Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix by baogorek · Pull Request #611 · PolicyEngine/policyengine-us-data

baogorek · 2026-03-16T17:18:01Z

Summary

Command to run the modal pipeline on this branch, end-to-end (which it did complete on 3/24):

modal run --detach modal_app/pipeline.py::main \
      --action run --branch fix-would-file-blend-and-entity-weights \
      --gpu T4

This PR hardens the Modal pipeline for production use, covering three major areas:

1. Pre-baked Modal images (eliminate runtime installs)

Before: Every function call (15+ sites across 4 files) did a fresh git clone + uv sync, downloading 858MB of PyTorch/CUDA dependencies each time. ~15% failure rate from network timeouts.
After: Source code and dependencies are baked into Modal image layers at build time via add_local_dir(copy=True). Modal caches layers by content hash — unchanged code skips the build entirely.
New modal_app/images.py with shared cpu_image and gpu_image
New modal_app/resilience.py with subprocess retry wrapper
All 4 Modal apps (data_build, remote_calibration_runner, local_area, pipeline) updated to use pre-baked images
Fixes Python 3.11→3.13 mismatch in remote_calibration_runner

2. End-to-end pipeline orchestrator (`pipeline.py`)

5-step orchestrated pipeline: build datasets → build calibration package → fit weights (regional + national in parallel) → build H5s + stage → finalize
Resume support via run metadata (survives preemption)
Parallel regional + national calibration fitting
Validation diagnostics aggregation
Atomic promotion workflow (staging → production)

3. Auto-trigger on merge to main

New .github/workflows/pipeline.yaml triggers on push to main
Uses modal run --detach (fire and forget — GH Actions runner exits, Modal runs independently)
Supports workflow_dispatch with configurable GPU/epochs/workers/skip_national

Supporting fixes (earlier commits)

Blend entity values on would_file draws; fix entity weights
Salt takeup draws with hh_id:clone_idx instead of block:hh_id
Fix county precomputation crashes (LA County zip_code, geography.npz removal)
Deduplicate county precomputation; enable aca_ptc/eitc/ctc targets
Configure distinct national vs regional calibration hyperparameters
Fix national H5 build: artifact validation remap and geography/weights mismatch

Deployment flow on merge

Step 1 builds all datasets and uploads enhanced_cps_2024.h5 + cps_2024.h5 directly to HF + GCS (no promote gate)
Steps 2-3 build calibration package and fit regional + national weights
Step 4 builds local area H5s (states/districts/cities) + national US.h5, stages to HF staging/
Promote (manual one-liner) pushes staged H5s to HF production + GCS

Test plan

Lint passes (ruff format)
Smoke test passes
uv.lock freshness check passes
Modal image builds successfully with pre-baked code (CI in progress)
Full pipeline smoke test

🤖 Generated with Claude Code

baogorek · 2026-03-17T19:25:44Z

Removed `stacked_dataset_builder.py` and `geography.npz` artifact

geography.npz was a saved copy of the output of assign_random_geography(n_records, n_clones, seed), which is fully deterministic. publish_local_area already regenerates geography from the same seed — it never loaded the .npz.

The only consumer was stacked_dataset_builder.py, an alternative entry point that loaded geography from the file instead of regenerating it. This created a second code path that had to stay in sync with publish_local_area but could silently diverge (and had a known bug where n_clones metadata was wrong).

Removed both. load_geography/save_geography remain in clone_and_assign.py since modal_app/worker_script.py still references them.

Matrix builder: precompute entity values with would_file=False alongside the all-True values, then blend per tax unit based on the would_file draw before applying target takeup draws. This ensures X@w matches sim.calculate for targets affected by non-target state variables. Fixes #609 publish_local_area: remove explicit sub-entity weight overrides (tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight, person_weight) that used incorrect person-count splitting. These are formula variables in policyengine-us that correctly derive from household_weight at runtime. Fixes #610 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace block-based RNG salting with (hh_id, clone_idx) salting. Draws are now tied to the donor household identity and independent across clones, eliminating the multi-clone-same-block collision issue (#597). Geographic variation comes through the rate threshold, not the draw. Closes #597 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

County precomputation crashes on LA County (06037) because aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001' for LA County in both precomputation and publish_local_area so X @ w matches sim.calculate("aca_ptc").sum(). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The zip_code set for LA County (06037) was being wiped by delete_arrays which only preserved "county". Also apply the 06037 zip_code fix to the in-process county precomputation path (not just the parallel worker function). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The only county-dependent variable (aca_ptc) does not depend on would_file_taxes_voluntarily, so the entity_wf_false pass was computing identical values. Removing it eliminates ~2,977 extra simulation passes during --county-level builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ptc/eitc/ctc targets - Fix geography.npz n_clones: was saving unique CD count instead of actual clone count (line 1292 of unified_calibration.py) - Deduplicate county precomputation: inline workers=1 path now calls _compute_single_state_group_counties instead of copy-pasting it - Enable aca_ptc, eitc, and refundable_ctc targets at all levels in target_config.yaml (remove outdated #7748 disable comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Geography is fully deterministic from (n_records, n_clones, seed) via assign_random_geography, so the .npz file was redundant. publish_local_area already regenerates from seed. Removing the artifact and its only consumer (stacked_dataset_builder.py) eliminates a divergent code path that had to stay in sync. The modal_app/worker_script.py still uses load_geography, so the functions remain in clone_and_assign.py for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…coped checkpoints - Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI) - Remove geography.npz dependency from Modal pipeline; workers regenerate geography deterministically from (n_records, n_clones, seed) - Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset to auto-clear stale checkpoints when inputs change - Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup, geo_labels) - Stop uploading source_imputed H5 as intermediate; promote-dataset uploads at promotion time instead - Default skip_download=True in Modal local_area (reads from volume) - Remove _upload_source_imputed from remote_calibration_runner - Clean up huggingface.py: remove geography/blocks/geo_labels from download and upload functions - ruff format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Keep upload-dataset and skip_download=False defaults so the full pipeline (data_build → calibrate → stage-h5s) works via HF transport. skip_download is available as opt-in for local push-to-modal workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The data_build.py upload step now pushes source_imputed to calibration/source_imputed_stratified_extended_cps.h5 on HF so the downstream calibration pipeline (build-matrices, calibrate) can download it. This closes the gap in the all-Modal pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --detach to all 7 modal run commands in Makefile so long-running jobs survive terminal disconnects - Add --county-level to build-matrices (required for county precomputation) - Add N_CLONES variable (default 430) and pass --n-clones to build-matrices, stage-h5s, and stage-national-h5 - Plumb n_clones through Modal scripts: build_package entrypoint, coordinate_publish, and coordinate_national_publish (replacing hardcoded 430) - Change pipeline target to a reference card since --detach makes sequential chaining impossible Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…s mismatch 1. validate_artifacts now accepts filename_remap so the national config (which records calibration_weights.npy) checks national_calibration_weights.npy 2. Worker regenerates geography when national weights have fewer clones than the regional geography Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…orts; build enhanced CPS - Regional: epochs=1000, beta=0.65, L0=1e-7, L2=1e-8 - National: epochs=4000, beta=0.65, L0=1e-4, L2=1e-12 - Both use target_config.yaml (same targets, different regularization) - Fix pipeline.py ModuleNotFoundError by adding sys.path setup - Default GPU to T4 everywhere - Re-enable enhanced_cps build and upload in pipeline step 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace per-container git clone + uv sync (858MB PyTorch/CUDA each time) with add_local_dir(copy=True) images that bake source code and deps at build time. Modal caches layers by content hash, so unchanged code skips the build entirely. - Add modal_app/images.py with shared cpu_image and gpu_image - Add modal_app/resilience.py with subprocess retry wrapper - Add .github/workflows/pipeline.yaml for auto-trigger on merge to main - Simplify all 4 Modal apps to use pre-baked images (no runtime cloning) - Fix Python 3.11→3.13 mismatch in remote_calibration_runner Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… enable full_suite PR tests Preemption was killing coordinators mid-run, losing all state and restarting from scratch. Now run_pipeline, promote_run, coordinate_publish, coordinate_national_publish, and build_datasets are non-preemptible. Added find_resumable_run() so restarts converge to the same run ID. Enabled full_suite: true in PR CI so enhanced_cps tests run against freshly built data, not stale HuggingFace artifacts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New "under construction" node type (amber dashed) for showing pipeline changes that are actively being developed: US: - PR #611: Pipeline orchestrator in Overview (Modal hardening) - PR #540: Category takeup rerandomization in Stage 2, extracted puf_impute.py + source_impute.py modules in Stage 4 - PR #618: CMS marketplace data + plan selection in Stage 5 UK: - PR #291: New Stage 9 — OA calibration pipeline (6 phases) - PR #296: New Stage 10 — Adversarial weight regularisation - PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

build_h5() was refactored to take a BaseSimData object instead of a raw dataset_path. The tests still passed the old kwarg, causing TypeError at the end of the 4-hour CI run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…a import The direct import failed because Modal's system Python doesn't have the package — it's installed in the uv venv. Matches the subprocess pattern used by all other policyengine_us_data imports in this file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

.git is intentionally excluded from Modal images (size + cache invalidation). Capture GIT_COMMIT/GIT_BRANCH at image build time (locally) and bake via .env(). get_git_provenance() falls back to these env vars when git commands fail inside containers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Preemptible spot instances caused silent worker terminations that left the pipeline hanging with no clear diagnostic trail. Every function except pipeline_status (read-only, 60s) is now nonpreemptible. Spawn points now print function-call IDs for coordinate_publish workers, fit_weights, and H5 build orchestrators. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- PR #611 orchestrator now coordinates all stages (0-8), not just 5-8 - UK PR #291 (OA clone-and-assign) repositioned between Stage 4 and Stage 5 - UK PR #296 relabeled as standalone tool (not a pipeline stage) - Sidebar updated to show PR #296 as "Tool" instead of "Stage 10" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BaseSimData extracted simulation data into a static dataclass to avoid reloading per area, but this reimplemented Microsimulation internals and produced incorrect population numbers. Each build_h5 call now creates a fresh Microsimulation from dataset_path — correct by construction. Also includes worker log streaming fix and target config updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Modal rejects nonpreemptible=True on GPU workloads at deploy time. CPU-only functions retain nonpreemptible=True. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anth-volk · 2026-03-24T22:23:08Z

Observation: Duplicated Modal image definition

The same image definition (Debian slim + git + uv ≥ 0.8 + add_local_dir + uv sync --frozen) is copy-pasted in four places:

modal_app/data_build.py
modal_app/local_area.py
modal_app/pipeline.py
modal_app/images.py

images.py appears to have been created as the canonical shared definition (it has a _base_image() factory producing cpu_image and gpu_image), but the other three files don't import from it — they each inline their own copy of the image, _IGNORE list, and git env var capture.

This means the ignore list, Python version, uv pin, or UV_HTTP_TIMEOUT could drift between files without anyone noticing. Consider having data_build.py, local_area.py, and pipeline.py import from images.py instead.

…ate Modal image - Add run_id parameter to staging/promote/cleanup functions in data_upload.py so HF paths become staging/{run_id}/... instead of flat staging/ - Generate run_id in coordinate_publish/coordinate_national_publish if not provided - Store run_id in manifest.json; promote_publish reads it back as fallback - Downgrade manifest verification failure from hard error to warning so uploads proceed even if checksums have issues - Add --run-id CLI arg to validate_staging, check_staging_sums, promote_local_h5s - Thread run_id through pipeline.py spawn/promote calls - Consolidate duplicated Modal image definition into images.py (addresses PR #611 review) - All changes are backward-compatible: run_id="" preserves flat staging/ paths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek · 2026-03-25T13:52:25Z

Good observation — addressed in 3cdf5ba. images.py is now the single source of truth for the Modal image definition (ignore list, Python version, uv pin, UV_HTTP_TIMEOUT, and git env vars). The four consumer files (data_build.py, local_area.py, pipeline.py, remote_calibration_runner.py) now import from it:

from modal_app.images import cpu_image as image   # or gpu_image for calibration runner

This removed ~160 lines of duplicated image/ignore/git-env boilerplate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…mport Modal containers don't have the repo root on sys.path by default, so `from modal_app.images import ...` fails. Add the same sys.path fix that pipeline.py already uses for its cross-module imports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use existing sys/Path imports instead of aliased re-imports - Remove duplicate sys.path block in pipeline.py (now handled once at top) - Add sys.path fix to restage.py (also imports from modal_app) - Consistent pattern across all modal_app/ entrypoints: sys.path gets /root/policyengine-us-data (baked image) and local repo root before any from modal_app.* imports Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Copy all intermediate H5 datasets to pipeline volume for lineage tracing - Add yearless source_imputed alias for downstream pipeline consumers - Route source_imputed H5s to calibration/ path in HF staging for promote - Normalize at-large congressional district GEOID 200→201 (AK, DE, etc.) - Prune filer-gated and high-error calibration targets (67→32) - Remove unused imports and normalize Unicode across ~58 files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_us_data import The lazy import from policyengine_us_data.utils.run_id triggers the full package __init__ chain (which needs policyengine_core), but the orchestrator runs outside the uv venv. Inline the trivial timestamp logic instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anth-volk

Approved pending test passage

New "under construction" node type (amber dashed) for showing pipeline changes that are actively being developed: US: - PR #611: Pipeline orchestrator in Overview (Modal hardening) - PR #540: Category takeup rerandomization in Stage 2, extracted puf_impute.py + source_impute.py modules in Stage 4 - PR #618: CMS marketplace data + plan selection in Stage 5 UK: - PR #291: New Stage 9 — OA calibration pipeline (6 phases) - PR #296: New Stage 10 — Adversarial weight regularisation - PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- PR #611 orchestrator now coordinates all stages (0-8), not just 5-8 - UK PR #291 (OA clone-and-assign) repositioned between Stage 4 and Stage 5 - UK PR #296 relabeled as standalone tool (not a pipeline stage) - Sidebar updated to show PR #296 as "Tool" instead of "Stage 10" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 7578ba2 to 310bb73 Compare March 16, 2026 17:54

baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 0ecf3d1 to 34f5fc0 Compare March 18, 2026 01:57

baogorek and others added 27 commits March 20, 2026 11:32

modal

079b926

fix fixture to address failing tests

0bfd65b

make tests optional in when building data in modal

b9eea30

make sure datasets upload when ecps is not required

16a1e5c

modal pipeline

835db5a

calibration pipeline nits

8206abe

removing old artifacts

7d5e8cf

lower to expected timeout

54db9fd

adding functions to container

9acc35c

loop validation into pipeline

7b78405

fix .put_file

d99163f

capture outputs

dd195b5

Enable enhanced_cps in build-data-modal target

944de7d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek and others added 5 commits March 23, 2026 21:27

Update test_xw_consistency for current calibration config

56c3059

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

baogorek requested review from anth-volk and juaristi22 March 24, 2026 20:03

Remove nonpreemptible from GPU functions (Modal does not support it)

fddb33e

Modal rejects nonpreemptible=True on GPU workloads at deploy time. CPU-only functions retain nonpreemptible=True. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anth-volk mentioned this pull request Mar 24, 2026

Add unified version registry for GCS/HF dataset versioning #601

Closed

3 tasks

baogorek marked this pull request as ready for review March 24, 2026 20:12

baogorek and others added 2 commits March 25, 2026 09:06

restage functionality

b520c2f

baogorek and others added 5 commits March 25, 2026 10:05

Format restage.py

a094158

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

anth-volk approved these changes Mar 25, 2026

View reviewed changes

baogorek merged commit 4991b1e into main Mar 25, 2026
6 checks passed

baogorek deleted the fix-would-file-blend-and-entity-weights branch March 25, 2026 22:36

baogorek mentioned this pull request Mar 26, 2026

Pipeline workflow killed by GitHub Actions timeout — modal run --detach does not survive runner cancellation #636

Closed

juaristi22 mentioned this pull request Mar 26, 2026

End-to-end versioned pipeline on Modal #621

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix #611

Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix #611
baogorek merged 60 commits intomainfrom
fix-would-file-blend-and-entity-weights

baogorek commented Mar 16, 2026 •

edited

Loading

Uh oh!

baogorek commented Mar 17, 2026

Uh oh!

anth-volk commented Mar 24, 2026

Uh oh!

baogorek commented Mar 25, 2026

Uh oh!

anth-volk left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

baogorek commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Pre-baked Modal images (eliminate runtime installs)

2. End-to-end pipeline orchestrator (pipeline.py)

3. Auto-trigger on merge to main

Supporting fixes (earlier commits)

Deployment flow on merge

Test plan

Uh oh!

baogorek commented Mar 17, 2026

Removed stacked_dataset_builder.py and geography.npz artifact

Uh oh!

anth-volk commented Mar 24, 2026

Uh oh!

baogorek commented Mar 25, 2026

Uh oh!

anth-volk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baogorek commented Mar 16, 2026 •

edited

Loading

2. End-to-end pipeline orchestrator (`pipeline.py`)

Removed `stacked_dataset_builder.py` and `geography.npz` artifact