Skip to content

Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix #611

Merged
baogorek merged 60 commits intomainfrom
fix-would-file-blend-and-entity-weights
Mar 25, 2026
Merged

Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix #611
baogorek merged 60 commits intomainfrom
fix-would-file-blend-and-entity-weights

Conversation

@baogorek
Copy link
Copy Markdown
Collaborator

@baogorek baogorek commented Mar 16, 2026

Summary

Command to run the modal pipeline on this branch, end-to-end (which it did complete on 3/24):

modal run --detach modal_app/pipeline.py::main \
      --action run --branch fix-would-file-blend-and-entity-weights \
      --gpu T4

This PR hardens the Modal pipeline for production use, covering three major areas:

1. Pre-baked Modal images (eliminate runtime installs)

  • Before: Every function call (15+ sites across 4 files) did a fresh git clone + uv sync, downloading 858MB of PyTorch/CUDA dependencies each time. ~15% failure rate from network timeouts.
  • After: Source code and dependencies are baked into Modal image layers at build time via add_local_dir(copy=True). Modal caches layers by content hash — unchanged code skips the build entirely.
  • New modal_app/images.py with shared cpu_image and gpu_image
  • New modal_app/resilience.py with subprocess retry wrapper
  • All 4 Modal apps (data_build, remote_calibration_runner, local_area, pipeline) updated to use pre-baked images
  • Fixes Python 3.11→3.13 mismatch in remote_calibration_runner

2. End-to-end pipeline orchestrator (pipeline.py)

  • 5-step orchestrated pipeline: build datasets → build calibration package → fit weights (regional + national in parallel) → build H5s + stage → finalize
  • Resume support via run metadata (survives preemption)
  • Parallel regional + national calibration fitting
  • Validation diagnostics aggregation
  • Atomic promotion workflow (staging → production)

3. Auto-trigger on merge to main

  • New .github/workflows/pipeline.yaml triggers on push to main
  • Uses modal run --detach (fire and forget — GH Actions runner exits, Modal runs independently)
  • Supports workflow_dispatch with configurable GPU/epochs/workers/skip_national

Supporting fixes (earlier commits)

  • Blend entity values on would_file draws; fix entity weights
  • Salt takeup draws with hh_id:clone_idx instead of block:hh_id
  • Fix county precomputation crashes (LA County zip_code, geography.npz removal)
  • Deduplicate county precomputation; enable aca_ptc/eitc/ctc targets
  • Configure distinct national vs regional calibration hyperparameters
  • Fix national H5 build: artifact validation remap and geography/weights mismatch

Deployment flow on merge

  1. Step 1 builds all datasets and uploads enhanced_cps_2024.h5 + cps_2024.h5 directly to HF + GCS (no promote gate)
  2. Steps 2-3 build calibration package and fit regional + national weights
  3. Step 4 builds local area H5s (states/districts/cities) + national US.h5, stages to HF staging/
  4. Promote (manual one-liner) pushes staged H5s to HF production + GCS

Test plan

  • Lint passes (ruff format)
  • Smoke test passes
  • uv.lock freshness check passes
  • Modal image builds successfully with pre-baked code (CI in progress)
  • Full pipeline smoke test

🤖 Generated with Claude Code

@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 7578ba2 to 310bb73 Compare March 16, 2026 17:54
@baogorek
Copy link
Copy Markdown
Collaborator Author

Removed stacked_dataset_builder.py and geography.npz artifact

geography.npz was a saved copy of the output of assign_random_geography(n_records, n_clones, seed), which is fully deterministic. publish_local_area already regenerates geography from the same seed — it never loaded the .npz.

The only consumer was stacked_dataset_builder.py, an alternative entry point that loaded geography from the file instead of regenerating it. This created a second code path that had to stay in sync with publish_local_area but could silently diverge (and had a known bug where n_clones metadata was wrong).

Removed both. load_geography/save_geography remain in clone_and_assign.py since modal_app/worker_script.py still references them.

@baogorek baogorek force-pushed the fix-would-file-blend-and-entity-weights branch from 0ecf3d1 to 34f5fc0 Compare March 18, 2026 01:57
baogorek and others added 27 commits March 20, 2026 11:32
Matrix builder: precompute entity values with would_file=False alongside
the all-True values, then blend per tax unit based on the would_file draw
before applying target takeup draws. This ensures X@w matches sim.calculate
for targets affected by non-target state variables.

Fixes #609

publish_local_area: remove explicit sub-entity weight overrides
(tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight,
person_weight) that used incorrect person-count splitting. These are
formula variables in policyengine-us that correctly derive from
household_weight at runtime.

Fixes #610

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace block-based RNG salting with (hh_id, clone_idx) salting.
Draws are now tied to the donor household identity and independent
across clones, eliminating the multi-clone-same-block collision
issue (#597). Geographic variation comes through the rate threshold,
not the draw.

Closes #597

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
County precomputation crashes on LA County (06037) because
aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code
calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001'
for LA County in both precomputation and publish_local_area
so X @ w matches sim.calculate("aca_ptc").sum().

Fixes #612

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The zip_code set for LA County (06037) was being wiped by
delete_arrays which only preserved "county". Also apply the
06037 zip_code fix to the in-process county precomputation
path (not just the parallel worker function).

Fixes #612

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The only county-dependent variable (aca_ptc) does not depend on
would_file_taxes_voluntarily, so the entity_wf_false pass was
computing identical values. Removing it eliminates ~2,977 extra
simulation passes during --county-level builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptc/eitc/ctc targets

- Fix geography.npz n_clones: was saving unique CD count instead of
  actual clone count (line 1292 of unified_calibration.py)
- Deduplicate county precomputation: inline workers=1 path now calls
  _compute_single_state_group_counties instead of copy-pasting it
- Enable aca_ptc, eitc, and refundable_ctc targets at all levels
  in target_config.yaml (remove outdated #7748 disable comments)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geography is fully deterministic from (n_records, n_clones, seed)
via assign_random_geography, so the .npz file was redundant.
publish_local_area already regenerates from seed. Removing the
artifact and its only consumer (stacked_dataset_builder.py)
eliminates a divergent code path that had to stay in sync.

The modal_app/worker_script.py still uses load_geography, so
the functions remain in clone_and_assign.py for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…coped checkpoints

- Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI)
- Remove geography.npz dependency from Modal pipeline; workers regenerate
  geography deterministically from (n_records, n_clones, seed)
- Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset
  to auto-clear stale checkpoints when inputs change
- Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup,
  geo_labels)
- Stop uploading source_imputed H5 as intermediate; promote-dataset uploads
  at promotion time instead
- Default skip_download=True in Modal local_area (reads from volume)
- Remove _upload_source_imputed from remote_calibration_runner
- Clean up huggingface.py: remove geography/blocks/geo_labels from
  download and upload functions
- ruff format

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upload-dataset and skip_download=False defaults so the full
pipeline (data_build → calibrate → stage-h5s) works via HF transport.
skip_download is available as opt-in for local push-to-modal workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_build.py upload step now pushes source_imputed to
calibration/source_imputed_stratified_extended_cps.h5 on HF so the
downstream calibration pipeline (build-matrices, calibrate) can
download it. This closes the gap in the all-Modal pipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --detach to all 7 modal run commands in Makefile so long-running
  jobs survive terminal disconnects
- Add --county-level to build-matrices (required for county precomputation)
- Add N_CLONES variable (default 430) and pass --n-clones to
  build-matrices, stage-h5s, and stage-national-h5
- Plumb n_clones through Modal scripts: build_package entrypoint,
  coordinate_publish, and coordinate_national_publish (replacing
  hardcoded 430)
- Change pipeline target to a reference card since --detach makes
  sequential chaining impossible

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s mismatch

1. validate_artifacts now accepts filename_remap so the national config
   (which records calibration_weights.npy) checks national_calibration_weights.npy
2. Worker regenerates geography when national weights have fewer clones
   than the regional geography

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orts; build enhanced CPS

- Regional: epochs=1000, beta=0.65, L0=1e-7, L2=1e-8
- National: epochs=4000, beta=0.65, L0=1e-4, L2=1e-12
- Both use target_config.yaml (same targets, different regularization)
- Fix pipeline.py ModuleNotFoundError by adding sys.path setup
- Default GPU to T4 everywhere
- Re-enable enhanced_cps build and upload in pipeline step 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-container git clone + uv sync (858MB PyTorch/CUDA each time)
with add_local_dir(copy=True) images that bake source code and deps at
build time. Modal caches layers by content hash, so unchanged code skips
the build entirely.

- Add modal_app/images.py with shared cpu_image and gpu_image
- Add modal_app/resilience.py with subprocess retry wrapper
- Add .github/workflows/pipeline.yaml for auto-trigger on merge to main
- Simplify all 4 Modal apps to use pre-baked images (no runtime cloning)
- Fix Python 3.11→3.13 mismatch in remote_calibration_runner

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… enable full_suite PR tests

Preemption was killing coordinators mid-run, losing all state and restarting
from scratch. Now run_pipeline, promote_run, coordinate_publish,
coordinate_national_publish, and build_datasets are non-preemptible.
Added find_resumable_run() so restarts converge to the same run ID.
Enabled full_suite: true in PR CI so enhanced_cps tests run against
freshly built data, not stale HuggingFace artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anth-volk added a commit that referenced this pull request Mar 24, 2026
New "under construction" node type (amber dashed) for showing
pipeline changes that are actively being developed:

US:
- PR #611: Pipeline orchestrator in Overview (Modal hardening)
- PR #540: Category takeup rerandomization in Stage 2, extracted
  puf_impute.py + source_impute.py modules in Stage 4
- PR #618: CMS marketplace data + plan selection in Stage 5

UK:
- PR #291: New Stage 9 — OA calibration pipeline (6 phases)
- PR #296: New Stage 10 — Adversarial weight regularisation
- PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baogorek and others added 5 commits March 23, 2026 21:27
build_h5() was refactored to take a BaseSimData object instead of a
raw dataset_path. The tests still passed the old kwarg, causing
TypeError at the end of the 4-hour CI run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…a import

The direct import failed because Modal's system Python doesn't have the
package — it's installed in the uv venv. Matches the subprocess pattern
used by all other policyengine_us_data imports in this file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
.git is intentionally excluded from Modal images (size + cache
invalidation). Capture GIT_COMMIT/GIT_BRANCH at image build time
(locally) and bake via .env(). get_git_provenance() falls back to
these env vars when git commands fail inside containers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preemptible spot instances caused silent worker terminations that left
the pipeline hanging with no clear diagnostic trail. Every function
except pipeline_status (read-only, 60s) is now nonpreemptible. Spawn
points now print function-call IDs for coordinate_publish workers,
fit_weights, and H5 build orchestrators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anth-volk added a commit that referenced this pull request Mar 24, 2026
- PR #611 orchestrator now coordinates all stages (0-8), not just 5-8
- UK PR #291 (OA clone-and-assign) repositioned between Stage 4 and Stage 5
- UK PR #296 relabeled as standalone tool (not a pipeline stage)
- Sidebar updated to show PR #296 as "Tool" instead of "Stage 10"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BaseSimData extracted simulation data into a static dataclass to avoid
reloading per area, but this reimplemented Microsimulation internals
and produced incorrect population numbers. Each build_h5 call now
creates a fresh Microsimulation from dataset_path — correct by
construction. Also includes worker log streaming fix and target config
updates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modal rejects nonpreemptible=True on GPU workloads at deploy time.
CPU-only functions retain nonpreemptible=True.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek baogorek marked this pull request as ready for review March 24, 2026 20:12
@anth-volk
Copy link
Copy Markdown
Collaborator

Observation: Duplicated Modal image definition

The same image definition (Debian slim + git + uv ≥ 0.8 + add_local_dir + uv sync --frozen) is copy-pasted in four places:

  • modal_app/data_build.py
  • modal_app/local_area.py
  • modal_app/pipeline.py
  • modal_app/images.py

images.py appears to have been created as the canonical shared definition (it has a _base_image() factory producing cpu_image and gpu_image), but the other three files don't import from it — they each inline their own copy of the image, _IGNORE list, and git env var capture.

This means the ignore list, Python version, uv pin, or UV_HTTP_TIMEOUT could drift between files without anyone noticing. Consider having data_build.py, local_area.py, and pipeline.py import from images.py instead.

baogorek and others added 2 commits March 25, 2026 09:06
…ate Modal image

- Add run_id parameter to staging/promote/cleanup functions in data_upload.py
  so HF paths become staging/{run_id}/... instead of flat staging/
- Generate run_id in coordinate_publish/coordinate_national_publish if not provided
- Store run_id in manifest.json; promote_publish reads it back as fallback
- Downgrade manifest verification failure from hard error to warning so uploads
  proceed even if checksums have issues
- Add --run-id CLI arg to validate_staging, check_staging_sums, promote_local_h5s
- Thread run_id through pipeline.py spawn/promote calls
- Consolidate duplicated Modal image definition into images.py (addresses PR #611 review)
- All changes are backward-compatible: run_id="" preserves flat staging/ paths

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek
Copy link
Copy Markdown
Collaborator Author

Good observation — addressed in 3cdf5ba. images.py is now the single source of truth for the Modal image definition (ignore list, Python version, uv pin, UV_HTTP_TIMEOUT, and git env vars). The four consumer files (data_build.py, local_area.py, pipeline.py, remote_calibration_runner.py) now import from it:

from modal_app.images import cpu_image as image   # or gpu_image for calibration runner

This removed ~160 lines of duplicated image/ignore/git-env boilerplate.

baogorek and others added 5 commits March 25, 2026 10:05
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mport

Modal containers don't have the repo root on sys.path by default,
so `from modal_app.images import ...` fails. Add the same sys.path
fix that pipeline.py already uses for its cross-module imports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use existing sys/Path imports instead of aliased re-imports
- Remove duplicate sys.path block in pipeline.py (now handled once at top)
- Add sys.path fix to restage.py (also imports from modal_app)
- Consistent pattern across all modal_app/ entrypoints:
  sys.path gets /root/policyengine-us-data (baked image) and
  local repo root before any from modal_app.* imports

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Copy all intermediate H5 datasets to pipeline volume for lineage tracing
- Add yearless source_imputed alias for downstream pipeline consumers
- Route source_imputed H5s to calibration/ path in HF staging for promote
- Normalize at-large congressional district GEOID 200→201 (AK, DE, etc.)
- Prune filer-gated and high-error calibration targets (67→32)
- Remove unused imports and normalize Unicode across ~58 files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_us_data import

The lazy import from policyengine_us_data.utils.run_id triggers the full
package __init__ chain (which needs policyengine_core), but the orchestrator
runs outside the uv venv. Inline the trivial timestamp logic instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@anth-volk anth-volk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending test passage

@baogorek baogorek merged commit 4991b1e into main Mar 25, 2026
6 checks passed
@baogorek baogorek deleted the fix-would-file-blend-and-entity-weights branch March 25, 2026 22:36
anth-volk added a commit that referenced this pull request Mar 27, 2026
New "under construction" node type (amber dashed) for showing
pipeline changes that are actively being developed:

US:
- PR #611: Pipeline orchestrator in Overview (Modal hardening)
- PR #540: Category takeup rerandomization in Stage 2, extracted
  puf_impute.py + source_impute.py modules in Stage 4
- PR #618: CMS marketplace data + plan selection in Stage 5

UK:
- PR #291: New Stage 9 — OA calibration pipeline (6 phases)
- PR #296: New Stage 10 — Adversarial weight regularisation
- PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
anth-volk added a commit that referenced this pull request Mar 27, 2026
- PR #611 orchestrator now coordinates all stages (0-8), not just 5-8
- UK PR #291 (OA clone-and-assign) repositioned between Stage 4 and Stage 5
- UK PR #296 relabeled as standalone tool (not a pipeline stage)
- Sidebar updated to show PR #296 as "Tool" instead of "Stage 10"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants