Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix #611
Harden Modal pipeline: pre-baked images, auto-trigger on merge, at-large CD fix #611
Conversation
7578ba2 to
310bb73
Compare
Removed
|
0ecf3d1 to
34f5fc0
Compare
Matrix builder: precompute entity values with would_file=False alongside the all-True values, then blend per tax unit based on the would_file draw before applying target takeup draws. This ensures X@w matches sim.calculate for targets affected by non-target state variables. Fixes #609 publish_local_area: remove explicit sub-entity weight overrides (tax_unit_weight, spm_unit_weight, family_weight, marital_unit_weight, person_weight) that used incorrect person-count splitting. These are formula variables in policyengine-us that correctly derive from household_weight at runtime. Fixes #610 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace block-based RNG salting with (hh_id, clone_idx) salting. Draws are now tied to the donor household identity and independent across clones, eliminating the multi-clone-same-block collision issue (#597). Geographic variation comes through the rate threshold, not the draw. Closes #597 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
County precomputation crashes on LA County (06037) because
aca_ptc → slcsp_rating_area_la_county → three_digit_zip_code
calls zip_code.astype(int) on 'UNKNOWN'. Set zip_code='90001'
for LA County in both precomputation and publish_local_area
so X @ w matches sim.calculate("aca_ptc").sum().
Fixes #612
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The zip_code set for LA County (06037) was being wiped by delete_arrays which only preserved "county". Also apply the 06037 zip_code fix to the in-process county precomputation path (not just the parallel worker function). Fixes #612 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The only county-dependent variable (aca_ptc) does not depend on would_file_taxes_voluntarily, so the entity_wf_false pass was computing identical values. Removing it eliminates ~2,977 extra simulation passes during --county-level builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptc/eitc/ctc targets - Fix geography.npz n_clones: was saving unique CD count instead of actual clone count (line 1292 of unified_calibration.py) - Deduplicate county precomputation: inline workers=1 path now calls _compute_single_state_group_counties instead of copy-pasting it - Enable aca_ptc, eitc, and refundable_ctc targets at all levels in target_config.yaml (remove outdated #7748 disable comments) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Geography is fully deterministic from (n_records, n_clones, seed) via assign_random_geography, so the .npz file was redundant. publish_local_area already regenerates from seed. Removing the artifact and its only consumer (stacked_dataset_builder.py) eliminates a divergent code path that had to stay in sync. The modal_app/worker_script.py still uses load_geography, so the functions remain in clone_and_assign.py for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…coped checkpoints - Add create_source_imputed_cps.py to data_build.py Phase 5 (was skipped in CI) - Remove geography.npz dependency from Modal pipeline; workers regenerate geography deterministically from (n_records, n_clones, seed) - Add input-scoped checkpoints to publish_local_area.py: hash weights+dataset to auto-clear stale checkpoints when inputs change - Remove stale artifacts from push-to-modal (stacked_blocks, stacked_takeup, geo_labels) - Stop uploading source_imputed H5 as intermediate; promote-dataset uploads at promotion time instead - Default skip_download=True in Modal local_area (reads from volume) - Remove _upload_source_imputed from remote_calibration_runner - Clean up huggingface.py: remove geography/blocks/geo_labels from download and upload functions - ruff format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Keep upload-dataset and skip_download=False defaults so the full pipeline (data_build → calibrate → stage-h5s) works via HF transport. skip_download is available as opt-in for local push-to-modal workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The data_build.py upload step now pushes source_imputed to calibration/source_imputed_stratified_extended_cps.h5 on HF so the downstream calibration pipeline (build-matrices, calibrate) can download it. This closes the gap in the all-Modal pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add --detach to all 7 modal run commands in Makefile so long-running jobs survive terminal disconnects - Add --county-level to build-matrices (required for county precomputation) - Add N_CLONES variable (default 430) and pass --n-clones to build-matrices, stage-h5s, and stage-national-h5 - Plumb n_clones through Modal scripts: build_package entrypoint, coordinate_publish, and coordinate_national_publish (replacing hardcoded 430) - Change pipeline target to a reference card since --detach makes sequential chaining impossible Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s mismatch 1. validate_artifacts now accepts filename_remap so the national config (which records calibration_weights.npy) checks national_calibration_weights.npy 2. Worker regenerates geography when national weights have fewer clones than the regional geography Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…orts; build enhanced CPS - Regional: epochs=1000, beta=0.65, L0=1e-7, L2=1e-8 - National: epochs=4000, beta=0.65, L0=1e-4, L2=1e-12 - Both use target_config.yaml (same targets, different regularization) - Fix pipeline.py ModuleNotFoundError by adding sys.path setup - Default GPU to T4 everywhere - Re-enable enhanced_cps build and upload in pipeline step 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-container git clone + uv sync (858MB PyTorch/CUDA each time) with add_local_dir(copy=True) images that bake source code and deps at build time. Modal caches layers by content hash, so unchanged code skips the build entirely. - Add modal_app/images.py with shared cpu_image and gpu_image - Add modal_app/resilience.py with subprocess retry wrapper - Add .github/workflows/pipeline.yaml for auto-trigger on merge to main - Simplify all 4 Modal apps to use pre-baked images (no runtime cloning) - Fix Python 3.11→3.13 mismatch in remote_calibration_runner Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… enable full_suite PR tests Preemption was killing coordinators mid-run, losing all state and restarting from scratch. Now run_pipeline, promote_run, coordinate_publish, coordinate_national_publish, and build_datasets are non-preemptible. Added find_resumable_run() so restarts converge to the same run ID. Enabled full_suite: true in PR CI so enhanced_cps tests run against freshly built data, not stale HuggingFace artifacts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New "under construction" node type (amber dashed) for showing pipeline changes that are actively being developed: US: - PR #611: Pipeline orchestrator in Overview (Modal hardening) - PR #540: Category takeup rerandomization in Stage 2, extracted puf_impute.py + source_impute.py modules in Stage 4 - PR #618: CMS marketplace data + plan selection in Stage 5 UK: - PR #291: New Stage 9 — OA calibration pipeline (6 phases) - PR #296: New Stage 10 — Adversarial weight regularisation - PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
build_h5() was refactored to take a BaseSimData object instead of a raw dataset_path. The tests still passed the old kwarg, causing TypeError at the end of the 4-hour CI run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…a import The direct import failed because Modal's system Python doesn't have the package — it's installed in the uv venv. Matches the subprocess pattern used by all other policyengine_us_data imports in this file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
.git is intentionally excluded from Modal images (size + cache invalidation). Capture GIT_COMMIT/GIT_BRANCH at image build time (locally) and bake via .env(). get_git_provenance() falls back to these env vars when git commands fail inside containers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Preemptible spot instances caused silent worker terminations that left the pipeline hanging with no clear diagnostic trail. Every function except pipeline_status (read-only, 60s) is now nonpreemptible. Spawn points now print function-call IDs for coordinate_publish workers, fit_weights, and H5 build orchestrators. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- PR #611 orchestrator now coordinates all stages (0-8), not just 5-8 - UK PR #291 (OA clone-and-assign) repositioned between Stage 4 and Stage 5 - UK PR #296 relabeled as standalone tool (not a pipeline stage) - Sidebar updated to show PR #296 as "Tool" instead of "Stage 10" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BaseSimData extracted simulation data into a static dataclass to avoid reloading per area, but this reimplemented Microsimulation internals and produced incorrect population numbers. Each build_h5 call now creates a fresh Microsimulation from dataset_path — correct by construction. Also includes worker log streaming fix and target config updates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modal rejects nonpreemptible=True on GPU workloads at deploy time. CPU-only functions retain nonpreemptible=True. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Observation: Duplicated Modal image definition The same image definition (Debian slim + git + uv ≥ 0.8 +
This means the ignore list, Python version, |
…ate Modal image
- Add run_id parameter to staging/promote/cleanup functions in data_upload.py
so HF paths become staging/{run_id}/... instead of flat staging/
- Generate run_id in coordinate_publish/coordinate_national_publish if not provided
- Store run_id in manifest.json; promote_publish reads it back as fallback
- Downgrade manifest verification failure from hard error to warning so uploads
proceed even if checksums have issues
- Add --run-id CLI arg to validate_staging, check_staging_sums, promote_local_h5s
- Thread run_id through pipeline.py spawn/promote calls
- Consolidate duplicated Modal image definition into images.py (addresses PR #611 review)
- All changes are backward-compatible: run_id="" preserves flat staging/ paths
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Good observation — addressed in 3cdf5ba. from modal_app.images import cpu_image as image # or gpu_image for calibration runnerThis removed ~160 lines of duplicated image/ignore/git-env boilerplate. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mport Modal containers don't have the repo root on sys.path by default, so `from modal_app.images import ...` fails. Add the same sys.path fix that pipeline.py already uses for its cross-module imports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use existing sys/Path imports instead of aliased re-imports - Remove duplicate sys.path block in pipeline.py (now handled once at top) - Add sys.path fix to restage.py (also imports from modal_app) - Consistent pattern across all modal_app/ entrypoints: sys.path gets /root/policyengine-us-data (baked image) and local repo root before any from modal_app.* imports Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Copy all intermediate H5 datasets to pipeline volume for lineage tracing - Add yearless source_imputed alias for downstream pipeline consumers - Route source_imputed H5s to calibration/ path in HF staging for promote - Normalize at-large congressional district GEOID 200→201 (AK, DE, etc.) - Prune filer-gated and high-error calibration targets (67→32) - Remove unused imports and normalize Unicode across ~58 files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_us_data import The lazy import from policyengine_us_data.utils.run_id triggers the full package __init__ chain (which needs policyengine_core), but the orchestrator runs outside the uv venv. Inline the trivial timestamp logic instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
anth-volk
left a comment
There was a problem hiding this comment.
Approved pending test passage
New "under construction" node type (amber dashed) for showing pipeline changes that are actively being developed: US: - PR #611: Pipeline orchestrator in Overview (Modal hardening) - PR #540: Category takeup rerandomization in Stage 2, extracted puf_impute.py + source_impute.py modules in Stage 4 - PR #618: CMS marketplace data + plan selection in Stage 5 UK: - PR #291: New Stage 9 — OA calibration pipeline (6 phases) - PR #296: New Stage 10 — Adversarial weight regularisation - PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- PR #611 orchestrator now coordinates all stages (0-8), not just 5-8 - UK PR #291 (OA clone-and-assign) repositioned between Stage 4 and Stage 5 - UK PR #296 relabeled as standalone tool (not a pipeline stage) - Sidebar updated to show PR #296 as "Tool" instead of "Stage 10" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Command to run the modal pipeline on this branch, end-to-end (which it did complete on 3/24):
This PR hardens the Modal pipeline for production use, covering three major areas:
1. Pre-baked Modal images (eliminate runtime installs)
git clone+uv sync, downloading 858MB of PyTorch/CUDA dependencies each time. ~15% failure rate from network timeouts.add_local_dir(copy=True). Modal caches layers by content hash — unchanged code skips the build entirely.modal_app/images.pywith sharedcpu_imageandgpu_imagemodal_app/resilience.pywith subprocess retry wrapperdata_build,remote_calibration_runner,local_area,pipeline) updated to use pre-baked imagesremote_calibration_runner2. End-to-end pipeline orchestrator (
pipeline.py)3. Auto-trigger on merge to main
.github/workflows/pipeline.yamltriggers on push to mainmodal run --detach(fire and forget — GH Actions runner exits, Modal runs independently)workflow_dispatchwith configurable GPU/epochs/workers/skip_nationalSupporting fixes (earlier commits)
would_filedraws; fix entity weightshh_id:clone_idxinstead ofblock:hh_idDeployment flow on merge
enhanced_cps_2024.h5+cps_2024.h5directly to HF + GCS (no promote gate)Test plan
🤖 Generated with Claude Code