Skip to content

Update SOI-backed calibration targets through TY2023#660

Draft
MaxGhenis wants to merge 8 commits intomainfrom
codex/update-soi-targets-2023
Draft

Update SOI-backed calibration targets through TY2023#660
MaxGhenis wants to merge 8 commits intomainfrom
codex/update-soi-targets-2023

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Mar 28, 2026

What changed

  • refreshed the tracked national workbook-backed SOI targets in policyengine_us_data/storage/calibration_targets/soi_targets.csv through TY2023, and backfilled TY2022 so the tracked series no longer jumps directly from 2021 to 2023
  • replaced the brittle SOI workbook refresh logic with explicit semantic mappings for the active Publication 1304 Table 1.4 and Table 2.1 layouts, plus focused regression tests
  • verified the TY2022 and TY2023 workbook layouts directly against the IRS Publication 1304 files, and kept explicit legacy multi-column handling for the pre-layout-shift TY2021 pass-through rows
  • updated the refresh flow to store non-lossy workbook column specs for multi-column sums like BP+BT instead of silently collapsing them to the final component column
  • added shared SOI metadata for the latest published national, geography, and IRA years, and reused that metadata for the retirement contribution targets in both legacy and DB-backed calibration paths
  • updated get_soi() to select the best available tracked year per variable for the requested simulation year instead of always taking the global latest year
  • updated the DB-backed IRS SOI ETL so national targets that can be sourced from the newer workbook release are overlaid from the latest published national year, even while the geography-file release cycle remains behind
  • fixed the DB SOI upsert path to target baseline rows only, so baseline overlays cannot overwrite reform targets with the same (stratum, variable, period) tuple
  • added integration coverage for the unified builder selecting the newer workbook-backed national overlay and for preserving reform-specific targets during baseline upserts
  • moved the lightweight SOI utility tests out of the package test tree so they can run without the full dataset build-time dependencies

Why

The repo had drifted into an inconsistent SOI state:

  • the tracked national workbook targets stopped at TY2021
  • the active DB-backed IRS SOI path was still inheriting many national targets from the TY2022 geography file
  • the tracked workbook coordinates had become stale for newer IRS layouts, especially for Table 1.4 and Table 2.1

This PR updates the available SOI-backed targets to the latest published IRS releases the repo can credibly use, and makes the refresh path less fragile going forward.

Scope and limits

This PR updates everything in the repo that can currently move to newer SOI data from published IRS sources.

Still intentionally left on older sources:

  • state and district geography-backed SOI targets remain on TY2022 because the IRS has not yet published the TY2023 in54, in55cm, and incd files
  • aca_ptc and refundable_ctc in the DB-backed IRS SOI path still remain on the geography-backed source for now because the published national workbook tables do not line up cleanly enough with the current incd code definitions to switch them safely in this PR
  • Roth IRA contributions remain on TY2022 because the latest published IRA accumulation table is still TY2022

Review follow-up

After the initial PR draft, I followed up on the semantic-mapping review points directly against the IRS files.

  • TY2022 and TY2023 Table 1.4 / 2.1 layouts were spot-checked against the official IRS workbooks and do share the newer column layouts used here
  • TY2021 does not share that layout, so the validator now explicitly reconstructs the old pass-through totals from the legacy multi-column pairs instead of assuming the newer mapping applies retroactively
  • relative to the previous HEAD version of soi_targets.csv, this follow-up did not change any stored target values; it corrected 144 lossy column specs in TY2022 and the same 144 in TY2023 so provenance now matches the actual workbook reads

Validation

  • uv run pytest -q tests/test_refresh_soi_table_targets.py tests/test_etl_irs_soi_overlay.py
  • uv run pytest -q policyengine_us_data/tests/test_calibration/test_unified_matrix_builder.py policyengine_us_data/tests/test_calibration/test_unified_calibration.py policyengine_us_data/tests/test_schema_views_and_lookups.py
  • uv run python policyengine_us_data/storage/calibration_targets/refresh_soi_table_targets.py --source-year 2021 --target-year 2022 --validate-source-year
  • uv run python policyengine_us_data/storage/calibration_targets/refresh_soi_table_targets.py --source-year 2022 --target-year 2023 --validate-source-year
  • uv run python -m py_compile policyengine_us_data/storage/calibration_targets/refresh_soi_table_targets.py policyengine_us_data/utils/soi.py policyengine_us_data/db/etl_irs_soi.py policyengine_us_data/db/etl_national_targets.py policyengine_us_data/storage/calibration_targets/pull_soi_targets.py policyengine_us_data/storage/calibration_targets/soi_metadata.py tests/test_soi_utils.py tests/test_refresh_soi_table_targets.py tests/test_etl_irs_soi_overlay.py
  • uv run ruff check policyengine_us_data/storage/calibration_targets/refresh_soi_table_targets.py policyengine_us_data/db/etl_irs_soi.py tests/test_refresh_soi_table_targets.py tests/test_etl_irs_soi_overlay.py

Calibration impact

The main national SOI-backed targets that now move from TY2022 to TY2023 in the DB-backed path are materially different. A few examples from the updated aggregate workbook targets:

  • taxable_interest_income: about +135%
  • dividend_income: about +22%
  • tax_exempt_interest_income: about +19%
  • taxable_social_security: about +15%
  • net_capital_gains: about -24%

So this is not just a bookkeeping refresh; it changes the national SOI-backed target surface in a meaningful way.

@MaxGhenis MaxGhenis changed the title [codex] Refresh SOI table targets to TY2023 Refresh SOI table targets to TY2023 Mar 28, 2026
@MaxGhenis MaxGhenis changed the title Refresh SOI table targets to TY2023 Update SOI-backed calibration targets through TY2023 Mar 28, 2026
@MaxGhenis MaxGhenis force-pushed the codex/update-soi-targets-2023 branch from 833aec1 to 96f7cc6 Compare March 29, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant