Skip to content

Address round 2 referee feedback#3

Open
MaxGhenis wants to merge 10 commits intopaper/address-referee-feedbackfrom
paper/round2-referee-feedback
Open

Address round 2 referee feedback#3
MaxGhenis wants to merge 10 commits intopaper/address-referee-feedbackfrom
paper/round2-referee-feedback

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Feb 8, 2026

Summary

  • Replaced 4 divergent PRDC implementations with canonical prdc library for consistency
  • Expanded benchmark from 3 to 10 random seeds for more robust evaluation
  • Added coverage bar chart figure to paper (methods x sources with error bars)
  • Fixed all micro -> microplex references across documentation
  • Clarified MAF per-column architecture in paper (conditional independence shared across all methods)
  • Rewrote benchmarks README to remove stale KS-based claims
  • Removed unimplemented feature claims from main README
  • Added HardConcrete (l0-python) to reweighting benchmark with init weight rescaling fix
  • Switched reweighting evaluation to train/test split (calibrate on age+weight, evaluate on held-out sex)
  • Added reweighting frontier figure: SparseCalibrator (L1, convex) dominates HardConcrete (L0, non-convex) at every sparsity level
  • Multi-seed analysis (5 seeds) shows HardConcrete variance explodes at high sparsity

Key reweighting finding

SparseCalibrator dominates the entire accuracy-sparsity frontier:

  • Convex/deterministic vs non-convex/stochastic
  • 12-90x faster (0.02s vs 1.8s at 5K records)
  • ~9% test error at 31 records vs 32% ± 10% for HardConcrete at similar sparsity
  • Filed policyengine-us-data#520 to migrate from l0-python

Test plan

🤖 Generated with Claude Code

MaxGhenis and others added 10 commits February 8, 2026 13:12
…tion

Replace 4 divergent PRDC implementations (benchmark.py, harness.py,
coverage.py, dgp.py) with wrappers around the canonical `prdc` library
(Naeem et al. 2020). Each call site still handles standardization and
edge-case guards, but the core k-NN metric computation is now unified.

- Add `prdc>=0.1` to core dependencies
- Fix all `micro` -> `microplex` references in docs (quickstart, api, benchmarks)
- Rewrite benchmarks/README.md: remove stale KS claims, reference PRDC evaluation
- Remove unsubstantiated "Joint correlations" and "Hierarchical structures"
  claims from README.md
- Clarify MAF per-column architecture in paper (1D flow per variable,
  conditional independence shared across all method families)
- Add coverage bar chart figure (paper/figures/coverage_by_method.py)
- 10-seed benchmark running separately (will update results JSON)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10-seed results with canonical prdc library. Key findings stable:
- ZI-QRF SIPP: 95.0% ± 0.2%, ZI-MAF CPS: 49.9% ± 0.7%
- ZI lifts: MAF +83%, QDNN +67%, QRF +2%
- Standard errors now tighter with more seeds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… claims

- Fix QRF sampling description: discrete 5-quantile grid, not continuous U(0.1,0.9)
- Fix QDNN description: note discrete 7-quantile grid and cross-seed instability
- Clarify PRDC vs PDC acronym and recall/coverage equivalence
- Fix flood2020 → flood2023 citation (IPUMS CPS v11.0 is from 2023)
- Note entropy balancing originates from causal inference literature
- Define constraint notation (A, b) in calibration section
- Note L0-Sparse converges to L1 solution (numerically identical)
- Remove evaluative adjectives (compelling, instructive, fundamentally)
- Convert Limitations bold pseudo-headers to running prose
- Rephrase Conclusion to avoid near-verbatim Abstract overlap
- Remove CT-GAN/TVAE from README and docs comparison tables
- Remove "Joint correlations", "Synthesize billions" from docs/intro.md
- Fix docs/_config.yml: micro → microplex
- Delete stale BENCHMARKS_SUMMARY.md (Dec 2024 claims, wrong repo paths)
- Add responses to dev deps, cvxpy as optional dependency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix critical bug in HardConcreteCalibrator: initial survey weights
(mean ~6800) produced constraint violations of 5000x+ after
normalization, causing gradient descent to fail. Fix rescales init
weights so A_norm @ w_init ≈ b_norm before optimization.

Results: HardConcrete achieves 8.5% mean error with 93.7% sparsity
(315/5000 records), outperforming IPF (12.6%), entropy (11.9%), and
SparseCalibrator (12.6%) on mean error while using far fewer records.

Also fixes benchmark to use deterministic targets (age_group + is_male
+ weight only) for reproducibility, removing auto-discovery of extra
columns that changed results between runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Report out-of-sample generalization: calibrate on age + weight (6 targets),
evaluate on held-out sex (2 targets). Table now shows train and test error
columns instead of combined mean. Key finding: HardConcrete matches dense
methods on held-out targets (25.4% test error) with 94% sparsity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sweep regularization parameters for SparseCalibrator and HardConcrete,
plotting active records vs held-out test error. Key finding: sparser
solutions generalize better — HardConcrete with <10 records achieves
~5% test error vs ~18% for dense methods using all 5,000 records.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
L1-Sparse and L0-Sparse (hard equality constraints, 5 records, 77% test
error) are the extreme endpoints of their respective families.
SparseCalibrator (L1) and HardConcrete (L0) trace parameterized frontiers
that dominate the hard-constraint solutions at every sparsity level.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run HardConcrete at 5 seeds per lambda to get reliable error bars.
Result: HC variance explodes at high sparsity (32-46% mean error with
±10-14% SE below 100 records). SparseCalibrator (convex, deterministic)
dominates at every sparsity level, reaching ~9% test error at 31 records.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant