Address round 2 referee feedback by MaxGhenis · Pull Request #3 · CosilicoAI/microplex

MaxGhenis · 2026-02-08T23:34:18Z

Summary

Replaced 4 divergent PRDC implementations with canonical prdc library for consistency
Expanded benchmark from 3 to 10 random seeds for more robust evaluation
Added coverage bar chart figure to paper (methods x sources with error bars)
Fixed all micro -> microplex references across documentation
Clarified MAF per-column architecture in paper (conditional independence shared across all methods)
Rewrote benchmarks README to remove stale KS-based claims
Removed unimplemented feature claims from main README
Added HardConcrete (l0-python) to reweighting benchmark with init weight rescaling fix
Switched reweighting evaluation to train/test split (calibrate on age+weight, evaluate on held-out sex)
Added reweighting frontier figure: SparseCalibrator (L1, convex) dominates HardConcrete (L0, non-convex) at every sparsity level
Multi-seed analysis (5 seeds) shows HardConcrete variance explodes at high sparsity

Key reweighting finding

SparseCalibrator dominates the entire accuracy-sparsity frontier:

Convex/deterministic vs non-convex/stochastic
12-90x faster (0.02s vs 1.8s at 5K records)
~9% test error at 31 records vs 32% ± 10% for HardConcrete at similar sparsity
Filed policyengine-us-data#520 to migrate from l0-python

Test plan

All 673 tests pass
Paper results load correctly with train/test split data
Coverage and frontier figures generate from benchmark JSON
No remaining from micro imports in docs
Frontier notebook: https://gist.github.com/MaxGhenis/9f9a31a156eba8a8cbf041710ce31213

🤖 Generated with Claude Code

…tion Replace 4 divergent PRDC implementations (benchmark.py, harness.py, coverage.py, dgp.py) with wrappers around the canonical `prdc` library (Naeem et al. 2020). Each call site still handles standardization and edge-case guards, but the core k-NN metric computation is now unified. - Add `prdc>=0.1` to core dependencies - Fix all `micro` -> `microplex` references in docs (quickstart, api, benchmarks) - Rewrite benchmarks/README.md: remove stale KS claims, reference PRDC evaluation - Remove unsubstantiated "Joint correlations" and "Hierarchical structures" claims from README.md - Clarify MAF per-column architecture in paper (1D flow per variable, conditional independence shared across all method families) - Add coverage bar chart figure (paper/figures/coverage_by_method.py) - 10-seed benchmark running separately (will update results JSON) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

10-seed results with canonical prdc library. Key findings stable: - ZI-QRF SIPP: 95.0% ± 0.2%, ZI-MAF CPS: 49.9% ± 0.7% - ZI lifts: MAF +83%, QDNN +67%, QRF +2% - Standard errors now tighter with more seeds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… claims - Fix QRF sampling description: discrete 5-quantile grid, not continuous U(0.1,0.9) - Fix QDNN description: note discrete 7-quantile grid and cross-seed instability - Clarify PRDC vs PDC acronym and recall/coverage equivalence - Fix flood2020 → flood2023 citation (IPUMS CPS v11.0 is from 2023) - Note entropy balancing originates from causal inference literature - Define constraint notation (A, b) in calibration section - Note L0-Sparse converges to L1 solution (numerically identical) - Remove evaluative adjectives (compelling, instructive, fundamentally) - Convert Limitations bold pseudo-headers to running prose - Rephrase Conclusion to avoid near-verbatim Abstract overlap - Remove CT-GAN/TVAE from README and docs comparison tables - Remove "Joint correlations", "Synthesize billions" from docs/intro.md - Fix docs/_config.yml: micro → microplex - Delete stale BENCHMARKS_SUMMARY.md (Dec 2024 claims, wrong repo paths) - Add responses to dev deps, cvxpy as optional dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e-feedback)

Fix critical bug in HardConcreteCalibrator: initial survey weights (mean ~6800) produced constraint violations of 5000x+ after normalization, causing gradient descent to fail. Fix rescales init weights so A_norm @ w_init ≈ b_norm before optimization. Results: HardConcrete achieves 8.5% mean error with 93.7% sparsity (315/5000 records), outperforming IPF (12.6%), entropy (11.9%), and SparseCalibrator (12.6%) on mean error while using far fewer records. Also fixes benchmark to use deterministic targets (age_group + is_male + weight only) for reproducibility, removing auto-discovery of extra columns that changed results between runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e-feedback)

Report out-of-sample generalization: calibrate on age + weight (6 targets), evaluate on held-out sex (2 targets). Table now shows train and test error columns instead of combined mean. Key finding: HardConcrete matches dense methods on held-out targets (25.4% test error) with 94% sparsity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Sweep regularization parameters for SparseCalibrator and HardConcrete, plotting active records vs held-out test error. Key finding: sparser solutions generalize better — HardConcrete with <10 records achieves ~5% test error vs ~18% for dense methods using all 5,000 records. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

L1-Sparse and L0-Sparse (hard equality constraints, 5 records, 77% test error) are the extreme endpoints of their respective families. SparseCalibrator (L1) and HardConcrete (L0) trace parameterized frontiers that dominate the hard-constraint solutions at every sparsity level. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Run HardConcrete at 5 seeds per lambda to get reliable error bars. Result: HC variance explodes at high sparsity (32-46% mean error with ±10-14% SE below 100 records). SparseCalibrator (convex, deterministic) dominates at every sparsity level, reaching ~9% test error at 31 records. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MaxGhenis and others added 10 commits February 8, 2026 13:12

WIP: auto-save before context compact (1 files on paper/round2-refere…

a8771d9

…e-feedback)

WIP: auto-save before context compact (3 files on paper/round2-refere…

7faeecb

…e-feedback)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address round 2 referee feedback#3

Address round 2 referee feedback#3
MaxGhenis wants to merge 10 commits intopaper/address-referee-feedbackfrom
paper/round2-referee-feedback

MaxGhenis commented Feb 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxGhenis commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key reweighting finding

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Feb 8, 2026 •

edited

Loading