Address round 2 referee feedback#3
Open
MaxGhenis wants to merge 10 commits intopaper/address-referee-feedbackfrom
Open
Address round 2 referee feedback#3MaxGhenis wants to merge 10 commits intopaper/address-referee-feedbackfrom
MaxGhenis wants to merge 10 commits intopaper/address-referee-feedbackfrom
Conversation
…tion Replace 4 divergent PRDC implementations (benchmark.py, harness.py, coverage.py, dgp.py) with wrappers around the canonical `prdc` library (Naeem et al. 2020). Each call site still handles standardization and edge-case guards, but the core k-NN metric computation is now unified. - Add `prdc>=0.1` to core dependencies - Fix all `micro` -> `microplex` references in docs (quickstart, api, benchmarks) - Rewrite benchmarks/README.md: remove stale KS claims, reference PRDC evaluation - Remove unsubstantiated "Joint correlations" and "Hierarchical structures" claims from README.md - Clarify MAF per-column architecture in paper (1D flow per variable, conditional independence shared across all method families) - Add coverage bar chart figure (paper/figures/coverage_by_method.py) - 10-seed benchmark running separately (will update results JSON) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10-seed results with canonical prdc library. Key findings stable: - ZI-QRF SIPP: 95.0% ± 0.2%, ZI-MAF CPS: 49.9% ± 0.7% - ZI lifts: MAF +83%, QDNN +67%, QRF +2% - Standard errors now tighter with more seeds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… claims - Fix QRF sampling description: discrete 5-quantile grid, not continuous U(0.1,0.9) - Fix QDNN description: note discrete 7-quantile grid and cross-seed instability - Clarify PRDC vs PDC acronym and recall/coverage equivalence - Fix flood2020 → flood2023 citation (IPUMS CPS v11.0 is from 2023) - Note entropy balancing originates from causal inference literature - Define constraint notation (A, b) in calibration section - Note L0-Sparse converges to L1 solution (numerically identical) - Remove evaluative adjectives (compelling, instructive, fundamentally) - Convert Limitations bold pseudo-headers to running prose - Rephrase Conclusion to avoid near-verbatim Abstract overlap - Remove CT-GAN/TVAE from README and docs comparison tables - Remove "Joint correlations", "Synthesize billions" from docs/intro.md - Fix docs/_config.yml: micro → microplex - Delete stale BENCHMARKS_SUMMARY.md (Dec 2024 claims, wrong repo paths) - Add responses to dev deps, cvxpy as optional dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix critical bug in HardConcreteCalibrator: initial survey weights (mean ~6800) produced constraint violations of 5000x+ after normalization, causing gradient descent to fail. Fix rescales init weights so A_norm @ w_init ≈ b_norm before optimization. Results: HardConcrete achieves 8.5% mean error with 93.7% sparsity (315/5000 records), outperforming IPF (12.6%), entropy (11.9%), and SparseCalibrator (12.6%) on mean error while using far fewer records. Also fixes benchmark to use deterministic targets (age_group + is_male + weight only) for reproducibility, removing auto-discovery of extra columns that changed results between runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Report out-of-sample generalization: calibrate on age + weight (6 targets), evaluate on held-out sex (2 targets). Table now shows train and test error columns instead of combined mean. Key finding: HardConcrete matches dense methods on held-out targets (25.4% test error) with 94% sparsity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sweep regularization parameters for SparseCalibrator and HardConcrete, plotting active records vs held-out test error. Key finding: sparser solutions generalize better — HardConcrete with <10 records achieves ~5% test error vs ~18% for dense methods using all 5,000 records. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
L1-Sparse and L0-Sparse (hard equality constraints, 5 records, 77% test error) are the extreme endpoints of their respective families. SparseCalibrator (L1) and HardConcrete (L0) trace parameterized frontiers that dominate the hard-constraint solutions at every sparsity level. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run HardConcrete at 5 seeds per lambda to get reliable error bars. Result: HC variance explodes at high sparsity (32-46% mean error with ±10-14% SE below 100 records). SparseCalibrator (convex, deterministic) dominates at every sparsity level, reaching ~9% test error at 31 records. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
prdclibrary for consistencymicro->microplexreferences across documentationKey reweighting finding
SparseCalibrator dominates the entire accuracy-sparsity frontier:
Test plan
from microimports in docs🤖 Generated with Claude Code