diff --git a/BENCHMARKS_SUMMARY.md b/BENCHMARKS_SUMMARY.md deleted file mode 100644 index 880f8d7..0000000 --- a/BENCHMARKS_SUMMARY.md +++ /dev/null @@ -1,288 +0,0 @@ -# microplex Benchmark Results Summary - -**Date:** December 26, 2024 -**Version:** 0.1.0 -**Status:** All benchmarks passed - microplex demonstrates strong performance - -## Executive Summary - -Comprehensive benchmarks comparing microplex to state-of-the-art synthetic data methods (CT-GAN, TVAE, Gaussian Copula from SDV), **PolicyEngine's current Sequential QRF approach**, and **TabPFN (Prior-Data Fitted Networks)** demonstrate that **microplex is the best overall method for economic microdata synthesis**. - -### TabPFN Comparison (NEW - Transformer-Based) - -We benchmarked microplex against TabPFN (Prior-Data Fitted Networks), a transformer-based approach for tabular prediction: - -| Method | Marginal Fidelity (KS) | Correlation Error | Zero-Inflation Error | Generation Speed | -|--------|------------------------|-------------------|---------------------|------------------| -| **microplex** | 0.0766 | **0.0907** (best) | 0.0444 | **0.01s** (best) | -| TabPFN + Zero-Inflation | **0.0716** (best) | 0.1451 | **0.0324** (best) | 1.86s | -| TabPFN Sequential | 0.3052 | 0.1114 | 0.2297 | 1.62s | - -**Key Finding:** TabPFN with two-stage zero handling slightly edges microplex on marginal fidelity and zero handling, but microplex is **186x faster** at generation and has **37% better correlation preservation**. TabPFN is limited to small datasets (<1000 rows). - -See **[TabPFN Comparison Report](benchmarks/results/tabpfn_comparison.md)** for full analysis. - -### QRF Comparison (PolicyEngine Current Approach) - -We benchmarked microplex against Sequential Quantile Random Forests (QRF), PolicyEngine's current microdata enhancement method: - -| Method | Marginal Fidelity (KS) | Correlation Preservation | Zero-Inflation Error | Speed | -|--------|-------------------------|---------------------------|----------------------|-------| -| **microplex** | **0.0685** (5.5x better) | 0.2044 | 0.0561 | **Fastest** | -| QRF + Zero-Inflation | 0.2327 | **0.0918** (best) | **0.0310** (best) | Moderate | -| QRF Sequential | 0.3774 (worst) | 0.1711 | 0.2097 (worst) | Moderate | - -**Key Finding:** While QRF with two-stage zero-inflation handles zeros well and preserves correlations decently, **microplex achieves 5.5x better marginal fidelity** and trains/generates significantly faster. QRF's sequential nature breaks joint distribution consistency. - -**Recommendation:** Transition from Sequential QRF to microplex for PolicyEngine/Cosilico production use. - -See **[QRF Comparison Report](benchmarks/results/qrf_comparison.md)** for full analysis. - -### Key Results - -| Metric | microplex | Next Best | Improvement | -|--------|-----------|-----------|-------------| -| **Marginal Fidelity** (KS) | 0.0611 | 0.1997 (CT-GAN) | **3.3x better** | -| **Correlation Error** | 0.1060 | 0.1756 (Copula) | **1.7x better** | -| **Zero-Inflation Error** | 0.0223 | 0.0555 (TVAE) | **2.5x better** | -| **Generation Speed** | < 0.1s | 0.6s (TVAE) | **6x faster** | - -## Why This Matters - -### For Economic Microdata - -Economic survey data (CPS, ACS, PSID) has unique characteristics: -- **Zero-inflated variables:** Many people have $0 assets, debt, or benefit receipt -- **Skewed distributions:** Income, wealth follow log-normal distributions -- **Complex correlations:** Education → income → assets chains - -microplex is **purpose-built** for these characteristics, while general-purpose methods (CT-GAN, TVAE, Copula) struggle. - -### Critical Finding: Zero-Inflation Handling - -This is microplex's **strongest differentiator**: - -- **Real data:** 40% have zero assets -- **microplex:** 38% zero assets (2% error) ✅ -- **TVAE:** 35% zero assets (5% error) -- **CT-GAN:** 31% zero assets (10% error) -- **Copula:** 62% zero assets (22% error) ❌ **FAILS** - -This 2.5-10x advantage comes from microplex's **two-stage modeling**: -1. Binary classifier for P(positive | demographics) -2. Separate flow for P(value | positive, demographics) - -Other methods try to model the full distribution in one step, leading to poor zero-fraction preservation. - -## Test Methodology - -### Data -- **Samples:** 10,000 training, 2,000 test -- **Variables:** - - Conditions: age, education, region - - Targets: income, assets, debt, savings -- **Characteristics:** - - Zero-inflation: 40% no assets, 50% no debt - - Realistic correlations (education → income → assets) - - Mimics CPS/ACS survey data - -### Methods Compared -1. **microplex** - Conditional MAF with two-stage zero-inflation -2. **CT-GAN** - Conditional Tabular GAN (SDV) -3. **TVAE** - Tabular VAE (SDV) -4. **Gaussian Copula** - Copula synthesis (SDV) -5. **Sequential QRF** - Quantile Random Forests (PolicyEngine current) -6. **TabPFN** - Prior-Data Fitted Networks (transformer-based) - -### Metrics -- **Marginal Fidelity:** KS statistic (distribution matching) -- **Joint Fidelity:** Correlation matrix error (relationship preservation) -- **Zero Fidelity:** Zero-fraction error (zero-inflation handling) -- **Performance:** Training and generation time - -## Detailed Results - -### Full Comparison Table - -| Method | Mean KS ↓ | Corr Error ↓ | Zero Error ↓ | Train (s) | Gen (s) ↓ | -|--------|-----------|--------------|--------------|-----------|-----------| -| **microplex** | **0.0611** | **0.1060** | **0.0223** | 6.1 | **0.0** | -| CT-GAN | 0.1997 | 0.3826 | 0.0986 | 35.5 | 0.8 | -| TVAE | 0.2459 | 0.1969 | 0.0555 | 12.0 | 0.6 | -| Copula | 0.2632 | 0.1756 | 0.2241 | **0.5** | 0.8 | - -**Bold** = best performance, **↓** = lower is better - -### Performance Analysis - -#### Marginal Fidelity (Mean KS = 0.0611) -- microplex: 0.0611 ← **BEST** -- CT-GAN: 0.1997 (3.3x worse) -- TVAE: 0.2459 (4.0x worse) -- Copula: 0.2632 (4.3x worse) - -**Reason:** Normalizing flows provide exact likelihood modeling with stable training. - -#### Correlation Preservation (Error = 0.1060) -- microplex: 0.1060 ← **BEST** -- Copula: 0.1756 (1.7x worse) -- TVAE: 0.1969 (1.9x worse) -- CT-GAN: 0.3826 (3.6x worse) - -**Reason:** MAF architecture explicitly models conditional dependencies through autoregressive structure. - -#### Zero-Inflation (Error = 0.0223) -- microplex: 0.0223 ← **BEST** -- TVAE: 0.0555 (2.5x worse) -- CT-GAN: 0.0986 (4.4x worse) -- Copula: 0.2241 (10.0x worse) - -**Reason:** Two-stage modeling (binary + flow) specifically designed for zero-inflation. - -#### Generation Speed (< 0.1s) -- microplex: < 0.1s ← **BEST** -- TVAE: 0.6s (6x slower) -- CT-GAN: 0.8s (8x slower) -- Copula: 0.8s (8x slower) - -**Reason:** Single forward pass through flow, no iterative sampling or nearest-neighbor matching. - -## Visualizations - -All visualizations saved to `/Users/maxghenis/CosilicoAI/micro/benchmarks/results/`: - -1. **summary_metrics.png** - Side-by-side comparison of all metrics -2. **distributions_*.png** - Per-method distribution comparisons (4 files) -3. **zero_inflation.png** - Zero-fraction preservation analysis ← **Key differentiator** -4. **timing.png** - Training and generation time comparison - -## Use Cases - -### ✅ Ideal for microplex -- Economic microdata synthesis (CPS, ACS, PSID) -- Zero-inflated variables (benefits, assets, debt) -- Conditional generation (demographics → outcomes) -- Fast simulation (policy analysis, Monte Carlo) -- Privacy-preserving data release - -### ⚠️ Consider alternatives -- Categorical-heavy data (try CT-GAN) -- Quick prototype/baseline (try Copula) -- Small sample size < 1,000 (simpler methods) - -## PolicyEngine / Cosilico Applications - -microplex is **ideal** for: - -1. **CPS/ACS enhancement** - Impute income/benefits onto census demographics -2. **Microsimulation** - Generate representative populations for policy analysis -3. **Privacy-preserving release** - Publish synthetic microdata maintaining statistical properties -4. **Data fusion** - Combine variables from different surveys -5. **Missing data imputation** - Fill gaps conditioned on observed variables - -The zero-inflation handling is **critical** for: -- Benefit eligibility modeling (many don't receive benefits) -- Asset/debt analysis (many have zero assets/debt) -- Poverty analysis (preserving zeros in income is essential) - -## Issues and Next Steps - -### Issues Found -**None critical** - microplex works excellently out of the box. - -Minor opportunities for improvement: -- Correlation error could be further reduced with explicit correlation loss -- Zero-fraction could be made even more precise with calibration -- Training time could be reduced with early stopping - -### Recommended Next Steps - -**High Priority:** -1. Test on **real CPS/ACS data** - Validate performance on actual microdata -2. Add **memory profiling** - Assess scalability for large datasets -3. Run **cross-validation** - More robust performance estimates - -**Medium Priority:** -4. **Subgroup analysis** - Ensure fairness across demographics -5. **Conditional validity tests** - Deeper assessment of conditional generation -6. ~~Benchmark vs **PolicyEngine current methods**~~ - ✅ **DONE** - See QRF comparison above - -**Lower Priority (High Value):** -7. **Downstream task evaluation** - Test utility preservation (poverty prediction, etc.) -8. **Privacy metrics** - Distance to closest record, membership inference -9. **Scale testing** - Test on 1k to 1M samples -10. **Hyperparameter tuning** - Optimize performance further - -See `/Users/maxghenis/CosilicoAI/micro/benchmarks/results/ISSUES_FOUND.md` for details. - -## Documentation - -Full documentation in `/Users/maxghenis/CosilicoAI/micro/benchmarks/results/`: - -### General Synthetic Data Comparisons -- **BENCHMARK_REPORT.md** - Comprehensive 20-page analysis vs CT-GAN, TVAE, Copula -- **ISSUES_FOUND.md** - Issues identified and improvement opportunities -- **README.md** - Quick reference guide -- **results.csv** - Summary table -- **results.md** - Markdown results table - -### PolicyEngine QRF Comparison -- **qrf_comparison.md** - Full QRF vs microplex analysis -- **qrf_results.csv** - QRF benchmark summary -- **qrf_comparison.png** - Main 4-metric visualization -- **qrf_distributions.png** - Distribution comparisons -- **qrf_zero_inflation.png** - Zero-handling analysis -- **qrf_timing.png** - Performance comparison -- **qrf_per_variable_ks.png** - Per-variable fidelity - -### TabPFN Comparison (NEW) -- **tabpfn_comparison.md** - Full TabPFN vs microplex analysis -- **tabpfn_results.csv** - TabPFN benchmark summary -- **tabpfn_comparison.png** - Main 4-metric visualization -- **tabpfn_distributions_*.png** - Per-method distribution comparisons -- **tabpfn_zero_inflation.png** - Zero-handling analysis -- **tabpfn_per_variable_ks.png** - Per-variable fidelity - -## Reproducibility - -```bash -cd /Users/maxghenis/CosilicoAI/micro - -# General synthetic data benchmarks -python benchmarks/run_benchmarks.py - -# QRF comparison (PolicyEngine current approach) -python benchmarks/run_qrf_benchmark.py - -# TabPFN comparison (transformer-based) -python benchmarks/run_tabpfn_benchmark.py -``` - -Requirements: -- Python 3.9+ -- microplex -- sdv >= 1.0 (for CT-GAN, TVAE, Copula) -- scikit-learn >= 1.3 (for QRF) -- tabpfn == 0.1.11 (for TabPFN - newer versions are gated) -- matplotlib, seaborn, tabulate - -Results are deterministic (random seed = 42). - -## Conclusion - -**microplex is ready for production use in PolicyEngine/Cosilico.** - -The benchmarks demonstrate: -- ✅ **Superior fidelity** across all statistical metrics -- ✅ **Exceptional zero-inflation handling** (2.5-10x better) -- ✅ **Fast generation** for real-time simulation -- ✅ **Stable training** without failure -- ✅ **Purpose-built** for economic microdata - -Next step: **Test on real CPS/ACS data** to validate production readiness. - ---- - -**Generated:** December 26, 2024 -**Location:** /Users/maxghenis/CosilicoAI/micro/benchmarks/results/ -**Full report:** BENCHMARK_REPORT.md diff --git a/README.md b/README.md index 3f657b1..0068a28 100644 --- a/README.md +++ b/README.md @@ -12,8 +12,7 @@ Multi-source microdata synthesis and survey reweighting. - **Conditional relationships**: Generate target variables given demographics - **Zero-inflated distributions**: Handle variables that are 0 for many observations -- **Joint correlations**: Preserve relationships between target variables -- **Hierarchical structures**: Keep household/firm compositions intact +- **Multi-source fusion**: Combine variables from surveys with different variable sets ## Installation @@ -72,13 +71,13 @@ print(f"Using {stats['n_nonzero']} of {stats['n_records']} records") ## Why `microplex`? -| Feature | microplex | CT-GAN | TVAE | synthpop | -|---------|-------|--------|------|----------| -| Multi-source fusion | ✅ | ❌ | ❌ | ❌ | -| Zero-inflation handling | ✅ | ❌ | ❌ | ⚠️ | -| Multiple synthesis methods | ✅ (QRF, QDNN, MAF) | ❌ | ❌ | ✅ (CART) | -| Survey reweighting | ✅ (IPF, entropy, sparse) | ❌ | ❌ | ❌ | -| PRDC evaluation | ✅ | ❌ | ❌ | ❌ | +| Feature | microplex | synthpop | +|---------|-------|----------| +| Multi-source fusion | ✅ | ❌ | +| Zero-inflation handling | ✅ | ⚠️ | +| Multiple synthesis methods | ✅ (QRF, QDNN, MAF) | ✅ (CART) | +| Survey reweighting | ✅ (IPF, entropy, sparse) | ❌ | +| PRDC evaluation | ✅ | ❌ | ### Use Cases @@ -159,7 +158,6 @@ See [benchmarks/](benchmarks/) for synthesis method comparisons: - **QRF / ZI-QRF**: Quantile regression forests (with/without zero-inflation) - **QDNN / ZI-QDNN**: Quantile deep neural networks - **MAF / ZI-MAF**: Masked autoregressive flows -- **CT-GAN / TVAE**: Deep generative baselines (from SDV) ## Citation diff --git a/benchmarks/README.md b/benchmarks/README.md index 28fae8e..da9c154 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -1,163 +1,61 @@ -# microplex Benchmarks +# microplex benchmarks -Comprehensive benchmarks comparing microplex to other synthetic data methods. +Synthesis method comparison using PRDC (Precision, Recall, Density, Coverage) metrics from [Naeem et al. (2020)](https://arxiv.org/abs/2002.09797), evaluated via the canonical [`prdc`](https://github.com/clovaai/generative-evaluation-prdc) library. -## Quick Results +## Methods compared -**microplex wins across all fidelity metrics:** +Six synthesis methods — QRF, QDNN, and MAF, each with and without zero-inflation (ZI) handling — are evaluated against holdouts from three source surveys (SIPP, CPS ASEC, PSID). -| Metric | microplex | CT-GAN | TVAE | Copula | microplex Advantage | -|--------|-----------|--------|------|--------|-------------------| -| **Marginal Fidelity** (KS ↓) | **0.0611** | 0.1997 | 0.2459 | 0.2632 | **3.3x better** | -| **Correlation Error** ↓ | **0.1060** | 0.3826 | 0.1969 | 0.1756 | **1.7x better** | -| **Zero-Inflation Error** ↓ | **0.0223** | 0.0986 | 0.0555 | 0.2241 | **2.5x better** | -| **Generation Speed** ↓ | **< 0.1s** | 0.8s | 0.6s | 0.8s | **6-8x faster** | +| Method | Description | +|--------|-------------| +| **QRF** | Quantile regression forest | +| **ZI-QRF** | Zero-inflated QRF (two-stage hurdle) | +| **QDNN** | Quantile deep neural network (pinball loss) | +| **ZI-QDNN** | Zero-inflated QDNN | +| **MAF** | Masked autoregressive flow (1D per column) | +| **ZI-MAF** | Zero-inflated MAF | -## Running Benchmarks +## Running benchmarks ```bash # Install dependencies pip install microplex[benchmark] -# Run general synthetic data benchmarks (CT-GAN, TVAE, Copula) -python benchmarks/run_benchmarks.py +# Single-seed run +python scripts/run_benchmark.py --output benchmarks/results/benchmark_full.json -# Run QRF comparison (PolicyEngine current approach) -python benchmarks/run_qrf_benchmark.py -``` - -Results saved to `benchmarks/results/`: -- `BENCHMARK_REPORT.md` - Comprehensive analysis vs CT-GAN/TVAE/Copula -- `qrf_comparison.md` - QRF vs microplex analysis (NEW) -- `results.csv`, `qrf_results.csv` - Summary tables -- `*.png` - Visualization charts - -## Test Setup - -- **Data:** 10,000 realistic economic microdata samples -- **Variables:** age, education, region → income, assets, debt, savings -- **Key Feature:** Zero-inflation (40% no assets, 50% no debt) -- **Methods:** microplex, CT-GAN, TVAE, Gaussian Copula -- **Epochs:** 50 for iterative methods - -## Why microplex Wins - -### 1. Zero-Inflation Handling (10x Better) - -microplex uses a **two-stage model**: -1. Binary classifier: P(positive | demographics) -2. Flow model: P(value | positive, demographics) - -Other methods try to model the full distribution (including zeros) in one step, leading to: -- Copula: 62% synthetic zeros vs 40% real (22% error!) -- CT-GAN: 31% synthetic zeros vs 40% real (10% error) -- microplex: 38% synthetic zeros vs 40% real (2% error) ✓ - -**This is critical for economic data** where many variables (benefits, assets, debt) are zero for large portions of the population. - -### 2. Marginal Fidelity (3.3x Better) - -Normalizing flows provide: -- **Exact likelihood** modeling (not approximate like VAE) -- **Stable training** (not adversarial like GAN) -- **Log transformations** for skewed economic distributions - -Result: KS statistic of 0.0611 vs 0.20-0.26 for alternatives. - -### 3. Correlation Preservation (1.7x Better) +# Multi-seed run (for paper) +python scripts/run_benchmark.py --n-seeds 10 --output benchmarks/results/benchmark_multi_seed.json -MAF (Masked Autoregressive Flow) architecture: -- Explicitly models conditional dependencies -- Joint training on all target variables -- Autoregressive structure captures correlations - -Result: Maintains income-assets, income-debt relationships accurately. - -### 4. Generation Speed (6-8x Faster) - -- **Single forward pass** through flow (no iterative sampling) -- **No nearest-neighbor matching** needed (unlike GAN methods) -- Enables **real-time microsimulation** - -Result: < 0.1s to generate 2,000 samples (vs 0.6-0.8s for alternatives). - -## Use Cases - -### Perfect For -- Economic microdata (CPS, ACS, PSID) -- Zero-inflated variables (benefits, assets, debt) -- Conditional generation (demographics → outcomes) -- Fast simulation (policy analysis, Monte Carlo) - -### Consider Alternatives If -- Data is primarily categorical (try CT-GAN) -- Need quick prototype (try Copula) -- Small sample size < 1,000 (simpler methods may suffice) - -## NEW: QRF Comparison Results - -**microplex vs Sequential QRF (PolicyEngine's current approach):** +# Fast mode for testing +python scripts/run_benchmark.py --fast --output /tmp/benchmark_test.json +``` -| Method | Marginal Fidelity (KS) ↓ | Correlation Error ↓ | Zero-Inflation Error ↓ | Train Time | Gen Time | -|--------|-------------------------|-------------------|----------------------|------------|----------| -| **microplex** | **0.0685** | 0.2044 | 0.0561 | **2.0s** | **0.01s** | -| QRF + Zero-Inflation | 0.2327 | **0.0918** | **0.0310** | 11.7s | 0.07s | -| QRF Sequential | 0.3774 | 0.1711 | 0.2097 | 7.1s | 0.04s | +## Key findings -**Key Findings:** -- **5.5x better marginal fidelity** - microplex models distributions more accurately -- **Faster training & generation** - 2s vs 7-12s training, 0.01s vs 0.04-0.07s generation -- **Comparable zero-handling** - Both QRF+ZI and microplex handle zeros well with two-stage modeling -- **QRF weakness:** Sequential prediction breaks joint distribution consistency +See the [paper](../paper/) for full analysis. Summary: -See `results/qrf_comparison.md` for full analysis. +- **Zero-inflation handling matters more than base model choice** for economic data with mass-at-zero variables. ZI lifts MAF and QDNN coverage substantially while barely affecting QRF (which handles mixed distributions natively via leaf node composition). +- **Per-source coverage varies dramatically**: SIPP and CPS achieve meaningful coverage; PSID shows 0% coverage with only 2 shared conditioning variables (age, sex). +- **Speed-accuracy tradeoff**: ZI-QRF is fastest with competitive coverage; ZI-MAF is slowest but achieves the highest CPS coverage. -## Files +## Output structure ``` -benchmarks/ -├── README.md # This file -├── compare.py # Benchmark infrastructure -├── compare_qrf.py # QRF implementation (NEW) -├── run_benchmarks.py # Main benchmark script -├── run_qrf_benchmark.py # QRF comparison script (NEW) -└── results/ - ├── BENCHMARK_REPORT.md # Comprehensive analysis vs CT-GAN/TVAE/Copula - ├── qrf_comparison.md # QRF comparison report (NEW) - ├── results.csv # Summary table - ├── qrf_results.csv # QRF results (NEW) - ├── results.md # Markdown results - ├── summary_metrics.png # Overall comparison - ├── qrf_comparison.png # QRF 4-metric comparison (NEW) - ├── qrf_distributions.png # QRF distribution plots (NEW) - ├── qrf_zero_inflation.png # QRF zero-handling (NEW) - ├── qrf_timing.png # QRF performance (NEW) - ├── qrf_per_variable_ks.png # QRF per-variable fidelity (NEW) - ├── distributions_*.png # Per-method distributions - ├── zero_inflation.png # Zero-handling analysis - ├── timing.png # Performance comparison - ├── train_data.csv # Training data - └── test_data.csv # Test data +benchmarks/results/ +├── benchmark_full.json # Single-seed PRDC results +├── benchmark_multi_seed.json # Multi-seed means +/- SE +├── reweighting_full.json # Calibration method comparison +└── *.png # Visualization charts ``` -## Benchmark Details - -See [BENCHMARK_REPORT.md](results/BENCHMARK_REPORT.md) for: -- Detailed methodology -- Per-variable breakdowns -- Statistical analysis -- Visualizations -- Recommendations - ## Citation ```bibtex -@software{microplex2024, - author = {Cosilico}, - title = {microplex: Microdata synthesis using normalizing flows}, - year = {2024}, - note = {Benchmark results show 3.3x better marginal fidelity, - 1.7x better correlation preservation, and 2.5x better - zero-inflation handling vs. CT-GAN/TVAE/Copula} +@software{microplex2025, + author = {Ghenis, Max}, + title = {microplex: Multi-source microdata synthesis and survey reweighting}, + year = {2025}, + url = {https://github.com/CosilicoAI/microplex} } ``` diff --git a/benchmarks/results/benchmark_multi_seed.json b/benchmarks/results/benchmark_multi_seed.json index 361ffbc..778913f 100644 --- a/benchmarks/results/benchmark_multi_seed.json +++ b/benchmarks/results/benchmark_multi_seed.json @@ -1,214 +1,340 @@ { - "n_seeds": 3, + "n_seeds": 10, "base_seed": 42, "methods": { "MAF": { "cps": { - "mean": 0.4163333333333334, - "std": 0.026284897438136078, - "se": 0.01517559261152957, - "n_seeds": 3, + "mean": 0.397625, + "std": 0.019822213269180843, + "se": 0.006268334219622379, + "n_seeds": 10, "values": [ - 0.387, - 0.42425, - 0.43775 + 0.391, + 0.42275, + 0.4375, + 0.382, + 0.391, + 0.3835, + 0.37775, + 0.403, + 0.381, + 0.40675 ] }, "psid": { "mean": 0.0, "std": 0.0, "se": 0.0, - "n_seeds": 3, + "n_seeds": 10, "values": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, 0.0, 0.0, 0.0 ] }, "sipp": { - "mean": 0.3520833333333333, - "std": 0.011625224012178572, - "se": 0.006711826212821, - "n_seeds": 3, + "mean": 0.34917499999999996, + "std": 0.010589597883457775, + "se": 0.003348724881702487, + "n_seeds": 10, "values": [ - 0.36375, - 0.3405, - 0.352 + 0.351, + 0.33525, + 0.34, + 0.33425, + 0.35, + 0.35275, + 0.35375, + 0.36125, + 0.3465, + 0.367 ] } }, "QDNN": { "cps": { - "mean": 0.3865, - "std": 0.019697715603592226, - "se": 0.011372481406154664, - "n_seeds": 3, + "mean": 0.380425, + "std": 0.024575577872712938, + "se": 0.007771480089260846, + "n_seeds": 10, "values": [ - 0.3925, - 0.4025, - 0.3645 + 0.3355, + 0.406, + 0.38, + 0.35925, + 0.41075, + 0.3765, + 0.39075, + 0.3885, + 0.354, + 0.403 ] }, "psid": { "mean": 0.0, "std": 0.0, "se": 0.0, - "n_seeds": 3, + "n_seeds": 10, "values": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, 0.0, 0.0, 0.0 ] }, "sipp": { - "mean": 0.26525, - "std": 0.07045743395838369, - "se": 0.04067861846228311, - "n_seeds": 3, + "mean": 0.29324999999999996, + "std": 0.08534667864396105, + "se": 0.02698898952453677, + "n_seeds": 10, "values": [ - 0.34075, - 0.20125, - 0.25375 + 0.2345, + 0.23525, + 0.19875, + 0.32125, + 0.3045, + 0.176, + 0.40625, + 0.4395, + 0.29175, + 0.32475 ] } }, "QRF": { "cps": { - "mean": 0.3349166666666667, - "std": 0.008270177345942056, - "se": 0.004774789116925591, - "n_seeds": 3, + "mean": 0.33715, + "std": 0.0070889193660090295, + "se": 0.0022417131345865344, + "n_seeds": 10, "values": [ - 0.3285, - 0.34425, - 0.332 + 0.333, + 0.342, + 0.33825, + 0.32525, + 0.33925, + 0.34725, + 0.33875, + 0.34575, + 0.32875, + 0.33325 ] }, "psid": { "mean": 0.0, "std": 0.0, "se": 0.0, - "n_seeds": 3, + "n_seeds": 10, "values": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, 0.0, 0.0, 0.0 ] }, "sipp": { - "mean": 0.9430833333333334, - "std": 0.0047191983782559725, - "se": 0.0027246304540453313, - "n_seeds": 3, + "mean": 0.9382250000000001, + "std": 0.007086224899996721, + "se": 0.0022408610696188535, + "n_seeds": 10, "values": [ - 0.942, - 0.939, - 0.94825 + 0.93125, + 0.936, + 0.9505, + 0.9405, + 0.92975, + 0.94625, + 0.93525, + 0.93175, + 0.94525, + 0.93575 ] } }, "ZI-MAF": { "cps": { - "mean": 0.5074166666666667, - "std": 0.0217087731881222, - "se": 0.012533566043938883, - "n_seeds": 3, + "mean": 0.499175, + "std": 0.020806332502923765, + "se": 0.006579540046403106, + "n_seeds": 10, "values": [ - 0.4825, - 0.52225, - 0.5175 + 0.49975, + 0.50825, + 0.5425, + 0.50325, + 0.472, + 0.50525, + 0.47825, + 0.5105, + 0.47525, + 0.49675 ] }, "psid": { "mean": 0.0, "std": 0.0, "se": 0.0, - "n_seeds": 3, + "n_seeds": 10, "values": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, 0.0, 0.0, 0.0 ] }, "sipp": { - "mean": 0.8718333333333333, - "std": 0.013203534880225536, - "se": 0.0076230644173528265, - "n_seeds": 3, + "mean": 0.8660250000000002, + "std": 0.006741713020030183, + "se": 0.0021319168474507742, + "n_seeds": 10, "values": [ - 0.8835, - 0.8745, - 0.8575 + 0.8645, + 0.86575, + 0.85725, + 0.8685, + 0.8605, + 0.87375, + 0.85825, + 0.86375, + 0.8695, + 0.8785 ] } }, "ZI-QDNN": { "cps": { - "mean": 0.42691666666666667, - "std": 0.019211541669874727, - "se": 0.011091788754649888, - "n_seeds": 3, + "mean": 0.4064, + "std": 0.02154555896492619, + "se": 0.006813303979062664, + "n_seeds": 10, "values": [ 0.43175, - 0.44325, - 0.40575 + 0.43575, + 0.41175, + 0.4255, + 0.40125, + 0.39575, + 0.4125, + 0.36575, + 0.38775, + 0.39625 ] }, "psid": { "mean": 0.0, "std": 0.0, "se": 0.0, - "n_seeds": 3, + "n_seeds": 10, "values": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, 0.0, 0.0, 0.0 ] }, "sipp": { - "mean": 0.7103333333333334, - "std": 0.02075200793497662, - "se": 0.011981177367484002, - "n_seeds": 3, + "mean": 0.71665, + "std": 0.04016082253473733, + "se": 0.012699967191558668, + "n_seeds": 10, "values": [ - 0.734, - 0.70175, - 0.69525 + 0.73675, + 0.742, + 0.65075, + 0.74825, + 0.666, + 0.75875, + 0.6675, + 0.748, + 0.7125, + 0.736 ] } }, "ZI-QRF": { "cps": { - "mean": 0.34908333333333336, - "std": 0.008552533737631986, - "se": 0.004937807655675183, - "n_seeds": 3, + "mean": 0.346975, + "std": 0.008698219805096772, + "se": 0.0027506186172891675, + "n_seeds": 10, "values": [ - 0.346, - 0.35875, - 0.3425 + 0.34475, + 0.35225, + 0.34725, + 0.3395, + 0.353, + 0.34925, + 0.34825, + 0.3595, + 0.34875, + 0.32725 ] }, "psid": { "mean": 0.0, "std": 0.0, "se": 0.0, - "n_seeds": 3, + "n_seeds": 10, "values": [ + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, + 0.0, 0.0, 0.0, 0.0 ] }, "sipp": { - "mean": 0.9500000000000001, - "std": 0.006726812023536869, - "se": 0.0038837267325770226, - "n_seeds": 3, + "mean": 0.9495750000000001, + "std": 0.006720666716264925, + "se": 0.002125261421828175, + "n_seeds": 10, "values": [ - 0.9575, - 0.9445, - 0.948 + 0.95575, + 0.94525, + 0.95425, + 0.93475, + 0.94625, + 0.9465, + 0.95025, + 0.95375, + 0.95775, + 0.95125 ] } } @@ -221,34 +347,30 @@ "seed": 42, "methods": { "QRF": { - "mean_coverage": 0.4235, - "mean_precision": 0.3057, - "mean_recall": 0.4235, - "mean_density": 0.191, - "elapsed_seconds": 9.4, + "mean_coverage": 0.4214, + "mean_precision": 0.5165, + "mean_density": 0.3711, + "elapsed_seconds": 8.8, "sources": [ { "source": "sipp", - "precision": 0.6584, - "recall": 0.942, - "density": 0.4108, - "coverage": 0.942, + "precision": 0.8727, + "density": 0.6633, + "coverage": 0.9313, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2587, - "recall": 0.3285, - "density": 0.1621, - "coverage": 0.3285, + "precision": 0.6769, + "density": 0.4501, + "coverage": 0.333, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -257,34 +379,30 @@ ] }, "ZI-QRF": { - "mean_coverage": 0.4345, - "mean_precision": 0.3119, - "mean_recall": 0.4345, - "mean_density": 0.1771, - "elapsed_seconds": 7.1, + "mean_coverage": 0.4335, + "mean_precision": 0.518, + "mean_density": 0.3792, + "elapsed_seconds": 6.9, "sources": [ { "source": "sipp", - "precision": 0.6739, - "recall": 0.9575, - "density": 0.4018, - "coverage": 0.9575, + "precision": 0.8845, + "density": 0.6862, + "coverage": 0.9557, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2617, - "recall": 0.346, - "density": 0.1295, - "coverage": 0.346, + "precision": 0.6696, + "density": 0.4514, + "coverage": 0.3448, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -293,34 +411,30 @@ ] }, "QDNN": { - "mean_coverage": 0.2444, - "mean_precision": 0.1842, - "mean_recall": 0.2444, - "mean_density": 0.1708, - "elapsed_seconds": 37.2, + "mean_coverage": 0.19, + "mean_precision": 0.4012, + "mean_density": 0.2454, + "elapsed_seconds": 39.3, "sources": [ { "source": "sipp", - "precision": 0.339, - "recall": 0.3407, - "density": 0.3825, - "coverage": 0.3407, + "precision": 0.6375, + "density": 0.3968, + "coverage": 0.2345, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2136, - "recall": 0.3925, - "density": 0.13, - "coverage": 0.3925, + "precision": 0.566, + "density": 0.3393, + "coverage": 0.3355, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -329,26 +443,23 @@ ] }, "ZI-QDNN": { - "mean_coverage": 0.3886, - "mean_precision": 0.2419, - "mean_recall": 0.3886, - "mean_density": 0.1608, - "elapsed_seconds": 23.9, + "mean_coverage": 0.3895, + "mean_precision": 0.424, + "mean_density": 0.2715, + "elapsed_seconds": 23.1, "sources": [ { "source": "sipp", - "precision": 0.491, - "recall": 0.734, - "density": 0.3614, - "coverage": 0.734, + "precision": 0.6855, + "density": 0.4219, + "coverage": 0.7368, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2348, - "recall": 0.4318, - "density": 0.1209, + "precision": 0.5864, + "density": 0.3925, "coverage": 0.4318, "n_holdout": 4000, "n_synthetic": 9841 @@ -356,7 +467,6 @@ { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -365,35 +475,31 @@ ] }, "MAF": { - "mean_coverage": 0.2503, - "mean_precision": 0.1747, - "mean_recall": 0.2503, - "mean_density": 0.328, - "elapsed_seconds": 87.9, + "mean_coverage": 0.2473, + "mean_precision": 0.253, + "mean_density": 0.138, + "elapsed_seconds": 88.7, "sources": [ { "source": "sipp", - "precision": 0.3185, - "recall": 0.3638, - "density": 0.4596, - "coverage": 0.3638, + "precision": 0.284, + "density": 0.1303, + "coverage": 0.351, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2057, - "recall": 0.387, - "density": 0.3404, - "coverage": 0.387, + "precision": 0.475, + "density": 0.2839, + "coverage": 0.391, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", - "precision": 0.0001, - "recall": 0.0, - "density": 0.1841, + "precision": 0.0, + "density": 0.0, "coverage": 0.0, "n_holdout": 1841, "n_synthetic": 9841 @@ -401,34 +507,30 @@ ] }, "ZI-MAF": { - "mean_coverage": 0.4553, - "mean_precision": 0.2804, - "mean_recall": 0.4553, - "mean_density": 0.2698, - "elapsed_seconds": 56.9, + "mean_coverage": 0.4548, + "mean_precision": 0.4156, + "mean_density": 0.2681, + "elapsed_seconds": 55.8, "sources": [ { "source": "sipp", - "precision": 0.5924, - "recall": 0.8835, - "density": 0.4416, - "coverage": 0.8835, + "precision": 0.7476, + "density": 0.4985, + "coverage": 0.8645, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2487, - "recall": 0.4825, - "density": 0.3677, - "coverage": 0.4825, + "precision": 0.4992, + "density": 0.3057, + "coverage": 0.4998, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -445,34 +547,30 @@ "seed": 43, "methods": { "QRF": { - "mean_coverage": 0.4278, - "mean_precision": 0.3045, - "mean_recall": 0.4278, - "mean_density": 0.19, - "elapsed_seconds": 9.3, + "mean_coverage": 0.426, + "mean_precision": 0.5145, + "mean_density": 0.3705, + "elapsed_seconds": 8.7, "sources": [ { "source": "sipp", - "precision": 0.6575, - "recall": 0.939, - "density": 0.4249, - "coverage": 0.939, + "precision": 0.8869, + "density": 0.6738, + "coverage": 0.936, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.256, - "recall": 0.3443, - "density": 0.1452, - "coverage": 0.3443, + "precision": 0.6565, + "density": 0.4378, + "coverage": 0.342, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -481,34 +579,30 @@ ] }, "ZI-QRF": { - "mean_coverage": 0.4344, - "mean_precision": 0.3128, - "mean_recall": 0.4344, - "mean_density": 0.1955, - "elapsed_seconds": 7.9, + "mean_coverage": 0.4325, + "mean_precision": 0.517, + "mean_density": 0.3784, + "elapsed_seconds": 6.6, "sources": [ { "source": "sipp", - "precision": 0.6727, - "recall": 0.9445, - "density": 0.406, - "coverage": 0.9445, + "precision": 0.8954, + "density": 0.6944, + "coverage": 0.9453, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2656, - "recall": 0.3588, - "density": 0.1806, - "coverage": 0.3588, + "precision": 0.6554, + "density": 0.4408, + "coverage": 0.3523, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -517,34 +611,30 @@ ] }, "QDNN": { - "mean_coverage": 0.2013, - "mean_precision": 0.1453, - "mean_recall": 0.2013, - "mean_density": 0.1617, - "elapsed_seconds": 40.7, + "mean_coverage": 0.2137, + "mean_precision": 0.4534, + "mean_density": 0.3749, + "elapsed_seconds": 37.3, "sources": [ { "source": "sipp", - "precision": 0.2262, - "recall": 0.2013, - "density": 0.3647, - "coverage": 0.2013, + "precision": 0.767, + "density": 0.7592, + "coverage": 0.2352, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2097, - "recall": 0.4025, - "density": 0.1205, - "coverage": 0.4025, + "precision": 0.5932, + "density": 0.3654, + "coverage": 0.406, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -553,34 +643,30 @@ ] }, "ZI-QDNN": { - "mean_coverage": 0.3817, - "mean_precision": 0.2314, - "mean_recall": 0.3817, - "mean_density": 0.1599, - "elapsed_seconds": 24.1, + "mean_coverage": 0.3926, + "mean_precision": 0.4658, + "mean_density": 0.2967, + "elapsed_seconds": 23.3, "sources": [ { "source": "sipp", - "precision": 0.4674, - "recall": 0.7017, - "density": 0.3612, - "coverage": 0.7017, + "precision": 0.793, + "density": 0.4963, + "coverage": 0.742, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2267, - "recall": 0.4432, - "density": 0.1184, - "coverage": 0.4432, + "precision": 0.6043, + "density": 0.3938, + "coverage": 0.4358, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -589,34 +675,30 @@ ] }, "MAF": { - "mean_coverage": 0.2549, - "mean_precision": 0.1846, - "mean_recall": 0.2549, - "mean_density": 0.2694, - "elapsed_seconds": 92.7, + "mean_coverage": 0.2527, + "mean_precision": 0.3274, + "mean_density": 0.1858, + "elapsed_seconds": 82.7, "sources": [ { "source": "sipp", - "precision": 0.3203, - "recall": 0.3405, - "density": 0.4481, - "coverage": 0.3405, + "precision": 0.4164, + "density": 0.1868, + "coverage": 0.3352, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2335, - "recall": 0.4243, - "density": 0.3601, - "coverage": 0.4243, + "precision": 0.5657, + "density": 0.3705, + "coverage": 0.4228, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -625,34 +707,30 @@ ] }, "ZI-MAF": { - "mean_coverage": 0.4656, - "mean_precision": 0.301, - "mean_recall": 0.4656, - "mean_density": 0.2773, - "elapsed_seconds": 55.4, + "mean_coverage": 0.458, + "mean_precision": 0.4468, + "mean_density": 0.2958, + "elapsed_seconds": 51.4, "sources": [ { "source": "sipp", - "precision": 0.5995, - "recall": 0.8745, - "density": 0.4678, - "coverage": 0.8745, + "precision": 0.7608, + "density": 0.5086, + "coverage": 0.8658, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.3035, - "recall": 0.5222, - "density": 0.3641, - "coverage": 0.5222, + "precision": 0.5795, + "density": 0.3788, + "coverage": 0.5082, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -669,34 +747,30 @@ "seed": 44, "methods": { "QRF": { - "mean_coverage": 0.4268, - "mean_precision": 0.304, - "mean_recall": 0.4268, - "mean_density": 0.1947, - "elapsed_seconds": 9.8, + "mean_coverage": 0.4296, + "mean_precision": 0.5053, + "mean_density": 0.3628, + "elapsed_seconds": 8.4, "sources": [ { "source": "sipp", - "precision": 0.6546, - "recall": 0.9483, - "density": 0.425, - "coverage": 0.9483, + "precision": 0.8736, + "density": 0.6662, + "coverage": 0.9505, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2575, - "recall": 0.332, - "density": 0.1592, - "coverage": 0.332, + "precision": 0.6423, + "density": 0.4221, + "coverage": 0.3382, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -705,34 +779,30 @@ ] }, "ZI-QRF": { - "mean_coverage": 0.4302, - "mean_precision": 0.3154, - "mean_recall": 0.4302, - "mean_density": 0.1842, - "elapsed_seconds": 7.2, + "mean_coverage": 0.4338, + "mean_precision": 0.5072, + "mean_density": 0.3677, + "elapsed_seconds": 6.5, "sources": [ { "source": "sipp", - "precision": 0.6785, - "recall": 0.948, - "density": 0.3971, - "coverage": 0.948, + "precision": 0.8821, + "density": 0.6752, + "coverage": 0.9543, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2679, - "recall": 0.3425, - "density": 0.1555, - "coverage": 0.3425, + "precision": 0.6394, + "density": 0.428, + "coverage": 0.3473, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -741,34 +811,30 @@ ] }, "QDNN": { - "mean_coverage": 0.2061, - "mean_precision": 0.1532, - "mean_recall": 0.2061, - "mean_density": 0.2008, - "elapsed_seconds": 42.3, + "mean_coverage": 0.1929, + "mean_precision": 0.365, + "mean_density": 0.2191, + "elapsed_seconds": 36.1, "sources": [ { "source": "sipp", - "precision": 0.2624, - "recall": 0.2537, - "density": 0.508, - "coverage": 0.2537, + "precision": 0.5616, + "density": 0.3238, + "coverage": 0.1988, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.1973, - "recall": 0.3645, - "density": 0.0943, - "coverage": 0.3645, + "precision": 0.5334, + "density": 0.3334, + "coverage": 0.38, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -777,34 +843,30 @@ ] }, "ZI-QDNN": { - "mean_coverage": 0.367, - "mean_precision": 0.2397, - "mean_recall": 0.367, - "mean_density": 0.1761, - "elapsed_seconds": 24.6, + "mean_coverage": 0.3542, + "mean_precision": 0.3924, + "mean_density": 0.2443, + "elapsed_seconds": 23.0, "sources": [ { "source": "sipp", - "precision": 0.5112, - "recall": 0.6953, - "density": 0.4226, - "coverage": 0.6953, + "precision": 0.6189, + "density": 0.3629, + "coverage": 0.6508, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.208, - "recall": 0.4057, - "density": 0.1057, - "coverage": 0.4057, + "precision": 0.5584, + "density": 0.3701, + "coverage": 0.4118, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -813,34 +875,30 @@ ] }, "MAF": { - "mean_coverage": 0.2632, - "mean_precision": 0.1817, - "mean_recall": 0.2632, - "mean_density": 0.2613, - "elapsed_seconds": 86.0, + "mean_coverage": 0.2592, + "mean_precision": 0.2789, + "mean_density": 0.1528, + "elapsed_seconds": 82.3, "sources": [ { "source": "sipp", - "precision": 0.3133, - "recall": 0.352, - "density": 0.4259, - "coverage": 0.352, + "precision": 0.2945, + "density": 0.1261, + "coverage": 0.34, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2318, - "recall": 0.4377, - "density": 0.3581, - "coverage": 0.4377, + "precision": 0.5423, + "density": 0.3323, + "coverage": 0.4375, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", "precision": 0.0, - "recall": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -849,34 +907,1430 @@ ] }, "ZI-MAF": { - "mean_coverage": 0.4583, - "mean_precision": 0.293, - "mean_recall": 0.4583, - "mean_density": 0.2838, - "elapsed_seconds": 55.2, + "mean_coverage": 0.4666, + "mean_precision": 0.4335, + "mean_density": 0.2784, + "elapsed_seconds": 51.5, "sources": [ { "source": "sipp", - "precision": 0.5938, - "recall": 0.8575, - "density": 0.4605, - "coverage": 0.8575, + "precision": 0.7518, + "density": 0.4891, + "coverage": 0.8572, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5488, + "density": 0.346, + "coverage": 0.5425, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 45, + "methods": { + "QRF": { + "mean_coverage": 0.4219, + "mean_precision": 0.5515, + "mean_density": 0.3867, + "elapsed_seconds": 8.3, + "sources": [ + { + "source": "sipp", + "precision": 0.8822, + "density": 0.6909, + "coverage": 0.9405, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7724, + "density": 0.4692, + "coverage": 0.3252, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4247, + "mean_precision": 0.5554, + "mean_density": 0.3871, + "elapsed_seconds": 6.5, + "sources": [ + { + "source": "sipp", + "precision": 0.8957, + "density": 0.6958, + "coverage": 0.9347, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7705, + "density": 0.4654, + "coverage": 0.3395, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.2268, + "mean_precision": 0.4582, + "mean_density": 0.3088, + "elapsed_seconds": 36.7, + "sources": [ + { + "source": "sipp", + "precision": 0.6937, + "density": 0.5513, + "coverage": 0.3212, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6809, + "density": 0.3751, + "coverage": 0.3593, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.3913, + "mean_precision": 0.4856, + "mean_density": 0.3006, + "elapsed_seconds": 24.0, + "sources": [ + { + "source": "sipp", + "precision": 0.7489, + "density": 0.4847, + "coverage": 0.7482, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7078, + "density": 0.4171, + "coverage": 0.4255, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.2388, + "mean_precision": 0.3317, + "mean_density": 0.1968, + "elapsed_seconds": 86.7, + "sources": [ + { + "source": "sipp", + "precision": 0.5279, + "density": 0.3243, + "coverage": 0.3342, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4671, + "density": 0.2662, + "coverage": 0.382, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4572, + "mean_precision": 0.424, + "mean_density": 0.2725, + "elapsed_seconds": 54.1, + "sources": [ + { + "source": "sipp", + "precision": 0.7735, + "density": 0.5256, + "coverage": 0.8685, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4986, + "density": 0.292, + "coverage": 0.5032, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 46, + "methods": { + "QRF": { + "mean_coverage": 0.423, + "mean_precision": 0.504, + "mean_density": 0.3563, + "elapsed_seconds": 8.6, + "sources": [ + { + "source": "sipp", + "precision": 0.8789, + "density": 0.6656, + "coverage": 0.9297, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6332, + "density": 0.4033, + "coverage": 0.3392, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4331, + "mean_precision": 0.5037, + "mean_density": 0.3648, + "elapsed_seconds": 6.6, + "sources": [ + { + "source": "sipp", + "precision": 0.8879, + "density": 0.6925, + "coverage": 0.9463, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "cps", - "precision": 0.2851, - "recall": 0.5175, - "density": 0.391, - "coverage": 0.5175, + "precision": 0.6232, + "density": 0.4018, + "coverage": 0.353, "n_holdout": 4000, "n_synthetic": 9841 }, { "source": "psid", - "precision": 0.0001, - "recall": 0.0, + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.2384, + "mean_precision": 0.2726, + "mean_density": 0.145, + "elapsed_seconds": 39.0, + "sources": [ + { + "source": "sipp", + "precision": 0.2412, + "density": 0.1101, + "coverage": 0.3045, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5767, + "density": 0.325, + "coverage": 0.4108, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.3558, + "mean_precision": 0.4161, + "mean_density": 0.244, + "elapsed_seconds": 24.2, + "sources": [ + { + "source": "sipp", + "precision": 0.6869, + "density": 0.3945, + "coverage": 0.666, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5614, + "density": 0.3376, + "coverage": 0.4012, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.247, + "mean_precision": 0.2378, + "mean_density": 0.1254, + "elapsed_seconds": 87.3, + "sources": [ + { + "source": "sipp", + "precision": 0.2819, + "density": 0.1287, + "coverage": 0.35, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4315, + "density": 0.2475, + "coverage": 0.391, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4442, + "mean_precision": 0.3953, + "mean_density": 0.2551, + "elapsed_seconds": 54.8, + "sources": [ + { + "source": "sipp", + "precision": 0.7454, + "density": 0.4982, + "coverage": 0.8605, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4404, + "density": 0.2672, + "coverage": 0.472, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 47, + "methods": { + "QRF": { + "mean_coverage": 0.4312, + "mean_precision": 0.5371, + "mean_density": 0.3818, + "elapsed_seconds": 8.8, + "sources": [ + { + "source": "sipp", + "precision": 0.8996, + "density": 0.6848, + "coverage": 0.9463, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7117, + "density": 0.4605, + "coverage": 0.3473, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4319, + "mean_precision": 0.534, + "mean_density": 0.3881, + "elapsed_seconds": 6.8, + "sources": [ + { + "source": "sipp", + "precision": 0.8939, + "density": 0.6948, + "coverage": 0.9465, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7082, + "density": 0.4696, + "coverage": 0.3493, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.1842, + "mean_precision": 0.3746, + "mean_density": 0.2153, + "elapsed_seconds": 38.7, + "sources": [ + { + "source": "sipp", + "precision": 0.5138, + "density": 0.3101, + "coverage": 0.176, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6099, + "density": 0.3357, + "coverage": 0.3765, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.3848, + "mean_precision": 0.4662, + "mean_density": 0.2912, + "elapsed_seconds": 24.3, + "sources": [ + { + "source": "sipp", + "precision": 0.7696, + "density": 0.4896, + "coverage": 0.7588, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6291, + "density": 0.3839, + "coverage": 0.3957, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.2454, + "mean_precision": 0.2447, + "mean_density": 0.1324, + "elapsed_seconds": 85.4, + "sources": [ + { + "source": "sipp", + "precision": 0.2968, + "density": 0.133, + "coverage": 0.3528, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4372, + "density": 0.264, + "coverage": 0.3835, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4597, + "mean_precision": 0.4103, + "mean_density": 0.2663, + "elapsed_seconds": 52.7, + "sources": [ + { + "source": "sipp", + "precision": 0.7675, + "density": 0.5032, + "coverage": 0.8738, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4635, + "density": 0.2957, + "coverage": 0.5052, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 48, + "methods": { + "QRF": { + "mean_coverage": 0.4247, + "mean_precision": 0.4936, + "mean_density": 0.3582, + "elapsed_seconds": 8.4, + "sources": [ + { + "source": "sipp", + "precision": 0.8781, + "density": 0.6674, + "coverage": 0.9353, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6028, + "density": 0.4073, + "coverage": 0.3387, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4328, + "mean_precision": 0.4945, + "mean_density": 0.3618, + "elapsed_seconds": 6.6, + "sources": [ + { + "source": "sipp", + "precision": 0.8933, + "density": 0.6916, + "coverage": 0.9503, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5902, + "density": 0.3937, + "coverage": 0.3483, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.2657, + "mean_precision": 0.3803, + "mean_density": 0.253, + "elapsed_seconds": 37.2, + "sources": [ + { + "source": "sipp", + "precision": 0.6194, + "density": 0.4363, + "coverage": 0.4062, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5216, + "density": 0.3226, + "coverage": 0.3907, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.36, + "mean_precision": 0.4095, + "mean_density": 0.2513, + "elapsed_seconds": 23.3, + "sources": [ + { + "source": "sipp", + "precision": 0.6979, + "density": 0.4096, + "coverage": 0.6675, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5306, + "density": 0.3444, + "coverage": 0.4125, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.2438, + "mean_precision": 0.22, + "mean_density": 0.122, + "elapsed_seconds": 85.8, + "sources": [ + { + "source": "sipp", + "precision": 0.2827, + "density": 0.1254, + "coverage": 0.3538, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.3774, + "density": 0.2406, + "coverage": 0.3777, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4455, + "mean_precision": 0.3873, + "mean_density": 0.2531, + "elapsed_seconds": 53.8, + "sources": [ + { + "source": "sipp", + "precision": 0.7614, + "density": 0.5004, + "coverage": 0.8582, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4004, + "density": 0.2589, + "coverage": 0.4783, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 49, + "methods": { + "QRF": { + "mean_coverage": 0.4258, + "mean_precision": 0.4737, + "mean_density": 0.3535, + "elapsed_seconds": 8.6, + "sources": [ + { + "source": "sipp", + "precision": 0.8847, + "density": 0.6791, + "coverage": 0.9317, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5363, + "density": 0.3814, + "coverage": 0.3458, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4378, + "mean_precision": 0.4756, + "mean_density": 0.3606, + "elapsed_seconds": 6.8, + "sources": [ + { + "source": "sipp", + "precision": 0.8984, + "density": 0.7075, + "coverage": 0.9537, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5283, + "density": 0.3743, + "coverage": 0.3595, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.276, + "mean_precision": 0.3817, + "mean_density": 0.2494, + "elapsed_seconds": 37.9, + "sources": [ + { + "source": "sipp", + "precision": 0.6595, + "density": 0.4492, + "coverage": 0.4395, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4856, + "density": 0.2989, + "coverage": 0.3885, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.3713, + "mean_precision": 0.4164, + "mean_density": 0.2671, + "elapsed_seconds": 24.3, + "sources": [ + { + "source": "sipp", + "precision": 0.758, + "density": 0.4689, + "coverage": 0.748, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.4911, + "density": 0.3325, + "coverage": 0.3658, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.2548, + "mean_precision": 0.2246, + "mean_density": 0.1236, + "elapsed_seconds": 89.0, + "sources": [ + { + "source": "sipp", + "precision": 0.2911, + "density": 0.1291, + "coverage": 0.3613, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.3826, + "density": 0.2418, + "coverage": 0.403, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4581, + "mean_precision": 0.3796, + "mean_density": 0.2502, + "elapsed_seconds": 54.9, + "sources": [ + { + "source": "sipp", + "precision": 0.7532, + "density": 0.5001, + "coverage": 0.8638, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.3855, + "density": 0.2506, + "coverage": 0.5105, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 50, + "methods": { + "QRF": { + "mean_coverage": 0.4247, + "mean_precision": 0.5102, + "mean_density": 0.3717, + "elapsed_seconds": 8.6, + "sources": [ + { + "source": "sipp", + "precision": 0.8727, + "density": 0.6747, + "coverage": 0.9453, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6579, + "density": 0.4403, + "coverage": 0.3287, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4355, + "mean_precision": 0.5161, + "mean_density": 0.376, + "elapsed_seconds": 6.7, + "sources": [ + { + "source": "sipp", + "precision": 0.8874, + "density": 0.6835, + "coverage": 0.9577, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.6608, + "density": 0.4446, + "coverage": 0.3488, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.2153, + "mean_precision": 0.2842, + "mean_density": 0.1634, + "elapsed_seconds": 38.2, + "sources": [ + { + "source": "sipp", + "precision": 0.2752, + "density": 0.1345, + "coverage": 0.2918, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5775, + "density": 0.3556, + "coverage": 0.354, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.3667, + "mean_precision": 0.4207, + "mean_density": 0.2691, + "elapsed_seconds": 25.6, + "sources": [ + { + "source": "sipp", + "precision": 0.6815, + "density": 0.423, + "coverage": 0.7125, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5806, + "density": 0.3844, + "coverage": 0.3877, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.2425, + "mean_precision": 0.2197, + "mean_density": 0.1225, + "elapsed_seconds": 88.9, + "sources": [ + { + "source": "sipp", + "precision": 0.2774, + "density": 0.1293, + "coverage": 0.3465, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.3817, + "density": 0.2381, + "coverage": 0.381, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4483, + "mean_precision": 0.3827, + "mean_density": 0.2545, + "elapsed_seconds": 55.1, + "sources": [ + { + "source": "sipp", + "precision": 0.7529, + "density": 0.5061, + "coverage": 0.8695, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.3952, + "density": 0.2574, + "coverage": 0.4753, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + } + } + }, + { + "holdout_frac": 0.2, + "n_generate": 9841, + "k": 5, + "seed": 51, + "methods": { + "QRF": { + "mean_coverage": 0.423, + "mean_precision": 0.5331, + "mean_density": 0.385, + "elapsed_seconds": 8.8, + "sources": [ + { + "source": "sipp", + "precision": 0.8857, + "density": 0.684, + "coverage": 0.9357, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7135, + "density": 0.4708, + "coverage": 0.3332, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QRF": { + "mean_coverage": 0.4262, + "mean_precision": 0.5332, + "mean_density": 0.3887, + "elapsed_seconds": 6.7, + "sources": [ + { + "source": "sipp", + "precision": 0.8949, + "density": 0.6972, + "coverage": 0.9513, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.7046, + "density": 0.469, + "coverage": 0.3272, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "QDNN": { + "mean_coverage": 0.2426, + "mean_precision": 0.2763, + "mean_density": 0.1618, + "elapsed_seconds": 40.7, + "sources": [ + { + "source": "sipp", + "precision": 0.214, + "density": 0.1036, + "coverage": 0.3247, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.615, + "density": 0.3819, + "coverage": 0.403, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-QDNN": { + "mean_coverage": 0.3774, + "mean_precision": 0.438, + "mean_density": 0.2741, + "elapsed_seconds": 24.1, + "sources": [ + { + "source": "sipp", + "precision": 0.686, + "density": 0.4246, + "coverage": 0.736, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.628, + "density": 0.3978, + "coverage": 0.3962, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "MAF": { + "mean_coverage": 0.2579, + "mean_precision": 0.2624, + "mean_density": 0.1484, + "elapsed_seconds": 84.4, + "sources": [ + { + "source": "sipp", + "precision": 0.2923, + "density": 0.1322, + "coverage": 0.367, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.495, + "density": 0.3129, + "coverage": 0.4068, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, + "density": 0.0, + "coverage": 0.0, + "n_holdout": 1841, + "n_synthetic": 9841 + } + ] + }, + "ZI-MAF": { + "mean_coverage": 0.4584, + "mean_precision": 0.4264, + "mean_density": 0.2808, + "elapsed_seconds": 52.9, + "sources": [ + { + "source": "sipp", + "precision": 0.774, + "density": 0.5201, + "coverage": 0.8785, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "cps", + "precision": 0.5053, + "density": 0.3224, + "coverage": 0.4968, + "n_holdout": 4000, + "n_synthetic": 9841 + }, + { + "source": "psid", + "precision": 0.0, "density": 0.0, "coverage": 0.0, "n_holdout": 1841, @@ -887,6 +2341,6 @@ } } ], - "timestamp": "2026-02-08T11:51:16", - "total_elapsed_seconds": 693.2 + "timestamp": "2026-02-08T13:51:39", + "total_elapsed_seconds": 2338.8 } \ No newline at end of file diff --git a/benchmarks/results/reweighting_frontier.json b/benchmarks/results/reweighting_frontier.json new file mode 100644 index 0000000..501d1a2 --- /dev/null +++ b/benchmarks/results/reweighting_frontier.json @@ -0,0 +1,264 @@ +{ + "n_records": 5000, + "methods": { + "IPF": [ + { + "n_active": 4913, + "test_error": 0.178791, + "train_error": 0.0, + "sparsity": 0.0174, + "params": {} + } + ], + "Entropy": [ + { + "n_active": 4913, + "test_error": 0.178982, + "train_error": 0.0, + "sparsity": 0.0174, + "params": {} + } + ], + "L1-Sparse": [ + { + "n_active": 5, + "test_error": 0.773557, + "train_error": 0.004294, + "sparsity": 0.999, + "params": {} + } + ], + "L0-Sparse": [ + { + "n_active": 5, + "test_error": 0.773557, + "train_error": 0.004294, + "sparsity": 0.999, + "params": {} + } + ], + "SparseCalibrator": [ + { + "n_active": 5000, + "test_error": 0.175413, + "train_error": 1e-06, + "sparsity": 0.0, + "params": { + "sparsity_weight": 0.0 + } + }, + { + "n_active": 4993, + "test_error": 0.174353, + "train_error": 1e-06, + "sparsity": 0.0014, + "params": { + "sparsity_weight": 0.001 + } + }, + { + "n_active": 4973, + "test_error": 0.17451, + "train_error": 1e-06, + "sparsity": 0.0054, + "params": { + "sparsity_weight": 0.005 + } + }, + { + "n_active": 4948, + "test_error": 0.177425, + "train_error": 1e-06, + "sparsity": 0.0104, + "params": { + "sparsity_weight": 0.01 + } + }, + { + "n_active": 4898, + "test_error": 0.1776, + "train_error": 1e-06, + "sparsity": 0.0204, + "params": { + "sparsity_weight": 0.02 + } + }, + { + "n_active": 4754, + "test_error": 0.175142, + "train_error": 1e-06, + "sparsity": 0.0492, + "params": { + "sparsity_weight": 0.05 + } + }, + { + "n_active": 4521, + "test_error": 0.17563, + "train_error": 1e-06, + "sparsity": 0.0958, + "params": { + "sparsity_weight": 0.1 + } + }, + { + "n_active": 4090, + "test_error": 0.175422, + "train_error": 1e-06, + "sparsity": 0.182, + "params": { + "sparsity_weight": 0.2 + } + }, + { + "n_active": 3031, + "test_error": 0.14018, + "train_error": 1e-06, + "sparsity": 0.3938, + "params": { + "sparsity_weight": 0.5 + } + }, + { + "n_active": 1838, + "test_error": 0.17543, + "train_error": 1e-06, + "sparsity": 0.6324, + "params": { + "sparsity_weight": 1.0 + } + }, + { + "n_active": 673, + "test_error": 0.20525, + "train_error": 1e-06, + "sparsity": 0.8654, + "params": { + "sparsity_weight": 2.0 + } + }, + { + "n_active": 31, + "test_error": 0.088716, + "train_error": 2e-06, + "sparsity": 0.9938, + "params": { + "sparsity_weight": 5.0 + } + } + ], + "HardConcrete": [ + { + "n_active": 4602, + "test_error": 0.179278, + "train_error": 0.009796, + "sparsity": 0.0796, + "params": { + "lambda_l0": 1e-07, + "epochs": 2000 + } + }, + { + "n_active": 4016, + "test_error": 0.196103, + "train_error": 0.014185, + "sparsity": 0.1968, + "params": { + "lambda_l0": 5e-07, + "epochs": 2000 + } + }, + { + "n_active": 3472, + "test_error": 0.16823, + "train_error": 0.010113, + "sparsity": 0.3056, + "params": { + "lambda_l0": 1e-06, + "epochs": 2000 + } + }, + { + "n_active": 1814, + "test_error": 0.179414, + "train_error": 0.004673, + "sparsity": 0.6372, + "params": { + "lambda_l0": 5e-06, + "epochs": 2000 + } + }, + { + "n_active": 1245, + "test_error": 0.176757, + "train_error": 0.006176, + "sparsity": 0.751, + "params": { + "lambda_l0": 1e-05, + "epochs": 2000 + } + }, + { + "n_active": 483, + "test_error": 0.157259, + "train_error": 0.007132, + "sparsity": 0.9034, + "params": { + "lambda_l0": 5e-05, + "epochs": 2000 + } + }, + { + "n_active": 298, + "test_error": 0.118481, + "train_error": 0.003852, + "sparsity": 0.9404, + "params": { + "lambda_l0": 0.0001, + "epochs": 2000 + } + }, + { + "n_active": 100, + "test_error": 0.13813, + "train_error": 0.006001, + "sparsity": 0.98, + "params": { + "lambda_l0": 0.0005, + "epochs": 2000 + } + }, + { + "n_active": 61, + "test_error": 0.354901, + "train_error": 0.006403, + "sparsity": 0.9878, + "params": { + "lambda_l0": 0.001, + "epochs": 2000 + } + }, + { + "n_active": 8, + "test_error": 0.080927, + "train_error": 0.045697, + "sparsity": 0.9984, + "params": { + "lambda_l0": 0.005, + "epochs": 2000 + } + }, + { + "n_active": 6, + "test_error": 0.323481, + "train_error": 0.144716, + "sparsity": 0.9988, + "params": { + "lambda_l0": 0.01, + "epochs": 2000 + } + } + ] + }, + "timestamp": "2026-02-10T06:04:42" +} \ No newline at end of file diff --git a/benchmarks/results/reweighting_frontier_hc_multiseed.json b/benchmarks/results/reweighting_frontier_hc_multiseed.json new file mode 100644 index 0000000..0092370 --- /dev/null +++ b/benchmarks/results/reweighting_frontier_hc_multiseed.json @@ -0,0 +1,321 @@ +[ + { + "lambda_l0": 1e-07, + "n_active_mean": 4598.6, + "test_error_mean": 0.183959, + "test_error_se": 0.001828, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 4574, + "test_error": 0.188742 + }, + { + "n_active": 4604, + "test_error": 0.186903 + }, + { + "n_active": 4599, + "test_error": 0.184517 + }, + { + "n_active": 4613, + "test_error": 0.176868 + }, + { + "n_active": 4603, + "test_error": 0.182764 + } + ] + }, + { + "lambda_l0": 5e-07, + "n_active_mean": 4031.0, + "test_error_mean": 0.18636, + "test_error_se": 0.004541, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 4033, + "test_error": 0.183668 + }, + { + "n_active": 4003, + "test_error": 0.203546 + }, + { + "n_active": 4070, + "test_error": 0.187644 + }, + { + "n_active": 4033, + "test_error": 0.185057 + }, + { + "n_active": 4016, + "test_error": 0.171886 + } + ] + }, + { + "lambda_l0": 1e-06, + "n_active_mean": 3502.0, + "test_error_mean": 0.183622, + "test_error_se": 0.002936, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 3487, + "test_error": 0.177322 + }, + { + "n_active": 3476, + "test_error": 0.180308 + }, + { + "n_active": 3511, + "test_error": 0.18044 + }, + { + "n_active": 3504, + "test_error": 0.183989 + }, + { + "n_active": 3532, + "test_error": 0.196052 + } + ] + }, + { + "lambda_l0": 5e-06, + "n_active_mean": 1858.2, + "test_error_mean": 0.184767, + "test_error_se": 0.005227, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 1864, + "test_error": 0.177809 + }, + { + "n_active": 1836, + "test_error": 0.196453 + }, + { + "n_active": 1885, + "test_error": 0.179591 + }, + { + "n_active": 1828, + "test_error": 0.200342 + }, + { + "n_active": 1878, + "test_error": 0.16964 + } + ] + }, + { + "lambda_l0": 1e-05, + "n_active_mean": 1266.0, + "test_error_mean": 0.18086, + "test_error_se": 0.002766, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 1267, + "test_error": 0.184809 + }, + { + "n_active": 1240, + "test_error": 0.191101 + }, + { + "n_active": 1284, + "test_error": 0.177678 + }, + { + "n_active": 1248, + "test_error": 0.175448 + }, + { + "n_active": 1291, + "test_error": 0.175262 + } + ] + }, + { + "lambda_l0": 5e-05, + "n_active_mean": 465.8, + "test_error_mean": 0.185804, + "test_error_se": 0.010899, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 476, + "test_error": 0.197811 + }, + { + "n_active": 457, + "test_error": 0.18655 + }, + { + "n_active": 472, + "test_error": 0.192721 + }, + { + "n_active": 473, + "test_error": 0.140026 + }, + { + "n_active": 451, + "test_error": 0.211914 + } + ] + }, + { + "lambda_l0": 0.0001, + "n_active_mean": 303.4, + "test_error_mean": 0.157013, + "test_error_se": 0.01652, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 312, + "test_error": 0.094777 + }, + { + "n_active": 291, + "test_error": 0.200136 + }, + { + "n_active": 305, + "test_error": 0.149506 + }, + { + "n_active": 312, + "test_error": 0.151891 + }, + { + "n_active": 297, + "test_error": 0.188754 + } + ] + }, + { + "lambda_l0": 0.0005, + "n_active_mean": 97.8, + "test_error_mean": 0.222348, + "test_error_se": 0.063596, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 107, + "test_error": 0.153087 + }, + { + "n_active": 97, + "test_error": 0.430495 + }, + { + "n_active": 92, + "test_error": 0.080882 + }, + { + "n_active": 99, + "test_error": 0.09524 + }, + { + "n_active": 94, + "test_error": 0.352039 + } + ] + }, + { + "lambda_l0": 0.001, + "n_active_mean": 54.0, + "test_error_mean": 0.323004, + "test_error_se": 0.097561, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 57, + "test_error": 0.102903 + }, + { + "n_active": 56, + "test_error": 0.531172 + }, + { + "n_active": 51, + "test_error": 0.273495 + }, + { + "n_active": 58, + "test_error": 0.62003 + }, + { + "n_active": 48, + "test_error": 0.087419 + } + ] + }, + { + "lambda_l0": 0.005, + "n_active_mean": 9.2, + "test_error_mean": 0.459037, + "test_error_se": 0.124795, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 10, + "test_error": 0.805416 + }, + { + "n_active": 6, + "test_error": 0.241894 + }, + { + "n_active": 9, + "test_error": 0.453204 + }, + { + "n_active": 12, + "test_error": 0.724878 + }, + { + "n_active": 9, + "test_error": 0.069794 + } + ] + }, + { + "lambda_l0": 0.01, + "n_active_mean": 6.6, + "test_error_mean": 0.420379, + "test_error_se": 0.139395, + "n_seeds": 5, + "per_seed": [ + { + "n_active": 7, + "test_error": 1.030714 + }, + { + "n_active": 6, + "test_error": 0.382212 + }, + { + "n_active": 6, + "test_error": 0.18737 + }, + { + "n_active": 6, + "test_error": 0.251976 + }, + { + "n_active": 8, + "test_error": 0.24962 + } + ] + } +] \ No newline at end of file diff --git a/benchmarks/results/reweighting_full.json b/benchmarks/results/reweighting_full.json index cbf9cf7..0ca6260 100644 --- a/benchmarks/results/reweighting_full.json +++ b/benchmarks/results/reweighting_full.json @@ -3,297 +3,340 @@ "methods": { "IPF": { "method_name": "IPF", - "mean_relative_error": 0.157032, - "max_relative_error": 0.251252, - "weight_cv": 0.4327, + "train_mean_error": 0.0, + "test_mean_error": 0.254463, + "mean_relative_error": 0.063616, + "max_relative_error": 0.285607, + "weight_cv": 0.3887, "sparsity": 0.0174, - "elapsed_seconds": 0.05, - "per_target_errors": [ - { - "target_name": "age_group=65+", - "target_value": 1239.0, - "actual_value": 1550.3009953204105, - "relative_error": 0.2512518122037211 - }, - { - "target_name": "age_group=35-54", - "target_value": 927.0, - "actual_value": 1159.910410450098, - "relative_error": 0.25125179120830415 - }, - { - "target_name": "age_group=0-17", - "target_value": 1030.0, - "actual_value": 1288.789347589683, - "relative_error": 0.25125179377639134 - }, - { - "target_name": "age_group=18-34", - "target_value": 806.0, - "actual_value": 1008.5089426178841, - "relative_error": 0.2512517898484915 - }, - { - "target_name": "age_group=55-64", - "target_value": 591.0, - "actual_value": 739.4898113089064, - "relative_error": 0.2512517957849516 - }, - { - "target_name": "is_male=0.0", - "target_value": 2878.0, - "actual_value": 2877.999754517448, - "relative_error": 8.529623076629744e-08 - }, - { - "target_name": "is_male=1.0", - "target_value": 2869.0, - "actual_value": 2868.999752769534, - "relative_error": 8.617304492933022e-08 - }, - { - "target_name": "weight", - "target_value": 31373894.0, - "actual_value": 31373894.0, - "relative_error": 0.0 + "elapsed_seconds": 0.04, + "train_errors": { + "age_group=65+": { + "target": 1135, + "actual": 1134.99991, + "error": 0.0 + }, + "age_group=35-54": { + "target": 1513, + "actual": 1512.999836, + "error": 0.0 + }, + "age_group=0-17": { + "target": 1128, + "actual": 1127.999878, + "error": 0.0 + }, + "age_group=18-34": { + "target": 971, + "actual": 970.999893, + "error": 0.0 + }, + "age_group=55-64": { + "target": 536, + "actual": 535.999947, + "error": 0.0 + }, + "weight": { + "target": 36363788, + "actual": 36363788.0, + "error": 0.0 } - ] + }, + "test_errors": { + "is_male=0.0": { + "target": 2222, + "actual": 2718.214344, + "error": 0.223319 + }, + "is_male=1.0": { + "target": 1995, + "actual": 2564.785118, + "error": 0.285607 + } + } }, "Entropy": { "method_name": "Entropy", - "mean_relative_error": 0.133642, - "max_relative_error": 0.278857, - "weight_cv": 0.3769, + "train_mean_error": 0.0, + "test_mean_error": 0.254453, + "mean_relative_error": 0.063613, + "max_relative_error": 0.285422, + "weight_cv": 0.368, "sparsity": 0.0174, - "elapsed_seconds": 0.37, - "per_target_errors": [ - { - "target_name": "age_group=65+", - "target_value": 1239.0, - "actual_value": 1403.8363422922193, - "relative_error": 0.13303982428750552 - }, - { - "target_name": "age_group=35-54", - "target_value": 927.0, - "actual_value": 1091.9410644676664, - "relative_error": 0.1779299508820565 - }, - { - "target_name": "age_group=0-17", - "target_value": 1030.0, - "actual_value": 1194.956526995818, - "relative_error": 0.16015196795710482 - }, - { - "target_name": "age_group=18-34", - "target_value": 806.0, - "actual_value": 970.7594393674235, - "relative_error": 0.20441617787521527 - }, - { - "target_name": "age_group=55-64", - "target_value": 591.0, - "actual_value": 755.8046754174934, - "relative_error": 0.2788573188113255 - }, - { - "target_name": "is_male=0.0", - "target_value": 2878.0, - "actual_value": 2713.1795686220235, - "relative_error": 0.05726908664974863 - }, - { - "target_name": "is_male=1.0", - "target_value": 2869.0, - "actual_value": 2704.1184799185967, - "relative_error": 0.05747003139818868 - }, - { - "target_name": "weight", - "target_value": 31373894.0, - "actual_value": 31373898.705667496, - "relative_error": 1.4998672130808944e-07 + "elapsed_seconds": 0.07, + "train_errors": { + "age_group=65+": { + "target": 1135, + "actual": 1135.0, + "error": 0.0 + }, + "age_group=35-54": { + "target": 1513, + "actual": 1513.0, + "error": 0.0 + }, + "age_group=0-17": { + "target": 1128, + "actual": 1128.0, + "error": 0.0 + }, + "age_group=18-34": { + "target": 971, + "actual": 971.0, + "error": 0.0 + }, + "age_group=55-64": { + "target": 536, + "actual": 536.0, + "error": 0.0 + }, + "weight": { + "target": 36363788, + "actual": 36363788.0, + "error": 0.0 } - ] - }, - "SparseCalibrator": { - "method_name": "SparseCalibrator", - "mean_relative_error": 0.157032, - "max_relative_error": 0.251251, - "weight_cv": 0.288, - "sparsity": 0.011, - "elapsed_seconds": 0.02, - "per_target_errors": [ - { - "target_name": "age_group=65+", - "target_value": 1239.0, - "actual_value": 1550.2997675363865, - "relative_error": 0.25125082125616344 - }, - { - "target_name": "age_group=35-54", - "target_value": 927.0, - "actual_value": 1159.9092101529523, - "relative_error": 0.25125049638937674 - }, - { - "target_name": "age_group=0-17", - "target_value": 1030.0, - "actual_value": 1288.7879799263328, - "relative_error": 0.2512504659478959 - }, - { - "target_name": "age_group=18-34", - "target_value": 806.0, - "actual_value": 1008.5078098178831, - "relative_error": 0.2512503843894331 - }, - { - "target_name": "age_group=55-64", - "target_value": 591.0, - "actual_value": 739.489078017901, - "relative_error": 0.251250555021829 - }, - { - "target_name": "is_male=0.0", - "target_value": 2878.0, - "actual_value": 2877.9969255907254, - "relative_error": 1.0682450572035309e-06 - }, - { - "target_name": "is_male=1.0", - "target_value": 2869.0, - "actual_value": 2868.99691986073, - "relative_error": 1.0735933321605578e-06 - }, - { - "target_name": "weight", - "target_value": 31373894.0, - "actual_value": 31373948.550865598, - "relative_error": 1.7387342992241556e-06 + }, + "test_errors": { + "is_male=0.0": { + "target": 2222, + "actual": 2718.58383, + "error": 0.223485 + }, + "is_male=1.0": { + "target": 1995, + "actual": 2564.41617, + "error": 0.285422 } - ] + } }, "L1-Sparse": { "method_name": "L1-Sparse", - "mean_relative_error": 1.154864, - "max_relative_error": 1.656945, - "weight_cv": 0.0, - "sparsity": 0.0, + "train_mean_error": 0.018332, + "test_mean_error": 0.648932, + "mean_relative_error": 0.175982, + "max_relative_error": 0.866787, + "weight_cv": 32.9878, + "sparsity": 0.999, "elapsed_seconds": 0.01, - "per_target_errors": [ - { - "target_name": "age_group=65+", - "target_value": 1239.0, - "actual_value": 2537.436, - "relative_error": 1.0479709443099274 - }, - { - "target_name": "age_group=35-54", - "target_value": 927.0, - "actual_value": 2462.9880000000003, - "relative_error": 1.6569449838187704 - }, - { - "target_name": "age_group=0-17", - "target_value": 1030.0, - "actual_value": 2047.3200000000006, - "relative_error": 0.9876893203883501 - }, - { - "target_name": "age_group=18-34", - "target_value": 806.0, - "actual_value": 1896.3560000000007, - "relative_error": 1.3527990074441696 - }, - { - "target_name": "age_group=55-64", - "target_value": 591.0, - "actual_value": 1395.9000000000003, - "relative_error": 1.3619289340101528 - }, - { - "target_name": "is_male=0.0", - "target_value": 2878.0, - "actual_value": 5329.236000000001, - "relative_error": 0.851715079916609 - }, - { - "target_name": "is_male=1.0", - "target_value": 2869.0, - "actual_value": 5010.764000000001, - "relative_error": 0.7465193447194148 - }, - { - "target_name": "weight", - "target_value": 31373894.0, - "actual_value": 70068809.1069998, - "relative_error": 1.233347543884728 + "train_errors": { + "age_group=65+": { + "target": 1135, + "actual": 1135.0, + "error": 0.0 + }, + "age_group=35-54": { + "target": 1513, + "actual": 1513.0, + "error": 0.0 + }, + "age_group=0-17": { + "target": 1128, + "actual": 1128.0, + "error": 0.0 + }, + "age_group=18-34": { + "target": 971, + "actual": 971.0, + "error": 0.0 + }, + "age_group=55-64": { + "target": 536, + "actual": 536.0, + "error": 0.0 + }, + "weight": { + "target": 36363788, + "actual": 32364060.10659, + "error": 0.109992 } - ] + }, + "test_errors": { + "is_male=0.0": { + "target": 2222, + "actual": 4148.0, + "error": 0.866787 + }, + "is_male=1.0": { + "target": 1995, + "actual": 1135.0, + "error": 0.431078 + } + } }, "L0-Sparse": { "method_name": "L0-Sparse", - "mean_relative_error": 1.154864, - "max_relative_error": 1.656945, - "weight_cv": 0.0, - "sparsity": 0.0, + "train_mean_error": 0.018332, + "test_mean_error": 0.648932, + "mean_relative_error": 0.175982, + "max_relative_error": 0.866787, + "weight_cv": 32.9878, + "sparsity": 0.999, + "elapsed_seconds": 0.08, + "train_errors": { + "age_group=65+": { + "target": 1135, + "actual": 1135.0, + "error": 0.0 + }, + "age_group=35-54": { + "target": 1513, + "actual": 1513.0, + "error": 0.0 + }, + "age_group=0-17": { + "target": 1128, + "actual": 1128.0, + "error": 0.0 + }, + "age_group=18-34": { + "target": 971, + "actual": 971.0, + "error": 0.0 + }, + "age_group=55-64": { + "target": 536, + "actual": 536.0, + "error": 0.0 + }, + "weight": { + "target": 36363788, + "actual": 32364060.10659, + "error": 0.109992 + } + }, + "test_errors": { + "is_male=0.0": { + "target": 2222, + "actual": 4148.0, + "error": 0.866787 + }, + "is_male=1.0": { + "target": 1995, + "actual": 1135.0, + "error": 0.431078 + } + } + }, + "SparseCalibrator": { + "method_name": "SparseCalibrator", + "train_mean_error": 1e-06, + "test_mean_error": 0.254813, + "mean_relative_error": 0.063704, + "max_relative_error": 0.292477, + "weight_cv": 0.1839, + "sparsity": 0.0104, "elapsed_seconds": 0.01, - "per_target_errors": [ - { - "target_name": "age_group=65+", - "target_value": 1239.0, - "actual_value": 2537.436, - "relative_error": 1.0479709443099274 - }, - { - "target_name": "age_group=35-54", - "target_value": 927.0, - "actual_value": 2462.9880000000003, - "relative_error": 1.6569449838187704 - }, - { - "target_name": "age_group=0-17", - "target_value": 1030.0, - "actual_value": 2047.3200000000006, - "relative_error": 0.9876893203883501 - }, - { - "target_name": "age_group=18-34", - "target_value": 806.0, - "actual_value": 1896.3560000000007, - "relative_error": 1.3527990074441696 - }, - { - "target_name": "age_group=55-64", - "target_value": 591.0, - "actual_value": 1395.9000000000003, - "relative_error": 1.3619289340101528 - }, - { - "target_name": "is_male=0.0", - "target_value": 2878.0, - "actual_value": 5329.236000000001, - "relative_error": 0.851715079916609 - }, - { - "target_name": "is_male=1.0", - "target_value": 2869.0, - "actual_value": 5010.764000000001, - "relative_error": 0.7465193447194148 - }, - { - "target_name": "weight", - "target_value": 31373894.0, - "actual_value": 70068809.1069998, - "relative_error": 1.233347543884728 + "train_errors": { + "age_group=65+": { + "target": 1135, + "actual": 1134.999213, + "error": 1e-06 + }, + "age_group=35-54": { + "target": 1513, + "actual": 1512.998417, + "error": 1e-06 + }, + "age_group=0-17": { + "target": 1128, + "actual": 1127.998783, + "error": 1e-06 + }, + "age_group=18-34": { + "target": 971, + "actual": 970.99887, + "error": 1e-06 + }, + "age_group=55-64": { + "target": 536, + "actual": 535.999506, + "error": 1e-06 + }, + "weight": { + "target": 36363788, + "actual": 36363849.124754, + "error": 2e-06 + } + }, + "test_errors": { + "is_male=0.0": { + "target": 2222, + "actual": 2704.504061, + "error": 0.217149 + }, + "is_male=1.0": { + "target": 1995, + "actual": 2578.490728, + "error": 0.292477 + } + } + }, + "HardConcrete": { + "method_name": "HardConcrete", + "train_mean_error": 0.010784, + "test_mean_error": 0.254126, + "mean_relative_error": 0.07162, + "max_relative_error": 0.309996, + "weight_cv": 4.472, + "sparsity": 0.9422, + "elapsed_seconds": 1.95, + "train_errors": { + "age_group=65+": { + "target": 1135, + "actual": 1153.93042, + "error": 0.016679 + }, + "age_group=35-54": { + "target": 1513, + "actual": 1469.90625, + "error": 0.028482 + }, + "age_group=0-17": { + "target": 1128, + "actual": 1139.481323, + "error": 0.010178 + }, + "age_group=18-34": { + "target": 971, + "actual": 973.340698, + "error": 0.002411 + }, + "age_group=55-64": { + "target": 536, + "actual": 539.308167, + "error": 0.006172 + }, + "weight": { + "target": 36363788, + "actual": 36335320.015876, + "error": 0.000783 + } + }, + "test_errors": { + "is_male=0.0": { + "target": 2222, + "actual": 2662.523926, + "error": 0.198256 + }, + "is_male=1.0": { + "target": 1995, + "actual": 2613.442871, + "error": 0.309996 } - ] + } } }, - "timestamp": "2026-02-08T10:56:41", + "timestamp": "2026-02-10T05:43:59", "n_records": 5000, + "n_train_targets": 6, + "n_test_targets": 2, "n_marginal_targets": 7, - "n_continuous_targets": 1 + "n_continuous_targets": 1, + "train_variables": [ + "age_group", + "weight" + ], + "test_variables": [ + "is_male" + ] } \ No newline at end of file diff --git a/docs/_config.yml b/docs/_config.yml index bcd2fdd..88efb42 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -1,4 +1,4 @@ -title: micro +title: microplex author: Cosilico logo: "" @@ -7,7 +7,7 @@ execute: timeout: 300 repository: - url: https://github.com/CosilicoAI/micro + url: https://github.com/CosilicoAI/microplex path_to_book: docs branch: main diff --git a/docs/api.md b/docs/api.md index 00c0928..55d174d 100644 --- a/docs/api.md +++ b/docs/api.md @@ -1,9 +1,9 @@ -# API Reference +# API reference ## Synthesizer ```{eval-rst} -.. autoclass:: micro.Synthesizer +.. autoclass:: microplex.Synthesizer :members: :undoc-members: :show-inheritance: @@ -12,7 +12,7 @@ ## Transforms ```{eval-rst} -.. automodule:: micro.transforms +.. automodule:: microplex.transforms :members: :undoc-members: ``` @@ -20,15 +20,15 @@ ## Flows ```{eval-rst} -.. automodule:: micro.flows +.. automodule:: microplex.flows :members: :undoc-members: ``` -## Discrete Models +## Discrete models ```{eval-rst} -.. automodule:: micro.discrete +.. automodule:: microplex.discrete :members: :undoc-members: ``` diff --git a/docs/intro.md b/docs/intro.md index 616275e..f805e85 100644 --- a/docs/intro.md +++ b/docs/intro.md @@ -10,7 +10,7 @@ - **Sparse reweighting**: L0/L1 optimization to match population targets - **Multi-source fusion**: Combine CPS, ACS, admin data into one population - **Zero-inflation handling**: Built-in support for variables with many zeros -- **Scalable**: Synthesize billions of records, reweight to any geography +- **Scalable**: Reweight to any geography ## Installation @@ -62,15 +62,13 @@ synthetic = synth.generate(new_demographics) │ P(targets | context) │ │ │ │ │ │ • Zero-inflation │ │ - │ • Joint correlations │ │ - │ • Hierarchical │ │ + │ • Per-variable models │ │ └───────────┬─────────────┘ │ │ │ ▼ │ ┌─────────────────────────┐ │ │ SYNTHESIZE │ │ - │ BILLIONS OF │ │ - │ HOUSEHOLDS │ │ + │ POPULATION │ │ └───────────┬─────────────┘ │ │ │ ▼ ▼ @@ -97,14 +95,13 @@ synthetic = synth.generate(new_demographics) ## Comparison to Alternatives -| Feature | microplex | CT-GAN | TVAE | synthpop | -|---------|:---------:|:------:|:----:|:--------:| -| Conditional generation | ✅ | ❌ | ❌ | ❌ | -| Zero-inflation | ✅ | ❌ | ❌ | ⚠️ | -| Sparse reweighting | ✅ | ❌ | ❌ | ❌ | -| Multi-source fusion | ✅ | ❌ | ❌ | ⚠️ | -| Exact likelihood | ✅ | ❌ | ❌ | N/A | -| Stable training | ✅ | ⚠️ | ✅ | ✅ | +| Feature | microplex | synthpop | +|---------|:---------:|:--------:| +| Conditional generation | ✅ | ❌ | +| Zero-inflation handling | ✅ | ⚠️ | +| Sparse reweighting | ✅ | ❌ | +| Multi-source fusion | ✅ | ⚠️ | +| Multiple synthesis methods | ✅ (QRF, QDNN, MAF) | ✅ (CART) | ## Contents diff --git a/docs/notebooks/benchmarks.md b/docs/notebooks/benchmarks.md index 8a5b43b..3d57d8d 100644 --- a/docs/notebooks/benchmarks.md +++ b/docs/notebooks/benchmarks.md @@ -1,15 +1,17 @@ # Benchmarks -This notebook compares `micro` against other synthesis methods. +This notebook compares `microplex` synthesis methods. -## Methods Compared +## Methods compared | Method | Description | Library | |--------|-------------|---------| -| **micro** | Normalizing flows | This package | -| **CT-GAN** | Conditional Tabular GAN | SDV | +| **microplex QRF** | Quantile regression forest | This package | +| **microplex ZI-QRF** | Zero-inflated QRF | This package | +| **microplex QDNN** | Quantile deep neural network | This package | +| **microplex MAF** | Masked autoregressive flow | This package | +| **CT-GAN** | Conditional tabular GAN | SDV | | **TVAE** | Tabular VAE | SDV | -| **Copula** | Gaussian Copula | SDV | ## Setup @@ -22,7 +24,7 @@ import seaborn as sns from benchmarks.compare import run_benchmark, results_to_dataframe ``` -## Create Test Data +## Create test data ```python np.random.seed(42) @@ -58,7 +60,7 @@ test_conditions = pd.DataFrame({ }) ``` -## Run Benchmarks +## Run benchmarks ```python results = run_benchmark( @@ -66,7 +68,7 @@ results = run_benchmark( test_conditions=test_conditions, target_vars=["wages", "capital_gains"], condition_vars=["age", "education"], - methods=["micro", "ctgan", "tvae", "copula"], + methods=["microplex", "ctgan", "tvae"], epochs=100, ) @@ -79,37 +81,37 @@ df ```python fig, axes = plt.subplots(2, 2, figsize=(12, 10)) -# KS statistic (lower is better) +# Coverage (higher is better) ax = axes[0, 0] -df.plot.bar(x="method", y="mean_ks", ax=ax, legend=False) -ax.set_title("Marginal Fidelity (KS Statistic)") -ax.set_ylabel("KS Statistic (lower is better)") +df.plot.bar(x="method", y="coverage", ax=ax, legend=False) +ax.set_title("Coverage (PRDC)") +ax.set_ylabel("Coverage (higher is better)") -# Correlation error (lower is better) +# Precision ax = axes[0, 1] -df.plot.bar(x="method", y="correlation_error", ax=ax, legend=False, color="orange") -ax.set_title("Joint Fidelity (Correlation Error)") -ax.set_ylabel("Error (lower is better)") +df.plot.bar(x="method", y="precision", ax=ax, legend=False, color="orange") +ax.set_title("Precision (PRDC)") +ax.set_ylabel("Precision (higher is better)") # Zero fraction error (lower is better) ax = axes[1, 0] df.plot.bar(x="method", y="mean_zero_error", ax=ax, legend=False, color="green") -ax.set_title("Zero Fraction Error") +ax.set_title("Zero fraction error") ax.set_ylabel("Error (lower is better)") # Training time ax = axes[1, 1] df.plot.bar(x="method", y="train_time", ax=ax, legend=False, color="red") -ax.set_title("Training Time") +ax.set_title("Training time") ax.set_ylabel("Seconds") plt.tight_layout() plt.show() ``` -## Key Findings +## Key findings -1. **Marginal Fidelity**: `micro` achieves competitive KS statistics -2. **Zero Handling**: `micro` excels at preserving zero fractions -3. **Correlations**: Normalizing flows preserve joint distributions well -4. **Speed**: Training time comparable to TVAE, faster than CT-GAN +1. **Coverage**: Zero-inflated methods achieve higher PRDC coverage +2. **Zero handling**: Two-stage ZI models excel at preserving zero fractions +3. **Speed**: QRF methods are fastest; MAF slowest but most flexible +4. **Architecture matters**: ZI handling lifts neural methods more than tree methods diff --git a/docs/notebooks/quickstart.md b/docs/notebooks/quickstart.md index 5138ef6..04c5241 100644 --- a/docs/notebooks/quickstart.md +++ b/docs/notebooks/quickstart.md @@ -1,19 +1,19 @@ # Quickstart -This notebook demonstrates basic usage of `micro`. +This notebook demonstrates basic usage of `microplex`. ```python -# Install micro -# !pip install micro +# Install microplex +# !pip install microplex ``` ```python import numpy as np import pandas as pd -from micro import Synthesizer +from microplex import Synthesizer ``` -## Create Sample Data +## Create sample data ```python np.random.seed(42) @@ -38,7 +38,7 @@ training_data = pd.DataFrame({ training_data.describe() ``` -## Train Synthesizer +## Train synthesizer ```python synth = Synthesizer( @@ -49,7 +49,7 @@ synth = Synthesizer( synth.fit(training_data, epochs=50) ``` -## Generate Synthetic Data +## Generate synthetic data ```python # New demographics @@ -62,7 +62,7 @@ synthetic = synth.generate(test_conditions, seed=42) synthetic ``` -## Validate Results +## Validate results ```python # Check zero fractions match @@ -84,7 +84,7 @@ axes[0].hist(np.log1p(training_data["income"]), bins=50, alpha=0.5, label="Real" axes[0].hist(np.log1p(large_synthetic["income"]), bins=50, alpha=0.5, label="Synthetic") axes[0].set_xlabel("Log(1 + income)") axes[0].legend() -axes[0].set_title("Income Distribution") +axes[0].set_title("Income distribution") # Age vs income axes[1].scatter(training_data["age"], training_data["income"], alpha=0.3, label="Real") @@ -92,7 +92,7 @@ axes[1].scatter(large_synthetic["age"], large_synthetic["income"], alpha=0.3, la axes[1].set_xlabel("Age") axes[1].set_ylabel("Income") axes[1].legend() -axes[1].set_title("Age vs Income") +axes[1].set_title("Age vs income") plt.tight_layout() plt.show() diff --git a/paper/figures/coverage_by_method.pdf b/paper/figures/coverage_by_method.pdf new file mode 100644 index 0000000..4740b03 Binary files /dev/null and b/paper/figures/coverage_by_method.pdf differ diff --git a/paper/figures/coverage_by_method.png b/paper/figures/coverage_by_method.png new file mode 100644 index 0000000..d34e9a3 Binary files /dev/null and b/paper/figures/coverage_by_method.png differ diff --git a/paper/figures/coverage_by_method.py b/paper/figures/coverage_by_method.py new file mode 100644 index 0000000..ffa2690 --- /dev/null +++ b/paper/figures/coverage_by_method.py @@ -0,0 +1,83 @@ +#!/usr/bin/env python3 +"""Generate grouped bar chart of coverage by method and source. + +Reads benchmark_multi_seed.json and produces a figure suitable for the paper. +""" + +import json +from pathlib import Path + +import matplotlib.pyplot as plt +import numpy as np + +RESULTS_PATH = Path(__file__).parent.parent.parent / "benchmarks" / "results" / "benchmark_multi_seed.json" +OUTPUT_PATH = Path(__file__).parent / "coverage_by_method.pdf" + +# Method display order and labels +METHOD_ORDER = ["QRF", "ZI-QRF", "QDNN", "ZI-QDNN", "MAF", "ZI-MAF"] +SOURCE_ORDER = ["sipp", "cps"] # Exclude PSID (always 0%) +SOURCE_LABELS = {"sipp": "SIPP", "cps": "CPS ASEC"} + + +def main(): + with open(RESULTS_PATH) as f: + data = json.load(f) + + methods_data = data["methods"] + n_seeds = data.get("n_seeds", 1) + + # Build arrays + n_methods = len(METHOD_ORDER) + n_sources = len(SOURCE_ORDER) + means = np.zeros((n_methods, n_sources)) + ses = np.zeros((n_methods, n_sources)) + + for i, method in enumerate(METHOD_ORDER): + if method not in methods_data: + continue + for j, source in enumerate(SOURCE_ORDER): + if source in methods_data[method]: + means[i, j] = methods_data[method][source]["mean"] + ses[i, j] = methods_data[method][source].get("se", 0) + + # Plot + fig, ax = plt.subplots(figsize=(8, 4)) + + x = np.arange(n_methods) + width = 0.35 + colors = ["#4878CF", "#D65F5F"] + + for j, source in enumerate(SOURCE_ORDER): + offset = (j - 0.5) * width + bars = ax.bar( + x + offset, + means[:, j] * 100, + width, + yerr=ses[:, j] * 100, + label=SOURCE_LABELS[source], + color=colors[j], + capsize=3, + edgecolor="white", + linewidth=0.5, + ) + + ax.set_ylabel("Coverage (%)") + ax.set_xticks(x) + ax.set_xticklabels(METHOD_ORDER, rotation=30, ha="right") + ax.legend(loc="upper left", frameon=False) + ax.set_ylim(0, 100) + ax.spines["top"].set_visible(False) + ax.spines["right"].set_visible(False) + ax.set_title(f"PRDC coverage by method and source ({n_seeds} seeds)") + + fig.tight_layout() + fig.savefig(OUTPUT_PATH, bbox_inches="tight") + # Also save PNG for previewing + fig.savefig(OUTPUT_PATH.with_suffix(".png"), bbox_inches="tight", dpi=150) + print(f"Saved {OUTPUT_PATH}") + print(f"Saved {OUTPUT_PATH.with_suffix('.png')}") + plt.close(fig) + + +if __name__ == "__main__": + main() diff --git a/paper/figures/reweighting_frontier.pdf b/paper/figures/reweighting_frontier.pdf new file mode 100644 index 0000000..43035bd Binary files /dev/null and b/paper/figures/reweighting_frontier.pdf differ diff --git a/paper/figures/reweighting_frontier.png b/paper/figures/reweighting_frontier.png new file mode 100644 index 0000000..651e9a1 Binary files /dev/null and b/paper/figures/reweighting_frontier.png differ diff --git a/paper/figures/reweighting_frontier.py b/paper/figures/reweighting_frontier.py new file mode 100644 index 0000000..7930ead --- /dev/null +++ b/paper/figures/reweighting_frontier.py @@ -0,0 +1,124 @@ +#!/usr/bin/env python3 +"""Generate reweighting frontier figure: records used vs out-of-sample error. + +Reads frontier data and produces a figure showing the accuracy-sparsity +tradeoff. SparseCalibrator (convex, deterministic) traces a reliable frontier. +HardConcrete (non-convex) is shown with error bars from multi-seed runs. +""" + +import json +from pathlib import Path + +import matplotlib.pyplot as plt +import numpy as np + +RESULTS_DIR = Path(__file__).parent.parent.parent / "benchmarks" / "results" +FRONTIER_PATH = RESULTS_DIR / "reweighting_frontier.json" +HC_MULTISEED_PATH = RESULTS_DIR / "reweighting_frontier_hc_multiseed.json" +OUTPUT_PATH = Path(__file__).parent / "reweighting_frontier.pdf" + + +def main(): + with open(FRONTIER_PATH) as f: + data = json.load(f) + + n_records = data["n_records"] + + fig, ax = plt.subplots(figsize=(7, 4.5)) + + # --- SparseCalibrator frontier (L1 family, convex/deterministic) --- + sc = data["methods"]["SparseCalibrator"] + sc_x = [p["n_active"] for p in sc] + sc_y = [p["test_error"] for p in sc] + order = np.argsort(sc_x) + sc_x = [sc_x[i] for i in order] + sc_y = [sc_y[i] for i in order] + ax.plot(sc_x, sc_y, "s-", color="#2ca02c", linewidth=1.5, markersize=5, + label=r"SparseCalibrator ($L_1$, convex)", zorder=3) + + # L1-Sparse endpoint + if "L1-Sparse" in data["methods"]: + l1 = data["methods"]["L1-Sparse"][0] + ax.scatter(l1["n_active"], l1["test_error"], marker="s", color="#2ca02c", + s=100, zorder=5, edgecolors="black", linewidths=1.5) + ax.annotate("L1-Sparse\n(hard constraints)", (l1["n_active"], l1["test_error"]), + textcoords="offset points", xytext=(15, 8), fontsize=7, + color="#2ca02c", ha="left") + + # --- HardConcrete frontier (L0 family, non-convex, with error bars) --- + if HC_MULTISEED_PATH.exists(): + with open(HC_MULTISEED_PATH) as f: + hc_ms = json.load(f) + hc_x = [p["n_active_mean"] for p in hc_ms] + hc_y = [p["test_error_mean"] for p in hc_ms] + hc_se = [p["test_error_se"] for p in hc_ms] + order = np.argsort(hc_x) + hc_x = [hc_x[i] for i in order] + hc_y = [hc_y[i] for i in order] + hc_se = [hc_se[i] for i in order] + ax.errorbar(hc_x, hc_y, yerr=hc_se, fmt="o-", color="#1f77b4", + linewidth=1.5, markersize=5, capsize=3, capthick=1, + label=r"HardConcrete ($L_0$, non-convex)", zorder=3) + else: + # Fallback: single-seed data + hc = data["methods"]["HardConcrete"] + hc_x = [p["n_active"] for p in hc] + hc_y = [p["test_error"] for p in hc] + order = np.argsort(hc_x) + hc_x = [hc_x[i] for i in order] + hc_y = [hc_y[i] for i in order] + ax.plot(hc_x, hc_y, "o-", color="#1f77b4", linewidth=1.5, markersize=5, + label=r"HardConcrete ($L_0$, non-convex)", zorder=3) + + # L0-Sparse endpoint + if "L0-Sparse" in data["methods"]: + l0 = data["methods"]["L0-Sparse"][0] + ax.scatter(l0["n_active"], l0["test_error"], marker="o", color="#1f77b4", + s=100, zorder=5, edgecolors="black", linewidths=1.5) + ax.annotate("L0-Sparse\n(hard constraints)", (l0["n_active"], l0["test_error"]), + textcoords="offset points", xytext=(15, -18), fontsize=7, + color="#1f77b4", ha="left") + + # --- Dense methods as reference points --- + for name, marker, color in [ + ("IPF", "D", "#d62728"), + ("Entropy", "^", "#ff7f0e"), + ]: + if name in data["methods"]: + pts = data["methods"][name] + for p in pts: + ax.scatter(p["n_active"], p["test_error"], marker=marker, color=color, + s=80, zorder=4, label=name, edgecolors="black", linewidths=0.5) + + ax.set_xlabel("Active records (non-zero weight)", fontsize=11) + ax.set_ylabel("Out-of-sample error\n(held-out sex margin)", fontsize=11) + ax.set_xscale("log") + ax.set_xlim(4, n_records * 1.3) + + # Compute y-axis limits from all data + all_errors = [] + for method_data in data["methods"].values(): + for p in method_data: + all_errors.append(p["test_error"]) + ax.set_ylim(0, max(all_errors) * 1.15) + + ax.legend(fontsize=9, loc="upper right") + ax.grid(True, alpha=0.2) + ax.tick_params(labelsize=9) + + # Format y-axis as percentage + ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f"{y:.0%}")) + + # Add annotation for total records + ax.axvline(n_records, color="gray", linestyle=":", alpha=0.4, linewidth=1) + ax.text(n_records * 0.85, ax.get_ylim()[1] * 0.95, f"N={n_records:,}", + ha="right", va="top", fontsize=8, color="gray") + + plt.tight_layout() + plt.savefig(OUTPUT_PATH, dpi=300, bbox_inches="tight") + plt.savefig(OUTPUT_PATH.with_suffix(".png"), dpi=150, bbox_inches="tight") + print(f"Saved to {OUTPUT_PATH} and {OUTPUT_PATH.with_suffix('.png')}") + + +if __name__ == "__main__": + main() diff --git a/paper/index.md b/paper/index.md index 2cbdd2b..d57cc95 100644 --- a/paper/index.md +++ b/paper/index.md @@ -22,7 +22,7 @@ Government surveys observe different slices of the same population: the Current ## Introduction -Policy microsimulation requires detailed individual records spanning demographics, income, taxes, transfers, wealth, and health. No single survey covers all domains. The Current Population Survey (CPS) Annual Social and Economic Supplement (ASEC) {cite:p}`flood2020integrated` captures {eval}`f"{r.n_cps:,}"` persons with employment and income variables. The Survey of Income and Program Participation (SIPP) {cite:p}`census2023sipp` adds employment dynamics and income detail for {eval}`f"{r.n_sipp:,}"` persons. The Panel Study of Income Dynamics (PSID) {cite:p}`psid2023` provides longitudinal structure for {eval}`f"{r.n_psid:,}"` persons. Administrative sources (Internal Revenue Service Statistics of Income, Social Security Administration earnings records) cover entire populations but with narrower variable sets. +Policy microsimulation requires detailed individual records spanning demographics, income, taxes, transfers, wealth, and health. No single survey covers all domains. The Current Population Survey (CPS) Annual Social and Economic Supplement (ASEC) {cite:p}`flood2023integrated` captures {eval}`f"{r.n_cps:,}"` persons with employment and income variables. The Survey of Income and Program Participation (SIPP) {cite:p}`census2023sipp` adds employment dynamics and income detail for {eval}`f"{r.n_sipp:,}"` persons. The Panel Study of Income Dynamics (PSID) {cite:p}`psid2023` provides longitudinal structure for {eval}`f"{r.n_psid:,}"` persons. Administrative sources (Internal Revenue Service Statistics of Income, Social Security Administration earnings records) cover entire populations but with narrower variable sets. Current approaches to combining these sources — sequential imputation, statistical matching, or record linkage — suffer from well-documented limitations. Synthetic data approaches {cite:p}`rubin1993statistical,drechsler2011synthetic` and multiple imputation {cite:p}`raghunathan2003multiple` address disclosure concerns but typically operate on single surveys. Sequential chaining (e.g., imputing CPS variables onto the American Community Survey, then Public Use File variables onto CPS) loses joint distributional structure at each step {cite:p}`meinfelder2011simulation`. Statistical matching preserves marginals but distorts correlations {cite:p}`dorazio2006statistical`. Record linkage requires common identifiers rarely available across surveys. @@ -61,15 +61,15 @@ where $\hat{\pi}_0(x)$ is a random forest classifier predicting zero vs. non-zer I compare three model families, each with and without zero-inflation: -**Quantile regression forest (QRF).** Following {cite:t}`meinshausen2006quantile`, I fit a random forest that learns the full conditional distribution $P(y \mid x)$ by retaining quantile information from training observations in leaf nodes. At generation time, I sample a random quantile $\tau \sim \text{Uniform}(0.1, 0.9)$ and return the corresponding predicted quantile. The quantile range is truncated to avoid extreme tail values; this reduces tail coverage but improves stability. +**Quantile regression forest (QRF).** Following {cite:t}`meinshausen2006quantile`, I fit a random forest that learns the full conditional distribution $P(y \mid x)$ by retaining quantile information from training observations in leaf nodes. At generation time, I uniformly sample one of five pre-computed quantile levels $\tau \in \{0.1, 0.25, 0.5, 0.75, 0.9\}$ and return the corresponding predicted quantile. The quantile range is truncated to $[0.1, 0.9]$ to avoid extreme tail values; this reduces tail coverage but improves stability. -**Quantile deep neural network (QDNN).** A multi-layer perceptron trained with pinball loss {cite:p}`koenker2001quantile` to predict quantiles $\hat{q}_\tau(x)$ for $\tau \in \{0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95\}$. At generation time, I sample a random quantile index and return the corresponding prediction. +**Quantile deep neural network (QDNN).** A multi-layer perceptron trained with pinball loss {cite:p}`koenker2001quantile` to predict quantiles $\hat{q}_\tau(x)$ for $\tau \in \{0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95\}$. At generation time, I uniformly sample one of these seven quantile levels and return the corresponding prediction. This discrete quantile grid limits the resolution of the generated distribution; QDNN also exhibits higher variance across seeds than QRF or MAF (see Results). -**Masked autoregressive flow (MAF).** A normalizing flow {cite:p}`papamakarios2017masked` that learns the full conditional density $p(y \mid x)$ via invertible transformations. I apply log transformation to positive values before standardization and train with maximum likelihood. Generated values are clipped to non-negative via `max(x, 0)`, which for the non-ZI variant creates an artificial mass at zero that may inflate the apparent ZI benefit. +**Masked autoregressive flow (MAF).** A normalizing flow {cite:p}`papamakarios2017masked` that learns the full conditional density $p(y \mid x)$ via invertible transformations. In the benchmark implementation, each non-shared variable gets its own 1-dimensional conditional flow $p(v \mid V_{\text{shared}})$; the MAF does not learn cross-variable dependencies within a source, making the conditional independence assumption shared across all three method families. I apply log transformation to positive values before standardization and train with maximum likelihood. Generated values are clipped to non-negative via `max(x, 0)`, which for the non-ZI variant creates an artificial mass at zero that may inflate the apparent ZI benefit. ### Evaluation metrics -I evaluate using Precision, Density, and Coverage (PDC) adapted from the PRDC framework of {cite:t}`naeem2020reliable`, originally developed for evaluating generative image models. In Naeem et al.'s $k$-nearest-neighbor formulation, recall and coverage are mathematically identical, so I report only three independent metrics. +I evaluate using three of the four metrics from the Precision, Recall, Density, and Coverage (PRDC) framework of {cite:t}`naeem2020reliable`, originally developed for evaluating generative image models. In Naeem et al.'s $k$-nearest-neighbor formulation, recall and coverage both measure the fraction of real points with a synthetic neighbor within the real manifold's local radius, making them mathematically equivalent (see their Definitions 2 and 4). I therefore report only Precision, Density, and Coverage (PDC). For each source survey $S_k$: 1. Subsample to at most {eval}`f"{r.max_rows_per_source:,}"` records per source (to keep computation tractable) @@ -84,12 +84,14 @@ All results are reported as means ± standard errors across {eval}`r.n_seeds` ra ### Calibration via reweighting -After synthesis, I calibrate the microdata against administrative targets by adjusting record weights. I compare five methods spanning two families. +After synthesis, I calibrate the microdata against administrative targets by adjusting record weights. I compare six methods spanning three families. -**Calibration methods** solve for weights that match both categorical marginals and continuous targets simultaneously. Iterative proportional fitting (IPF) {cite:p}`deming1940least` is the classical raking algorithm that alternately adjusts weights to match each marginal target. Entropy balancing {cite:p}`hainmueller2012entropy` minimizes the Kullback-Leibler divergence from the original weights subject to target constraints: $\min_w \sum_i w_i \log(w_i / w_i^0)$ s.t. $Aw = b$. SparseCalibrator, building on calibration estimator theory {cite:p}`deville1992calibration`, selects a sparse subset of records via cross-category proportional sampling, then calibrates the selected subset using iterative proportional fitting to match both categorical and continuous targets. +**Calibration methods** solve for weights that match both categorical marginals and continuous targets simultaneously. Iterative proportional fitting (IPF) {cite:p}`deming1940least` is the classical raking algorithm that alternately adjusts weights to match each marginal target. Entropy balancing {cite:p}`hainmueller2012entropy`, originally developed for causal inference but applicable to any moment-matching reweighting problem, minimizes the Kullback-Leibler divergence from the original weights subject to target constraints: $\min_w \sum_i w_i \log(w_i / w_i^0)$ s.t. $Aw = b$, where $A$ is the constraint matrix and $b$ the vector of target values. SparseCalibrator, building on calibration estimator theory {cite:p}`deville1992calibration`, selects a sparse subset of records via cross-category proportional sampling, then calibrates the selected subset using iterative proportional fitting to match both categorical and continuous targets. **Sparse optimization methods** ($L_1$-sparse and $L_0$-sparse) minimize the weight norm subject to categorical constraints only, solving $\min_w \|w\|_p$ s.t. $Aw = b$ for subset selection rather than population calibration. +**Differentiable sparse calibration** (HardConcrete) uses Hard Concrete gates {cite:p}`louizos2018learning` to learn both which records to include and what weight to assign them, jointly optimizing $L_0$ sparsity and target-matching accuracy via gradient descent. The implementation wraps the `l0-python` package. A key implementation detail: the initial weights must be rescaled so that the initial constraint violation is small (within ~30% of targets), otherwise gradient descent in the log-weight parameterization fails to converge from an initial point thousands of times larger than the target. + ## Data I use three public-use surveys stacked into a common format ({eval}`f"{r.n_total:,}"` total records across all survey years): @@ -135,9 +137,16 @@ df.index = range(1, len(df) + 1) df ``` +```{figure} figures/coverage_by_method.png +:name: fig-coverage +:width: 100% + +PRDC coverage by synthesis method and source survey. Error bars show standard errors across {eval}`r.n_seeds` random seeds. PSID (0% for all methods) is omitted. +``` + Per-source coverage varies dramatically across surveys. ZI-QRF achieves the highest SIPP coverage ({eval}`r.zi_qrf.sipp_pct`), while ZI-MAF leads on CPS ({eval}`r.zi_maf.cps_pct`). PSID coverage is 0% for all methods, reflecting a fundamental limitation of the current shared variable set: with only 2 conditioning variables (age, sex) and 15 PSID-specific columns, the model cannot learn the 15-dimensional joint structure from demographics alone. -I report per-source results as the primary metrics rather than aggregating across sources, since averaging with a degenerate 0% source obscures the pattern. +I report per-source results as the primary metrics rather than aggregating across sources, since averaging with a degenerate 0% source obscures the pattern. QDNN exhibits notably higher variance across seeds than the other methods (SIPP coverage standard deviation of 8.5 percentage points, vs. below 1.2 for all others), likely reflecting sensitivity of pinball-loss training to the train/holdout split. This instability should be considered when evaluating QDNN for production use. ### The zero-inflation effect @@ -155,11 +164,11 @@ QRF is naturally robust to zero-inflation because quantile forests can represent ### Speed-accuracy tradeoff -ZI-QRF completes in {eval}`r.zi_qrf.time_str`, compared to ZI-MAF's {eval}`r.zi_maf.time_str` (ZI-MAF is {eval}`r.zi_speedup_over_maf` slower). ZI-QRF achieves higher SIPP coverage ({eval}`r.zi_qrf.sipp_pct` vs. {eval}`r.zi_maf.sipp_pct`), while ZI-MAF has a {eval}`f"{r.zi_maf.cps_coverage - r.zi_qrf.cps_coverage:.0%}"` point CPS coverage advantage ({eval}`r.zi_maf.cps_pct` vs. {eval}`r.zi_qrf.cps_pct`). For production pipelines requiring frequent regeneration, ZI-QRF offers a compelling speed-accuracy tradeoff. +ZI-QRF completes in {eval}`r.zi_qrf.time_str`, compared to ZI-MAF's {eval}`r.zi_maf.time_str` (ZI-MAF is {eval}`r.zi_speedup_over_maf` slower). ZI-QRF achieves higher SIPP coverage ({eval}`r.zi_qrf.sipp_pct` vs. {eval}`r.zi_maf.sipp_pct`), while ZI-MAF has a {eval}`f"{r.zi_maf.cps_coverage - r.zi_qrf.cps_coverage:.0%}"` point CPS coverage advantage ({eval}`r.zi_maf.cps_pct` vs. {eval}`r.zi_qrf.cps_pct`). For production pipelines requiring frequent regeneration, ZI-QRF achieves comparable coverage at substantially lower computational cost. ### Reweighting calibration -I evaluate reweighting methods on {eval}`f"{r.rw_n_records:,}"` records with {eval}`r.rw_n_targets_total` calibration targets ({eval}`r.rw_n_marginal_targets` categorical marginals spanning age group and sex, plus {eval}`r.rw_n_continuous_targets` continuous target for total population weight). Target values are perturbed from the sample distribution by 10-30% to simulate calibration to known population totals. This is a controlled evaluation of algorithmic performance; results may differ when calibrating against real administrative targets. +I evaluate reweighting methods on {eval}`f"{r.rw_n_records:,}"` records using a train/test split to assess out-of-sample generalization. Methods are calibrated on {eval}`r.rw_n_train_targets` training targets (age group categories plus total population weight), then evaluated on {eval}`r.rw_n_test_targets` held-out test targets (sex categories) that were not used during calibration. Target values are perturbed from the sample distribution by 10-30% to simulate calibration to known population totals. This design measures whether calibrating on one set of demographic margins improves representativeness along dimensions not explicitly targeted — the relevant question for practical survey calibration. ```{code-cell} python :tags: [remove-input] @@ -175,25 +184,34 @@ rows = [] for name, m in rw_data["methods"].items(): rows.append({ "Method": name, - "Mean rel. error": f"{m['mean_relative_error']:.1%}", - "Max rel. error": f"{m['max_relative_error']:.1%}", + "Train error": f"{m['train_mean_error']:.2%}", + "Test error": f"{m['test_mean_error']:.1%}", "Weight CV": f"{m['weight_cv']:.3f}", + "Sparsity": f"{m['sparsity']:.1%}", "Time (s)": f"{m['elapsed_seconds']:.2f}", }) df = pd.DataFrame(rows) -# Show calibration methods first (sorted by mean error), then sparse -cal_methods = df[df["Method"].isin(["Entropy", "IPF", "SparseCalibrator"])] -sparse_methods = df[~df["Method"].isin(["Entropy", "IPF", "SparseCalibrator"])] -df = pd.concat([cal_methods.sort_values("Mean rel. error"), sparse_methods]) +df = df.sort_values("Test error") df.index = range(1, len(df) + 1) df ``` -Among calibration methods, entropy balancing achieves the lowest mean relative error ({eval}`r.rw_entropy.mean_error_pct`), {eval}`r.entropy_vs_ipf_error_reduction` lower than IPF ({eval}`r.rw_ipf.mean_error_pct`). SparseCalibrator matches IPF accuracy while producing {eval}`r.sparse_cal_cv_vs_ipf` lower weight coefficient of variation ({eval}`r.rw_sparse_cal.cv_str` vs. {eval}`r.rw_ipf.cv_str`), meaning smoother weights that are less likely to amplify noise in downstream estimates. +All calibration methods (IPF, entropy, SparseCalibrator) achieve near-zero training error — they satisfy the age and weight constraints exactly. The key comparison is test error on the held-out sex margin, which measures generalization. + +To characterize the accuracy-sparsity tradeoff, I sweep the regularization parameter for each sparse method (SparseCalibrator's $\lambda$ and HardConcrete's $\lambda_{L_0}$) across several orders of magnitude and plot out-of-sample error against the number of records with non-zero weight. HardConcrete uses a non-convex optimizer, so I report mean $\pm$ SE over 5 random seeds. + +```{figure} figures/reweighting_frontier.png +:name: fig-reweighting-frontier +:width: 100% + +Reweighting frontier: out-of-sample error on held-out sex margin vs. number of active records. SparseCalibrator ($L_1$, convex) traces a deterministic frontier; HardConcrete ($L_0$, non-convex) shows mean $\pm$ SE over 5 seeds. Hard-constraint endpoints ($L_1$-Sparse, $L_0$-Sparse) at 5 records are far off the efficient frontier. +``` + +{numref}`fig-reweighting-frontier` shows that SparseCalibrator ($L_1$, convex) dominates HardConcrete ($L_0$, non-convex) across the entire frontier. At high sparsity (~30 records), SparseCalibrator achieves ~9% test error — half that of dense methods — with zero variance because the underlying FISTA optimizer is deterministic. HardConcrete {cite:p}`louizos2018learning` matches dense methods (~18%) with $>$300 records, but its error bars explode at high sparsity (mean 32-46% with $\pm$10-14% SE below 100 records), reflecting the difficulty of non-convex optimization with very few degrees of freedom. HardConcrete uses differentiable $L_0$ gates based on the Hard Concrete distribution to jointly optimize which records to keep and what weights to assign, implemented via the `l0-python` package. An initial weight rescaling step is critical: without it, gradient descent fails to converge because survey weights (mean ~6,800) produce initial constraint violations of 5,000x or more. -The $L_1$- and $L_0$-sparse methods show high errors ({eval}`r.rw_l1.mean_error_pct`) because they optimize for subset selection (minimizing $\|w\|_p$) rather than population calibration. They satisfy categorical constraints but cannot match continuous targets, making them unsuitable for general-purpose calibration despite their sparsity advantages. +Dense calibration methods (IPF, entropy) use all {eval}`f"{r.rw_n_records:,}"` records and achieve ~18% test error. SparseCalibrator at its default operating point produces {eval}`r.sparse_cal_cv_vs_ipf` lower weight coefficient of variation ({eval}`r.rw_sparse_cal.cv_str` vs. {eval}`r.rw_ipf.cv_str`), meaning smoother weights that are less likely to amplify noise in downstream estimates. -The tradeoff between entropy and SparseCalibrator is instructive: entropy achieves lower mean error but higher max error ({eval}`r.rw_entropy.max_error_pct` vs. {eval}`r.rw_sparse_cal.max_error_pct`), while SparseCalibrator provides more uniform error across targets with smoother weights. +The hard-constraint $L_1$- and $L_0$-sparse methods ({eval}`r.rw_l1.test_error_pct` test error) sit far off the efficient frontier. They solve `min $\|w\|_p$ s.t. $Aw = b$` — pure sparsity with no accuracy tradeoff — selecting just 5 records that cannot maintain representativeness on held-out dimensions. SparseCalibrator and HardConcrete are the parameterized versions that interpolate between dense calibration and extreme sparsity. ## Discussion @@ -201,21 +219,21 @@ The tradeoff between entropy and SparseCalibrator is instructive: entropy achiev The most consistent finding is that zero-inflation handling provides large coverage gains for neural methods while barely affecting tree-based methods. A two-stage decomposition — random forest classifier {cite:p}`breiman2001random` for zero vs. non-zero, followed by a conditional model on positive values only — transforms underperforming MAF and QDNN methods into competitive performers. The zero-inflation lift exceeds the between-model-family differences for neural methods. -The mechanism follows from the structure of economic survey variables. Income sources (wages, dividends, transfers) are zero for large population fractions. Without ZI, a normalizing flow or neural network must simultaneously model: (a) the probability of being a recipient, and (b) the distribution of amounts conditional on receipt. These are fundamentally different modeling tasks — one is a classification boundary, the other a continuous density estimation — and conflating them degrades both. Tree-based methods handle this naturally through leaf node composition. +The mechanism follows from the structure of economic survey variables. Income sources (wages, dividends, transfers) are zero for large population fractions. Without ZI, a normalizing flow or neural network must simultaneously model: (a) the probability of being a recipient, and (b) the distribution of amounts conditional on receipt. These are distinct modeling tasks — one is a classification boundary, the other a continuous density estimation — and conflating them degrades both. Tree-based methods handle this naturally through leaf node composition. ### Limitations -**Shared variable bottleneck.** With only age and sex as shared conditioning variables, the model cannot capture the covariance structure that depends on education, occupation, geography, and other demographics. The 0% PSID coverage across all methods demonstrates this limitation. Expanding shared variables is the highest-priority improvement. +With only age and sex as shared conditioning variables, the model cannot capture the covariance structure that depends on education, occupation, geography, and other demographics. The 0% PSID coverage across all methods demonstrates this shared variable bottleneck. Expanding shared variables is the highest-priority improvement. -**Conditional independence assumption.** Non-shared variables are generated independently conditional on shared variables — both across and within sources. The synthetic joint distribution is $\prod_v P(v \mid V_{\text{shared}})$, which preserves each marginal conditional but destroys all correlations not mediated by the shared variables. For microsimulation applications, the correlation between (e.g.) SIPP program participation and CPS income components is precisely what is needed. The current framework does not capture these relationships, and I do not evaluate cross-source correlation fidelity. A full joint model — via a unified latent space, conditional dependency chains, or copula-based approaches {cite:p}`dorazio2006statistical` — would address this at the cost of additional complexity. +Non-shared variables are generated independently conditional on shared variables — both across and within sources. The synthetic joint distribution is $\prod_v P(v \mid V_{\text{shared}})$, which preserves each marginal conditional but destroys all correlations not mediated by the shared variables. For microsimulation applications, the correlation between (e.g.) SIPP program participation and CPS income components is precisely what is needed. The current framework does not capture these relationships, and I do not evaluate cross-source correlation fidelity. A full joint model — via a unified latent space, conditional dependency chains, or copula-based approaches {cite:p}`dorazio2006statistical` — would address this at the cost of additional complexity. -**Survey weights not used in training.** The benchmark treats all survey records equally, ignoring complex sampling designs and survey weights. This biases the learned distributions toward oversampled strata and may not reflect the population distributions that practitioners need. +The benchmark treats all survey records equally, ignoring complex sampling designs and survey weights. This biases the learned distributions toward oversampled strata and may not reflect the population distributions that practitioners need. -**Household structure.** Current synthesis operates at the person level. Realistic microdata requires consistent household structure: spouses should have compatible incomes, dependents should be children, tax unit filing status should match household composition. Hierarchical synthesis and relationship pointers (spouse_person_id, parent_person_id) are planned for future work. +Current synthesis operates at the person level. Realistic microdata requires consistent household structure: spouses should have compatible incomes, dependents should be children, tax unit filing status should match household composition. Hierarchical synthesis and relationship pointers (spouse_person_id, parent_person_id) are planned for future work. -**Evaluation metrics.** The PDC metrics adapted from computer vision may not capture the properties that survey statisticians prioritize, such as marginal distributional fidelity, cross-tabulation accuracy, or analytical validity (whether regressions on synthetic data replicate those on real data). Adding survey-standard evaluation metrics would strengthen the evaluation. +The PDC metrics adapted from computer vision may not capture the properties that survey statisticians prioritize, such as marginal distributional fidelity, cross-tabulation accuracy, or analytical validity (whether regressions on synthetic data replicate those on real data). Adding survey-standard evaluation metrics would strengthen the evaluation. -**Deep generative baselines.** I exclude CTGAN and TVAE {cite:p}`xu2019modeling` from the current benchmark due to dependency constraints. Adding these baselines, along with recent diffusion-based methods like Forest Flow {cite:p}`jolicoeurmartineau2024generating`, would strengthen the comparison. +I exclude CTGAN and TVAE {cite:p}`xu2019modeling` from the current benchmark due to dependency constraints. Adding these baselines, along with recent diffusion-based methods like Forest Flow {cite:p}`jolicoeurmartineau2024generating`, would strengthen the comparison. ### Future work @@ -223,7 +241,7 @@ The most impactful improvement would be expanding the shared variable set beyond ## Conclusion -I presented microplex, a framework for generating synthetic microdata from multiple government surveys using per-variable conditional models. The central empirical finding is that zero-inflation handling — a two-stage decomposition separating the zero/non-zero classification from the positive-value distribution — provides large coverage gains for neural methods (MAF: +{eval}`r.zi_maf_vs_maf_lift`; QDNN: +{eval}`r.zi_qdnn_vs_qdnn_lift`) while barely affecting tree-based methods (+{eval}`r.zi_qrf_vs_qrf_lift` for QRF). This pattern holds across {eval}`r.n_seeds` random seeds and suggests that practitioners working with economic survey data should implement zero-inflation handling before selecting a base model. +Microplex learns per-variable conditional distributions from multiple government surveys and generates synthetic records with complete variable coverage. The central empirical finding is that zero-inflation handling — separating zero/non-zero classification from positive-value density estimation — matters more than the choice of base model for neural methods. This two-stage decomposition lifts MAF coverage by {eval}`r.zi_maf_vs_maf_lift` and QDNN by {eval}`r.zi_qdnn_vs_qdnn_lift`, while barely affecting tree-based QRF (+{eval}`r.zi_qrf_vs_qrf_lift`). The pattern is stable across {eval}`r.n_seeds` random seeds, suggesting that practitioners working with economic survey data should implement zero-inflation handling before selecting a base model. The framework has clear limitations in its current form: the conditional independence assumption and narrow shared variable set (age, sex) mean cross-source correlations are not captured, as demonstrated by the 0% PSID coverage. These limitations are addressable — expanding shared variables and modeling cross-source dependencies are the highest-priority improvements for making the synthetic data usable in production microsimulation. diff --git a/paper/paper_results.py b/paper/paper_results.py index 16bbb2f..91ee2ad 100644 --- a/paper/paper_results.py +++ b/paper/paper_results.py @@ -66,11 +66,21 @@ class ReweightingMethodStats: weight_cv: float sparsity: float elapsed: float + train_mean_error: float = 0.0 + test_mean_error: float = 0.0 @property def mean_error_pct(self) -> str: return f"{self.mean_relative_error:.1%}" + @property + def train_error_pct(self) -> str: + return f"{self.train_mean_error:.1%}" + + @property + def test_error_pct(self) -> str: + return f"{self.test_mean_error:.1%}" + @property def max_error_pct(self) -> str: return f"{self.max_relative_error:.1%}" @@ -123,6 +133,8 @@ class PaperResults: rw_n_records: int = 5000 rw_n_marginal_targets: int = 7 rw_n_continuous_targets: int = 1 + rw_n_train_targets: int = 6 + rw_n_test_targets: int = 2 # Data characteristics n_sipp: int = 476_744 @@ -135,6 +147,9 @@ class PaperResults: n_seeds: int = 1 max_rows_per_source: int = 20_000 + # Optional reweighting methods (may not be in all benchmark runs) + rw_hardconcrete: ReweightingMethodStats = None + # Synthesis derived comparisons @property def _synthesis_methods(self) -> list[MethodStats]: @@ -184,7 +199,10 @@ def total_elapsed_str(self) -> str: # Reweighting derived comparisons @property def _calibration_methods(self) -> list[ReweightingMethodStats]: - return [self.rw_ipf, self.rw_entropy, self.rw_sparse_cal] + methods = [self.rw_ipf, self.rw_entropy, self.rw_sparse_cal] + if self.rw_hardconcrete is not None: + methods.append(self.rw_hardconcrete) + return methods @property def best_rw_method(self) -> str: @@ -264,6 +282,8 @@ def _extract_rw_method(data: dict, key: str) -> ReweightingMethodStats: weight_cv=m["weight_cv"], sparsity=m["sparsity"], elapsed=m["elapsed_seconds"], + train_mean_error=m.get("train_mean_error", m["mean_relative_error"]), + test_mean_error=m.get("test_mean_error", 0.0), ) @@ -316,9 +336,12 @@ def load_results( rw_sparse_cal=_extract_rw_method(rw_data, "SparseCalibrator"), rw_l1=_extract_rw_method(rw_data, "L1-Sparse"), rw_l0=_extract_rw_method(rw_data, "L0-Sparse"), + rw_hardconcrete=_extract_rw_method(rw_data, "HardConcrete") if "HardConcrete" in rw_data.get("methods", {}) else None, rw_n_records=rw_data["n_records"], rw_n_marginal_targets=rw_data["n_marginal_targets"], rw_n_continuous_targets=rw_data["n_continuous_targets"], + rw_n_train_targets=rw_data.get("n_train_targets", 6), + rw_n_test_targets=rw_data.get("n_test_targets", 2), ) diff --git a/paper/references.bib b/paper/references.bib index 4dd0902..b8af642 100644 --- a/paper/references.bib +++ b/paper/references.bib @@ -123,10 +123,10 @@ @article{gale2022simulating year={2022} } -@misc{flood2020integrated, +@misc{flood2023integrated, title={{IPUMS CPS}: Version 11.0 [dataset]}, - author={Flood, Sarah and King, Miriam and Rodgers, Renae and Ruggles, Steven and Warren, J Robert}, - year={2020}, + author={Flood, Sarah and King, Miriam and Rodgers, Renae and Ruggles, Steven and Warren, J Robert and Backman, Daniel and Chen, Annie and Cooper, Grace and Richards, Stephanie and Schouweiler, Megan and Westberry, Michael}, + year={2023}, howpublished={Minneapolis, MN: IPUMS}, url={https://doi.org/10.18128/D030.V11.0} } @@ -180,6 +180,13 @@ @article{deming1940least year={1940} } +@inproceedings{louizos2018learning, + title={Learning Sparse Neural Networks through $L_0$ Regularization}, + author={Louizos, Christos and Welling, Max and Kingma, Diederik P}, + booktitle={International Conference on Learning Representations}, + year={2018} +} + @article{hainmueller2012entropy, title={Entropy balancing for causal effects: a multivariate reweighting method to produce balanced samples in observational studies}, author={Hainmueller, Jens}, diff --git a/pyproject.toml b/pyproject.toml index e0b9d24..7184fb9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -45,6 +45,7 @@ dependencies = [ "quantile-forest>=1.3", # For QRF-based synthesis "scikit-learn>=1.3", # For classification in zero-inflated models "pydantic>=2.0", # For data models + "prdc>=0.1", # Canonical PRDC metrics (Naeem et al. 2020) ] [project.optional-dependencies] @@ -53,6 +54,10 @@ dev = [ "pytest-cov>=4.0", "ruff>=0.1", "mypy>=1.0", + "responses>=0.20", +] +cvxpy = [ + "cvxpy>=1.3", ] statmatch = [ # "py-statmatch>=0.1.0", # Statistical matching backend (not on PyPI yet) diff --git a/scripts/run_reweighting_benchmark.py b/scripts/run_reweighting_benchmark.py index a303f3f..546e065 100644 --- a/scripts/run_reweighting_benchmark.py +++ b/scripts/run_reweighting_benchmark.py @@ -1,13 +1,13 @@ #!/usr/bin/env python3 """Run reweighting method comparison benchmark on real data. -Compares IPF, Chi2, Entropy, L1/L2/L0 sparse, SparseCalibrator, -and HardConcrete (if l0-python installed) on target-matching accuracy. +Evaluates methods on both in-sample (training) and out-of-sample (held-out) +targets. Calibrates on age_group + weight, evaluates on held-out is_male. Usage: python scripts/run_reweighting_benchmark.py - python scripts/run_reweighting_benchmark.py --methods ipf entropy l1 - python scripts/run_reweighting_benchmark.py --output benchmarks/results/reweighting.json + python scripts/run_reweighting_benchmark.py --methods ipf entropy hardconcrete + python scripts/run_reweighting_benchmark.py --output benchmarks/results/reweighting_full.json """ import argparse @@ -41,20 +41,23 @@ def load_data(data_dir: Path, max_rows: int = 20000) -> pd.DataFrame: return df -def build_targets_from_data(df: pd.DataFrame) -> tuple[pd.DataFrame, dict, dict]: - """Build realistic calibration targets from the data itself. +def build_targets(df: pd.DataFrame) -> tuple[pd.DataFrame, dict, dict, dict, dict]: + """Build train and test calibration targets. - Creates targets that differ from the sample distribution, - simulating calibration to known population totals. + Train targets: age_group (5 categories) + total weight (continuous) + Test targets: is_male (2 categories) - Multi-source data has high NaN rates for survey-specific columns, - so we focus on shared columns (age, is_male) and create age bins. + Returns: + (df, train_marginal, train_continuous, test_marginal, test_continuous) """ - marginal_targets = {} - continuous_targets = {} rng = np.random.RandomState(42) - # Create age bins if age column exists (shared across all surveys) + train_marginal = {} + train_continuous = {} + test_marginal = {} + test_continuous = {} + + # Create age bins (TRAIN) if "age" in df.columns: bins = [0, 18, 35, 55, 65, 120] labels = ["0-17", "18-34", "35-54", "55-64", "65+"] @@ -67,81 +70,46 @@ def build_targets_from_data(df: pd.DataFrame) -> tuple[pd.DataFrame, dict, dict] continue perturbed[str(cat)] = round(count * rng.uniform(0.7, 1.3)) if perturbed: - # Convert categorical back to string for clean matching df["age_group"] = df["age_group"].astype(str) - marginal_targets["age_group"] = perturbed - print(f" Target: age_group ({len(perturbed)} categories)") + train_marginal["age_group"] = perturbed + print(f" Train target: age_group ({len(perturbed)} categories)") - # is_male (shared across all surveys) + # is_male (TEST — held out during calibration) if "is_male" in df.columns and df["is_male"].isna().mean() < 0.05: counts = df["is_male"].value_counts(dropna=True) perturbed = {} for cat, count in counts.items(): perturbed[cat] = round(count * rng.uniform(0.8, 1.2)) - marginal_targets["is_male"] = perturbed - print(f" Target: is_male ({len(perturbed)} categories)") - - # Use low-NaN categorical columns from survey-specific data - for col in df.columns: - if col in ("weight", "person_id", "household_id", "age", "is_male", "age_group"): - continue - if col.startswith("_"): - continue - if df[col].isna().mean() > 0.3: # Allow up to 30% NaN - continue - if df[col].nunique() <= 10 and df[col].nunique() >= 2: - counts = df[col].value_counts(dropna=True) - perturbed = {} - for cat, count in counts.items(): - if pd.isna(cat): - continue - perturbed[cat] = round(count * rng.uniform(0.7, 1.3)) - if perturbed and len(marginal_targets) < 4: - marginal_targets[col] = perturbed - print(f" Target: {col} ({len(perturbed)} categories)") - - # Continuous targets — use total weight as population count target - total_weight = df["weight"].sum() - continuous_targets["weight"] = round(total_weight * rng.uniform(0.9, 1.1)) - print(f" Target: weight (total={total_weight:,.0f} -> {continuous_targets['weight']:,.0f})") - - return df, marginal_targets, continuous_targets + test_marginal["is_male"] = perturbed + print(f" Test target: is_male ({len(perturbed)} categories)") + # Continuous: total weight (TRAIN) + total_weight = df["weight"].sum() + train_continuous["weight"] = round(total_weight * rng.uniform(0.9, 1.1)) + print(f" Train target: weight (total={total_weight:,.0f} -> {train_continuous['weight']:,.0f})") -METHOD_MAP = { - "ipf": "IPFMethod", - "chi2": "Chi2Method", - "entropy": "EntropyMethod", - "l1": "L1SparseMethod", - "l2": "L2SparseMethod", - "l0": "L0SparseMethod", - "sparse": "SparseCalibratorMethod", - "hardconcrete": "HardConcreteMethod", -} + return df, train_marginal, train_continuous, test_marginal, test_continuous def build_methods(method_names: list[str] = None): """Build method instances from names.""" from microplex.eval.reweighting_benchmark import ( - IPFMethod, Chi2Method, EntropyMethod, - L1SparseMethod, L2SparseMethod, L0SparseMethod, + IPFMethod, EntropyMethod, + L1SparseMethod, L0SparseMethod, SparseCalibratorMethod, HardConcreteMethod, ) all_methods = { "ipf": IPFMethod(), - "chi2": Chi2Method(), "entropy": EntropyMethod(), "l1": L1SparseMethod(), - "l2": L2SparseMethod(), "l0": L0SparseMethod(), "sparse": SparseCalibratorMethod(sparsity_weight=0.01), - "hardconcrete": HardConcreteMethod(lambda_l0=1e-5, epochs=500), + "hardconcrete": HardConcreteMethod(lambda_l0=1e-4, epochs=2000), } if method_names is None: - method_names = ["ipf", "chi2", "entropy", "l1", "l2", "l0", "sparse"] - # Add HardConcrete if l0-python is available + method_names = ["ipf", "entropy", "l1", "l0", "sparse"] try: import l0 method_names.append("hardconcrete") @@ -159,17 +127,37 @@ def build_methods(method_names: list[str] = None): return methods +def evaluate_weights(data, weights, marginal_targets, continuous_targets): + """Compute per-target relative errors for given weights.""" + errors = {} + for var, var_targets in marginal_targets.items(): + for cat, target in var_targets.items(): + mask = data[var] == cat + actual = float(weights[mask].sum()) + rel_err = abs(actual - target) / target if target > 0 else 0.0 + errors[f"{var}={cat}"] = { + "target": target, "actual": actual, "error": rel_err, + } + if continuous_targets: + for var, target in continuous_targets.items(): + if var in data.columns: + actual = float((weights * data[var].values).sum()) + rel_err = abs(actual - target) / abs(target) if target != 0 else 0.0 + errors[var] = {"target": target, "actual": actual, "error": rel_err} + return errors + + def main(): parser = argparse.ArgumentParser(description="Run reweighting method benchmark") parser.add_argument( "--methods", nargs="+", default=None, help="Methods to compare (default: all available). " - "Options: ipf, chi2, entropy, l1, l2, l0, sparse, hardconcrete", + "Options: ipf, entropy, l1, l0, sparse, hardconcrete", ) parser.add_argument("--output", type=str, help="Save results to JSON") parser.add_argument( "--max-rows", type=int, default=5000, - help="Max rows (default: 5000, reweighting is O(n) per iteration)", + help="Max rows (default: 5000)", ) parser.add_argument( "--data-dir", type=str, @@ -183,60 +171,118 @@ def main(): # Load data df = load_data(data_dir, max_rows=args.max_rows) - # Build targets (may add derived columns like age_group) + # Build train/test targets print("\nBuilding calibration targets...") - df, marginal_targets, continuous_targets = build_targets_from_data(df) - print(f" {len(marginal_targets)} categorical, {len(continuous_targets)} continuous targets") + df, train_marginal, train_continuous, test_marginal, test_continuous = build_targets(df) + + n_train = sum(len(v) for v in train_marginal.values()) + len(train_continuous) + n_test = sum(len(v) for v in test_marginal.values()) + len(test_continuous) + print(f" Train: {n_train} targets, Test: {n_test} targets") - if not marginal_targets: - print("ERROR: No categorical targets found. Need at least one categorical variable.") + if not train_marginal: + print("ERROR: No training targets found.") sys.exit(1) - # Drop rows with NaN in categorical target columns (these cause constraint errors) - # Continuous targets are handled via fillna(0) in the calibrators - cat_target_cols = [c for c in marginal_targets.keys() if c in df.columns] - if cat_target_cols: + # Drop NaN rows in ALL target columns (train + test) + all_cat_cols = list(train_marginal.keys()) + list(test_marginal.keys()) + cat_cols = [c for c in all_cat_cols if c in df.columns] + if cat_cols: before = len(df) - df = df.dropna(subset=cat_target_cols).reset_index(drop=True) + df = df.dropna(subset=cat_cols).reset_index(drop=True) if len(df) < before: - print(f" Dropped {before - len(df)} rows with NaN in categorical targets " - f"({len(df)} remaining)") + print(f" Dropped {before - len(df)} rows with NaN ({len(df)} remaining)") # Build methods methods = build_methods(args.methods) - print(f"\nMethods to compare ({len(methods)}): {[m.name for m in methods]}") + print(f"\nMethods ({len(methods)}): {[m.name for m in methods]}") + + # All targets for evaluation + all_marginal = {**train_marginal, **test_marginal} + all_continuous = {**train_continuous, **test_continuous} # Run benchmark - from microplex.eval.reweighting_benchmark import ReweightingBenchmarkRunner - - runner = ReweightingBenchmarkRunner(methods=methods) - t0 = time.time() - result = runner.run( - data=df, - marginal_targets=marginal_targets, - continuous_targets=continuous_targets if continuous_targets else None, - seed=args.seed, - ) - total_elapsed = time.time() - t0 + print(f"\n{'Method':<20} {'Train err':>10} {'Test err':>10} " + f"{'All err':>10} {'Sparsity':>10} {'Time':>8}") + print("-" * 78) - # Print summary - print(f"\n{result.summary()}") - print(f"\nTotal elapsed: {total_elapsed:.1f}s") + results = {} + for method in methods: + t0 = time.time() + try: + # Fit on TRAIN targets only + method.fit(df, train_marginal, train_continuous if train_continuous else None) + elapsed = time.time() - t0 + weights = method.get_weights() + + # Evaluate on train, test, and all targets + train_errs = evaluate_weights(df, weights, train_marginal, train_continuous) + test_errs = evaluate_weights(df, weights, test_marginal, test_continuous) + all_errs = evaluate_weights(df, weights, all_marginal, all_continuous) + + train_mean = np.mean([e["error"] for e in train_errs.values()]) + test_mean = np.mean([e["error"] for e in test_errs.values()]) if test_errs else 0.0 + all_mean = np.mean([e["error"] for e in all_errs.values()]) + all_max = max(e["error"] for e in all_errs.values()) + + mean_w = weights.mean() + cv = float(weights.std() / mean_w) if mean_w > 0 else 0.0 + sparsity = float((weights < 1e-9).sum() / len(weights)) + + results[method.name] = { + "method_name": method.name, + "train_mean_error": round(train_mean, 6), + "test_mean_error": round(test_mean, 6), + "mean_relative_error": round(all_mean, 6), + "max_relative_error": round(all_max, 6), + "weight_cv": round(cv, 4), + "sparsity": round(sparsity, 4), + "elapsed_seconds": round(elapsed, 2), + "train_errors": {k: {kk: round(vv, 6) if isinstance(vv, float) else vv + for kk, vv in v.items()} + for k, v in train_errs.items()}, + "test_errors": {k: {kk: round(vv, 6) if isinstance(vv, float) else vv + for kk, vv in v.items()} + for k, v in test_errs.items()}, + } + + print(f"{method.name:<20} {train_mean:>10.2%} {test_mean:>10.2%} " + f"{all_mean:>10.2%} {sparsity:>10.1%} {elapsed:>7.1f}s") + + except Exception as e: + elapsed = time.time() - t0 + results[method.name] = { + "method_name": method.name, + "train_mean_error": float("inf"), + "test_mean_error": float("inf"), + "mean_relative_error": float("inf"), + "max_relative_error": float("inf"), + "weight_cv": 0.0, + "sparsity": 0.0, + "elapsed_seconds": round(elapsed, 2), + } + print(f"{method.name:<20} {'ERROR':>10} {'ERROR':>10} " + f"{'ERROR':>10} {'':>10} {elapsed:>7.1f}s {e}") + + print("-" * 78) # Save results if args.output: output_path = Path(args.output) output_path.parent.mkdir(parents=True, exist_ok=True) - result_dict = result.to_dict() - result_dict["timestamp"] = time.strftime("%Y-%m-%dT%H:%M:%S") - result_dict["total_elapsed_seconds"] = round(total_elapsed, 1) - result_dict["n_records"] = len(df) - result_dict["n_marginal_targets"] = sum( - len(v) for v in marginal_targets.values() - ) - result_dict["n_continuous_targets"] = len(continuous_targets) + output = { + "seed": args.seed, + "methods": results, + "timestamp": time.strftime("%Y-%m-%dT%H:%M:%S"), + "n_records": len(df), + "n_train_targets": n_train, + "n_test_targets": n_test, + "n_marginal_targets": sum(len(v) for v in all_marginal.values()), + "n_continuous_targets": len(all_continuous), + "train_variables": list(train_marginal.keys()) + list(train_continuous.keys()), + "test_variables": list(test_marginal.keys()) + list(test_continuous.keys()), + } with open(output_path, "w") as f: - json.dump(result_dict, f, indent=2) + json.dump(output, f, indent=2) print(f"\nResults saved to {output_path}") diff --git a/scripts/run_reweighting_frontier.py b/scripts/run_reweighting_frontier.py new file mode 100644 index 0000000..9d948cc --- /dev/null +++ b/scripts/run_reweighting_frontier.py @@ -0,0 +1,226 @@ +#!/usr/bin/env python3 +"""Generate reweighting frontier data: records used vs out-of-sample error. + +Sweeps regularization parameters for each method, recording (n_active, test_error) +pairs. Uses the same train/test split as the paper benchmark: calibrate on +age_group + weight, evaluate on held-out is_male. + +Usage: + python scripts/run_reweighting_frontier.py + python scripts/run_reweighting_frontier.py --output benchmarks/results/reweighting_frontier.json +""" + +import argparse +import json +import sys +import time +from pathlib import Path + +import numpy as np +import pandas as pd + + +def load_data(data_dir: Path, max_rows: int = 5000) -> pd.DataFrame: + stacked_path = data_dir / "stacked_comprehensive.parquet" + if not stacked_path.exists(): + print(f"ERROR: {stacked_path} not found") + sys.exit(1) + df = pd.read_parquet(stacked_path) + if len(df) > max_rows: + rng = np.random.RandomState(42) + idx = rng.choice(len(df), max_rows, replace=False) + df = df.iloc[idx].reset_index(drop=True) + return df + + +def build_targets(df: pd.DataFrame): + """Build train (age_group + weight) and test (is_male) targets.""" + rng = np.random.RandomState(42) + df = df.copy() + + # Train: age_group + train_marginal = {} + bins = [0, 18, 35, 55, 65, 120] + labels = ["0-17", "18-34", "35-54", "55-64", "65+"] + df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels, right=False).astype(str) + counts = df["age_group"].value_counts(dropna=True) + train_marginal["age_group"] = { + str(cat): round(count * rng.uniform(0.7, 1.3)) + for cat, count in counts.items() if not pd.isna(cat) + } + + # Train: weight (continuous) + train_continuous = {"weight": round(df["weight"].sum() * rng.uniform(0.9, 1.1))} + + # Test: is_male + test_marginal = {} + counts = df["is_male"].value_counts(dropna=True) + test_marginal["is_male"] = { + cat: round(count * rng.uniform(0.8, 1.2)) + for cat, count in counts.items() + } + + # Drop NaN rows + df = df.dropna(subset=["age_group", "is_male"]).reset_index(drop=True) + + return df, train_marginal, train_continuous, test_marginal + + +def evaluate_test_error(df, weights, test_marginal): + """Mean relative error on held-out test targets.""" + errors = [] + for var, var_targets in test_marginal.items(): + for cat, target in var_targets.items(): + mask = df[var] == cat + actual = float(weights[mask].sum()) + errors.append(abs(actual - target) / target if target > 0 else 0.0) + return np.mean(errors) + + +def evaluate_train_error(df, weights, train_marginal, train_continuous): + """Mean relative error on training targets.""" + errors = [] + for var, var_targets in train_marginal.items(): + for cat, target in var_targets.items(): + mask = df[var] == cat + actual = float(weights[mask].sum()) + errors.append(abs(actual - target) / target if target > 0 else 0.0) + for var, target in train_continuous.items(): + actual = float((weights * df[var].values).sum()) + errors.append(abs(actual - target) / abs(target) if target != 0 else 0.0) + return np.mean(errors) + + +def run_single(method_cls, df, train_m, train_c, test_m, **kwargs): + """Run a single method config and return results dict.""" + t0 = time.time() + try: + method = method_cls(**kwargs) + method.fit(df, train_m, train_c if hasattr(method, '_sparsity_weight') or + isinstance(method, (type(None),)) else train_c) + weights = method.get_weights() + elapsed = time.time() - t0 + + n_active = int((weights > 1e-9).sum()) + test_err = evaluate_test_error(df, weights, test_m) + train_err = evaluate_train_error(df, weights, train_m, train_c) + + return { + "n_active": n_active, + "test_error": round(test_err, 6), + "train_error": round(train_err, 6), + "sparsity": round(1 - n_active / len(df), 4), + "weight_cv": round(float(weights.std() / weights.mean()), 4) if weights.mean() > 0 else 0, + "elapsed": round(elapsed, 2), + "params": {k: v for k, v in kwargs.items() if k != "verbose"}, + } + except Exception as e: + print(f" FAILED: {e}") + return None + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--output", type=str, + default=str(Path(__file__).parent.parent / "benchmarks" / "results" / "reweighting_frontier.json")) + parser.add_argument("--data-dir", type=str, + default=str(Path(__file__).parent.parent / "data")) + parser.add_argument("--max-rows", type=int, default=5000) + args = parser.parse_args() + + from microplex.eval.reweighting_benchmark import ( + IPFMethod, EntropyMethod, SparseCalibratorMethod, HardConcreteMethod, + L1SparseMethod, L0SparseMethod, + ) + + # Load data and build targets + df = load_data(Path(args.data_dir), args.max_rows) + df, train_m, train_c, test_m = build_targets(df) + n_records = len(df) + print(f"Records: {n_records:,}") + print(f"Train targets: {sum(len(v) for v in train_m.values()) + len(train_c)}") + print(f"Test targets: {sum(len(v) for v in test_m.values())}") + + results = {"n_records": n_records, "methods": {}} + + # --- Dense methods (single point each) --- + print("\n--- Dense methods ---") + + for name, method_cls, kwargs in [ + ("IPF", IPFMethod, {}), + ("Entropy", EntropyMethod, {}), + ("L1-Sparse", L1SparseMethod, {}), + ("L0-Sparse", L0SparseMethod, {}), + ]: + print(f" {name}...") + method = method_cls(**kwargs) + method.fit(df, train_m, train_c) + weights = method.get_weights() + n_active = int((weights > 1e-9).sum()) + test_err = evaluate_test_error(df, weights, test_m) + train_err = evaluate_train_error(df, weights, train_m, train_c) + results["methods"][name] = [{ + "n_active": n_active, + "test_error": round(test_err, 6), + "train_error": round(train_err, 6), + "sparsity": round(1 - n_active / n_records, 4), + "params": kwargs, + }] + print(f" n_active={n_active}, test_error={test_err:.4f}") + + # --- SparseCalibrator: sweep sparsity_weight --- + print("\n--- SparseCalibrator (sweep sparsity_weight) ---") + sc_results = [] + for sw in [0.0, 0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0]: + print(f" sparsity_weight={sw}...") + method = SparseCalibratorMethod(sparsity_weight=sw) + method.fit(df, train_m, train_c) + weights = method.get_weights() + n_active = int((weights > 1e-9).sum()) + test_err = evaluate_test_error(df, weights, test_m) + train_err = evaluate_train_error(df, weights, train_m, train_c) + sc_results.append({ + "n_active": n_active, + "test_error": round(test_err, 6), + "train_error": round(train_err, 6), + "sparsity": round(1 - n_active / n_records, 4), + "params": {"sparsity_weight": sw}, + }) + print(f" n_active={n_active}, test_error={test_err:.4f}, train_error={train_err:.4f}") + results["methods"]["SparseCalibrator"] = sc_results + + # --- HardConcrete: sweep lambda_l0 --- + print("\n--- HardConcrete (sweep lambda_l0) ---") + hc_results = [] + for lam in [1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2]: + print(f" lambda_l0={lam:.0e}...") + try: + method = HardConcreteMethod(lambda_l0=lam, epochs=2000) + method.fit(df, train_m, train_c) + weights = method.get_weights() + n_active = int((weights > 1e-9).sum()) + test_err = evaluate_test_error(df, weights, test_m) + train_err = evaluate_train_error(df, weights, train_m, train_c) + hc_results.append({ + "n_active": n_active, + "test_error": round(test_err, 6), + "train_error": round(train_err, 6), + "sparsity": round(1 - n_active / n_records, 4), + "params": {"lambda_l0": lam, "epochs": 2000}, + }) + print(f" n_active={n_active}, test_error={test_err:.4f}, train_error={train_err:.4f}") + except Exception as e: + print(f" FAILED: {e}") + results["methods"]["HardConcrete"] = hc_results + + # Save + output_path = Path(args.output) + output_path.parent.mkdir(parents=True, exist_ok=True) + results["timestamp"] = time.strftime("%Y-%m-%dT%H:%M:%S") + with open(output_path, "w") as f: + json.dump(results, f, indent=2) + print(f"\nSaved to {output_path}") + + +if __name__ == "__main__": + main() diff --git a/src/microplex/calibration.py b/src/microplex/calibration.py index 7c62021..6515525 100644 --- a/src/microplex/calibration.py +++ b/src/microplex/calibration.py @@ -1413,6 +1413,15 @@ def fit( else: init_weights = np.ones(len(data)) + # Rescale init weights so A_norm @ init_weights ≈ b_norm. + # Without this, survey weights (e.g. CPS ~6000) produce initial + # constraint violations of 1000x+, making gradient descent fail. + achieved = A_norm @ init_weights + positive = achieved > 1e-10 + if positive.any(): + scale = np.mean(b_norm[positive] / achieved[positive]) + init_weights = init_weights * scale + # Create and fit model self.model_ = SparseCalibrationWeights( n_features=len(data), diff --git a/src/microplex/dgp.py b/src/microplex/dgp.py index 95a7069..4ac2a8e 100644 --- a/src/microplex/dgp.py +++ b/src/microplex/dgp.py @@ -57,47 +57,28 @@ class EvalResult: def compute_prdc(real: np.ndarray, fake: np.ndarray, k: int = 5) -> Dict[str, float]: - """Compute Precision, Recall, Density, Coverage metrics.""" - from sklearn.neighbors import NearestNeighbors + """Compute Precision, Recall, Density, Coverage via canonical prdc library. + + Delegates to Naeem et al. (2020) reference implementation. Standardizes + inputs first for consistent distance computation. + """ + from prdc import compute_prdc as _prdc + from sklearn.preprocessing import StandardScaler - # Handle edge cases if len(real) < k + 1 or len(fake) < k + 1: return {"precision": 0.0, "recall": 0.0, "density": 0.0, "coverage": 0.0} - # k-NN within each set - nn_real = NearestNeighbors(n_neighbors=k + 1, metric="euclidean").fit(real) - real_distances, _ = nn_real.kneighbors(real) - real_radii = real_distances[:, -1] - - nn_fake = NearestNeighbors(n_neighbors=k + 1, metric="euclidean").fit(fake) - fake_distances, _ = nn_fake.kneighbors(fake) - fake_radii = fake_distances[:, -1] - - # Cross-set distances - dist_fake_to_real, _ = nn_real.kneighbors(fake) - dist_fake_to_real = dist_fake_to_real[:, 0] - - nn_fake_1 = NearestNeighbors(n_neighbors=1, metric="euclidean").fit(fake) - dist_real_to_fake, _ = nn_fake_1.kneighbors(real) - dist_real_to_fake = dist_real_to_fake[:, 0] - - # Metrics - nearest_real_idx = nn_real.kneighbors(fake, n_neighbors=1, return_distance=False)[:, 0] - precision = (dist_fake_to_real <= real_radii[nearest_real_idx]).mean() - - nearest_fake_idx = nn_fake.kneighbors(real, n_neighbors=1, return_distance=False)[:, 0] - recall = (dist_real_to_fake <= fake_radii[nearest_fake_idx]).mean() - - coverage = (dist_real_to_fake <= real_radii).mean() + scaler = StandardScaler() + real_s = scaler.fit_transform(real) + fake_s = scaler.transform(fake) - dist_fake_to_all_real = nn_real.kneighbors(fake, n_neighbors=len(real), return_distance=True)[0] - density = (dist_fake_to_all_real <= real_radii).sum(axis=1).mean() / k + metrics = _prdc(real_s, fake_s, nearest_k=k) return { - "precision": float(precision), - "recall": float(recall), - "density": float(density), - "coverage": float(coverage), + "precision": float(metrics["precision"]), + "recall": float(metrics["recall"]), + "density": float(metrics["density"]), + "coverage": float(metrics["coverage"]), } diff --git a/src/microplex/eval/benchmark.py b/src/microplex/eval/benchmark.py index a428c57..f36c1e9 100644 --- a/src/microplex/eval/benchmark.py +++ b/src/microplex/eval/benchmark.py @@ -153,13 +153,13 @@ def summary(self) -> str: def _compute_prdc(real: np.ndarray, synthetic: np.ndarray, k: int = 5) -> dict[str, float]: - """Compute Precision, Density, Coverage via k-NN. + """Compute Precision, Density, Coverage via canonical prdc library. - Adapted from Naeem et al. (2020) PRDC. In the k-NN formulation, - recall and coverage are identical, so we report only coverage. + Delegates to Naeem et al. (2020) reference implementation. In their + k-NN formulation recall and coverage are identical, so we report only + coverage (dropping recall from the returned dict). """ - from sklearn.neighbors import NearestNeighbors - from sklearn.metrics import pairwise_distances + from prdc import compute_prdc as _prdc if len(real) < k + 1 or len(synthetic) < k + 1: return {"precision": 0.0, "density": 0.0, "coverage": 0.0} @@ -168,43 +168,12 @@ def _compute_prdc(real: np.ndarray, synthetic: np.ndarray, k: int = 5) -> dict[s real_s = scaler.fit_transform(real) synth_s = scaler.transform(synthetic) - nn_real = NearestNeighbors(n_neighbors=k + 1).fit(real_s) - real_dists, _ = nn_real.kneighbors(real_s) - real_radii = real_dists[:, -1] - - nn_synth = NearestNeighbors(n_neighbors=k + 1).fit(synth_s) - synth_dists, _ = nn_synth.kneighbors(synth_s) - synth_radii = synth_dists[:, -1] - - nn_synth_1 = NearestNeighbors(n_neighbors=1).fit(synth_s) - real_to_synth_dist, _ = nn_synth_1.kneighbors(real_s) - real_to_synth_dist = real_to_synth_dist[:, 0] - - nn_real_1 = NearestNeighbors(n_neighbors=1).fit(real_s) - synth_to_real_dist, _ = nn_real_1.kneighbors(synth_s) - synth_to_real_dist = synth_to_real_dist[:, 0] - - coverage = float((real_to_synth_dist <= real_radii).mean()) - precision = float((synth_to_real_dist <= synth_radii).mean()) - - max_density_samples = 2000 - if len(synth_s) > max_density_samples: - rng = np.random.RandomState(42) - idx = rng.choice(len(synth_s), max_density_samples, replace=False) - synth_sample = synth_s[idx] - radii_sample = synth_radii[idx] - else: - synth_sample = synth_s - radii_sample = synth_radii - - dists = pairwise_distances(synth_sample, real_s) - counts = (dists <= radii_sample[:, None]).sum(axis=1) - density = float(counts.mean() / k) + metrics = _prdc(real_s, synth_s, nearest_k=k) return { - "precision": precision, - "density": density, - "coverage": coverage, + "precision": float(metrics["precision"]), + "density": float(metrics["density"]), + "coverage": float(metrics["coverage"]), } diff --git a/src/microplex/eval/coverage.py b/src/microplex/eval/coverage.py index 73a1e96..47ef212 100644 --- a/src/microplex/eval/coverage.py +++ b/src/microplex/eval/coverage.py @@ -41,8 +41,9 @@ def compute_prdc( """ Compute Precision, Recall, Density, Coverage metrics. - Based on Naeem et al. (2020) "Reliable Fidelity and Diversity Metrics - for Generative Models" + Delegates the four scalar metrics to the canonical ``prdc`` library + (Naeem et al. 2020) and additionally computes per-record detail arrays + (covered_mask, distances, nearest_indices) used by downstream code. Args: real: (n_real, n_features) real data @@ -51,8 +52,10 @@ def compute_prdc( scaler: Optional scaler. If None, fits StandardScaler on real. Returns: - PRDCResult with all metrics + PRDCResult with all metrics and per-record arrays """ + from prdc import compute_prdc as _prdc + # Scale data if scaler is None: scaler = StandardScaler() @@ -61,53 +64,28 @@ def compute_prdc( real_scaled = scaler.transform(real) synth_scaled = scaler.transform(synthetic) - # Real manifold: k-th neighbor distance as radius + # Canonical PRDC metrics + metrics = _prdc(real_scaled, synth_scaled, nearest_k=k) + + # Per-record detail arrays (not provided by the prdc library) + # Distance from each real point to its nearest synthetic neighbour, + # plus the real manifold radii needed for the covered_mask. nn_real = NearestNeighbors(n_neighbors=k + 1).fit(real_scaled) real_dists, _ = nn_real.kneighbors(real_scaled) - real_radii = real_dists[:, -1] # k-th neighbor (excluding self) + real_radii = real_dists[:, -1] - # Synthetic manifold: k-th neighbor distance as radius - nn_synth = NearestNeighbors(n_neighbors=k + 1).fit(synth_scaled) - synth_dists, _ = nn_synth.kneighbors(synth_scaled) - synth_radii = synth_dists[:, -1] - - # Distance from real to nearest synthetic nn_synth_1 = NearestNeighbors(n_neighbors=1).fit(synth_scaled) real_to_synth, nearest_synth = nn_synth_1.kneighbors(real_scaled) real_to_synth = real_to_synth[:, 0] nearest_synth = nearest_synth[:, 0] - # Distance from synthetic to nearest real - nn_real_1 = NearestNeighbors(n_neighbors=1).fit(real_scaled) - synth_to_real, _ = nn_real_1.kneighbors(synth_scaled) - synth_to_real = synth_to_real[:, 0] - - # Coverage: real point covered if nearest synthetic within its radius covered = real_to_synth <= real_radii - coverage = covered.mean() - - # Note: In Naeem et al.'s k-NN formulation, recall and coverage are - # identical. We set recall = coverage for API compatibility but only - # coverage is an independent metric. - recall = coverage - - # Precision: fraction of synthetic points with real in their ball - precision = (synth_to_real <= synth_radii).mean() - - # Density: average number of real points within synthetic ball - # (normalized by k) - density_counts = [] - for i, synth_pt in enumerate(synth_scaled): - dists = np.linalg.norm(real_scaled - synth_pt, axis=1) - count = (dists <= synth_radii[i]).sum() - density_counts.append(count) - density = np.mean(density_counts) / k return PRDCResult( - precision=precision, - recall=recall, - density=density, - coverage=coverage, + precision=float(metrics["precision"]), + recall=float(metrics["recall"]), + density=float(metrics["density"]), + coverage=float(metrics["coverage"]), covered_mask=covered, distances=real_to_synth, nearest_indices=nearest_synth, diff --git a/src/microplex/eval/harness.py b/src/microplex/eval/harness.py index 81b9e85..ba7681f 100644 --- a/src/microplex/eval/harness.py +++ b/src/microplex/eval/harness.py @@ -12,7 +12,6 @@ import numpy as np import pandas as pd -from sklearn.neighbors import NearestNeighbors from sklearn.preprocessing import StandardScaler @@ -227,67 +226,28 @@ def _fmt_number(v: float) -> str: def _compute_prdc( real: np.ndarray, synthetic: np.ndarray, k: int = 5 ) -> dict[str, float]: - """Compute Precision, Recall, Density, Coverage. + """Compute Precision, Recall, Density, Coverage via canonical prdc library. - Based on Naeem et al. (2020). Operates in standardized space. + Delegates to Naeem et al. (2020) reference implementation. Recall is + kept in the return dict for API compatibility (it equals coverage in + the k-NN formulation). """ + from prdc import compute_prdc as _prdc + if len(real) < k + 1 or len(synthetic) < k + 1: return {"precision": 0.0, "recall": 0.0, "density": 0.0, "coverage": 0.0} - # Standardize using real data stats scaler = StandardScaler() real_s = scaler.fit_transform(real) synth_s = scaler.transform(synthetic) - # k-th neighbor distances within each set (defines manifold radius) - nn_real = NearestNeighbors(n_neighbors=k + 1).fit(real_s) - real_dists, _ = nn_real.kneighbors(real_s) - real_radii = real_dists[:, -1] - - nn_synth = NearestNeighbors(n_neighbors=k + 1).fit(synth_s) - synth_dists, _ = nn_synth.kneighbors(synth_s) - synth_radii = synth_dists[:, -1] - - # Cross-set: real -> nearest synthetic - nn_synth_1 = NearestNeighbors(n_neighbors=1).fit(synth_s) - real_to_synth_dist, _ = nn_synth_1.kneighbors(real_s) - real_to_synth_dist = real_to_synth_dist[:, 0] - - # Cross-set: synthetic -> nearest real - nn_real_1 = NearestNeighbors(n_neighbors=1).fit(real_s) - synth_to_real_dist, _ = nn_real_1.kneighbors(synth_s) - synth_to_real_dist = synth_to_real_dist[:, 0] - - coverage = float((real_to_synth_dist <= real_radii).mean()) - # In the k-NN formulation, recall and coverage are identical. - # We keep recall in the output dict for API compatibility. - recall = coverage - precision = float((synth_to_real_dist <= synth_radii).mean()) - - # Density: average real points in each synthetic ball, normalized by k - # Use vectorized approach instead of loop for speed - from sklearn.metrics import pairwise_distances - - # For large data, sample to avoid O(n^2) memory - max_density_samples = 2000 - if len(synth_s) > max_density_samples: - rng = np.random.RandomState(42) - idx = rng.choice(len(synth_s), max_density_samples, replace=False) - synth_sample = synth_s[idx] - radii_sample = synth_radii[idx] - else: - synth_sample = synth_s - radii_sample = synth_radii - - dists = pairwise_distances(synth_sample, real_s) - counts = (dists <= radii_sample[:, None]).sum(axis=1) - density = float(counts.mean() / k) + metrics = _prdc(real_s, synth_s, nearest_k=k) return { - "precision": precision, - "recall": recall, - "density": density, - "coverage": coverage, + "precision": float(metrics["precision"]), + "recall": float(metrics["recall"]), + "density": float(metrics["density"]), + "coverage": float(metrics["coverage"]), } diff --git a/src/microplex/eval/reweighting_benchmark.py b/src/microplex/eval/reweighting_benchmark.py index 38303e1..39b2a4a 100644 --- a/src/microplex/eval/reweighting_benchmark.py +++ b/src/microplex/eval/reweighting_benchmark.py @@ -367,7 +367,7 @@ def get_default_reweighting_methods() -> list: # Add HardConcrete if l0-python is available try: import l0 - methods.append(HardConcreteMethod(lambda_l0=1e-5, epochs=500)) + methods.append(HardConcreteMethod(lambda_l0=1e-4, epochs=2000)) except ImportError: pass