microplex-evals is the separate evaluation workspace for Microplex benchmark science, paper writing, artifact curation, and review handoff.
It exists so the software repos stay focused on runtime code while evaluation work stays reproducible, inspectable, and citable.
| Repo | Owns | Does not own |
|---|---|---|
microplex-evals |
benchmark definitions, experiment matrices, evaluation-only scripts, repo-local artifacts, Quarto papers, review handoff docs | shared runtime abstractions, production synthesis/calibration code, country adapter runtime code |
microplex |
shared target specs, benchmark math, result/suite types, reweighting interfaces, shared artifact validation | country-specific PE execution, raw source parsing, paper/reporting workflows |
microplex-us |
US sources, PE-US execution/materialization, US-local pipelines, saved US runtime bundles | cross-country benchmark math, paper/reporting orchestration |
microplex-uk |
UK sources, PE-UK execution/materialization, UK-local pipelines | cross-country benchmark math, paper/reporting orchestration |
- Headline benchmark mission: improve downstream
microplex-usperformance on real target surfaces, with eventual PE-native loss as the primary ranking signal. - First-class evaluation outcome: coverage and support across microdata types, not just scalar loss wins.
- Current paper workspace:
papers/imputation-benchmark-2026-03/ - Current benchmark config:
benchmarks/us/imputation/imputation-benchmark-2026-03.toml - Clean method benchmark config:
benchmarks/us/imputation/method-benchmark-2026-03.toml - Method benchmark review handoff:
reviews/us-method-benchmark-2026-03.md - Zero-head benchmark config:
benchmarks/us/imputation/zi-head-benchmark-2026-03.toml - Zero-head benchmark review handoff:
reviews/us-zi-head-benchmark-2026-03.md
This repo is wired to sibling working copies through uv:
../microplex../microplex-us
That keeps the eval repo lightweight while still letting scripts depend on the current local runtime code.
cd /Users/maxghenis/CosilicoAI/microplex-evals
uv sync
./scripts/render_us_imputation_paper
./scripts/plan_us_method_benchmark
./scripts/run_us_method_benchmark --methods bootstrap_independent qrf zi_qrf
./scripts/plan_us_zi_head_benchmark
./scripts/run_us_zi_head_benchmarkThat command chain will:
- sync the configured saved results from sibling repos into repo-local artifacts
- build derived CSV/JSON tables and paper assets
- render the Quarto paper to
papers/imputation-benchmark-2026-03/rendered/index.html - write a durable plan for the clean US method benchmark
- run the clean method benchmark adapters against survey-specific holdout-conditioned draws
microplex-evals/
├── benchmarks/
├── papers/
├── artifacts/
├── scripts/
├── reviews/
├── AGENTS.md
├── _WORKSPACE.md
└── _BUILD_LOG.md
- Paper claims should cite repo-local saved artifacts under
artifacts/. - Scratch diagnostics can seed analysis, but they are not headline evidence unless the benchmark config explicitly allows it.
- If a result needs shared benchmark semantics or shared manifest validation, that logic belongs in
microplex, not here.
The earlier copied legacy benchmark implementation was intentionally deleted. Git history is the preservation layer for that code. The repo now carries a clean from-scratch US method-benchmark scaffold built around explicit configs, artifact outputs, survey-specific holdout conditioning, and cleanly reimplemented benchmark adapters.