py-statmatch

Python implementation of R's StatMatch package for statistical matching and data fusion, plus advanced methods not available in R.

Overview

py-statmatch provides tools for statistical matching (also known as data fusion or synthetic data matching) between different datasets. This package started as a Python port of the R package StatMatch, but now includes additional modern techniques.

Features

Core R StatMatch Functions (21 functions)

All functions produce identical results to R's StatMatch package.

Hot Deck Matching
- nnd_hotdeck: Nearest Neighbor Distance Hot Deck
- rand_hotdeck: Random selection from k-nearest donors
- rank_nnd_hotdeck: ECDF-based rank matching
- create_fused: Create synthetic fused datasets
Distance Functions
- gower_dist: Gower's distance for mixed-type data
- mahalanobis_dist: Covariance-adjusted distance
- maximum_dist: Chebyshev/L-infinity distance
Frechet Bounds
- frechet_bounds_cat: Bounds for categorical data
- fbwidths_by_x: Bounds for all X variable subsets
- p_bayes: Pseudo-Bayes estimation
Comparison & Plotting
- comp_cont, comp_prop, pw_assoc
- plot_bounds, plot_cont, plot_tab
Sample Utilities
- comb_samples, harmonize_x, fact2dummy
- mixed_mtc, sel_mtc_by_unc

Advanced Methods (Beyond R)

Multiple Imputation (mi_nnd_hotdeck, combine_mi_estimates)
- Generate m imputed datasets with proper uncertainty quantification
- Rubin's combining rules for valid inference
ML-Based Propensity Matching (propensity_hotdeck)
- Gradient Boosting, Random Forest, Neural Network, Logistic
- Caliper matching support
Optimal Transport (ot_hotdeck, wasserstein_dist)
- Globally optimal matching via Earth Mover's Distance
- Entropy-regularized Sinkhorn for efficiency
Bayesian Uncertainty (bayesian_match, credible_interval)
- Posterior inference on matched values
- CIA (Conditional Independence Assumption) testing
Embedding Distance (learn_embeddings, embedding_dist)
- Target encoding and SVD for high-cardinality categoricals
- Better handling of complex categorical relationships
Survey Weights (calibrate_weights, design_effect, replicate_variance)
- Complex survey design support
- Weight calibration via iterative proportional fitting
Diagnostics Dashboard (match_diagnostics, love_plot)
- Balance tables, SMD calculation
- HTML report generation

Installation

pip install py-statmatch

For development:

git clone https://github.com/PolicyEngine/py-statmatch.git
cd py-statmatch
pip install -e ".[dev]"

Quick Start

Basic Matching

import pandas as pd
from statmatch import nnd_hotdeck, create_fused

# Donor dataset (has X and Y variables)
donors = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'income': [30000, 45000, 55000, 65000, 80000],
    'satisfaction': [7, 8, 6, 9, 8]  # Variable to donate
})

# Recipient dataset (has X but missing Y)
recipients = pd.DataFrame({
    'age': [28, 33, 42],
    'income': [35000, 50000, 70000]
})

# Perform matching
result = nnd_hotdeck(
    data_rec=recipients,
    data_don=donors,
    match_vars=['age', 'income']
)

# Create fused dataset
fused = create_fused(
    data_rec=recipients,
    data_don=donors,
    mtc_ids=result['mtc.ids'],
    z_vars=['satisfaction']
)

Multiple Imputation (Proper Uncertainty)

from statmatch import mi_nnd_hotdeck, mi_create_fused, mi_summary

# Generate 5 imputed datasets
mi_results = mi_nnd_hotdeck(
    data_rec=recipients,
    data_don=donors,
    match_vars=['age', 'income'],
    m=5
)

# Create fused datasets
fused_datasets = mi_create_fused(
    data_rec=recipients,
    data_don=donors,
    mi_results=mi_results,
    z_vars=['satisfaction']
)

# Get summary with confidence intervals
summary = mi_summary(fused_datasets, 'satisfaction')
print(summary)  # estimate, std_error, ci_lower, ci_upper

ML Propensity Matching

from statmatch import propensity_hotdeck

result = propensity_hotdeck(
    data_rec=recipients,
    data_don=donors,
    match_vars=['age', 'income'],
    estimator='gbm',  # or 'random_forest', 'neural_net', 'logistic'
    caliper=0.1       # optional: max propensity score distance
)

Match Quality Diagnostics

from statmatch import match_diagnostics

diag = match_diagnostics(
    result=result,
    data_rec=recipients,
    data_don=donors,
    match_vars=['age', 'income']
)

# View balance table
print(diag.balance_table())

# Generate HTML report
diag.to_html('match_report.html')

# Love plot visualization
diag.love_plot()

Documentation

Full documentation: https://policyengine.github.io/py-statmatch/

Development

# Run tests
pytest

# Run with coverage
pytest --cov=statmatch

# Run R comparison tests (requires R + rpy2 + StatMatch)
pytest -k "against_r" -v

# Format code
black . -l 79

License

MIT License

Citation

@software{pystatmatch2024,
  title = {py-statmatch: Statistical matching in Python with advanced methods},
  author = {PolicyEngine},
  year = {2024},
  url = {https://github.com/PolicyEngine/py-statmatch}
}

Acknowledgments

Core matching functions are a Python port of the R StatMatch package by Marcello D'Orazio. Advanced methods (MI, propensity, OT, Bayesian, embeddings, diagnostics) are original contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github/workflows		.github/workflows
docs		docs
scripts		scripts
statmatch		statmatch
tests		tests
website		website
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
changelog.yaml		changelog.yaml
example.py		example.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

py-statmatch

Overview

Features

Core R StatMatch Functions (21 functions)

Advanced Methods (Beyond R)

Installation

Quick Start

Basic Matching

Multiple Imputation (Proper Uncertainty)

ML Propensity Matching

Match Quality Diagnostics

Documentation

Development

License

Citation

Acknowledgments

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

py-statmatch

Overview

Features

Core R StatMatch Functions (21 functions)

Advanced Methods (Beyond R)

Installation

Quick Start

Basic Matching

Multiple Imputation (Proper Uncertainty)

ML Propensity Matching

Match Quality Diagnostics

Documentation

Development

License

Citation

Acknowledgments

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages