Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Microimpute is a Python package for imputing variables from one survey dataset o
- **Statistical Matching**: distance-based matching to find similar donor observations
- **Ordinary Least Squares (OLS)**: linear regression imputation
- **Quantile Regression**: models conditional quantiles instead of the conditional mean
- **Quantile Random Forests (QRF)**: non-parametric, tree-based quantile estimation
- **Quantile Regression Forests (QRF)**: non-parametric, tree-based quantile estimation
- **Mixture Density Networks (MDN)**: neural network with a Gaussian mixture output

## Autoimpute
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Microimpute is a Python package for imputing variables from one survey dataset o
The package currently supports:
- Hot Deck Matching
- Ordinary Least Squares (OLS) Linear Regression
- Quantile Random Forests (QRF)
- Quantile Regression Forests (QRF)
- Quantile Regression
- Mixture Density Networks (MDN)

Expand Down
4 changes: 2 additions & 2 deletions docs/models/qrf/index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Quantile Random Forests
# Quantile Regression Forests

The `QRF` model uses an ensemble of decision trees to predict different quantiles of the target variable distribution. This allows it to model non-linear relationships while estimating uncertainty across the conditional distribution.

Expand All @@ -8,7 +8,7 @@ QRF handles both numerical and categorical variables. For numerical targets, it

## How it works

Quantile Random Forests build on standard random forests using the `quantile_forest` package. The method constructs an ensemble of decision trees, each trained on a bootstrapped sample of the data (bagging). At each split, only a random subset of features is considered, which introduces diversity among trees and reduces overfitting.
Quantile Regression Forests build on standard random forests using the `quantile_forest` package. The method constructs an ensemble of decision trees, each trained on a bootstrapped sample of the data (bagging). At each split, only a random subset of features is considered, which introduces diversity among trees and reduces overfitting.

Unlike standard random forests that aggregate predictions into averages, QRF retains the full predictive distribution from each tree and estimates quantiles directly from this empirical distribution.

Expand Down
12 changes: 6 additions & 6 deletions docs/models/qrf/qrf-imputation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Quantile Random Forest (QRF) imputation\n",
"# Quantile Regression Forest (QRF) imputation\n",
"\n",
"This notebook demonstrates how to use Microimpute's `QRF` imputer to impute values using Quantile Random Forests. QRF extends traditional random forests to predict the entire conditional distribution of a target variable.\n",
"This notebook demonstrates how to use Microimpute's `QRF` imputer to impute values using Quantile Regression Forests. QRF extends traditional random forests to predict the entire conditional distribution of a target variable.\n",
"\n",
"## Variable type support\n",
"\n",
"The QRF model automatically handles both numerical and categorical variables. For numerical targets, it applies quantile random forests. For categorical targets (strings, booleans, or numerically-encoded categorical variables), it switches to using a random forest classifier. This automatic adaptation happens internally without requiring any manual configuration.\n",
"The QRF model automatically handles both numerical and categorical variables. For numerical targets, it applies quantile regression forests. For categorical targets (strings, booleans, or numerically-encoded categorical variables), it switches to using a random forest classifier. This automatic adaptation happens internally without requiring any manual configuration.\n",
"\n",
"The QRF model supports sequential imputation with a single object and workflow. Pass a list of `imputed_variables` with all variables you want to impute, and the model imputes them sequentially. This means that previously imputed variables will serve as predictors for subsequent variables, capturing complex dependencies between the imputed variables.\n",
"\n",
Expand Down Expand Up @@ -584,7 +584,7 @@
"qrf_imputer = QRF()\n",
"\n",
"# Fit the model with our training data\n",
"# This trains a quantile random forest model\n",
"# This trains a quantile regression forest model\n",
"fitted_qrf_imputer = qrf_imputer.fit(\n",
" X_train,\n",
" predictors,\n",
Expand Down Expand Up @@ -2023,7 +2023,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This scatter plot compares actual observed values with those imputed by a Quantile Random Forest (QRF) model, providing a visual assessment of imputation accuracy. Each point represents a data record, with the x-axis showing the true value and the y-axis showing the model’s predicted value. The red dashed line represents the ideal 1:1 relationship, where predictions perfectly match actual values. Most points cluster around this line, suggesting that the QRF model effectively captures the underlying structure of the data. Importantly, the model does not appear to systematically over- or under-predict across the range, and while performance at the extremes may be weaker, the overall pattern indicates that QRF provides a reasonably accurate and unbiased approach to imputing missing values. Additionally, it is important to consider the characteristics of the diabetes dataset, which seems to show a strong linear relationship between predictors and the imputed variable. QRF's behavior suggests strength in accurately imputing variables for datasets when such linearity assumptions do not hold."
"This scatter plot compares actual observed values with those imputed by a Quantile Regression Forest (QRF) model, providing a visual assessment of imputation accuracy. Each point represents a data record, with the x-axis showing the true value and the y-axis showing the model’s predicted value. The red dashed line represents the ideal 1:1 relationship, where predictions perfectly match actual values. Most points cluster around this line, suggesting that the QRF model effectively captures the underlying structure of the data. Importantly, the model does not appear to systematically over- or under-predict across the range, and while performance at the extremes may be weaker, the overall pattern indicates that QRF provides a reasonably accurate and unbiased approach to imputing missing values. Additionally, it is important to consider the characteristics of the diabetes dataset, which seems to show a strong linear relationship between predictors and the imputed variable. QRF's behavior suggests strength in accurately imputing variables for datasets when such linearity assumptions do not hold."
]
},
{
Expand Down Expand Up @@ -3636,7 +3636,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This plot visualizes the prediction intervals produced by the Quantile Random Forest (QRF) model for imputing total serum cholesterol values across ten data records. Each vertical bar represents an 80% (light gray) or 40% (dark gray) prediction interval, capturing the model's estimated range of plausible values based on the Q10–Q90 and Q30–Q70 quantiles, respectively. Red dots mark the model's median predictions (Q50), while black dots show the actual observed values. In most cases, the true values fall within the wider intervals, indicating that the QRF model is appropriately capturing uncertainty in its imputation. The fact that the intervals are sometimes asymmetrical around the median reflects the model’s flexibility in estimating skewed or heteroskedastic distributions. Overall, the plot demonstrates that the QRF model not only provides accurate point estimates but also yields informative prediction intervals that account for uncertainty in the imputed values."
"This plot visualizes the prediction intervals produced by the Quantile Regression Forest (QRF) model for imputing total serum cholesterol values across ten data records. Each vertical bar represents an 80% (light gray) or 40% (dark gray) prediction interval, capturing the model's estimated range of plausible values based on the Q10–Q90 and Q30–Q70 quantiles, respectively. Red dots mark the model's median predictions (Q50), while black dots show the actual observed values. In most cases, the true values fall within the wider intervals, indicating that the QRF model is appropriately capturing uncertainty in its imputation. The fact that the intervals are sometimes asymmetrical around the median reflects the model’s flexibility in estimating skewed or heteroskedastic distributions. Overall, the plot demonstrates that the QRF model not only provides accurate point estimates but also yields informative prediction intervals that account for uncertainty in the imputed values."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion microimpute/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Available models:
- OLS: ordinary least squares regression with bootstrapped quantiles
- QRF: quantile random forest for non-parametric quantile regression
- QRF: quantile regression forest for non-parametric quantile regression
- QuantReg: linear quantile regression model
- Matching: statistical matching/hot-deck imputation (optional, requires rpy2)
- MDN: Mixture Density Network for probabilistic imputation
Expand Down
24 changes: 24 additions & 0 deletions microimpute/models/mdn.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,12 @@ def fit(
col for col in X.columns.tolist() if col not in categorical_cols
]

# Cast continuous columns to float64 to avoid pandas 3.x
# LossySetitemError when pytorch_tabular's scaler writes
# normalized float values back into integer-typed columns.
for col in continuous_cols:
train_data[col] = train_data[col].astype("float64")

# Configure data
data_config = DataConfig(
target=[y.name],
Expand Down Expand Up @@ -351,6 +357,12 @@ def predict(self, X: pd.DataFrame, n_samples: int = 1) -> np.ndarray:
# Put model in eval mode
self.model.model.eval()

# Cast continuous columns to float64 for pandas 3.x compat
X = X.copy()
for col in self.model.config.continuous_cols:
if col in X.columns:
X[col] = X[col].astype("float64")

# Create inference dataloader
test_loader = self.model.datamodule.prepare_inference_dataloader(X)

Expand Down Expand Up @@ -466,6 +478,12 @@ def fit(
col for col in X.columns.tolist() if col not in categorical_cols
]

# Cast continuous columns to float64 to avoid pandas 3.x
# LossySetitemError when pytorch_tabular's scaler writes
# normalized float values back into integer-typed columns.
for col in continuous_cols:
train_data[col] = train_data[col].astype("float64")

# Configure data
data_config = DataConfig(
target=[y.name],
Expand Down Expand Up @@ -541,6 +559,12 @@ def predict(
Predicted values as Series, or dict with probabilities if
return_probs=True.
"""
# Cast continuous columns to float64 for pandas 3.x compat
X = X.copy()
for col in self.model.config.continuous_cols:
if col in X.columns:
X[col] = X[col].astype("float64")

# Get predictions with probabilities
preds_df = self.model.predict(X, ret_logits=False)

Expand Down
2 changes: 1 addition & 1 deletion microimpute/models/qrf.py
Original file line number Diff line number Diff line change
Expand Up @@ -482,7 +482,7 @@ class QRF(Imputer):
"""
Quantile Regression Forest model for imputation.

This model uses a Quantile Random Forest to predict quantiles.
This model uses a Quantile Regression Forest to predict quantiles.
The underlying QRF implementation is from the quantile_forest package.
"""

Expand Down
16 changes: 8 additions & 8 deletions paper/bibliography/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,14 @@ @article{bishop1994mixture
year = {1994}
}

@incollection{bourguignon2006microsimulation,
title = {Microsimulation as a tool for evaluating redistribution policies},
author = {Bourguignon, Fran{\c{c}}ois and Spadaro, Amedeo},
booktitle = {Journal of Economic Inequality},
volume = {4},
number = {1},
pages = {77--106},
year = {2006},
@article{bourguignon2006microsimulation,
title = {Microsimulation as a tool for evaluating redistribution policies},
author = {Bourguignon, Fran{\c{c}}ois and Spadaro, Amedeo},
journal = {Journal of Economic Inequality},
volume = {4},
number = {1},
pages = {77--106},
year = {2006},
publisher = {Springer}
}

Expand Down
Binary file modified paper/figures/models_dist_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified paper/figures/models_ssi_reform_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading