Scikit-Learn IterativeImputer for Data Imputation

Handling missing data in datasets can be challenging and negatively impact model performance.

IterativeImputer from scikit-learn provides a robust way to handle missing values by modeling each feature with missing values as a function of other features iteratively.

IterativeImputer models each feature with missing values using a round-robin fashion. Imputed values are then used to fit the next feature, repeating until convergence.

Key hyperparameters include max_iter (maximum number of imputation iterations), random_state (random seed for reproducibility), and initial_strategy (initialization strategy for imputation, e.g., ‘mean’).

The IterativeImputer class is experimental and requires that we import enable_iterative_imputer before using it.

This method is appropriate for data preprocessing in datasets with missing values.

from sklearn.datasets import make_classification
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np

# generate synthetic dataset
X, y = make_classification(n_samples=20, n_features=10, random_state=1)
# introduce missing values
rng = np.random.RandomState(0)
X[rng.randint(20, size=20), rng.randint(10, size=20)] = np.nan

# display dataset with missing values
print("Dataset with missing values:\n", X[:5])

# create imputer
imputer = IterativeImputer(max_iter=10, random_state=0)

# fit and transform the dataset
X_imputed = imputer.fit_transform(X)

# display dataset after imputation
print("Dataset after imputation:\n", X_imputed[:5])

Running the example gives an output like:

Dataset with missing values:
 [[ 0.24879916 -0.29664115 -0.18339999         nan  0.49521132  0.59580588
  -0.17470316 -0.51709446  1.48686048 -0.99702683]
 [-0.02461696 -0.77516162 -1.20095743         nan  1.27375593 -2.02051088
   1.96710175  1.12141771  0.94490844  0.40890054]
 [-1.23005814  0.5505375  -0.53783907  0.89353275  0.79280687 -0.34831546
  -0.62353073 -0.1809203   1.24939741 -0.60392063]
 [        nan -0.11905419  0.00865963         nan  0.01740941  1.14322724
  -1.12201873  0.44136444  1.66872712 -0.10015523]
 [ 1.62765075  0.3380117  -1.21812012  1.28101569 -1.19926803 -1.20825938
   0.86334532 -1.85798186         nan  1.23616403]]
Dataset after imputation:
 [[ 0.24879916 -0.29664115 -0.18339999  1.57050966  0.49521132  0.59580588
  -0.17470316 -0.51709446  1.48686048 -0.99702683]
 [-0.02461696 -0.77516162 -1.20095743 -0.20551979  1.27375593 -2.02051088
   1.96710175  1.12141771  0.94490844  0.40890054]
 [-1.23005814  0.5505375  -0.53783907  0.89353275  0.79280687 -0.34831546
  -0.62353073 -0.1809203   1.24939741 -0.60392063]
 [ 0.29318898 -0.11905419  0.00865963  2.00031291  0.01740941  1.14322724
  -1.12201873  0.44136444  1.66872712 -0.10015523]
 [ 1.62765075  0.3380117  -1.21812012  1.28101569 -1.19926803 -1.20825938
   0.86334532 -1.85798186  2.20410684  1.23616403]]

Generate a synthetic dataset with make_classification() function, including a specified number of samples and features.
Introduce missing values randomly in the dataset.
Display the dataset with missing values to understand the extent and distribution of missing data.
Create an IterativeImputer instance with specified hyperparameters like max_iter and random_state.
Fit the imputer on the dataset and transform it to fill in missing values.
Display the dataset after imputation to observe the changes and effectiveness of the imputation.

This example demonstrates how to use IterativeImputer to handle missing values in a dataset, showcasing the practical steps and transformations involved in the imputation process using scikit-learn.

See Also