Handling missing data in datasets can be challenging and negatively impact model performance.
IterativeImputer from scikit-learn provides a robust way to handle missing values by modeling each feature with missing values as a function of other features iteratively.
IterativeImputer models each feature with missing values using a round-robin fashion. Imputed values are then used to fit the next feature, repeating until convergence.
Key hyperparameters include max_iter
(maximum number of imputation iterations), random_state
(random seed for reproducibility), and initial_strategy
(initialization strategy for imputation, e.g., ‘mean’).
The IterativeImputer
class is experimental and requires that we import enable_iterative_imputer
before using it.
This method is appropriate for data preprocessing in datasets with missing values.
from sklearn.datasets import make_classification
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
# generate synthetic dataset
X, y = make_classification(n_samples=20, n_features=10, random_state=1)
# introduce missing values
rng = np.random.RandomState(0)
X[rng.randint(20, size=20), rng.randint(10, size=20)] = np.nan
# display dataset with missing values
print("Dataset with missing values:\n", X[:5])
# create imputer
imputer = IterativeImputer(max_iter=10, random_state=0)
# fit and transform the dataset
X_imputed = imputer.fit_transform(X)
# display dataset after imputation
print("Dataset after imputation:\n", X_imputed[:5])
Running the example gives an output like:
Dataset with missing values:
[[ 0.24879916 -0.29664115 -0.18339999 nan 0.49521132 0.59580588
-0.17470316 -0.51709446 1.48686048 -0.99702683]
[-0.02461696 -0.77516162 -1.20095743 nan 1.27375593 -2.02051088
1.96710175 1.12141771 0.94490844 0.40890054]
[-1.23005814 0.5505375 -0.53783907 0.89353275 0.79280687 -0.34831546
-0.62353073 -0.1809203 1.24939741 -0.60392063]
[ nan -0.11905419 0.00865963 nan 0.01740941 1.14322724
-1.12201873 0.44136444 1.66872712 -0.10015523]
[ 1.62765075 0.3380117 -1.21812012 1.28101569 -1.19926803 -1.20825938
0.86334532 -1.85798186 nan 1.23616403]]
Dataset after imputation:
[[ 0.24879916 -0.29664115 -0.18339999 1.57050966 0.49521132 0.59580588
-0.17470316 -0.51709446 1.48686048 -0.99702683]
[-0.02461696 -0.77516162 -1.20095743 -0.20551979 1.27375593 -2.02051088
1.96710175 1.12141771 0.94490844 0.40890054]
[-1.23005814 0.5505375 -0.53783907 0.89353275 0.79280687 -0.34831546
-0.62353073 -0.1809203 1.24939741 -0.60392063]
[ 0.29318898 -0.11905419 0.00865963 2.00031291 0.01740941 1.14322724
-1.12201873 0.44136444 1.66872712 -0.10015523]
[ 1.62765075 0.3380117 -1.21812012 1.28101569 -1.19926803 -1.20825938
0.86334532 -1.85798186 2.20410684 1.23616403]]
- Generate a synthetic dataset with
make_classification()
function, including a specified number of samples and features. - Introduce missing values randomly in the dataset.
- Display the dataset with missing values to understand the extent and distribution of missing data.
- Create an
IterativeImputer
instance with specified hyperparameters likemax_iter
andrandom_state
. - Fit the imputer on the dataset and transform it to fill in missing values.
- Display the dataset after imputation to observe the changes and effectiveness of the imputation.
This example demonstrates how to use IterativeImputer
to handle missing values in a dataset, showcasing the practical steps and transformations involved in the imputation process using scikit-learn.