SKLearner Home | About | Contact | Examples

Scikit-Learn SimpleImputer for Data Imputation

How to Use SimpleImputer

SimpleImputer is used for handling missing values in a dataset by replacing them with a specified strategy such as the mean, median, or most frequent value. It is particularly useful for preprocessing data before applying machine learning algorithms.

The key hyperparameters of SimpleImputer include strategy (which determines the replacement value). Common strategies include ‘mean’, ‘median’, ‘most_frequent’, and ‘constant’.

The algorithm is appropriate for any problem type (classification, regression, etc.) where the dataset contains missing values.

from sklearn.datasets import make_classification
from sklearn.impute import SimpleImputer
import numpy as np

# generate dataset with missing values
X, _ = make_classification(n_samples=20, n_features=5, random_state=1)
# introduce missing values
X[np.random.randint(20, size=10), np.random.randint(5, size=10)] = np.nan

print("Dataset before imputation:\n", X[:5])

# create the imputer
imputer = SimpleImputer(strategy='mean')

# fit and transform the dataset
X_imputed = imputer.fit_transform(X)

print("Dataset after imputation:\n", X_imputed[:5])

Running the example gives an output like:

Dataset before imputation:
 [[ 1.38235094 -0.29809284         nan -2.03377636         nan]
 [-0.53783907 -1.44411381  1.24939741  0.89353275 -0.34831546]
 [-0.02413359  0.88514116 -0.9970991  -1.21623831         nan]
 [-0.46468434 -0.30620401  1.75810823  1.58156872  0.15620461]
 [ 0.71911693  1.51981682 -0.84707228 -0.2124111   1.02038585]]
Dataset after imputation:
 [[ 1.38235094 -0.29809284 -0.19457485 -2.03377636 -0.13811565]
 [-0.53783907 -1.44411381  1.24939741  0.89353275 -0.34831546]
 [-0.02413359  0.88514116 -0.9970991  -1.21623831 -0.13811565]
 [-0.46468434 -0.30620401  1.75810823  1.58156872  0.15620461]
 [ 0.71911693  1.51981682 -0.84707228 -0.2124111   1.02038585]]

The steps are as follows:

  1. First, a synthetic classification dataset is generated using the make_classification() function. This creates a dataset with a specified number of samples (n_samples) and a fixed random seed (random_state) for reproducibility. Random missing values are introduced in the dataset.

  2. The dataset is printed before imputation to show the missing values.

  3. A SimpleImputer instance is created with the strategy set to ‘mean’. The imputer is fit on the dataset and used to replace missing values with the mean of the respective columns using the fit_transform() method.

  4. The dataset is printed after imputation to illustrate how the missing values have been replaced.

This example demonstrates how to use SimpleImputer to handle missing data in a dataset. The imputer is configured to replace missing values with the mean of the column. The dataset is printed before and after imputation to illustrate the effect of this preprocessing step.



See Also