How to Use SimpleImputer
SimpleImputer is used for handling missing values in a dataset by replacing them with a specified strategy such as the mean, median, or most frequent value. It is particularly useful for preprocessing data before applying machine learning algorithms.
The key hyperparameters of SimpleImputer
include strategy
(which determines the replacement value). Common strategies include ‘mean’, ‘median’, ‘most_frequent’, and ‘constant’.
The algorithm is appropriate for any problem type (classification, regression, etc.) where the dataset contains missing values.
from sklearn.datasets import make_classification
from sklearn.impute import SimpleImputer
import numpy as np
# generate dataset with missing values
X, _ = make_classification(n_samples=20, n_features=5, random_state=1)
# introduce missing values
X[np.random.randint(20, size=10), np.random.randint(5, size=10)] = np.nan
print("Dataset before imputation:\n", X[:5])
# create the imputer
imputer = SimpleImputer(strategy='mean')
# fit and transform the dataset
X_imputed = imputer.fit_transform(X)
print("Dataset after imputation:\n", X_imputed[:5])
Running the example gives an output like:
Dataset before imputation:
[[ 1.38235094 -0.29809284 nan -2.03377636 nan]
[-0.53783907 -1.44411381 1.24939741 0.89353275 -0.34831546]
[-0.02413359 0.88514116 -0.9970991 -1.21623831 nan]
[-0.46468434 -0.30620401 1.75810823 1.58156872 0.15620461]
[ 0.71911693 1.51981682 -0.84707228 -0.2124111 1.02038585]]
Dataset after imputation:
[[ 1.38235094 -0.29809284 -0.19457485 -2.03377636 -0.13811565]
[-0.53783907 -1.44411381 1.24939741 0.89353275 -0.34831546]
[-0.02413359 0.88514116 -0.9970991 -1.21623831 -0.13811565]
[-0.46468434 -0.30620401 1.75810823 1.58156872 0.15620461]
[ 0.71911693 1.51981682 -0.84707228 -0.2124111 1.02038585]]
The steps are as follows:
First, a synthetic classification dataset is generated using the
make_classification()
function. This creates a dataset with a specified number of samples (n_samples
) and a fixed random seed (random_state
) for reproducibility. Random missing values are introduced in the dataset.The dataset is printed before imputation to show the missing values.
A
SimpleImputer
instance is created with the strategy set to ‘mean’. The imputer is fit on the dataset and used to replace missing values with the mean of the respective columns using thefit_transform()
method.The dataset is printed after imputation to illustrate how the missing values have been replaced.
This example demonstrates how to use SimpleImputer
to handle missing data in a dataset. The imputer is configured to replace missing values with the mean of the column. The dataset is printed before and after imputation to illustrate the effect of this preprocessing step.