Scikit-Learn KNNImputer for Data Imputation

Handling missing values in a dataset is a common problem in data preprocessing. KNNImputer in scikit-learn provides an effective solution by imputing missing values based on the k-nearest neighbors approach.

KNNImputer uses the mean value of the k-nearest neighbors to fill in missing values. The key hyperparameters include n_neighbors (the number of neighboring samples to use for imputation), weights (the weight function used in prediction), and metric (the distance metric for finding nearest neighbors). This method is suitable for data preprocessing tasks involving datasets with missing values.

from sklearn.datasets import make_classification
from sklearn.impute import KNNImputer
import numpy as np

# generate a dataset with missing values
X, y = make_classification(n_samples=10, n_features=5, random_state=1)
X[0, 0] = np.nan
X[4, 3] = np.nan
X[7, 2] = np.nan

print("Dataset with missing values:")
print(X)

# create the KNN imputer
imputer = KNNImputer(n_neighbors=2)

# fit on the dataset and transform it
X_imputed = imputer.fit_transform(X)

print("\nDataset after imputation:")
print(X_imputed)

Running the example gives an output like:

Dataset with missing values:
[[        nan  1.05492298 -0.7290756  -1.14651383  1.44634283]
 [-1.11731035  0.79495321  3.11651775 -2.85961623 -1.52637437]
 [ 0.2344157  -1.92617151  2.43027958  1.49509867 -3.42524143]
 [-0.67124613  0.72558433  1.73994406 -2.00875146 -0.60483688]
 [-0.0126646   0.14092825  2.41932059         nan -1.60290743]
 [ 1.6924546   0.0230103  -1.07460638  0.55132541  0.78712117]
 [ 0.74204416 -1.91437196  3.84266872  0.70896364 -4.42287433]
 [-0.74715829 -0.36632248         nan  1.72073855  1.23169963]
 [-0.88762896  0.59936399 -1.18938753 -0.22942496  1.37496472]
 [ 1.65980218 -1.04052679  0.89368622  1.03584131 -1.55118469]]

Dataset after imputation:
[[ 0.40241282  1.05492298 -0.7290756  -1.14651383  1.44634283]
 [-1.11731035  0.79495321  3.11651775 -2.85961623 -1.52637437]
 [ 0.2344157  -1.92617151  2.43027958  1.49509867 -3.42524143]
 [-0.67124613  0.72558433  1.73994406 -2.00875146 -0.60483688]
 [-0.0126646   0.14092825  2.41932059 -2.43418384 -1.60290743]
 [ 1.6924546   0.0230103  -1.07460638  0.55132541  0.78712117]
 [ 0.74204416 -1.91437196  3.84266872  0.70896364 -4.42287433]
 [-0.74715829 -0.36632248 -1.13199695  1.72073855  1.23169963]
 [-0.88762896  0.59936399 -1.18938753 -0.22942496  1.37496472]
 [ 1.65980218 -1.04052679  0.89368622  1.03584131 -1.55118469]]

The steps are as follows:

Generate a synthetic classification dataset using make_classification(). Introduce some missing values manually to simulate a real-world scenario.
Display the dataset before imputation to show the missing values.
Create a KNNImputer instance with n_neighbors set to 2.
Fit the imputer on the dataset and transform the dataset to fill in the missing values using fit_transform().
Display the dataset after imputation to show the filled-in values.

This example demonstrates how to handle missing data in a dataset using the KNNImputer in scikit-learn. The imputer fills in missing values based on the mean of the nearest neighbors, making it a powerful tool for data preprocessing.

See Also