Handling missing values in a dataset is a common problem in data preprocessing. KNNImputer
in scikit-learn provides an effective solution by imputing missing values based on the k-nearest neighbors approach.
KNNImputer
uses the mean value of the k-nearest neighbors to fill in missing values. The key hyperparameters include n_neighbors
(the number of neighboring samples to use for imputation), weights
(the weight function used in prediction), and metric
(the distance metric for finding nearest neighbors). This method is suitable for data preprocessing tasks involving datasets with missing values.
from sklearn.datasets import make_classification
from sklearn.impute import KNNImputer
import numpy as np
# generate a dataset with missing values
X, y = make_classification(n_samples=10, n_features=5, random_state=1)
X[0, 0] = np.nan
X[4, 3] = np.nan
X[7, 2] = np.nan
print("Dataset with missing values:")
print(X)
# create the KNN imputer
imputer = KNNImputer(n_neighbors=2)
# fit on the dataset and transform it
X_imputed = imputer.fit_transform(X)
print("\nDataset after imputation:")
print(X_imputed)
Running the example gives an output like:
Dataset with missing values:
[[ nan 1.05492298 -0.7290756 -1.14651383 1.44634283]
[-1.11731035 0.79495321 3.11651775 -2.85961623 -1.52637437]
[ 0.2344157 -1.92617151 2.43027958 1.49509867 -3.42524143]
[-0.67124613 0.72558433 1.73994406 -2.00875146 -0.60483688]
[-0.0126646 0.14092825 2.41932059 nan -1.60290743]
[ 1.6924546 0.0230103 -1.07460638 0.55132541 0.78712117]
[ 0.74204416 -1.91437196 3.84266872 0.70896364 -4.42287433]
[-0.74715829 -0.36632248 nan 1.72073855 1.23169963]
[-0.88762896 0.59936399 -1.18938753 -0.22942496 1.37496472]
[ 1.65980218 -1.04052679 0.89368622 1.03584131 -1.55118469]]
Dataset after imputation:
[[ 0.40241282 1.05492298 -0.7290756 -1.14651383 1.44634283]
[-1.11731035 0.79495321 3.11651775 -2.85961623 -1.52637437]
[ 0.2344157 -1.92617151 2.43027958 1.49509867 -3.42524143]
[-0.67124613 0.72558433 1.73994406 -2.00875146 -0.60483688]
[-0.0126646 0.14092825 2.41932059 -2.43418384 -1.60290743]
[ 1.6924546 0.0230103 -1.07460638 0.55132541 0.78712117]
[ 0.74204416 -1.91437196 3.84266872 0.70896364 -4.42287433]
[-0.74715829 -0.36632248 -1.13199695 1.72073855 1.23169963]
[-0.88762896 0.59936399 -1.18938753 -0.22942496 1.37496472]
[ 1.65980218 -1.04052679 0.89368622 1.03584131 -1.55118469]]
The steps are as follows:
- Generate a synthetic classification dataset using
make_classification()
. Introduce some missing values manually to simulate a real-world scenario. - Display the dataset before imputation to show the missing values.
- Create a
KNNImputer
instance withn_neighbors
set to 2. - Fit the imputer on the dataset and transform the dataset to fill in the missing values using
fit_transform()
. - Display the dataset after imputation to show the filled-in values.
This example demonstrates how to handle missing data in a dataset using the KNNImputer
in scikit-learn. The imputer fills in missing values based on the mean of the nearest neighbors, making it a powerful tool for data preprocessing.