The MissingIndicator
transformer in scikit-learn is used to add binary indicators for missing values in the dataset. This is particularly useful in preprocessing pipelines where it is necessary to flag missing values before imputation.
The primary hyperparameters for MissingIndicator
include missing_values
, which specifies the placeholder for missing values, and features
, which determines whether to add indicators for missing values in all features or only specified ones.
This transformer is appropriate for preprocessing steps in both classification and regression problems.
from sklearn.impute import MissingIndicator
import numpy as np
# Create a dataset with missing values
X = np.array([[np.nan, 1, 3], [4, 0, np.nan], [8, 1, 0]])
# Initialize the MissingIndicator
indicator = MissingIndicator()
# Fit and transform the data
X_missing = indicator.fit_transform(X)
# Print the original dataset
print("Original Dataset:")
print(X)
# Print the dataset with missing value indicators
print("Missing Indicator Added:")
print(X_missing)
Running the example gives an output like:
Original Dataset:
[[nan 1. 3.]
[ 4. 0. nan]
[ 8. 1. 0.]]
Missing Indicator Added:
[[ True False]
[False True]
[False False]]
The steps are as follows:
A synthetic dataset with missing values is created using
numpy
arrays. This dataset includesnp.nan
to represent missing values.The
MissingIndicator
transformer is instantiated with default hyperparameters. This prepares the transformer to identify missing values in the dataset.The
fit_transform()
method is applied to the dataset to generate binary indicators for missing values. This method both fits the transformer to the data and transforms the data in a single step.The original dataset and the transformed dataset with missing indicators are printed for comparison. The binary indicators help in identifying the positions of missing values, which can then be addressed in subsequent preprocessing steps.
This example demonstrates how to use the MissingIndicator
transformer in scikit-learn to handle missing values in a dataset, an essential step in many data preprocessing pipelines.