SKLearner Home | About | Contact | Examples

Scikit-Learn Binarizer for Data Preprocessing

Binarizer is a preprocessing tool used to transform continuous data into binary values based on a threshold.

The key parameter of Binarizer is threshold, which determines the cutoff point for binarization.

Binarizer is appropriate for feature engineering in classification and clustering problems where binary features are needed.

from sklearn.datasets import make_classification
from sklearn.preprocessing import Binarizer
import numpy as np

# generate continuous data
X, _ = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=1)
X = X * 10  # scale data to have wider range of values

# configure the binarizer
binarizer = Binarizer(threshold=0.0)

# transform the dataset
binary_X = binarizer.fit_transform(X)

# show before and after transformation
print("Before Binarization:")
print(X[:5])
print("After Binarization:")
print(binary_X[:5])

Running the example gives an output like:

Before Binarization:
[[ 13.00227169  -7.85653903]
 [ 14.41844252  -5.60085539]
 [ -8.47924448 -13.66213235]
 [ -7.22150149 -14.11294144]
 [-12.72214654   2.59451061]]
After Binarization:
[[1. 0.]
 [1. 0.]
 [0. 0.]
 [0. 0.]
 [0. 1.]]

The steps are as follows:

  1. Generate a synthetic continuous dataset using the make_classification() function. This dataset is scaled to create a wider range of values, making binarization more illustrative.

  2. Instantiate a Binarizer with the default threshold of 0.0.

  3. Apply the fit_transform() method to transform the dataset, converting values above the threshold to 1 and below to 0.

  4. Print a sample of the data before and after binarization to demonstrate the effect of the transformation.

This example shows how to use the Binarizer to convert continuous data into binary values based on a specified threshold, which is useful in scenarios where binary features are needed for machine learning algorithms.



See Also