Scikit-Learn KBinsDiscretizer for Data Preprocessing

Converting continuous data into discrete bins is often necessary for categorical analysis.

The KBinsDiscretizer in scikit-learn provides a way to transform continuous features into discrete bins, using different strategies such as ‘uniform’, ‘quantile’, and ‘kmeans’.

Key hyperparameters include n_bins (number of bins), encode (encoding method), and strategy (binning strategy).

This method is suitable for preprocessing steps in classification and regression tasks where discrete bins are required.

from sklearn.datasets import make_regression
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

# generate continuous dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=1)

# configure the transform
kbin = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')

# fit the transform
kbin.fit(X)

# apply the transform
X_binned = kbin.transform(X)

# show before and after
print("Before binning:\n", X[:10])
print("After binning:\n", X_binned[:10])

Running the example gives an output like:

Before binning:
 [[-0.61175641]
 [-0.24937038]
 [ 0.48851815]
 [ 0.76201118]
 [ 1.51981682]
 [ 0.37756379]
 [ 0.51292982]
 [-0.67124613]
 [-1.39649634]
 [ 0.31563495]]
After binning:
 [[1.]
 [2.]
 [3.]
 [3.]
 [4.]
 [2.]
 [3.]
 [1.]
 [1.]
 [2.]]

The steps are as follows:

Generate a continuous dataset using make_regression(). This creates a dataset with a specified number of samples (n_samples), features (n_features), noise level (noise), and a fixed random seed (random_state) for reproducibility.
Instantiate KBinsDiscretizer with n_bins for the number of bins, encode for the encoding method, and strategy for the binning strategy.
Fit the KBinsDiscretizer on the dataset using the fit() method.
Transform the dataset with the transform() method to get binned data.
Display the first 10 samples of the dataset before and after binning to show the effect of the transformation.

This example demonstrates how to use KBinsDiscretizer to convert continuous features into discrete bins, which can be useful for various preprocessing tasks in machine learning workflows.

See Also