Converting continuous data into discrete bins is often necessary for categorical analysis.
The KBinsDiscretizer
in scikit-learn provides a way to transform continuous features into discrete bins, using different strategies such as ‘uniform’, ‘quantile’, and ‘kmeans’.
Key hyperparameters include n_bins
(number of bins), encode
(encoding method), and strategy
(binning strategy).
This method is suitable for preprocessing steps in classification and regression tasks where discrete bins are required.
from sklearn.datasets import make_regression
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
# generate continuous dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=1)
# configure the transform
kbin = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
# fit the transform
kbin.fit(X)
# apply the transform
X_binned = kbin.transform(X)
# show before and after
print("Before binning:\n", X[:10])
print("After binning:\n", X_binned[:10])
Running the example gives an output like:
Before binning:
[[-0.61175641]
[-0.24937038]
[ 0.48851815]
[ 0.76201118]
[ 1.51981682]
[ 0.37756379]
[ 0.51292982]
[-0.67124613]
[-1.39649634]
[ 0.31563495]]
After binning:
[[1.]
[2.]
[3.]
[3.]
[4.]
[2.]
[3.]
[1.]
[1.]
[2.]]
The steps are as follows:
Generate a continuous dataset using
make_regression()
. This creates a dataset with a specified number of samples (n_samples
), features (n_features
), noise level (noise
), and a fixed random seed (random_state
) for reproducibility.Instantiate
KBinsDiscretizer
withn_bins
for the number of bins,encode
for the encoding method, andstrategy
for the binning strategy.Fit the
KBinsDiscretizer
on the dataset using thefit()
method.Transform the dataset with the
transform()
method to get binned data.Display the first 10 samples of the dataset before and after binning to show the effect of the transformation.
This example demonstrates how to use KBinsDiscretizer
to convert continuous features into discrete bins, which can be useful for various preprocessing tasks in machine learning workflows.