Scikit-Learn QuantileTransformer for Data Preprocessing

QuantileTransformer is used for transforming the features to follow a uniform or normal distribution. It smooths out irregular distributions and is useful in scenarios where linear models benefit from features following a Gaussian distribution.

The key hyperparameters include n_quantiles (number of quantiles used), output_distribution (target distribution for the transformed data), and subsample (maximum number of samples used for fitting).

QuantileTransformer is appropriate for preprocessing data in both classification and regression problems.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer
import matplotlib.pyplot as plt

# generate a classification dataset
X, y = make_classification(n_samples=100, n_informative=2, n_redundant=0, n_features=2, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create the quantile transformer
transformer = QuantileTransformer(output_distribution='normal', random_state=1)

# fit the transformer on the training data and transform the training data
X_train_transformed = transformer.fit_transform(X_train)

# transform the test data
X_test_transformed = transformer.transform(X_test)

# plot the distribution of the original and transformed features
fig, axs = plt.subplots(2, 2, figsize=(10, 10))

# original distribution
axs[0, 0].hist(X_train[:, 0], bins=30)
axs[0, 0].set_title('Original Feature 1')
axs[0, 1].hist(X_train[:, 1], bins=30)
axs[0, 1].set_title('Original Feature 2')

# transformed distribution
axs[1, 0].hist(X_train_transformed[:, 0], bins=30)
axs[1, 0].set_title('Transformed Feature 1')
axs[1, 1].hist(X_train_transformed[:, 1], bins=30)
axs[1, 1].set_title('Transformed Feature 2')

plt.tight_layout()
plt.show()

Running the example gives an output like:

Scikit-Learn QuantileTransformer Plot

The steps are as follows:

First, a synthetic classification dataset is generated using make_classification(). This creates a dataset with two features and a fixed random seed for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a QuantileTransformer is instantiated with the target distribution set to normal. The transformer is fit on the training data and then used to transform both the training and test data.
The distribution of the original and transformed features is then plotted using matplotlib. This helps visualize the effect of the transformation, showing how the features are modified to follow the specified distribution.

This example demonstrates how to use QuantileTransformer to preprocess data, making it follow a desired distribution, which can improve the performance of linear models.

See Also