Scikit-Learn RobustScaler for Data Preprocessing

RobustScaler is a preprocessing technique that scales features using statistics that are robust to outliers.

This scaler removes the median and scales the data according to the interquartile range.

It is appropriate for regression and classification problems where data contains outliers.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# define the scaler
scaler = RobustScaler()

# fit and transform the training set
X_train_scaled = scaler.fit_transform(X_train)

# transform the test set
X_test_scaled = scaler.transform(X_test)

# create model
model = LogisticRegression()

# fit model
model.fit(X_train_scaled, y_train)

# evaluate model
yhat = model.predict(X_test_scaled)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [[-1.10325445, -0.49821356, -0.05962247, -0.89224592, -0.70158632]]
row_scaled = scaler.transform(row)
yhat = model.predict(row_scaled)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.950
Predicted: 0

The steps are as follows:

First, a synthetic binary classification dataset is generated using the make_classification() function. This creates a dataset with a specified number of samples (n_samples), features (n_features), and a fixed random seed (random_state) for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a RobustScaler is defined to scale the features of the dataset. The scaler is fit on the training data using the fit_transform() method, which computes the necessary statistics and applies the scaling. The test data is then transformed using the same scaler without fitting it again.
A LogisticRegression model is instantiated with default hyperparameters. The model is fit on the scaled training data using the fit() method.
The performance of the model is evaluated by comparing the predictions (yhat) to the actual values (y_test) using the accuracy score metric.
A single prediction can be made by first scaling a new data sample with the RobustScaler and then passing it to the predict() method of the fitted model.

This example demonstrates how to effectively use RobustScaler for preprocessing data that contains outliers before fitting a LogisticRegression model. The scaling ensures that the model is not unduly influenced by outliers, improving the robustness and reliability of the predictions.

See Also