RobustScaler is a preprocessing technique that scales features using statistics that are robust to outliers.
This scaler removes the median and scales the data according to the interquartile range.
It is appropriate for regression and classification problems where data contains outliers.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# define the scaler
scaler = RobustScaler()
# fit and transform the training set
X_train_scaled = scaler.fit_transform(X_train)
# transform the test set
X_test_scaled = scaler.transform(X_test)
# create model
model = LogisticRegression()
# fit model
model.fit(X_train_scaled, y_train)
# evaluate model
yhat = model.predict(X_test_scaled)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
# make a prediction
row = [[-1.10325445, -0.49821356, -0.05962247, -0.89224592, -0.70158632]]
row_scaled = scaler.transform(row)
yhat = model.predict(row_scaled)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.950
Predicted: 0
The steps are as follows:
First, a synthetic binary classification dataset is generated using the
make_classification()
function. This creates a dataset with a specified number of samples (n_samples
), features (n_features
), and a fixed random seed (random_state
) for reproducibility. The dataset is split into training and test sets usingtrain_test_split()
.Next, a
RobustScaler
is defined to scale the features of the dataset. The scaler is fit on the training data using thefit_transform()
method, which computes the necessary statistics and applies the scaling. The test data is then transformed using the same scaler without fitting it again.A
LogisticRegression
model is instantiated with default hyperparameters. The model is fit on the scaled training data using thefit()
method.The performance of the model is evaluated by comparing the predictions (
yhat
) to the actual values (y_test
) using the accuracy score metric.A single prediction can be made by first scaling a new data sample with the
RobustScaler
and then passing it to thepredict()
method of the fitted model.
This example demonstrates how to effectively use RobustScaler
for preprocessing data that contains outliers before fitting a LogisticRegression
model. The scaling ensures that the model is not unduly influenced by outliers, improving the robustness and reliability of the predictions.