Scikit-Learn robust_scale() for Data Preprocessing

robust_scale() is a preprocessing method in scikit-learn used to scale features of a dataset using statistics that are robust to outliers.

Unlike standard scaling methods, robust_scale() uses the median and interquartile range, making it ideal for data with outliers.

This example demonstrates how to apply robust_scale() to a dataset and visualize the results before and after scaling.

from sklearn.datasets import make_blobs
from sklearn.preprocessing import robust_scale
import numpy as np

# generate dataset with outliers
X, _ = make_blobs(n_samples=100, centers=1, n_features=2, random_state=42)
X[::10] += 10  # add outliers

# print dataset before scaling
print("Dataset Before Scaling:")
print(X[:10])  # printing first 10 samples for brevity

# apply robust scaling
X_scaled = robust_scale(X)

# print dataset after scaling
print("Dataset After Scaling:")
print(X_scaled[:10])  # printing first 10 samples for brevity

Running the example gives an output like:

Dataset Before Scaling:
[[ 7.75068517 19.796109  ]
 [-1.88353028  8.15712857]
 [-2.44166942  7.58953794]
 [-3.70050112  9.67083974]
 [-2.73266041  9.72828662]
 [-2.58629933  9.3554381 ]
 [-1.68713746 10.91107911]
 [-2.42215055  8.71527878]
 [-3.74614833  7.69382952]
 [-0.64342311  9.48811905]]
Dataset After Scaling:
[[ 7.153318    7.13149617]
 [ 0.41283127 -0.78673842]
 [ 0.02233455 -1.17288184]
 [-0.85839501  0.24307001]
 [-0.18125451  0.28215229]
 [-0.07885439  0.02849587]
 [ 0.55023562  1.08683002]
 [ 0.03599074 -0.40701751]
 [-0.89033165 -1.10193017]
 [ 1.28046039  0.11876141]]

The steps are as follows:

A synthetic dataset is generated using the make_blobs() function, with outliers added to simulate real-world data.
The dataset is printed before scaling to highlight the presence of outliers.
The robust_scale() function is applied to the dataset, scaling the features using the median and interquartile range.
The dataset is printed after scaling to show the effect of robust_scale() in mitigating the influence of outliers.

This example demonstrates how to use robust_scale() for preprocessing data that contains outliers, ensuring that the scaled features are less affected by extreme values, which is crucial for improving the performance of many machine learning algorithms.

See Also