robust_scale()
is a preprocessing method in scikit-learn used to scale features of a dataset using statistics that are robust to outliers.
Unlike standard scaling methods, robust_scale()
uses the median and interquartile range, making it ideal for data with outliers.
This example demonstrates how to apply robust_scale()
to a dataset and visualize the results before and after scaling.
from sklearn.datasets import make_blobs
from sklearn.preprocessing import robust_scale
import numpy as np
# generate dataset with outliers
X, _ = make_blobs(n_samples=100, centers=1, n_features=2, random_state=42)
X[::10] += 10 # add outliers
# print dataset before scaling
print("Dataset Before Scaling:")
print(X[:10]) # printing first 10 samples for brevity
# apply robust scaling
X_scaled = robust_scale(X)
# print dataset after scaling
print("Dataset After Scaling:")
print(X_scaled[:10]) # printing first 10 samples for brevity
Running the example gives an output like:
Dataset Before Scaling:
[[ 7.75068517 19.796109 ]
[-1.88353028 8.15712857]
[-2.44166942 7.58953794]
[-3.70050112 9.67083974]
[-2.73266041 9.72828662]
[-2.58629933 9.3554381 ]
[-1.68713746 10.91107911]
[-2.42215055 8.71527878]
[-3.74614833 7.69382952]
[-0.64342311 9.48811905]]
Dataset After Scaling:
[[ 7.153318 7.13149617]
[ 0.41283127 -0.78673842]
[ 0.02233455 -1.17288184]
[-0.85839501 0.24307001]
[-0.18125451 0.28215229]
[-0.07885439 0.02849587]
[ 0.55023562 1.08683002]
[ 0.03599074 -0.40701751]
[-0.89033165 -1.10193017]
[ 1.28046039 0.11876141]]
The steps are as follows:
- A synthetic dataset is generated using the
make_blobs()
function, with outliers added to simulate real-world data. - The dataset is printed before scaling to highlight the presence of outliers.
- The
robust_scale()
function is applied to the dataset, scaling the features using the median and interquartile range. - The dataset is printed after scaling to show the effect of
robust_scale()
in mitigating the influence of outliers.
This example demonstrates how to use robust_scale()
for preprocessing data that contains outliers, ensuring that the scaled features are less affected by extreme values, which is crucial for improving the performance of many machine learning algorithms.