Scikit-Learn StandardScaler for Data Preprocessing

The StandardScaler is used to standardize features by removing the mean and scaling to unit variance. This scaler is essential for many machine learning algorithms that assume data is normally distributed.

The key parameters include with_mean (whether to center the data) and with_std (whether to scale the data to unit variance).

This scaler is useful for preprocessing data for algorithms that are sensitive to feature scaling, such as Support Vector Machines and K-Nearest Neighbors.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=5, random_state=1)

# split into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=1)

# create and fit StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

# transform the train and test datasets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# show a sample of the data before and after scaling
print("Before scaling:\n", X_train[:5])
print("After scaling:\n", X_train_scaled[:5])

Running the example gives an output like:

Before scaling:
 [[ 0.9825172   0.58591043 -0.17816707  0.57699061  0.33847597]
 [-0.06539297 -0.66134424  0.92866146  0.82427079  1.11257376]
 [-1.51594823  0.62336218  0.67660851 -0.51225935 -0.02488133]
 [ 1.50284911  0.82458463 -0.54216937  0.62883325  0.18387793]
 [ 0.7185152  -1.52568032 -1.39569948 -0.76874019 -1.31918161]]
After scaling:
 [[ 1.04527344  0.44314612 -0.17397783  0.68382349  0.32191469]
 [ 0.01072016 -0.81987876  0.86031338  0.9534102   0.99189015]
 [-1.42134599  0.48107141  0.62477902 -0.503685    0.00743185]
 [ 1.55897308  0.6848381  -0.51412481  0.74034272  0.18811132]
 [ 0.78463646 -1.69514344 -1.31171781 -0.78330235 -1.11277471]]

The steps are as follows:

Generate a synthetic dataset using make_classification() with specified number of samples and features, ensuring reproducibility with a fixed random seed.
Split the dataset into training and test sets using train_test_split().
Instantiate a StandardScaler and fit it to the training data using the fit() method.
Transform the training and test sets using the transform() method to standardize the data.
Display a sample of the data before and after scaling to illustrate the effect of the scaler.

This example demonstrates how to use StandardScaler to preprocess data, ensuring that features are standardized, which is crucial for the performance of many machine learning models.

See Also