Scikit-Learn scale() for Data Preprocessing

Data preprocessing often involves scaling features to normalize their range. The scale() function in scikit-learn standardizes features by removing the mean and scaling to unit variance. This example demonstrates how to use scale() on a dataset, showing its effect on the data.

The scale() function standardizes features by centering data and scaling to unit variance. This preprocessing step is crucial for many machine learning algorithms that assume the data is normally distributed and centered around zero.

This method is appropriate for preprocessing in any machine learning problem, including classification and regression.

from sklearn.datasets import make_classification
from sklearn.preprocessing import scale
import numpy as np

# generate a synthetic dataset
X, y = make_classification(n_samples=10, n_features=5, random_state=1)

# print original data
print("Original Data:\n", X)

# scale the dataset
X_scaled = scale(X)

# print scaled data
print("Scaled Data:\n", X_scaled)

Running the example gives an output like:

Original Data:
 [[-0.19183555  1.05492298 -0.7290756  -1.14651383  1.44634283]
 [-1.11731035  0.79495321  3.11651775 -2.85961623 -1.52637437]
 [ 0.2344157  -1.92617151  2.43027958  1.49509867 -3.42524143]
 [-0.67124613  0.72558433  1.73994406 -2.00875146 -0.60483688]
 [-0.0126646   0.14092825  2.41932059 -1.52320683 -1.60290743]
 [ 1.6924546   0.0230103  -1.07460638  0.55132541  0.78712117]
 [ 0.74204416 -1.91437196  3.84266872  0.70896364 -4.42287433]
 [-0.74715829 -0.36632248 -2.17641632  1.72073855  1.23169963]
 [-0.88762896  0.59936399 -1.18938753 -0.22942496  1.37496472]
 [ 1.65980218 -1.04052679  0.89368622  1.03584131 -1.55118469]]
Scaled Data:
 [[-0.27221891  1.19442667 -0.8354927  -0.6144257   1.16423689]
 [-1.23407375  0.94517452  1.10427177 -1.75733634 -0.35660933]
 [ 0.17078811 -1.66376801  0.75812482  1.14794825 -1.32807235]
 [-0.77047493  0.87866547  0.40991111 -1.18967477  0.11485052]
 [-0.08600482  0.31811266  0.75259697 -0.86573959 -0.39576375]
 [ 1.68614194  0.20505604 -1.00978266  0.518302    0.82697818]
 [ 0.69837124 -1.65245492  1.47055123  0.62347167 -1.83846272]
 [-0.84937117 -0.16822595 -1.56554912  1.29848579  1.05442514]
 [-0.99536368  0.75764875 -1.06767969 -0.00258216  1.12771975]
 [ 1.65220596 -0.81463524 -0.01695173  0.84155085 -0.36930233]]

The steps are as follows:

Generate a synthetic dataset with make_classification() for demonstration purposes. This creates a dataset with a specified number of samples (n_samples) and features (n_features), using a fixed random seed (random_state) for reproducibility.
Display the original dataset values before scaling. This provides a baseline to compare the effect of scaling.
Apply the scale() function to standardize the dataset. This function adjusts the data so that it has a mean of 0 and a standard deviation of 1.
Display the dataset values after scaling to show the effect of preprocessing. The transformed dataset should have features centered around zero with unit variance.

This example demonstrates how to use the scale() function to preprocess data, ensuring that features are standardized. This preprocessing step can improve the performance of many machine learning algorithms.

See Also