Scikit-Learn normalize() for Data Preprocessing

Normalization scales each input feature to have unit norm, which can be useful in various machine learning algorithms. This example demonstrates how to use the normalize() function from scikit-learn for data preprocessing. Normalization ensures that each feature contributes equally to the model, preventing features with larger magnitudes from dominating.

The key parameters of normalize() include norm (the norm to use to normalize each non-zero sample or feature), which can be l1, l2, or max.

Normalization is appropriate for any type of problem (classification, regression) where feature scaling is beneficial.

from sklearn.datasets import make_classification
from sklearn.preprocessing import normalize
import pandas as pd

# generate synthetic dataset
X, _ = make_classification(n_samples=10, n_features=5, random_state=1)

# convert to DataFrame for better visualization
df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(1, 6)])
print("Original Dataset:")
print(df)

# apply normalization
normalized_X = normalize(X, norm='l2')
normalized_df = pd.DataFrame(normalized_X, columns=[f'Feature_{i}' for i in range(1, 6)])
print("\nNormalized Dataset:")
print(normalized_df)

Running the example gives an output like:

Original Dataset:
   Feature_1  Feature_2  Feature_3  Feature_4  Feature_5
0  -0.191836   1.054923  -0.729076  -1.146514   1.446343
1  -1.117310   0.794953   3.116518  -2.859616  -1.526374
2   0.234416  -1.926172   2.430280   1.495099  -3.425241
3  -0.671246   0.725584   1.739944  -2.008751  -0.604837
4  -0.012665   0.140928   2.419321  -1.523207  -1.602907
5   1.692455   0.023010  -1.074606   0.551325   0.787121
6   0.742044  -1.914372   3.842669   0.708964  -4.422874
7  -0.747158  -0.366322  -2.176416   1.720739   1.231700
8  -0.887629   0.599364  -1.189388  -0.229425   1.374965
9   1.659802  -1.040527   0.893686   1.035841  -1.551185

Normalized Dataset:
   Feature_1  Feature_2  Feature_3  Feature_4  Feature_5
0  -0.085050   0.467696  -0.323233  -0.508302   0.641230
1  -0.237671   0.169100   0.662935  -0.608288  -0.324685
2   0.048214  -0.396169   0.499853   0.307508  -0.704494
3  -0.231528   0.250271   0.600146  -0.692864  -0.208622
4  -0.003860   0.042958   0.737454  -0.464302  -0.488596
5   0.761222   0.010349  -0.483330   0.247972   0.354027
6   0.118752  -0.306364   0.614957   0.113458  -0.707809
7  -0.237376  -0.116382  -0.691457   0.546686   0.391317
8  -0.418203   0.282388  -0.560376  -0.108093   0.647810
9   0.582639  -0.365256   0.313710   0.363611  -0.544512

The steps are as follows:

First, a synthetic dataset is generated using the make_classification() function, creating a dataset with a specified number of samples (n_samples) and features (n_features). The dataset is then converted to a DataFrame for better visualization.
The original dataset is printed to display the initial feature scales.
The normalize() function is used to transform the dataset. By default, normalize() uses the l2 norm, which scales each sample to have a unit norm.
The transformed dataset is printed to show the new feature scales, where each sample’s features have been normalized.

This example demonstrates how to preprocess data using normalization, which can improve the performance of many machine learning algorithms by ensuring that all features contribute equally.

See Also