Scikit-Learn Pipeline

Pipeline

Using the Pipeline class in scikit-learn allows you to chain multiple data preprocessing steps and a machine learning model into a single workflow. This streamlines the process and ensures consistency.

Pipeline does not have specific hyperparameters, but it requires a list of steps, each defined by a tuple containing a name and a transformer or estimator. Pipeline is suitable for various machine learning tasks, including classification and regression.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# generate a binary classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create a pipeline
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('model', LogisticRegression())
])

# fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# evaluate the pipeline on the test data
yhat = pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a single prediction
row = [[-1.10325445, -0.49821356, -0.05962247, -0.89224592, -0.70158632]]
yhat = pipeline.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.950
Predicted: 0

The steps are as follows:

Generate a synthetic binary classification dataset using make_classification(), specifying the number of samples, features, and classes, and ensuring reproducibility with a random seed.
Split the dataset into training and test sets using train_test_split().
Create a pipeline with three steps:
- StandardScaler() for standardizing features.
- PCA for dimensionality reduction to two components.
- LogisticRegression as the final classification model.
Fit the pipeline on the training data with the fit() method.
Evaluate the pipeline by predicting on the test set and calculating accuracy using accuracy_score.
Make a single prediction with the fitted pipeline on a new data sample using predict().

This example demonstrates how to combine preprocessing steps and a machine learning model into a single streamlined pipeline, making the workflow more efficient and organized. The Pipeline ensures that all steps are applied consistently to both training and test data, reducing the risk of data leakage and improving reproducibility.

See Also