Scikit-Learn make_pipeline()

Pipeline

The make_pipeline() function simplifies the creation of machine learning pipelines by automatically naming each step.

This example demonstrates how to use make_pipeline() to create a streamlined workflow that includes standardization, dimensionality reduction, and logistic regression for a binary classification task.

make_pipeline() is useful for chaining together multiple processing steps and a final estimator in a single workflow. It is appropriate for various problem types like classification, regression, clustering, and more.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create a pipeline
pipeline = make_pipeline(StandardScaler(), PCA(n_components=5), LogisticRegression())

# fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# evaluate the pipeline on the test data
yhat = pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a single prediction
row = [[-0.7612069 ,  0.5197695 ,  0.870965, -0.5985741 ,  0.9641751 ,
         0.8657729 ,  0.8502333 ,  1.1002584 , -0.886034, -0.9237921]]
yhat = pipeline.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.950
Predicted: 1

The steps are as follows:

First, a synthetic binary classification dataset is generated using the make_classification() function. This creates a dataset with a specified number of samples (n_samples), features (n_features), and classes (n_classes). The dataset is split into training and test sets using train_test_split().
Next, a pipeline is created using make_pipeline(), which includes:
- StandardScaler() for standardizing the features.
- PCA() for reducing the dimensionality to 5 components.
- LogisticRegression() as the final estimator.
The pipeline is then fit on the training data using the fit() method.
The performance of the pipeline is evaluated by comparing the predictions (yhat) to the actual values (y_test) using the accuracy score metric.
A single prediction can be made by passing a new data sample to the predict() method.

This example demonstrates how to effectively use make_pipeline() to streamline the workflow of preprocessing steps and model training, making it easier to manage and apply complex machine learning workflows in scikit-learn.

See Also