The make_pipeline()
function simplifies the creation of machine learning pipelines by automatically naming each step.
This example demonstrates how to use make_pipeline()
to create a streamlined workflow that includes standardization, dimensionality reduction, and logistic regression for a binary classification task.
make_pipeline()
is useful for chaining together multiple processing steps and a final estimator in a single workflow. It is appropriate for various problem types like classification, regression, clustering, and more.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# create a pipeline
pipeline = make_pipeline(StandardScaler(), PCA(n_components=5), LogisticRegression())
# fit the pipeline on the training data
pipeline.fit(X_train, y_train)
# evaluate the pipeline on the test data
yhat = pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
# make a single prediction
row = [[-0.7612069 , 0.5197695 , 0.870965, -0.5985741 , 0.9641751 ,
0.8657729 , 0.8502333 , 1.1002584 , -0.886034, -0.9237921]]
yhat = pipeline.predict(row)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.950
Predicted: 1
The steps are as follows:
First, a synthetic binary classification dataset is generated using the
make_classification()
function. This creates a dataset with a specified number of samples (n_samples
), features (n_features
), and classes (n_classes
). The dataset is split into training and test sets usingtrain_test_split()
.Next, a pipeline is created using
make_pipeline()
, which includes:StandardScaler()
for standardizing the features.PCA()
for reducing the dimensionality to 5 components.LogisticRegression()
as the final estimator.
The pipeline is then fit on the training data using the
fit()
method.The performance of the pipeline is evaluated by comparing the predictions (
yhat
) to the actual values (y_test
) using the accuracy score metric.A single prediction can be made by passing a new data sample to the
predict()
method.
This example demonstrates how to effectively use make_pipeline()
to streamline the workflow of preprocessing steps and model training, making it easier to manage and apply complex machine learning workflows in scikit-learn.