SKLearner Home | About | Contact | Examples

Scikit-Learn FeatureUnion for Pipelines

FeatureUnion is a powerful tool in scikit-learn for combining multiple feature extraction methods into a single pipeline. This is especially useful for integrating different types of data preprocessing steps that need to be applied to a dataset before modeling.

FeatureUnion allows the combination of different feature extraction techniques by applying them to the same data and concatenating their outputs. Key parameters include transformer_list (a list of transformers to be applied) and n_jobs (number of jobs to run in parallel). Suitable for preprocessing tasks in both classification and regression problems, particularly when different types of features need to be combined.

from sklearn.datasets import make_classification
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, random_state=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Define feature extraction methods
features = FeatureUnion([
    ('pca', PCA(n_components=5)),
    ('std', StandardScaler())
])

# Create pipeline
pipeline = Pipeline([
    ('features', features),
    ('model', LogisticRegression())
])

# Fit model
pipeline.fit(X_train, y_train)

# Evaluate model
yhat = pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# Make a prediction
row = [[-0.46947439, 0.54256004, -0.46341769, -0.46572975, 0.24196227,
         -1.91328024, -1.72491783, -0.56228753, -1.01283112, 0.31424733,
         -0.90802408, -1.4123037, 1.46564877, -0.2257763, 0.0675282,
         -1.42474819, -0.54438272, 0.11092259, -1.15099358, 0.37569802]]
yhat = pipeline.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.900
Predicted: 1

The steps are as follows:

  1. First, a synthetic binary classification dataset is generated using the make_classification() function, creating a dataset with a specified number of samples (n_samples), features (n_features), and classes (n_classes).

  2. The dataset is split into training and test sets using train_test_split() to ensure that the model can be trained and evaluated separately.

  3. Two feature extraction methods are defined: PCA for dimensionality reduction and StandardScaler for normalization. These are combined using FeatureUnion.

  4. A Pipeline is created that first applies the combined feature extraction methods and then fits a LogisticRegression model.

  5. The model is fit on the training data using the fit() method of the pipeline.

  6. The performance of the model is evaluated by predicting the test data and calculating the accuracy score with accuracy_score.

  7. A single prediction is made by passing a new data sample through the pipeline using the predict() method.

This example demonstrates how to use FeatureUnion to combine multiple preprocessing steps in a single pipeline, enabling more complex data transformations and feature combinations before modeling.



See Also