SKLearner Home | About | Contact | Examples

Scikit-Learn make_union() for Pipelines

Combining multiple feature extraction pipelines can be efficiently handled using the make_union() function. This approach is particularly useful for preprocessing heterogeneous data that requires different preprocessing steps.

make_union() is used to combine multiple feature extraction or transformation pipelines into a single pipeline. The key hyperparameters include the individual transformers that make up the union.

This approach is suitable for preprocessing tasks in both classification and regression problems.

from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# define individual feature extraction pipelines
pipeline_1 = make_pipeline(StandardScaler())
pipeline_2 = make_pipeline(MinMaxScaler())

# combine pipelines into a single feature union
combined_features = make_union(pipeline_1, pipeline_2)

# create a final pipeline with feature union and model
model_pipeline = make_pipeline(combined_features, LogisticRegression())

# fit the model
model_pipeline.fit(X_train, y_train)

# evaluate the model
yhat = model_pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

Running the example gives an output like:

Accuracy: 0.950

The steps are as follows:

  1. A synthetic classification dataset is generated using make_classification(). This dataset has 100 samples and 10 features. The dataset is split into training and test sets using train_test_split().

  2. Two separate preprocessing pipelines are created using make_pipeline(): one with StandardScaler() and another with MinMaxScaler().

  3. The make_union() function is used to combine these pipelines into a single unified pipeline.

  4. A final pipeline is created that combines the unified preprocessing steps with a LogisticRegression model.

  5. The combined pipeline is fit to the training data.

  6. The model’s performance is evaluated using the accuracy score metric on the test set.

This example shows how to use make_union() to combine multiple feature extraction pipelines, demonstrating its utility in preprocessing heterogeneous data for machine learning tasks.



See Also