Configure StackingRegressor "passthrough" Parameter

The passthrough parameter in scikit-learn’s StackingRegressor determines whether to include the original features alongside the predictions from the base estimators.

StackingRegressor is an ensemble method that combines multiple regression models via a meta-regressor. The passthrough parameter controls whether the original features are passed to the final estimator along with the outputs of the base estimators.

When passthrough=True, the meta-regressor receives both the original features and the base estimator predictions. This can potentially improve performance by allowing the meta-model to directly use the original features.

The default value for passthrough is False. Setting it to True can be beneficial when the original features contain information not fully captured by the base estimators.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
base_models = [
    ('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
    ('svr', SVR(kernel='rbf'))
]

# Create StackingRegressor with passthrough=False
stacking_false = StackingRegressor(
    estimators=base_models,
    final_estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    passthrough=False
)

# Create StackingRegressor with passthrough=True
stacking_true = StackingRegressor(
    estimators=base_models,
    final_estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    passthrough=True
)

# Train and evaluate models
for model in [stacking_false, stacking_true]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"passthrough={model.passthrough}, MSE: {mse:.4f}")

Running the example gives an output like:

passthrough=False, MSE: 7205.7209
passthrough=True, MSE: 4680.9284

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Define base models (RandomForestRegressor and SVR)
Create two StackingRegressor models, one with passthrough=False and one with passthrough=True
Train both models and evaluate their performance using mean squared error

Tips and heuristics for setting passthrough:

Use passthrough=True when you suspect the original features contain valuable information not captured by the base estimators
Consider the trade-off between potential performance gain and increased computational cost
Experiment with both options and compare performance to determine the best setting for your specific dataset

Issues to consider:

Setting passthrough=True increases the dimensionality of the meta-regressor’s input, which may lead to overfitting on smaller datasets
The effectiveness of passthrough depends on the choice of base estimators and final estimator
Increased computational cost and memory usage when passthrough=True, especially for high-dimensional datasets

See Also