The passthrough
parameter in scikit-learn’s StackingRegressor
determines whether to include the original features alongside the predictions from the base estimators.
StackingRegressor
is an ensemble method that combines multiple regression models via a meta-regressor. The passthrough
parameter controls whether the original features are passed to the final estimator along with the outputs of the base estimators.
When passthrough=True
, the meta-regressor receives both the original features and the base estimator predictions. This can potentially improve performance by allowing the meta-model to directly use the original features.
The default value for passthrough
is False. Setting it to True can be beneficial when the original features contain information not fully captured by the base estimators.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import StackingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base models
base_models = [
('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
('svr', SVR(kernel='rbf'))
]
# Create StackingRegressor with passthrough=False
stacking_false = StackingRegressor(
estimators=base_models,
final_estimator=RandomForestRegressor(n_estimators=10, random_state=42),
passthrough=False
)
# Create StackingRegressor with passthrough=True
stacking_true = StackingRegressor(
estimators=base_models,
final_estimator=RandomForestRegressor(n_estimators=10, random_state=42),
passthrough=True
)
# Train and evaluate models
for model in [stacking_false, stacking_true]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"passthrough={model.passthrough}, MSE: {mse:.4f}")
Running the example gives an output like:
passthrough=False, MSE: 7205.7209
passthrough=True, MSE: 4680.9284
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define base models (RandomForestRegressor and SVR)
- Create two StackingRegressor models, one with passthrough=False and one with passthrough=True
- Train both models and evaluate their performance using mean squared error
Tips and heuristics for setting passthrough
:
- Use passthrough=True when you suspect the original features contain valuable information not captured by the base estimators
- Consider the trade-off between potential performance gain and increased computational cost
- Experiment with both options and compare performance to determine the best setting for your specific dataset
Issues to consider:
- Setting passthrough=True increases the dimensionality of the meta-regressor’s input, which may lead to overfitting on smaller datasets
- The effectiveness of passthrough depends on the choice of base estimators and final estimator
- Increased computational cost and memory usage when passthrough=True, especially for high-dimensional datasets