Configure StackingRegressor "estimators" Parameter

The estimators parameter in scikit-learn’s StackingRegressor defines the set of first-level estimators used in the stacking ensemble.

Stacking is an ensemble learning technique that combines multiple base models to improve prediction performance. The estimators parameter specifies the list of these base models.

Effective configuration of estimators is crucial for the performance of the stacking ensemble. It typically involves selecting diverse models that capture different aspects of the data.

The default value for estimators is an empty list. Common configurations include a mix of different algorithm types, such as linear models, decision trees, and neural networks.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define different estimator configurations
configs = [
    ("Basic", [('lr', LinearRegression())]),
    ("Diverse", [('lr', LinearRegression()), ('ridge', Ridge()), ('dt', DecisionTreeRegressor())]),
    ("Complex", [('lr', LinearRegression()), ('ridge', Ridge()), ('dt', DecisionTreeRegressor()),
                 ('dt2', DecisionTreeRegressor(max_depth=5))])
]

for name, estimators in configs:
    stacking = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
    stacking.fit(X_train, y_train)
    y_pred = stacking.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"{name} configuration - MSE: {mse:.4f}")

Running the example gives an output like:

Basic configuration - MSE: 0.0095
Diverse configuration - MSE: 0.0095
Complex configuration - MSE: 0.0095

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Define different estimators configurations for StackingRegressor
Train and evaluate models with each configuration
Compare mean squared error (MSE) for each setup

Tips for configuring estimators:

Use a diverse set of base models to capture different patterns in the data
Consider the strengths and weaknesses of each base model
Experiment with different combinations to find the optimal setup for your specific problem
Balance model complexity with computational resources

Issues to consider:

More base models increase computational cost and may not always improve performance
The choice of final estimator can impact the effectiveness of the stacking ensemble
Overfitting can occur if the base models are too complex or if there’s data leakage in the stacking process

See Also