The estimators
parameter in scikit-learn’s StackingRegressor
defines the set of first-level estimators used in the stacking ensemble.
Stacking is an ensemble learning technique that combines multiple base models to improve prediction performance. The estimators
parameter specifies the list of these base models.
Effective configuration of estimators
is crucial for the performance of the stacking ensemble. It typically involves selecting diverse models that capture different aspects of the data.
The default value for estimators
is an empty list. Common configurations include a mix of different algorithm types, such as linear models, decision trees, and neural networks.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define different estimator configurations
configs = [
("Basic", [('lr', LinearRegression())]),
("Diverse", [('lr', LinearRegression()), ('ridge', Ridge()), ('dt', DecisionTreeRegressor())]),
("Complex", [('lr', LinearRegression()), ('ridge', Ridge()), ('dt', DecisionTreeRegressor()),
('dt2', DecisionTreeRegressor(max_depth=5))])
]
for name, estimators in configs:
stacking = StackingRegressor(estimators=estimators, final_estimator=LinearRegression())
stacking.fit(X_train, y_train)
y_pred = stacking.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"{name} configuration - MSE: {mse:.4f}")
Running the example gives an output like:
Basic configuration - MSE: 0.0095
Diverse configuration - MSE: 0.0095
Complex configuration - MSE: 0.0095
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define different
estimators
configurations forStackingRegressor
- Train and evaluate models with each configuration
- Compare mean squared error (MSE) for each setup
Tips for configuring estimators
:
- Use a diverse set of base models to capture different patterns in the data
- Consider the strengths and weaknesses of each base model
- Experiment with different combinations to find the optimal setup for your specific problem
- Balance model complexity with computational resources
Issues to consider:
- More base models increase computational cost and may not always improve performance
- The choice of final estimator can impact the effectiveness of the stacking ensemble
- Overfitting can occur if the base models are too complex or if there’s data leakage in the stacking process