The final_estimator parameter in scikit-learn’s StackingRegressor determines the model used to combine predictions from base estimators.
Stacking is an ensemble method that trains multiple base models and a meta-model (final estimator) to combine their predictions. The final_estimator is crucial as it learns how to best integrate the base models’ outputs.
By default, StackingRegressor uses RidgeCV as the final estimator. Common alternatives include LinearRegression, RandomForestRegressor, or other models capable of handling the base estimators’ outputs.
The choice of final_estimator can significantly impact the ensemble’s performance, especially when base estimators have complementary strengths.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base estimators
base_estimators = [
('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
('gb', GradientBoostingRegressor(n_estimators=10, random_state=42))
]
# Create StackingRegressors with different final estimators
final_estimators = {
'Default (RidgeCV)': None,
'LinearRegression': LinearRegression(),
'RandomForestRegressor': RandomForestRegressor(n_estimators=10, random_state=42)
}
for name, final_estimator in final_estimators.items():
stack = StackingRegressor(estimators=base_estimators, final_estimator=final_estimator, cv=5)
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Final Estimator: {name}, MSE: {mse:.4f}")
Running the example gives an output like:
Final Estimator: Default (RidgeCV), MSE: 3029.6674
Final Estimator: LinearRegression, MSE: 3029.6599
Final Estimator: RandomForestRegressor, MSE: 3644.4075
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define base estimators (RandomForestRegressor and GradientBoostingRegressor)
- Create StackingRegressor models with different final estimators
- Train each model and evaluate its performance using mean squared error
Tips for choosing and configuring the final_estimator:
- Consider the problem type and the nature of the base estimators’ outputs
- Experiment with different models to find the best final estimator for your specific case
- Ensure the final estimator can handle the dimensionality of the base estimators’ outputs
Issues to consider:
- The final estimator should be able to capture non-linear relationships if present in the base estimators’ outputs
- Overly complex final estimators may lead to overfitting, especially with limited data
- The computational cost of the final estimator affects the overall training time of the StackingRegressor