The final_estimator
parameter in scikit-learn’s StackingRegressor
determines the model used to combine predictions from base estimators.
Stacking is an ensemble method that trains multiple base models and a meta-model (final estimator) to combine their predictions. The final_estimator
is crucial as it learns how to best integrate the base models’ outputs.
By default, StackingRegressor
uses RidgeCV
as the final estimator. Common alternatives include LinearRegression
, RandomForestRegressor
, or other models capable of handling the base estimators’ outputs.
The choice of final_estimator
can significantly impact the ensemble’s performance, especially when base estimators have complementary strengths.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base estimators
base_estimators = [
('rf', RandomForestRegressor(n_estimators=10, random_state=42)),
('gb', GradientBoostingRegressor(n_estimators=10, random_state=42))
]
# Create StackingRegressors with different final estimators
final_estimators = {
'Default (RidgeCV)': None,
'LinearRegression': LinearRegression(),
'RandomForestRegressor': RandomForestRegressor(n_estimators=10, random_state=42)
}
for name, final_estimator in final_estimators.items():
stack = StackingRegressor(estimators=base_estimators, final_estimator=final_estimator, cv=5)
stack.fit(X_train, y_train)
y_pred = stack.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Final Estimator: {name}, MSE: {mse:.4f}")
Running the example gives an output like:
Final Estimator: Default (RidgeCV), MSE: 3029.6674
Final Estimator: LinearRegression, MSE: 3029.6599
Final Estimator: RandomForestRegressor, MSE: 3644.4075
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define base estimators (RandomForestRegressor and GradientBoostingRegressor)
- Create StackingRegressor models with different final estimators
- Train each model and evaluate its performance using mean squared error
Tips for choosing and configuring the final_estimator:
- Consider the problem type and the nature of the base estimators’ outputs
- Experiment with different models to find the best final estimator for your specific case
- Ensure the final estimator can handle the dimensionality of the base estimators’ outputs
Issues to consider:
- The final estimator should be able to capture non-linear relationships if present in the base estimators’ outputs
- Overly complex final estimators may lead to overfitting, especially with limited data
- The computational cost of the final estimator affects the overall training time of the StackingRegressor