The estimators
parameter in scikit-learn’s VotingRegressor
defines the set of base models used in the ensemble.
VotingRegressor
combines multiple regression models to improve predictions by averaging their individual outputs. The estimators
parameter specifies which models to include in the ensemble.
Selecting diverse and complementary estimators is crucial for maximizing the ensemble’s performance. The goal is to leverage the strengths of different models while mitigating their individual weaknesses.
There is no default value for estimators
; it must be explicitly specified when creating a VotingRegressor
.
Common values include combinations of different regression models such as LinearRegression
, RandomForestRegressor
, SVR
, and GradientBoostingRegressor
.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=42)
svr = SVR(kernel='rbf')
# Create VotingRegressor with different estimator combinations
estimator_combinations = [
[('lr', lr), ('rf', rf)],
[('lr', lr), ('svr', svr)],
[('rf', rf), ('svr', svr)],
[('lr', lr), ('rf', rf), ('svr', svr)]
]
for i, estimators in enumerate(estimator_combinations):
vr = VotingRegressor(estimators=estimators)
vr.fit(X_train, y_train)
y_pred = vr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Combination {i+1}: {[e[0] for e in estimators]}")
print(f"Mean Squared Error: {mse:.4f}")
print()
Running the example gives an output like:
Combination 1: ['lr', 'rf']
Mean Squared Error: 1764.3715
Combination 2: ['lr', 'svr']
Mean Squared Error: 8797.5941
Combination 3: ['rf', 'svr']
Mean Squared Error: 16628.2346
Combination 4: ['lr', 'rf', 'svr']
Mean Squared Error: 7390.2678
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define individual regression models (LinearRegression, RandomForestRegressor, SVR)
- Create VotingRegressor instances with different estimator combinations
- Train each VotingRegressor and evaluate its performance using mean squared error
Tips for configuring the estimators
parameter:
- Choose diverse models with different strengths and weaknesses
- Include a mix of simple and complex models to balance bias and variance
- Consider the computational cost of the ensemble, especially for large datasets
- Experiment with different combinations to find the optimal set for your specific problem
Issues to consider:
- The performance of individual estimators significantly impacts the ensemble’s overall performance
- Including too many complex estimators may lead to overfitting
- There’s a trade-off between model diversity and individual performance
- The optimal combination of estimators can vary depending on the specific dataset and problem