Configure VotingRegressor "estimators" Parameter

The estimators parameter in scikit-learn’s VotingRegressor defines the set of base models used in the ensemble.

VotingRegressor combines multiple regression models to improve predictions by averaging their individual outputs. The estimators parameter specifies which models to include in the ensemble.

Selecting diverse and complementary estimators is crucial for maximizing the ensemble’s performance. The goal is to leverage the strengths of different models while mitigating their individual weaknesses.

There is no default value for estimators; it must be explicitly specified when creating a VotingRegressor.

Common values include combinations of different regression models such as LinearRegression, RandomForestRegressor, SVR, and GradientBoostingRegressor.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base models
lr = LinearRegression()
rf = RandomForestRegressor(n_estimators=100, random_state=42)
svr = SVR(kernel='rbf')

# Create VotingRegressor with different estimator combinations
estimator_combinations = [
    [('lr', lr), ('rf', rf)],
    [('lr', lr), ('svr', svr)],
    [('rf', rf), ('svr', svr)],
    [('lr', lr), ('rf', rf), ('svr', svr)]
]

for i, estimators in enumerate(estimator_combinations):
    vr = VotingRegressor(estimators=estimators)
    vr.fit(X_train, y_train)
    y_pred = vr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Combination {i+1}: {[e[0] for e in estimators]}")
    print(f"Mean Squared Error: {mse:.4f}")
    print()

Running the example gives an output like:

Combination 1: ['lr', 'rf']
Mean Squared Error: 1764.3715

Combination 2: ['lr', 'svr']
Mean Squared Error: 8797.5941

Combination 3: ['rf', 'svr']
Mean Squared Error: 16628.2346

Combination 4: ['lr', 'rf', 'svr']
Mean Squared Error: 7390.2678

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Define individual regression models (LinearRegression, RandomForestRegressor, SVR)
Create VotingRegressor instances with different estimator combinations
Train each VotingRegressor and evaluate its performance using mean squared error

Tips for configuring the estimators parameter:

Choose diverse models with different strengths and weaknesses
Include a mix of simple and complex models to balance bias and variance
Consider the computational cost of the ensemble, especially for large datasets
Experiment with different combinations to find the optimal set for your specific problem

Issues to consider:

The performance of individual estimators significantly impacts the ensemble’s overall performance
Including too many complex estimators may lead to overfitting
There’s a trade-off between model diversity and individual performance
The optimal combination of estimators can vary depending on the specific dataset and problem

See Also