Configure BaggingRegressor "estimator" Parameter

The estimator parameter in scikit-learn’s BaggingRegressor determines the base model used in the ensemble.

Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the training data, trains a separate model on each subset, and combines their predictions. The estimator parameter specifies the type of model to use as the base learner in this ensemble.

By default, BaggingRegressor uses DecisionTreeRegressor with max_features=1 as the base estimator. However, you can use any regressor that follows the scikit-learn estimator API.

Common choices for the base estimator include DecisionTreeRegressor, LinearRegression, and SVR (Support Vector Regressor).

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators to compare
estimators = [
    ('Default', None),
    ('DecisionTree', DecisionTreeRegressor(random_state=42)),
    ('LinearRegression', LinearRegression()),
    ('SVR', SVR())
]

# Train and evaluate BaggingRegressor with different base estimators
for name, estimator in estimators:
    bagging = BaggingRegressor(estimator=estimator, random_state=42)
    bagging.fit(X_train, y_train)
    y_pred = bagging.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"Estimator: {name}, MSE: {mse:.4f}")

Running the example gives an output like:

Estimator: Default, MSE: 7486.4813
Estimator: DecisionTree, MSE: 7486.4813
Estimator: LinearRegression, MSE: 0.0113
Estimator: SVR, MSE: 35251.5928

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Define a list of base estimators to compare
Train BaggingRegressor models with different base estimators
Evaluate each model’s performance using Mean Squared Error

Some tips for choosing the base estimator:

Consider the nature of your data and the problem you’re solving
Experiment with different estimators to find the best performance
Use cross-validation to get a more robust estimate of model performance

Issues to consider:

The choice of base estimator can significantly impact the ensemble’s performance
Some estimators may be more computationally expensive than others
The default DecisionTreeRegressor often works well, but it’s not always optimal
Weak learners (e.g., shallow decision trees) often benefit more from bagging than strong learners

See Also