The estimator
parameter in scikit-learn’s BaggingRegressor
determines the base model used in the ensemble.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the training data, trains a separate model on each subset, and combines their predictions. The estimator
parameter specifies the type of model to use as the base learner in this ensemble.
By default, BaggingRegressor
uses DecisionTreeRegressor
with max_features=1
as the base estimator. However, you can use any regressor that follows the scikit-learn estimator API.
Common choices for the base estimator include DecisionTreeRegressor
, LinearRegression
, and SVR
(Support Vector Regressor).
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base estimators to compare
estimators = [
('Default', None),
('DecisionTree', DecisionTreeRegressor(random_state=42)),
('LinearRegression', LinearRegression()),
('SVR', SVR())
]
# Train and evaluate BaggingRegressor with different base estimators
for name, estimator in estimators:
bagging = BaggingRegressor(estimator=estimator, random_state=42)
bagging.fit(X_train, y_train)
y_pred = bagging.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Estimator: {name}, MSE: {mse:.4f}")
Running the example gives an output like:
Estimator: Default, MSE: 7486.4813
Estimator: DecisionTree, MSE: 7486.4813
Estimator: LinearRegression, MSE: 0.0113
Estimator: SVR, MSE: 35251.5928
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define a list of base estimators to compare
- Train
BaggingRegressor
models with different base estimators - Evaluate each model’s performance using Mean Squared Error
Some tips for choosing the base estimator:
- Consider the nature of your data and the problem you’re solving
- Experiment with different estimators to find the best performance
- Use cross-validation to get a more robust estimate of model performance
Issues to consider:
- The choice of base estimator can significantly impact the ensemble’s performance
- Some estimators may be more computationally expensive than others
- The default
DecisionTreeRegressor
often works well, but it’s not always optimal - Weak learners (e.g., shallow decision trees) often benefit more from bagging than strong learners