The estimator
parameter in scikit-learn’s AdaBoostRegressor
determines the base regressor used in the ensemble.
AdaBoost (Adaptive Boosting) is an ensemble learning method that combines multiple weak learners to create a strong predictor. The estimator
parameter specifies the type of weak learner to use as the base model.
By default, AdaBoostRegressor
uses DecisionTreeRegressor
with max_depth=3
as the base estimator. This default works well in many cases, but changing the base estimator can significantly impact the model’s performance and characteristics.
Common alternatives to the default include decision trees with different depths, linear models like LinearRegression
, or other regressors that can be considered “weak learners”.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and evaluate AdaBoostRegressor with different base estimators
estimators = {
'Default': None,
'DecisionTree(max_depth=1)': DecisionTreeRegressor(max_depth=1),
'LinearRegression': LinearRegression()
}
for name, estimator in estimators.items():
model = AdaBoostRegressor(estimator=estimator, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"{name} - MSE: {mse:.4f}")
Running the example gives an output like:
Default - MSE: 3767.9097
DecisionTree(max_depth=1) - MSE: 6149.6108
LinearRegression - MSE: 0.0097
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Create
AdaBoostRegressor
instances with different base estimators - Train models and evaluate using mean squared error
- Compare the performance of different base estimators
Some tips and heuristics for setting the estimator
parameter:
- Choose weak learners that are simple and fast to train
- Balance between weak learners (e.g., shallow trees) and stronger base models
- Consider the computational cost, especially for large datasets or many boosting rounds
Issues to consider:
- The choice of base estimator can affect model interpretability
- Complex base estimators may lead to overfitting
- Different base estimators create trade-offs between bias and variance in the final model