Configure AdaBoostRegressor "n_estimators" Parameter

The n_estimators parameter in scikit-learn’s AdaBoostRegressor controls the number of weak learners in the ensemble.

AdaBoost (Adaptive Boosting) is an ensemble method that combines multiple weak learners, typically decision trees, to create a strong predictor. The n_estimators parameter determines how many weak learners are sequentially trained.

Increasing n_estimators generally improves model performance up to a point, after which returns diminish and overfitting may occur. The optimal value depends on the specific dataset and problem.

The default value for n_estimators in AdaBoostRegressor is 50.

In practice, values between 50 and 500 are commonly used, but this can vary widely depending on the complexity of the regression task and the characteristics of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_estimators values
n_estimators_values = [10, 50, 100, 200, 500]
mse_scores = []

for n in n_estimators_values:
    ada = AdaBoostRegressor(n_estimators=n, random_state=42)
    ada.fit(X_train, y_train)
    y_pred = ada.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"n_estimators={n}, MSE: {mse:.4f}")

# Find best n_estimators
best_n = n_estimators_values[np.argmin(mse_scores)]
print(f"\nBest n_estimators: {best_n}")

Running the example gives an output like:

n_estimators=10, MSE: 16101.0718
n_estimators=50, MSE: 10253.6570
n_estimators=100, MSE: 9149.6702
n_estimators=200, MSE: 8349.3847
n_estimators=500, MSE: 8023.3465

Best n_estimators: 500

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train AdaBoostRegressor models with different n_estimators values
Evaluate the mean squared error (MSE) of each model on the test set
Identify the best performing n_estimators value

Some tips and heuristics for setting n_estimators in AdaBoostRegressor:

Start with the default value of 50 and incrementally increase
Monitor performance on a validation set to avoid overfitting
Consider the trade-off between model performance and training time

Issues to consider:

Higher n_estimators values increase computational cost
The optimal number of estimators can vary greatly depending on the dataset
AdaBoost can be sensitive to noisy data and outliers, which may affect the optimal n_estimators

See Also