Configure GradientBoostingRegressor "n_estimators" Parameter

The n_estimators parameter in scikit-learn’s GradientBoostingRegressor controls the number of boosting stages (trees) in the ensemble.

Gradient Boosting is an ensemble technique that builds trees sequentially to minimize the residual errors of previous trees. The n_estimators parameter determines how many boosting stages are run.

Generally, using more boosting stages reduces the model’s bias and can improve performance, but it also increases the risk of overfitting and computational cost. The default value for n_estimators is 100.

In practice, values between 100 and 1000 are commonly used depending on the dataset size and complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_estimators values
n_estimators_values = [50, 100, 200, 500]
mse_values = []

for n in n_estimators_values:
    gbr = GradientBoostingRegressor(n_estimators=n, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_values.append(mse)
    print(f"n_estimators={n}, MSE: {mse:.3f}")

Running the example gives an output like:

n_estimators=50, MSE: 2394.832
n_estimators=100, MSE: 1234.753
n_estimators=200, MSE: 881.858
n_estimators=500, MSE: 776.403

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features.
Split the data into train and test sets.
Train GradientBoostingRegressor models with different n_estimators values.
Evaluate the mean squared error of each model on the test set.

Some tips and heuristics for setting n_estimators:

Start with the default value of 100 and increase it incrementally until the performance plateaus.
Higher values of n_estimators reduce the model’s bias but may increase variance and computational cost.
Evaluate the performance gains against computational efficiency to find the optimal balance.

Issues to consider:

The optimal number of boosting stages depends on the dataset’s complexity.
Using too few boosting stages can result in underfitting, while too many can cause overfitting and increased computational time.
Consider the trade-off between performance improvement and computational resources when choosing n_estimators.

See Also