Configure GradientBoostingClassifier "n_estimators" Parameter

The n_estimators parameter in scikit-learn’s GradientBoostingClassifier controls the number of boosting stages (decision trees) used in the ensemble.

Gradient Boosting is an ensemble method that combines weak learners (decision trees) sequentially, with each tree attempting to correct the errors of the previous ones. The n_estimators parameter determines how many trees are added to the ensemble.

Generally, using more boosting stages leads to better performance, as it allows the model to learn more complex relationships in the data. However, there is a trade-off between performance and computational cost, as using a large number of estimators can significantly increase training time.

The default value for n_estimators is 100.

In practice, values between 100 and 1000 are commonly used, depending on the size and complexity of the dataset, as well as the available computational resources.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_estimators values
n_estimators_values = [50, 100, 200, 500]
accuracies = []

for n in n_estimators_values:
    gb = GradientBoostingClassifier(n_estimators=n, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"n_estimators={n}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

n_estimators=50, Accuracy: 0.775
n_estimators=100, Accuracy: 0.785
n_estimators=200, Accuracy: 0.775
n_estimators=500, Accuracy: 0.785

The key steps in this example are:

Generate a synthetic multi-class classification dataset with informative and noise features
Split the data into train and test sets
Train GradientBoostingClassifier models with different n_estimators values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting n_estimators:

Start with the default value of 100 and increase it until the performance plateaus
Higher values generally lead to better performance but also increase computation time
Consider the trade-off between the performance gain and the additional computational cost

Issues to consider:

The optimal number of estimators depends on the size and complexity of the dataset
Using too few estimators can result in underfitting, while too many can lead to overfitting and long training times
There are diminishing returns in performance after a certain number of estimators

See Also