Configure GradientBoostingClassifier "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s GradientBoostingClassifier controls the maximum number of leaf nodes in each individual decision tree.

Gradient Boosting is an ensemble learning method that sequentially adds decision trees to correct the errors made by the previous trees. The max_leaf_nodes parameter limits the complexity of each tree.

Smaller values of max_leaf_nodes lead to shallower trees, reducing the model’s capacity to fit complex relationships. Larger values allow more complex trees, potentially improving performance but also increasing the risk of overfitting.

The default value for max_leaf_nodes is None, which means there is no limit to the number of leaf nodes.

In practice, values between 10 and 100 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 50, 100, None]
accuracies = []

for max_leaf in max_leaf_nodes_values:
    gb = GradientBoostingClassifier(max_leaf_nodes=max_leaf, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_leaf_nodes={max_leaf}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_leaf_nodes=10, Accuracy: 0.785
max_leaf_nodes=50, Accuracy: 0.785
max_leaf_nodes=100, Accuracy: 0.785
max_leaf_nodes=None, Accuracy: 0.785

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different max_leaf_nodes values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_leaf_nodes:

Start with a small value like 10 and increase it until performance plateaus
Larger values allow more complex trees, but can lead to overfitting
Consider the trade-off between model complexity and performance

Issues to consider:

The max_leaf_nodes parameter interacts with other tree parameters like max_depth
Using too many leaf nodes can cause overfitting, especially on smaller datasets
Higher values of max_leaf_nodes increase the computational cost of training

See Also