Configure GradientBoostingRegressor "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s GradientBoostingRegressor controls the maximum number of leaf nodes per decision tree in the ensemble.

Gradient Boosting is an ensemble technique that builds models sequentially, with each new model attempting to correct the errors of the previous ones. This method often leads to high accuracy and robustness.

The max_leaf_nodes parameter limits the number of leaf nodes in each tree, which helps control the complexity of the model and prevent overfitting.

The default value for max_leaf_nodes is None, meaning there is no limit on the number of leaf nodes.

In practice, common values range from 10 to 100 depending on the complexity and size of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [5, 10, 20, 50]
r2_scores = []

for max_leaf_nodes in max_leaf_nodes_values:
    gbr = GradientBoostingRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)
    print(f"max_leaf_nodes={max_leaf_nodes}, R-squared: {r2:.3f}")

Running the example gives an output like:

max_leaf_nodes=5, R-squared: 0.917
max_leaf_nodes=10, R-squared: 0.922
max_leaf_nodes=20, R-squared: 0.922
max_leaf_nodes=50, R-squared: 0.922

The key steps in this example are:

Generate a synthetic regression dataset with noise
Split the data into train and test sets
Train GradientBoostingRegressor models with different max_leaf_nodes values
Evaluate and compare the R-squared scores of each model

Some tips and heuristics for setting max_leaf_nodes:

Start with lower values for max_leaf_nodes to prevent overfitting
Increase the value to improve model performance if underfitting is observed
Balance the complexity of individual trees with the overall ensemble strength

Issues to consider:

The optimal max_leaf_nodes value depends on the dataset complexity
Too few leaf nodes can cause underfitting
Too many leaf nodes can lead to overfitting and increased computational cost

See Also