SKLearner Home | About | Contact | Examples

Configure GradientBoostingRegressor "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s GradientBoostingRegressor controls the maximum number of leaf nodes per decision tree in the ensemble.

Gradient Boosting is an ensemble technique that builds models sequentially, with each new model attempting to correct the errors of the previous ones. This method often leads to high accuracy and robustness.

The max_leaf_nodes parameter limits the number of leaf nodes in each tree, which helps control the complexity of the model and prevent overfitting.

The default value for max_leaf_nodes is None, meaning there is no limit on the number of leaf nodes.

In practice, common values range from 10 to 100 depending on the complexity and size of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [5, 10, 20, 50]
r2_scores = []

for max_leaf_nodes in max_leaf_nodes_values:
    gbr = GradientBoostingRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    r2_scores.append(r2)
    print(f"max_leaf_nodes={max_leaf_nodes}, R-squared: {r2:.3f}")

Running the example gives an output like:

max_leaf_nodes=5, R-squared: 0.917
max_leaf_nodes=10, R-squared: 0.922
max_leaf_nodes=20, R-squared: 0.922
max_leaf_nodes=50, R-squared: 0.922

The key steps in this example are:

  1. Generate a synthetic regression dataset with noise
  2. Split the data into train and test sets
  3. Train GradientBoostingRegressor models with different max_leaf_nodes values
  4. Evaluate and compare the R-squared scores of each model

Some tips and heuristics for setting max_leaf_nodes:

Issues to consider:



See Also