The max_leaf_nodes
parameter in scikit-learn’s GradientBoostingRegressor
controls the maximum number of leaf nodes per decision tree in the ensemble.
Gradient Boosting is an ensemble technique that builds models sequentially, with each new model attempting to correct the errors of the previous ones. This method often leads to high accuracy and robustness.
The max_leaf_nodes
parameter limits the number of leaf nodes in each tree, which helps control the complexity of the model and prevent overfitting.
The default value for max_leaf_nodes
is None
, meaning there is no limit on the number of leaf nodes.
In practice, common values range from 10 to 100 depending on the complexity and size of the dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_leaf_nodes values
max_leaf_nodes_values = [5, 10, 20, 50]
r2_scores = []
for max_leaf_nodes in max_leaf_nodes_values:
gbr = GradientBoostingRegressor(max_leaf_nodes=max_leaf_nodes, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
r2 = r2_score(y_test, y_pred)
r2_scores.append(r2)
print(f"max_leaf_nodes={max_leaf_nodes}, R-squared: {r2:.3f}")
Running the example gives an output like:
max_leaf_nodes=5, R-squared: 0.917
max_leaf_nodes=10, R-squared: 0.922
max_leaf_nodes=20, R-squared: 0.922
max_leaf_nodes=50, R-squared: 0.922
The key steps in this example are:
- Generate a synthetic regression dataset with noise
- Split the data into train and test sets
- Train
GradientBoostingRegressor
models with differentmax_leaf_nodes
values - Evaluate and compare the R-squared scores of each model
Some tips and heuristics for setting max_leaf_nodes
:
- Start with lower values for
max_leaf_nodes
to prevent overfitting - Increase the value to improve model performance if underfitting is observed
- Balance the complexity of individual trees with the overall ensemble strength
Issues to consider:
- The optimal
max_leaf_nodes
value depends on the dataset complexity - Too few leaf nodes can cause underfitting
- Too many leaf nodes can lead to overfitting and increased computational cost