Configure HistGradientBoostingClassifier "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s HistGradientBoostingClassifier controls the maximum number of leaf nodes in each tree.

HistGradientBoostingClassifier is a histogram-based gradient boosting algorithm that offers improved performance and scalability compared to traditional gradient boosting methods. It builds an ensemble of decision trees sequentially, with each tree correcting the errors of its predecessors.

The max_leaf_nodes parameter limits the complexity of individual trees in the ensemble. A smaller value results in shallower trees, potentially reducing overfitting but may underfit if set too low. Conversely, a larger value allows for deeper, more complex trees that can capture more intricate patterns in the data.

By default, max_leaf_nodes is set to None, which means there’s no limit on the number of leaf nodes. Common values range from 10 to 100, depending on the dataset’s complexity and size.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 31, 100, None]
results = []

for max_leaf in max_leaf_nodes_values:
    start_time = time.time()
    hgb = HistGradientBoostingClassifier(max_leaf_nodes=max_leaf, random_state=42)
    hgb.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = hgb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results.append((max_leaf, accuracy, train_time))

for max_leaf, accuracy, train_time in results:
    print(f"max_leaf_nodes={max_leaf}, Accuracy: {accuracy:.3f}, Training time: {train_time:.2f}s")

Running the example gives an output like:

max_leaf_nodes=10, Accuracy: 0.889, Training time: 0.34s
max_leaf_nodes=31, Accuracy: 0.912, Training time: 0.88s
max_leaf_nodes=100, Accuracy: 0.923, Training time: 2.46s
max_leaf_nodes=None, Accuracy: 0.921, Training time: 5.67s

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different max_leaf_nodes values
Evaluate the accuracy and training time for each model

Tips for setting max_leaf_nodes:

Start with a small value (e.g., 10) and gradually increase it while monitoring performance
Use cross-validation to find the optimal value for your specific dataset
Consider the trade-off between model complexity and training time

Issues to consider:

Smaller values may lead to underfitting, especially on complex datasets
Larger values can result in overfitting and increased training time
The optimal value depends on the dataset’s size, complexity, and the problem at hand
Setting max_leaf_nodes to None allows unlimited growth, which may not be suitable for all cases

See Also