SKLearner Home | About | Contact | Examples

Configure HistGradientBoostingClassifier "max_leaf_nodes" Parameter

The max_leaf_nodes parameter in scikit-learn’s HistGradientBoostingClassifier controls the maximum number of leaf nodes in each tree.

HistGradientBoostingClassifier is a histogram-based gradient boosting algorithm that offers improved performance and scalability compared to traditional gradient boosting methods. It builds an ensemble of decision trees sequentially, with each tree correcting the errors of its predecessors.

The max_leaf_nodes parameter limits the complexity of individual trees in the ensemble. A smaller value results in shallower trees, potentially reducing overfitting but may underfit if set too low. Conversely, a larger value allows for deeper, more complex trees that can capture more intricate patterns in the data.

By default, max_leaf_nodes is set to None, which means there’s no limit on the number of leaf nodes. Common values range from 10 to 100, depending on the dataset’s complexity and size.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_leaf_nodes values
max_leaf_nodes_values = [10, 31, 100, None]
results = []

for max_leaf in max_leaf_nodes_values:
    start_time = time.time()
    hgb = HistGradientBoostingClassifier(max_leaf_nodes=max_leaf, random_state=42)
    hgb.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = hgb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results.append((max_leaf, accuracy, train_time))

for max_leaf, accuracy, train_time in results:
    print(f"max_leaf_nodes={max_leaf}, Accuracy: {accuracy:.3f}, Training time: {train_time:.2f}s")

Running the example gives an output like:

max_leaf_nodes=10, Accuracy: 0.889, Training time: 0.34s
max_leaf_nodes=31, Accuracy: 0.912, Training time: 0.88s
max_leaf_nodes=100, Accuracy: 0.923, Training time: 2.46s
max_leaf_nodes=None, Accuracy: 0.921, Training time: 5.67s

The key steps in this example are:

  1. Generate a synthetic multi-class classification dataset
  2. Split the data into train and test sets
  3. Train HistGradientBoostingClassifier models with different max_leaf_nodes values
  4. Evaluate the accuracy and training time for each model

Tips for setting max_leaf_nodes:

Issues to consider:



See Also