Configure HistGradientBoostingClassifier "max_depth" Parameter

The max_depth parameter in scikit-learn’s HistGradientBoostingClassifier controls the maximum depth of the trees in the ensemble.

HistGradientBoostingClassifier is a histogram-based variant of gradient boosting that offers improved performance and faster training times compared to traditional gradient boosting methods. The max_depth parameter limits the number of levels in each decision tree.

Setting max_depth helps control the complexity of the model. Smaller values create simpler trees, potentially reducing overfitting, while larger values allow for more complex decision boundaries.

The default value for max_depth is None, which allows trees to grow until all leaves are pure or contain fewer than min_samples_leaf samples. Common values range from 3 to 10, depending on the dataset’s complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_depth values
max_depth_values = [None, 3, 5, 10]
accuracies = []

for depth in max_depth_values:
    hgbc = HistGradientBoostingClassifier(max_depth=depth, random_state=42)
    hgbc.fit(X_train, y_train)
    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_depth={depth}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_depth=None, Accuracy: 0.912
max_depth=3, Accuracy: 0.870
max_depth=5, Accuracy: 0.901
max_depth=10, Accuracy: 0.913

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different max_depth values
Evaluate the accuracy of each model on the test set

Tips for setting max_depth:

Start with the default None and compare against smaller values
Use cross-validation to find the optimal depth for your specific dataset
Consider the trade-off between model complexity and training time

Issues to consider:

Deeper trees can lead to overfitting, especially on small datasets
Shallow trees may underfit complex datasets
The optimal depth depends on the nature and size of your data
Interaction with other hyperparameters like learning_rate and max_iter can affect the optimal max_depth

See Also