Configure HistGradientBoostingRegressor "max_depth" Parameter

The max_depth parameter in scikit-learn’s HistGradientBoostingRegressor controls the maximum depth of the trees in the ensemble.

HistGradientBoostingRegressor is an efficient implementation of gradient boosting that uses histogram-based algorithms for faster training. It builds an ensemble of decision trees sequentially, with each tree correcting the errors of the previous ones.

The max_depth parameter limits how deep each tree can grow. Deeper trees can capture more complex patterns but may lead to overfitting, while shallower trees may generalize better but might underfit.

By default, max_depth is set to None, allowing trees to grow until all leaves are pure or contain less than min_samples_leaf samples. Common values range from 3 to 10, depending on the complexity of the data.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_depth values
max_depth_values = [None, 3, 5, 10]
mse_scores = []

for depth in max_depth_values:
    hgbr = HistGradientBoostingRegressor(max_depth=depth, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_depth={depth}, MSE: {mse:.3f}")

Running the example gives an output like:

max_depth=None, MSE: 1023.074
max_depth=3, MSE: 1122.012
max_depth=5, MSE: 984.290
max_depth=10, MSE: 1044.807

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different max_depth values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting max_depth:

Start with the default None and compare with fixed depths
Use smaller depths (3-5) for simpler datasets or when overfitting is a concern
Increase depth (7-10) for more complex datasets if underfitting occurs
Consider using cross-validation to find the optimal depth for your specific dataset

Issues to consider:

Deeper trees increase model complexity and training time
Very deep trees may lead to overfitting, especially on small datasets
Shallower trees may underfit if the underlying patterns are complex
The optimal depth often depends on the number of samples and features in the dataset

See Also