Configure GradientBoostingRegressor "max_depth" Parameter

The max_depth parameter in scikit-learn’s GradientBoostingRegressor controls the maximum depth of individual trees in the ensemble.

GradientBoostingRegressor is a machine learning technique that builds an ensemble of weak learners (usually decision trees) sequentially to reduce bias and variance. The max_depth parameter determines how complex each tree can be.

Generally, a higher max_depth value allows the trees to capture more information, leading to a more complex model. However, this can also increase the risk of overfitting. Conversely, a lower max_depth can prevent overfitting but may result in underfitting if the model is too simple.

The default value for max_depth is 3.

In practice, values between 3 and 10 are commonly used depending on the dataset and problem complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_depth values
max_depth_values = [3, 5, 7, 10]
mse_scores = []

for depth in max_depth_values:
    gbr = GradientBoostingRegressor(max_depth=depth, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_depth={depth}, Mean Squared Error: {mse:.3f}")

Running the example gives an output like:

max_depth=3, Mean Squared Error: 3052.375
max_depth=5, Mean Squared Error: 4062.380
max_depth=7, Mean Squared Error: 6255.098
max_depth=10, Mean Squared Error: 11292.818

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features.
Split the data into train and test sets.
Train GradientBoostingRegressor models with different max_depth values.
Evaluate the Mean Squared Error of each model on the test set.

Some tips and heuristics for setting max_depth:

Start with the default value of 3 and increase it to see if performance improves.
A higher max_depth allows the model to learn more complex patterns but can lead to overfitting.
Monitor performance on a validation set to avoid overfitting.

Issues to consider:

The optimal max_depth depends on the size and complexity of the dataset.
Using too shallow trees can result in underfitting.
Computational cost increases with higher max_depth values.

See Also