Configure GradientBoostingClassifier "max_depth" Parameter

The max_depth parameter in scikit-learn’s GradientBoostingClassifier controls the maximum depth of the individual regression estimators (decision trees) in the ensemble.

Gradient Boosting is an ensemble learning method that sequentially trains decision trees to minimize the residual error of the previous trees. The max_depth parameter determines the complexity of each tree.

Smaller max_depth values (shallower trees) lead to a simpler model that may underfit, while larger values (deeper trees) increase the model’s complexity and risk of overfitting. The optimal value depends on the dataset.

The default max_depth value is 3.

In practice, values between 3 and 10 are commonly used, with 5 being a good starting point for many datasets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_depth values
max_depth_values = [2, 3, 5, 10]
accuracies = []

for depth in max_depth_values:
    gb = GradientBoostingClassifier(max_depth=depth, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_depth={depth}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_depth=2, Accuracy: 0.795
max_depth=3, Accuracy: 0.835
max_depth=5, Accuracy: 0.850
max_depth=10, Accuracy: 0.780

The key steps in this example are:

Generate a synthetic multiclass classification dataset with informative and noise features
Split the data into train and test sets
Train GradientBoostingClassifier models with different max_depth values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_depth:

Start with the default value (3) and increase it until performance plateaus
Use cross-validation to find the optimal value for your dataset
Consider the trade-off between model complexity and generalization ability

Issues to consider:

High max_depth values can lead to overfitting, especially on small datasets
Deeper trees increase training time and memory usage
The optimal max_depth depends on the size and complexity of the dataset

See Also