Configure GradientBoostingClassifier "max_features" Parameter

The max_features parameter in scikit-learn’s GradientBoostingClassifier controls the number of features to consider when looking for the best split at each node of the decision trees.

Gradient Boosting builds an ensemble of decision trees in a stage-wise fashion. The max_features parameter determines how many features are randomly sampled for splitting at each node.

Smaller values of max_features can help reduce overfitting by introducing more randomness and diversity among the trees. However, setting it too low may lead to underfitting.

The default value for max_features is None, which means that all features are considered at each split. Common values are "sqrt" (square root of the total number of features) and "log2" (logarithm base 2 of the total number of features).

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=10,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [None, "sqrt", "log2", 0.5]
accuracies = []

for mf in max_features_values:
    gb = GradientBoostingClassifier(max_features=mf, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_features={mf}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_features=None, Accuracy: 0.760
max_features=sqrt, Accuracy: 0.720
max_features=log2, Accuracy: 0.720
max_features=0.5, Accuracy: 0.750

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different max_features values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_features:

Start with the default value (None) and try smaller values like "sqrt" or "log2"
Smaller values can help reduce overfitting, especially for high-dimensional datasets
If the model is underfitting, try increasing max_features

Issues to consider:

The optimal value depends on the specific dataset and problem
Setting max_features too low can lead to underfitting and poor performance
Using a smaller value increases randomness, which can require more trees (n_estimators) for the model to converge

See Also