Configure HistGradientBoostingClassifier "max_features" Parameter

The max_features parameter in scikit-learn’s HistGradientBoostingClassifier controls the number of features to consider when looking for the best split.

HistGradientBoostingClassifier is an implementation of gradient boosting that uses histogram-based decision trees. It’s known for its speed and efficiency, especially on large datasets.

The max_features parameter determines the maximum number of features to consider when splitting a node. It is a float between 0 and 1, representing a percentage of the number of input features.

By default, max_features is set to 1.0, which uses all features.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]
accuracies = []

for mf in max_features_values:
    hgbc = HistGradientBoostingClassifier(max_features=mf, random_state=42)
    hgbc.fit(X_train, y_train)
    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"max_features={mf}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

max_features=0.1, Accuracy: 0.904
max_features=0.3, Accuracy: 0.917
max_features=0.5, Accuracy: 0.918
max_features=0.7, Accuracy: 0.915
max_features=0.9, Accuracy: 0.913
max_features=1.0, Accuracy: 0.912

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different max_features values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting max_features:

Start with the default 1.0 and experiment with other values
‘sqrt’ and ’log2’ of the number of features often work well and can reduce overfitting
Using a float (e.g., 0.5) allows for fine-grained control
Lower values can lead to more diverse trees but may miss important features

Issues to consider:

The optimal value depends on the dataset’s characteristics
Too low values may result in underfitting, while too high can lead to overfitting
There’s a trade-off between model complexity and computational cost
The effect of max_features can interact with other hyperparameters

See Also