Configure HistGradientBoostingClassifier "max_bins" Parameter

The max_bins parameter in scikit-learn’s HistGradientBoostingClassifier controls the maximum number of bins used to discretize continuous features.

HistGradientBoostingClassifier is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency on large datasets and supports missing values.

The max_bins parameter determines the granularity of the histograms used to approximate the continuous features. Higher values can potentially capture more detailed patterns but increase computational cost and memory usage.

The default value for max_bins is 255.

In practice, values between 32 and 255 are commonly used, depending on the dataset characteristics and computational constraints.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_bins values
max_bins_values = [32, 64, 128, 255]
accuracies = []
training_times = []

for bins in max_bins_values:
    start_time = time.time()
    hgbc = HistGradientBoostingClassifier(max_bins=bins, random_state=42)
    hgbc.fit(X_train, y_train)
    training_time = time.time() - start_time

    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    accuracies.append(accuracy)
    training_times.append(training_time)
    print(f"max_bins={bins}, Accuracy: {accuracy:.3f}, Training Time: {training_time:.2f} seconds")

Running the example gives an output like:

max_bins=32, Accuracy: 0.917, Training Time: 0.67 seconds
max_bins=64, Accuracy: 0.912, Training Time: 0.69 seconds
max_bins=128, Accuracy: 0.914, Training Time: 0.75 seconds
max_bins=255, Accuracy: 0.912, Training Time: 0.87 seconds

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different max_bins values
Measure training time and evaluate accuracy for each model
Compare the results to understand the trade-off between accuracy and computational cost

Some tips and heuristics for setting max_bins:

Start with the default value of 255 and decrease if training time is too long
Increase max_bins if you suspect important patterns are being missed due to coarse binning
For datasets with many samples, higher max_bins values may be beneficial

Issues to consider:

Higher max_bins values increase memory usage and computational cost
Very low max_bins values may lead to underfitting due to loss of information
The optimal value depends on the dataset size, feature distributions, and available computational resources

See Also