SKLearner Home | About | Contact | Examples

Configure HistGradientBoostingClassifier "max_bins" Parameter

The max_bins parameter in scikit-learn’s HistGradientBoostingClassifier controls the maximum number of bins used to discretize continuous features.

HistGradientBoostingClassifier is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency on large datasets and supports missing values.

The max_bins parameter determines the granularity of the histograms used to approximate the continuous features. Higher values can potentially capture more detailed patterns but increase computational cost and memory usage.

The default value for max_bins is 255.

In practice, values between 32 and 255 are commonly used, depending on the dataset characteristics and computational constraints.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_bins values
max_bins_values = [32, 64, 128, 255]
accuracies = []
training_times = []

for bins in max_bins_values:
    start_time = time.time()
    hgbc = HistGradientBoostingClassifier(max_bins=bins, random_state=42)
    hgbc.fit(X_train, y_train)
    training_time = time.time() - start_time

    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    accuracies.append(accuracy)
    training_times.append(training_time)
    print(f"max_bins={bins}, Accuracy: {accuracy:.3f}, Training Time: {training_time:.2f} seconds")

Running the example gives an output like:

max_bins=32, Accuracy: 0.917, Training Time: 0.67 seconds
max_bins=64, Accuracy: 0.912, Training Time: 0.69 seconds
max_bins=128, Accuracy: 0.914, Training Time: 0.75 seconds
max_bins=255, Accuracy: 0.912, Training Time: 0.87 seconds

The key steps in this example are:

  1. Generate a synthetic multi-class classification dataset
  2. Split the data into train and test sets
  3. Train HistGradientBoostingClassifier models with different max_bins values
  4. Measure training time and evaluate accuracy for each model
  5. Compare the results to understand the trade-off between accuracy and computational cost

Some tips and heuristics for setting max_bins:

Issues to consider:



See Also