The max_bins parameter in scikit-learn’s HistGradientBoostingClassifier controls the maximum number of bins used to discretize continuous features.
HistGradientBoostingClassifier is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency on large datasets and supports missing values.
The max_bins parameter determines the granularity of the histograms used to approximate the continuous features. Higher values can potentially capture more detailed patterns but increase computational cost and memory usage.
The default value for max_bins is 255.
In practice, values between 32 and 255 are commonly used, depending on the dataset characteristics and computational constraints.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_bins values
max_bins_values = [32, 64, 128, 255]
accuracies = []
training_times = []
for bins in max_bins_values:
start_time = time.time()
hgbc = HistGradientBoostingClassifier(max_bins=bins, random_state=42)
hgbc.fit(X_train, y_train)
training_time = time.time() - start_time
y_pred = hgbc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
training_times.append(training_time)
print(f"max_bins={bins}, Accuracy: {accuracy:.3f}, Training Time: {training_time:.2f} seconds")
Running the example gives an output like:
max_bins=32, Accuracy: 0.917, Training Time: 0.67 seconds
max_bins=64, Accuracy: 0.912, Training Time: 0.69 seconds
max_bins=128, Accuracy: 0.914, Training Time: 0.75 seconds
max_bins=255, Accuracy: 0.912, Training Time: 0.87 seconds
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
HistGradientBoostingClassifiermodels with differentmax_binsvalues - Measure training time and evaluate accuracy for each model
- Compare the results to understand the trade-off between accuracy and computational cost
Some tips and heuristics for setting max_bins:
- Start with the default value of 255 and decrease if training time is too long
- Increase
max_binsif you suspect important patterns are being missed due to coarse binning - For datasets with many samples, higher
max_binsvalues may be beneficial
Issues to consider:
- Higher
max_binsvalues increase memory usage and computational cost - Very low
max_binsvalues may lead to underfitting due to loss of information - The optimal value depends on the dataset size, feature distributions, and available computational resources