Configure HistGradientBoostingRegressor "max_bins" Parameter

The max_bins parameter in scikit-learn’s HistGradientBoostingRegressor controls the maximum number of bins used to discretize continuous features.

Histogram-based gradient boosting uses binning to transform continuous features into discrete values, which speeds up training and reduces memory usage. The max_bins parameter determines the granularity of this discretization.

Higher max_bins values can capture more fine-grained patterns in the data but increase training time and memory usage. Lower values are faster but may lose some information.

The default value for max_bins is 255. In practice, values between 32 and 1000 are commonly used, depending on the dataset size and desired trade-off between accuracy and speed.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_bins values
max_bins_values = [32, 128, 255]
mse_scores = []
training_times = []

for bins in max_bins_values:
    start_time = time.time()
    hgbr = HistGradientBoostingRegressor(max_bins=bins, random_state=42)
    hgbr.fit(X_train, y_train)
    training_time = time.time() - start_time

    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    mse_scores.append(mse)
    training_times.append(training_time)

    print(f"max_bins={bins}, MSE: {mse:.3f}, Training Time: {training_time:.3f} seconds")

Running the example gives an output like:

max_bins=32, MSE: 1201.878, Training Time: 0.259 seconds
max_bins=128, MSE: 952.967, Training Time: 0.295 seconds
max_bins=255, MSE: 924.053, Training Time: 0.334 seconds

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different max_bins values
Evaluate the mean squared error and training time for each model

Some tips and heuristics for setting max_bins:

Start with the default value of 255 and adjust based on dataset size and feature distribution
For large datasets, higher max_bins values may improve accuracy but increase training time
For smaller datasets or when speed is crucial, lower values can provide faster training with potentially minimal loss in accuracy

Issues to consider:

Higher max_bins values increase memory usage, which may be a concern for large datasets
Very high max_bins values can lead to overfitting, especially on smaller datasets
The optimal max_bins value may interact with other hyperparameters like learning rate and number of iterations

See Also