Configure HistGradientBoostingRegressor "quantile" Parameter

The quantile parameter in scikit-learn’s HistGradientBoostingRegressor controls the strategy for binning continuous features.

Histogram-based gradient boosting is an efficient variant of gradient boosting that uses binning to speed up training and reduce memory usage. The quantile parameter determines the method used to create these bins.

When quantile is set to True, the algorithm uses quantile-based binning, which creates bins with approximately equal numbers of samples. When False, it uses uniform binning, creating bins of equal width.

The default value for quantile is True.

In practice, quantile-based binning (True) often performs better, especially with non-uniformly distributed data, but uniform binning (False) can be faster.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different quantile values
quantile_values = [True, False]
mse_scores = []

for quantile in quantile_values:
    hgbr = HistGradientBoostingRegressor(quantile=quantile, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"quantile={quantile}, MSE: {mse:.3f}")

# Calculate percentage difference
pct_diff = (mse_scores[1] - mse_scores[0]) / mse_scores[0] * 100
print(f"Percentage difference: {pct_diff:.2f}%")

Running the example gives an output like:

quantile=True, MSE: 924.053
quantile=False, MSE: 924.053
Percentage difference: 0.00%

The key steps in this example are:

Generate a synthetic regression dataset with multiple features
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different quantile values
Evaluate the mean squared error of each model on the test set
Calculate the percentage difference in performance between the two strategies

Some tips and heuristics for setting quantile:

Use True (quantile-based binning) as a default, especially for non-uniform data
Consider False (uniform binning) if training time is a concern
Experiment with both options and compare performance

Issues to consider:

Quantile-based binning may perform better on skewed or non-uniform data distributions
Uniform binning can be faster, especially on large datasets
The impact of this parameter may vary depending on the nature of your data

See Also