The quantile
parameter in scikit-learn’s HistGradientBoostingRegressor
controls the strategy for binning continuous features.
Histogram-based gradient boosting is an efficient variant of gradient boosting that uses binning to speed up training and reduce memory usage. The quantile
parameter determines the method used to create these bins.
When quantile
is set to True, the algorithm uses quantile-based binning, which creates bins with approximately equal numbers of samples. When False, it uses uniform binning, creating bins of equal width.
The default value for quantile
is True.
In practice, quantile-based binning (True) often performs better, especially with non-uniformly distributed data, but uniform binning (False) can be faster.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different quantile values
quantile_values = [True, False]
mse_scores = []
for quantile in quantile_values:
hgbr = HistGradientBoostingRegressor(quantile=quantile, random_state=42)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"quantile={quantile}, MSE: {mse:.3f}")
# Calculate percentage difference
pct_diff = (mse_scores[1] - mse_scores[0]) / mse_scores[0] * 100
print(f"Percentage difference: {pct_diff:.2f}%")
Running the example gives an output like:
quantile=True, MSE: 924.053
quantile=False, MSE: 924.053
Percentage difference: 0.00%
The key steps in this example are:
- Generate a synthetic regression dataset with multiple features
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentquantile
values - Evaluate the mean squared error of each model on the test set
- Calculate the percentage difference in performance between the two strategies
Some tips and heuristics for setting quantile
:
- Use True (quantile-based binning) as a default, especially for non-uniform data
- Consider False (uniform binning) if training time is a concern
- Experiment with both options and compare performance
Issues to consider:
- Quantile-based binning may perform better on skewed or non-uniform data distributions
- Uniform binning can be faster, especially on large datasets
- The impact of this parameter may vary depending on the nature of your data