Configure HistGradientBoostingRegressor "tol" Parameter

The tol parameter in scikit-learn’s HistGradientBoostingRegressor controls the tolerance for the early stopping criterion.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets.

The tol parameter determines the minimum relative improvement in the loss function required to continue training. If the improvement falls below this threshold, training stops early.

The default value for tol is 1e-7. In practice, values between 1e-8 and 1e-3 are commonly used, depending on the desired trade-off between model performance and training time.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different tol values
tol_values = [1e-8, 1e-7, 1e-5, 1e-3]
results = []

for tol in tol_values:
    start_time = time.time()
    model = HistGradientBoostingRegressor(tol=tol, random_state=42, max_iter=1000)
    model.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    n_iter = model.n_iter_

    results.append((tol, mse, train_time, n_iter))
    print(f"tol={tol:.0e}, MSE: {mse:.4f}, Training time: {train_time:.2f}s, Iterations: {n_iter}")

Running the example gives an output like:

tol=1e-08, MSE: 729.3384, Training time: 2.75s, Iterations: 1000
tol=1e-07, MSE: 729.3384, Training time: 2.46s, Iterations: 1000
tol=1e-05, MSE: 729.3384, Training time: 2.40s, Iterations: 1000
tol=1e-03, MSE: 729.3384, Training time: 2.42s, Iterations: 1000

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different tol values
Measure training time, number of iterations, and mean squared error for each model
Compare the results to understand the impact of tol on model performance and training efficiency

Some tips and heuristics for setting tol:

Start with the default value of 1e-7 and adjust based on your needs
Lower tol values generally lead to better model performance but longer training times
Higher tol values can speed up training but may result in suboptimal models

Issues to consider:

The optimal tol value depends on the specific dataset and problem
Very low tol values may lead to overfitting, especially on noisy datasets
Very high tol values may cause the model to stop training prematurely, resulting in underfitting

See Also