The tol
parameter in scikit-learn’s HistGradientBoostingRegressor
controls the tolerance for the early stopping criterion.
HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets.
The tol
parameter determines the minimum relative improvement in the loss function required to continue training. If the improvement falls below this threshold, training stops early.
The default value for tol
is 1e-7. In practice, values between 1e-8 and 1e-3 are commonly used, depending on the desired trade-off between model performance and training time.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import time
# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different tol values
tol_values = [1e-8, 1e-7, 1e-5, 1e-3]
results = []
for tol in tol_values:
start_time = time.time()
model = HistGradientBoostingRegressor(tol=tol, random_state=42, max_iter=1000)
model.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
n_iter = model.n_iter_
results.append((tol, mse, train_time, n_iter))
print(f"tol={tol:.0e}, MSE: {mse:.4f}, Training time: {train_time:.2f}s, Iterations: {n_iter}")
Running the example gives an output like:
tol=1e-08, MSE: 729.3384, Training time: 2.75s, Iterations: 1000
tol=1e-07, MSE: 729.3384, Training time: 2.46s, Iterations: 1000
tol=1e-05, MSE: 729.3384, Training time: 2.40s, Iterations: 1000
tol=1e-03, MSE: 729.3384, Training time: 2.42s, Iterations: 1000
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differenttol
values - Measure training time, number of iterations, and mean squared error for each model
- Compare the results to understand the impact of
tol
on model performance and training efficiency
Some tips and heuristics for setting tol
:
- Start with the default value of 1e-7 and adjust based on your needs
- Lower
tol
values generally lead to better model performance but longer training times - Higher
tol
values can speed up training but may result in suboptimal models
Issues to consider:
- The optimal
tol
value depends on the specific dataset and problem - Very low
tol
values may lead to overfitting, especially on noisy datasets - Very high
tol
values may cause the model to stop training prematurely, resulting in underfitting