Configure HistGradientBoostingClassifier "tol" Parameter

The tol parameter in scikit-learn’s HistGradientBoostingClassifier controls the tolerance for the stopping criterion.

HistGradientBoostingClassifier is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficient training on large datasets and supports early stopping.

The tol parameter determines the minimum relative improvement in the loss function required to continue training. If the improvement falls below this threshold, training stops early.

The default value for tol is 1e-7. In practice, values between 1e-8 and 1e-3 are commonly used, depending on the desired trade-off between model performance and training time.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different tol values
tol_values = [1e-8, 1e-7, 1e-5, 1e-3]
accuracies = []

for tol in tol_values:
    hgbc = HistGradientBoostingClassifier(tol=tol, random_state=42, max_iter=1000)
    hgbc.fit(X_train, y_train)
    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"tol={tol:.0e}, Accuracy: {accuracy:.4f}, Iterations: {hgbc.n_iter_}")

Running the example gives an output like:

tol=1e-08, Accuracy: 0.9480, Iterations: 1000
tol=1e-07, Accuracy: 0.9480, Iterations: 1000
tol=1e-05, Accuracy: 0.9480, Iterations: 1000
tol=1e-03, Accuracy: 0.9480, Iterations: 1000

The key steps in this example are:

Generate a synthetic classification dataset with informative and noise features
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different tol values
Evaluate the accuracy of each model on the test set
Compare the number of iterations and performance for different tol values

Some tips and heuristics for setting tol:

Start with the default value of 1e-7 and adjust based on model performance and training time
Lower tol values generally lead to more accurate models but longer training times
Higher tol values can result in faster training but may lead to underfitting

Issues to consider:

The optimal tol value depends on the dataset complexity and desired trade-off between accuracy and training time
Very low tol values may lead to overfitting, especially on small datasets
Very high tol values may cause the model to stop training prematurely, resulting in poor performance

See Also