Configure HistGradientBoostingClassifier "validation_fraction" Parameter

The validation_fraction parameter in scikit-learn’s HistGradientBoostingClassifier controls the proportion of training data to set aside as a validation set for early stopping.

HistGradientBoostingClassifier is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets.

The validation_fraction parameter determines how much of the training data is used for early stopping. A larger fraction provides a more reliable estimate for early stopping but reduces the amount of data used for training.

The default value for validation_fraction is 0.1 (10% of the training data).

Common values range from 0.1 to 0.3, depending on the size of the dataset and the desired trade-off between training data and validation set size.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different validation_fraction values
validation_fractions = [0.1, 0.2, 0.3]
accuracies = []

for fraction in validation_fractions:
    hgbc = HistGradientBoostingClassifier(validation_fraction=fraction, random_state=42, max_iter=1000)
    hgbc.fit(X_train, y_train)
    y_pred = hgbc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"validation_fraction={fraction}, Accuracy: {accuracy:.3f}, Iterations: {hgbc.n_iter_}")

Running the example gives an output like:

validation_fraction=0.1, Accuracy: 0.948, Iterations: 1000
validation_fraction=0.2, Accuracy: 0.948, Iterations: 1000
validation_fraction=0.3, Accuracy: 0.948, Iterations: 1000

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and noise features
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different validation_fraction values
Evaluate the accuracy of each model on the test set and report the number of iterations

Some tips and heuristics for setting validation_fraction:

Use larger values (e.g., 0.2 or 0.3) for smaller datasets to ensure a reliable validation set
Use smaller values (e.g., 0.1 or less) for larger datasets to maximize training data
Consider the trade-off between early stopping reliability and amount of training data

Issues to consider:

A too small validation set may lead to unreliable early stopping decisions
A too large validation set may significantly reduce the amount of training data
The optimal value depends on the dataset size, complexity, and the model’s tendency to overfit

See Also