Configure HistGradientBoostingRegressor "validation_fraction" Parameter

The validation_fraction parameter in scikit-learn’s HistGradientBoostingRegressor controls the proportion of training data to set aside as a validation set for early stopping.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.

The validation_fraction parameter determines the size of the validation set used for early stopping. Early stopping helps prevent overfitting by monitoring the model’s performance on a held-out validation set during training.

The default value for validation_fraction is 0.1 (10% of the training data). In practice, values between 0.1 and 0.3 are commonly used, depending on the size of the dataset and the problem complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different validation_fraction values
validation_fractions = [0.1, 0.2, 0.3]
mse_scores = []

for vf in validation_fractions:
    hgbr = HistGradientBoostingRegressor(validation_fraction=vf, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"validation_fraction={vf}, MSE: {mse:.3f}")

Running the example gives an output like:

validation_fraction=0.1, MSE: 924.053
validation_fraction=0.2, MSE: 924.053
validation_fraction=0.3, MSE: 924.053

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different validation_fraction values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting validation_fraction:

Start with the default value of 0.1 and adjust based on dataset size and model performance
Larger datasets can often use smaller validation fractions
Smaller datasets may benefit from larger validation fractions for more reliable early stopping

Issues to consider:

Using a larger validation fraction reduces the amount of data available for training
Very small validation fractions may not provide reliable estimates for early stopping
The optimal validation fraction can depend on the noise level in the data and the complexity of the regression task

See Also