The validation_fraction
parameter in scikit-learn’s HistGradientBoostingRegressor
controls the proportion of training data to set aside as a validation set for early stopping.
HistGradientBoostingRegressor
is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.
The validation_fraction
parameter determines the size of the validation set used for early stopping. Early stopping helps prevent overfitting by monitoring the model’s performance on a held-out validation set during training.
The default value for validation_fraction
is 0.1 (10% of the training data). In practice, values between 0.1 and 0.3 are commonly used, depending on the size of the dataset and the problem complexity.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different validation_fraction values
validation_fractions = [0.1, 0.2, 0.3]
mse_scores = []
for vf in validation_fractions:
hgbr = HistGradientBoostingRegressor(validation_fraction=vf, random_state=42)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"validation_fraction={vf}, MSE: {mse:.3f}")
Running the example gives an output like:
validation_fraction=0.1, MSE: 924.053
validation_fraction=0.2, MSE: 924.053
validation_fraction=0.3, MSE: 924.053
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentvalidation_fraction
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting validation_fraction
:
- Start with the default value of 0.1 and adjust based on dataset size and model performance
- Larger datasets can often use smaller validation fractions
- Smaller datasets may benefit from larger validation fractions for more reliable early stopping
Issues to consider:
- Using a larger validation fraction reduces the amount of data available for training
- Very small validation fractions may not provide reliable estimates for early stopping
- The optimal validation fraction can depend on the noise level in the data and the complexity of the regression task