Configure GradientBoostingRegressor "validation_fraction" Parameter

The validation_fraction parameter in scikit-learn’s GradientBoostingRegressor controls the fraction of training data to set aside for validation during training.

Gradient Boosting is a machine learning technique for regression problems, which builds an ensemble of trees sequentially to minimize the loss function. It is known for its accuracy and ability to handle various types of data.

The validation_fraction parameter specifies the fraction of training data to set aside as validation data for early stopping. It helps monitor the validation error during training to prevent overfitting.

The default value for validation_fraction is 0.1 (10% of the training data).

In practice, values between 0.1 and 0.3 are commonly used depending on the dataset size and complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different validation_fraction values
validation_fraction_values = [0.1, 0.2, 0.3]
mse_scores = []

for vf in validation_fraction_values:
    gbr = GradientBoostingRegressor(validation_fraction=vf, n_estimators=100, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"validation_fraction={vf}, MSE: {mse:.3f}")

Running the example gives an output like:

validation_fraction=0.1, MSE: 3052.375
validation_fraction=0.2, MSE: 3052.375
validation_fraction=0.3, MSE: 3052.375

The key steps in this example are:

Generate a synthetic regression dataset with noise
Split the data into train and test sets
Train GradientBoostingRegressor models with different validation_fraction values
Evaluate the mean squared error (MSE) of each model on the test set

Some tips and heuristics for setting validation_fraction:

Use the default value of 0.1 initially and adjust based on validation error
Larger values of validation_fraction may be beneficial for larger datasets to ensure robust early stopping
Monitor the validation error to decide when to stop training to prevent overfitting

Issues to consider:

The optimal validation_fraction value depends on the dataset size and variability
Too small a fraction might not provide a reliable validation set, while too large a fraction reduces the amount of training data
Consider computational cost as larger validation fractions require more data to be set aside and monitored

See Also