Configure GradientBoostingRegressor "init" Parameter

The init parameter in scikit-learn’s GradientBoostingRegressor allows you to set an initial model for the boosting process.

Gradient Boosting is a machine learning technique for regression and classification problems, which builds models sequentially to correct the errors of the previous models. The init parameter determines the initial model that the boosting process starts with.

The default value for init is None, which means the initial model is a simple mean prediction for regression problems. Common values for init include other regressors like DummyRegressor or LinearRegression.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define initial models
init_models = [None, DummyRegressor(strategy="mean"), LinearRegression()]
init_model_names = ["None", "DummyRegressor", "LinearRegression"]

for init, name in zip(init_models, init_model_names):
    gbr = GradientBoostingRegressor(init=init, random_state=42)
    gbr.fit(X_train, y_train)
    y_pred = gbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f"init={name}, Mean Squared Error: {mse:.3f}")

Running the example gives an output like:

init=None, Mean Squared Error: 1234.753
init=DummyRegressor, Mean Squared Error: 1234.753
init=LinearRegression, Mean Squared Error: 0.010

The key steps in this example are:

Generate a synthetic regression dataset.
Split the data into train and test sets.
Train GradientBoostingRegressor models with different init values.
Evaluate the mean squared error of each model on the test set.

Some tips and heuristics for setting init:

Use the default None to start with a simple mean prediction.
Try DummyRegressor or other simple models to understand their effects on performance.
Use a more sophisticated model like LinearRegression if it improves initial predictions and overall performance.

Issues to consider:

The choice of initial model can impact the training time and convergence of the boosting process.
More complex initial models may provide better starting points but at the cost of increased complexity and computational expense.
Experiment with different initial models to find the best balance for your specific dataset and problem.

See Also