The init
parameter in scikit-learn’s GradientBoostingRegressor
allows you to set an initial model for the boosting process.
Gradient Boosting is a machine learning technique for regression and classification problems, which builds models sequentially to correct the errors of the previous models. The init
parameter determines the initial model that the boosting process starts with.
The default value for init
is None
, which means the initial model is a simple mean prediction for regression problems. Common values for init
include other regressors like DummyRegressor
or LinearRegression
.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define initial models
init_models = [None, DummyRegressor(strategy="mean"), LinearRegression()]
init_model_names = ["None", "DummyRegressor", "LinearRegression"]
for init, name in zip(init_models, init_model_names):
gbr = GradientBoostingRegressor(init=init, random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"init={name}, Mean Squared Error: {mse:.3f}")
Running the example gives an output like:
init=None, Mean Squared Error: 1234.753
init=DummyRegressor, Mean Squared Error: 1234.753
init=LinearRegression, Mean Squared Error: 0.010
The key steps in this example are:
- Generate a synthetic regression dataset.
- Split the data into train and test sets.
- Train
GradientBoostingRegressor
models with differentinit
values. - Evaluate the mean squared error of each model on the test set.
Some tips and heuristics for setting init
:
- Use the default
None
to start with a simple mean prediction. - Try
DummyRegressor
or other simple models to understand their effects on performance. - Use a more sophisticated model like
LinearRegression
if it improves initial predictions and overall performance.
Issues to consider:
- The choice of initial model can impact the training time and convergence of the boosting process.
- More complex initial models may provide better starting points but at the cost of increased complexity and computational expense.
- Experiment with different initial models to find the best balance for your specific dataset and problem.