Configure GradientBoostingClassifier "init" Parameter

The init parameter in scikit-learn’s GradientBoostingClassifier allows you to specify an initial estimator to be used as the first estimator in the boosting ensemble.

By default, init is set to None, which means the initial estimator is a DummyEstimator that predicts the mean of the training data. You can set init to any estimator that implements the fit and predict methods.

Using a more sophisticated initial estimator can sometimes improve the performance of the ensemble, especially if the initial estimator is a good fit for the problem.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different init values
init_values = [None,
               DummyClassifier(strategy='most_frequent'),
               DummyClassifier(strategy='stratified'),
               DecisionTreeClassifier(max_depth=1),
               DecisionTreeClassifier(max_depth=2),
               DecisionTreeClassifier(max_depth=3)]

accuracies = []

for init in init_values:
    gb = GradientBoostingClassifier(n_estimators=100, init=init, random_state=42)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"init={init}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

init=None, Accuracy: 0.785
init=DummyClassifier(strategy='most_frequent'), Accuracy: 0.575
init=DummyClassifier(strategy='stratified'), Accuracy: 0.645
init=DecisionTreeClassifier(max_depth=1), Accuracy: 0.800
init=DecisionTreeClassifier(max_depth=2), Accuracy: 0.775
init=DecisionTreeClassifier(max_depth=3), Accuracy: 0.740

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different init values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting init:

Consider using an initial estimator if the default DummyEstimator is not providing good results
A DummyClassifier with strategy='stratified' can be a good choice for imbalanced datasets
A DecisionTreeClassifier with a low max_depth can capture simple patterns in the data

Issues to consider:

Using a complex initial estimator can slow down training
If the initial estimator is too strong, it may dominate the ensemble and negate the benefits of boosting

See Also