The init
parameter in scikit-learn’s GradientBoostingClassifier
allows you to specify an initial estimator to be used as the first estimator in the boosting ensemble.
By default, init
is set to None
, which means the initial estimator is a DummyEstimator
that predicts the mean of the training data. You can set init
to any estimator that implements the fit
and predict
methods.
Using a more sophisticated initial estimator can sometimes improve the performance of the ensemble, especially if the initial estimator is a good fit for the problem.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different init values
init_values = [None,
DummyClassifier(strategy='most_frequent'),
DummyClassifier(strategy='stratified'),
DecisionTreeClassifier(max_depth=1),
DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)]
accuracies = []
for init in init_values:
gb = GradientBoostingClassifier(n_estimators=100, init=init, random_state=42)
gb.fit(X_train, y_train)
y_pred = gb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"init={init}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
init=None, Accuracy: 0.785
init=DummyClassifier(strategy='most_frequent'), Accuracy: 0.575
init=DummyClassifier(strategy='stratified'), Accuracy: 0.645
init=DecisionTreeClassifier(max_depth=1), Accuracy: 0.800
init=DecisionTreeClassifier(max_depth=2), Accuracy: 0.775
init=DecisionTreeClassifier(max_depth=3), Accuracy: 0.740
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentinit
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting init
:
- Consider using an initial estimator if the default
DummyEstimator
is not providing good results - A
DummyClassifier
withstrategy='stratified'
can be a good choice for imbalanced datasets - A
DecisionTreeClassifier
with a lowmax_depth
can capture simple patterns in the data
Issues to consider:
- Using a complex initial estimator can slow down training
- If the initial estimator is too strong, it may dominate the ensemble and negate the benefits of boosting