Configure GradientBoostingClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s GradientBoostingClassifier controls the minimum number of samples required to split an internal node in the decision trees.

Gradient Boosting is an ensemble learning method that sequentially trains decision trees, with each tree learning to correct the errors of the previous ones. The min_samples_split parameter determines the minimum number of samples needed to split a node during the construction of each tree.

Smaller values of min_samples_split allow the trees to make more complex splits, potentially leading to overfitting. Larger values restrict the trees to simpler splits, which can lead to underfitting but also reduces computational cost.

The default value for min_samples_split is 2.

In practice, values between 2 and 100 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 10, 50, 100]
accuracies = []
train_times = []

for mss in min_samples_split_values:
    start_time = time.time()
    gb = GradientBoostingClassifier(min_samples_split=mss, random_state=42)
    gb.fit(X_train, y_train)
    train_time = time.time() - start_time
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    train_times.append(train_time)
    print(f"min_samples_split={mss}, Accuracy: {accuracy:.3f}, Train Time: {train_time:.3f}s")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.914, Train Time: 5.239s
min_samples_split=10, Accuracy: 0.914, Train Time: 5.386s
min_samples_split=50, Accuracy: 0.913, Train Time: 5.753s
min_samples_split=100, Accuracy: 0.913, Train Time: 5.575s

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and redundant features
Split the data into train and test sets
Train GradientBoostingClassifier models with different min_samples_split values
Evaluate the accuracy and training time of each model on the test set

Some tips and heuristics for setting min_samples_split:

Start with the default value of 2 and increase it until the performance plateaus or degrades
Smaller values lead to more complex trees, which can overfit but may be necessary for complex datasets
Larger values result in simpler trees, which can underfit but are less computationally expensive

Issues to consider:

The computational cost increases as min_samples_split decreases due to the creation of more complex trees
Very small values can cause overfitting, especially on smaller datasets
Very large values can lead to underfitting by oversimplifying the model
The optimal value depends on the characteristics of the dataset and the problem at hand

See Also