SKLearner Home | About | Contact | Examples

Configure GradientBoostingClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s GradientBoostingClassifier controls the minimum number of samples required to split an internal node in the decision trees.

Gradient Boosting is an ensemble learning method that sequentially trains decision trees, with each tree learning to correct the errors of the previous ones. The min_samples_split parameter determines the minimum number of samples needed to split a node during the construction of each tree.

Smaller values of min_samples_split allow the trees to make more complex splits, potentially leading to overfitting. Larger values restrict the trees to simpler splits, which can lead to underfitting but also reduces computational cost.

The default value for min_samples_split is 2.

In practice, values between 2 and 100 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 10, 50, 100]
accuracies = []
train_times = []

for mss in min_samples_split_values:
    start_time = time.time()
    gb = GradientBoostingClassifier(min_samples_split=mss, random_state=42)
    gb.fit(X_train, y_train)
    train_time = time.time() - start_time
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    train_times.append(train_time)
    print(f"min_samples_split={mss}, Accuracy: {accuracy:.3f}, Train Time: {train_time:.3f}s")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.914, Train Time: 5.239s
min_samples_split=10, Accuracy: 0.914, Train Time: 5.386s
min_samples_split=50, Accuracy: 0.913, Train Time: 5.753s
min_samples_split=100, Accuracy: 0.913, Train Time: 5.575s

The key steps in this example are:

  1. Generate a synthetic binary classification dataset with informative and redundant features
  2. Split the data into train and test sets
  3. Train GradientBoostingClassifier models with different min_samples_split values
  4. Evaluate the accuracy and training time of each model on the test set

Some tips and heuristics for setting min_samples_split:

Issues to consider:



See Also