The min_samples_split
parameter in scikit-learn’s GradientBoostingClassifier
controls the minimum number of samples required to split an internal node in the decision trees.
Gradient Boosting is an ensemble learning method that sequentially trains decision trees, with each tree learning to correct the errors of the previous ones. The min_samples_split
parameter determines the minimum number of samples needed to split a node during the construction of each tree.
Smaller values of min_samples_split
allow the trees to make more complex splits, potentially leading to overfitting. Larger values restrict the trees to simpler splits, which can lead to underfitting but also reduces computational cost.
The default value for min_samples_split
is 2.
In practice, values between 2 and 100 are commonly used depending on the size and complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 10, 50, 100]
accuracies = []
train_times = []
for mss in min_samples_split_values:
start_time = time.time()
gb = GradientBoostingClassifier(min_samples_split=mss, random_state=42)
gb.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = gb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
train_times.append(train_time)
print(f"min_samples_split={mss}, Accuracy: {accuracy:.3f}, Train Time: {train_time:.3f}s")
Running the example gives an output like:
min_samples_split=2, Accuracy: 0.914, Train Time: 5.239s
min_samples_split=10, Accuracy: 0.914, Train Time: 5.386s
min_samples_split=50, Accuracy: 0.913, Train Time: 5.753s
min_samples_split=100, Accuracy: 0.913, Train Time: 5.575s
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
GradientBoostingClassifier
models with differentmin_samples_split
values - Evaluate the accuracy and training time of each model on the test set
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and increase it until the performance plateaus or degrades
- Smaller values lead to more complex trees, which can overfit but may be necessary for complex datasets
- Larger values result in simpler trees, which can underfit but are less computationally expensive
Issues to consider:
- The computational cost increases as
min_samples_split
decreases due to the creation of more complex trees - Very small values can cause overfitting, especially on smaller datasets
- Very large values can lead to underfitting by oversimplifying the model
- The optimal value depends on the characteristics of the dataset and the problem at hand