The min_samples_split
parameter in scikit-learn’s DecisionTreeClassifier
controls the minimum number of samples required to split an internal node during the tree growth process.
This parameter can be used to control overfitting by preventing the tree from splitting nodes that contain only a few samples. Higher values will lead to smaller trees that are less prone to overfitting, but may underfit the data if set too high.
The default value for min_samples_split
is 2, meaning that a node will only be split if it contains at least 2 samples. In practice, values between 2 and 20 are commonly used depending on the size and complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []
for min_split in min_samples_split_values:
dt = DecisionTreeClassifier(min_samples_split=min_split, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"min_samples_split={min_split}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
min_samples_split=2, Accuracy: 0.740
min_samples_split=5, Accuracy: 0.730
min_samples_split=10, Accuracy: 0.735
min_samples_split=20, Accuracy: 0.735
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train
DecisionTreeClassifier
models with differentmin_samples_split
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and increase it to reduce overfitting
- Be cautious not to set the value too high, as it can lead to underfitting
- Consider the size of the dataset and the desired complexity of the tree
Issues to consider:
- The optimal value depends on the characteristics of the dataset
- High values can result in very simple trees that may not capture complex patterns
- Low values allow more complex trees that may overfit to noise in the training data