Configure DecisionTreeClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s DecisionTreeClassifier controls the minimum number of samples required to split an internal node during the tree growth process.

This parameter can be used to control overfitting by preventing the tree from splitting nodes that contain only a few samples. Higher values will lead to smaller trees that are less prone to overfitting, but may underfit the data if set too high.

The default value for min_samples_split is 2, meaning that a node will only be split if it contains at least 2 samples. In practice, values between 2 and 20 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []

for min_split in min_samples_split_values:
    dt = DecisionTreeClassifier(min_samples_split=min_split, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_split={min_split}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.740
min_samples_split=5, Accuracy: 0.730
min_samples_split=10, Accuracy: 0.735
min_samples_split=20, Accuracy: 0.735

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train DecisionTreeClassifier models with different min_samples_split values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_split:

Start with the default value of 2 and increase it to reduce overfitting
Be cautious not to set the value too high, as it can lead to underfitting
Consider the size of the dataset and the desired complexity of the tree

Issues to consider:

The optimal value depends on the characteristics of the dataset
High values can result in very simple trees that may not capture complex patterns
Low values allow more complex trees that may overfit to noise in the training data

See Also