SKLearner Home | About | Contact | Examples

Configure DecisionTreeClassifier "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s DecisionTreeClassifier controls the minimum number of samples required to split an internal node during the tree growth process.

This parameter can be used to control overfitting by preventing the tree from splitting nodes that contain only a few samples. Higher values will lead to smaller trees that are less prone to overfitting, but may underfit the data if set too high.

The default value for min_samples_split is 2, meaning that a node will only be split if it contains at least 2 samples. In practice, values between 2 and 20 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []

for min_split in min_samples_split_values:
    dt = DecisionTreeClassifier(min_samples_split=min_split, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"min_samples_split={min_split}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

min_samples_split=2, Accuracy: 0.740
min_samples_split=5, Accuracy: 0.730
min_samples_split=10, Accuracy: 0.735
min_samples_split=20, Accuracy: 0.735

The key steps in this example are:

  1. Generate a synthetic multiclass classification dataset
  2. Split the data into train and test sets
  3. Train DecisionTreeClassifier models with different min_samples_split values
  4. Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting min_samples_split:

Issues to consider:



See Also