The min_samples_split
parameter in scikit-learn’s RandomForestClassifier
controls the minimum number of samples required to split an internal node in the decision trees.
This parameter affects the growth of the trees in the ensemble. A higher value of min_samples_split
requires more samples at a node to consider splitting further, leading to shallower trees. Conversely, a lower value allows the trees to grow deeper by considering splits for nodes with fewer samples.
The default value for min_samples_split
is 2. In practice, values between 2 and 20 are commonly used depending on the size and complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_informative=5, n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []
for min_split in min_samples_split_values:
rf = RandomForestClassifier(min_samples_split=min_split, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"min_samples_split={min_split}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
min_samples_split=2, Accuracy: 0.860
min_samples_split=5, Accuracy: 0.840
min_samples_split=10, Accuracy: 0.845
min_samples_split=20, Accuracy: 0.845
The key steps in this example are:
- Generate a synthetic multiclass classification dataset with informative and noise features
- Split the data into train and test sets
- Train
RandomForestClassifier
models with differentmin_samples_split
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and increase it if the model appears to be overfitting
- Increasing
min_samples_split
can reduce overfitting by requiring more samples to make a split, but setting it too high may lead to underfitting - Adjust the value based on the model’s performance on a validation set or using cross-validation
Issues to consider:
- The optimal value of
min_samples_split
depends on the size and complexity of the dataset - Setting
min_samples_split
too low can result in overfitting by growing deep, complex trees - Setting
min_samples_split
too high can cause underfitting by creating shallow, simplistic trees