The min_samples_split
parameter in scikit-learn’s ExtraTreesClassifier
determines the minimum number of samples required to split an internal node.
ExtraTreesClassifier is an ensemble method that builds multiple randomized decision trees and uses averaging to improve predictive accuracy and control over-fitting. The min_samples_split
parameter affects the growth of individual trees in the ensemble.
This parameter controls the trade-off between underfitting and overfitting. Lower values allow for more complex trees, potentially leading to overfitting, while higher values result in shallower trees, which may underfit.
The default value for min_samples_split
is 2, which allows the algorithm to create very detailed trees.
In practice, values between 2 and 20 are commonly used, depending on the dataset size and complexity.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
accuracies = []
for mss in min_samples_split_values:
etc = ExtraTreesClassifier(n_estimators=100, min_samples_split=mss, random_state=42)
etc.fit(X_train, y_train)
y_pred = etc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"min_samples_split={mss}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
min_samples_split=2, Accuracy: 0.925
min_samples_split=5, Accuracy: 0.920
min_samples_split=10, Accuracy: 0.890
min_samples_split=20, Accuracy: 0.910
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
ExtraTreesClassifier
models with differentmin_samples_split
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and increase it if overfitting is observed
- Use cross-validation to find the optimal value for your specific dataset
- Consider the size of your dataset; larger datasets may benefit from higher values
Issues to consider:
- Lower values lead to more complex trees, which may overfit but can capture intricate patterns
- Higher values result in simpler trees, which may underfit but are less prone to noise
- The optimal value depends on the noise level and complexity of your data
- Computational resources increase with lower values as trees become deeper