The min_weight_fraction_leaf
parameter in scikit-learn’s ExtraTreesClassifier
controls the minimum weighted fraction of the sum total of weights required to be at a leaf node.
ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.
The min_weight_fraction_leaf
parameter affects the tree structure by setting a threshold for the minimum number of samples required at leaf nodes as a fraction of the total samples. This can help prevent overfitting by ensuring that leaf nodes represent a meaningful proportion of the data.
The default value for min_weight_fraction_leaf
is 0.0, which means there is no minimum and a leaf node can be created even with a single sample. In practice, values between 0.0 and 0.5 are commonly used, with smaller values allowing for more complex trees and larger values creating simpler trees.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import balanced_accuracy_score
# Generate synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=2, weights=[0.9, 0.1],
random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_weight_fraction_leaf values
min_weight_fractions = [0.0, 0.1, 0.2, 0.3]
balanced_accuracies = []
for fraction in min_weight_fractions:
etc = ExtraTreesClassifier(n_estimators=100, min_weight_fraction_leaf=fraction, random_state=42)
etc.fit(X_train, y_train)
y_pred = etc.predict(X_test)
balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
balanced_accuracies.append(balanced_accuracy)
print(f"min_weight_fraction_leaf={fraction}, Balanced Accuracy: {balanced_accuracy:.3f}")
Running the example gives an output like:
min_weight_fraction_leaf=0.0, Balanced Accuracy: 0.565
min_weight_fraction_leaf=0.1, Balanced Accuracy: 0.500
min_weight_fraction_leaf=0.2, Balanced Accuracy: 0.500
min_weight_fraction_leaf=0.3, Balanced Accuracy: 0.500
The key steps in this example are:
- Generate a synthetic imbalanced binary classification dataset
- Split the data into train and test sets
- Train
ExtraTreesClassifier
models with differentmin_weight_fraction_leaf
values - Evaluate the balanced accuracy of each model on the test set
Some tips and heuristics for setting min_weight_fraction_leaf
:
- Start with the default value of 0.0 and increase it if you suspect overfitting
- Use larger values for smaller datasets or when you want to create simpler trees
- Consider using cross-validation to find the optimal value for your specific dataset
Issues to consider:
- Increasing
min_weight_fraction_leaf
can lead to underfitting if set too high - For imbalanced datasets, this parameter can help ensure minority class representation
- The effect of this parameter may vary depending on other tree parameters and dataset characteristics