Configure ExtraTreesClassifier "min_weight_fraction_leaf" Parameter

The min_weight_fraction_leaf parameter in scikit-learn’s ExtraTreesClassifier controls the minimum weighted fraction of the sum total of weights required to be at a leaf node.

ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.

The min_weight_fraction_leaf parameter affects the tree structure by setting a threshold for the minimum number of samples required at leaf nodes as a fraction of the total samples. This can help prevent overfitting by ensuring that leaf nodes represent a meaningful proportion of the data.

The default value for min_weight_fraction_leaf is 0.0, which means there is no minimum and a leaf node can be created even with a single sample. In practice, values between 0.0 and 0.5 are commonly used, with smaller values allowing for more complex trees and larger values creating simpler trees.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import balanced_accuracy_score

# Generate synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, weights=[0.9, 0.1],
                           random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_weight_fraction_leaf values
min_weight_fractions = [0.0, 0.1, 0.2, 0.3]
balanced_accuracies = []

for fraction in min_weight_fractions:
    etc = ExtraTreesClassifier(n_estimators=100, min_weight_fraction_leaf=fraction, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    balanced_accuracies.append(balanced_accuracy)
    print(f"min_weight_fraction_leaf={fraction}, Balanced Accuracy: {balanced_accuracy:.3f}")

Running the example gives an output like:

min_weight_fraction_leaf=0.0, Balanced Accuracy: 0.565
min_weight_fraction_leaf=0.1, Balanced Accuracy: 0.500
min_weight_fraction_leaf=0.2, Balanced Accuracy: 0.500
min_weight_fraction_leaf=0.3, Balanced Accuracy: 0.500

The key steps in this example are:

Generate a synthetic imbalanced binary classification dataset
Split the data into train and test sets
Train ExtraTreesClassifier models with different min_weight_fraction_leaf values
Evaluate the balanced accuracy of each model on the test set

Some tips and heuristics for setting min_weight_fraction_leaf:

Start with the default value of 0.0 and increase it if you suspect overfitting
Use larger values for smaller datasets or when you want to create simpler trees
Consider using cross-validation to find the optimal value for your specific dataset

Issues to consider:

Increasing min_weight_fraction_leaf can lead to underfitting if set too high
For imbalanced datasets, this parameter can help ensure minority class representation
The effect of this parameter may vary depending on other tree parameters and dataset characteristics

See Also