The min_weight_fraction_leaf
parameter in scikit-learn’s RandomForestClassifier
sets the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.
This parameter affects the growth of decision trees in the ensemble by controlling the creation of leaf nodes based on the proportion of sample weights they contain. A higher value limits the creation of leaf nodes to those with a significant fraction of the total sample weight, resulting in smaller, less complex trees.
The default value for min_weight_fraction_leaf
is 0, meaning there is no restriction on the weighted fraction of samples at a leaf node. In practice, values between 0 and 0.5 are commonly used, depending on the dataset’s characteristics and the desired balance between model complexity and generalization performance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
# Generate imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
n_features=10, n_informative=5, n_redundant=0,
random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_weight_fraction_leaf values
min_weight_fraction_leaf_values = [0.0, 0.1, 0.2, 0.3]
f1_scores = []
for value in min_weight_fraction_leaf_values:
rf = RandomForestClassifier(min_weight_fraction_leaf=value, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
f1 = f1_score(y_test, y_pred)
f1_scores.append(f1)
print(f"min_weight_fraction_leaf={value}, F1 Score: {f1:.3f}")
Running the example gives an output like:
min_weight_fraction_leaf=0.0, F1 Score: 0.800
min_weight_fraction_leaf=0.1, F1 Score: 0.000
min_weight_fraction_leaf=0.2, F1 Score: 0.000
min_weight_fraction_leaf=0.3, F1 Score: 0.000
The key steps in this example are:
- Generate an imbalanced synthetic binary classification dataset
- Split the data into train and test sets
- Train
RandomForestClassifier
models with differentmin_weight_fraction_leaf
values - Evaluate the F1 score of each model on the test set
Tips and heuristics for setting min_weight_fraction_leaf
:
- Use higher values (e.g., 0.1 to 0.5) for imbalanced datasets to prevent overfitting on the minority class
- Consider the trade-off between model complexity and generalization performance when tuning this parameter
Issues to consider:
- Higher values lead to smaller, less complex decision trees, which can help prevent overfitting but may also cause underfitting if set too high
- The optimal value depends on the dataset’s characteristics, such as class imbalance and the presence of noisy or irrelevant features