The min_weight_fraction_leaf
parameter in scikit-learn’s DecisionTreeClassifier
controls the minimum weighted fraction of the total sum of weights (of all the input samples) required to be at a leaf node.
By default, this parameter is set to 0.0, which means that no restriction is imposed on the leaf node size. Increasing this value can be used to control overfitting and the complexity of the decision tree.
Common values for min_weight_fraction_leaf
range from 0.0 to 0.5, depending on the characteristics of the dataset, such as class imbalance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.8, 0.2],
random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_weight_fraction_leaf values
min_weight_fraction_leaf_values = [0.0, 0.1, 0.2, 0.3]
accuracies = []
tree_depths = []
for value in min_weight_fraction_leaf_values:
dt = DecisionTreeClassifier(min_weight_fraction_leaf=value, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
tree_depths.append(dt.get_depth())
print(f"min_weight_fraction_leaf={value}, Accuracy: {accuracy:.3f}, Tree Depth: {dt.get_depth()}")
The example code output would look like:
min_weight_fraction_leaf=0.0, Accuracy: 0.870, Tree Depth: 10
min_weight_fraction_leaf=0.1, Accuracy: 0.870, Tree Depth: 5
min_weight_fraction_leaf=0.2, Accuracy: 0.870, Tree Depth: 3
min_weight_fraction_leaf=0.3, Accuracy: 0.840, Tree Depth: 2
The key steps in this example are:
- Generate an imbalanced synthetic binary classification dataset
- Split the data into train and test sets
- Train
DecisionTreeClassifier
models with differentmin_weight_fraction_leaf
values - Evaluate the accuracy and tree depth of each model on the test set
Some tips and heuristics for setting min_weight_fraction_leaf
:
- Higher values limit the tree depth and can help prevent overfitting
- Setting the value too high may lead to underfitting
- The value is often set based on the class imbalance ratio in the dataset
Issues to consider:
- The effect of
min_weight_fraction_leaf
depends on the class imbalance and size of the dataset - It may be necessary to tune this parameter in conjunction with other parameters like
max_depth
- The optimal value is dataset-specific and requires experimentation to determine