Configure RandomForestClassifier "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s RandomForestClassifier controls the minimum decrease in impurity required to split an internal node during tree construction.

Random Forest is an ensemble learning method that combines multiple decision trees to improve generalization performance. During tree construction, the algorithm splits nodes based on the feature that provides the greatest decrease in impurity (e.g., Gini impurity or entropy).

The min_impurity_decrease parameter sets a threshold for the minimum decrease in impurity required to make a split. If the best split doesn’t reduce the impurity by at least this amount, the node becomes a leaf.

The default value for min_impurity_decrease is 0.0, which means that any split that reduces impurity is allowed.

In practice, setting min_impurity_decrease to a small positive value (e.g., 0.01 or 0.1) can help prune the tree and reduce overfitting, but setting it too high may result in underfitting.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.01, 0.1, 0.5]
f1_scores = []

for min_impurity_decrease in min_impurity_decrease_values:
    rf = RandomForestClassifier(min_impurity_decrease=min_impurity_decrease, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    f1_scores.append(f1)
    print(f"min_impurity_decrease={min_impurity_decrease}, F1-score: {f1:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, F1-score: 0.919
min_impurity_decrease=0.01, F1-score: 0.896
min_impurity_decrease=0.1, F1-score: 0.618
min_impurity_decrease=0.5, F1-score: 0.649

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train RandomForestClassifier models with different min_impurity_decrease values
Evaluate the F1-score of each model on the test set

Some tips and heuristics for setting min_impurity_decrease:

Start with the default value of 0.0 and increase it if the model appears to be overfitting
Higher values can lead to smaller trees and potentially underfitting
Lower values can lead to larger trees and potentially overfitting

Issues to consider:

The optimal value of min_impurity_decrease depends on the dataset and problem
Setting the value too high can result in underfitting, while setting it too low may not prevent overfitting
The effect of min_impurity_decrease may vary depending on other hyperparameters, such as max_depth or min_samples_split

See Also