SKLearner Home | About | Contact | Examples

Configure RandomForestClassifier "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s RandomForestClassifier controls the minimum decrease in impurity required to split an internal node during tree construction.

Random Forest is an ensemble learning method that combines multiple decision trees to improve generalization performance. During tree construction, the algorithm splits nodes based on the feature that provides the greatest decrease in impurity (e.g., Gini impurity or entropy).

The min_impurity_decrease parameter sets a threshold for the minimum decrease in impurity required to make a split. If the best split doesn’t reduce the impurity by at least this amount, the node becomes a leaf.

The default value for min_impurity_decrease is 0.0, which means that any split that reduces impurity is allowed.

In practice, setting min_impurity_decrease to a small positive value (e.g., 0.01 or 0.1) can help prune the tree and reduce overfitting, but setting it too high may result in underfitting.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_decrease_values = [0.0, 0.01, 0.1, 0.5]
f1_scores = []

for min_impurity_decrease in min_impurity_decrease_values:
    rf = RandomForestClassifier(min_impurity_decrease=min_impurity_decrease, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    f1_scores.append(f1)
    print(f"min_impurity_decrease={min_impurity_decrease}, F1-score: {f1:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, F1-score: 0.919
min_impurity_decrease=0.01, F1-score: 0.896
min_impurity_decrease=0.1, F1-score: 0.618
min_impurity_decrease=0.5, F1-score: 0.649

The key steps in this example are:

  1. Generate a synthetic binary classification dataset
  2. Split the data into train and test sets
  3. Train RandomForestClassifier models with different min_impurity_decrease values
  4. Evaluate the F1-score of each model on the test set

Some tips and heuristics for setting min_impurity_decrease:

Issues to consider:



See Also