Configure ExtraTreesRegressor "min_impurity_decrease" Parameter

The min_impurity_decrease parameter in scikit-learn’s ExtraTreesRegressor controls the minimum impurity decrease required to split a node during tree growth.

Extra Trees (Extremely Randomized Trees) is an ensemble method similar to Random Forest, but with two key differences: it splits nodes by choosing cut-points fully at random and uses the whole learning sample to grow the trees.

The min_impurity_decrease parameter sets a threshold for node splitting. A split will only occur if it results in a decrease in impurity greater than or equal to this value, which helps control overfitting.

The default value for min_impurity_decrease is 0.0, which means no early stopping. In practice, small values (e.g., 1e-7 to 1e-3) are often used to prune the trees and reduce model complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_impurity_decrease values
min_impurity_values = [0.0, 1e-5, 1e-4, 1e-3]
mse_scores = []

for value in min_impurity_values:
    etr = ExtraTreesRegressor(n_estimators=100, min_impurity_decrease=value, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_impurity_decrease={value}, MSE: {mse:.3f}")

Running the example gives an output like:

min_impurity_decrease=0.0, MSE: 2036.183
min_impurity_decrease=1e-05, MSE: 2036.184
min_impurity_decrease=0.0001, MSE: 2020.745
min_impurity_decrease=0.001, MSE: 2005.388

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different min_impurity_decrease values
Evaluate the Mean Squared Error (MSE) of each model on the test set

Some tips and heuristics for setting min_impurity_decrease:

Start with small values (e.g., 1e-7) and gradually increase
Consider the size and noise level of your dataset; larger datasets may benefit from higher values
Use cross-validation to find the optimal value for your specific problem

Issues to consider:

Higher values lead to simpler trees, which can improve interpretability but may underfit
Lower values allow for more complex trees, potentially capturing more nuanced patterns but risking overfitting
There’s a trade-off between model complexity and computational efficiency; higher values result in faster training and prediction times

See Also