The min_impurity_decrease
parameter in scikit-learn’s ExtraTreesRegressor
controls the minimum impurity decrease required to split a node during tree growth.
Extra Trees (Extremely Randomized Trees) is an ensemble method similar to Random Forest, but with two key differences: it splits nodes by choosing cut-points fully at random and uses the whole learning sample to grow the trees.
The min_impurity_decrease
parameter sets a threshold for node splitting. A split will only occur if it results in a decrease in impurity greater than or equal to this value, which helps control overfitting.
The default value for min_impurity_decrease
is 0.0, which means no early stopping. In practice, small values (e.g., 1e-7 to 1e-3) are often used to prune the trees and reduce model complexity.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_impurity_decrease values
min_impurity_values = [0.0, 1e-5, 1e-4, 1e-3]
mse_scores = []
for value in min_impurity_values:
etr = ExtraTreesRegressor(n_estimators=100, min_impurity_decrease=value, random_state=42)
etr.fit(X_train, y_train)
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_impurity_decrease={value}, MSE: {mse:.3f}")
Running the example gives an output like:
min_impurity_decrease=0.0, MSE: 2036.183
min_impurity_decrease=1e-05, MSE: 2036.184
min_impurity_decrease=0.0001, MSE: 2020.745
min_impurity_decrease=0.001, MSE: 2005.388
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentmin_impurity_decrease
values - Evaluate the Mean Squared Error (MSE) of each model on the test set
Some tips and heuristics for setting min_impurity_decrease
:
- Start with small values (e.g., 1e-7) and gradually increase
- Consider the size and noise level of your dataset; larger datasets may benefit from higher values
- Use cross-validation to find the optimal value for your specific problem
Issues to consider:
- Higher values lead to simpler trees, which can improve interpretability but may underfit
- Lower values allow for more complex trees, potentially capturing more nuanced patterns but risking overfitting
- There’s a trade-off between model complexity and computational efficiency; higher values result in faster training and prediction times