Configure ExtraTreesRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s ExtraTreesRegressor controls the minimum number of samples required to be at a leaf node.

ExtraTreesRegressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve the predictive accuracy and control over-fitting.

The min_samples_leaf parameter affects the complexity of the trees. A smaller value allows for more complex trees, while a larger value restricts tree depth and can help prevent overfitting.

The default value for min_samples_leaf is 1, which allows trees to have leaves with only one sample. In practice, values between 1 and 10 are commonly used, depending on the dataset size and noise level.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20, 50]
mse_scores = []

for min_samples in min_samples_leaf_values:
    etr = ExtraTreesRegressor(n_estimators=100, min_samples_leaf=min_samples, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")

# Find the best min_samples_leaf value
best_min_samples = min_samples_leaf_values[np.argmin(mse_scores)]
print(f"\nBest min_samples_leaf value: {best_min_samples}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 2036.183
min_samples_leaf=5, MSE: 2721.550
min_samples_leaf=10, MSE: 3426.148
min_samples_leaf=20, MSE: 4466.804
min_samples_leaf=50, MSE: 6881.153

Best min_samples_leaf value: 1

The key steps in this example are:

Generate a synthetic regression dataset with some noise
Split the data into train and test sets
Train ExtraTreesRegressor models with different min_samples_leaf values
Evaluate the Mean Squared Error (MSE) of each model on the test set
Identify the min_samples_leaf value that yields the lowest MSE

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 1 and increase it if you suspect overfitting
For smaller datasets, use larger values to prevent overfitting
For larger datasets, smaller values can be used to capture more complex patterns
Consider the noise level in your data; higher noise may require larger values

Issues to consider:

The optimal value depends on the dataset size, number of features, and noise level
Very small values can lead to overfitting, especially on noisy data
Large values may cause underfitting if the trees become too shallow
There’s often a trade-off between model complexity and generalization performance

See Also