The min_samples_leaf
parameter in scikit-learn’s ExtraTreesRegressor
controls the minimum number of samples required to be at a leaf node.
ExtraTreesRegressor
is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve the predictive accuracy and control over-fitting.
The min_samples_leaf
parameter affects the complexity of the trees. A smaller value allows for more complex trees, while a larger value restricts tree depth and can help prevent overfitting.
The default value for min_samples_leaf
is 1, which allows trees to have leaves with only one sample. In practice, values between 1 and 10 are commonly used, depending on the dataset size and noise level.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20, 50]
mse_scores = []
for min_samples in min_samples_leaf_values:
etr = ExtraTreesRegressor(n_estimators=100, min_samples_leaf=min_samples, random_state=42)
etr.fit(X_train, y_train)
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")
# Find the best min_samples_leaf value
best_min_samples = min_samples_leaf_values[np.argmin(mse_scores)]
print(f"\nBest min_samples_leaf value: {best_min_samples}")
Running the example gives an output like:
min_samples_leaf=1, MSE: 2036.183
min_samples_leaf=5, MSE: 2721.550
min_samples_leaf=10, MSE: 3426.148
min_samples_leaf=20, MSE: 4466.804
min_samples_leaf=50, MSE: 6881.153
Best min_samples_leaf value: 1
The key steps in this example are:
- Generate a synthetic regression dataset with some noise
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentmin_samples_leaf
values - Evaluate the Mean Squared Error (MSE) of each model on the test set
- Identify the
min_samples_leaf
value that yields the lowest MSE
Some tips and heuristics for setting min_samples_leaf
:
- Start with the default value of 1 and increase it if you suspect overfitting
- For smaller datasets, use larger values to prevent overfitting
- For larger datasets, smaller values can be used to capture more complex patterns
- Consider the noise level in your data; higher noise may require larger values
Issues to consider:
- The optimal value depends on the dataset size, number of features, and noise level
- Very small values can lead to overfitting, especially on noisy data
- Large values may cause underfitting if the trees become too shallow
- There’s often a trade-off between model complexity and generalization performance