SKLearner Home | About | Contact | Examples

Configure ExtraTreesRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s ExtraTreesRegressor controls the minimum number of samples required to be at a leaf node.

ExtraTreesRegressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve the predictive accuracy and control over-fitting.

The min_samples_leaf parameter affects the complexity of the trees. A smaller value allows for more complex trees, while a larger value restricts tree depth and can help prevent overfitting.

The default value for min_samples_leaf is 1, which allows trees to have leaves with only one sample. In practice, values between 1 and 10 are commonly used, depending on the dataset size and noise level.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20, 50]
mse_scores = []

for min_samples in min_samples_leaf_values:
    etr = ExtraTreesRegressor(n_estimators=100, min_samples_leaf=min_samples, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")

# Find the best min_samples_leaf value
best_min_samples = min_samples_leaf_values[np.argmin(mse_scores)]
print(f"\nBest min_samples_leaf value: {best_min_samples}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 2036.183
min_samples_leaf=5, MSE: 2721.550
min_samples_leaf=10, MSE: 3426.148
min_samples_leaf=20, MSE: 4466.804
min_samples_leaf=50, MSE: 6881.153

Best min_samples_leaf value: 1

The key steps in this example are:

  1. Generate a synthetic regression dataset with some noise
  2. Split the data into train and test sets
  3. Train ExtraTreesRegressor models with different min_samples_leaf values
  4. Evaluate the Mean Squared Error (MSE) of each model on the test set
  5. Identify the min_samples_leaf value that yields the lowest MSE

Some tips and heuristics for setting min_samples_leaf:

Issues to consider:



See Also