Configure ExtraTreesRegressor "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s ExtraTreesRegressor controls the minimum number of samples required to split an internal node.

Extra Trees Regressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. The min_samples_split parameter determines the minimum number of samples required to split an internal node during tree construction.

This parameter affects the complexity of the trees. Lower values create more complex trees, potentially leading to overfitting, while higher values result in simpler trees that might underfit.

The default value for min_samples_split is 2, which allows for maximum tree growth.

In practice, values between 2 and 20 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []

for min_samples in min_samples_split_values:
    etr = ExtraTreesRegressor(n_estimators=100, min_samples_split=min_samples, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_split={min_samples}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_split=2, MSE: 2036.183
min_samples_split=5, MSE: 1917.775
min_samples_split=10, MSE: 2196.227
min_samples_split=20, MSE: 2757.848

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different min_samples_split values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting min_samples_split:

Start with the default value of 2 and increase it if you suspect overfitting
Use cross-validation to find the optimal value for your specific dataset
Consider the size of your dataset; larger datasets may benefit from higher values

Issues to consider:

Lower values create more complex trees, which can lead to overfitting
Higher values create simpler trees, which may underfit the data
The optimal value depends on the noise level and complexity of your data
Computation time increases with lower values due to more splits being considered

See Also