SKLearner Home | About | Contact | Examples

Configure ExtraTreesRegressor "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s ExtraTreesRegressor controls the minimum number of samples required to split an internal node.

Extra Trees Regressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. The min_samples_split parameter determines the minimum number of samples required to split an internal node during tree construction.

This parameter affects the complexity of the trees. Lower values create more complex trees, potentially leading to overfitting, while higher values result in simpler trees that might underfit.

The default value for min_samples_split is 2, which allows for maximum tree growth.

In practice, values between 2 and 20 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []

for min_samples in min_samples_split_values:
    etr = ExtraTreesRegressor(n_estimators=100, min_samples_split=min_samples, random_state=42)
    etr.fit(X_train, y_train)
    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_split={min_samples}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_split=2, MSE: 2036.183
min_samples_split=5, MSE: 1917.775
min_samples_split=10, MSE: 2196.227
min_samples_split=20, MSE: 2757.848

The key steps in this example are:

  1. Generate a synthetic regression dataset
  2. Split the data into train and test sets
  3. Train ExtraTreesRegressor models with different min_samples_split values
  4. Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting min_samples_split:

Issues to consider:



See Also