The min_samples_split
parameter in scikit-learn’s ExtraTreesRegressor
controls the minimum number of samples required to split an internal node.
Extra Trees Regressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. The min_samples_split
parameter determines the minimum number of samples required to split an internal node during tree construction.
This parameter affects the complexity of the trees. Lower values create more complex trees, potentially leading to overfitting, while higher values result in simpler trees that might underfit.
The default value for min_samples_split
is 2, which allows for maximum tree growth.
In practice, values between 2 and 20 are commonly used, depending on the dataset size and the desired trade-off between bias and variance.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []
for min_samples in min_samples_split_values:
etr = ExtraTreesRegressor(n_estimators=100, min_samples_split=min_samples, random_state=42)
etr.fit(X_train, y_train)
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_samples_split={min_samples}, MSE: {mse:.3f}")
Running the example gives an output like:
min_samples_split=2, MSE: 2036.183
min_samples_split=5, MSE: 1917.775
min_samples_split=10, MSE: 2196.227
min_samples_split=20, MSE: 2757.848
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentmin_samples_split
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and increase it if you suspect overfitting
- Use cross-validation to find the optimal value for your specific dataset
- Consider the size of your dataset; larger datasets may benefit from higher values
Issues to consider:
- Lower values create more complex trees, which can lead to overfitting
- Higher values create simpler trees, which may underfit the data
- The optimal value depends on the noise level and complexity of your data
- Computation time increases with lower values due to more splits being considered