The min_samples_split
parameter in scikit-learn’s DecisionTreeRegressor
controls the minimum number of samples required to split an internal node in the tree.
Decision trees recursively split the data based on feature values until a stopping criterion is met. The min_samples_split
parameter sets a threshold on the number of samples needed at a node for it to be considered for splitting.
Increasing min_samples_split
results in smaller trees, as it requires more samples at a node to trigger a split. This can help reduce overfitting by limiting the tree’s complexity and depth.
The default value for min_samples_split
is 2.
In practice, values between 2 and 20 are commonly used depending on the size and noisiness of the dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []
for min_split in min_samples_split_values:
dt = DecisionTreeRegressor(min_samples_split=min_split, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_samples_split={min_split}, MSE: {mse:.3f}")
Running the example gives an output like:
min_samples_split=2, MSE: 6350.428
min_samples_split=5, MSE: 6502.799
min_samples_split=10, MSE: 5895.029
min_samples_split=20, MSE: 5827.258
The key steps in this example are:
- Generate a synthetic regression dataset with informative features and some noise
- Split the data into train and test sets
- Train
DecisionTreeRegressor
models with differentmin_samples_split
values - Evaluate the mean squared error (MSE) of each model on the test set
Some tips and heuristics for setting min_samples_split
:
- Start with the default value of 2 and increase it until the performance starts to degrade
- Larger values lead to smaller trees, which can help prevent overfitting
- The optimal value depends on the size and complexity of the dataset
Issues to consider:
- Setting the value too low can lead to overly complex trees that overfit the training data
- Setting the value too high can result in oversimplified trees that underfit the data
- The best value is dataset-specific and often requires experimentation to determine