Configure DecisionTreeRegressor "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s DecisionTreeRegressor controls the minimum number of samples required to split an internal node in the tree.

Decision trees recursively split the data based on feature values until a stopping criterion is met. The min_samples_split parameter sets a threshold on the number of samples needed at a node for it to be considered for splitting.

Increasing min_samples_split results in smaller trees, as it requires more samples at a node to trigger a split. This can help reduce overfitting by limiting the tree’s complexity and depth.

The default value for min_samples_split is 2.

In practice, values between 2 and 20 are commonly used depending on the size and noisiness of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []

for min_split in min_samples_split_values:
    dt = DecisionTreeRegressor(min_samples_split=min_split, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_split={min_split}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_split=2, MSE: 6350.428
min_samples_split=5, MSE: 6502.799
min_samples_split=10, MSE: 5895.029
min_samples_split=20, MSE: 5827.258

The key steps in this example are:

Generate a synthetic regression dataset with informative features and some noise
Split the data into train and test sets
Train DecisionTreeRegressor models with different min_samples_split values
Evaluate the mean squared error (MSE) of each model on the test set

Some tips and heuristics for setting min_samples_split:

Start with the default value of 2 and increase it until the performance starts to degrade
Larger values lead to smaller trees, which can help prevent overfitting
The optimal value depends on the size and complexity of the dataset

Issues to consider:

Setting the value too low can lead to overly complex trees that overfit the training data
Setting the value too high can result in oversimplified trees that underfit the data
The best value is dataset-specific and often requires experimentation to determine

See Also