SKLearner Home | About | Contact | Examples

Configure DecisionTreeRegressor "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s DecisionTreeRegressor controls the minimum number of samples required to split an internal node in the tree.

Decision trees recursively split the data based on feature values until a stopping criterion is met. The min_samples_split parameter sets a threshold on the number of samples needed at a node for it to be considered for splitting.

Increasing min_samples_split results in smaller trees, as it requires more samples at a node to trigger a split. This can help reduce overfitting by limiting the tree’s complexity and depth.

The default value for min_samples_split is 2.

In practice, values between 2 and 20 are commonly used depending on the size and noisiness of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []

for min_split in min_samples_split_values:
    dt = DecisionTreeRegressor(min_samples_split=min_split, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_split={min_split}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_split=2, MSE: 6350.428
min_samples_split=5, MSE: 6502.799
min_samples_split=10, MSE: 5895.029
min_samples_split=20, MSE: 5827.258

The key steps in this example are:

  1. Generate a synthetic regression dataset with informative features and some noise
  2. Split the data into train and test sets
  3. Train DecisionTreeRegressor models with different min_samples_split values
  4. Evaluate the mean squared error (MSE) of each model on the test set

Some tips and heuristics for setting min_samples_split:

Issues to consider:



See Also