Configure RandomForestRegressor "min_samples_split" Parameter

The min_samples_split parameter in scikit-learn’s RandomForestRegressor controls the minimum number of samples required to split an internal node during the construction of each decision tree in the forest.

Random Forest is an ensemble learning method that combines predictions from multiple decision trees to improve generalization performance. The min_samples_split parameter can be used to control overfitting by requiring a minimum number of samples to make a split.

The default value for min_samples_split is 2, meaning a node will split if it has at least 2 samples.

In practice, values between 5 and 20 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       n_targets=1, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_split values
min_samples_split_values = [2, 5, 10, 20]
mse_scores = []

for min_samples_split in min_samples_split_values:
    rf = RandomForestRegressor(min_samples_split=min_samples_split, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_split={min_samples_split}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_split=2, MSE: 208.093
min_samples_split=5, MSE: 214.610
min_samples_split=10, MSE: 235.431
min_samples_split=20, MSE: 293.128

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features
Split the data into train and test sets
Train RandomForestRegressor models with different min_samples_split values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting min_samples_split:

Higher values can reduce overfitting by requiring more samples to make a split
But setting it too high can lead to underfitting
The default of 2 is often too low, values of 5-20 are common depending on dataset size

Issues to consider:

The optimal value depends on the dataset size and complexity
It should be balanced against other parameters like max_depth and min_samples_leaf
Computational cost increases for higher values as trees become more constrained

See Also