Configure RandomForestRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s RandomForestRegressor controls the minimum number of samples required to be at a leaf node. It determines when to stop splitting a node during the construction of each decision tree in the forest.

Random Forest is an ensemble learning method that combines predictions from multiple decision trees to improve generalization performance. During training, each tree is grown by recursively splitting nodes until a stopping criterion is met, such as reaching a minimum number of samples at a leaf node.

Increasing min_samples_leaf results in smaller trees with fewer splits, as the splitting process stops earlier when the minimum number of samples is reached. This can help prevent overfitting, especially on small or noisy datasets, by reducing the complexity of individual trees.

The default value for min_samples_leaf is 1, meaning that a node can be split until there is only one sample at each leaf. In practice, values between 1 and 10 are commonly used depending on the size and complexity of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20]
mse_scores = []

for min_samples in min_samples_leaf_values:
    rf = RandomForestRegressor(min_samples_leaf=min_samples, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 2621.793
min_samples_leaf=5, MSE: 2943.948
min_samples_leaf=10, MSE: 3685.958
min_samples_leaf=20, MSE: 4799.949

The key steps in this example are:

Generate a synthetic regression dataset with informative features and some noise
Split the data into train and test sets
Train RandomForestRegressor models with different min_samples_leaf values
Evaluate the mean squared error (MSE) of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 1 and increase it until the performance plateaus or starts to degrade
Higher values can help reduce overfitting, especially on small or noisy datasets
Consider the trade-off between model complexity and generalization ability

Issues to consider:

Setting min_samples_leaf too high can lead to underfitting, as the trees become too simplistic
The optimal value depends on the size and complexity of the dataset
Very high values can result in trees that are unable to capture important patterns in the data

See Also