Configure DecisionTreeRegressor "min_samples_leaf" Parameter

The min_samples_leaf parameter in scikit-learn’s DecisionTreeRegressor controls the minimum number of samples required to be at a leaf node.

Decision Tree Regressor is a non-parametric supervised learning method used for regression tasks. The min_samples_leaf parameter determines the minimum number of samples that must be present in a leaf node for a split to be considered.

Increasing min_samples_leaf leads to smaller trees with higher bias and lower variance, while decreasing it results in larger trees with lower bias and higher variance. This parameter helps control the complexity and generalization ability of the model.

The default value for min_samples_leaf is 1.

In practice, values between 1 and 20 are commonly used depending on the size and noise level of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20]
mse_scores = []

for min_samples in min_samples_leaf_values:
    dt = DecisionTreeRegressor(min_samples_leaf=min_samples, random_state=42)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")

Running the example gives an output like:

min_samples_leaf=1, MSE: 6350.428
min_samples_leaf=5, MSE: 5201.241
min_samples_leaf=10, MSE: 5062.167
min_samples_leaf=20, MSE: 6321.089

The key steps in this example are:

Generate a synthetic regression dataset with informative features and some noise
Split the data into train and test sets
Train DecisionTreeRegressor models with different min_samples_leaf values
Evaluate the mean squared error (MSE) of each model on the test set

Some tips and heuristics for setting min_samples_leaf:

Start with the default value of 1 and increase it until the performance starts to degrade
Larger values lead to smaller trees with higher bias and lower variance
Smaller values result in larger trees with lower bias and higher variance

Issues to consider:

Very small min_samples_leaf values can cause overfitting, especially with noisy data
Large min_samples_leaf values may lead to underfitting and poor performance
Using small min_samples_leaf values can be computationally expensive for large datasets

See Also