The min_samples_leaf
parameter in scikit-learn’s DecisionTreeRegressor
controls the minimum number of samples required to be at a leaf node.
Decision Tree Regressor is a non-parametric supervised learning method used for regression tasks. The min_samples_leaf
parameter determines the minimum number of samples that must be present in a leaf node for a split to be considered.
Increasing min_samples_leaf
leads to smaller trees with higher bias and lower variance, while decreasing it results in larger trees with lower bias and higher variance. This parameter helps control the complexity and generalization ability of the model.
The default value for min_samples_leaf
is 1.
In practice, values between 1 and 20 are commonly used depending on the size and noise level of the dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_samples_leaf values
min_samples_leaf_values = [1, 5, 10, 20]
mse_scores = []
for min_samples in min_samples_leaf_values:
dt = DecisionTreeRegressor(min_samples_leaf=min_samples, random_state=42)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_samples_leaf={min_samples}, MSE: {mse:.3f}")
Running the example gives an output like:
min_samples_leaf=1, MSE: 6350.428
min_samples_leaf=5, MSE: 5201.241
min_samples_leaf=10, MSE: 5062.167
min_samples_leaf=20, MSE: 6321.089
The key steps in this example are:
- Generate a synthetic regression dataset with informative features and some noise
- Split the data into train and test sets
- Train
DecisionTreeRegressor
models with differentmin_samples_leaf
values - Evaluate the mean squared error (MSE) of each model on the test set
Some tips and heuristics for setting min_samples_leaf
:
- Start with the default value of 1 and increase it until the performance starts to degrade
- Larger values lead to smaller trees with higher bias and lower variance
- Smaller values result in larger trees with lower bias and higher variance
Issues to consider:
- Very small
min_samples_leaf
values can cause overfitting, especially with noisy data - Large
min_samples_leaf
values may lead to underfitting and poor performance - Using small
min_samples_leaf
values can be computationally expensive for large datasets