Configure RandomForestRegressor "min_weight_fraction_leaf" Parameter

The min_weight_fraction_leaf parameter in scikit-learn’s RandomForestRegressor controls the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.

This parameter helps to control the model’s complexity and can be used to mitigate overfitting. Higher values impose a stronger regularization, creating smaller, more constrained leaves.

The default value for min_weight_fraction_leaf is 0, meaning there is no minimum weighted fraction requirement by default.

In practice, values between 0 and 0.5 are commonly used depending on the dataset’s characteristics and the desired level of regularization.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different min_weight_fraction_leaf values
min_weight_fraction_leaf_values = [0, 0.1, 0.25, 0.5]
mse_scores = []

for value in min_weight_fraction_leaf_values:
    rf = RandomForestRegressor(min_weight_fraction_leaf=value, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"min_weight_fraction_leaf={value}, MSE: {mse:.3f}")

Running the example gives an output like:

min_weight_fraction_leaf=0, MSE: 208.093
min_weight_fraction_leaf=0.1, MSE: 837.138
min_weight_fraction_leaf=0.25, MSE: 1735.592
min_weight_fraction_leaf=0.5, MSE: 2169.169

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features
Split the data into train and test sets
Train RandomForestRegressor models with different min_weight_fraction_leaf values
Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting min_weight_fraction_leaf:

Start with the default value of 0 and increase it to add regularization if the model is overfitting
Higher values create smaller leaves, which can reduce overfitting but may lead to underfitting if too high
The optimal value depends on the dataset size, complexity, and noise level

Issues to consider:

Setting the value too high can lead to underfitting and decreased model performance
Very small values may not provide enough regularization to prevent overfitting
This parameter interacts with others like max_depth and min_samples_leaf, so tune them together

See Also