The min_weight_fraction_leaf
parameter in scikit-learn’s ExtraTreesRegressor
controls the minimum weighted fraction of the sum total of weights required to be at a leaf node.
The Extra Trees algorithm, short for Extremely Randomized Trees, is an ensemble method that builds multiple decision trees with increased randomization. It differs from Random Forests in how it selects split points and uses the entire learning sample to grow trees.
The min_weight_fraction_leaf
parameter sets a threshold for the minimum fraction of samples required at a leaf node. This affects the depth and complexity of the trees in the ensemble.
The default value for min_weight_fraction_leaf
is 0.0, which means no minimum fraction is imposed. In practice, values between 0.0 and 0.5 are commonly used, with smaller values allowing for more complex trees.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different min_weight_fraction_leaf values
leaf_fractions = [0.0, 0.1, 0.2, 0.3]
mse_scores = []
for fraction in leaf_fractions:
etr = ExtraTreesRegressor(n_estimators=100, min_weight_fraction_leaf=fraction, random_state=42)
etr.fit(X_train, y_train)
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"min_weight_fraction_leaf={fraction}, MSE: {mse:.3f}")
Running the example gives an output like:
min_weight_fraction_leaf=0.0, MSE: 2036.183
min_weight_fraction_leaf=0.1, MSE: 8603.181
min_weight_fraction_leaf=0.2, MSE: 12409.299
min_weight_fraction_leaf=0.3, MSE: 14671.269
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentmin_weight_fraction_leaf
values - Evaluate the Mean Squared Error (MSE) of each model on the test set
Some tips and heuristics for setting min_weight_fraction_leaf
:
- Start with the default value of 0.0 and increase it to control overfitting
- Higher values lead to shallower trees, which can reduce overfitting but may increase bias
- Consider the size of your dataset; smaller datasets may benefit from higher values
Issues to consider:
- The optimal value depends on the noise level and complexity of your data
- Very high values can lead to underfitting, especially on complex datasets
- This parameter interacts with other tree-specific parameters like
max_depth
- Balancing between bias and variance is key when tuning this parameter