Configure ExtraTreesRegressor "n_estimators" Parameter

The n_estimators parameter in scikit-learn’s ExtraTreesRegressor controls the number of trees in the ensemble.

Extra Trees, short for Extremely Randomized Trees, is an ensemble method that builds multiple decision trees with increased randomness. The n_estimators parameter determines how many trees are created in the forest.

Increasing the number of trees generally improves model performance by reducing variance, but comes with diminishing returns and increased computational cost.

The default value for n_estimators is 100.

In practice, values between 100 and 1000 are commonly used, depending on the dataset’s size and complexity.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_estimators values
n_estimators_values = [10, 100, 500, 1000]
mse_scores = []

for n in n_estimators_values:
    et = ExtraTreesRegressor(n_estimators=n, random_state=42)
    et.fit(X_train, y_train)
    y_pred = et.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"n_estimators={n}, MSE: {mse:.3f}")

Running the example gives an output like:

n_estimators=10, MSE: 2252.659
n_estimators=100, MSE: 2036.183
n_estimators=500, MSE: 1883.620
n_estimators=1000, MSE: 1875.810

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different n_estimators values
Evaluate the Mean Squared Error of each model on the test set

Some tips and heuristics for setting n_estimators:

Start with the default value of 100 and increase it until performance plateaus
Use cross-validation to find the optimal number of trees for your specific dataset
Consider the trade-off between model performance and training/prediction time

Issues to consider:

The optimal number of trees depends on the dataset’s characteristics
Using too few trees may result in underfitting, while too many can lead to overfitting
There’s a computational cost associated with increasing the number of trees
The performance gain typically diminishes as you increase the number of trees beyond a certain point

See Also