Configure ExtraTreesClassifier "n_estimators" Parameter

The n_estimators parameter in scikit-learn’s ExtraTreesClassifier controls the number of trees in the ensemble.

Extremely Randomized Trees (Extra Trees) is an ensemble method that builds multiple decision trees with increased randomness compared to Random Forests. The n_estimators parameter determines the number of trees in the ensemble.

Increasing n_estimators generally improves model performance by reducing variance, but it also increases computational cost. There’s often a point of diminishing returns where adding more trees provides minimal benefit.

The default value for n_estimators in ExtraTreesClassifier is 100.

In practice, values between 50 and 1000 are commonly used, depending on dataset size and complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_estimators values
n_estimators_values = [10, 50, 100, 500]
accuracies = []

for n in n_estimators_values:
    etc = ExtraTreesClassifier(n_estimators=n, random_state=42)
    etc.fit(X_train, y_train)
    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"n_estimators={n}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

n_estimators=10, Accuracy: 0.755
n_estimators=50, Accuracy: 0.825
n_estimators=100, Accuracy: 0.845
n_estimators=500, Accuracy: 0.855

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train ExtraTreesClassifier models with different n_estimators values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting n_estimators:

Start with the default value of 100 and increase if performance improves
Monitor the trade-off between model performance and training time
Use cross-validation to find the optimal number of trees for your specific dataset

Issues to consider:

Larger values of n_estimators increase training time and memory usage
The benefit of adding more trees often plateaus after a certain point
Extra Trees are generally faster to train than Random Forests, so higher n_estimators values may be more feasible

See Also