Scikit-Learn "ExtraTreesClassifier" versus "RandomForestClassifier"

Comparing the performance and features of ExtraTreesClassifier and RandomForestClassifier in scikit-learn provides insights into their differences and use cases.

ExtraTreesClassifier is an ensemble learning method that builds multiple decision trees using random subsets of the training data and features. Key hyperparameters include n_estimators (number of trees), max_features (number of features to consider for splits), and bootstrap (whether bootstrap samples are used).

RandomForestClassifier constructs multiple decision trees using bootstrap sampling and randomness in feature selection. Key hyperparameters include n_estimators, max_features, and bootstrap.

The primary difference is that ExtraTreesClassifier uses the entire dataset for each tree without bootstrap sampling, leading to less variance but potentially more bias compared to RandomForestClassifier.

Both methods aim to improve model performance by reducing overfitting. ExtraTreesClassifier often results in faster training times due to its simpler construction process, while RandomForestClassifier tends to have slightly better performance due to the added randomness from bootstrap sampling.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate ExtraTreesClassifier
etc = ExtraTreesClassifier(n_estimators=100, random_state=42)
etc.fit(X_train, y_train)
y_pred_etc = etc.predict(X_test)
print(f"ExtraTreesClassifier accuracy: {accuracy_score(y_test, y_pred_etc):.3f}")
print(f"ExtraTreesClassifier F1 score: {f1_score(y_test, y_pred_etc):.3f}")

# Fit and evaluate RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)
y_pred_rfc = rfc.predict(X_test)
print(f"RandomForestClassifier accuracy: {accuracy_score(y_test, y_pred_rfc):.3f}")
print(f"RandomForestClassifier F1 score: {f1_score(y_test, y_pred_rfc):.3f}")

Running the example gives an output like:

ExtraTreesClassifier accuracy: 0.965
ExtraTreesClassifier F1 score: 0.965
RandomForestClassifier accuracy: 0.940
RandomForestClassifier F1 score: 0.941

The steps are as follows:

Generate a synthetic classification dataset with make_classification.
Split the data into training and test sets using train_test_split.
Instantiate ExtraTreesClassifier, fit it on the training data, and evaluate its performance on the test set.
Instantiate RandomForestClassifier, fit it on the training data, and evaluate its performance on the test set.
Compare the test set performance (accuracy and F1 score) of both models.

See Also