Configure ExtraTreesClassifier "oob_score" Parameter

The oob_score parameter in scikit-learn’s ExtraTreesClassifier enables out-of-bag error estimation, providing an unbiased estimate of the generalization accuracy.

ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.

The oob_score parameter, when set to True, uses out-of-bag samples to estimate the generalization accuracy. This can be useful for model evaluation without the need for a separate validation set.

By default, oob_score is set to False. It’s commonly used in scenarios where you want to monitor model performance during training without using a separate validation set.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train ExtraTreesClassifier with oob_score=True
et_with_oob = ExtraTreesClassifier(n_estimators=100, random_state=42, bootstrap=True, oob_score=True)
et_with_oob.fit(X_train, y_train)

# Train ExtraTreesClassifier with oob_score=False
et_without_oob = ExtraTreesClassifier(n_estimators=100, random_state=42, oob_score=False)
et_without_oob.fit(X_train, y_train)

# Calculate OOB score for the model with oob_score=True
oob_score = et_with_oob.oob_score_

# Evaluate models on test set
y_pred_with_oob = et_with_oob.predict(X_test)
y_pred_without_oob = et_without_oob.predict(X_test)

accuracy_with_oob = accuracy_score(y_test, y_pred_with_oob)
accuracy_without_oob = accuracy_score(y_test, y_pred_without_oob)

print(f"OOB Score: {oob_score:.3f}")
print(f"Test Accuracy (oob_score=True): {accuracy_with_oob:.3f}")
print(f"Test Accuracy (oob_score=False): {accuracy_without_oob:.3f}")

Running the example gives an output like:

OOB Score: 0.935
Test Accuracy (oob_score=True): 0.925
Test Accuracy (oob_score=False): 0.925

The key steps in this example are:

Generate a synthetic classification dataset
Split the data into train and test sets
Train two ExtraTreesClassifier models, one with oob_score=True and one with oob_score=False
Calculate the OOB score for the model with oob_score=True
Evaluate both models on the test set and compare accuracies

Some tips and heuristics for using oob_score:

Use oob_score=True when you want to monitor model performance without a separate validation set
OOB score can be used for model selection or hyperparameter tuning
OOB estimation is generally less expensive than cross-validation

Issues to consider:

Enabling oob_score increases computational cost and memory usage
OOB score may be less reliable for small datasets or with a small number of trees
The OOB estimate tends to be pessimistic compared to cross-validation estimates

See Also