The oob_score
parameter in scikit-learn’s ExtraTreesClassifier
enables out-of-bag error estimation, providing an unbiased estimate of the generalization accuracy.
ExtraTreesClassifier is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset. It uses averaging to improve predictive accuracy and control over-fitting.
The oob_score
parameter, when set to True
, uses out-of-bag samples to estimate the generalization accuracy. This can be useful for model evaluation without the need for a separate validation set.
By default, oob_score
is set to False
. It’s commonly used in scenarios where you want to monitor model performance during training without using a separate validation set.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train ExtraTreesClassifier with oob_score=True
et_with_oob = ExtraTreesClassifier(n_estimators=100, random_state=42, bootstrap=True, oob_score=True)
et_with_oob.fit(X_train, y_train)
# Train ExtraTreesClassifier with oob_score=False
et_without_oob = ExtraTreesClassifier(n_estimators=100, random_state=42, oob_score=False)
et_without_oob.fit(X_train, y_train)
# Calculate OOB score for the model with oob_score=True
oob_score = et_with_oob.oob_score_
# Evaluate models on test set
y_pred_with_oob = et_with_oob.predict(X_test)
y_pred_without_oob = et_without_oob.predict(X_test)
accuracy_with_oob = accuracy_score(y_test, y_pred_with_oob)
accuracy_without_oob = accuracy_score(y_test, y_pred_without_oob)
print(f"OOB Score: {oob_score:.3f}")
print(f"Test Accuracy (oob_score=True): {accuracy_with_oob:.3f}")
print(f"Test Accuracy (oob_score=False): {accuracy_without_oob:.3f}")
Running the example gives an output like:
OOB Score: 0.935
Test Accuracy (oob_score=True): 0.925
Test Accuracy (oob_score=False): 0.925
The key steps in this example are:
- Generate a synthetic classification dataset
- Split the data into train and test sets
- Train two
ExtraTreesClassifier
models, one withoob_score=True
and one withoob_score=False
- Calculate the OOB score for the model with
oob_score=True
- Evaluate both models on the test set and compare accuracies
Some tips and heuristics for using oob_score
:
- Use
oob_score=True
when you want to monitor model performance without a separate validation set - OOB score can be used for model selection or hyperparameter tuning
- OOB estimation is generally less expensive than cross-validation
Issues to consider:
- Enabling
oob_score
increases computational cost and memory usage - OOB score may be less reliable for small datasets or with a small number of trees
- The OOB estimate tends to be pessimistic compared to cross-validation estimates