The oob_score
parameter in scikit-learn’s BaggingClassifier
determines whether to use out-of-bag samples to estimate the generalization accuracy.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the training data, trains a base estimator on each subset, and combines their predictions. Out-of-bag (OOB) samples are those not used for training a particular base estimator.
When oob_score
is set to True
, the classifier uses OOB samples to estimate the generalization accuracy without the need for a separate validation set.
The default value for oob_score
is False
. Setting it to True
is common when you want to get an unbiased estimate of the model’s performance without using a separate validation set.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train BaggingClassifier with oob_score=False
bc_without_oob = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100,
random_state=42, oob_score=False)
bc_without_oob.fit(X_train, y_train)
# Create and train BaggingClassifier with oob_score=True
bc_with_oob = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100,
random_state=42, oob_score=True)
bc_with_oob.fit(X_train, y_train)
# Evaluate models
y_pred_without_oob = bc_without_oob.predict(X_test)
y_pred_with_oob = bc_with_oob.predict(X_test)
print("BaggingClassifier without OOB:")
print(f"Test accuracy: {accuracy_score(y_test, y_pred_without_oob):.4f}")
print("\nBaggingClassifier with OOB:")
print(f"Test accuracy: {accuracy_score(y_test, y_pred_with_oob):.4f}")
print(f"OOB score: {bc_with_oob.oob_score_:.4f}")
Running the example gives an output like:
BaggingClassifier without OOB:
Test accuracy: 0.8900
BaggingClassifier with OOB:
Test accuracy: 0.8900
OOB score: 0.8662
The key steps in this example are:
- Generate a synthetic classification dataset
- Split the data into train and test sets
- Create two
BaggingClassifier
instances, one withoob_score=False
and another withoob_score=True
- Train both models and evaluate their performance on the test set
- For the model with
oob_score=True
, report the OOB score
Some tips for using oob_score
:
- Enable
oob_score
when you want to get an estimate of model performance without a separate validation set - OOB score can be used for model selection or hyperparameter tuning
- OOB estimation is generally less biased than using the training error
Issues to consider:
- Enabling
oob_score
increases computational cost and memory usage - OOB score may be less reliable for small datasets or with a small number of base estimators
- The OOB estimate tends to be pessimistic and may underestimate the true performance