Configure RandomForestClassifier "oob_score" Parameter

The oob_score parameter in scikit-learn’s RandomForestClassifier controls whether to use out-of-bag (OOB) samples to estimate the generalization error.

Random Forest is an ensemble method that trains each tree on a bootstrap sample of the training data. The OOB samples are the data points that were not included in a particular bootstrap sample for a given tree. By setting oob_score=True, the model can use these samples to provide an unbiased estimate of its performance without needing a separate validation set.

The default value for oob_score is False. Setting it to True provides an efficient way to assess the model’s performance but incurs additional computational cost during training.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with oob_score=False
rf_no_oob = RandomForestClassifier(n_estimators=100, oob_score=False, random_state=42)
rf_no_oob.fit(X_train, y_train)
y_pred_no_oob = rf_no_oob.predict(X_test)
accuracy_no_oob = accuracy_score(y_test, y_pred_no_oob)

# Train with oob_score=True
rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
rf_oob.fit(X_train, y_train)
y_pred_oob = rf_oob.predict(X_test)
accuracy_oob = accuracy_score(y_test, y_pred_oob)

print(f"Test Accuracy (oob_score=False): {accuracy_no_oob:.3f}")
print(f"Test Accuracy (oob_score=True): {accuracy_oob:.3f}")
print(f"OOB Score: {rf_oob.oob_score_:.3f}")

The output will look like:

Test Accuracy (oob_score=False): 0.920
Test Accuracy (oob_score=True): 0.920
OOB Score: 0.930

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train a RandomForestClassifier with oob_score=False and evaluate test set accuracy
Train a RandomForestClassifier with oob_score=True, evaluate test set accuracy, and print the OOB score
Compare the test set accuracies and the OOB score

Some tips and heuristics for using oob_score:

Use oob_score when computational resources allow, especially with limited data
oob_score can provide a useful model diagnostic even when using a separate validation set

Issues to consider:

Setting oob_score=True makes training take longer due to the additional calculations

See Also