Configure RandomForestRegressor "oob_score" Parameter

The oob_score parameter in scikit-learn’s RandomForestRegressor allows using out-of-bag (OOB) samples to estimate the generalization error of the model.

In a random forest, each tree is trained on a bootstrap sample of the original dataset. About one-third of the samples are left out of the bootstrap sample and not used in the construction of the tree. These left-out samples are called out-of-bag samples.

Setting oob_score=True will cause the model to use these OOB samples to estimate its own performance, providing a built-in validation set without the need to hold out additional data. This can be useful when you have a limited amount of data and can’t afford a separate validation set. However, this comes at the cost of additional computation.

By default, oob_score is set to False, meaning the OOB samples are not used for performance estimation.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train two models, one with oob_score=True and one with default False
rf_with_oob = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=42)
rf_without_oob = RandomForestRegressor(n_estimators=100, random_state=42)

rf_with_oob.fit(X_train, y_train)
rf_without_oob.fit(X_train, y_train)

# Evaluate performance on test set
y_pred_with_oob = rf_with_oob.predict(X_test)
y_pred_without_oob = rf_without_oob.predict(X_test)

mse_with_oob = mean_squared_error(y_test, y_pred_with_oob)
mse_without_oob = mean_squared_error(y_test, y_pred_without_oob)

print(f"MSE with oob_score=True: {mse_with_oob:.3f}")
print(f"MSE with oob_score=False: {mse_without_oob:.3f}")
print(f"OOB score estimate: {rf_with_oob.oob_score_:.3f}")

Running the example gives an output like:

MSE with oob_score=True: 208.093
MSE with oob_score=False: 208.093
OOB score estimate: 0.958

The key steps in this example are:

Generate a synthetic regression dataset with informative features and some noise
Split the data into train and test sets
Train two RandomForestRegressor models, one with oob_score=True and one with default False
Evaluate the performance of both models on the test set using mean squared error
Compare the OOB score estimate from the model with oob_score=True to the test set performance

Some tips and heuristics for using oob_score:

Use OOB score when you have a limited amount of data and can’t afford a separate validation set
OOB score can provide a reasonable estimate of generalization error, but isn’t a perfect substitute for a true validation set
Computational cost is higher when oob_score=True due to the additional calculations required

Issues to consider:

OOB score is less reliable with small datasets because each tree only uses a subset of the data
The estimate can be biased for some types of data or model parameters

See Also