SKLearner Home | About | Contact | Examples

Configure RandomForestRegressor "oob_score" Parameter

The oob_score parameter in scikit-learn’s RandomForestRegressor allows using out-of-bag (OOB) samples to estimate the generalization error of the model.

In a random forest, each tree is trained on a bootstrap sample of the original dataset. About one-third of the samples are left out of the bootstrap sample and not used in the construction of the tree. These left-out samples are called out-of-bag samples.

Setting oob_score=True will cause the model to use these OOB samples to estimate its own performance, providing a built-in validation set without the need to hold out additional data. This can be useful when you have a limited amount of data and can’t afford a separate validation set. However, this comes at the cost of additional computation.

By default, oob_score is set to False, meaning the OOB samples are not used for performance estimation.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train two models, one with oob_score=True and one with default False
rf_with_oob = RandomForestRegressor(n_estimators=100, oob_score=True, random_state=42)
rf_without_oob = RandomForestRegressor(n_estimators=100, random_state=42)

rf_with_oob.fit(X_train, y_train)
rf_without_oob.fit(X_train, y_train)

# Evaluate performance on test set
y_pred_with_oob = rf_with_oob.predict(X_test)
y_pred_without_oob = rf_without_oob.predict(X_test)

mse_with_oob = mean_squared_error(y_test, y_pred_with_oob)
mse_without_oob = mean_squared_error(y_test, y_pred_without_oob)

print(f"MSE with oob_score=True: {mse_with_oob:.3f}")
print(f"MSE with oob_score=False: {mse_without_oob:.3f}")
print(f"OOB score estimate: {rf_with_oob.oob_score_:.3f}")

Running the example gives an output like:

MSE with oob_score=True: 208.093
MSE with oob_score=False: 208.093
OOB score estimate: 0.958

The key steps in this example are:

  1. Generate a synthetic regression dataset with informative features and some noise
  2. Split the data into train and test sets
  3. Train two RandomForestRegressor models, one with oob_score=True and one with default False
  4. Evaluate the performance of both models on the test set using mean squared error
  5. Compare the OOB score estimate from the model with oob_score=True to the test set performance

Some tips and heuristics for using oob_score:

Issues to consider:



See Also