The oob_score
parameter in scikit-learn’s ExtraTreesRegressor
enables out-of-bag (OOB) error estimation during training.
Extra Trees (Extremely Randomized Trees) is an ensemble method similar to Random Forests, but with additional randomization in the tree-building process. It creates multiple decision trees and aggregates their predictions.
The oob_score
parameter, when set to True
, uses samples not selected during bootstrap to estimate the generalization accuracy. This provides an unbiased estimate of the model’s performance without needing a separate validation set.
By default, oob_score
is set to False
. It’s commonly enabled when you want to monitor model performance during training without using a separate validation set.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit ExtraTreesRegressor with oob_score=True
et_oob = ExtraTreesRegressor(n_estimators=100, bootstrap=True, oob_score=True, random_state=42)
et_oob.fit(X_train, y_train)
# Print OOB score
print(f"OOB Score: {et_oob.oob_score_:.3f}")
# Evaluate on test set
y_pred = et_oob.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {mse:.3f}")
# Compare with model without OOB scoring
et_no_oob = ExtraTreesRegressor(n_estimators=100, bootstrap=True, oob_score=False, random_state=42)
et_no_oob.fit(X_train, y_train)
y_pred_no_oob = et_no_oob.predict(X_test)
mse_no_oob = mean_squared_error(y_test, y_pred_no_oob)
print(f"Test MSE (without OOB): {mse_no_oob:.3f}")
Running the example gives an output like:
OOB Score: 0.859
Test MSE: 2122.805
Test MSE (without OOB): 2122.805
Key steps in this example:
- Generate a synthetic regression dataset
- Split data into train and test sets
- Create
ExtraTreesRegressor
withoob_score=True
- Fit model and print OOB score
- Compare OOB score with test set performance
- Create and evaluate model with
oob_score=False
for comparison
Tips for using oob_score
:
- Enable when you want to monitor model performance without a separate validation set
- OOB score is typically slightly pessimistic compared to the true test error
- Increases computation time and memory usage, especially for large datasets
Issues to consider:
- OOB estimation may be less reliable for small datasets or models with few trees
- OOB score doesn’t replace proper cross-validation for final model evaluation
- Enabling
oob_score
slightly increases training time and memory usage