Scikit-Learn "BaggingRegressor" versus "RandomForestRegressor"

Evaluating the performance of BaggingRegressor and RandomForestRegressor on a regression task helps in understanding their differences and use cases.

BaggingRegressor is a versatile ensemble algorithm that uses bootstrapped samples and can wrap any regressor as its base estimator. Key hyperparameters include n_estimators (number of base estimators), base_estimator (type of base estimator), and max_samples (fraction of samples to draw).

RandomForestRegressor, on the other hand, is a specialized ensemble method where the base estimators are decision trees, and it incorporates feature randomness in tree construction. Important hyperparameters are n_estimators (number of trees), max_features (number of features to consider when looking for the best split), and max_depth (maximum depth of the tree).

The main difference is that BaggingRegressor offers flexibility in choosing the base estimator, which can be useful for custom models. RandomForestRegressor generally provides better performance with less tuning required due to its integrated feature selection and decision tree base estimators.

BaggingRegressor is useful when you want to use a custom base estimator, while RandomForestRegressor is preferred for its robustness and ease of use with decision trees.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate BaggingRegressor
bagging = BaggingRegressor(n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
print(f"BaggingRegressor MSE: {mean_squared_error(y_test, y_pred_bagging):.3f}")

# Fit and evaluate RandomForestRegressor
random_forest = RandomForestRegressor(n_estimators=50, random_state=42)
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
print(f"RandomForestRegressor MSE: {mean_squared_error(y_test, y_pred_rf):.3f}")

Running the example gives an output like:

BaggingRegressor MSE: 7111.178
RandomForestRegressor MSE: 7270.551

The steps are as follows:

Generate a synthetic regression dataset using make_regression.
Split the data into training and test sets using train_test_split.
Instantiate BaggingRegressor with 50 base estimators, fit it on the training data, and evaluate its performance on the test set.
Instantiate RandomForestRegressor with 50 trees, fit it on the training data, and evaluate its performance on the test set.
Compare the test set performance (mean squared error) of both models and discuss the results.

See Also