SKLearner Home | About | Contact | Examples

Scikit-Learn "BaggingClassifier" versus "RandomForestClassifier"

Comparing the performance and use-cases of BaggingClassifier and RandomForestClassifier in ensemble learning.

In scikit-learn, the BaggingClassifier class provides a general ensemble method that can use any base estimator. Key hyperparameters include n_estimators (number of base estimators), estimator (type of base estimator), and max_samples (number of samples to draw from X to train each base estimator).

On the other hand, the RandomForestClassifier class is a specific type of bagging that uses decision trees and adds random feature selection. Its key hyperparameters include n_estimators (number of trees in the forest), max_features (number of features to consider when looking for the best split), and bootstrap (whether bootstrap samples are used when building trees).

The main difference is that BaggingClassifier offers flexibility with the choice of base estimators, while RandomForestClassifier generally performs better for decision tree ensembles due to its additional randomization and feature selection strategies.

BaggingClassifier is ideal when you want to use a specific base estimator or experiment with different types of base learners. RandomForestClassifier is preferred when you need a robust ensemble method specifically for decision trees.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate BaggingClassifier
bagging = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
print(f"BaggingClassifier accuracy: {accuracy_score(y_test, y_pred_bagging):.3f}")
print(f"BaggingClassifier F1 score: {f1_score(y_test, y_pred_bagging):.3f}")

# Fit and evaluate RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=50, random_state=42)
random_forest.fit(X_train, y_train)
y_pred_rf = random_forest.predict(X_test)
print(f"\nRandomForestClassifier accuracy: {accuracy_score(y_test, y_pred_rf):.3f}")
print(f"RandomForestClassifier F1 score: {f1_score(y_test, y_pred_rf):.3f}")

Running the example gives an output like:

BaggingClassifier accuracy: 0.885
BaggingClassifier F1 score: 0.886

RandomForestClassifier accuracy: 0.880
RandomForestClassifier F1 score: 0.882

The steps are as follows:

  1. Generate a synthetic binary classification dataset using make_classification.
  2. Split the data into training and test sets using train_test_split.
  3. Instantiate BaggingClassifier with DecisionTreeClassifier as the base estimator, fit it on the training data, and evaluate its performance on the test set.
  4. Instantiate RandomForestClassifier, fit it on the training data, and evaluate its performance on the test set.
  5. Compare the test set performance (accuracy and F1 score) of both models.


See Also