Configure BaggingClassifier "random_state" Parameter

The random_state parameter in scikit-learn’s BaggingClassifier controls the randomness of the ensemble learning process, ensuring reproducibility of results.

Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a classifier on each subset, and combines their predictions. The random_state parameter affects the random sampling of these subsets and the base estimators.

Setting random_state to a fixed integer ensures that the random number generator produces the same sequence of random numbers each time the code is run, making results reproducible.

The default value for random_state is None, which means a different random seed is used each time. For reproducibility, it’s common to use any fixed integer value (e.g., 42).

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different random_state values
random_states = [None, 0, 42, 100]
for rs in random_states:
    clf = BaggingClassifier(estimator=DecisionTreeClassifier(),
                            n_estimators=10, random_state=rs)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"random_state={rs}, Accuracy: {accuracy:.4f}")

# Demonstrate reproducibility
clf1 = BaggingClassifier(random_state=42)
clf2 = BaggingClassifier(random_state=42)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
print(f"Models identical: {all((clf1.estimators_[i].tree_.feature ==
                                clf2.estimators_[i].tree_.feature).all()
                               for i in range(len(clf1.estimators_)))}")

Running the example gives an output like:

random_state=None, Accuracy: 0.8900
random_state=0, Accuracy: 0.8450
random_state=42, Accuracy: 0.8850
random_state=100, Accuracy: 0.8900
Models identical: True

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train BaggingClassifier models with different random_state values
Evaluate the accuracy of each model on the test set
Demonstrate reproducibility by comparing two models with the same random_state

Some tips for setting random_state:

Use a fixed integer value for reproducibility in research or production environments
Keep the same random_state value across your entire pipeline for consistency
Document the random_state value used in your experiments or deployed models

Issues to consider:

Using None as random_state can lead to different results each run, which may be desirable for some applications like cross-validation
The same random_state value doesn’t guarantee identical results across different machines or scikit-learn versions
Fixing random_state may hide the variability in model performance due to random initialization

See Also