The random_state
parameter in scikit-learn’s BaggingClassifier
controls the randomness of the ensemble learning process, ensuring reproducibility of results.
Bagging (Bootstrap Aggregating) is an ensemble method that creates multiple subsets of the original dataset, trains a classifier on each subset, and combines their predictions. The random_state
parameter affects the random sampling of these subsets and the base estimators.
Setting random_state
to a fixed integer ensures that the random number generator produces the same sequence of random numbers each time the code is run, making results reproducible.
The default value for random_state
is None, which means a different random seed is used each time. For reproducibility, it’s common to use any fixed integer value (e.g., 42).
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different random_state values
random_states = [None, 0, 42, 100]
for rs in random_states:
clf = BaggingClassifier(estimator=DecisionTreeClassifier(),
n_estimators=10, random_state=rs)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"random_state={rs}, Accuracy: {accuracy:.4f}")
# Demonstrate reproducibility
clf1 = BaggingClassifier(random_state=42)
clf2 = BaggingClassifier(random_state=42)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
print(f"Models identical: {all((clf1.estimators_[i].tree_.feature ==
clf2.estimators_[i].tree_.feature).all()
for i in range(len(clf1.estimators_)))}")
Running the example gives an output like:
random_state=None, Accuracy: 0.8900
random_state=0, Accuracy: 0.8450
random_state=42, Accuracy: 0.8850
random_state=100, Accuracy: 0.8900
Models identical: True
The key steps in this example are:
- Generate a synthetic binary classification dataset
- Split the data into train and test sets
- Train
BaggingClassifier
models with differentrandom_state
values - Evaluate the accuracy of each model on the test set
- Demonstrate reproducibility by comparing two models with the same
random_state
Some tips for setting random_state
:
- Use a fixed integer value for reproducibility in research or production environments
- Keep the same
random_state
value across your entire pipeline for consistency - Document the
random_state
value used in your experiments or deployed models
Issues to consider:
- Using None as
random_state
can lead to different results each run, which may be desirable for some applications like cross-validation - The same
random_state
value doesn’t guarantee identical results across different machines or scikit-learn versions - Fixing
random_state
may hide the variability in model performance due to random initialization