The bootstrap
parameter in scikit-learn’s RandomForestClassifier
determines whether bootstrap sampling is used when building trees.
Bootstrap sampling involves creating multiple datasets by repeatedly sampling with replacement from the original dataset. This introduces randomness and diversity into the ensemble.
By default, bootstrap
is set to True
, meaning each tree is trained on a bootstrapped dataset. Setting it to False
means each tree is trained on the entire original dataset.
Using bootstrap sampling (bootstrap=True
) generally improves performance by reducing variance, but it can slightly increase bias. Disabling it (bootstrap=False
) may be beneficial for high-dimensional datasets.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
n_informative=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with bootstrap=True and bootstrap=False
rf_bootstrap = RandomForestClassifier(bootstrap=True, random_state=42)
rf_no_bootstrap = RandomForestClassifier(bootstrap=False, random_state=42)
rf_bootstrap.fit(X_train, y_train)
rf_no_bootstrap.fit(X_train, y_train)
# Evaluate accuracy
y_pred_bootstrap = rf_bootstrap.predict(X_test)
y_pred_no_bootstrap = rf_no_bootstrap.predict(X_test)
accuracy_bootstrap = accuracy_score(y_test, y_pred_bootstrap)
accuracy_no_bootstrap = accuracy_score(y_test, y_pred_no_bootstrap)
print(f"Accuracy with bootstrap: {accuracy_bootstrap:.3f}")
print(f"Accuracy without bootstrap: {accuracy_no_bootstrap:.3f}")
Running the example gives an output like:
Accuracy with bootstrap: 0.855
Accuracy without bootstrap: 0.850
The key steps in this example are:
- Generate a synthetic multiclass classification dataset
- Split the data into train and test sets
- Train
RandomForestClassifier
withbootstrap=True
andbootstrap=False
- Evaluate the accuracy of each model on the test set
Tips and heuristics for setting bootstrap
:
- Use the default
bootstrap=True
unless you have a specific reason not to - Disabling bootstrap sampling may be beneficial for high-dimensional datasets
- Bootstrap sampling reduces variance but can slightly increase bias
Issues to consider:
- Bootstrap sampling increases computational cost as each tree is trained on a different dataset
- The
bootstrap
parameter interacts withmax_samples
, which controls the size of each bootstrap sample