Configure RandomForestClassifier "bootstrap" Parameter

The bootstrap parameter in scikit-learn’s RandomForestClassifier determines whether bootstrap sampling is used when building trees.

Bootstrap sampling involves creating multiple datasets by repeatedly sampling with replacement from the original dataset. This introduces randomness and diversity into the ensemble.

By default, bootstrap is set to True, meaning each tree is trained on a bootstrapped dataset. Setting it to False means each tree is trained on the entire original dataset.

Using bootstrap sampling (bootstrap=True) generally improves performance by reducing variance, but it can slightly increase bias. Disabling it (bootstrap=False) may be beneficial for high-dimensional datasets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_features=10,
                           n_informative=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with bootstrap=True and bootstrap=False
rf_bootstrap = RandomForestClassifier(bootstrap=True, random_state=42)
rf_no_bootstrap = RandomForestClassifier(bootstrap=False, random_state=42)

rf_bootstrap.fit(X_train, y_train)
rf_no_bootstrap.fit(X_train, y_train)

# Evaluate accuracy
y_pred_bootstrap = rf_bootstrap.predict(X_test)
y_pred_no_bootstrap = rf_no_bootstrap.predict(X_test)

accuracy_bootstrap = accuracy_score(y_test, y_pred_bootstrap)
accuracy_no_bootstrap = accuracy_score(y_test, y_pred_no_bootstrap)

print(f"Accuracy with bootstrap: {accuracy_bootstrap:.3f}")
print(f"Accuracy without bootstrap: {accuracy_no_bootstrap:.3f}")

Running the example gives an output like:

Accuracy with bootstrap: 0.855
Accuracy without bootstrap: 0.850

The key steps in this example are:

Generate a synthetic multiclass classification dataset
Split the data into train and test sets
Train RandomForestClassifier with bootstrap=True and bootstrap=False
Evaluate the accuracy of each model on the test set

Tips and heuristics for setting bootstrap:

Use the default bootstrap=True unless you have a specific reason not to
Disabling bootstrap sampling may be beneficial for high-dimensional datasets
Bootstrap sampling reduces variance but can slightly increase bias

Issues to consider:

Bootstrap sampling increases computational cost as each tree is trained on a different dataset
The bootstrap parameter interacts with max_samples, which controls the size of each bootstrap sample

See Also