Configure RandomForestRegressor "bootstrap" Parameter

The bootstrap parameter in scikit-learn’s RandomForestRegressor controls whether bootstrap samples are used when building trees.

Bootstrapping is a resampling technique where each tree is trained on a random sample of the training data, drawn with replacement. This introduces randomness and diversity into the ensemble, which can help reduce overfitting.

By default, bootstrap is set to True, meaning that each tree is trained on a bootstrapped subset of the data. Setting bootstrap to False means that each tree is trained on the entire training dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
                       n_targets=1, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with bootstrap=True and bootstrap=False
rf_bootstrap = RandomForestRegressor(n_estimators=100, bootstrap=True, random_state=42)
rf_no_bootstrap = RandomForestRegressor(n_estimators=100, bootstrap=False, random_state=42)

rf_bootstrap.fit(X_train, y_train)
rf_no_bootstrap.fit(X_train, y_train)

# Evaluate performance
y_pred_bootstrap = rf_bootstrap.predict(X_test)
y_pred_no_bootstrap = rf_no_bootstrap.predict(X_test)

mse_bootstrap = mean_squared_error(y_test, y_pred_bootstrap)
mse_no_bootstrap = mean_squared_error(y_test, y_pred_no_bootstrap)

print(f"Bootstrap MSE: {mse_bootstrap:.3f}")
print(f"No Bootstrap MSE: {mse_no_bootstrap:.3f}")

Running the example gives an output like:

Bootstrap MSE: 208.093
No Bootstrap MSE: 458.141

The key steps in this example are:

Generate a synthetic regression dataset with informative and noise features
Split the data into train and test sets
Train RandomForestRegressor models with bootstrap=True and bootstrap=False
Evaluate and compare the mean squared error of the models on the test set

Some tips and heuristics for setting bootstrap:

Bootstrapping generally improves performance by reducing variance, especially on smaller datasets
Using bootstrap=False can be beneficial for large datasets where reducing training time is important
If using bootstrap=False, it’s recommended to increase n_estimators to maintain performance

Issues to consider:

Bootstrapping increases the model’s training time, as each tree is trained on a different data subset
The randomness introduced by bootstrapping can slightly reduce the interpretability of feature importances
The impact of bootstrapping may vary depending on the specific characteristics of the dataset

See Also