The bootstrap
parameter in scikit-learn’s RandomForestRegressor
controls whether bootstrap samples are used when building trees.
Bootstrapping is a resampling technique where each tree is trained on a random sample of the training data, drawn with replacement. This introduces randomness and diversity into the ensemble, which can help reduce overfitting.
By default, bootstrap
is set to True
, meaning that each tree is trained on a bootstrapped subset of the data. Setting bootstrap
to False
means that each tree is trained on the entire training dataset.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5,
n_targets=1, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with bootstrap=True and bootstrap=False
rf_bootstrap = RandomForestRegressor(n_estimators=100, bootstrap=True, random_state=42)
rf_no_bootstrap = RandomForestRegressor(n_estimators=100, bootstrap=False, random_state=42)
rf_bootstrap.fit(X_train, y_train)
rf_no_bootstrap.fit(X_train, y_train)
# Evaluate performance
y_pred_bootstrap = rf_bootstrap.predict(X_test)
y_pred_no_bootstrap = rf_no_bootstrap.predict(X_test)
mse_bootstrap = mean_squared_error(y_test, y_pred_bootstrap)
mse_no_bootstrap = mean_squared_error(y_test, y_pred_no_bootstrap)
print(f"Bootstrap MSE: {mse_bootstrap:.3f}")
print(f"No Bootstrap MSE: {mse_no_bootstrap:.3f}")
Running the example gives an output like:
Bootstrap MSE: 208.093
No Bootstrap MSE: 458.141
The key steps in this example are:
- Generate a synthetic regression dataset with informative and noise features
- Split the data into train and test sets
- Train
RandomForestRegressor
models withbootstrap=True
andbootstrap=False
- Evaluate and compare the mean squared error of the models on the test set
Some tips and heuristics for setting bootstrap
:
- Bootstrapping generally improves performance by reducing variance, especially on smaller datasets
- Using
bootstrap=False
can be beneficial for large datasets where reducing training time is important - If using
bootstrap=False
, it’s recommended to increasen_estimators
to maintain performance
Issues to consider:
- Bootstrapping increases the model’s training time, as each tree is trained on a different data subset
- The randomness introduced by bootstrapping can slightly reduce the interpretability of feature importances
- The impact of bootstrapping may vary depending on the specific characteristics of the dataset