The pre_dispatch
parameter in RandomizedSearchCV
controls the number of jobs that get dispatched during parallel execution. Random search is a hyperparameter optimization method that tries random combinations of parameters to find the best performing model.
The default value for pre_dispatch
is '2*n_jobs'
, which means that 2 times the number of jobs will be dispatched. For example, if n_jobs
is set to 4, then 8 jobs will be dispatched.
Common values for pre_dispatch
are '2*n_jobs'
, 'all'
, or an integer. Setting pre_dispatch
to 'all'
dispatches all jobs at once, which can consume a lot of memory. Using an integer allows dispatching a specific number of jobs.
As a heuristic, set pre_dispatch
to '2*n_jobs'
for a balance between memory consumption and parallelism, or use an integer to fine-tune the number of dispatched jobs based on available resources.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, random_state=42)
# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(5, 50),
'max_depth': [None, 5, 10, 20],
'min_samples_split': randint(2, 20)}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# List of pre_dispatch values to test
pre_dispatch_values = ['2*n_jobs', 'all', 10]
for pre_dispatch in pre_dispatch_values:
start_time = time.perf_counter()
# Run RandomizedSearchCV with the current pre_dispatch value
search = RandomizedSearchCV(rf, param_dist, n_iter=50, cv=5, n_jobs=4, pre_dispatch=pre_dispatch, random_state=42)
search.fit(X, y)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Execution time for pre_dispatch='{pre_dispatch}': {execution_time:.2f} seconds")
print()
Running the example gives an output like:
Execution time for pre_dispatch='2*n_jobs': 3.84 seconds
Execution time for pre_dispatch='all': 2.80 seconds
Execution time for pre_dispatch='10': 2.74 seconds
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification()
. - Define a parameter distribution dictionary
param_dist
forRandomForestClassifier
hyperparameters. - Create a base
RandomForestClassifier
modelrf
. - Iterate over different
pre_dispatch
values ('2*n_jobs'
,'all'
, 10). - For each
pre_dispatch
value:- Record the start time using
time.perf_counter()
. - Run
RandomizedSearchCV
with 50 iterations, 5-fold cross-validation, and 4 jobs. - Record the end time and calculate the execution time.
- Print the execution time for the current
pre_dispatch
value.
- Record the start time using