Scikit-Learn Configure RandomizedSearchCV "pre_dispatch" Parameter

The pre_dispatch parameter in RandomizedSearchCV controls the number of jobs that get dispatched during parallel execution. Random search is a hyperparameter optimization method that tries random combinations of parameters to find the best performing model.

The default value for pre_dispatch is '2*n_jobs', which means that 2 times the number of jobs will be dispatched. For example, if n_jobs is set to 4, then 8 jobs will be dispatched.

Common values for pre_dispatch are '2*n_jobs', 'all', or an integer. Setting pre_dispatch to 'all' dispatches all jobs at once, which can consume a lot of memory. Using an integer allows dispatching a specific number of jobs.

As a heuristic, set pre_dispatch to '2*n_jobs' for a balance between memory consumption and parallelism, or use an integer to fine-tune the number of dispatched jobs based on available resources.

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_redundant=10, random_state=42)

# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(5, 50),
              'max_depth': [None, 5, 10, 20],
              'min_samples_split': randint(2, 20)}

# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)

# List of pre_dispatch values to test
pre_dispatch_values = ['2*n_jobs', 'all', 10]

for pre_dispatch in pre_dispatch_values:
    start_time = time.perf_counter()

    # Run RandomizedSearchCV with the current pre_dispatch value
    search = RandomizedSearchCV(rf, param_dist, n_iter=50, cv=5, n_jobs=4, pre_dispatch=pre_dispatch, random_state=42)
    search.fit(X, y)

    end_time = time.perf_counter()
    execution_time = end_time - start_time

    print(f"Execution time for pre_dispatch='{pre_dispatch}': {execution_time:.2f} seconds")
    print()

Running the example gives an output like:

Execution time for pre_dispatch='2*n_jobs': 3.84 seconds

Execution time for pre_dispatch='all': 2.80 seconds

Execution time for pre_dispatch='10': 2.74 seconds

The steps are as follows:

Generate a synthetic binary classification dataset using make_classification().
Define a parameter distribution dictionary param_dist for RandomForestClassifier hyperparameters.
Create a base RandomForestClassifier model rf.
Iterate over different pre_dispatch values ('2*n_jobs', 'all', 10).
For each pre_dispatch value:
- Record the start time using time.perf_counter().
- Run RandomizedSearchCV with 50 iterations, 5-fold cross-validation, and 4 jobs.
- Record the end time and calculate the execution time.
- Print the execution time for the current pre_dispatch value.

See Also