The n_jobs
parameter in RandomizedSearchCV
controls the parallelization of the hyperparameter search process. Random search is a hyperparameter optimization method that tries random combinations of parameters to find the best performing model.
The n_jobs
parameter determines how many CPU cores are used in parallel during the search.
The default value is 1, which means no parallelization.
Setting n_jobs
to -1 will use all available cores, while setting it to a specific positive integer will use that number of cores.
As a heuristic, it’s often best to use -1 for maximum parallelization unless memory is limited, in which case a lower number of cores may be more appropriate.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': [100, 200, 300],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# List of n_jobs values to test
n_jobs_values = [1, 2, -1]
for n_jobs in n_jobs_values:
start_time = time.perf_counter()
# Run RandomizedSearchCV with the current n_jobs value
search = RandomizedSearchCV(rf, param_dist, n_iter=10, cv=5, n_jobs=n_jobs, random_state=42)
search.fit(X, y)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Best score for n_jobs={n_jobs}: {search.best_score_:.3f}")
print(f"Execution time for n_jobs={n_jobs}: {execution_time:.2f} seconds")
print()
Running the example gives an output like:
Best score for n_jobs=1: 0.939
Execution time for n_jobs=1: 17.94 seconds
Best score for n_jobs=2: 0.939
Execution time for n_jobs=2: 10.39 seconds
Best score for n_jobs=-1: 0.939
Execution time for n_jobs=-1: 6.36 seconds
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification()
from scikit-learn. - Define a parameter distribution dictionary
param_dist
forRandomForestClassifier
hyperparameters. - Create a base
RandomForestClassifier
modelrf
. - Iterate over different
n_jobs
values (1, 2, -1). - For each
n_jobs
value:- Record the start time using
time.time()
. - Run
RandomizedSearchCV
with 10 iterations and 5-fold cross-validation. - Record the end time and calculate the execution time.
- Print the best score and execution time for the current
n_jobs
value.
- Record the start time using