The n_iter parameter in RandomizedSearchCV controls the number of parameter settings that are sampled during the hyperparameter search process. Random search tries random combinations of hyperparameters to find the best performing model.
The default value for n_iter is 10, which means 10 different parameter settings will be sampled and evaluated.
Higher values of n_iter lead to more extensive searches and potentially better performance, but also result in longer runtimes.
As a heuristic, start with the default value of 10 and increase it until the performance plateaus or the computational budget is reached.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': [10, 50, 100],
'max_depth': [None, 5, 10],
'min_samples_split': [2, 5, 10]}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# List of n_iter values to test
n_iter_values = [10, 50, 100]
for n_iter in n_iter_values:
start_time = time.perf_counter()
# Run RandomizedSearchCV with the current n_iter value
search = RandomizedSearchCV(rf, param_dist, n_iter=n_iter, cv=5, n_jobs=-1, random_state=42)
search.fit(X, y)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Best score for n_iter={n_iter}: {search.best_score_:.3f}")
print(f"Execution time for n_iter={n_iter}: {execution_time:.2f} seconds")
print()
Running the example gives an output like:
Best score for n_iter=10: 0.933
Execution time for n_iter=10: 2.54 seconds
Best score for n_iter=50: 0.939
Execution time for n_iter=50: 3.47 seconds
Best score for n_iter=100: 0.939
Execution time for n_iter=100: 3.79 seconds
The steps in this example are:
- Generate a synthetic binary classification dataset using
make_classification(). - Define a parameter distribution dictionary
param_distforRandomForestClassifierhyperparameters. - Create a base
RandomForestClassifiermodelrf. - Iterate over different
n_itervalues. - For each
n_itervalue:- Record the start time using
time.perf_counter(). - Run
RandomizedSearchCVwith the currentn_itervalue and 5-fold cross-validation. - Record the end time and calculate the execution time.
- Print the best score and execution time for the current
n_itervalue.
- Record the start time using