The cv
parameter in RandomizedSearchCV
controls the cross-validation splitting strategy during hyperparameter tuning. Random search is a hyperparameter optimization method that samples random combinations of parameters to find the best performing model.
The cv
parameter can be set to an integer, which specifies the number of folds in a KFold
or StratifiedKFold
cross-validation strategy. It can also be set to an object that implements the split
method, such as ShuffleSplit
or TimeSeriesSplit
.
The default value of cv
is 5, which performs 5-fold cross-validation. Common values for cv
are 3, 5, and 10.
As a heuristic, use a higher value of cv
(e.g., 10) for larger datasets to obtain more reliable performance estimates. For smaller datasets or when computational resources are limited, a lower value (e.g., 3) can be used to reduce computation time.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(10, 100),
'max_depth': [None, 5, 10],
'min_samples_split': randint(2, 10)}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# List of cv values to test
cv_values = [3, 5, 10]
for cv in cv_values:
start_time = time.perf_counter()
# Run RandomizedSearchCV with the current cv value
search = RandomizedSearchCV(rf, param_dist, n_iter=10, cv=cv, random_state=42)
search.fit(X, y)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Best score for cv={cv}: {search.best_score_:.3f}")
print(f"Execution time for cv={cv}: {execution_time:.2f} seconds")
print()
The example gives an output like:
Best score for cv=3: 0.936
Execution time for cv=3: 3.76 seconds
Best score for cv=5: 0.939
Execution time for cv=5: 6.58 seconds
Best score for cv=10: 0.933
Execution time for cv=10: 14.69 seconds
The steps in the example are:
- Generate a synthetic binary classification dataset using
make_classification()
. - Define a parameter distribution
param_dist
forRandomForestClassifier
hyperparameters. - Create a base
RandomForestClassifier
modelrf
. - Iterate over different
cv
values (3, 5, 10). - For each
cv
value:- Record the start time.
- Run
RandomizedSearchCV
with 10 iterations and the currentcv
value. - Record the end time and calculate the execution time.
- Print the best score and execution time for the current
cv
value.