The cv parameter in RandomizedSearchCV controls the cross-validation splitting strategy during hyperparameter tuning. Random search is a hyperparameter optimization method that samples random combinations of parameters to find the best performing model.
The cv parameter can be set to an integer, which specifies the number of folds in a KFold or StratifiedKFold cross-validation strategy. It can also be set to an object that implements the split method, such as ShuffleSplit or TimeSeriesSplit.
The default value of cv is 5, which performs 5-fold cross-validation. Common values for cv are 3, 5, and 10.
As a heuristic, use a higher value of cv (e.g., 10) for larger datasets to obtain more reliable performance estimates. For smaller datasets or when computational resources are limited, a lower value (e.g., 3) can be used to reduce computation time.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(10, 100),
'max_depth': [None, 5, 10],
'min_samples_split': randint(2, 10)}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# List of cv values to test
cv_values = [3, 5, 10]
for cv in cv_values:
start_time = time.perf_counter()
# Run RandomizedSearchCV with the current cv value
search = RandomizedSearchCV(rf, param_dist, n_iter=10, cv=cv, random_state=42)
search.fit(X, y)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Best score for cv={cv}: {search.best_score_:.3f}")
print(f"Execution time for cv={cv}: {execution_time:.2f} seconds")
print()
The example gives an output like:
Best score for cv=3: 0.936
Execution time for cv=3: 3.76 seconds
Best score for cv=5: 0.939
Execution time for cv=5: 6.58 seconds
Best score for cv=10: 0.933
Execution time for cv=10: 14.69 seconds
The steps in the example are:
- Generate a synthetic binary classification dataset using
make_classification(). - Define a parameter distribution
param_distforRandomForestClassifierhyperparameters. - Create a base
RandomForestClassifiermodelrf. - Iterate over different
cvvalues (3, 5, 10). - For each
cvvalue:- Record the start time.
- Run
RandomizedSearchCVwith 10 iterations and the currentcvvalue. - Record the end time and calculate the execution time.
- Print the best score and execution time for the current
cvvalue.