Scikit-Learn Configure RandomizedSearchCV "cv" Parameter

The cv parameter in RandomizedSearchCV controls the cross-validation splitting strategy during hyperparameter tuning. Random search is a hyperparameter optimization method that samples random combinations of parameters to find the best performing model.

The cv parameter can be set to an integer, which specifies the number of folds in a KFold or StratifiedKFold cross-validation strategy. It can also be set to an object that implements the split method, such as ShuffleSplit or TimeSeriesSplit.

The default value of cv is 5, which performs 5-fold cross-validation. Common values for cv are 3, 5, and 10.

As a heuristic, use a higher value of cv (e.g., 10) for larger datasets to obtain more reliable performance estimates. For smaller datasets or when computational resources are limited, a lower value (e.g., 3) can be used to reduce computation time.

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)

# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(10, 100),
              'max_depth': [None, 5, 10],
              'min_samples_split': randint(2, 10)}

# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)

# List of cv values to test
cv_values = [3, 5, 10]

for cv in cv_values:
    start_time = time.perf_counter()

    # Run RandomizedSearchCV with the current cv value
    search = RandomizedSearchCV(rf, param_dist, n_iter=10, cv=cv, random_state=42)
    search.fit(X, y)

    end_time = time.perf_counter()
    execution_time = end_time - start_time

    print(f"Best score for cv={cv}: {search.best_score_:.3f}")
    print(f"Execution time for cv={cv}: {execution_time:.2f} seconds")
    print()

The example gives an output like:

Best score for cv=3: 0.936
Execution time for cv=3: 3.76 seconds

Best score for cv=5: 0.939
Execution time for cv=5: 6.58 seconds

Best score for cv=10: 0.933
Execution time for cv=10: 14.69 seconds

The steps in the example are:

Generate a synthetic binary classification dataset using make_classification().
Define a parameter distribution param_dist for RandomForestClassifier hyperparameters.
Create a base RandomForestClassifier model rf.
Iterate over different cv values (3, 5, 10).
For each cv value:
- Record the start time.
- Run RandomizedSearchCV with 10 iterations and the current cv value.
- Record the end time and calculate the execution time.
- Print the best score and execution time for the current cv value.

See Also