The ‘cv’ parameter in scikit-learn’s GridSearchCV
controls the cross-validation splitting strategy used during hyperparameter tuning. Setting ‘cv’ appropriately ensures that the model’s performance is evaluated reliably.
Grid search is a method for exhaustively searching over a specified set of parameter values to find the best combination. It trains and evaluates the model for each combination of parameters, using the specified cross-validation strategy to assess performance.
The ‘cv’ parameter can be set to an integer to specify the number of folds, or to a cross-validation object like KFold
or StratifiedKFold
.
Choosing the right cross-validation strategy is crucial for ensuring that the hyperparameter tuning process is robust and reliable, especially for imbalanced datasets where stratification is important.
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, KFold, StratifiedKFold
# create a synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# define the parameter and grid values
param_grid = {'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01]}
# define and perform a grid search with cv=5
grid_cv5 = GridSearchCV(estimator=SVC(), param_grid=param_grid, cv=5)
grid_cv5.fit(X, y)
# define and perform a grid search with KFold cv
grid_kfold = GridSearchCV(estimator=SVC(), param_grid=param_grid, cv=KFold(n_splits=5))
grid_kfold.fit(X, y)
# define and perform a grid search with StratifiedKFold cv
grid_stratified = GridSearchCV(estimator=SVC(), param_grid=param_grid, cv=StratifiedKFold(n_splits=5))
grid_stratified.fit(X, y)
# report the best parameters for cv=5
print("Best parameters found with cv=5:")
print(grid_cv5.best_params_)
# report the best parameters for KFold cv
print("Best parameters found with KFold cv:")
print(grid_kfold.best_params_)
# report the best parameters for StratifiedKFold cv
print("Best parameters found with StratifiedKFold cv:")
print(grid_stratified.best_params_)
Running the example gives an output like:
Best parameters found with cv=5:
{'C': 1, 'gamma': 0.01}
Best parameters found with KFold cv:
{'C': 10, 'gamma': 0.01}
Best parameters found with StratifiedKFold cv:
{'C': 1, 'gamma': 0.01}
The key steps in this example are:
- Generate a synthetic binary classification dataset using
make_classification
. - Define a parameter grid for
SVC
withC
andgamma
values to search over. - Create three
GridSearchCV
objects with differentcv
settings:- Integer
cv=5
for 5-fold cross-validation. cv=KFold(n_splits=5)
for 5-fold cross-validation.cv=StratifiedKFold(n_splits=5)
for 5-fold stratified cross-validation.
- Integer
- Fit each grid search object to find the best parameters for each cross-validation strategy.
- Print out the best parameters found by each grid search to demonstrate how different cross-validation strategies can affect the hyperparameter tuning results.
This example highlights the importance of selecting an appropriate cross-validation strategy to ensure reliable evaluation during hyperparameter tuning with GridSearchCV
.