Efficiently tuning hyperparameters for machine learning models can significantly impact model performance. GridSearchCV
and RandomizedSearchCV
are two methods provided by scikit-learn to automate this process.
GridSearchCV
explores all possible combinations of hyperparameters within a specified grid. Key hyperparameters for GridSearchCV
include param_grid
(dictionary of parameters to search), cv
(number of cross-validation folds), and scoring
(evaluation metric).
RandomizedSearchCV
randomly samples a specified number of hyperparameter combinations. Key hyperparameters for RandomizedSearchCV
include param_distributions
(dictionary of parameters with distributions or lists), n_iter
(number of parameter settings sampled), and cv
(number of cross-validation folds).
The main difference is that GridSearchCV
exhaustively searches all possible combinations, potentially being more accurate but computationally expensive. RandomizedSearchCV
offers a faster, less exhaustive search, trading some accuracy for speed.
GridSearchCV
is suitable for smaller hyperparameter spaces or when accuracy is paramount, whereas RandomizedSearchCV
is better for larger hyperparameter spaces or when computational efficiency is critical.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
# Generate synthetic classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the model
model = RandomForestClassifier(random_state=42)
# Define hyperparameter grid for GridSearchCV
param_grid = {
'n_estimators': [5, 10, 50],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# Perform GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
y_pred_grid = grid_search.predict(X_test)
print(f"GridSearchCV accuracy: {accuracy_score(y_test, y_pred_grid):.3f}")
print(f"GridSearchCV F1 score: {f1_score(y_test, y_pred_grid):.3f}")
print(f"Best hyperparameters (GridSearchCV): {grid_search.best_params_}")
# Define hyperparameter distributions for RandomizedSearchCV
param_distributions = {
'n_estimators': [5, 10, 50],
'max_depth': [None, 10, 20],
'min_samples_split': [2, 5, 10]
}
# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions, n_iter=10, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)
y_pred_random = random_search.predict(X_test)
print(f"\nRandomizedSearchCV accuracy: {accuracy_score(y_test, y_pred_random):.3f}")
print(f"RandomizedSearchCV F1 score: {f1_score(y_test, y_pred_random):.3f}")
print(f"Best hyperparameters (RandomizedSearchCV): {random_search.best_params_}")
Running the example gives an output like:
GridSearchCV accuracy: 0.900
GridSearchCV F1 score: 0.903
Best hyperparameters (GridSearchCV): {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 50}
RandomizedSearchCV accuracy: 0.885
RandomizedSearchCV F1 score: 0.888
Best hyperparameters (RandomizedSearchCV): {'n_estimators': 10, 'min_samples_split': 10, 'max_depth': 10}
- Generate a synthetic classification dataset using
make_classification
. - Split the dataset into training and test sets using
train_test_split
. - Define
RandomForestClassifier
as the model. - Set up
GridSearchCV
with a defined hyperparameter grid and perform the search. - Evaluate the performance of the best model from
GridSearchCV
on the test set. - Set up
RandomizedSearchCV
with hyperparameter distributions and perform the search. - Evaluate the performance of the best model from
RandomizedSearchCV
on the test set. - Compare the results from both hyperparameter tuning methods in terms of accuracy, F1 score, and selected hyperparameters.