Scikit-Learn GridSearchCV Hyperparameter Optimization

GridSearchCV is a powerful tool in scikit-learn for systematically tuning the hyperparameters of a given model. It performs an exhaustive search over a specified parameter grid, allowing you to find the optimal combination of hyperparameters for your model.

The key hyperparameters of GridSearchCV include the estimator (the model to tune), param_grid (the hyperparameter space to search), scoring (the metric to optimize), and cv (the cross-validation splitting strategy).

GridSearchCV is appropriate for tuning the hyperparameters of any supervised learning model, including both classification and regression models.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# generate synthetic classification dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=5, random_state=42)

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# define the model to tune
model = DecisionTreeClassifier(random_state=42)

# define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 9],
    'min_samples_split': [2, 5, 10, 20]
}

# perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train)

# report the best score and best parameters
print("Best score: {:.3f}".format(grid_search.best_score_))
print("Best parameters: {}".format(grid_search.best_params_))

# evaluate the best model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.3f}".format(accuracy))

Running the example gives an output like:

Best score: 0.741
Best parameters: {'max_depth': 7, 'min_samples_split': 2}
Test set accuracy: 0.810

The steps are as follows:

First, a synthetic multiclass classification dataset is generated using make_classification(). The dataset is split into training and test sets using train_test_split().
A DecisionTreeClassifier is defined as the model to tune. A parameter grid is defined with different values for the max_depth and min_samples_split hyperparameters.
GridSearchCV is instantiated with the model, parameter grid, accuracy scoring metric, and 5-fold cross-validation. It is then fitted on the training data to find the best hyperparameter combination.
The best score and best hyperparameters are reported. Finally, the best model is evaluated on the test set to assess its performance on unseen data.

This example demonstrates how to use GridSearchCV to systematically search for the best hyperparameters of a scikit-learn model. By defining a model and a parameter grid, you can easily tune your model to achieve optimal performance on your specific dataset.

See Also