Scikit-Learn GridSearchCV MultinomialNB

Hyperparameter tuning is a crucial step in optimizing machine learning models for best performance. In this example, we’ll demonstrate how to use scikit-learn’s GridSearchCV to perform hyperparameter tuning for Multinomial Naive Bayes, an algorithm commonly used for classification tasks with discrete data.

Grid search is a method for evaluating different combinations of model hyperparameters to find the best performing configuration. It exhaustively searches through a specified parameter grid, trains and evaluates the model for each combination using cross-validation, and selects the hyperparameters that yield the best performance metric.

Multinomial Naive Bayes is a probabilistic classifier based on Bayes’ theorem. It is particularly effective for text classification and other applications involving discrete features. The algorithm calculates the conditional probability of each class given the feature values and assigns the class with the highest probability.

The key hyperparameters for Multinomial Naive Bayes include alpha, which is the smoothing parameter that helps handle zero probabilities in the data, and fit_prior, which determines whether the model should learn class prior probabilities.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_classes=2, random_state=42)
X = np.abs(X)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'alpha': [0.1, 0.5, 1.0],
    'fit_prior': [True, False]
}

# Perform grid search
grid_search = GridSearchCV(estimator=MultinomialNB(),
                           param_grid=param_grid,
                           cv=5,
                           scoring='accuracy')
grid_search.fit(X_train, y_train)

# Report best score and parameters
print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

# Evaluate on test set
best_model = grid_search.best_estimator_
accuracy = best_model.score(X_test, y_test)
print(f"Test set accuracy: {accuracy:.3f}")

Running the example gives an output like:

Best score: 0.710
Best parameters: {'alpha': 0.1, 'fit_prior': True}
Test set accuracy: 0.705

The steps are as follows:

Generate a synthetic classification dataset using make_classification.
Split the dataset into train and test sets using train_test_split.
Define the parameter grid with different values for alpha and fit_prior.
Perform grid search using GridSearchCV with MultinomialNB, 5-fold cross-validation, and accuracy scoring.
Report the best cross-validation score and the best hyperparameters found.
Evaluate the best model on the test set and report the accuracy.

By using GridSearchCV, we can efficiently identify the optimal hyperparameters for MultinomialNB, improving model performance and saving time compared to manual tuning.

See Also