The priors parameter in GaussianNB allows you to set prior probabilities for each class.
GaussianNB is a naive Bayes classifier based on applying Bayes’ theorem with the assumption of Gaussian (normal) distribution of the features. The priors parameter specifies the prior probabilities of the classes. If not specified, the class prior probabilities are determined from the data.
The default value for priors is None, meaning that class priors are inferred from the training data. In practice, setting specific priors can be useful when you have prior knowledge about the class distributions.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different priors values
priors_values = [None, [0.2, 0.5, 0.3], [0.33, 0.33, 0.34]]
accuracies = []
for priors in priors_values:
gnb = GaussianNB(priors=priors)
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"priors={priors}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
priors=None, Accuracy: 0.705
priors=[0.2, 0.5, 0.3], Accuracy: 0.695
priors=[0.33, 0.33, 0.34], Accuracy: 0.710
The key steps in this example are:
- Generate a synthetic multi-class classification dataset with informative and noise features.
- Split the data into train and test sets.
- Train
GaussianNBmodels with differentpriorsvalues. - Evaluate the accuracy of each model on the test set.
Some tips and heuristics for setting priors:
- Use domain knowledge to set priors if you have prior probabilities available.
- If you do not have prior probabilities, let the model infer them from the training data by setting
priorstoNone. - Adjust
priorsto see if the model performance improves with different assumptions about class distribution.
Issues to consider:
- Incorrectly setting
priorscan bias the model towards certain classes. - The optimal priors depend on the specific dataset and problem context.
- If the data is imbalanced, setting appropriate priors can help improve model performance.