Normalized Mutual Information (NMI) is a metric used to evaluate the similarity between true and predicted labels in classification and clustering tasks. It measures the mutual information shared between the true labels and the predicted labels, normalized by the average of their entropies.
The normalized_mutual_info_score()
function in scikit-learn calculates NMI by dividing the mutual information (MI) of the labels by the average of their individual entropies. This metric ranges from 0 to 1, where 1 indicates perfect agreement and 0 indicates no mutual information.
NMI is particularly useful in clustering tasks to compare the similarity of different labelings. However, it may not provide meaningful insights for very small datasets or when the true labels are not well-defined.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import normalized_mutual_info_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train an SVM classifier
clf = SVC(kernel='linear', C=1, random_state=42)
clf.fit(X_train, y_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate normalized mutual information score
nmi = normalized_mutual_info_score(y_test, y_pred)
print(f"Normalized Mutual Information Score: {nmi:.2f}")
Running the example gives an output like:
Normalized Mutual Information Score: 0.45
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification()
. - Split the dataset into training and test sets using
train_test_split()
. - Train an
SVC
classifier on the training set. - Use the trained classifier to make predictions on the test set with
predict()
. - Calculate the normalized mutual information score using
normalized_mutual_info_score()
by comparing the predicted labels to the true labels.
First, we generate a synthetic binary classification dataset with 1000 samples and 2 classes using the make_classification()
function.
Next, we split the dataset into training and test sets with an 80-20 ratio using the train_test_split()
function to evaluate our classifier on unseen data.
We then train an SVM classifier using the SVC
class from scikit-learn, specifying a linear kernel and a regularization parameter C
of 1. The classifier is trained with the fit()
method on the training data.
After training, we make predictions on the test set using the predict()
method.
Finally, we calculate the normalized mutual information score with the normalized_mutual_info_score()
function by comparing the true labels to the predicted labels, giving us a measure of agreement between the two sets of labels.