Scikit-Learn adjusted_mutual_info_score() Metric

adjusted_mutual_info_score() measures the agreement between two clusterings, adjusted for chance.

The metric considers both the total number of clusters and the size of each cluster.

Values range from 0 to 1, with higher values indicating better agreement.

This metric is commonly used in clustering problems rather than traditional supervised classification.

However, it is not suitable for comparing clusters with a very high imbalance in the number of elements.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_clusters_per_class=1, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a KMeans clustering algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)

# Predict on test set
y_pred = kmeans.predict(X_test)

# Calculate Adjusted Mutual Information score
ami_score = adjusted_mutual_info_score(y_test, y_pred)
print(f"Adjusted Mutual Information Score: {ami_score:.2f}")

Running the example gives an output like:

Adjusted Mutual Information Score: 0.44

The steps are as follows:

Generate a synthetic dataset using make_classification() to simulate a clustering problem.
Split the dataset into training and test sets with train_test_split().
Train a KMeans clustering model on the training data.
Predict cluster labels on the test set with predict().
Evaluate the clustering performance using adjusted_mutual_info_score() by comparing true labels to predicted cluster labels.

First, we generate a synthetic dataset with three classes using the make_classification() function from scikit-learn. This allows us to simulate a clustering problem without using real-world data.

Next, we split the dataset into training and test sets using the train_test_split() function. This step is crucial for evaluating the performance of our clustering model on unseen data. We use 80% of the data for training and reserve 20% for testing.

With our data prepared, we train a KMeans clustering model using the KMeans class from scikit-learn. We specify three clusters and set the random state for reproducibility. The fit() method is called on the model object, passing in the training features (X_train) to learn the underlying patterns in the data.

After training, we use the trained model to make predictions on the test set by calling the predict() method with X_test. This generates predicted cluster labels for each sample in the test set.

Finally, we evaluate the performance of our clustering model using the adjusted_mutual_info_score() function. This function takes the true labels (y_test) and the predicted cluster labels (y_pred) as input and calculates the agreement between the two clusterings, adjusted for chance. The resulting AMI score is printed, giving us a quantitative measure of our model’s performance.

See Also