adjusted_mutual_info_score()
measures the agreement between two clusterings, adjusted for chance.
The metric considers both the total number of clusters and the size of each cluster.
Values range from 0 to 1, with higher values indicating better agreement.
This metric is commonly used in clustering problems rather than traditional supervised classification.
However, it is not suitable for comparing clusters with a very high imbalance in the number of elements.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_clusters_per_class=1, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a KMeans clustering algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)
# Predict on test set
y_pred = kmeans.predict(X_test)
# Calculate Adjusted Mutual Information score
ami_score = adjusted_mutual_info_score(y_test, y_pred)
print(f"Adjusted Mutual Information Score: {ami_score:.2f}")
Running the example gives an output like:
Adjusted Mutual Information Score: 0.44
The steps are as follows:
- Generate a synthetic dataset using
make_classification()
to simulate a clustering problem. - Split the dataset into training and test sets with
train_test_split()
. - Train a
KMeans
clustering model on the training data. - Predict cluster labels on the test set with
predict()
. - Evaluate the clustering performance using
adjusted_mutual_info_score()
by comparing true labels to predicted cluster labels.
First, we generate a synthetic dataset with three classes using the make_classification()
function from scikit-learn. This allows us to simulate a clustering problem without using real-world data.
Next, we split the dataset into training and test sets using the train_test_split()
function. This step is crucial for evaluating the performance of our clustering model on unseen data. We use 80% of the data for training and reserve 20% for testing.
With our data prepared, we train a KMeans clustering model using the KMeans
class from scikit-learn. We specify three clusters and set the random state for reproducibility. The fit()
method is called on the model object, passing in the training features (X_train
) to learn the underlying patterns in the data.
After training, we use the trained model to make predictions on the test set by calling the predict()
method with X_test
. This generates predicted cluster labels for each sample in the test set.
Finally, we evaluate the performance of our clustering model using the adjusted_mutual_info_score()
function. This function takes the true labels (y_test
) and the predicted cluster labels (y_pred
) as input and calculates the agreement between the two clusterings, adjusted for chance. The resulting AMI score is printed, giving us a quantitative measure of our model’s performance.