SKLearner Home | About | Contact | Examples

Scikit-Learn adjusted_rand_score() Metric

The adjusted rand index (ARI) measures the similarity between two clustering results by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusters.

The adjusted_rand_score() function in scikit-learn computes the ARI, which adjusts for chance groupings. ARI ranges from -1 to 1, with 1 indicating perfect agreement, 0 indicating random clustering, and negative values indicating worse than random clustering.

ARI is used in clustering tasks to evaluate the similarity between clustering results and ground truth. It is effective for comparing clustering performance but may not be suitable for all types of clustering problems, especially where the number of clusters differs significantly.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a KMeans classifier
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)

# Predict on test set
y_pred = kmeans.predict(X_test)

# Calculate adjusted rand score
ari = adjusted_rand_score(y_test, y_pred)
print(f"Adjusted Rand Index: {ari:.2f}")

Running the example gives an output like:

Adjusted Rand Index: 1.00

We generate a synthetic dataset using the make_blobs() function from scikit-learn, creating 1000 samples with three centers (clusters).

The dataset is split into training and test sets using the train_test_split() function to ensure the model is evaluated on unseen data.

We train a KMeans classifier using the KMeans class from scikit-learn, specifying three clusters and setting a random state for reproducibility.

The trained classifier predicts cluster labels for the test set using the predict() method.

The adjusted_rand_score() function calculates the adjusted rand index by comparing the true cluster labels (y_test) with the predicted cluster labels (y_pred), providing a measure of clustering similarity adjusted for chance.



See Also