Rand Index is a measure of the similarity between two data clusterings. It calculates the proportion of agreement between two clustering results, with a value between 0 and 1. Rand Index is useful for evaluating the performance of clustering algorithms.
The rand_score()
function in scikit-learn calculates the Rand Index by counting the pairs of elements that are either both in the same cluster or both in different clusters in both true and predicted clusterings, divided by the total number of pairs. It takes the true labels and predicted labels as input and returns a float value between 0 and 1, with 1 indicating perfect agreement.
Rand Index is used for clustering evaluation in both binary and multiclass classification problems. However, it does not account for chance, which means it can be misleading when the number of clusters is very different. Therefore, it is important to be aware of this limitation when interpreting the results.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import rand_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a KMeans classifier
clf = KMeans(n_clusters=2, random_state=42)
clf.fit(X_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate Rand Index
rand_index = rand_score(y_test, y_pred)
print(f"Rand Index: {rand_index:.2f}")
Running the example gives an output like:
Rand Index: 0.66
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification()
. - Split the dataset into training and test sets using
train_test_split()
. - Train a
KMeans
classifier on the training set. - Use the trained classifier to make predictions on the test set with
predict()
. - Calculate the Rand Index of the predictions using
rand_score()
by comparing the predicted labels to the true labels.
First, we generate a synthetic binary classification dataset using the make_classification()
function from scikit-learn. This function creates a dataset with 1000 samples and 2 classes, allowing us to simulate a classification problem without using real-world data.
Next, we split the dataset into training and test sets using the train_test_split()
function. This step is crucial for evaluating the performance of our classifier on unseen data. We use 80% of the data for training and reserve 20% for testing.
With our data prepared, we train a KMeans
classifier using the KMeans
class from scikit-learn. We specify 2 clusters and set the random state to 42. The fit()
method is called on the classifier object, passing in the training features (X_train
) to learn the underlying patterns in the data.
After training, we use the trained classifier to make predictions on the test set by calling the predict()
method with X_test
. This generates predicted labels for each sample in the test set.
Finally, we evaluate the Rand Index of our classifier using the rand_score()
function. This function takes the true labels (y_test
) and the predicted labels (y_pred
) as input and calculates the proportion of agreement between the two clusterings. The resulting Rand Index score is printed, giving us a quantitative measure of our classifier’s performance.
This example demonstrates how to use the rand_score()
function from scikit-learn to evaluate the performance of a clustering model.