completeness_score()
is a metric for evaluating the performance of clustering algorithms.
Completeness measures the extent to which all members of a given class are assigned to the same cluster. It is calculated by comparing the entropy of the classes and the entropy of the clusters. A score of 1.0 indicates perfect completeness, while a score close to 0 indicates low completeness.
Completeness is useful in scenarios where it’s important that members of a class are not split across multiple clusters, such as customer segmentation or document classification.
However, it is not suitable for evaluating clustering performance in terms of homogeneity or if the focus is on the accuracy of the assignment to a specific cluster.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import completeness_score
# Generate synthetic dataset
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a KMeans clustering model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)
# Predict cluster labels for the test set
y_pred = kmeans.predict(X_test)
# Calculate completeness score
comp_score = completeness_score(y_test, y_pred)
print(f"Completeness Score: {comp_score:.2f}")
Running the example gives an output like:
Completeness Score: 1.00
The steps are as follows:
- Generate dataset: Use
make_blobs()
to create a dataset with 1000 samples and 3 distinct clusters. - Split dataset: Divide the data into training and test sets with
train_test_split()
, reserving 20% for testing. - Train model: Train a
KMeans
clustering model with 3 clusters on the training set using thefit()
method. - Predict clusters: Predict the cluster assignments for the test set using the
predict()
method. - Calculate metric: Use
completeness_score()
to evaluate how well the clustering algorithm assigns all members of a class to the same cluster, comparing the true labels (y_test
) to the predicted clusters (y_pred
).