Homogeneity score is a useful metric for evaluating the quality of clustering models. It measures how uniformly the members of a cluster belong to a single class. A high homogeneity score indicates that clusters contain only members of one class.
The homogeneity_score()
function in scikit-learn calculates this metric by comparing the entropy of the class distribution within clusters to the entropy of the overall class distribution. The formula is 1 - H(C|K) / H(C), where H(C|K) is the conditional entropy of the class distribution given the cluster assignments and H(C) is the entropy of the class distribution.
A homogeneity score of 1.0 means that each cluster contains only members of a single class, indicating perfect homogeneity. Lower scores indicate that clusters contain members of multiple classes. This metric is particularly useful in clustering problems to assess the purity of clusters. However, it should not be used alone, as it does not consider the completeness of the clusters, only their homogeneity.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_clusters_per_class=1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit KMeans clustering algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)
# Predict on test set
y_pred = kmeans.predict(X_test)
# Calculate homogeneity score
homogeneity = homogeneity_score(y_test, y_pred)
print(f"Homogeneity Score: {homogeneity:.2f}")
Running the example gives an output like:
Homogeneity Score: 0.43
The steps are as follows:
- Generate a synthetic dataset using
make_classification()
with 3 classes and 1 cluster per class. - Split the dataset into training and test sets using
train_test_split()
. - Fit the
KMeans
clustering algorithm on the training set. - Predict cluster labels for the test set using
predict()
. - Calculate the homogeneity score using
homogeneity_score()
by comparing the true labels (y_test
) to the predicted cluster labels (y_pred
).
This example demonstrates how to use the homogeneity_score()
function from scikit-learn to evaluate the homogeneity of clusters in a clustering model.