Evaluating the quality of clustering results is crucial in classification tasks, and homogeneity_completeness_v_measure()
provides a comprehensive metric to do this.
This function helps us understand how well our clustering algorithm has performed by calculating three related scores: homogeneity, completeness, and V-measure.
Homogeneity measures if each cluster contains only members of a single class, ensuring that clusters are pure. Completeness measures if all members of a given class are assigned to the same cluster, ensuring that all class members are grouped together. V-measure is the harmonic mean of homogeneity and completeness, providing a balanced measure of both.
The homogeneity_completeness_v_measure()
function returns three float values between 0 and 1, with 1 indicating perfect clustering. Good values are close to 1 for all three measures, indicating high-quality clustering. Bad values are close to 0, indicating poor clustering quality. This metric is used primarily in clustering problems to evaluate the quality of the clusters and is not applicable for non-clustering classification tasks.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_completeness_v_measure
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=3, n_clusters_per_class=1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a KMeans clustering model
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X_train)
# Predict clusters on test set
y_pred = kmeans.predict(X_test)
# Evaluate clustering performance
homogeneity, completeness, v_measure = homogeneity_completeness_v_measure(y_test, y_pred)
print(f"Homogeneity: {homogeneity:.2f}, Completeness: {completeness:.2f}, V-measure: {v_measure:.2f}")
Running the example gives an output like:
Homogeneity: 0.43, Completeness: 0.46, V-measure: 0.44
- Generate a synthetic multi-class classification dataset using
make_classification()
. - Split the dataset into training and test sets using
train_test_split()
. - Train a
KMeans
clustering model on the training set. - Use the trained model to predict clusters for the test set.
- Evaluate the clustering performance using
homogeneity_completeness_v_measure()
.
The example starts by creating a synthetic dataset with three classes, simulating a clustering problem. The dataset is split into training and testing portions to evaluate the model’s performance on unseen data.
A KMeans
clustering model is trained on the training set. This model attempts to group the data into three clusters, corresponding to the three classes.
The trained model predicts cluster assignments for the test set. These predicted clusters are then compared to the true class labels using the homogeneity_completeness_v_measure()
function.
Finally, the homogeneity, completeness, and V-measure scores are printed, providing insights into the quality of the clustering.