The calinski_harabasz_score()
evaluates clustering performance based on the dispersion of clusters. It calculates the ratio of the sum of between-cluster dispersion to the sum of within-cluster dispersion. Higher scores indicate better-defined clusters.
The calinski_harabasz_score()
function in scikit-learn calculates this score by considering the distances between cluster centroids and the points within each cluster. It takes the dataset and predicted cluster labels as input and returns a float value, with higher values indicating better clustering performance.
This metric is particularly useful for clustering problems, such as those solved with k-means. However, it has limitations. The Calinski-Harabasz score may not be suitable for non-globular clusters and is sensitive to the number of clusters.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
# Generate synthetic dataset
X, _ = make_blobs(n_samples=1000, centers=3, random_state=42)
# Fit k-means algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
# Predict cluster labels
labels = kmeans.labels_
# Calculate Calinski-Harabasz score
ch_score = calinski_harabasz_score(X, labels)
print(f"Calinski-Harabasz Score: {ch_score:.2f}")
Running the example gives an output like:
Calinski-Harabasz Score: 16512.26
The steps are as follows:
- Generate a synthetic clustering dataset using
make_blobs()
. This function creates a dataset with 1000 samples and 3 centers, allowing us to simulate a clustering problem without using real-world data. - Fit a k-means clustering algorithm using the
KMeans
class from scikit-learn. Specify the number of clusters (3) and set therandom_state
for reproducibility. - Predict the cluster labels for the dataset using the
labels_
attribute of the fitted k-means model. - Calculate the Calinski-Harabasz score using the
calinski_harabasz_score()
function, which takes the dataset and predicted labels as input. The score measures the ratio of between-cluster dispersion to within-cluster dispersion. - Print the Calinski-Harabasz score to evaluate the clustering performance. Higher scores indicate better-defined clusters.
This example demonstrates how to use the calinski_harabasz_score()
function from scikit-learn to evaluate the performance of a clustering model.