Scikit-Learn calinski_harabasz_score() Metric

The calinski_harabasz_score() evaluates clustering performance based on the dispersion of clusters. It calculates the ratio of the sum of between-cluster dispersion to the sum of within-cluster dispersion. Higher scores indicate better-defined clusters.

The calinski_harabasz_score() function in scikit-learn calculates this score by considering the distances between cluster centroids and the points within each cluster. It takes the dataset and predicted cluster labels as input and returns a float value, with higher values indicating better clustering performance.

This metric is particularly useful for clustering problems, such as those solved with k-means. However, it has limitations. The Calinski-Harabasz score may not be suitable for non-globular clusters and is sensitive to the number of clusters.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score

# Generate synthetic dataset
X, _ = make_blobs(n_samples=1000, centers=3, random_state=42)

# Fit k-means algorithm
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Predict cluster labels
labels = kmeans.labels_

# Calculate Calinski-Harabasz score
ch_score = calinski_harabasz_score(X, labels)
print(f"Calinski-Harabasz Score: {ch_score:.2f}")

Running the example gives an output like:

Calinski-Harabasz Score: 16512.26

The steps are as follows:

Generate a synthetic clustering dataset using make_blobs(). This function creates a dataset with 1000 samples and 3 centers, allowing us to simulate a clustering problem without using real-world data.
Fit a k-means clustering algorithm using the KMeans class from scikit-learn. Specify the number of clusters (3) and set the random_state for reproducibility.
Predict the cluster labels for the dataset using the labels_ attribute of the fitted k-means model.
Calculate the Calinski-Harabasz score using the calinski_harabasz_score() function, which takes the dataset and predicted labels as input. The score measures the ratio of between-cluster dispersion to within-cluster dispersion.
Print the Calinski-Harabasz score to evaluate the clustering performance. Higher scores indicate better-defined clusters.

This example demonstrates how to use the calinski_harabasz_score() function from scikit-learn to evaluate the performance of a clustering model.

See Also