Scikit-Learn silhouette_samples() Metric

Silhouette samples are used to evaluate the quality of clustering results. They measure how similar each sample is to its own cluster compared to other clusters, providing a way to assess the goodness of the clustering.

The silhouette_samples() function in scikit-learn calculates the silhouette coefficient for each sample by computing the mean intra-cluster distance and the mean nearest-cluster distance. The silhouette score for each sample is given by (b - a) / max(a, b), where a is the mean intra-cluster distance and b is the mean nearest-cluster distance. Scores range from -1 to 1, with values close to 1 indicating good clustering, 0 indicating overlapping clusters, and negative values indicating incorrect clustering.

Silhouette samples are commonly used for clustering problems such as k-means, hierarchical clustering, and DBSCAN. However, they may not perform well with clusters of varying density or non-globular shapes.

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples

# Generate synthetic dataset
X, _ = make_blobs(n_samples=1000, centers=3, cluster_std=1.0, random_state=42)

# Fit KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Calculate silhouette samples
silhouette_vals = silhouette_samples(X, labels)
print(f"Silhouette samples: {silhouette_vals[:10]}")

Running the example gives an output like:

Silhouette samples: [0.91092226 0.89832016 0.87495121 0.84861634 0.79964852 0.91290724
 0.86535873 0.86053709 0.81838091 0.84240909]

The steps are as follows:

Generate a synthetic dataset using make_blobs() with 3 centers.
Split the dataset into training and test sets using train_test_split().
Fit the KMeans clustering algorithm on the dataset.
Predict the cluster labels for the dataset.
Calculate silhouette scores for each sample using silhouette_samples().
Print the silhouette scores for the first 10 samples.

First, generate a synthetic dataset using the make_blobs() function from scikit-learn. This function creates a dataset with 1000 samples and 3 centers, allowing us to simulate a clustering problem.

Next, fit the KMeans clustering algorithm on the dataset using the KMeans class from scikit-learn. Specify the number of clusters as 3 and set the random state for reproducibility. The fit_predict() method is called on the KMeans object, passing in the dataset (X) to compute the cluster labels.

After clustering, calculate the silhouette scores for each sample using the silhouette_samples() function. This function takes the dataset (X) and the predicted labels (labels) as input and returns the silhouette scores for each sample.

Finally, print the silhouette scores for the first 10 samples, giving us insight into the clustering quality for these samples. This example demonstrates how to use the silhouette_samples() function from scikit-learn to evaluate the quality of clustering results.

See Also