Scikit-Learn BisectingKMeans Model

BisectingKMeans is an iterative clustering algorithm that splits clusters in a hierarchical manner to improve clustering quality. It works by recursively splitting clusters into two sub-clusters, selecting the best bisection at each step. This process continues until the desired number of clusters is achieved.

The key hyperparameters of BisectingKMeans include the n_clusters (number of clusters), init (initialization method), and bisecting_strategy (method to choose bisection).

This algorithm is suitable for clustering problems where hierarchical and iterative refinement of clusters is beneficial.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.cluster import BisectingKMeans
from sklearn.metrics import silhouette_score

# generate synthetic dataset
X, _ = make_blobs(n_samples=100, centers=3, n_features=2, random_state=1)

# split into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=1)

# create model
model = BisectingKMeans(n_clusters=3)

# fit model
model.fit(X_train)

# evaluate model
labels = model.predict(X_test)
score = silhouette_score(X_test, labels)
print('Silhouette Score: %.3f' % score)

# make a prediction
sample = [[-5.5, 2.3]]
label = model.predict(sample)
print('Predicted Cluster: %d' % label[0])

Running the example gives an output like:

Silhouette Score: 0.796
Predicted Cluster: 2

The steps are as follows:

First, a synthetic clustering dataset is generated using the make_blobs() function. This creates a dataset with a specified number of samples (n_samples) and centers (centers) for clear visualization. The dataset is then split into training and test sets using train_test_split().
Next, a BisectingKMeans model is instantiated with the desired number of clusters (n_clusters). The model is then fit on the training data using the fit() method.
The performance of the model is evaluated by predicting the cluster labels for the test data and calculating the silhouette score, which measures the cohesion and separation of the clusters.
A single prediction can be made by passing a new data sample to the predict() method to determine its cluster assignment.

This example demonstrates how to apply the BisectingKMeans algorithm for clustering tasks, showcasing the iterative refinement process and how to evaluate the quality of the resulting clusters. This approach is useful for hierarchical clustering and provides a clear understanding of how clusters are split and refined in scikit-learn.

See Also