Scikit-Learn MiniBatchKMeans Model

MiniBatchKMeans is a variant of the KMeans clustering algorithm designed to handle large datasets more efficiently. It processes small random batches of data to update the cluster centroids, making it faster and more scalable.

Key hyperparameters include n_clusters (number of clusters), batch_size (size of the mini-batches), and max_iter (maximum number of iterations).

This algorithm is suitable for clustering problems.

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score

# generate a synthetic dataset with blobs
X, _ = make_blobs(n_samples=1000, centers=3, n_features=2, random_state=42)

# split into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# create the MiniBatchKMeans model
model = MiniBatchKMeans(n_clusters=3, batch_size=100, max_iter=100)

# fit the model
model.fit(X_train)

# evaluate the model using silhouette score
labels = model.predict(X_test)
score = silhouette_score(X_test, labels)
print('Silhouette Score: %.3f' % score)

# make a prediction for a new sample
new_sample = [[-6.2, -6.5]]
label = model.predict(new_sample)
print('Predicted cluster: %d' % label[0])

Running the example gives an output like:

Silhouette Score: 0.851
Predicted cluster: 1

The steps are as follows:

First, a synthetic dataset is generated using the make_blobs() function, creating a dataset with three clusters and two features. The dataset is split into training and test sets using train_test_split().
Next, a MiniBatchKMeans model is instantiated with 3 clusters, a batch size of 100, and a maximum of 100 iterations. The model is then fit on the training data using the fit() method.
The performance of the model is evaluated by computing the silhouette score, which measures how similar each sample is to its own cluster compared to other clusters.
A single prediction can be made by passing a new data sample to the predict() method to determine its cluster assignment.

This example shows how to efficiently use MiniBatchKMeans for clustering large datasets, demonstrating its scalability and speed in scikit-learn.

See Also