SKLearner Home | About | Contact | Examples

Scikit-Learn AffinityPropagation Model

Affinity Propagation is a clustering algorithm that identifies exemplars among data points and forms clusters based on message passing between data points.

AffinityPropagation does not require the number of clusters to be specified in advance. The key hyperparameters include damping, which controls the update rate, and preference, which influences the number of clusters. This algorithm is suitable for clustering problems where the number of clusters is unknown.

from sklearn.datasets import make_blobs
from sklearn.cluster import AffinityPropagation
from sklearn.metrics import silhouette_score

# generate a synthetic dataset
X, _ = make_blobs(n_samples=100, centers=3, n_features=2, random_state=1)

# create the model
model = AffinityPropagation()

# fit the model
model.fit(X)

# evaluate the model
labels = model.labels_
score = silhouette_score(X, labels)
print('Silhouette Score: %.3f' % score)

# make a prediction
new_data = [[0.5, 2.5]]
prediction = model.predict(new_data)
print('Predicted Cluster: %d' % prediction[0])

Running the example gives an output like:

Silhouette Score: 0.770
Predicted Cluster: 1

The steps are as follows:

  1. First, a synthetic dataset is generated using the make_blobs() function. This creates a dataset with 100 samples, 3 centers, and 2 features, ensuring reproducibility with a fixed random seed (random_state).

  2. Next, an AffinityPropagation model is instantiated with default parameters and then fit on the synthetic dataset using the fit() method.

  3. The performance of the model is evaluated by calculating the silhouette score, which measures how similar an object is to its own cluster compared to other clusters.

  4. A single prediction is made by passing a new data sample to the predict() method.

This example demonstrates how to use AffinityPropagation for clustering tasks, showcasing the simplicity and effectiveness of this algorithm in scikit-learn. The model automatically determines the number of clusters and can be used for clustering tasks where the number of clusters is not known a priori.



See Also