The Fowlkes-Mallows score is a metric used to evaluate the similarity between two clusterings or classification results.
It measures the geometric mean of the pairwise precision and recall. This means it calculates the ratio of the number of true positive pairs to the geometric mean of the number of predicted positive pairs and the number of actual positive pairs.
The fowlkes_mallows_score()
function in scikit-learn calculates this score by taking the true labels and the predicted labels as inputs and returning a float value between 0 and 1, with 1 indicating perfect similarity.
The Fowlkes-Mallows score is useful for both clustering and classification problems where comparing the similarity of two clusterings or sets of labels is needed. However, it may not be suitable for imbalanced datasets and is sensitive to the number of clusters or classes.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import fowlkes_mallows_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_clusters_per_class=1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_pred = kmeans.fit_predict(X_test)
# Calculate Fowlkes-Mallows score
fm_score = fowlkes_mallows_score(y_test, y_pred)
print(f"Fowlkes-Mallows Score: {fm_score:.2f}")
Running the example gives an output like:
Fowlkes-Mallows Score: 0.71
The steps are as follows:
- Generate a synthetic dataset using
make_classification()
with 1000 samples, 10 features, and 5 informative features. - Split the dataset into training and test sets using
train_test_split()
with an 80-20 split. - Apply KMeans clustering with 2 clusters on the test set using
fit_predict()
to get cluster labels. - Calculate the Fowlkes-Mallows score using
fowlkes_mallows_score()
by comparing the true labels and the predicted cluster labels. - Print the Fowlkes-Mallows score, which indicates the similarity between the true labels and the clustering results.