pair_confusion_matrix()
provides a detailed evaluation of classification model performance by comparing pairs of predicted and true labels. It calculates the number of times pairs of samples are assigned the same or different labels in the predicted and true labels. This metric is useful for understanding the agreement between pairs of samples, especially in clustering and other unsupervised learning tasks.
The pair_confusion_matrix()
function in scikit-learn generates a matrix that shows the count of pairs where both samples are classified the same in both true and predicted labels, or differently in both, or mixed. High values on the diagonal indicate good agreement between predicted and true labels.
This metric is most effective for classification tasks involving clusters or group comparisons. It is less useful in binary or multiclass classification where pairwise comparisons do not provide significant insights. Good values indicate high agreement between predicted and true labels, while bad values indicate low agreement.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import pair_confusion_matrix
# Generate synthetic dataset
X, y = make_classification(n_samples=100, n_clusters_per_class=1, n_features=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a KMeans classifier
clf = KMeans(n_clusters=3, random_state=42)
clf.fit(X_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate pair confusion matrix
pcm = pair_confusion_matrix(y_test, y_pred)
print("Pair Confusion Matrix:")
print(pcm)
Running the example gives an output like:
Pair Confusion Matrix:
[[204 58]
[ 30 88]]
The steps are as follows:
- Generate a synthetic classification dataset with
make_classification()
. - Split the dataset into training and testing sets with
train_test_split()
. - Train a
KMeans
clustering model on the training set. - Predict the clusters on the test set using
predict()
. - Calculate the pair confusion matrix using
pair_confusion_matrix()
to evaluate the clustering performance.
First, we generate a synthetic classification dataset using the make_classification()
function from scikit-learn. This function creates a dataset with 100 samples, 5 features, and 3 classes, allowing us to simulate a clustering problem without using real-world data.
Next, we split the dataset into training and test sets using the train_test_split()
function. This step is crucial for evaluating the performance of our clustering model on unseen data. We use 80% of the data for training and reserve 20% for testing.
With our data prepared, we train a KMeans clustering model using the KMeans
class from scikit-learn. We specify 3 clusters and set the random state for reproducibility. The fit()
method is called on the clustering object, passing in the training features (X_train
) to learn the underlying patterns in the data.
After training, we use the trained clustering model to make predictions on the test set by calling the predict()
method with X_test
. This generates predicted cluster labels for each sample in the test set.
Finally, we evaluate the clustering performance using the pair_confusion_matrix()
function. This function takes the true labels (y_test
) and the predicted labels (y_pred
) as input and calculates the pair confusion matrix, providing a detailed evaluation of the agreement between pairs of samples. The resulting matrix is printed, giving us a quantitative measure of our clustering model’s performance.