Evaluating the performance of a classification model involves understanding the distribution of predicted and true labels. The contingency_matrix()
function from scikit-learn helps in creating a matrix that shows the counts of true vs. predicted labels.
The contingency_matrix()
function creates a matrix where rows represent true labels and columns represent predicted labels. Each cell shows the count of instances for the corresponding true-predicted label pair. The function compares each true label with the predicted label and increments the corresponding cell in the matrix.
A diagonal matrix indicates perfect classification, while off-diagonal values indicate misclassifications. This metric is useful for both binary and multiclass classification problems. However, it has limitations, such as being less insightful for highly imbalanced datasets and not accounting for the severity of misclassifications.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics.cluster import contingency_matrix
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_clusters_per_class=1, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train an SVM classifier
clf = SVC(kernel='linear', C=1, random_state=42)
clf.fit(X_train, y_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate contingency matrix
cm = contingency_matrix(y_test, y_pred)
print(f"Contingency Matrix:\n{cm}")
Running the example gives an output like:
Contingency Matrix:
[[60 2 4]
[ 9 51 1]
[ 5 0 68]]
The steps are as follows:
- Generate a synthetic multiclass classification dataset using
make_classification()
. - Split the dataset into training and test sets using
train_test_split()
. - Train an
SVC
classifier with a linear kernel. - Predict labels on the test set using the trained classifier.
- Calculate the contingency matrix using
contingency_matrix()
by comparing true and predicted labels.
This example demonstrates how to use contingency_matrix()
from scikit-learn to evaluate a classification model by visualizing the distribution of true versus predicted labels.