Matthews correlation coefficient (MCC) is a metric that measures the quality of binary classifications. It provides a balanced measure even when the classes are of very different sizes.
The MCC value ranges from -1 to 1, where 1 indicates a perfect prediction, 0 indicates no better than random prediction, and -1 indicates total disagreement between the predictions and the actual labels.
MCC is particularly useful for binary classification problems, especially when dealing with imbalanced datasets. In such cases, MCC gives a more informative and truthful measure of the classifier’s performance than accuracy. However, MCC is not suitable for multiclass problems directly and can be sensitive to data imbalance.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import matthews_corrcoef
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a logistic regression classifier
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate MCC
mcc = matthews_corrcoef(y_test, y_pred)
print(f"Matthews Correlation Coefficient: {mcc:.2f}")
Running the example gives an output like:
Matthews Correlation Coefficient: 0.72
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification()
. - Split the dataset into training and test sets using
train_test_split()
. - Train a
LogisticRegression
classifier on the training set. - Use the trained classifier to make predictions on the test set with
predict()
. - Calculate the MCC of the predictions using
matthews_corrcoef()
by comparing the predicted labels to the true labels.
First, we generate a synthetic binary classification dataset using the make_classification()
function from scikit-learn. This function creates a dataset with 1000 samples and 2 classes, simulating a classification problem.
Next, we split the dataset into training and test sets using the train_test_split()
function. We use 80% of the data for training and reserve 20% for testing.
With our data prepared, we train a logistic regression classifier using the LogisticRegression
class from scikit-learn. We call the fit()
method on the classifier object, passing in the training features (X_train
) and labels (y_train
) to learn the patterns in the data.
After training, we use the trained classifier to make predictions on the test set by calling the predict()
method with X_test
. This generates predicted labels for each sample in the test set.
Finally, we evaluate the performance of our classifier using the matthews_corrcoef()
function. This function takes the true labels (y_test
) and the predicted labels (y_pred
) as input and calculates the MCC. The resulting MCC score is printed, providing a measure of our classifier’s performance.
This example demonstrates how to use the matthews_corrcoef()
function from scikit-learn to evaluate the performance of a binary classification model.