Hamming Loss is a metric used to evaluate the performance of classification models, particularly in multilabel classification. It calculates the fraction of incorrect labels, indicating how many times the model’s predictions differ from the true labels.
The hamming_loss()
function in scikit-learn computes this by dividing the number of incorrect labels by the total number of labels. It takes the true labels and predicted labels as input and returns a float value between 0 and 1, with 0 being perfect accuracy.
Hamming Loss is used primarily in multilabel classification problems. It is not suitable for single-label classification problems or when the focus is on other types of errors.
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import hamming_loss
# Generate synthetic multilabel dataset
X, y = make_multilabel_classification(n_samples=1000, n_classes=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a RandomForest classifier with multi-label support
clf = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
clf.fit(X_train, y_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate Hamming Loss
h_loss = hamming_loss(y_test, y_pred)
print(f"Hamming Loss: {h_loss:.2f}")
Running the example gives an output like:
Hamming Loss: 0.18
- Generate a synthetic multilabel classification dataset using
make_multilabel_classification()
. This function creates a dataset with 1000 samples and 5 classes, simulating a multilabel classification problem. - Split the dataset into training and test sets using
train_test_split()
, reserving 20% for testing. - Train a RandomForest classifier with multi-label support using the
MultiOutputClassifier
class and theRandomForestClassifier
as the base estimator. Call thefit()
method with the training data. - Make predictions on the test set using the
predict()
method of the trained classifier. - Calculate the Hamming Loss using the
hamming_loss()
function. This function takes the true labels and the predicted labels, computing the fraction of labels that are incorrect. The result is printed to provide a measure of the classifier’s performance.