Scikit-Learn roc_auc_score() Metric

ROC AUC (Receiver Operating Characteristic Area Under the Curve) is a valuable metric for evaluating the performance of classification models.

It quantifies how well the model can distinguish between different classes. The roc_auc_score() function in scikit-learn calculates this metric by plotting the true positive rate against the false positive rate at various threshold settings.

The ROC AUC score ranges from 0 to 1, with 1 indicating perfect classification and 0.5 indicating performance no better than random guessing.

This metric is suitable for both binary and multiclass classification problems. However, it may not be effective for imbalanced datasets if the model does not balance the classes well.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForest classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Predict probabilities on test set
y_pred_proba = clf.predict_proba(X_test)[:, 1]

# Calculate ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC AUC Score: {roc_auc:.2f}")

Running the example gives an output like:

ROC AUC Score: 0.94

The steps are as follows:

Generate a synthetic binary classification dataset using make_classification(). This function creates a dataset with 1000 samples and 2 classes, simulating a classification problem.
Split the dataset into training and test sets using train_test_split(). This step is crucial for evaluating the performance of our classifier on unseen data. We use 80% of the data for training and reserve 20% for testing.
Train a RandomForest classifier using the RandomForestClassifier class from scikit-learn. The fit() method is called on the classifier object, passing in the training features (X_train) and labels (y_train) to learn the underlying patterns in the data.
Predict probabilities on the test set using the predict_proba() method. This generates predicted probabilities for each sample in the test set, which are necessary for calculating the ROC AUC score.
Calculate the ROC AUC score using the roc_auc_score() function. This function takes the true labels (y_test) and the predicted probabilities (y_pred_proba) as input and calculates the area under the ROC curve. The resulting score is printed, giving us a quantitative measure of our classifier’s performance.

See Also