Mutual information score is a useful metric for evaluating the dependency between two categorical variables. It quantifies the amount of information obtained about one variable through another variable.
The mutual_info_score
function in scikit-learn measures mutual information based on the entropy of the variables and their joint entropy. It takes the true labels and predicted labels as input and returns a float value. Higher values indicate more dependency, while a value of zero indicates no dependency.
Mutual information is used for both binary and multiclass classification problems. However, it does not indicate the direction of the relationship or its linearity. Despite this limitation, it provides valuable insights into the relationships between variables.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mutual_info_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_clusters_per_class=1, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Predict on test set
y_pred = clf.predict(X_test)
# Calculate mutual information score
mi_score = mutual_info_score(y_test, y_pred)
print(f"Mutual Information Score: {mi_score:.2f}")
Running the example gives an output like:
Mutual Information Score: 0.71
The steps are as follows:
- Generate a synthetic classification dataset using
make_classification()
. - Split the dataset into training and test sets using
train_test_split()
. - Train a
DecisionTreeClassifier
on the training set. - Use the trained classifier to make predictions on the test set with
predict()
. - Calculate the mutual information score using
mutual_info_score()
by comparing the true labels (y_test
) and the predicted labels (y_pred
).
First, we generate a synthetic classification dataset using the make_classification()
function from scikit-learn. This function creates a dataset with 1000 samples and 3 classes, simulating a classification problem without using real-world data.
Next, we split the dataset into training and test sets using the train_test_split()
function. This step is crucial for evaluating the performance of our classifier on unseen data. We use 80% of the data for training and reserve 20% for testing.
With our data prepared, we train a Decision Tree classifier using the DecisionTreeClassifier
class from scikit-learn. The fit()
method is called on the classifier object, passing in the training features (X_train
) and labels (y_train
) to learn the underlying patterns in the data.
After training, we use the trained classifier to make predictions on the test set by calling the predict()
method with X_test
. This generates predicted labels for each sample in the test set.
Finally, we evaluate the dependency between the true and predicted labels using the mutual_info_score()
function. This function takes the true labels (y_test
) and the predicted labels (y_pred
) as input and calculates the mutual information score. The resulting score is printed, providing a quantitative measure of the dependency between the variables.
This example demonstrates how to use the mutual_info_score()
function from scikit-learn to evaluate the performance of a classification model.