Mutual information measures the dependency between two variables and can be used for feature selection in classification problems. The mutual_info_classif()
function from scikit-learn returns a score for each feature, allowing the selection of the most relevant features.
The function is suitable for classification problems and is a useful tool for dimensionality reduction and improving model performance by focusing on the most informative features.
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest
# generate dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
print(X.shape, y.shape)
# calculate mutual information scores
scorer = mutual_info_classif(X, y)
# select top 5 features
k = 5
top_k_features = SelectKBest(mutual_info_classif, k=k)
top_k_features.fit(X, y)
mask = top_k_features.get_support()
# report scores and selected features
print(scorer)
print(mask)
# create new dataset with selected features
X_new = top_k_features.transform(X)
print(X_new.shape)
Running the example gives an output like:
(1000, 10) (1000,)
[0.07269592 0.08224524 0.04956311 0.16258819 0.01641024 0.0819586
0.06882626 0.03653101 0.02723712 0. ]
[ True True False True False True True False False False]
(1000, 5)
The steps are as follows:
First, a synthetic classification dataset is generated using
make_classification()
. The shape of the dataset is reported, showing the number of samples and features.The
mutual_info_classif()
function is used to calculate the mutual information score between each feature and the target variable. The scores are printed, indicating the level of dependency between each feature and the target.The
SelectKBest
class is used to select the top 5 features with the highest mutual information scores. Theget_support()
method returns a boolean mask indicating which features were selected.A new dataset
X_new
is created by transforming the original datasetX
using thetransform()
method of the fittedSelectKBest
instance. This new dataset contains only the selected features, effectively reducing the dimensionality of the data.
This example demonstrates how to use mutual information to select the most informative features for a classification problem. By reducing the number of features, the complexity of the model can be reduced, potentially leading to improved performance and faster training times.