Scikit-Learn chi2() for Feature Selection

Use the chi2() function from scikit-learn to perform feature selection on a dataset. chi2() is a statistical test that measures the dependence between non-negative features and the target variable. It is useful for selecting the most relevant features for classification tasks.

chi2() does not have specific hyperparameters, but it is typically used with SelectKBest or similar selectors to specify the number of features to select. This method is appropriate for feature selection in classification tasks.

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, chi2
import numpy as np

# generate classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)

# perform chi-squared feature selection
chi2_selector = SelectKBest(score_func=chi2, k=5)
X_kbest = chi2_selector.fit_transform(X, y)

# summarize selected features
print("Selected features' indices:", chi2_selector.get_support(indices=True))
print("Selected features' scores:", chi2_selector.scores_[chi2_selector.get_support(indices=True)])

Running the example gives an output like:

Selected features' indices: [0 2 5 7 8]
Selected features' scores: [0.9923403  1.91998273 0.36959358 1.45143897 0.64637542]

The steps are as follows:

Generate a synthetic binary classification dataset using make_classification() with 100 samples, 10 features, and 2 classes. This function allows us to create a dataset with a fixed random seed (random_state) for reproducibility.
Apply SelectKBest with chi2 as the score function to select the top 5 features. This selector evaluates the chi-squared statistic between each feature and the target variable.
Fit the SelectKBest object on the dataset and transform the dataset to retain only the selected features using the fit_transform() method.
Print the indices of the selected features and their chi-squared scores to understand which features were deemed most relevant by the chi-squared test.

This example demonstrates how to use the chi2() function in conjunction with SelectKBest to identify and select the most relevant features for a classification problem. This method helps reduce dimensionality and improve model performance by focusing on the most significant features.

See Also