Use the chi2()
function from scikit-learn to perform feature selection on a dataset. chi2()
is a statistical test that measures the dependence between non-negative features and the target variable. It is useful for selecting the most relevant features for classification tasks.
chi2()
does not have specific hyperparameters, but it is typically used with SelectKBest
or similar selectors to specify the number of features to select. This method is appropriate for feature selection in classification tasks.
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest, chi2
import numpy as np
# generate classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)
# perform chi-squared feature selection
chi2_selector = SelectKBest(score_func=chi2, k=5)
X_kbest = chi2_selector.fit_transform(X, y)
# summarize selected features
print("Selected features' indices:", chi2_selector.get_support(indices=True))
print("Selected features' scores:", chi2_selector.scores_[chi2_selector.get_support(indices=True)])
Running the example gives an output like:
Selected features' indices: [0 2 5 7 8]
Selected features' scores: [0.9923403 1.91998273 0.36959358 1.45143897 0.64637542]
The steps are as follows:
Generate a synthetic binary classification dataset using
make_classification()
with 100 samples, 10 features, and 2 classes. This function allows us to create a dataset with a fixed random seed (random_state
) for reproducibility.Apply
SelectKBest
withchi2
as the score function to select the top 5 features. This selector evaluates the chi-squared statistic between each feature and the target variable.Fit the
SelectKBest
object on the dataset and transform the dataset to retain only the selected features using thefit_transform()
method.Print the indices of the selected features and their chi-squared scores to understand which features were deemed most relevant by the chi-squared test.
This example demonstrates how to use the chi2()
function in conjunction with SelectKBest
to identify and select the most relevant features for a classification problem. This method helps reduce dimensionality and improve model performance by focusing on the most significant features.