Scikit-Learn GenericUnivariateSelect for Feature Selection

GenericUnivariateSelect is a flexible feature selection method that allows you to select the best features based on univariate statistical tests.

It supports various selection modes such as percentile, k-best, and others.

This example demonstrates how to use GenericUnivariateSelect to select important features from a dataset, showcasing its application in preprocessing steps for machine learning tasks.

The key hyperparameters include score_func (the scoring function), mode (the selection mode), and param (the mode-dependent parameter, such as the number of top features to select).

This method is appropriate for classification and regression problems where feature selection is required to improve model performance or reduce computational cost.

from sklearn.datasets import make_classification
from sklearn.feature_selection import GenericUnivariateSelect, chi2
import numpy as np

# generate a classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=2, random_state=42)
X = np.abs(X)

# print dataset shape before feature selection
print("Original shape:", X.shape)

# configure the feature selection method
transformer = GenericUnivariateSelect(score_func=chi2, mode='k_best', param=5)

# fit and transform the dataset
X_new = transformer.fit_transform(X, y)

# print dataset shape after feature selection
print("Transformed shape:", X_new.shape)

Running the example gives an output like:

Original shape: (100, 10)
Transformed shape: (100, 5)

The steps are as follows:

First, a synthetic classification dataset is generated using make_classification(). This dataset has 100 samples, 10 features, of which 5 are informative, and 2 are redundant.
The shape of the dataset before feature selection is printed to show the original number of features.
A GenericUnivariateSelect transformer is configured with chi2 as the scoring function, k_best mode, and param set to 5, indicating that the top 5 features will be selected.
The transformer is fit on the dataset and the data is transformed to select the top features.
The shape of the dataset after feature selection is printed to show the reduced number of features.

This example demonstrates how to use GenericUnivariateSelect to perform feature selection, reducing the number of features in the dataset while retaining the most important ones according to the specified scoring function and selection mode.

See Also