SKLearner Home | About | Contact | Examples

Scikit-Learn SelectFdr for Feature Selection

SelectFdr is a feature selection method that helps in selecting the most relevant features from a high-dimensional dataset by controlling the expected proportion of false discoveries among the selected features.

The key hyperparameter of SelectFdr is alpha, which sets the threshold for the false discovery rate. This method is suitable for both classification and regression problems where dimensionality reduction is needed.

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFdr, chi2
import numpy as np

# generate a synthetic classification dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=2, random_state=1)
X = np.abs(X)

# print the shape of the dataset before feature selection
print('Shape before feature selection:', X.shape)

# configure and apply SelectFdr
selector = SelectFdr(chi2, alpha=0.05)
X_new = selector.fit_transform(X, y)

# print the shape of the dataset after feature selection
print('Shape after feature selection:', X_new.shape)

Running the example gives an output like:

Shape before feature selection: (100, 20)
UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
Shape after feature selection: (100, 0)
  1. Generate a synthetic classification dataset with 100 samples and 20 features using make_classification(). This dataset will be used to demonstrate the feature selection process.

  2. Print the shape of the dataset before applying feature selection to show the initial dimensions. This helps in understanding the impact of feature selection on the dataset.

  3. Instantiate SelectFdr with chi2 as the scoring function and alpha=0.05 as the threshold for the false discovery rate. These parameters control how the features are evaluated and selected based on their relevance.

  4. Apply SelectFdr to the dataset using the fit_transform() method. This step selects the most relevant features based on the specified alpha value.

  5. Print the shape of the dataset after applying feature selection to demonstrate the reduction in features. This shows how the dimensionality of the dataset has been reduced by removing irrelevant features.

This example demonstrates how to use SelectFdr to select features in a dataset, which can help improve model performance by reducing dimensionality and removing irrelevant features. The SelectFdr method is particularly useful when working with high-dimensional datasets where controlling the false discovery rate is crucial.



See Also