SKLearner Home | About | Contact | Examples

Scikit-Learn SelectFpr for Feature Selection

Feature selection method SelectFpr allows you to select features based on false positive rate (FPR).

The SelectFpr method applies univariate feature selection by evaluating the importance of each feature.

Common hyperparameters include score_func (the scoring function used, typically chi2 or f_classif) and alpha (the highest p-value for features to keep).

It is appropriate for datasets where feature selection can improve model performance by removing irrelevant or redundant features, applicable in classification problems.

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFpr, chi2
import numpy as np

# generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100, n_features=30, n_informative=2, n_classes=2, random_state=1)
X = np.abs(X)

# summarize the dataset before feature selection
print('Original shape:', X.shape)

# configure to select features based on false positive rate
selector = SelectFpr(score_func=chi2, alpha=0.05)

# fit the model and transform the dataset
X_new = selector.fit_transform(X, y)

# summarize the dataset after feature selection
print('Transformed shape:', X_new.shape)

Running the example gives an output like:

Original shape: (100, 30)
UserWarning: No features were selected: either the data is too noisy or the selection test too strict.
Transformed shape: (100, 0)

The steps are as follows:

  1. Generate a synthetic binary classification dataset using make_classification(). This creates a dataset with a specified number of samples (n_samples), features (n_features), and a fixed random seed (random_state) for reproducibility. The dataset is summarized before applying feature selection to show the original number of features.

  2. Configure SelectFpr with the chi2 scoring function and an alpha of 0.05. This sets up the feature selector to retain features with a p-value below the specified alpha threshold.

  3. Fit the SelectFpr selector to the dataset and transform it to retain only significant features. This step reduces the dataset to include only the features deemed important based on their false positive rate.

  4. Summarize the transformed dataset by printing its shape, showing the reduced number of features after selection.

This example demonstrates how to use SelectFpr for feature selection in scikit-learn. By selecting features based on their false positive rate, you can improve model performance by reducing noise and complexity in the data.



See Also