Scikit-Learn "SelectFdr" versus "SelectFpr"

SelectFdr and SelectFpr are both feature selection methods in scikit-learn that rely on statistical tests to determine which features to keep. However, they differ in their approaches to controlling errors, making them suitable for different scenarios.

SelectFdr (False Discovery Rate) is a method that controls the expected proportion of false discoveries among the selected features. Key hyperparameters include alpha (the highest allowed false discovery rate) and score_func (the statistical test to use, such as f_classif for classification tasks). This method is useful when it is important to limit the number of false positives among the selected features.

SelectFpr (False Positive Rate) controls the false positive rate, which is the proportion of false positives among all the selected features. Key hyperparameters include alpha (the highest allowed false positive rate) and score_func. This method is appropriate when the goal is to control the rate of type I errors directly.

The main difference between SelectFdr and SelectFpr is in how they control statistical errors. SelectFdr aims to control the proportion of false discoveries, making it more conservative, while SelectFpr directly limits the false positive rate, which can be more permissive in certain scenarios.

SelectFdr is ideal when you need to be very cautious about including false features, often in exploratory analysis where the cost of false discoveries is high. SelectFpr, on the other hand, is useful when you want to ensure a low false positive rate and can tolerate a few false positives among the selected features.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFdr, SelectFpr, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic dataset with many features
X, y = make_classification(n_samples=1000, n_features=100, n_informative=10, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SelectFdr for feature selection
select_fdr = SelectFdr(score_func=f_classif, alpha=0.05)
X_train_fdr = select_fdr.fit_transform(X_train, y_train)
X_test_fdr = select_fdr.transform(X_test)

# Evaluate with LogisticRegression
lr_fdr = LogisticRegression(random_state=42)
lr_fdr.fit(X_train_fdr, y_train)
y_pred_fdr = lr_fdr.predict(X_test_fdr)
print(f"SelectFdr accuracy: {accuracy_score(y_test, y_pred_fdr):.3f}")
print(f"SelectFdr F1 score: {f1_score(y_test, y_pred_fdr):.3f}")

# Apply SelectFpr for feature selection
select_fpr = SelectFpr(score_func=f_classif, alpha=0.05)
X_train_fpr = select_fpr.fit_transform(X_train, y_train)
X_test_fpr = select_fpr.transform(X_test)

# Evaluate with LogisticRegression
lr_fpr = LogisticRegression(random_state=42)
lr_fpr.fit(X_train_fpr, y_train)
y_pred_fpr = lr_fpr.predict(X_test_fpr)
print(f"\nSelectFpr accuracy: {accuracy_score(y_test, y_pred_fpr):.3f}")
print(f"SelectFpr F1 score: {f1_score(y_test, y_pred_fpr):.3f}")

Running the example gives an output like:

SelectFdr accuracy: 0.825
SelectFdr F1 score: 0.828

SelectFpr accuracy: 0.820
SelectFpr F1 score: 0.825

The steps are as follows:

Generate a synthetic classification dataset with many features using make_classification.
Split the data into training and testing sets using train_test_split.
Apply SelectFdr for feature selection:
- Use SelectFdr with f_classif and alpha=0.05.
- Fit the selector to the training data.
- Transform the training and testing data using the fitted selector.
- Evaluate the model performance using LogisticRegression and print results.
Apply SelectFpr for feature selection:
- Use SelectFpr with f_classif and alpha=0.05.
- Fit the selector to the training data.
- Transform the training and testing data using the fitted selector.
- Evaluate the model performance using LogisticRegression and print results.
Compare the results of SelectFdr and SelectFpr:
- Discuss the differences in the number and type of features selected.
- Compare the model performance metrics (accuracy, F1 score, etc.).
- Highlight scenarios where one method might be preferred over the other.

See Also