SelectFdr
and SelectFpr
are both feature selection methods in scikit-learn that rely on statistical tests to determine which features to keep. However, they differ in their approaches to controlling errors, making them suitable for different scenarios.
SelectFdr
(False Discovery Rate) is a method that controls the expected proportion of false discoveries among the selected features. Key hyperparameters include alpha
(the highest allowed false discovery rate) and score_func
(the statistical test to use, such as f_classif
for classification tasks). This method is useful when it is important to limit the number of false positives among the selected features.
SelectFpr
(False Positive Rate) controls the false positive rate, which is the proportion of false positives among all the selected features. Key hyperparameters include alpha
(the highest allowed false positive rate) and score_func
. This method is appropriate when the goal is to control the rate of type I errors directly.
The main difference between SelectFdr
and SelectFpr
is in how they control statistical errors. SelectFdr
aims to control the proportion of false discoveries, making it more conservative, while SelectFpr
directly limits the false positive rate, which can be more permissive in certain scenarios.
SelectFdr
is ideal when you need to be very cautious about including false features, often in exploratory analysis where the cost of false discoveries is high. SelectFpr
, on the other hand, is useful when you want to ensure a low false positive rate and can tolerate a few false positives among the selected features.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFdr, SelectFpr, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
# Generate synthetic dataset with many features
X, y = make_classification(n_samples=1000, n_features=100, n_informative=10, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply SelectFdr for feature selection
select_fdr = SelectFdr(score_func=f_classif, alpha=0.05)
X_train_fdr = select_fdr.fit_transform(X_train, y_train)
X_test_fdr = select_fdr.transform(X_test)
# Evaluate with LogisticRegression
lr_fdr = LogisticRegression(random_state=42)
lr_fdr.fit(X_train_fdr, y_train)
y_pred_fdr = lr_fdr.predict(X_test_fdr)
print(f"SelectFdr accuracy: {accuracy_score(y_test, y_pred_fdr):.3f}")
print(f"SelectFdr F1 score: {f1_score(y_test, y_pred_fdr):.3f}")
# Apply SelectFpr for feature selection
select_fpr = SelectFpr(score_func=f_classif, alpha=0.05)
X_train_fpr = select_fpr.fit_transform(X_train, y_train)
X_test_fpr = select_fpr.transform(X_test)
# Evaluate with LogisticRegression
lr_fpr = LogisticRegression(random_state=42)
lr_fpr.fit(X_train_fpr, y_train)
y_pred_fpr = lr_fpr.predict(X_test_fpr)
print(f"\nSelectFpr accuracy: {accuracy_score(y_test, y_pred_fpr):.3f}")
print(f"SelectFpr F1 score: {f1_score(y_test, y_pred_fpr):.3f}")
Running the example gives an output like:
SelectFdr accuracy: 0.825
SelectFdr F1 score: 0.828
SelectFpr accuracy: 0.820
SelectFpr F1 score: 0.825
The steps are as follows:
- Generate a synthetic classification dataset with many features using
make_classification
. - Split the data into training and testing sets using
train_test_split
. - Apply
SelectFdr
for feature selection:- Use
SelectFdr
withf_classif
andalpha=0.05
. - Fit the selector to the training data.
- Transform the training and testing data using the fitted selector.
- Evaluate the model performance using
LogisticRegression
and print results.
- Use
- Apply
SelectFpr
for feature selection:- Use
SelectFpr
withf_classif
andalpha=0.05
. - Fit the selector to the training data.
- Transform the training and testing data using the fitted selector.
- Evaluate the model performance using
LogisticRegression
and print results.
- Use
- Compare the results of
SelectFdr
andSelectFpr
:- Discuss the differences in the number and type of features selected.
- Compare the model performance metrics (accuracy, F1 score, etc.).
- Highlight scenarios where one method might be preferred over the other.