Scikit-Learn SelectFwe for Feature Selection

SelectFwe is a univariate feature selection method that selects features based on a statistical test. It controls the false discovery rate to limit the expected proportion of incorrectly rejected null hypotheses.

The key hyperparameters of SelectFwe include the score_func (statistical test function) and alpha (desired family-wise error rate).

The algorithm is appropriate for both classification and regression problems where feature selection is needed.

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFwe, f_classif
from sklearn.model_selection import train_test_split

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# summarize dataset shape before feature selection
print('Shape before feature selection:', X_train.shape)

# configure to select top features
selector = SelectFwe(score_func=f_classif, alpha=0.05)

# fit on the training data
X_train_selected = selector.fit_transform(X_train, y_train)

# summarize dataset shape after feature selection
print('Shape after feature selection:', X_train_selected.shape)

Running the example gives an output like:

Shape before feature selection: (80, 10)
Shape after feature selection: (80, 2)

The steps are as follows:

First, a synthetic binary classification dataset is generated using the make_classification() function. This creates a dataset with a specified number of samples (n_samples), features (n_features), classes (n_classes), and a fixed random seed (random_state) for reproducibility. The dataset is split into training and test sets using train_test_split().
The shape of the dataset before feature selection is printed to provide a baseline for comparison.
Next, a SelectFwe selector is instantiated with f_classif as the scoring function and an alpha value of 0.05. This configuration sets up the feature selection method to control the family-wise error rate at 5%.
The selector is then fit on the training data using the fit_transform() method, which reduces the dataset to the selected features.
Finally, the shape of the dataset after feature selection is printed to show the reduction in the number of features.

This example demonstrates how to use SelectFwe for feature selection by controlling the family-wise error rate, reducing the number of features in the dataset based on statistical tests. This method is particularly useful for improving model performance by eliminating irrelevant or redundant features.

See Also