Scikit-Learn f_classif() for Feature Selection

The f_classif function is a scoring function for feature selection in scikit-learn. It calculates the ANOVA F-statistics between each feature and the target variable, with higher scores indicating more informative features.

This function is useful for selecting a subset of relevant features before training a model, especially when dealing with high-dimensional datasets. It is appropriate for classification problems.

from sklearn.datasets import make_classification
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# generate synthetic classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, n_redundant=5, random_state=1)

# calculate feature scores
scores = f_classif(X, y)
print(scores)

# select top k features
k = 10
selector = SelectKBest(f_classif, k=k)
X_selected = selector.fit_transform(X, y)

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=1)

# create and fit logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [[0.2, 0.8, -0.5, 1.0, -0.1, 0.3, -0.8, 0.7, -0.3, 0.6]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

(array([2.31990677e+00, 1.28485946e+02, 1.70040267e-02, 6.36326260e+00,
       4.03160910e+01, 1.35065816e-04, 8.20419935e+01, 1.53545904e+02,
       2.49952753e-01, 1.15432085e+02, 3.96419080e-01, 5.44730385e+01,
       1.26585336e+02, 3.10783653e+02, 1.10047025e+00, 5.38738843e-01,
       4.99731196e-02, 2.99454871e+00, 2.07535565e+00, 2.89558697e-02]), array([1.28044469e-01, 4.22053646e-28, 8.96276650e-01, 1.18052410e-02,
       3.27501316e-10, 9.90729691e-01, 6.85640635e-19, 6.66504434e-33,
       6.17218559e-01, 1.48489352e-25, 5.29088519e-01, 3.31527814e-13,
       9.86587901e-28, 9.16567148e-61, 2.94416895e-01, 4.63128259e-01,
       8.23155804e-01, 8.38543046e-02, 1.50008685e-01, 8.64915459e-01]))
Accuracy: 0.815
Predicted: 0

The steps are as follows:

First, a synthetic classification dataset is generated using make_classification. The f_classif function is imported from sklearn.feature_selection, along with the SelectKBest class for selecting the top features.
The f_classif function is then used to calculate the ANOVA F-statistics between each feature and the target variable. The scores are printed to provide insight into the relative importance of each feature.
The SelectKBest class is used to select the top k features based on the f_classif scores. The data is then transformed to include only these selected features.
The transformed data is split into training and test sets, and a logistic regression model is created and fit on the training data.
The model’s performance is evaluated on the test set using accuracy score, and a prediction is made on a new data sample to demonstrate how the model can be used for new, unseen data.

This example showcases how the f_classif function can be used for feature selection prior to model training. By selecting a subset of informative features, the dimensionality of the problem can be reduced, potentially leading to improved model performance and faster training times.

See Also