Scikit-Learn f_regression() for Feature Selection

The f_regression() function from sklearn.feature_selection computes the correlation between each feature and the target variable in a regression problem.

It returns F-scores and p-values that can be used to identify the most informative features.

This example demonstrates how to use f_regression() for feature selection on a synthetic regression dataset.

from sklearn.datasets import make_regression
from sklearn.feature_selection import f_regression

# generate regression dataset
X, y = make_regression(n_samples=100, n_features=10, n_informative=5, random_state=1)

# compute f-scores and p-values
f_scores, p_values = f_regression(X, y)

# report f-scores and p-values
print(f"F-scores: {f_scores}")
print(f"p-values: {p_values}")

# select top k features
k = 5
top_k_idx = f_scores.argsort()[-k:][::-1]
X_selected = X[:, top_k_idx]

# report shape before and after selection
print(f"Original shape: {X.shape}")
print(f"Shape after selection: {X_selected.shape}")

Running the example gives an output like:

F-scores: [1.61217693e+00 8.62555675e+00 4.99074096e+00 1.11492668e+02
 8.09988673e-03 6.95950654e+00 1.11498805e-01 2.24689360e+00
 2.04896259e-02 1.20213191e+01]
p-values: [2.07192835e-01 4.13151543e-03 2.77561720e-02 7.43441719e-18
 9.28471400e-01 9.69834515e-03 7.39157646e-01 1.37097428e-01
 8.86471960e-01 7.82576808e-04]
Original shape: (100, 10)
Shape after selection: (100, 5)

The steps are as follows:

First, a synthetic regression dataset is generated using make_regression(). The dataset has 100 samples, 10 features, and 5 informative features. The random_state is set for reproducibility.
The f_regression() function is used to compute the F-scores and p-values for each feature. These scores indicate the linear dependency between each feature and the target variable.
The F-scores and p-values are reported to provide insight into the importance of each feature.
The top k features are selected based on their F-scores. Here, k is set to 5. The indices of the top k features are obtained by sorting the F-scores in descending order and taking the first k indices.
The shape of the dataset before and after feature selection is reported. This shows the reduction in the number of features while preserving the number of samples.

This example demonstrates how f_regression() can be used to identify the most informative features in a regression problem. By selecting a subset of features with the highest F-scores, the dimensionality of the dataset can be reduced without losing much information. This can lead to simpler models, faster training times, and improved interpretability.

See Also