Scikit-Learn SelectKBest for Feature Selection

SelectKBest is a feature selection method that selects the top k features based on a scoring function. It is particularly useful in preprocessing steps to enhance model performance by reducing the number of input variables.

The key parameter of SelectKBest is k, which specifies the number of top features to select. The scoring function used (e.g., f_classif for classification) also significantly impacts the feature selection process.

This method is appropriate for both classification and regression problems where feature selection is necessary to improve model performance or reduce overfitting.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
import pandas as pd

# generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)

# display dataset shape before selection
print('Original shape:', X.shape)

# apply SelectKBest feature selection
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)

# display dataset shape after selection
print('Reduced shape:', X_new.shape)

# display a few rows of the dataset before and after feature selection
df_original = pd.DataFrame(X)
df_reduced = pd.DataFrame(X_new)
print('Original dataset sample:\n', df_original.head())
print('Reduced dataset sample:\n', df_reduced.head())

Running the example gives an output like:

Original shape: (100, 10)
Reduced shape: (100, 5)
Original dataset sample:
           0         1         2  ...         7         8         9
0  1.033451 -1.958169  0.790061  ... -0.857497  0.585309 -0.146273
1 -0.557495  0.727813 -0.892246  ... -1.943323 -0.701586  0.939169
2  1.044994  1.765335 -0.588964  ... -0.069329 -1.124787  0.608515
3 -0.106460 -0.478672  1.037633  ... -0.456118  1.703480 -2.790996
4 -1.637450  0.081768 -0.921486  ...  0.648644 -0.540548  3.958603

[5 rows x 10 columns]
Reduced dataset sample:
           0         1         2         3         4
0  0.790061 -0.014784  1.060806 -0.857497  0.585309
1 -0.892246 -0.059622 -1.103254 -1.943323 -0.701586
2 -0.588964 -1.283933  0.816958 -0.069329 -1.124787
3  1.037633  1.738814 -0.789718 -0.456118  1.703480
4 -0.921486  0.284574 -1.569173  0.648644 -0.540548

The steps are as follows:

First, a synthetic dataset is generated using the make_classification() function, creating a dataset with a specified number of samples (n_samples), features (n_features), and classes (n_classes). The dataset is displayed to show its original shape.
The SelectKBest method is then applied, selecting the top k features based on the f_classif scoring function. The new dataset, with reduced features, is displayed to show the new shape.
Finally, the first few rows of both the original and reduced datasets are printed to illustrate the effect of feature selection on the data.

This example demonstrates how to use SelectKBest for feature selection, showing how it reduces the number of features while retaining the most relevant ones, which can be beneficial for improving model performance and reducing overfitting.

See Also