SKLearner Home | About | Contact | Examples

Scikit-Learn SelectPercentile for Feature Selection

SelectPercentile is a feature selection method that selects the top features based on a univariate statistical test. It helps in reducing the dimensionality of the dataset by retaining only the most relevant features.

SelectPercentile includes key hyperparameters such as score_func (statistical test to use) and percentile (percentage of features to keep).

This method is suitable for both classification and regression problems, making it versatile for various predictive modeling tasks.

from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectPercentile, f_classif
import numpy as np

# generate a synthetic binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)

# print the shape of the dataset before feature selection
print("Original shape:", X.shape)

# configure to select the top 50% features
selector = SelectPercentile(score_func=f_classif, percentile=50)

# fit the selector
X_new = selector.fit_transform(X, y)

# print the shape of the dataset after feature selection
print("Reduced shape:", X_new.shape)

# print the selected features (first 5 samples)
print("Selected features (first 5 samples):\n", X_new[:5])

Running the example gives an output like:

Original shape: (100, 10)
Reduced shape: (100, 5)
Selected features (first 5 samples):
 [[ 0.79006105 -0.01478415  1.06080576 -0.85749682  0.58530898]
 [-0.89224592 -0.05962247 -1.10325445 -1.94332341 -0.70158632]
 [-0.58896393 -1.28393266  0.81695766 -0.0693287  -1.12478707]
 [ 1.03763331  1.73881441 -0.78971776 -0.45611756  1.70347973]
 [-0.92148585  0.28457414 -1.5691733   0.64864364 -0.54054806]]

The steps are as follows:

  1. Generate a synthetic binary classification dataset with 10 features using make_classification().
  2. Print the shape of the original dataset to show the number of features before feature selection.
  3. Configure SelectPercentile to keep the top 50% of features using f_classif as the scoring function.
  4. Fit SelectPercentile on the dataset and transform it to retain only the selected features.
  5. Print the shape of the reduced dataset to show the number of features after feature selection.
  6. Display the selected features for the first 5 samples to illustrate the result of the feature selection process.


See Also