Scikit-Learn RFECV for Feature Selection

Feature selection is crucial for improving model performance and reducing overfitting.

RFECV (Recursive Feature Elimination with Cross-Validation) is an effective method to automatically select the best features.

RFECV recursively removes the least important features and evaluates model performance using cross-validation.

Key hyperparameters include the estimator (the machine learning model used for evaluation), step (number of features to remove at each iteration), and cv (cross-validation splitting strategy).

This method is appropriate for both classification and regression problems.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create the RFECV object with a LogisticRegression estimator
model = LogisticRegression()
rfecv = RFECV(estimator=model, step=1, cv=5, scoring='accuracy')

# Fit the model
rfecv.fit(X_train, y_train)

# Transform the dataset
X_train_transformed = rfecv.transform(X_train)
X_test_transformed = rfecv.transform(X_test)

# Print the selected features
print("Optimal number of features: %d" % rfecv.n_features_)
print("Selected features: %s" % rfecv.support_)

Running the example gives an output like:

Optimal number of features: 4
Selected features: [False False  True  True False  True False  True False False]

The steps are as follows:

Generate a synthetic dataset using make_classification() with 10 features, some of which are informative and some redundant. The dataset is split into training and testing sets using train_test_split().
Instantiate an RFECV object with a LogisticRegression estimator, specifying the step size and cross-validation strategy. The model is then fit on the training data to select the optimal features.
Transform the training and testing sets using the fitted RFECV model.
Print the optimal number of features and the indices of the selected features.

This example demonstrates how to use RFECV for feature selection in scikit-learn, showcasing how to improve model performance by selecting the most relevant features. The process includes fitting the model, transforming the data, and visualizing the results.

See Also