Scikit-Learn RFE for Feature Selection

RFE (Recursive Feature Elimination) is a feature selection method used to select a subset of relevant features by recursively fitting the model and removing the least important features. This method helps in improving the performance of the model by reducing overfitting.

RFE works by selecting features based on their importance, determined by the model’s coefficients or feature importances. It is commonly used with algorithms like Support Vector Machines (SVM), Logistic Regression, and others.

RFE is suitable for classification and regression problems, where feature selection can significantly impact model performance.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pandas as pd

# generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=2, random_state=1)

# convert to DataFrame for better visualization
X_df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(10)])

# print dataset shape before transformation
print("Dataset shape before RFE:", X_df.shape)

# create model
model = LogisticRegression()

# create RFE model
rfe = RFE(model, n_features_to_select=5)

# fit RFE
rfe.fit(X, y)

# transform dataset
X_transformed = rfe.transform(X)

# convert to DataFrame for better visualization
X_transformed_df = pd.DataFrame(X_transformed, columns=[f'Selected_Feature_{i}' for i in range(5)])

# print dataset shape after transformation
print("Dataset shape after RFE:", X_transformed_df.shape)

# summarize the selected features
print("Selected features:", rfe.support_)
print("Feature ranking:", rfe.ranking_)

Running the example gives an output like:

Dataset shape before RFE: (100, 10)
Dataset shape after RFE: (100, 5)
Selected features: [False False  True  True False  True False  True False  True]
Feature ranking: [2 6 1 1 5 1 3 1 4 1]

The steps are as follows:

First, a synthetic classification dataset is generated using make_classification(), with 100 samples and 10 features, including both informative and redundant features. The dataset is converted to a DataFrame for better visualization.
The shape of the dataset before applying RFE is printed to show the initial number of features.
Next, a LogisticRegression model is created as the base model for RFE.
An RFE model is created with the logistic regression model and configured to select the top 5 features.
The RFE model is fit on the dataset using the fit() method, and the dataset is transformed to retain only the selected features.
The transformed dataset is converted to a DataFrame for better visualization, and its shape is printed to show the reduced number of features.
Finally, the selected features are summarized by printing the support mask and feature ranking. This helps in understanding which features were selected and their importance rankings.

This example demonstrates how to use RFE for feature selection, highlighting its ability to simplify models by retaining only the most important features, which can lead to improved model performance and reduced overfitting.

See Also