RFE (Recursive Feature Elimination) and RFECV (Recursive Feature Elimination with Cross-Validation) are powerful techniques for feature selection in machine learning.
RFE works by recursively removing features and building a model on the remaining features to determine which features contribute the most to the model’s performance. Key hyperparameters of RFE include estimator
(the model to fit), n_features_to_select
(number of features to select), and step
(number of features to remove at each iteration).
RFECV extends RFE by using cross-validation to find the optimal number of features. Its key hyperparameters include estimator
, step
, and cv
(number of cross-validation folds).
The main difference is that RFE requires the user to specify the number of features to select, whereas RFECV automatically determines the optimal number using cross-validation. This makes RFECV more computationally intensive but more automated.
RFE is faster but requires manual selection of features; RFECV is more computationally intensive but automates feature selection.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE, RFECV
from sklearn.metrics import accuracy_score
# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and evaluate RFE with LogisticRegression
estimator = LogisticRegression(random_state=42)
rfe = RFE(estimator, n_features_to_select=10, step=1)
rfe.fit(X_train, y_train)
y_pred_rfe = rfe.predict(X_test)
print(f"RFE accuracy: {accuracy_score(y_test, y_pred_rfe):.3f}")
# Fit and evaluate RFECV with LogisticRegression
rfecv = RFECV(estimator, step=1, cv=5)
rfecv.fit(X_train, y_train)
y_pred_rfecv = rfecv.predict(X_test)
print(f"RFECV accuracy: {accuracy_score(y_test, y_pred_rfecv):.3f}")
print(f"Optimal number of features: {rfecv.n_features_}")
Running the example gives an output like:
RFE accuracy: 0.815
RFECV accuracy: 0.790
Optimal number of features: 16
The steps are as follows:
- Generate a synthetic dataset with 20 features using
make_classification
. - Split the dataset into training and test sets using
train_test_split
. - Instantiate
RFE
withLogisticRegression
, selecting 10 features, and fit it on the training data. - Evaluate the
RFE
model’s performance on the test set using accuracy. - Instantiate
RFECV
withLogisticRegression
and 5-fold cross-validation, and fit it on the training data. - Evaluate the
RFECV
model’s performance on the test set using accuracy and note the optimal number of features. - Compare the accuracy of the models and the number of selected features.