SKLearner Home | About | Contact | Examples

Scikit-Learn "RFE" versus "RFECV"

RFE (Recursive Feature Elimination) and RFECV (Recursive Feature Elimination with Cross-Validation) are powerful techniques for feature selection in machine learning.

RFE works by recursively removing features and building a model on the remaining features to determine which features contribute the most to the model’s performance. Key hyperparameters of RFE include estimator (the model to fit), n_features_to_select (number of features to select), and step (number of features to remove at each iteration).

RFECV extends RFE by using cross-validation to find the optimal number of features. Its key hyperparameters include estimator, step, and cv (number of cross-validation folds).

The main difference is that RFE requires the user to specify the number of features to select, whereas RFECV automatically determines the optimal number using cross-validation. This makes RFECV more computationally intensive but more automated.

RFE is faster but requires manual selection of features; RFECV is more computationally intensive but automates feature selection.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE, RFECV
from sklearn.metrics import accuracy_score

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate RFE with LogisticRegression
estimator = LogisticRegression(random_state=42)
rfe = RFE(estimator, n_features_to_select=10, step=1)
rfe.fit(X_train, y_train)
y_pred_rfe = rfe.predict(X_test)
print(f"RFE accuracy: {accuracy_score(y_test, y_pred_rfe):.3f}")

# Fit and evaluate RFECV with LogisticRegression
rfecv = RFECV(estimator, step=1, cv=5)
rfecv.fit(X_train, y_train)
y_pred_rfecv = rfecv.predict(X_test)
print(f"RFECV accuracy: {accuracy_score(y_test, y_pred_rfecv):.3f}")
print(f"Optimal number of features: {rfecv.n_features_}")

Running the example gives an output like:

RFE accuracy: 0.815
RFECV accuracy: 0.790
Optimal number of features: 16

The steps are as follows:

  1. Generate a synthetic dataset with 20 features using make_classification.
  2. Split the dataset into training and test sets using train_test_split.
  3. Instantiate RFE with LogisticRegression, selecting 10 features, and fit it on the training data.
  4. Evaluate the RFE model’s performance on the test set using accuracy.
  5. Instantiate RFECV with LogisticRegression and 5-fold cross-validation, and fit it on the training data.
  6. Evaluate the RFECV model’s performance on the test set using accuracy and note the optimal number of features.
  7. Compare the accuracy of the models and the number of selected features.


See Also