Scikit-Learn SelectFromModel for Feature Selection

SelectFromModel is a feature selection method in scikit-learn that uses a model to determine the importance of features. It selects features based on their importance weights from an estimator, such as a linear model or tree-based model.

The key hyperparameters of SelectFromModel include the estimator (the base estimator from which the transformer is built) and threshold (the threshold value to use for feature selection).

This method is appropriate for any problem where feature selection is needed to improve model performance or reduce overfitting, such as classification and regression.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create RandomForest model
model = RandomForestClassifier(n_estimators=100, random_state=1)
model.fit(X_train, y_train)

# perform feature selection
selector = SelectFromModel(estimator=model, threshold="mean")
selector.fit(X_train, y_train)

# transform the data
X_train_transformed = selector.transform(X_train)
X_test_transformed = selector.transform(X_test)

# print original and transformed dataset shapes
print('Original shape:', X_train.shape)
print('Transformed shape:', X_train_transformed.shape)

Running the example gives an output like:

Original shape: (80, 10)
Transformed shape: (80, 3)

The steps are as follows:

First, a synthetic binary classification dataset is generated using the make_classification() function. This creates a dataset with a specified number of samples (n_samples), informative features (n_informative), and a fixed random seed (random_state) for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a RandomForestClassifier model is instantiated and fit on the training data using the fit() method.
SelectFromModel is used with the trained model to select important features based on the mean importance threshold.
The training and test data are then transformed to retain only the selected features using the transform() method.
The shapes of the original and transformed datasets are printed to show the effect of feature selection.

This example demonstrates how to use SelectFromModel for feature selection, making it easier to improve model performance and reduce overfitting by retaining only the most important features.

See Also