SKLearner Home | About | Contact | Examples

Scikit-Learn r_regression() for Feature Selection

The r_regression() function in scikit-learn is used to report statistics on features for regression problems. It provides valuable insights into the relationship between each feature and the target variable.

These statistics can be used for feature selection, identifying the most informative features for a predictive model. The function does not fit a model, but instead outputs a report of statistics.

from sklearn.datasets import make_regression
from sklearn.feature_selection import r_regression

# generate regression dataset
X, y = make_regression(n_samples=100, n_features=10, noise=0.5, random_state=1)

# report feature statistics
scores = r_regression(X, y)
print(scores)

# select top 5 features
# top_features = scores.argmax()
top_features = scores.argsort()[-5:][::-1]
print(top_features)

# select top features
X_selected = X[:, top_features]

# summarize selected dataset
print(X.shape, X_selected.shape)

Running the example gives an output like:

[ 0.17000308  0.14924683  0.04360784  0.56935469  0.25124696  0.14630747
 -0.03381099  0.52602067  0.19406379  0.42532461]
[3 7 9 4 8]
(100, 10) (100, 5)

The steps are as follows:

  1. First, a synthetic regression dataset is generated using the make_regression() function. This creates a dataset with a specified number of samples (n_samples), features (n_features), and noise in the target variable (noise). A fixed random seed (random_state) is set for reproducibility.

  2. The r_regression() function is called with the input features (X) and target variable (y). This computes statistical scores for each feature, indicating their correlation with the target variable. The scores are printed to the console.

  3. The top 5 features are selected by finding the indices of the 5 highest scores using argmax(). These indices are printed, showing which features have the strongest relationship with the target.

  4. A new feature matrix X_selected is created by selecting only the top 5 features from the original dataset X.

  5. Finally, the shapes of the original dataset X and the selected dataset X_selected are printed, confirming that the number of features has been reduced.

This example demonstrates how to use the r_regression() function to gain statistical insights into the features of a regression dataset. These statistics can guide feature selection, allowing you to identify and retain the most informative features for building a predictive model.



See Also