SKLearner Home | About | Contact | Examples

Scikit-Learn VarianceThreshold for Feature Selection

VarianceThreshold is a feature selection method that removes features with low variance.

This method is simple yet effective in preprocessing steps to improve model performance by eliminating uninformative features.

It works by setting a threshold and dropping features whose variance does not meet this threshold.

It is useful in scenarios where you want to reduce dimensionality without affecting the model’s predictive power.

from sklearn.datasets import make_classification
from sklearn.feature_selection import VarianceThreshold

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=10, random_state=1)

# print dataset shape before feature selection
print("Original shape:", X.shape)

# define the feature selection
selector = VarianceThreshold(threshold=0.1)

# fit and transform the dataset
X_selected = selector.fit_transform(X)

# print dataset shape after feature selection
print("Selected shape:", X_selected.shape)

# print the first row before and after transformation
print("Original first row:", X[0])
print("Selected first row:", X_selected[0])

Running the example gives an output like:

Original shape: (100, 10)
Selected shape: (100, 10)
Original first row: [ 1.0334508  -1.95816909  0.79006105 -0.01478415 -0.93418184  0.54264529
  1.06080576 -0.85749682  0.58530898 -0.14627327]
Selected first row: [ 1.0334508  -1.95816909  0.79006105 -0.01478415 -0.93418184  0.54264529
  1.06080576 -0.85749682  0.58530898 -0.14627327]

The steps are as follows:

  1. A synthetic binary classification dataset is generated using make_classification(), with 100 samples and 10 features.
  2. The original shape of the dataset is printed to show the number of features before feature selection.
  3. A VarianceThreshold feature selector is instantiated with a threshold of 0.1.
  4. The fit_transform() method is applied to the dataset to remove features with low variance.
  5. The new shape of the dataset is printed, demonstrating the reduction in the number of features.
  6. The first row of the dataset is printed before and after feature selection to illustrate the changes.

This example demonstrates how to use VarianceThreshold for feature selection, which helps in reducing the number of features by removing those with low variance, thus potentially improving the performance of machine learning models.



See Also