Scikit-Learn FeatureAgglomeration Model

Feature Agglomeration is a clustering-based feature extraction method that merges similar features into a single feature. It is useful for reducing dimensionality in high-dimensional datasets.

The key hyperparameters of FeatureAgglomeration include the n_clusters (number of clusters to find), affinity (metric used to compute the linkage), and linkage (which linkage criterion to use).

The algorithm is appropriate for feature reduction tasks, especially useful in preprocessing before applying machine learning models.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import FeatureAgglomeration
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# generate a synthetic high-dimensional dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, random_state=1)

# apply FeatureAgglomeration to reduce dimensionality
agglo = FeatureAgglomeration(n_clusters=5)
X_reduced = agglo.fit_transform(X)

# split the reduced dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=1)

# create and fit a LogisticRegression model
model = LogisticRegression()
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [X_test[0]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.850
Predicted: 0

The steps are as follows:

A synthetic high-dimensional dataset is generated using make_classification(). This creates a dataset with 20 features and a fixed random seed for reproducibility.
The FeatureAgglomeration class is instantiated to reduce the number of features to 5 clusters. The fit_transform() method is applied to the dataset to perform the reduction.
The reduced dataset is then split into training and test sets using train_test_split().
A LogisticRegression model is created and trained on the reduced training data using the fit() method.
The model’s performance is evaluated on the test set using the accuracy score metric.
A single prediction is made by passing a new data sample to the predict() method.

This example shows how to use FeatureAgglomeration to reduce the dimensionality of a dataset before applying a machine learning model, demonstrating the effectiveness of this feature reduction technique in scikit-learn.

See Also