Feature Agglomeration is a clustering-based feature extraction method that merges similar features into a single feature. It is useful for reducing dimensionality in high-dimensional datasets.
The key hyperparameters of FeatureAgglomeration
include the n_clusters
(number of clusters to find), affinity
(metric used to compute the linkage), and linkage
(which linkage criterion to use).
The algorithm is appropriate for feature reduction tasks, especially useful in preprocessing before applying machine learning models.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.cluster import FeatureAgglomeration
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# generate a synthetic high-dimensional dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, random_state=1)
# apply FeatureAgglomeration to reduce dimensionality
agglo = FeatureAgglomeration(n_clusters=5)
X_reduced = agglo.fit_transform(X)
# split the reduced dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.2, random_state=1)
# create and fit a LogisticRegression model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
# make a prediction
row = [X_test[0]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.850
Predicted: 0
The steps are as follows:
A synthetic high-dimensional dataset is generated using
make_classification()
. This creates a dataset with 20 features and a fixed random seed for reproducibility.The
FeatureAgglomeration
class is instantiated to reduce the number of features to 5 clusters. Thefit_transform()
method is applied to the dataset to perform the reduction.The reduced dataset is then split into training and test sets using
train_test_split()
.A
LogisticRegression
model is created and trained on the reduced training data using thefit()
method.The model’s performance is evaluated on the test set using the accuracy score metric.
A single prediction is made by passing a new data sample to the
predict()
method.
This example shows how to use FeatureAgglomeration
to reduce the dimensionality of a dataset before applying a machine learning model, demonstrating the effectiveness of this feature reduction technique in scikit-learn.