Multidimensional Scaling (MDS) is a technique for dimensionality reduction that seeks to preserve the pairwise distances between samples while representing the data in a lower-dimensional space. This makes MDS useful for visualizing high-dimensional data.
The key hyperparameters of MDS include n_components
(the number of dimensions in the low-dimensional space), metric
(the distance metric to use), and max_iter
(the maximum number of iterations).
MDS can be used for visualizing patterns or clusters in high-dimensional data.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.manifold import MDS
from sklearn.metrics import euclidean_distances
# generate synthetic dataset with high dimensionality
X, _ = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# split into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=1)
# create MDS model
model = MDS(n_components=2, max_iter=500, random_state=1)
# fit model on training data
model.fit(X_train)
# transform test data to 2D
X_test_2d = model.fit_transform(X_test)
# calculate stress (measure of how well distances are preserved)
dist_high = euclidean_distances(X_test)
dist_low = euclidean_distances(X_test_2d)
stress = ((dist_high - dist_low)**2).sum() / 2
print(f'Stress: {stress:.3f}')
# visualize a few points in 2D space
print(X_test_2d[:5])
Running the example gives an output like:
Stress: 3996.743
[[ -7.38311504 -4.14428618]
[ -3.61935719 -9.28525924]
[-26.49828244 -6.65364579]
[ 9.07762016 -26.84195501]
[ 19.74342038 10.7711413 ]]
The steps are:
A high-dimensional synthetic dataset is generated using
make_classification()
and split into training and test sets.An
MDS
model is instantiated, specifying the desired number of output dimensions (n_components
). The model is then fit on the training data.The fitted model is used to transform the high-dimensional test data into a 2D representation.
The stress, a measure of how well the pairwise distances are preserved, is calculated by comparing the pairwise Euclidean distances in the high-dimensional space (
dist_high
) and the low-dimensional space (dist_low
).The 2D coordinates of a few samples are printed, which could be used for visualization.
This example demonstrates how to use MDS to reduce the dimensionality of data and evaluate the quality of the low-dimensional representation. The resulting 2D data could be plotted to visually explore patterns or clusters in the original high-dimensional space.