Scikit-Learn MDS Model

Multidimensional Scaling (MDS) is a technique for dimensionality reduction that seeks to preserve the pairwise distances between samples while representing the data in a lower-dimensional space. This makes MDS useful for visualizing high-dimensional data.

The key hyperparameters of MDS include n_components (the number of dimensions in the low-dimensional space), metric (the distance metric to use), and max_iter (the maximum number of iterations).

MDS can be used for visualizing patterns or clusters in high-dimensional data.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.manifold import MDS
from sklearn.metrics import euclidean_distances

# generate synthetic dataset with high dimensionality
X, _ = make_classification(n_samples=100, n_features=20, n_informative=15, n_redundant=5, random_state=1)

# split into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=1)

# create MDS model
model = MDS(n_components=2, max_iter=500, random_state=1)

# fit model on training data
model.fit(X_train)

# transform test data to 2D
X_test_2d = model.fit_transform(X_test)

# calculate stress (measure of how well distances are preserved)
dist_high = euclidean_distances(X_test)
dist_low = euclidean_distances(X_test_2d)
stress = ((dist_high - dist_low)**2).sum() / 2

print(f'Stress: {stress:.3f}')

# visualize a few points in 2D space
print(X_test_2d[:5])

Running the example gives an output like:

Stress: 3996.743
[[ -7.38311504  -4.14428618]
 [ -3.61935719  -9.28525924]
 [-26.49828244  -6.65364579]
 [  9.07762016 -26.84195501]
 [ 19.74342038  10.7711413 ]]

The steps are:

A high-dimensional synthetic dataset is generated using make_classification() and split into training and test sets.
An MDS model is instantiated, specifying the desired number of output dimensions (n_components). The model is then fit on the training data.
The fitted model is used to transform the high-dimensional test data into a 2D representation.
The stress, a measure of how well the pairwise distances are preserved, is calculated by comparing the pairwise Euclidean distances in the high-dimensional space (dist_high) and the low-dimensional space (dist_low).
The 2D coordinates of a few samples are printed, which could be used for visualization.

This example demonstrates how to use MDS to reduce the dimensionality of data and evaluate the quality of the low-dimensional representation. The resulting 2D data could be plotted to visually explore patterns or clusters in the original high-dimensional space.

See Also