SKLearner Home | About | Contact | Examples

Scikit-Learn nan_euclidean_distances() Metric

The nan_euclidean_distances() function in scikit-learn calculates the Euclidean distance between pairs of observations in n-dimensional space. It is particularly useful when dealing with datasets that contain missing values (NaNs), as it can handle them seamlessly.

This metric is commonly used in various tasks such as clustering, nearest neighbor searches, and outlier detection. It takes two 2D arrays as input and returns a 2D array of pairwise distances between the observations in the input arrays.

One limitation of the Euclidean distance is that it is sensitive to the scale of the features. If the features have different scales, the larger-scaled features will dominate the distance calculations. Additionally, calculating pairwise distances can be computationally expensive for large datasets.

import numpy as np
from sklearn.metrics import nan_euclidean_distances

# Generate synthetic dataset with missing values
X = np.array([[1, 2, 3],
              [4, np.nan, 6],
              [7, 8, np.nan]])

Y = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Calculate pairwise distances
distances = nan_euclidean_distances(X, Y)

print("Pairwise distances:")
print(distances)

Running the example gives an output like:

Pairwise distances:
[[ 0.          5.19615242 10.39230485]
 [ 5.19615242  0.          5.19615242]
 [10.39230485  5.19615242  0.        ]]
  1. Generate a synthetic dataset with missing values using NumPy. We create two 2D arrays, X and Y, where X contains some missing values represented by np.nan.

  2. Use the nan_euclidean_distances() function to calculate the pairwise distances between the observations in X and Y. The function automatically handles the missing values in the input arrays.

  3. Print the resulting distance matrix, which shows the pairwise Euclidean distances between the observations in X and Y.

This example demonstrates how to use the nan_euclidean_distances() function from scikit-learn to calculate pairwise distances between observations in the presence of missing values. It provides a convenient way to measure the similarity or dissimilarity between data points, even when some feature values are missing.



See Also