The paired_cosine_distances()
function in scikit-learn calculates the cosine distance between pairs of samples in a dataset. Cosine distance measures the angular distance between two vectors, ignoring their magnitudes. It ranges from 0 (for identical vectors) to 2 (for oppositely oriented vectors).
Cosine distance is commonly used to measure similarity between text documents, where each document is represented as a high-dimensional vector of word frequencies or embeddings. A smaller cosine distance indicates higher similarity between documents.
paired_cosine_distances()
is useful for tasks like clustering similar documents, finding nearest neighbors, and building recommender systems based on item similarity. However, it assumes that the input features are already normalized and can be sensitive to differences in feature scaling.
from sklearn.metrics.pairwise import paired_cosine_distances
# Synthetic dataset representing text document vectors
docs = [
[0.5, 0.7, 0.1, 0.2],
[0.6, 0.8, 0.0, 0.1],
[0.1, 0.3, 0.9, 0.8],
[0.0, 0.2, 0.7, 0.9]
]
# Calculate pairwise cosine distances between documents
distances = paired_cosine_distances(docs,docs)
print("Pairwise Cosine Distances:")
print(distances)
Running this example produces output like:
Pairwise Cosine Distances:
[0. 0. 0. 0.]
The key steps are:
We create a synthetic dataset
docs
representing vectors for four text documents. Each document is a list of feature values, such as word frequencies or embeddings.We pass the
docs
dataset topaired_cosine_distances()
, which calculates the cosine distance between every pair of documents. The result is a square matrixdistances
where entry (i, j) is the cosine distance between document i and document j.We print the
distances
matrix, which shows the pairwise cosine distances between all documents. The diagonal entries are zero, indicating that each document is identical to itself. Off-diagonal entries show the cosine distances between different document pairs.
The resulting distance matrix could be used for downstream tasks like clustering the documents based on their similarity or finding the nearest neighbors of a given document. For example, documents 0 and 1 appear to be more similar to each other (smaller cosine distance) than to documents 2 and 3.
This example demonstrates how to use paired_cosine_distances()
from scikit-learn to calculate pairwise cosine distances between samples, which is particularly useful for measuring similarity in text data and other high-dimensional datasets.