Cosine distance is a measure of similarity between two non-zero vectors. It calculates the cosine of the angle between the vectors, providing a value between 0 and 2. A cosine distance of 0 indicates that the vectors are identical, while a distance of 2 means they are opposite. Cosine distance is commonly used in text analysis and information retrieval to compare documents or measure the similarity of text features.
The cosine_distances()
function in scikit-learn calculates pairwise cosine distances between samples. It takes a dataset as input and returns a square matrix where each element represents the cosine distance between two samples. This function is particularly useful when dealing with high-dimensional sparse data, such as text data represented by word counts or TF-IDF vectors.
To calculate cosine distances, the cosine_distances()
function first normalizes the input vectors to unit length. Then, it computes the dot product between each pair of normalized vectors, which directly gives the cosine similarity. Finally, it subtracts the cosine similarity from 1 to obtain the cosine distance.
It’s important to note that cosine distance is not a proper metric as it does not satisfy the triangle inequality. However, it is still a useful measure of similarity in many applications, especially when the magnitude of the vectors is not important, and only the direction matters.
from sklearn.datasets import make_classification
from sklearn.metrics.pairwise import cosine_distances
from scipy.sparse import csr_matrix
# Generate synthetic dataset
X, _ = make_classification(n_samples=5, n_features=4, random_state=42)
# Convert to sparse matrix representation
X_sparse = csr_matrix(X)
# Calculate pairwise cosine distances
distances = cosine_distances(X_sparse)
print("Cosine Distance Matrix:")
print(distances)
Running the example gives an output like:
Cosine Distance Matrix:
[[0. 1.92084752 1.73811109 1.33956554 0.89920678]
[1.92084752 0. 0.05723938 1.0540665 0.70487804]
[1.73811109 0.05723938 0. 1.38395428 0.40315328]
[1.33956554 1.0540665 1.38395428 0. 1.97001829]
[0.89920678 0.70487804 0.40315328 1.97001829 0. ]]
The steps are as follows:
- Generate a synthetic dataset using
make_classification()
to create a binary classification problem with 5 samples and 4 features. - Convert the dataset to a sparse matrix representation using
csr_matrix()
from scipy.sparse, which is suitable for high-dimensional data and efficient storage. - Use the
cosine_distances()
function to calculate pairwise cosine distances between all samples in the dataset. - Print the resulting distance matrix, where each element represents the cosine distance between two samples.
This example demonstrates how to use the cosine_distances()
function from scikit-learn to calculate pairwise cosine distances between samples in a dataset. By using a sparse matrix representation, it showcases how to efficiently handle high-dimensional data, which is common in text analysis and information retrieval tasks.