by Audrey M. Roy Greenfeld | Thu, Jan 16, 2025
A mathematical breakdown of cosine similarity, with copy-pastable LaTeX.
After hearing so much about cosine similarity, I thought it was something extremely difficult to understand. It turns out it's just the similarity of 2 word vectors, calculated as the cosine of the angle between them.
The result ranges from -1 to 1, where:
Similarity is just about the angle. The vectors are typically normalized to make comparisons make more sense intuitively, and to avoid having to deal with magnitudes as well.
Note: I'm reading that the range used with word vectors is usually [0,1] because word vectors typically are non-negative.
The cosine similarity formula in LaTeX:
%%latex $$\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}$$
If you're wondering why it looks so familiar, it's probably because you can get it by rewriting the definition of the dot product:
%%latex $$\mathbf{A} \cdot \mathbf{B} = {\|\mathbf{A}\| \|\mathbf{B}\|}\cos(\theta)$$
In the cosine similarity formula, the numerator is the dot product A . B. Expanding that, you get the sum of all the vector components multiplied together:
%%latex $$\mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^n A_i B_i$$
The magnitude ‖A‖ is the square root of the sum of the squares of its components:
%%latex $$\|\mathbf{A}\| = \sqrt{\sum_{i=1}^{n} A_i^2}$$
It's just the length of the vector.
In the cosine similarity formula, the denominator is multiplying magnitudes $|\mathbf{A}| |\mathbf{B}|$ together:
%%latex $$\|\mathbf{A}\| \|\mathbf{B}\| = {\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$
%%latex $$\text{cos}(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$
Cosine similarity measures how similar two vectors are by finding the cosine of the angle between them.
The math is not as bad as it looks: we're just comparing the directions vectors point in, normalized by their lengths.