Text Embeddings and Cosine Similarity

by Audrey M. Roy Greenfeld | Sun, Feb 2, 2025

I use Gemini's text-embedding-004 model to generate embeddings for sentences. Then I define a cosine similarity function and see what it returns for a few related and unrelated sentences.


Setup

from fastcore.utils import *
import google.generativeai as genai
import matplotlib.pyplot as plt
import numpy as np
import os
import seaborn as sns
genai.configure(api_key=os.getenv('GEMINI_API_KEY'))

Generating an Embedding

result = genai.embed_content(model="models/text-embedding-004", content="What is the meaning of life?")
str(result)[:100]
e = L(result['embedding'])
e

Generate More Embeddings

sentences = [
    "Why do we exist?",
    "What's the purpose of existence?",
    "What is the meaning of cookies?",
    "How do I bake cookies?",
    "What's the meaning of life?"
]

Get an embedding for each sentence:

def embed(s): return genai.embed_content("models/text-embedding-004", s)
ems = L(sentences).map(embed)
ems

Convert the embeddings to numpy arrays:

vectors = L(ems).attrgot('embedding').map(np.array)
vectors

Cosine Similarity

def cos_sim(a,b): 
    "Normalized dot product of 2 embedding vectors"
    return (a@b)/(np.linalg.norm(a)*np.linalg.norm(b))

Credit: This function is from How to Solve It With Code, Lesson 14.

I looked for a predefined version of cos_sim to import and use. The closest I found was scikit-learn's cosine_similarity function which returns a similarity matrix. In a future notebook I'd love to use that. Right now, however, I just want to see how similar 2 embeddings are.

2 Existential Questions: High Similarity

f"{sentences[0]} vs. {sentences[1]}"
cos_sim(vectors[0],vectors[1])

Existential Question vs. Cookie Question: Low Similarity

f"{sentences[0]} vs. {sentences[3]}"
cos_sim(vectors[0],vectors[3])

Meanings: Low Similarity

f"{sentences[4]} vs. {sentences[2]}"
cos_sim(vectors[4],vectors[2])

Similarity Matrix

Comparing each sentence to each other one:

sim_matrix = np.zeros((len(vectors), len(vectors)))
for i in range(len(vectors)):
    for j in range(len(vectors)):
        sim_matrix[i,j] = cos_sim(vectors[i], vectors[j])
sim_matrix

Now we can plot a heatmap of these, to build our intuition:

plt.figure(figsize=(10,8))
sns.heatmap(sim_matrix, annot=True, fmt='.2f', 
            xticklabels=sentences, yticklabels=sentences)
plt.title('Sentence Similarity Matrix')
plt.tight_layout()

Summary

Embeddings can capture meaningful semantic relationships between sentences or other text.