The Problem We're Solving
You have 10,000 document chunks — all embedded as vectors — sitting in a vector database.
A user asks a question. You embed that question too.
Now you have one question vector and 10,000 document vectors.
How do you find which documents are most similar to the question?
You need a way to measure the distance — or similarity — between vectors.
There are three main ways to do this:
1. Cosine Similarity
2. Euclidean Distance
3. Dot Product
Each one measures "closeness" slightly differently. By the end of this module you'll know exactly what each one does and when to use which.
First — A Simple Setup
Let's use tiny 2D vectors so we can actually visualize what's happening.
Imagine these four pieces of content embedded as 2D vectors:
"Cats are great pets" → A = (3, 4)
"Dogs make wonderful pets" → B = (2, 5)
"Pizza is delicious food" → C = (8, 1)
"I love my cat" → D = (4, 3)
User question:
"Tell me about pet cats" → Q = (3, 3)
Visualized:
↑
5 │ • B (2,5)
│
4 │ • A (3,4)
│ • D (4,3)
3 │ • Q (3,3) ← question
│
1 │ • C (8,1)
│
0 └──────────────────────────────→
0 1 2 3 4 5 6 7 8
Just by looking — A, B, D are close to Q. C is far away.
That makes sense — A, B, D are about pets. C is about pizza.
Now let's see how each metric measures this.
Method 1 — Euclidean Distance
What it is
Euclidean distance is the straight line distance between two points.
You've used this your whole life without knowing it had a name. It's literally just — draw a straight line between two points, how long is it?
↑
4 │ • A (3,4)
│ \
3 │ • Q (3,3)
│
└──────────────→
0 1 2 3
The straight line from Q to A — that's the Euclidean distance.
The Formula
Don't worry about memorizing this — just understand what it does:
Distance = √[(x2-x1)² + (y2-y1)²]
Q = (3,3), A = (3,4):
Distance = √[(3-3)² + (4-3)²]
= √[0 + 1]
= √1
= 1.0
Q = (3,3), C = (8,1):
Distance = √[(8-3)² + (1-3)²]
= √[25 + 4]
= √29
= 5.38
What the Numbers Mean
Euclidean Distance:
→ Small number = close = SIMILAR
→ Large number = far = DIFFERENT
Q to A = 1.0 ← very close, very similar ✓
Q to B = 2.24 ← fairly close, similar ✓
Q to D = 1.0 ← very close, very similar ✓
Q to C = 5.38 ← far away, not similar ✗
Correctly finds that A and D are most similar to Q.
The Problem With Euclidean Distance
Euclidean distance cares about both direction AND magnitude (how long the vector is).
This can be a problem.
Imagine:
Short document (2 sentences about cats):
→ Vector: (1, 2) ← small numbers, short doc
Long document (50 sentences about cats):
→ Vector: (8, 16) ← large numbers, long doc
Same topic. But very different vector lengths.
Euclidean distance would say they're far apart — even
though they're about the exact same thing.
For text — longer documents naturally produce larger vectors. Euclidean distance penalizes this unfairly.
This is why we usually prefer Cosine Similarity for text.
Method 2 — Cosine Similarity
The Core Idea
Cosine Similarity ignores vector length completely. It only cares about direction.
Think of it like this — two arrows pointing in the same direction are similar, even if one arrow is short and one is long.
Same direction, different lengths:
→ →→→→→
These are "similar" in cosine terms — same direction.
Completely different directions:
→
↑
These are "not similar" — different directions.
The Angle Between Vectors
Cosine Similarity measures the angle between two vectors.
Small angle (pointing roughly same direction):
→ HIGH similarity (close to 1.0)
Large angle (pointing different directions):
→ LOW similarity (close to 0)
Opposite directions (180° angle):
→ NEGATIVE similarity (close to -1.0)
Visualized:
↑
│ ↗ Document about cats (3,4)
│ ↗
│↗ Question about cats (3,3)
└──────────────→
Small angle between them → HIGH cosine similarity
↑
│↗ Question about cats (3,3)
│
│
└──────────────→ Pizza document (8,1)
Large angle → LOW cosine similarity
What the Numbers Mean
Cosine Similarity range: -1.0 to 1.0
1.0 = identical direction = very similar
0.7 = small angle = fairly similar
0.3 = medium angle = somewhat similar
0.0 = 90° angle = not related at all
-1.0 = opposite direction = opposite meaning
For our example:
Q and A (both about cats/pets):
→ Cosine similarity ≈ 0.99 ← almost identical direction ✓
Q and C (pets vs pizza):
→ Cosine similarity ≈ 0.49 ← very different direction ✗
Why Cosine Similarity is Best for Text
Short document: "I love cats" → (1, 2)
Long document: 50 sentences about cats → (8, 16)
Both point in the SAME direction — just different lengths.
(1,2) and (8,16) have the SAME angle from origin.
→ Cosine similarity = 1.0 (perfect match)
→ Euclidean distance = large (would say they're different)
Cosine Similarity correctly identifies them as the same topic.
This is why cosine similarity is the default choice for text embeddings in RAG systems.
Method 3 — Dot Product
What it is
Dot Product is the simplest calculation of the three.
You multiply each pair of matching numbers and add them all up.
Vector A = [3, 4]
Vector B = [2, 5]
Dot Product = (3×2) + (4×5)
= 6 + 20
= 26
That's it. No square roots. No angles. Just multiply and add.
Dot Product vs Cosine Similarity
Here's the relationship between them:
Dot Product = Cosine Similarity × (length of A) × (length of B)
So Dot Product is basically Cosine Similarity — but it also considers how long the vectors are.
This means:
If two vectors point in the same direction:
→ Both have high Dot Product AND high Cosine Similarity
But if one vector is much longer:
→ Dot Product becomes much larger
→ Cosine Similarity stays the same
When to Use Dot Product
Dot Product is best when vectors are normalized — meaning they all have the same length (length = 1).
Normalized vectors: all have length 1.0
→ Dot Product = Cosine Similarity (they become the same thing)
→ Much faster to compute (no square roots needed)
Many modern embedding models output normalized vectors by default. When this is the case — Dot Product is preferred because:
→ Same result as Cosine Similarity
→ Faster computation
→ Vector databases can optimize it better
OpenAI's text-embedding-3-small outputs normalized vectors. So when using OpenAI embeddings — Dot Product works just as well as Cosine Similarity, and is often faster.
The Three Methods — Side by Side
┌─────────────────┬──────────────────┬────────────────────┐
│ │ │ │
│ EUCLIDEAN │ COSINE │ DOT PRODUCT │
│ DISTANCE │ SIMILARITY │ │
│ │ │ │
├─────────────────┼──────────────────┼────────────────────┤
│ │ │ │
│ Straight line │ Angle between │ Multiply + add │
│ between points │ two vectors │ matching numbers │
│ │ │ │
├─────────────────┼──────────────────┼────────────────────┤
│ │ │ │
│ Small = similar │ Close to 1 = │ Higher = more │
│ Large = diff │ similar │ similar │
│ │ Close to 0 = │ │
│ │ different │ │
│ │ │ │
├─────────────────┼──────────────────┼────────────────────┤
│ │ │ │
│ Affected by │ Ignores vector │ Affected by │
│ vector length │ length │ vector length │
│ │ │ │
├─────────────────┼──────────────────┼────────────────────┤
│ │ │ │
│ Good for: │ Good for: │ Good for: │
│ Physical/ │ Text embeddings │ Normalized │
│ geographic data │ (most common) │ vectors (fastest) │
│ │ │ │
└─────────────────┴──────────────────┴────────────────────┘
In Code — How This Looks
Here's how you calculate cosine similarity between two embeddings in JavaScript:
// Calculate cosine similarity between two vectors function cosineSimilarity(vectorA, vectorB) { // Step 1: Dot product (multiply matching numbers, add them up) const dotProduct = vectorA.reduce( (sum, val, i) => sum + val * vectorB[i], 0 ); // Step 2: Length of each vector const magnitudeA = Math.sqrt( vectorA.reduce((sum, val) => sum + val * val, 0) ); const magnitudeB = Math.sqrt( vectorB.reduce((sum, val) => sum + val * val, 0) ); // Step 3: Cosine similarity = dot product / (length A × length B) return dotProduct / (magnitudeA * magnitudeB); }
// Example usage const catEmbedding = [0.9, 0.8, 0.1, 0.05]; // "cats are great pets" const dogEmbedding = [0.8, 0.9, 0.1, 0.06]; // "dogs make wonderful pets" const pizzaEmbedding = [0.1, 0.2, 0.9, 0.80]; // "pizza is delicious"
const catVsDog = cosineSimilarity(catEmbedding, dogEmbedding); const catVsPizza = cosineSimilarity(catEmbedding, pizzaEmbedding);
console.log("Cat vs Dog similarity: ", catVsDog); // → 0.97 (very similar) console.log("Cat vs Pizza similarity:", catVsPizza); // → 0.29 (very different)
In practice — your vector database does this calculation for you automatically. You don't calculate it manually. But knowing what's happening underneath makes you a better developer.
How Vector Search Actually Works
Here's the full flow of what happens when a user asks a question in a RAG system:
Step 1 — User asks a question
"What are the side effects of aspirin?"
Step 2 — Embed the question
Question → [0.23, 0.87, 0.41, ...] (1536 numbers)
Step 3 — Compare against all stored embeddings
Vector DB calculates similarity between question
vector and every stored document vector
Step 4 — Rank by similarity score
Doc 1: "Aspirin risks and warnings" → 0.94 ← very similar
Doc 2: "Common medication side effects" → 0.87 ← similar
Doc 3: "History of aspirin" → 0.71 ← somewhat similar
Doc 4: "Pizza recipe" → 0.12 ← not similar
Doc 5: "Car maintenance guide" → 0.08 ← not similar
Step 5 — Return top K results (e.g. top 3)
Returns Doc 1, Doc 2, Doc 3
Step 6 — Feed to LLM with question
LLM generates answer based on these relevant chunks
This is Top-K search — return the K most similar results. We'll see this in detail in Phase 4.
Which One Should You Use?
Simple decision guide:
Are you using OpenAI embeddings?
→ YES → Use Cosine Similarity or Dot Product
(both work equally well, dot product is faster)
Are your vectors normalized (length = 1)?
→ YES → Use Dot Product (faster, same result as cosine)
→ NO → Use Cosine Similarity (safer, handles different lengths)
Are you working with geographic or physical data?
→ YES → Use Euclidean Distance
Are you working with text for RAG?
→ Almost always → Cosine Similarity ✓
For everything we build in this course — RAG, semantic search, document retrieval — Cosine Similarity is the default choice.
A Real Life Analogy — Finding Similar Songs
Think of Spotify's "similar songs" feature.
Every song has an embedding — capturing tempo, energy, mood, genre, instruments, vocals.
Song embeddings (simplified):
"Bohemian Rhapsody" → [0.9, 0.3, 0.8, 0.7]
(rock, complex, dramatic, energetic)
"Stairway to Heaven" → [0.8, 0.4, 0.7, 0.6]
(rock, complex, dramatic, medium energy)
"Baby Shark" → [0.1, 0.9, 0.1, 0.2]
(pop, simple, happy, low energy)
Cosine similarity:
Bohemian Rhapsody vs Stairway to Heaven → 0.97 (very similar direction)
Bohemian Rhapsody vs Baby Shark → 0.21 (very different direction)
Spotify correctly recommends Stairway to Heaven after Bohemian Rhapsody. Not Baby Shark.
This is cosine similarity in action — at scale, across millions of songs.
3-Line Summary
- Euclidean Distance measures straight-line distance between points — small number means similar — but it's affected by vector length, making it less ideal for text of different sizes.
- Cosine Similarity measures the angle between two vectors — close to 1.0 means very similar — it ignores vector length so a short and long document about the same topic correctly score as similar.
- Dot Product is the fastest calculation and equals Cosine Similarity when vectors are normalized — use it with OpenAI embeddings since they output normalized vectors by default.
Module 3.3 — Complete ✅
Coming up — Module 3.4 — Practical: Convert Words to Vectors and Compare Them
This is the hands-on module of Phase 3. We write real code — generate actual embeddings using the OpenAI API, compare them using cosine similarity, and SEE the numbers that prove similar words have similar vectors. You'll finally feel embeddings in a tangible way.
No comments:
Post a Comment