Module 3.3 — Similarity Search: Cosine, Euclidean & Dot Product

The Problem We're Solving

You have 10,000 document chunks — all embedded as vectors — sitting in a vector database.

A user asks a question. You embed that question too.

Now you have one question vector and 10,000 document vectors.

How do you find which documents are most similar to the question?

You need a way to measure the distance — or similarity — between vectors.

There are three main ways to do this:

1. Cosine Similarity
2. Euclidean Distance
3. Dot Product

Each one measures "closeness" slightly differently. By the end of this module you'll know exactly what each one does and when to use which.


First — A Simple Setup

Let's use tiny 2D vectors so we can actually visualize what's happening.

Imagine these four pieces of content embedded as 2D vectors:

"Cats are great pets"          → A = (3, 4)
"Dogs make wonderful pets"     → B = (2, 5)
"Pizza is delicious food"      → C = (8, 1)
"I love my cat"                → D = (4, 3)

User question:
"Tell me about pet cats"       → Q = (3, 3)

Visualized:

        ↑
    5   │   • B (2,5)
        │
    4   │   • A (3,4)
        │       • D (4,3)
    3   │   • Q (3,3) ← question
        │
    1   │                       • C (8,1)
        │
    0   └──────────────────────────────→
        0   1   2   3   4   5   6   7   8

Just by looking — A, B, D are close to Q. C is far away.

That makes sense — A, B, D are about pets. C is about pizza.

Now let's see how each metric measures this.


Method 1 — Euclidean Distance

What it is

Euclidean distance is the straight line distance between two points.

You've used this your whole life without knowing it had a name. It's literally just — draw a straight line between two points, how long is it?

        ↑
    4   │   • A (3,4)
        │    \
    3   │   • Q (3,3)
        │
        └──────────────→
        0   1   2   3

The straight line from Q to A — that's the Euclidean distance.

The Formula

Don't worry about memorizing this — just understand what it does:

Distance = √[(x2-x1)² + (y2-y1)²]

Q = (3,3), A = (3,4):
Distance = √[(3-3)² + (4-3)²]
         = √[0 + 1]
         = √1
         = 1.0

Q = (3,3), C = (8,1):
Distance = √[(8-3)² + (1-3)²]
         = √[25 + 4]
         = √29
         = 5.38

What the Numbers Mean

Euclidean Distance:
→ Small number = close = SIMILAR
→ Large number = far = DIFFERENT

Q to A = 1.0    ← very close, very similar ✓
Q to B = 2.24   ← fairly close, similar ✓
Q to D = 1.0    ← very close, very similar ✓
Q to C = 5.38   ← far away, not similar ✗

Correctly finds that A and D are most similar to Q.

The Problem With Euclidean Distance

Euclidean distance cares about both direction AND magnitude (how long the vector is).

This can be a problem.

Imagine:
Short document (2 sentences about cats):
→ Vector: (1, 2)  ← small numbers, short doc

Long document (50 sentences about cats):
→ Vector: (8, 16) ← large numbers, long doc

Same topic. But very different vector lengths.
Euclidean distance would say they're far apart — even 
though they're about the exact same thing.

For text — longer documents naturally produce larger vectors. Euclidean distance penalizes this unfairly.

This is why we usually prefer Cosine Similarity for text.


Method 2 — Cosine Similarity

The Core Idea

Cosine Similarity ignores vector length completely. It only cares about direction.

Think of it like this — two arrows pointing in the same direction are similar, even if one arrow is short and one is long.

Same direction, different lengths:
→ →→→→→

These are "similar" in cosine terms — same direction.

Completely different directions:
→
↑
These are "not similar" — different directions.

The Angle Between Vectors

Cosine Similarity measures the angle between two vectors.

Small angle (pointing roughly same direction):
→ HIGH similarity (close to 1.0)

Large angle (pointing different directions):
→ LOW similarity (close to 0)

Opposite directions (180° angle):
→ NEGATIVE similarity (close to -1.0)

Visualized:

        ↑
        │    ↗ Document about cats (3,4)
        │  ↗  
        │↗ Question about cats (3,3)
        └──────────────→

Small angle between them → HIGH cosine similarity

        ↑
        │↗ Question about cats (3,3)
        │
        │
        └──────────────→ Pizza document (8,1)

Large angle → LOW cosine similarity

What the Numbers Mean

Cosine Similarity range: -1.0 to 1.0

1.0   = identical direction = very similar
0.7   = small angle = fairly similar  
0.3   = medium angle = somewhat similar
0.0   = 90° angle = not related at all
-1.0  = opposite direction = opposite meaning

For our example:

Q and A (both about cats/pets):
→ Cosine similarity ≈ 0.99  ← almost identical direction ✓

Q and C (pets vs pizza):
→ Cosine similarity ≈ 0.49  ← very different direction ✗

Why Cosine Similarity is Best for Text

Short document: "I love cats" → (1, 2)
Long document:  50 sentences about cats → (8, 16)

Both point in the SAME direction — just different lengths.

(1,2) and (8,16) have the SAME angle from origin.
→ Cosine similarity = 1.0 (perfect match)
→ Euclidean distance = large (would say they're different)

Cosine Similarity correctly identifies them as the same topic.

This is why cosine similarity is the default choice for text embeddings in RAG systems.


Method 3 — Dot Product

What it is

Dot Product is the simplest calculation of the three.

You multiply each pair of matching numbers and add them all up.

Vector A = [3, 4]
Vector B = [2, 5]

Dot Product = (3×2) + (4×5)
            = 6 + 20
            = 26

That's it. No square roots. No angles. Just multiply and add.

Dot Product vs Cosine Similarity

Here's the relationship between them:

Dot Product = Cosine Similarity × (length of A) × (length of B)

So Dot Product is basically Cosine Similarity — but it also considers how long the vectors are.

This means:

If two vectors point in the same direction:
→ Both have high Dot Product AND high Cosine Similarity

But if one vector is much longer:
→ Dot Product becomes much larger
→ Cosine Similarity stays the same

When to Use Dot Product

Dot Product is best when vectors are normalized — meaning they all have the same length (length = 1).

Normalized vectors: all have length 1.0
→ Dot Product = Cosine Similarity (they become the same thing)
→ Much faster to compute (no square roots needed)

Many modern embedding models output normalized vectors by default. When this is the case — Dot Product is preferred because:

→ Same result as Cosine Similarity
→ Faster computation
→ Vector databases can optimize it better

OpenAI's text-embedding-3-small outputs normalized vectors. So when using OpenAI embeddings — Dot Product works just as well as Cosine Similarity, and is often faster.


The Three Methods — Side by Side

┌─────────────────┬──────────────────┬────────────────────┐
│                 │                  │                    │
│  EUCLIDEAN      │  COSINE          │  DOT PRODUCT       │
│  DISTANCE       │  SIMILARITY      │                    │
│                 │                  │                    │
├─────────────────┼──────────────────┼────────────────────┤
│                 │                  │                    │
│ Straight line   │ Angle between    │ Multiply + add     │
│ between points  │ two vectors      │ matching numbers   │
│                 │                  │                    │
├─────────────────┼──────────────────┼────────────────────┤
│                 │                  │                    │
│ Small = similar │ Close to 1 =     │ Higher = more      │
│ Large = diff    │ similar          │ similar            │
│                 │ Close to 0 =     │                    │
│                 │ different        │                    │
│                 │                  │                    │
├─────────────────┼──────────────────┼────────────────────┤
│                 │                  │                    │
│ Affected by     │ Ignores vector   │ Affected by        │
│ vector length   │ length           │ vector length      │
│                 │                  │                    │
├─────────────────┼──────────────────┼────────────────────┤
│                 │                  │                    │
│ Good for:       │ Good for:        │ Good for:          │
│ Physical/       │ Text embeddings  │ Normalized         │
│ geographic data │ (most common)    │ vectors (fastest)  │
│                 │                  │                    │
└─────────────────┴──────────────────┴────────────────────┘

In Code — How This Looks

Here's how you calculate cosine similarity between two embeddings in JavaScript:


    // Calculate cosine similarity between two vectors
    function cosineSimilarity(vectorA, vectorB) {
   
    // Step 1: Dot product (multiply matching numbers, add them up)
    const dotProduct = vectorA.reduce(
        (sum, val, i) => sum + val * vectorB[i], 0
    );
   
    // Step 2: Length of each vector
    const magnitudeA = Math.sqrt(
        vectorA.reduce((sum, val) => sum + val * val, 0)
    );
    const magnitudeB = Math.sqrt(
        vectorB.reduce((sum, val) => sum + val * val, 0)
    );
   
    // Step 3: Cosine similarity = dot product / (length A × length B)
    return dotProduct / (magnitudeA * magnitudeB);
    }

    // Example usage
    const catEmbedding  = [0.9, 0.8, 0.1, 0.05]; // "cats are great pets"
    const dogEmbedding  = [0.8, 0.9, 0.1, 0.06]; // "dogs make wonderful pets"
    const pizzaEmbedding = [0.1, 0.2, 0.9, 0.80]; // "pizza is delicious"

    const catVsDog   = cosineSimilarity(catEmbedding, dogEmbedding);
    const catVsPizza = cosineSimilarity(catEmbedding, pizzaEmbedding);

    console.log("Cat vs Dog similarity:  ", catVsDog);   // → 0.97 (very similar)
    console.log("Cat vs Pizza similarity:", catVsPizza);  // → 0.29 (very different)

In practice — your vector database does this calculation for you automatically. You don't calculate it manually. But knowing what's happening underneath makes you a better developer.


How Vector Search Actually Works

Here's the full flow of what happens when a user asks a question in a RAG system:

Step 1 — User asks a question
"What are the side effects of aspirin?"

Step 2 — Embed the question
Question → [0.23, 0.87, 0.41, ...] (1536 numbers)

Step 3 — Compare against all stored embeddings
Vector DB calculates similarity between question 
vector and every stored document vector

Step 4 — Rank by similarity score
Doc 1: "Aspirin risks and warnings"      → 0.94 ← very similar
Doc 2: "Common medication side effects"  → 0.87 ← similar
Doc 3: "History of aspirin"             → 0.71 ← somewhat similar
Doc 4: "Pizza recipe"                   → 0.12 ← not similar
Doc 5: "Car maintenance guide"          → 0.08 ← not similar

Step 5 — Return top K results (e.g. top 3)
Returns Doc 1, Doc 2, Doc 3

Step 6 — Feed to LLM with question
LLM generates answer based on these relevant chunks

This is Top-K search — return the K most similar results. We'll see this in detail in Phase 4.


Which One Should You Use?

Simple decision guide:

Are you using OpenAI embeddings?
→ YES → Use Cosine Similarity or Dot Product
         (both work equally well, dot product is faster)

Are your vectors normalized (length = 1)?
→ YES → Use Dot Product (faster, same result as cosine)
→ NO  → Use Cosine Similarity (safer, handles different lengths)

Are you working with geographic or physical data?
→ YES → Use Euclidean Distance

Are you working with text for RAG?
→ Almost always → Cosine Similarity ✓

For everything we build in this course — RAG, semantic search, document retrieval — Cosine Similarity is the default choice.


A Real Life Analogy — Finding Similar Songs

Think of Spotify's "similar songs" feature.

Every song has an embedding — capturing tempo, energy, mood, genre, instruments, vocals.

Song embeddings (simplified):

"Bohemian Rhapsody" → [0.9, 0.3, 0.8, 0.7]
                       (rock, complex, dramatic, energetic)

"Stairway to Heaven" → [0.8, 0.4, 0.7, 0.6]
                        (rock, complex, dramatic, medium energy)

"Baby Shark"         → [0.1, 0.9, 0.1, 0.2]
                        (pop, simple, happy, low energy)

Cosine similarity:

Bohemian Rhapsody vs Stairway to Heaven → 0.97 (very similar direction)
Bohemian Rhapsody vs Baby Shark         → 0.21 (very different direction)

Spotify correctly recommends Stairway to Heaven after Bohemian Rhapsody. Not Baby Shark.

This is cosine similarity in action — at scale, across millions of songs.


3-Line Summary

  1. Euclidean Distance measures straight-line distance between points — small number means similar — but it's affected by vector length, making it less ideal for text of different sizes.
  2. Cosine Similarity measures the angle between two vectors — close to 1.0 means very similar — it ignores vector length so a short and long document about the same topic correctly score as similar.
  3. Dot Product is the fastest calculation and equals Cosine Similarity when vectors are normalized — use it with OpenAI embeddings since they output normalized vectors by default.

Module 3.3 — Complete ✅

Coming up — Module 3.4 — Practical: Convert Words to Vectors and Compare Them

This is the hands-on module of Phase 3. We write real code — generate actual embeddings using the OpenAI API, compare them using cosine similarity, and SEE the numbers that prove similar words have similar vectors. You'll finally feel embeddings in a tangible way.

No comments:

Post a Comment

Module 3.3 — Similarity Search: Cosine, Euclidean & Dot Product

The Problem We're Solving You have 10,000 document chunks — all embedded as vectors — sitting in a vector database. A user asks a questi...