Module 3.2 — Vectors, Vector Space & Dimensions

Start With Something You Already Know

Remember Module 3.1 — we said an embedding is a list of numbers:

"cat" → [0.9, 0.8, 0.1, 0.05]
"dog" → [0.8, 0.9, 0.1, 0.06]

But what does this list of numbers actually mean geometrically?

This module answers that — by building up from something very simple.


Start With 1 Dimension

Imagine a straight line. Just a number line.

←──────────────────────────────→
-10    -5     0     5     10

Every point on this line can be described by one number.

Point A = 3
Point B = 7
Point C = -2

This is a 1-dimensional space. One number = one dimension = one axis.

Distance between A and B = 7 - 3 = 4 units apart.

Simple. You've known this since school.


Move to 2 Dimensions

Now add a second axis — vertical.

        ↑
    5   │         • C (2, 4)
        │
    2   │   • A (1, 2)
        │                    • B (5, 1)
    0   └──────────────────────────→
        0    1    2    3    4    5

Now every point needs two numbers to describe it:

Point A = (1, 2)   → 1 on horizontal, 2 on vertical
Point B = (5, 1)   → 5 on horizontal, 1 on vertical
Point C = (2, 4)   → 2 on horizontal, 4 on vertical

This is 2-dimensional space.

A and C are close together. B is far from both.

You can see this visually on the graph.


Move to 3 Dimensions

Add one more axis — depth.

Now every point needs THREE numbers:

Point A = (1, 2, 3)
Point B = (5, 1, 0)
Point C = (2, 4, 2)

This is 3-dimensional space. Like the real physical world we live in — length, width, height.

You can still kind of visualize this — imagine a 3D room with coordinates.


Now — What is a Vector?

A vector is simply an arrow from the origin (0,0) to a point.

2D example:

        ↑
    4   │
        │         ↗ vector for "cat" = (3, 4)
    2   │      ↗
        │   ↗
    0   └──────────────→
        0    1    2    3

The vector is the arrow. Its direction and length encode information.

Two vectors that point in similar directions → similar meaning. Two vectors pointing in very different directions → different meaning.

A vector and an embedding are the same thing. An embedding IS a vector — a list of numbers representing a direction in space.

When people say "vector" and "embedding" in AI — they mean the same thing.

Embedding = Vector = List of numbers = Point in space

Scaling Up — 1,536 Dimensions

Here's where people's brains freeze.

"I understand 2D and 3D — but 1,536 dimensions?? That's impossible to visualize."

You're right — you cannot visualize it. Nobody can.

But here's the key insight:

    The math works exactly the same way — whether it's 2 dimensions or 1,536 dimensions.

The rules don't change. Points that are close together are similar. Points far apart are different. Calculations work the same.

We just can't draw a picture of it.

Think of it like this — you understand that a city has a population number. You can't "see" a population — it's abstract. But you can still compare cities:

Mumbai population:  20,000,000
Delhi population:   32,000,000
My hometown:           50,000

Delhi > Mumbai > My hometown

You compared abstract numbers without needing to visualize them.

Same with 1,536-dimensional vectors — you compare them mathematically even though you can't see them.


What Do the Dimensions Actually Represent?

This is a great question — and the honest answer is:

Nobody fully knows.

The dimensions are learned automatically during training. No human decides "dimension 1 = animal-ness" or "dimension 2 = size."

The model figures out on its own what each dimension should capture — based on billions of examples.

But researchers have found some interesting patterns. In simpler models you can sometimes see:

Dimension 1 might roughly capture:
→ Is this a living thing? (high = yes, low = no)

"cat"   → 0.9  (yes, living)
"dog"   → 0.8  (yes, living)
"car"   → 0.1  (no, not living)
"rock"  → 0.05 (no, not living)

Dimension 2 might roughly capture:
→ Is this an animal vs human-made?

"cat"   → 0.8  (yes, natural animal)
"dog"   → 0.9  (yes, natural animal)
"car"   → 0.9  (yes, human-made)
"pizza" → 0.7  (yes, human-made)

In reality, 1,536 dimensions capture incredibly subtle, complex patterns of meaning — far beyond what humans could label.


Vector Space — The Full Picture

Vector space is just the name for the entire space where all vectors live.

1D vector space = a line
2D vector space = a flat plane (like a map)
3D vector space = a 3D room (like the real world)
1536D vector space = impossible to visualize but mathematically valid

When you embed 10,000 documents — you place 10,000 points into this vector space.

Vector Space (shown in simplified 2D):

                    ↑ technical
                    │
  • Python docs     │  • JavaScript docs
  • coding tutorial │  • React guide
                    │  • Node.js intro
────────────────────┼──────────────────────→ programming
                    │           specific
  • cooking recipe  │
  • pizza guide     │  • travel blog
  • food history    │  • hotel review
                    ↓ non-technical

Documents about similar topics cluster together in the space.

When a user asks a question — that question also gets placed as a point in the same space. The closest document points are the most relevant results.


Why This is Called "Semantic Space"

Because similar meanings end up close together — this vector space is also called Semantic Space or Embedding Space.

"Semantic" just means "related to meaning."

Semantic Space clusters:

Animals cluster:
• cat, dog, fish, bird, wolf → all near each other

Vehicles cluster:
• car, truck, bus, bike, train → all near each other

Food cluster:
• pizza, burger, pasta, rice → all near each other

The clusters emerge automatically from training.
Nobody created them manually.

A Real Life Analogy — The City Map

Think of embedding space like a city map.

Every business is a point on the map. Businesses of the same type naturally cluster in areas:

City Map:

[Hospital District]     [Tech District]
 • Hospital A            • Google HQ
 • Clinic B              • Startup X
 • Medical Lab           • Dev Agency

[Restaurant Row]        [University Area]
 • Italian place         • Main University
 • Pizza shop            • Library
 • Sushi bar             • Student cafe

You want to find "a place that treats injuries."

You don't search all businesses. You go to the Hospital District — because that's where similar businesses cluster.

Vector search works the same way. Your question goes to a "location" in embedding space. The nearest neighbors are the most relevant results.


Dimensions and Information Capacity

Here's an important practical point — more dimensions = more nuance captured.

Low dimensions (e.g. 384):
→ Faster to compute
→ Cheaper to store
→ Less nuance in meaning
→ Good for simple use cases

High dimensions (e.g. 3,072):
→ Slower to compute
→ More expensive to store
→ More nuance in meaning
→ Better for complex use cases

For most applications — 1,536 dimensions (OpenAI's text-embedding-3-small) is the sweet spot:

Dimension options for text-embedding-3-small:
→ You can actually REQUEST fewer dimensions
→ 512  dimensions: fast, cheap, decent quality
→ 1536 dimensions: balanced (default)

For text-embedding-3-large:
→ Up to 3,072 dimensions: best quality, more expensive

Embeddings in the Same Space — Critical Rule

Here is a rule that will save you a lot of debugging pain later:

    You must always use the same embedding model for everything in one system.

When you embed documents → use model X When you embed user questions → use the SAME model X

✓ Correct:
Documents embedded with: text-embedding-3-small
Questions embedded with: text-embedding-3-small
→ They live in the same space → comparison works

✗ Wrong:
Documents embedded with: text-embedding-3-small
Questions embedded with: text-embedding-3-large
→ Different spaces → comparison gives garbage results

Different models create different spaces. Mixing them is like comparing GPS coordinates from two different planets — the numbers mean completely different things.


What "Similar" Actually Means Mathematically

We've been saying "close in space = similar meaning."

But how do you actually calculate closeness between two vectors?

That's what the next module (3.3) covers — Similarity Search and Distance Metrics.

For now just understand:

Two vectors:
A = [0.9, 0.8, 0.1, 0.05]   ("cat")
B = [0.8, 0.9, 0.1, 0.06]   ("dog")
C = [0.1, 0.2, 0.9, 0.80]   ("car")

A and B → numbers are very similar → close in space → similar meaning
A and C → numbers are very different → far in space → different meaning

There are specific mathematical formulas to calculate this closeness. We'll cover exactly those in Module 3.3.


Putting it All Together

Text
  ↓
Embedding Model
  ↓
Vector (list of 1536 numbers)
  ↓
Each number = one dimension
  ↓
The vector is a point in 1536-dimensional space
  ↓
Similar texts → similar vectors → close points in space
  ↓
Finding similar texts = finding close points
  ↓
This is called Vector Search / Semantic Search

3-Line Summary

  1. A vector is just a list of numbers — each number is one dimension — and together they describe a point in a multi-dimensional space where similar meanings live close together.
  2. Real embeddings have hundreds or thousands of dimensions — you can't visualize this, but the math works exactly the same as 2D or 3D space — close vectors mean similar meaning.
  3. All embeddings in one system must come from the same model — mixing models is like using two different maps for the same city — the coordinates won't match and your search will break.

Module 3.2 — Complete ✅

Coming up — Module 3.3 — Similarity Search: Cosine, Euclidean & Dot Product

You know what vectors are and how they live in space. Now we cover the three ways to measure how close two vectors are — cosine similarity, euclidean distance, and dot product. This is how your RAG system will actually find relevant documents. Simple explanations, real intuition, and you'll know exactly which one to use and when.

No comments:

Post a Comment

Module 3.3 — Similarity Search: Cosine, Euclidean & Dot Product

The Problem We're Solving You have 10,000 document chunks — all embedded as vectors — sitting in a vector database. A user asks a questi...