Module 2.2 — Vocabulary, Embeddings, Parameters & Model Weights

Start With a Simple Question

After tokenization, the model has a list of numbers.

But here is the problem.

Let's say "cat" becomes token number 2481 and "dog" becomes token number 6391.

These are just ID numbers — like roll numbers in a class. They don't mean anything on their own.

Token 2481 and token 6391 are just... two different numbers. There is nothing in those numbers that tells the model:

  • "cat" and "dog" are both animals
  • "cat" and "dog" are similar to each other
  • "cat" and "pizza" are very different from each other

The model just sees 2481 and 6391 — two random numbers with no relationship.

This is a big problem. Because meaning and relationships between words is exactly what language understanding requires.

So how do we solve this?

This is where Embeddings come in.


First — What is a Vocabulary?

Before embeddings, let's quickly understand vocabulary.

The model's vocabulary is simply its complete list of all tokens it knows.

Token ID    Token
────────────────────
0           [START]
1           "the"
2           "a"
3           "is"
...
2481        "cat"
...
6391        "dog"
...
50,000      [END]

Think of it like a dictionary — but instead of definitions, it's just a numbered list of every token the model understands.

GPT-4 has about 100,000 tokens in its vocabulary.

Every word, punctuation mark, and sub-word chunk you can think of — it's somewhere in this list with a number attached.


The Problem With Just Using Token IDs

Let's make this very concrete.

Imagine you're the model. Someone gives you these token IDs:

[2481, 312, 8734, 6391]

You have no idea what these mean. Are 2481 and 6391 related? Are they opposites? Are they the same type of thing?

You literally cannot tell from the numbers alone.

Now imagine instead of a single number, each token was represented by a list of numbers — where those numbers actually captured meaning.

"cat"  → [0.9, 0.8, 0.1, 0.05, ...]
"dog"  → [0.8, 0.9, 0.1, 0.06, ...]
"fish" → [0.6, 0.5, 0.3, 0.04, ...]
"car"  → [0.1, 0.0, 0.9, 0.80, ...]

Now look at "cat" and "dog" — their numbers are very similar. Look at "cat" and "car" — their numbers are very different.

The numbers are capturing something about meaning. Words that mean similar things have similar numbers.

This list of numbers that represents a token's meaning — that is an Embedding.


What is an Embedding?

An embedding is a list of numbers that represents the meaning of a token.

That's it. Simple definition.

But let's really understand why this is powerful.

Forget AI for a second. Think about maps.

On a map, every city has two numbers — latitude and longitude:

New York    → [40.7, -74.0]
New Jersey  → [40.0, -74.4]
London      → [51.5,  -0.1]
Tokyo       → [35.6, 139.6]

Just from these two numbers, you can tell:

  • New York and New Jersey are very close (similar numbers)
  • New York and Tokyo are very far (very different numbers)

The two numbers capture a real relationship — physical distance.

Embeddings do the same thing — but for meaning instead of geography.

Instead of 2 numbers (latitude and longitude), embeddings use hundreds or thousands of numbers to capture the meaning of a word from many different angles.


Understanding Dimensions

Each number in an embedding is called a dimension.

Real embeddings have hundreds or thousands of dimensions:

GPT-4 embedding dimensions:  1,536
Claude embeddings:            1,024
Small/cheap models:             384

So "cat" in GPT-4 is not represented by 2 numbers — it's represented by 1,536 numbers.

Why so many?

Because meaning is complex. Words have many properties:

Dimension 1  might capture: is this a living thing?
Dimension 2  might capture: is this an animal?
Dimension 3  might capture: is this domestic/pet?
Dimension 4  might capture: is this big or small?
Dimension 5  might capture: is this a noun or verb?
...
Dimension 1536 might capture: some very subtle pattern

No one programs these dimensions manually. The model learns what each dimension should represent during training.


The Magic of Embedding Space

When you plot words by their embeddings, something incredible happens.

Words with similar meanings end up close together in space.

Imagine a 3D space (real embeddings are 1536D but 
let's simplify to 3D for visualization):

                 animals
                    ↑
          cat • • dog
         fish •         • wolf
                         • bear
                    
                         • car
           pizza •       • truck
          burger •  • bus
                    ↓
                  non-animals

Animals cluster together. Vehicles cluster together. Foods cluster together.

The model never got told "cat and dog are both animals." It figured this out purely from reading billions of sentences and noticing that cat and dog appear in similar contexts.


The Famous Example — King, Queen, Man, Woman

This is the most famous example in all of AI and it perfectly shows why embeddings are magical.

Embedding("king") - Embedding("man") + Embedding("woman") 
≈ Embedding("queen")

In plain English:

king minus man plus woman = queen

The model learned that:

  • king and queen have a royalty relationship
  • man and woman have a gender relationship
  • These relationships are consistent in the embedding space

This kind of math on meaning — that's why embeddings changed everything.

You can do arithmetic on concepts. Not just words.


What Are Parameters?

Now let's talk about parameters.

When people say "GPT-4 has 1 trillion parameters" — what does that actually mean?

A parameter is simply a number inside the model that was learned during training.

Think of it like this.

Imagine you're teaching a child to recognize cats. You show them thousands of pictures. Slowly, their brain builds up internal patterns:

"If it has pointy ears..."
"If it has whiskers..."
"If it moves a certain way..."

These internal patterns — stored in the brain's connections — are like parameters.

In an LLM:

Parameters = the millions/billions of numbers stored 
             inside the model that determine how it 
             processes and generates language

These numbers are NOT programmed by humans. They are learned automatically by the model during training — by reading billions of words and adjusting the numbers over and over until the model gets good at predicting language.


What Are Model Weights?

Model weights and parameters are basically the same thing — just two words for the same concept.

"Weights" comes from how neural networks work internally. Each connection between neurons has a "weight" — a number that says how strong or weak that connection is.

Neural Network (simplified):

Input        Hidden Layer       Output
  ●  ──0.8──→  ●
  ●  ──0.3──→  ●  ──0.9──→  ●
  ●  ──0.6──→  ●

The numbers on the arrows (0.8, 0.3, 0.6, 0.9) 
are the WEIGHTS / PARAMETERS

During training, these weights get adjusted millions of times until the model produces good outputs.

When you download a model — you're downloading a huge file full of these learned weight numbers.

GPT-2  (small)   →  548 MB file
GPT-3  (large)   →  ~700 GB file
LLaMA 2 (7B)     →  ~13 GB file

The file IS the weights. The weights ARE the model.


How Do Embeddings, Parameters and Weights Connect?

Here's how they all fit together:

Step 1 — Tokenization
"I love cats" → [40, 1842, 8765]

Step 2 — Embedding Lookup
Each token ID gets converted to its embedding
(a list of 1536 numbers)

40    → [0.2, 0.8, 0.1, 0.5, ... 1536 numbers]
1842  → [0.7, 0.1, 0.9, 0.3, ... 1536 numbers]
8765  → [0.9, 0.8, 0.1, 0.05,... 1536 numbers]

The embedding table is stored as MODEL WEIGHTS
(learned during training)

Step 3 — Processing
These embeddings flow through the Transformer
(more weights doing calculations)

Step 4 — Output
Model produces next token prediction

The embedding table is part of the model weights. It was learned during training — not programmed by hand.


A Real Life Analogy for All of This

Think of a very experienced doctor.

After years of medical school and treating thousands of patients, a doctor builds up internal knowledge:

"When I see these symptoms together → likely this disease"
"This drug works well with this condition"
"These two symptoms almost never appear together"

This knowledge is not written anywhere. It's stored in the doctor's brain — in the connections between neurons.

Now:

  • Parameters/Weights = all the knowledge stored in the doctor's brain
  • Embeddings = the doctor's internal "concept map" of how diseases, symptoms, and treatments relate to each other
  • Training = the years of medical school and patient experience that built up this knowledge

When you call an LLM API — you're essentially consulting that doctor. All that learned knowledge (weights) is being used to answer your question.


Why This Matters for What We're Building

Here's why you need to understand this:

In Phase 3 — we'll use embeddings to measure how similar two pieces of text are. This is the core of search in AI applications.

In Phase 4 — we'll store embeddings in a Vector Database. You'll understand exactly what you're storing and why.

In Phase 5 (RAG) — when a user asks a question, we'll convert that question to an embedding and find the most similar document chunks. Now you understand what "similar" actually means mathematically.

All of RAG — all of AI search — runs on embeddings.


3-Line Summary

  1. A vocabulary is the model's complete list of known tokens — but token ID numbers alone carry no meaning, just like roll numbers don't tell you anything about a student.
  2. An embedding converts each token into a list of hundreds of numbers that capture its meaning — words with similar meanings get similar numbers, so the model can understand relationships between concepts.
  3. Parameters and weights are the same thing — all the numbers learned inside the model during training — the embedding table is part of these weights, learned automatically by reading billions of words.

Module 2.2 — Complete ✅

Coming up — Module 2.3 — The Transformer Architecture

This is the engine behind every LLM — ChatGPT, Claude, Gemini — all of them. You'll finally understand what a "Transformer" actually is, why it was such a big deal when it was invented, and how it processes your text. Explained simply — no math, just clear ideas.

No comments:

Post a Comment

Module 2.3 — The Transformer Architecture

Start With The Big Picture Every major AI model you've heard of: ChatGPT ✓ Transformer Claude ✓ Transformer Gemini ✓ Transf...