Module 2.3 — The Transformer Architecture

Start With The Big Picture

Every major AI model you've heard of:

ChatGPT    ✓ Transformer
Claude     ✓ Transformer
Gemini     ✓ Transformer
LLaMA      ✓ Transformer
Copilot    ✓ Transformer

They all use the same fundamental architecture — the Transformer.

It was invented in 2017 by researchers at Google in a paper called "Attention Is All You Need." That one paper changed the entire field of AI.

Before Transformers — language models were slow, struggled with long text, and couldn't understand context well.

After Transformers — everything exploded. GPT, BERT, Claude, Gemini — all of it became possible because of this one architecture.

So what makes it so special?


The Problem It Was Solving

Before Transformers, the most popular way to process language was something called an RNN — Recurrent Neural Network.

An RNN reads text like a human reads a book — one word at a time, left to right.

"The cat sat on the mat"

RNN reads:
"The" → processes → remembers something
"cat" → processes → updates memory
"sat" → processes → updates memory
"on"  → processes → updates memory
"the" → processes → updates memory
"mat" → processes → done

This had two serious problems:

Problem 1 — It forgot things from the beginning

By the time the RNN reached the end of a long sentence or paragraph, it had already mostly forgotten what was at the beginning. Like trying to remember the first sentence of a page after reading the whole page — the early stuff fades.

Problem 2 — It was slow

Because it processed words one by one in sequence — you couldn't speed it up. Word 5 had to wait for word 4. Word 4 had to wait for word 3. No way to process in parallel.

The Transformer solved BOTH of these problems completely.


The Transformer's Big Idea

Instead of reading text word by word — the Transformer looks at all words at the same time.

RNN approach:
"The" → "cat" → "sat" → "on" → "the" → "mat"
(one at a time, sequential)

Transformer approach:
"The"  "cat"  "sat"  "on"  "the"  "mat"
  ↕      ↕      ↕      ↕      ↕      ↕
Every word looks at every other word simultaneously

Every single word looks at every other word at the same time and asks:

"How much should I pay attention to you right now?"

This is called the Attention Mechanism — and it's so important it gets its own module next (Module 2.4).

For now just understand — the Transformer's superpower is processing all words in parallel, while understanding how every word relates to every other word.


The Transformer — Inside the Box

Let's open it up and see what's inside.

The Transformer has two main parts:

┌─────────────────────────────────────┐
│           TRANSFORMER               │
│                                     │
│  ┌─────────────┐  ┌──────────────┐  │
│  │   ENCODER   │  │   DECODER    │  │
│  │             │  │              │  │
│  │ Reads and   │  │ Generates    │  │
│  │ understands │  │ output       │  │
│  │ the input   │  │ token by     │  │
│  │             │  │ token        │  │
│  └─────────────┘  └──────────────┘  │
└─────────────────────────────────────┘

Encoder — reads your input and builds a deep understanding of it.

Decoder — takes that understanding and generates the output, one token at a time.

Different models use different parts:

GPT (ChatGPT)  → Decoder only
               → Great at generating text
               → This is what LLMs use

BERT (Google)  → Encoder only  
               → Great at understanding text
               → Used for search, classification

T5, original   → Encoder + Decoder
Transformer    → Great at translation tasks

Since we're focused on LLMs — we care most about the Decoder.


What Happens Inside One Transformer Layer

The Transformer is not one thing — it's many identical layers stacked on top of each other.

GPT-3 has 96 layers. Each layer does the same type of processing, but learns different patterns.

Here's what happens inside one layer — simply:

Input tokens (as embeddings)
          ↓
┌─────────────────────────┐
│   ATTENTION             │
│                         │
│   Every token looks at  │
│   every other token and │
│   figures out which     │
│   ones matter most      │
└─────────────────────────┘
          ↓
┌─────────────────────────┐
│   FEED FORWARD          │
│   NETWORK               │
│                         │
│   Each token gets       │
│   processed through     │
│   a small neural        │
│   network individually  │
└─────────────────────────┘
          ↓
Richer, more meaningful 
token representations

Then this output feeds into the NEXT layer. And the next. And the next.

Each layer builds a deeper understanding of the text.


A Real Life Analogy — The Team Meeting

Imagine a team of 10 people in a meeting, each person representing one word in a sentence.

The old way (RNN):

Person 1 speaks. Then passes a note to Person 2. Person 2 reads the note, speaks, passes a note to Person 3. And so on. By the time Person 10 gets the note — it's been rewritten 9 times. The original message is barely there.

The Transformer way:

All 10 people sit in a circle. Everyone can see everyone else. Before anyone speaks — everyone looks around the room and decides:

"Who in this room is most relevant to what I need to say?"

Person 1 (the word "bank") looks around:

  • Sees Person 3 said "river" → pays HIGH attention to them
  • Sees Person 7 said "money" → pays LOW attention to them
  • Now Person 1 knows — in this sentence, "bank" means river bank

Everyone has full context. Nobody is waiting. Everything happens at the same time.

That's the Transformer.


The Three Steps — Input to Output

Let's trace your actual message through a Transformer LLM step by step:


Step 1 — Input Preparation

You type: "What is the capital of France?"

Tokenizer splits it:
["What", " is", " the", " capital", " of", " France", "?"]

Each token gets its embedding (list of 1536 numbers):
"What"    → [0.2, 0.8, 0.1, ...]
" is"     → [0.5, 0.3, 0.7, ...]
" the"    → [0.1, 0.9, 0.2, ...]
" capital"→ [0.8, 0.4, 0.6, ...]
" of"     → [0.3, 0.2, 0.8, ...]
" France" → [0.9, 0.7, 0.3, ...]
"?"       → [0.1, 0.1, 0.9, ...]

Step 2 — Positional Encoding

Wait — there's a problem.

The Transformer processes all tokens at the same time. But order matters in language.

"Dog bites man"  ≠  "Man bites dog"

Same words, completely different meaning. The Transformer needs to know the order.

So before processing, we add positional encoding — extra numbers added to each embedding that tell the model:

"What"    → embedding + "I am token 1"
" is"     → embedding + "I am token 2"
" the"    → embedding + "I am token 3"
...

Now the model knows both the meaning AND the position of each token.


Step 3 — Through the Layers

The embeddings (with position info) flow through all the Transformer layers.

Layer 1:  Basic patterns — grammar, word types
Layer 2:  Slightly deeper — phrases, simple relationships  
Layer 3:  Deeper — sentence structure
...
Layer 96: Very deep — complex reasoning, context, meaning

Each layer transforms the embeddings into richer, more meaningful representations.

By the final layer — the model has a very deep understanding of your question.


Step 4 — Output Generation

After the final layer — the model produces a probability distribution:

What comes next after "The capital of France is"?

"Paris"    → 94%
"Lyon"     → 2%
"London"   → 1%
"Berlin"   → 0.5%
...

Picks "Paris." Adds it. Runs through all layers again. Picks the next token. And so on.

"The" → "capital" → "of" → "France" → "is" → "Paris" → "."

Why So Many Layers?

Think about how you understand a sentence.

"The trophy didn't fit in the suitcase because it was too big."

What does "it" refer to — the trophy or the suitcase?

To answer this you need to:

  1. Identify all the nouns (trophy, suitcase)
  2. Understand "fit" implies a size comparison
  3. Understand "too big" applies to whatever didn't fit
  4. Conclude "it" = the trophy

That's multiple levels of reasoning — each building on the previous.

Transformer layers work the same way:

Early layers  → simple patterns (this is a noun, this is a verb)
Middle layers → relationships (this noun is the subject)
Later layers  → complex reasoning (what does "it" refer to?)

More layers = ability to handle more complex language.


How Big is a Transformer Actually?

Let's put this in perspective:

GPT-2 (2019)
→ 48 layers
→ 1.5 billion parameters
→ Could write decent paragraphs

GPT-3 (2020)
→ 96 layers
→ 175 billion parameters
→ Could write essays, answer questions, write code

GPT-4 (2023)
→ Exact details not public
→ Estimated ~1 trillion parameters
→ Near human-level on many tasks

More layers and parameters = more capacity to learn complex patterns = better language understanding.


The Full Picture — What You Now Know

Your text
    ↓
Tokenizer → splits into tokens
    ↓
Embedding table → each token becomes 
                  a list of numbers
    ↓
Positional encoding → adds position info
    ↓
Transformer layers (many of them)
  → Each layer runs Attention
    (every token looks at every other token)
  → Then Feed Forward Network
  → Embeddings get richer and richer
    ↓
Final layer → probability distribution
              over all vocabulary tokens
    ↓
Sample next token (controlled by temperature)
    ↓
Add to sequence → repeat until done
    ↓
Tokenizer converts numbers back to text
    ↓
Response appears on your screen

3-Line Summary

  1. The Transformer processes all tokens at the same time — not one by one — which makes it faster and better at understanding relationships across long text.
  2. It's made of many identical layers stacked on top of each other — early layers learn simple patterns, later layers learn complex reasoning — all using an Attention mechanism.
  3. More layers and more parameters means more capacity to understand language — which is why GPT-4 with ~1 trillion parameters is so much more capable than earlier models.

Module 2.3 — Complete ✅

Coming up — Module 2.4 — The Attention Mechanism

This is the actual secret sauce inside every Transformer. The word "attention" gets thrown around a lot — but almost nobody explains what it actually does in simple terms. We'll fix that completely. By the end you'll genuinely understand why that 2017 paper was called "Attention Is All You Need."

Module 2.2 — Vocabulary, Embeddings, Parameters & Model Weights

Start With a Simple Question

After tokenization, the model has a list of numbers.

But here is the problem.

Let's say "cat" becomes token number 2481 and "dog" becomes token number 6391.

These are just ID numbers — like roll numbers in a class. They don't mean anything on their own.

Token 2481 and token 6391 are just... two different numbers. There is nothing in those numbers that tells the model:

  • "cat" and "dog" are both animals
  • "cat" and "dog" are similar to each other
  • "cat" and "pizza" are very different from each other

The model just sees 2481 and 6391 — two random numbers with no relationship.

This is a big problem. Because meaning and relationships between words is exactly what language understanding requires.

So how do we solve this?

This is where Embeddings come in.


First — What is a Vocabulary?

Before embeddings, let's quickly understand vocabulary.

The model's vocabulary is simply its complete list of all tokens it knows.

Token ID    Token
────────────────────
0           [START]
1           "the"
2           "a"
3           "is"
...
2481        "cat"
...
6391        "dog"
...
50,000      [END]

Think of it like a dictionary — but instead of definitions, it's just a numbered list of every token the model understands.

GPT-4 has about 100,000 tokens in its vocabulary.

Every word, punctuation mark, and sub-word chunk you can think of — it's somewhere in this list with a number attached.


The Problem With Just Using Token IDs

Let's make this very concrete.

Imagine you're the model. Someone gives you these token IDs:

[2481, 312, 8734, 6391]

You have no idea what these mean. Are 2481 and 6391 related? Are they opposites? Are they the same type of thing?

You literally cannot tell from the numbers alone.

Now imagine instead of a single number, each token was represented by a list of numbers — where those numbers actually captured meaning.

"cat"  → [0.9, 0.8, 0.1, 0.05, ...]
"dog"  → [0.8, 0.9, 0.1, 0.06, ...]
"fish" → [0.6, 0.5, 0.3, 0.04, ...]
"car"  → [0.1, 0.0, 0.9, 0.80, ...]

Now look at "cat" and "dog" — their numbers are very similar. Look at "cat" and "car" — their numbers are very different.

The numbers are capturing something about meaning. Words that mean similar things have similar numbers.

This list of numbers that represents a token's meaning — that is an Embedding.


What is an Embedding?

An embedding is a list of numbers that represents the meaning of a token.

That's it. Simple definition.

But let's really understand why this is powerful.

Forget AI for a second. Think about maps.

On a map, every city has two numbers — latitude and longitude:

New York    → [40.7, -74.0]
New Jersey  → [40.0, -74.4]
London      → [51.5,  -0.1]
Tokyo       → [35.6, 139.6]

Just from these two numbers, you can tell:

  • New York and New Jersey are very close (similar numbers)
  • New York and Tokyo are very far (very different numbers)

The two numbers capture a real relationship — physical distance.

Embeddings do the same thing — but for meaning instead of geography.

Instead of 2 numbers (latitude and longitude), embeddings use hundreds or thousands of numbers to capture the meaning of a word from many different angles.


Understanding Dimensions

Each number in an embedding is called a dimension.

Real embeddings have hundreds or thousands of dimensions:

GPT-4 embedding dimensions:  1,536
Claude embeddings:            1,024
Small/cheap models:             384

So "cat" in GPT-4 is not represented by 2 numbers — it's represented by 1,536 numbers.

Why so many?

Because meaning is complex. Words have many properties:

Dimension 1  might capture: is this a living thing?
Dimension 2  might capture: is this an animal?
Dimension 3  might capture: is this domestic/pet?
Dimension 4  might capture: is this big or small?
Dimension 5  might capture: is this a noun or verb?
...
Dimension 1536 might capture: some very subtle pattern

No one programs these dimensions manually. The model learns what each dimension should represent during training.


The Magic of Embedding Space

When you plot words by their embeddings, something incredible happens.

Words with similar meanings end up close together in space.

Imagine a 3D space (real embeddings are 1536D but 
let's simplify to 3D for visualization):

                 animals
                    ↑
          cat • • dog
         fish •         • wolf
                         • bear
                    
                         • car
           pizza •       • truck
          burger •  • bus
                    ↓
                  non-animals

Animals cluster together. Vehicles cluster together. Foods cluster together.

The model never got told "cat and dog are both animals." It figured this out purely from reading billions of sentences and noticing that cat and dog appear in similar contexts.


The Famous Example — King, Queen, Man, Woman

This is the most famous example in all of AI and it perfectly shows why embeddings are magical.

Embedding("king") - Embedding("man") + Embedding("woman") 
≈ Embedding("queen")

In plain English:

king minus man plus woman = queen

The model learned that:

  • king and queen have a royalty relationship
  • man and woman have a gender relationship
  • These relationships are consistent in the embedding space

This kind of math on meaning — that's why embeddings changed everything.

You can do arithmetic on concepts. Not just words.


What Are Parameters?

Now let's talk about parameters.

When people say "GPT-4 has 1 trillion parameters" — what does that actually mean?

A parameter is simply a number inside the model that was learned during training.

Think of it like this.

Imagine you're teaching a child to recognize cats. You show them thousands of pictures. Slowly, their brain builds up internal patterns:

"If it has pointy ears..."
"If it has whiskers..."
"If it moves a certain way..."

These internal patterns — stored in the brain's connections — are like parameters.

In an LLM:

Parameters = the millions/billions of numbers stored 
             inside the model that determine how it 
             processes and generates language

These numbers are NOT programmed by humans. They are learned automatically by the model during training — by reading billions of words and adjusting the numbers over and over until the model gets good at predicting language.


What Are Model Weights?

Model weights and parameters are basically the same thing — just two words for the same concept.

"Weights" comes from how neural networks work internally. Each connection between neurons has a "weight" — a number that says how strong or weak that connection is.

Neural Network (simplified):

Input        Hidden Layer       Output
  ●  ──0.8──→  ●
  ●  ──0.3──→  ●  ──0.9──→  ●
  ●  ──0.6──→  ●

The numbers on the arrows (0.8, 0.3, 0.6, 0.9) 
are the WEIGHTS / PARAMETERS

During training, these weights get adjusted millions of times until the model produces good outputs.

When you download a model — you're downloading a huge file full of these learned weight numbers.

GPT-2  (small)   →  548 MB file
GPT-3  (large)   →  ~700 GB file
LLaMA 2 (7B)     →  ~13 GB file

The file IS the weights. The weights ARE the model.


How Do Embeddings, Parameters and Weights Connect?

Here's how they all fit together:

Step 1 — Tokenization
"I love cats" → [40, 1842, 8765]

Step 2 — Embedding Lookup
Each token ID gets converted to its embedding
(a list of 1536 numbers)

40    → [0.2, 0.8, 0.1, 0.5, ... 1536 numbers]
1842  → [0.7, 0.1, 0.9, 0.3, ... 1536 numbers]
8765  → [0.9, 0.8, 0.1, 0.05,... 1536 numbers]

The embedding table is stored as MODEL WEIGHTS
(learned during training)

Step 3 — Processing
These embeddings flow through the Transformer
(more weights doing calculations)

Step 4 — Output
Model produces next token prediction

The embedding table is part of the model weights. It was learned during training — not programmed by hand.


A Real Life Analogy for All of This

Think of a very experienced doctor.

After years of medical school and treating thousands of patients, a doctor builds up internal knowledge:

"When I see these symptoms together → likely this disease"
"This drug works well with this condition"
"These two symptoms almost never appear together"

This knowledge is not written anywhere. It's stored in the doctor's brain — in the connections between neurons.

Now:

  • Parameters/Weights = all the knowledge stored in the doctor's brain
  • Embeddings = the doctor's internal "concept map" of how diseases, symptoms, and treatments relate to each other
  • Training = the years of medical school and patient experience that built up this knowledge

When you call an LLM API — you're essentially consulting that doctor. All that learned knowledge (weights) is being used to answer your question.


Why This Matters for What We're Building

Here's why you need to understand this:

In Phase 3 — we'll use embeddings to measure how similar two pieces of text are. This is the core of search in AI applications.

In Phase 4 — we'll store embeddings in a Vector Database. You'll understand exactly what you're storing and why.

In Phase 5 (RAG) — when a user asks a question, we'll convert that question to an embedding and find the most similar document chunks. Now you understand what "similar" actually means mathematically.

All of RAG — all of AI search — runs on embeddings.


3-Line Summary

  1. A vocabulary is the model's complete list of known tokens — but token ID numbers alone carry no meaning, just like roll numbers don't tell you anything about a student.
  2. An embedding converts each token into a list of hundreds of numbers that capture its meaning — words with similar meanings get similar numbers, so the model can understand relationships between concepts.
  3. Parameters and weights are the same thing — all the numbers learned inside the model during training — the embedding table is part of these weights, learned automatically by reading billions of words.

Module 2.2 — Complete ✅

Coming up — Module 2.3 — The Transformer Architecture

This is the engine behind every LLM — ChatGPT, Claude, Gemini — all of them. You'll finally understand what a "Transformer" actually is, why it was such a big deal when it was invented, and how it processes your text. Explained simply — no math, just clear ideas.

Module 2.1 — What is NLP, Words vs Tokens & Tokenization

Start With a Simple Problem

Computers are dumb in one specific way.

They only understand numbers. That's it. Everything inside a computer — images, videos, music, text — is secretly just numbers underneath.

So when you type:

"I love pizza"

The computer sees this as a bunch of characters. But it has no idea what "love" means. It has no idea "pizza" is a food. It just sees symbols.

The big question is:

How do we turn human language into something a computer can actually understand and work with?

This is exactly what NLP solves.


What is NLP?

NLP stands for Natural Language Processing.

Break it down:

  • Natural Language = the language humans speak and write. English, Hindi, Spanish — any human language.
  • Processing = making a computer work with it.

So NLP = teaching computers to read, understand, and work with human language.

NLP is not new. It's been around since the 1950s. But it used to be very basic — simple rule-based stuff like "if the word is 'not', flip the meaning."

Today's NLP — powered by LLMs — is on a completely different level. The computer doesn't just follow rules anymore. It actually understands context, meaning, and nuance.


The Core Problem NLP Solves

Think about how hard human language actually is.

Same word, completely different meanings:

"I went to the bank"
→ river bank? or money bank?

"She couldn't bear the pain"
→ bear = tolerate? or bear = the animal?

"He is so cool"
→ temperature? or personality?

Humans figure this out instantly from context. Computers used to completely fail at this.

NLP is the field of techniques and models that teach computers to handle exactly this kind of complexity.


Step 1 — How Does a Computer Start Reading Text?

Let's say you give a computer this sentence:

"Dogs are great pets"

First question — how does the computer even break this down?

The naive answer is: split by spaces.

"Dogs" | "are" | "great" | "pets"
→ 4 words

Simple. But this breaks immediately with real language:

"I can't do this"
→ split by space → "I" | "can't" | "do" | "this"
→ but "can't" is actually "can" + "not"
→ are these 1 word or 2?

"New York"
→ is this 1 thing or 2 separate words?

"state-of-the-art"
→ 1 word? 4 words? something else?

Splitting by spaces is too simple. Real language is messy.

So instead of splitting into words, we split into tokens.


Words vs Tokens — The Real Difference

A word is what you and I understand — a unit of meaning in language.

A token is what the computer uses — a chunk of text that the model has learned to work with.

They are close — but not the same.

Here's the key difference:

Word:  "unbelievable"
→ You see 1 word
→ Computer sees 3 tokens: ["un", "believ", "able"]

Word:  "cat"
→ You see 1 word
→ Computer sees 1 token: ["cat"]

Word:  "I"
→ You see 1 word
→ Computer sees 1 token: ["I"]

Common, short words = usually 1 token. Long or rare words = split into multiple tokens.


Why Split Into Tokens and Not Words?

Great question. Three simple reasons:

Reason 1 — Handles words it has never seen

Imagine someone types a brand new made-up word:

"Anthropicization"

The model has never seen this word. If we treated words as the unit — the model is completely lost.

But with tokens:

"Anthropicization" → ["Anthrop", "ic", "ization"]

The model knows these pieces. It can make sense of the word even though it's never seen the full thing.

Reason 2 — Works across languages

English: "hello"     → 1 token
Spanish: "hola"      → 1 token  
Hindi:   "नमस्ते"    → 2-3 tokens

One tokenizer handles all languages without needing separate systems for each.

Reason 3 — Keeps the vocabulary manageable

There are millions of words across all languages. If every word was a separate entry — the model would need a list of millions of items.

With sub-word tokens — you only need about 50,000 to 100,000 tokens to cover almost everything. Much more manageable.


What is Tokenization?

Tokenization is simply the process of splitting text into tokens.

It's step one — before the model does anything with your text, the tokenizer chops it up first.

Let's trace a real example:

Input text:
"I love building AI apps!"

After tokenization:
["I", " love", " building", " AI", " apps", "!"]

After converting to numbers:
[40, 1842, 2615, 9552, 5181, 0]

Now the model has something it can actually work with — a list of numbers.


Let's See This With Real Examples

Here's how some common text gets tokenized:

Text: "Hello world"
Tokens: ["Hello", " world"]
Count: 2 tokens

Text: "ChatGPT is amazing"
Tokens: ["Chat", "G", "PT", " is", " amazing"]
Count: 5 tokens

Text: "I can't stop learning"
Tokens: ["I", " can", "'t", " stop", " learning"]
Count: 5 tokens

Text: "2024"
Tokens: ["2024"]
Count: 1 token

Text: "$1,299.99"
Tokens: ["$", "1", ",", "299", ".", "99"]
Count: 6 tokens

Notice a few things:

  • Spaces are often attached to the NEXT word, not left separate
  • Punctuation becomes its own token
  • Numbers can split in unexpected ways
  • Contractions like "can't" split into "can" + "'t"

The Tokenizer is Separate From the Model

This is important to understand.

The tokenizer and the model are two different things:

Your Text
    ↓
[TOKENIZER]          ← splits text into tokens
    ↓
Tokens (numbers)
    ↓
[LLM MODEL]          ← processes the numbers
    ↓
Output tokens
    ↓
[TOKENIZER]          ← converts numbers back to text
    ↓
Response Text

The tokenizer runs first, converts text to numbers. The model works with those numbers. Then the tokenizer runs again at the end, converting the model's output numbers back into readable text.


A Simple Real Life Analogy

Think of tokenization like a mail sorting room.

When letters arrive at a post office, the sorting room doesn't read every letter as a whole story. It breaks down the address into pieces:

"123 Main Street, New York, USA 10001"
→ House Number: 123
→ Street: Main Street
→ City: New York
→ Country: USA
→ ZIP: 10001

Each piece is a "token" — a chunk the sorting system can work with. The full address as one blob of text is hard to route. Broken into meaningful chunks — easy.

Tokenization does the same thing with language.


Why Should You Care About This as a Developer?

Because tokens directly affect three things in your apps:

1. Cost Every API call charges you per token. If you understand tokenization, you write efficient prompts and save money.

"Please be so kind as to summarize the following text"
→ 11 tokens — wordy, expensive

"Summarize:"
→ 2 tokens — same instruction, much cheaper

2. Speed More tokens = more processing = slower response. Tight prompts are faster prompts.

3. Context Limit Remember the context window from Module 1.3 — it's measured in tokens. Knowing how tokenization works helps you estimate how much space you have left.


Quick Summary of the Flow So Far

You type text
      ↓
Tokenizer splits it into chunks (tokens)
      ↓
Each token gets converted to a number
      ↓
List of numbers goes into the model
      ↓
Model processes the numbers
      ↓
Model outputs numbers
      ↓
Tokenizer converts numbers back to text
      ↓
You see the response

3-Line Summary

  1. NLP is the field of teaching computers to read and understand human language — tokenization is the very first step in that process.
  2. A token is a chunk of text — not exactly a word, not a letter — common words are one token, long or rare words get split into multiple tokens.
  3. Tokenization matters to you as a developer because every token costs money, takes time, and uses up your context window — writing efficient prompts means understanding how your text gets split.

Module 2.1 — Complete ✅

Coming up — Module 2.2 — Vocabulary, Embeddings, Parameters & Model Weights

This is where it gets really interesting. You'll learn what "embeddings" actually are at the most basic level — and why they are the single most important concept in all of modern AI. We'll build up to it simply, step by step.

Module 1.4 — Prompts: System Prompt vs User Prompt vs Completion

Why Prompting is an Engineering Skill

Most people treat prompts like Google search queries — throw some words in, hope for the best, tweak randomly when it doesn't work.

That's not how good AI developers think.

A prompt is an instruction set. You are programming the model using natural language. And just like code, the way you write it determines exactly what you get back.

Bad prompt → unpredictable output → broken app → frustrated users.

Good prompt → consistent, structured output → reliable app → happy users.

By the end of this module you'll understand the exact structure of how messages reach the model, how to write prompts that actually work, and patterns you'll reuse in every AI application you build.


The Three Roles — How the Model Sees a Conversation

When you call any LLM API, the conversation is not just raw text. It's structured into roles. Every piece of text is tagged with who sent it.

There are three roles:

┌─────────────────────────────────────────────────┐
│  SYSTEM                                         │
│  Instructions for how the model should behave.  │
│  Set by YOU, the developer. User never sees it. │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  USER                                           │
│  The message from the human in the conversation.│
│  This is what the user types.                   │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│  ASSISTANT                                      │
│  The model's response.                          │
│  Also called "completion" or "assistant turn."  │
└─────────────────────────────────────────────────┘

Every API call you make — whether it's a simple chatbot or a complex RAG pipeline — is built from combinations of these three roles.


Part 1 — The System Prompt

What it is

The system prompt is your developer instruction layer. It runs before anything else. It tells the model:

  • Who it is
  • What it should and shouldn't do
  • What tone to use
  • What format to respond in
  • What domain it operates in
  • What to do in edge cases

The user never sees the system prompt. But it shapes every single response the model gives.

Think of it like this — before an employee starts a customer support call, their manager briefs them:

"You are a support agent for TechCorp. 
Always be polite. Never discuss pricing.
If you don't know something, say 'I'll 
check on that for you.' Keep answers short."

The customer calling in has no idea this briefing happened. But every answer the agent gives is shaped by it.

That briefing is your system prompt.


What a Weak System Prompt Looks Like

SYSTEM:
You are a helpful assistant.

This is what most beginners write. It's almost useless.

"Helpful" is vague. "Assistant" is vague. The model will guess what you want — and guessing means inconsistency.


What a Strong System Prompt Looks Like

Here's a system prompt for a customer support bot for a software product:

SYSTEM:
You are a customer support specialist for DevTool Pro, 
a developer productivity SaaS application.

Your behavior rules:
- Answer ONLY questions related to DevTool Pro features, 
  bugs, billing, and account management
- If a question is unrelated to DevTool Pro, politely 
  decline and redirect: "I can only help with DevTool 
  Pro related questions."
- Never speculate about features that don't exist
- If unsure, say: "I don't have that information right 
  now — let me connect you with our team."
- Always respond in plain English, no jargon
- Keep responses under 150 words unless the question 
  genuinely requires more detail

Response format:
- Start with a direct answer to the question
- Add explanation if needed
- End with one follow-up offer if relevant

Tone: Professional but warm. Never robotic.

This system prompt does five things well:

1. Defines identity       → who the model IS
2. Sets boundaries        → what it will and won't do
3. Handles edge cases     → what to do when it doesn't know
4. Controls format        → how the output is structured
5. Sets tone              → how it sounds

System Prompt in Code


    const response = await fetch("https://api.anthropic.com/v1/messages", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            model: "claude-sonnet-4-6",
            max_tokens: 1024,
            system: `You are a customer support specialist for DevTool Pro.
                Answer only questions related to DevTool Pro.
                If unsure, say you'll connect them with the team.
                Keep responses under 150 words.
                Tone: Professional but warm.`,
            messages: [
                {
                    role: "user",
                    content: "How do I reset my password?"
                }
            ]
        })
    });

Notice — in the Anthropic API, system is a separate field, not part of the messages array. In OpenAI's API it's a message with role "system" inside the array. Different APIs, same concept.


Part 2 — The User Prompt

What it is

The user prompt is the actual message from the human. In a chat application, this is what the user types. In a backend pipeline (like RAG), this might be constructed programmatically.

Simple case — user just types naturally:

USER:
How do I reset my password?

But as a developer, you'll often construct the user prompt yourself — adding context, formatting, injected data — before it reaches the model.


Constructed User Prompts

In real applications, what looks like a "user message" is often built by your code. This is normal and powerful.

Example — a document summarizer:


    const userDocument = "...500 words of content from uploaded PDF...";
    const userQuestion = "What are the key action items?";

    const constructedUserPrompt = `
    Here is the document the user uploaded:

    <document>
    ${userDocument}
    </document>

    User's question: ${userQuestion}

    Please answer based only on the document above.
    `;

    // This constructed prompt is sent as the user message

The actual user only typed "What are the key action items?" — but your code wrapped it with the document content and clear instructions before sending.

This is a pattern you'll use constantly in RAG applications.


Prompt Engineering Techniques

Here are the core techniques that actually work — not magic phrases, but structural patterns:


Technique 1 — Be Specific, Not Vague

❌ Vague:
"Write something about climate change"

✅ Specific:
"Write a 3-paragraph summary of the causes of 
climate change, written for a high school student 
with no science background. Use simple language 
and one real-world example per paragraph."

Specificity removes guessing. Less guessing = more consistent output.


Technique 2 — Specify the Output Format

❌ No format specified:
"List the pros and cons of React vs Vue"

✅ Format specified:
"Compare React and Vue. Return your response as 
a JSON object with this exact structure:

{
  "react": {
    "pros": ["...", "..."],
    "cons": ["...", "..."]
  },
  "vue": {
    "pros": ["...", "..."],
    "cons": ["...", "..."]
  }
}

Return only the JSON. No explanation before or after."

When you specify format precisely, your code can parse the output reliably. This is critical for building real applications.


Technique 3 — Give Examples (Few-Shot Prompting)

This is one of the most powerful techniques. Show the model exactly what you want by example:

Classify customer messages as: BILLING, TECHNICAL, or GENERAL

Examples:
Message: "My invoice shows the wrong amount"
Category: BILLING

Message: "The app crashes when I click export"  
Category: TECHNICAL

Message: "What are your business hours?"
Category: GENERAL

Now classify this message:
Message: "I was charged twice this month"
Category:

The model has seen the pattern three times. It knows exactly what to do. No ambiguity.

This is called few-shot prompting — giving a few examples before the actual task.

Zero-shot = no examples, just instruction. Few-shot = a few examples before the task. One-shot = exactly one example.


Technique 4 — Chain of Thought

For complex reasoning tasks, ask the model to think step by step before giving the answer:

❌ Direct answer (often wrong on complex problems):
"What is 15% of 847?"

✅ Chain of thought:
"What is 15% of 847? Think step by step before 
giving the final answer."

Model output:
"Step 1: 10% of 847 = 84.7
 Step 2: 5% of 847 = 84.7 / 2 = 42.35
 Step 3: 15% = 84.7 + 42.35 = 127.05
 Answer: 127.05"

Making the model reason explicitly before answering dramatically improves accuracy on math, logic, and multi-step problems.

This is the technique behind OpenAI's "o1" model — it was trained to think before answering, not just immediately generate responses.


Technique 5 — Constrain What the Model Can and Cannot Do

You are a data extraction assistant.

Rules:
- Extract ONLY information that is explicitly stated 
  in the document
- If information is not in the document, return null 
  for that field
- NEVER infer or guess missing information
- NEVER add information from your own knowledge

This is critical — return null rather than guessing.

Explicit constraints prevent hallucination. In production apps this is not optional — you must constrain the model's behavior.


Part 3 — The Completion (Assistant Response)

What it is

The completion is the model's response — everything it generates back. In the API it's tagged with role "assistant."

Simple example:

USER:    "What is 2 + 2?"
ASSISTANT: "4"

The completion is "4."


Why You Sometimes Write the Assistant Turn Yourself

Here's something that surprises developers — you can pre-fill the assistant's response. You write the beginning of the assistant's answer, and the model continues from there.

This is called assistant prefilling and it's a powerful technique:

messages: [
  {
    role: "user",
    content: "Give me the user data as JSON"
  },
  {
    role: "assistant",
    content: "{"    // ← you start the JSON, model continues
  }
]

By starting with {, you force the model to continue generating valid JSON. It won't start with an explanation or preamble — it has to continue from {.

This is useful when you need:

  • Pure JSON output with no surrounding text
  • Responses that start in a specific way
  • Format enforcement without relying on instructions alone

Multi-Turn Conversations — How History is Structured

In a real conversation, messages alternate between user and assistant:

messages: [
  {
    role: "user",
    content: "My name is Arjun"
  },
  {
    role: "assistant", 
    content: "Nice to meet you, Arjun! How can I help you today?"
  },
  {
    role: "user",
    content: "What is my name?"
  }
  // Model will respond with "Arjun" because it can see the full history
]

Your application is responsible for maintaining this history array and sending it with every request. The model itself stores nothing between calls.

This is the exact reason why building a chatbot is more than just calling the API — you need to manage conversation state.


Putting It All Together — A Real Example

Here's a complete, production-style prompt structure for a code review assistant:


    const codeToReview = `
    function calculateTotal(items) {
    let total = 0;
    for (let i = 0; i <= items.length; i++) {
        total += items[i].price;
    }
    return total;
    }
    `;

    const response = await fetch("https://api.anthropic.com/v1/messages", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
            model: "claude-sonnet-4-6",
            max_tokens: 1024,

            // SYSTEM — defines behavior
            system: `You are a senior JavaScript developer doing
    code review. You identify bugs, performance issues,
    and style problems.

    Always respond in this exact JSON format:
    {
    "bugs": ["description of each bug"],
    "performance": ["performance issues"],
    "suggestions": ["style and best practice suggestions"],
    "severity": "low | medium | high"
    }

    Return only valid JSON. No explanation outside the JSON.`,

            messages: [
                // USER — the actual request
                {
                    role: "user",
                    content: `Please review this JavaScript function:

    \`\`\`javascript
    ${codeToReview}
    \`\`\`

    Identify all issues.`
                }
            ]
        })
    });

    const data = await response.json();
    const review = JSON.parse(data.content[0].text);
    console.log(review);

    /*
    Output:
    {
    "bugs": [
        "Off-by-one error: loop condition should be i < items.length,
        not i <= items.length. The last iteration accesses
        items[items.length] which is undefined, causing a
        TypeError when accessing .price"
    ],
    "performance": [
        "Consider using Array.reduce() instead of a for loop
        for cleaner, more idiomatic JavaScript"
    ],
    "suggestions": [
        "Add input validation to handle empty arrays",
        "Consider handling cases where items[i].price might be
        undefined or NaN"
    ],
    "severity": "high"
    }
    */

Notice what's happening here:
System prompt  → defines role + enforces JSON format
User prompt    → injects the actual code + asks the question
Output         → structured JSON your code can work with

This is the pattern for every production AI feature you'll build.


The Mental Model for Prompts

System Prompt   = The employee's job description + rules
User Prompt     = The customer's request
Completion      = The employee's response

Your job as developer:
Write a job description clear enough that 
the employee never has to guess what to do.

3-Line Summary

  1. Every LLM interaction has three roles — system (your instructions as developer), user (the human's message), and assistant (the model's response) — understanding these lets you control exactly what the model does.
  2. Strong system prompts define identity, boundaries, format, tone, and edge cases — vague prompts lead to inconsistent apps, specific prompts lead to reliable ones.
  3. Prompt engineering is a structural skill — specifying output format, using few-shot examples, adding chain-of-thought, and constraining behavior are the techniques that make AI applications actually work in production.

Module 1.4 — Complete ✅

Phase 1 is done. 🎉

You now understand:

  • The full AI/ML/DL/LLM hierarchy
  • How generative AI works and why hallucination happens
  • Tokens, context windows, and temperature
  • System prompts, user prompts, and how to engineer them properly

Coming Up — Phase 2: LLM Internals

Module 2.1 — What is NLP, Words vs Tokens, and Tokenization

We go inside the black box. You'll understand exactly how text becomes numbers, why tokenization works the way it does, and what's really happening before the Transformer even sees your input.

Module 1.3 — Tokens, Context Window & Temperature

Why This Module Matters for Developers

These are not just theoretical concepts. As a developer building AI applications, you will make decisions every single day based on tokens, context windows, and temperature.

  • Your billing is based on tokens
  • Your app's memory limit is defined by context window
  • Your output quality vs creativity is controlled by temperature

If you don't understand these deeply, you'll either build broken apps or waste money. Let's fix that right now.


Part 1 — Tokens

What Exactly is a Token?

You already know a model can't read raw text — it converts text to numbers first. But it doesn't convert letter by letter, and it doesn't convert word by word either.

It converts token by token.

A token is a chunk of text. The size of that chunk depends on the word:

Common short words    → usually 1 token
"cat"       = 1 token
"the"       = 1 token
"is"        = 1 token

Longer or rarer words → split into multiple tokens
"tallest"   = 2 tokens  → ["tall", "est"]
"unbelievable" = 4 tokens → ["un", "believ", "able", "?"]

Punctuation and spaces → their own tokens
"Hello!"    = 3 tokens  → ["Hello", "!", ""]
"\n"        = 1 token

Numbers
"2024"      = 1-2 tokens depending on the model

The model that decides how to split text into tokens is called a Tokenizer. Each LLM has its own tokenizer with its own rules.


The Real-World Rule of Thumb

OpenAI gives a simple approximation that's useful for estimation:

100 tokens ≈ 75 words

OR

1 token ≈ 4 characters of English text

So if you write a 300-word prompt, that's roughly 400 tokens. A 1000-word essay is roughly 1300 tokens.


Why Does the Model Use Tokens Instead of Words?

Three reasons:

Reason 1 — Handles unknown words gracefully

If someone types a brand new word — a typo, a name, a technical term the model has never seen — breaking it into sub-word tokens means the model can still process it.

"Anthropicization"  (made-up word)
→ ["Anthrop", "ic", "ization"]

Model can still understand the pieces even if 
it's never seen the full word.

Reason 2 — Efficient vocabulary size

If tokens were full words, the vocabulary would need millions of entries (every word in every language). With sub-word tokenization, you can cover almost all languages with a vocabulary of just 50,000–100,000 tokens.

Reason 3 — Handles code and symbols well

"function()" → ["function", "(", ")"]
"==="        → ["==="] or ["==", "="]
"$100"       → ["$", "100"]

Code, math, and symbols tokenize cleanly this way.


Tokens and Cost — The Direct Connection

Every major LLM API charges you per token — both for the tokens you send in (input) and the tokens the model generates back (output).

OpenAI's pricing example (approximate, for understanding):

GPT-4o
Input tokens:  $5 per 1 million tokens
Output tokens: $15 per 1 million tokens

This means every single character in your prompt costs money. And the model's response costs even more per token.

Now imagine you're building a RAG application that injects 5 pages of PDF content into every prompt. That's thousands of tokens — on every single request — from every single user.

System Prompt        →   200 tokens
Injected PDF context →  3000 tokens
User question        →    50 tokens
Model response       →   400 tokens
─────────────────────────────────────
Total per request    →  3650 tokens

1000 users/day       →  3,650,000 tokens/day

Token awareness is cost awareness. You'll think about this constantly in production.


Tokens and Speed

More tokens in the input = more computation = slower response.

More tokens in the output = model has to generate more = slower response.

This is why when you build applications, you'll learn to:

  • Write tight, efficient prompts
  • Limit output length when you don't need long responses
  • Chunk documents instead of dumping everything in at once

Part 2 — Context Window

What is a Context Window?

The context window is the maximum number of tokens the model can read at one time — input AND output combined.

Think of it like a whiteboard. The model can only see what's written on the whiteboard right now. Anything that doesn't fit on the whiteboard — the model simply cannot see.

Context Window = The whiteboard

What goes on the whiteboard:
┌────────────────────────────────┐
│  System Prompt                 │
│  Conversation History          │
│  Injected Documents (RAG)      │
│  Current User Message          │
│  Model's Response (being gen.) │
└────────────────────────────────┘

Total must be ≤ Context Window limit

Context Window Sizes — Then vs Now

This has changed dramatically in just a few years:

GPT-3 (2020)          →    4,096 tokens   (~3,000 words)
GPT-3.5 (2022)        →   16,384 tokens   (~12,000 words)
GPT-4 (2023)          →  128,000 tokens   (~96,000 words)
GPT-4o (2024)         →  128,000 tokens
Claude 3.5 (2024)     →  200,000 tokens   (~150,000 words)
Gemini 1.5 Pro (2024) → 1,000,000 tokens  (~750,000 words)

128,000 tokens is roughly the length of an entire novel. 1,000,000 tokens is roughly 750 books.


Why Context Window Matters For Your Apps

Scenario 1 — Chatbot with long conversations

Every message in the conversation takes up tokens. Eventually, the conversation gets too long and older messages have to be dropped. Your chatbot "forgets" what was said at the beginning.

Message 1:  "My name is Arjun"           ← might get dropped
Message 2:  "I work at a startup"        ← might get dropped
...
Message 47: "What did I tell you my name was?"
Model: "I don't have that information"   ← frustrating for user

This is why building proper memory systems matters — Phase 8 covers this.

Scenario 2 — RAG Applications

When you inject document content into a prompt, it eats up context window space. If a user uploads a 500-page PDF, you cannot dump the whole thing into the context window — it won't fit (on smaller models) and it'll be extremely expensive even if it does fit.

This is exactly why chunking exists — you break documents into pieces and only inject the most relevant pieces. We'll build this in Phase 5.

Scenario 3 — The Lost in the Middle Problem

Research has shown that LLMs actually don't pay equal attention to all parts of the context window:

Context Window

[Beginning]   ← model pays HIGH attention here
[Middle]      ← model pays LOW attention here ⚠️
[End]         ← model pays HIGH attention here

If you stuff 50 documents into the context and the relevant one lands in the middle — the model might miss it. This is called the "Lost in the Middle" problem. Knowing this helps you design better RAG systems.


Input Tokens vs Output Tokens

The context window covers both:

Input tokens  = everything you send to the model
Output tokens = everything the model generates back

Input + Output ≤ Context Window

If your context window is 128,000 tokens and you send 120,000 tokens of input — the model can only generate 8,000 tokens of output. The whiteboard is almost full.

You also typically set a max_tokens parameter when calling the API — this limits how long the response can be. Useful for controlling cost and keeping responses concise.


Part 3 — Temperature

What is Temperature?

Temperature controls how much randomness is introduced when the model selects the next token.

Remember from Module 1.2 — the model produces a probability distribution over all possible next tokens:

Next token probabilities (simplified):

"Paris"      → 72%
"France"     → 14%
"Europe"     → 8%
"London"     → 4%
"pizza"      → 0.5%
"banana"     → 0.001%

Temperature decides how the model samples from this distribution.


Temperature = 0

At temperature 0, the model always picks the highest probability token. No randomness at all.

Every single run:
"Paris" gets picked every time → 100% deterministic

Use temperature 0 when you need:

  • Consistent, repeatable outputs
  • Factual question answering
  • Data extraction
  • Classification tasks
  • Anything where the same input should always give the same output

Temperature = 1

At temperature 1, the model samples according to the actual probability distribution. High probability tokens get picked often, low probability ones occasionally.

Run 1: "Paris"   (picked 72% of the time)
Run 2: "Paris"
Run 3: "France"  (picked 14% of the time)
Run 4: "Paris"
Run 5: "Europe"  (picked 8% of the time)

This gives natural variation — different phrasings, different word choices — while still being coherent.

Use temperature 1 when you need:

  • Natural conversation
  • Writing assistance
  • General purpose chatbots

Temperature > 1

At high temperatures, the model flattens the probability distribution — making unlikely tokens more likely to get picked.

Temperature = 1.5

"Paris"      → now maybe 40% (reduced from 72%)
"France"     → now maybe 20%
"Europe"     → now maybe 15%
"London"     → now maybe 12%
"pizza"      → now maybe 8%   ← much more likely now
"banana"     → now maybe 3%   ← actually possible now

The output gets creative, unusual, unexpected — and also potentially incoherent or wrong.

Use high temperature when you need:

  • Brainstorming wildly different ideas
  • Creative writing with unexpected twists
  • Generating varied options to choose from

Temperature Visualized

Temperature: 0          Temp: 0.7          Temp: 1.5
────────────────        ──────────         ──────────────
█████████░░░░░░░        ██████░░░░░        ████░░░░░░░░░░
Very peaked             Balanced           Very flat
One clear winner        Natural mix        Everything possible

Deterministic           Varied             Chaotic
Factual                 Creative           Unpredictable
Consistent              Natural            Risky

Temperature in Practice — Real Code

Here's how you set these parameters when calling the OpenAI API in JavaScript:


    const response = await fetch("https://api.anthropic.com/v1/messages", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            model: "claude-sonnet-4-6",
            max_tokens: 1024,      // maximum output tokens
            temperature: 0.7,      // 0 = deterministic, 1 = natural, >1 = creative
            messages: [
                {
                    role: "user",
                    content: "What is the tallest mountain in the world?"
                }
            ]
        })
    });

    const data = await response.json();
    console.log(data.content[0].text);


Choosing the Right Temperature — Quick Guide

Use Case

Recommended Temperature

Data extraction from documents

0

Factual Q&A

0 to 0.3

Summarization

0.3

General chatbot

0.7

Email / content writing

0.7

Creative writing

1.0

Brainstorming

1.0 to 1.2

Poetry / experimental

1.2+

For most production AI applications — RAG systems, customer support bots, document analyzers — you'll stay between 0 and 0.5. You want accuracy over creativity.


How All Three Connect

User sends message
        ↓
[CONTEXT WINDOW starts filling up]
  System Prompt      → tokens consumed
  Chat History       → tokens consumed
  RAG Documents      → tokens consumed
  User Message       → tokens consumed
        ↓
[Must stay within Context Window limit]
        ↓
Model processes all input tokens
        ↓
Generates output token by token
[TEMPERATURE controls how each token is picked]
        ↓
Stops at max_tokens or [END] token
        ↓
Output tokens also counted against context window
        ↓
You get billed for input tokens + output tokens

3-Line Summary

  1. A token is a chunk of text (roughly 4 characters) — not a word, not a letter — and every API call is billed by how many tokens go in and come out.
  2. The context window is the model's total working memory — everything (system prompt, history, documents, response) must fit inside it, and content in the middle gets the least attention.
  3. Temperature controls randomness in token selection — use 0 for factual precision and consistency, 0.7 for natural conversation, and higher for creative tasks.

Module 1.3 — Complete ✅

Coming up: Module 1.4 — Prompts: System Prompt vs User Prompt vs Completion

This is where you learn the skill that separates average AI developers from great ones. Prompt engineering isn't about magic phrases — it's about understanding exactly how the model reads and interprets instructions, and structuring your prompts so the model has no choice but to give you exactly what you need.

Module 1.2 — What is Generative AI & How ChatGPT Actually Works

Where We Left Off

In Module 1.1 you learned that an LLM generates text by predicting the next most probable token — one at a time. That's the core mechanic.

But now you might be wondering:

  • What exactly makes AI "generative"?
  • What's actually happening inside ChatGPT when I send a message?
  • Why does it sometimes get things wrong if it's so powerful?
  • Why does it respond differently every time to the same question?

All of that — by the end of this module.


What Does "Generative" Actually Mean?

There are two broad categories of AI models:

Discriminative AI — learns to tell things apart.

Input: [image of a cat]
Task:  Is this a cat or a dog?
Output: "Cat" — 94% confidence

It draws a boundary between categories. It classifies. It judges. It does NOT create anything new.

Generative AI — learns the underlying patterns of data so deeply that it can create brand new data that looks like it came from the same distribution.

Input: "Write me a poem about rain"
Task:  Generate new text that fits this request
Output: A poem that has never existed before

It doesn't retrieve a stored poem. It doesn't copy-paste from its training data. It generates something new — token by token — using patterns it learned during training.

This is the core idea:

Generative AI has learned the patterns of its training data well enough to produce new, original content that fits those same patterns.

A generative image model doesn't store millions of images — it learns what makes an image look realistic, and then constructs a new one pixel by pixel.

A generative language model doesn't store billions of sentences — it learns what makes text sound coherent, and constructs new sentences token by token.


Types of Generative AI

Generative AI is not just ChatGPT. It's a family of models:

Model Type

What it Generates

Examples

LLM

Text

GPT-4, Claude, Gemini

Image Generation

Images

DALL-E, Midjourney, Stable Diffusion

Audio Generation

Music, Speech

Suno, ElevenLabs

Video Generation

Video

Sora, Runway

Code Generation

Code

GitHub Copilot, Cursor

Multimodal

Text + Image + Audio

GPT-4o, Gemini Ultra

All of these are "generative" — they create new content by learning from existing content. We will focus on LLMs because that is what powers everything in this course — RAG, Agents, LangChain — all of it is built on top of LLMs.


How ChatGPT Actually Works — The Full Journey

Let's trace one message, start to finish.

You open ChatGPT and type:

"What is the tallest mountain in the world?"

Here is exactly what happens:


Step 1 — Your Message Gets Combined With a System Prompt

You only see your message. But ChatGPT doesn't receive just your message. It receives something like this:

SYSTEM:
You are ChatGPT, a helpful, harmless, and honest 
AI assistant made by OpenAI. Answer questions 
clearly and concisely. Do not make up information.

USER:
What is the tallest mountain in the world?

The System Prompt is a hidden set of instructions that tells the model who it is and how to behave. You never see it. It's already there before you type anything.

This is important — you'll use System Prompts heavily when building AI applications.


Step 2 — Text Gets Broken Into Tokens

The model cannot read words the way you do. It converts your text into tokens first.

Tokens are not exactly words. They are chunks of text — sometimes a full word, sometimes part of a word, sometimes punctuation.

"What is the tallest mountain in the world?"

→ ["What", " is", " the", " tall", "est", " mountain", 
   " in", " the", " world", "?"]

Each token gets converted to a number — because neural networks only understand numbers.

"What"     → 2061
" is"      → 318
" the"     → 262
" tall"    → 9857
"est"      → 395
" mountain"→ 8598
...

We will go very deep on tokens and tokenization in Module 1.3. For now, just know this conversion happens.


Step 3 — Tokens Pass Through the Transformer

The sequence of numbers (tokens) gets fed into the Transformer — the neural network that is the heart of every LLM.

The Transformer does one thing extremely well:

It looks at ALL tokens simultaneously and figures out how each token relates to every other token.

For your sentence, it understands:

  • "tallest" is directly related to "mountain"
  • "world" gives global scope to the question
  • The whole sentence is asking for a factual comparison

This is the Attention Mechanism — the model pays different amounts of "attention" to different tokens depending on context. We'll cover this deeply in Phase 2.


Step 4 — The Model Predicts the Next Token

After processing your input, the model now produces a probability distribution over its entire vocabulary — which is typically 50,000 to 100,000 tokens.

For every single token in its vocabulary, it assigns a probability:

Next token probabilities:

"Mount"     → 31.2%
"The"       → 18.7%
"Everest"   → 14.3%
"Mt"        → 11.1%
"K2"        → 2.1%
"Mars"      → 0.0001%
...

It picks the most probable one (or near-most-probable, depending on temperature — we'll cover this in Module 1.3).

Let's say it picks "Mount".


Step 5 — The Token Gets Added, Process Repeats

Now the model has:

Input + "Mount"

It runs the whole process again. New probability distribution:

"Everest"   → 89.4%
"Fuji"      → 3.1%
"Kilimanjaro" → 1.2%
...

Picks "Everest". Now it has:

Input + "Mount Everest"

Runs again. Picks " is". Then " the". Then " tallest". Then " mountain". And so on.

"Mount Everest is the tallest mountain in 
the world, standing at 8,848 meters 
(29,032 feet) above sea level."

This process is called autoregressive generation. Every token depends on all previous tokens. The model keeps generating until it produces a special "end of sequence" token that signals it's done.


The Full Flow Visualized

You type a message
        ↓
System Prompt + Your Message combined
        ↓
Text converted to Tokens
        ↓
Tokens converted to Numbers
        ↓
Numbers fed into Transformer
        ↓
Transformer runs Attention across all tokens
        ↓
Probability distribution over vocabulary
        ↓
Most probable token selected
        ↓
Token added to sequence
        ↓
Repeat until [END] token
        ↓
Numbers converted back to Text
        ↓
Response streams to your screen

Why Does ChatGPT Sometimes Get Things Wrong?

This is one of the most important things to understand before building AI apps.

Remember — the model is not looking anything up. It is not connected to a database of facts. It is predicting the most probable next token based on patterns it saw during training.

This means:

If the training data had wrong information → the model learned wrong patterns → it will confidently produce wrong answers.

If the training data didn't cover something → the model has no pattern to follow → it may "hallucinate" — generate plausible-sounding but completely false information.

If something happened after the training cutoff → the model simply doesn't know → it might guess or make something up.

This is called hallucination — and it's not a bug. It's a fundamental property of how these models work. The model always generates something — it doesn't know how to say "I have no pattern for this."

This is exactly why RAG (Retrieval Augmented Generation) exists — which we'll cover in Phase 5. RAG is the solution to hallucination. Instead of relying on the model's internal knowledge, you inject real, current, verified information directly into the prompt. The model then generates based on that real context instead of guessing.


Why Does It Give Different Answers Every Time?

This is where temperature comes in — and we'll go deep on this in Module 1.3.

Short version: the model doesn't always pick the single highest probability token. It picks probabilistically — meaning sometimes the 2nd or 3rd most likely token gets picked. This introduces variation.

Same question, two runs:

Run 1: "Mount Everest is the tallest mountain..."
Run 2: "The tallest mountain in the world is Mount Everest..."

Both correct. Different phrasing. Because different tokens got sampled.

Turn temperature to 0 → it always picks the highest probability token → completely deterministic, same answer every time.

Turn temperature up → more randomness → more creative, more varied, sometimes more wrong.


One More Thing — ChatGPT Has No Memory By Default

This surprises a lot of people.

Every time you send a message in a conversation, the entire conversation history is sent back to the model from the beginning. The model itself stores nothing.

Message 1: "My name is Arjun"
Message 2: "What is my name?"

What actually gets sent to the model for Message 2:

USER: My name is Arjun
ASSISTANT: Nice to meet you, Arjun!
USER: What is my name?

The model reads all of it every single time and responds. It feels like memory — but it's just the conversation being replayed in full on every request.

This has a limit — the Context Window — which is the maximum amount of text the model can read at once. When the conversation gets too long, older messages start getting dropped.

This is also why Agent Memory is a whole topic in Phase 8 — building systems that give AI actual persistent memory beyond a single conversation.


What is Generative AI — Clean Definition

Generative AI is a class of models that learn the statistical patterns of their training data well enough to produce new, original content — whether text, images, audio, or video — that follows those same patterns.

ChatGPT is a Generative AI that:

  1. Takes your message combined with a system prompt
  2. Tokenizes the input
  3. Runs it through a Transformer
  4. Generates output one token at a time based on probability
  5. Stops when it produces an end-of-sequence token
  6. Streams the result to your screen

3-Line Summary

  1. Generative AI doesn't retrieve stored answers — it generates new content by learning statistical patterns from training data and producing output token by token.
  2. ChatGPT works by combining your message with a system prompt, tokenizing it, running it through a Transformer, and repeatedly predicting the next most probable token until the response is complete.
  3. Hallucination happens because the model always generates something — it has no mechanism to say "I don't know" — which is exactly the problem RAG solves later in this course.

Module 1.2 — Complete ✅

Coming up: Module 1.3 — Tokens, Context Window & Temperature

This is where things get really practical. You'll understand exactly what a token is, why token count matters for cost and performance, what context window limits mean for your applications, and how temperature controls creativity vs accuracy.


Module 2.3 — The Transformer Architecture

Start With The Big Picture Every major AI model you've heard of: ChatGPT ✓ Transformer Claude ✓ Transformer Gemini ✓ Transf...