Module 1.3 — Tokens, Context Window & Temperature

Why This Module Matters for Developers

These are not just theoretical concepts. As a developer building AI applications, you will make decisions every single day based on tokens, context windows, and temperature.

  • Your billing is based on tokens
  • Your app's memory limit is defined by context window
  • Your output quality vs creativity is controlled by temperature

If you don't understand these deeply, you'll either build broken apps or waste money. Let's fix that right now.


Part 1 — Tokens

What Exactly is a Token?

You already know a model can't read raw text — it converts text to numbers first. But it doesn't convert letter by letter, and it doesn't convert word by word either.

It converts token by token.

A token is a chunk of text. The size of that chunk depends on the word:

Common short words    → usually 1 token
"cat"       = 1 token
"the"       = 1 token
"is"        = 1 token

Longer or rarer words → split into multiple tokens
"tallest"   = 2 tokens  → ["tall", "est"]
"unbelievable" = 4 tokens → ["un", "believ", "able", "?"]

Punctuation and spaces → their own tokens
"Hello!"    = 3 tokens  → ["Hello", "!", ""]
"\n"        = 1 token

Numbers
"2024"      = 1-2 tokens depending on the model

The model that decides how to split text into tokens is called a Tokenizer. Each LLM has its own tokenizer with its own rules.


The Real-World Rule of Thumb

OpenAI gives a simple approximation that's useful for estimation:

100 tokens ≈ 75 words

OR

1 token ≈ 4 characters of English text

So if you write a 300-word prompt, that's roughly 400 tokens. A 1000-word essay is roughly 1300 tokens.


Why Does the Model Use Tokens Instead of Words?

Three reasons:

Reason 1 — Handles unknown words gracefully

If someone types a brand new word — a typo, a name, a technical term the model has never seen — breaking it into sub-word tokens means the model can still process it.

"Anthropicization"  (made-up word)
→ ["Anthrop", "ic", "ization"]

Model can still understand the pieces even if 
it's never seen the full word.

Reason 2 — Efficient vocabulary size

If tokens were full words, the vocabulary would need millions of entries (every word in every language). With sub-word tokenization, you can cover almost all languages with a vocabulary of just 50,000–100,000 tokens.

Reason 3 — Handles code and symbols well

"function()" → ["function", "(", ")"]
"==="        → ["==="] or ["==", "="]
"$100"       → ["$", "100"]

Code, math, and symbols tokenize cleanly this way.


Tokens and Cost — The Direct Connection

Every major LLM API charges you per token — both for the tokens you send in (input) and the tokens the model generates back (output).

OpenAI's pricing example (approximate, for understanding):

GPT-4o
Input tokens:  $5 per 1 million tokens
Output tokens: $15 per 1 million tokens

This means every single character in your prompt costs money. And the model's response costs even more per token.

Now imagine you're building a RAG application that injects 5 pages of PDF content into every prompt. That's thousands of tokens — on every single request — from every single user.

System Prompt        →   200 tokens
Injected PDF context →  3000 tokens
User question        →    50 tokens
Model response       →   400 tokens
─────────────────────────────────────
Total per request    →  3650 tokens

1000 users/day       →  3,650,000 tokens/day

Token awareness is cost awareness. You'll think about this constantly in production.


Tokens and Speed

More tokens in the input = more computation = slower response.

More tokens in the output = model has to generate more = slower response.

This is why when you build applications, you'll learn to:

  • Write tight, efficient prompts
  • Limit output length when you don't need long responses
  • Chunk documents instead of dumping everything in at once

Part 2 — Context Window

What is a Context Window?

The context window is the maximum number of tokens the model can read at one time — input AND output combined.

Think of it like a whiteboard. The model can only see what's written on the whiteboard right now. Anything that doesn't fit on the whiteboard — the model simply cannot see.

Context Window = The whiteboard

What goes on the whiteboard:
┌────────────────────────────────┐
│  System Prompt                 │
│  Conversation History          │
│  Injected Documents (RAG)      │
│  Current User Message          │
│  Model's Response (being gen.) │
└────────────────────────────────┘

Total must be ≤ Context Window limit

Context Window Sizes — Then vs Now

This has changed dramatically in just a few years:

GPT-3 (2020)          →    4,096 tokens   (~3,000 words)
GPT-3.5 (2022)        →   16,384 tokens   (~12,000 words)
GPT-4 (2023)          →  128,000 tokens   (~96,000 words)
GPT-4o (2024)         →  128,000 tokens
Claude 3.5 (2024)     →  200,000 tokens   (~150,000 words)
Gemini 1.5 Pro (2024) → 1,000,000 tokens  (~750,000 words)

128,000 tokens is roughly the length of an entire novel. 1,000,000 tokens is roughly 750 books.


Why Context Window Matters For Your Apps

Scenario 1 — Chatbot with long conversations

Every message in the conversation takes up tokens. Eventually, the conversation gets too long and older messages have to be dropped. Your chatbot "forgets" what was said at the beginning.

Message 1:  "My name is Arjun"           ← might get dropped
Message 2:  "I work at a startup"        ← might get dropped
...
Message 47: "What did I tell you my name was?"
Model: "I don't have that information"   ← frustrating for user

This is why building proper memory systems matters — Phase 8 covers this.

Scenario 2 — RAG Applications

When you inject document content into a prompt, it eats up context window space. If a user uploads a 500-page PDF, you cannot dump the whole thing into the context window — it won't fit (on smaller models) and it'll be extremely expensive even if it does fit.

This is exactly why chunking exists — you break documents into pieces and only inject the most relevant pieces. We'll build this in Phase 5.

Scenario 3 — The Lost in the Middle Problem

Research has shown that LLMs actually don't pay equal attention to all parts of the context window:

Context Window

[Beginning]   ← model pays HIGH attention here
[Middle]      ← model pays LOW attention here ⚠️
[End]         ← model pays HIGH attention here

If you stuff 50 documents into the context and the relevant one lands in the middle — the model might miss it. This is called the "Lost in the Middle" problem. Knowing this helps you design better RAG systems.


Input Tokens vs Output Tokens

The context window covers both:

Input tokens  = everything you send to the model
Output tokens = everything the model generates back

Input + Output ≤ Context Window

If your context window is 128,000 tokens and you send 120,000 tokens of input — the model can only generate 8,000 tokens of output. The whiteboard is almost full.

You also typically set a max_tokens parameter when calling the API — this limits how long the response can be. Useful for controlling cost and keeping responses concise.


Part 3 — Temperature

What is Temperature?

Temperature controls how much randomness is introduced when the model selects the next token.

Remember from Module 1.2 — the model produces a probability distribution over all possible next tokens:

Next token probabilities (simplified):

"Paris"      → 72%
"France"     → 14%
"Europe"     → 8%
"London"     → 4%
"pizza"      → 0.5%
"banana"     → 0.001%

Temperature decides how the model samples from this distribution.


Temperature = 0

At temperature 0, the model always picks the highest probability token. No randomness at all.

Every single run:
"Paris" gets picked every time → 100% deterministic

Use temperature 0 when you need:

  • Consistent, repeatable outputs
  • Factual question answering
  • Data extraction
  • Classification tasks
  • Anything where the same input should always give the same output

Temperature = 1

At temperature 1, the model samples according to the actual probability distribution. High probability tokens get picked often, low probability ones occasionally.

Run 1: "Paris"   (picked 72% of the time)
Run 2: "Paris"
Run 3: "France"  (picked 14% of the time)
Run 4: "Paris"
Run 5: "Europe"  (picked 8% of the time)

This gives natural variation — different phrasings, different word choices — while still being coherent.

Use temperature 1 when you need:

  • Natural conversation
  • Writing assistance
  • General purpose chatbots

Temperature > 1

At high temperatures, the model flattens the probability distribution — making unlikely tokens more likely to get picked.

Temperature = 1.5

"Paris"      → now maybe 40% (reduced from 72%)
"France"     → now maybe 20%
"Europe"     → now maybe 15%
"London"     → now maybe 12%
"pizza"      → now maybe 8%   ← much more likely now
"banana"     → now maybe 3%   ← actually possible now

The output gets creative, unusual, unexpected — and also potentially incoherent or wrong.

Use high temperature when you need:

  • Brainstorming wildly different ideas
  • Creative writing with unexpected twists
  • Generating varied options to choose from

Temperature Visualized

Temperature: 0          Temp: 0.7          Temp: 1.5
────────────────        ──────────         ──────────────
█████████░░░░░░░        ██████░░░░░        ████░░░░░░░░░░
Very peaked             Balanced           Very flat
One clear winner        Natural mix        Everything possible

Deterministic           Varied             Chaotic
Factual                 Creative           Unpredictable
Consistent              Natural            Risky

Temperature in Practice — Real Code

Here's how you set these parameters when calling the OpenAI API in JavaScript:


    const response = await fetch("https://api.anthropic.com/v1/messages", {
        method: "POST",
        headers: {
            "Content-Type": "application/json",
        },
        body: JSON.stringify({
            model: "claude-sonnet-4-6",
            max_tokens: 1024,      // maximum output tokens
            temperature: 0.7,      // 0 = deterministic, 1 = natural, >1 = creative
            messages: [
                {
                    role: "user",
                    content: "What is the tallest mountain in the world?"
                }
            ]
        })
    });

    const data = await response.json();
    console.log(data.content[0].text);


Choosing the Right Temperature — Quick Guide

Use Case

Recommended Temperature

Data extraction from documents

0

Factual Q&A

0 to 0.3

Summarization

0.3

General chatbot

0.7

Email / content writing

0.7

Creative writing

1.0

Brainstorming

1.0 to 1.2

Poetry / experimental

1.2+

For most production AI applications — RAG systems, customer support bots, document analyzers — you'll stay between 0 and 0.5. You want accuracy over creativity.


How All Three Connect

User sends message
        ↓
[CONTEXT WINDOW starts filling up]
  System Prompt      → tokens consumed
  Chat History       → tokens consumed
  RAG Documents      → tokens consumed
  User Message       → tokens consumed
        ↓
[Must stay within Context Window limit]
        ↓
Model processes all input tokens
        ↓
Generates output token by token
[TEMPERATURE controls how each token is picked]
        ↓
Stops at max_tokens or [END] token
        ↓
Output tokens also counted against context window
        ↓
You get billed for input tokens + output tokens

3-Line Summary

  1. A token is a chunk of text (roughly 4 characters) — not a word, not a letter — and every API call is billed by how many tokens go in and come out.
  2. The context window is the model's total working memory — everything (system prompt, history, documents, response) must fit inside it, and content in the middle gets the least attention.
  3. Temperature controls randomness in token selection — use 0 for factual precision and consistency, 0.7 for natural conversation, and higher for creative tasks.

Module 1.3 — Complete ✅

Coming up: Module 1.4 — Prompts: System Prompt vs User Prompt vs Completion

This is where you learn the skill that separates average AI developers from great ones. Prompt engineering isn't about magic phrases — it's about understanding exactly how the model reads and interprets instructions, and structuring your prompts so the model has no choice but to give you exactly what you need.

No comments:

Post a Comment

Module 2.3 — The Transformer Architecture

Start With The Big Picture Every major AI model you've heard of: ChatGPT ✓ Transformer Claude ✓ Transformer Gemini ✓ Transf...