CodeWithGagan | Programming Language and IT Lectures: Module 2.1 — What is NLP, Words vs Tokens & Tokenization

Start With a Simple Problem

Computers are dumb in one specific way.

They only understand numbers. That's it. Everything inside a computer — images, videos, music, text — is secretly just numbers underneath.

So when you type:

"I love pizza"

The computer sees this as a bunch of characters. But it has no idea what "love" means. It has no idea "pizza" is a food. It just sees symbols.

The big question is:

How do we turn human language into something a computer can actually understand and work with?

This is exactly what NLP solves.

What is NLP?

NLP stands for Natural Language Processing.

Break it down:

Natural Language = the language humans speak and write. English, Hindi, Spanish — any human language.
Processing = making a computer work with it.

So NLP = teaching computers to read, understand, and work with human language.

NLP is not new. It's been around since the 1950s. But it used to be very basic — simple rule-based stuff like "if the word is 'not', flip the meaning."

Today's NLP — powered by LLMs — is on a completely different level. The computer doesn't just follow rules anymore. It actually understands context, meaning, and nuance.

The Core Problem NLP Solves

Think about how hard human language actually is.

Same word, completely different meanings:

"I went to the bank"
→ river bank? or money bank?

"She couldn't bear the pain"
→ bear = tolerate? or bear = the animal?

"He is so cool"
→ temperature? or personality?

Humans figure this out instantly from context. Computers used to completely fail at this.

NLP is the field of techniques and models that teach computers to handle exactly this kind of complexity.

Step 1 — How Does a Computer Start Reading Text?

Let's say you give a computer this sentence:

"Dogs are great pets"

First question — how does the computer even break this down?

The naive answer is: split by spaces.

"Dogs" | "are" | "great" | "pets"
→ 4 words

Simple. But this breaks immediately with real language:

"I can't do this"
→ split by space → "I" | "can't" | "do" | "this"
→ but "can't" is actually "can" + "not"
→ are these 1 word or 2?

"New York"
→ is this 1 thing or 2 separate words?

"state-of-the-art"
→ 1 word? 4 words? something else?

Splitting by spaces is too simple. Real language is messy.

So instead of splitting into words, we split into tokens.

Words vs Tokens — The Real Difference

A word is what you and I understand — a unit of meaning in language.

A token is what the computer uses — a chunk of text that the model has learned to work with.

They are close — but not the same.

Here's the key difference:

Word:  "unbelievable"
→ You see 1 word
→ Computer sees 3 tokens: ["un", "believ", "able"]

Word:  "cat"
→ You see 1 word
→ Computer sees 1 token: ["cat"]

Word:  "I"
→ You see 1 word
→ Computer sees 1 token: ["I"]

Common, short words = usually 1 token. Long or rare words = split into multiple tokens.

Why Split Into Tokens and Not Words?

Great question. Three simple reasons:

Reason 1 — Handles words it has never seen

Imagine someone types a brand new made-up word:

"Anthropicization"

The model has never seen this word. If we treated words as the unit — the model is completely lost.

But with tokens:

"Anthropicization" → ["Anthrop", "ic", "ization"]

The model knows these pieces. It can make sense of the word even though it's never seen the full thing.

Reason 2 — Works across languages

English: "hello"     → 1 token
Spanish: "hola"      → 1 token  
Hindi:   "नमस्ते"    → 2-3 tokens

One tokenizer handles all languages without needing separate systems for each.

Reason 3 — Keeps the vocabulary manageable

There are millions of words across all languages. If every word was a separate entry — the model would need a list of millions of items.

With sub-word tokens — you only need about 50,000 to 100,000 tokens to cover almost everything. Much more manageable.

What is Tokenization?

Tokenization is simply the process of splitting text into tokens.

It's step one — before the model does anything with your text, the tokenizer chops it up first.

Let's trace a real example:

Input text:
"I love building AI apps!"

After tokenization:
["I", " love", " building", " AI", " apps", "!"]

After converting to numbers:
[40, 1842, 2615, 9552, 5181, 0]

Now the model has something it can actually work with — a list of numbers.

Let's See This With Real Examples

Here's how some common text gets tokenized:

Text: "Hello world"
Tokens: ["Hello", " world"]
Count: 2 tokens

Text: "ChatGPT is amazing"
Tokens: ["Chat", "G", "PT", " is", " amazing"]
Count: 5 tokens

Text: "I can't stop learning"
Tokens: ["I", " can", "'t", " stop", " learning"]
Count: 5 tokens

Text: "2024"
Tokens: ["2024"]
Count: 1 token

Text: "$1,299.99"
Tokens: ["$", "1", ",", "299", ".", "99"]
Count: 6 tokens

Notice a few things:

Spaces are often attached to the NEXT word, not left separate
Punctuation becomes its own token
Numbers can split in unexpected ways
Contractions like "can't" split into "can" + "'t"

The Tokenizer is Separate From the Model

This is important to understand.

The tokenizer and the model are two different things:

Your Text
    ↓
[TOKENIZER]          ← splits text into tokens
    ↓
Tokens (numbers)
    ↓
[LLM MODEL]          ← processes the numbers
    ↓
Output tokens
    ↓
[TOKENIZER]          ← converts numbers back to text
    ↓
Response Text

The tokenizer runs first, converts text to numbers. The model works with those numbers. Then the tokenizer runs again at the end, converting the model's output numbers back into readable text.

A Simple Real Life Analogy

Think of tokenization like a mail sorting room.

When letters arrive at a post office, the sorting room doesn't read every letter as a whole story. It breaks down the address into pieces:

"123 Main Street, New York, USA 10001"
→ House Number: 123
→ Street: Main Street
→ City: New York
→ Country: USA
→ ZIP: 10001

Each piece is a "token" — a chunk the sorting system can work with. The full address as one blob of text is hard to route. Broken into meaningful chunks — easy.

Tokenization does the same thing with language.

Why Should You Care About This as a Developer?

Because tokens directly affect three things in your apps:

1. Cost Every API call charges you per token. If you understand tokenization, you write efficient prompts and save money.

"Please be so kind as to summarize the following text"
→ 11 tokens — wordy, expensive

"Summarize:"
→ 2 tokens — same instruction, much cheaper

2. Speed More tokens = more processing = slower response. Tight prompts are faster prompts.

3. Context Limit Remember the context window from Module 1.3 — it's measured in tokens. Knowing how tokenization works helps you estimate how much space you have left.

Quick Summary of the Flow So Far

You type text
      ↓
Tokenizer splits it into chunks (tokens)
      ↓
Each token gets converted to a number
      ↓
List of numbers goes into the model
      ↓
Model processes the numbers
      ↓
Model outputs numbers
      ↓
Tokenizer converts numbers back to text
      ↓
You see the response

3-Line Summary

NLP is the field of teaching computers to read and understand human language — tokenization is the very first step in that process.
A token is a chunk of text — not exactly a word, not a letter — common words are one token, long or rare words get split into multiple tokens.
Tokenization matters to you as a developer because every token costs money, takes time, and uses up your context window — writing efficient prompts means understanding how your text gets split.

Module 2.1 — Complete ✅

Coming up — Module 2.2 — Vocabulary, Embeddings, Parameters & Model Weights

This is where it gets really interesting. You'll learn what "embeddings" actually are at the most basic level — and why they are the single most important concept in all of modern AI. We'll build up to it simply, step by step.

CodeWithGagan | Programming Language and IT Lectures

Module 2.1 — What is NLP, Words vs Tokens & Tokenization