The Real Problem with Accuracy
Let's start with a story.
You are a doctor. 100 patients walk into your hospital. After pathology tests, the reality is confirmed:
Actually have Cancer → 10 patients
Actually Healthy → 90 patients
Now you just say "Nobody has cancer" for all 100 patients.
Percentage = (Part / Whole) × 100
Your accuracy:
Accuracy = Correct predictions / Total
= (90 / 100) x 100
= 90% ✅
90% sounds amazing. But you just missed every single cancer patient. You are the worst doctor ever — but accuracy says you are great.
This is the problem. Accuracy lies when data is imbalanced.
Imbalanced means not equal or not properly balanced.
Simple meaning:
When things are uneven, unequal, or not in proper proportion, they are called imbalanced.
Examples:
Work-life imbalance → too much work, no personal time
Diet imbalance → eating too much of one type of food, not enough of others
Data imbalance (in coding/ML) → one category has way more data than another
In one line:
👉 Imbalanced = something is out of balance or not evenly distributed.
F1 Score fixes this lie.
How F1 Score Works — Step by Step
Step 1 — Understand the Setup
You build an ML Model that looks at patient data and predicts — "Cancer or No Cancer?"
The model has never seen the pathology results. It makes its own independent guesses based on symptoms, age, reports etc.
Ground Truth (Reality) → 10 patients have cancer [FIXED, never changes]
Model's Predictions → Model thinks 10 patients have cancer [Its own guess]
These are two separate lists. Now we compare them.
Step 2 — Compare Model Predictions vs Reality
The model flagged these 10 patients as "Cancer":
Model's Cancer List → P1, P2, P3, P4, P5, P6, P7, P8, P9, P10
Reality's Cancer List (from pathology):
Actual Cancer List → P1, P2, P3, P4, P5, P6, P7, P11, P12, P13
Now compare both lists:
Both lists have → P1, P2, P3, P4, P5, P6, P7 → 7 patients ✅
(Model was RIGHT about these)
Only in Model list → P8, P9, P10 → 3 patients ❌
(Model said Cancer but they were Healthy)
Only in Real list → P11, P12, P13 → 3 patients ❌
(Actually had Cancer but Model MISSED them)
This is where 7 comes from — the overlap between both lists.
Step 3 — Learn the 4 Terms
Every prediction falls into one of these 4 boxes:
MODEL SAID
Cancer No Cancer
┌───────────┬───────────┐
REALITY Cancer│ 7 │ 3 │ ← 10 actual cancer patients
├───────────┼───────────┤
Healthy│ 3 │ 87 │ ← 90 actual healthy patients
└───────────┴───────────┘
↑ ↑
Model said Model said
Cancer No Cancer
|
Term |
Full Name |
Meaning |
Count |
|
TP |
True
Positive |
Actually
Cancer + Model said Cancer ✅ |
7 |
|
FP |
False Positive |
Actually Healthy + Model said
Cancer ❌ |
3 |
|
FN |
False
Negative |
Actually
Cancer + Model said Healthy ❌ |
3 |
|
TN |
True Negative |
Actually Healthy + Model said
Healthy ✅ |
87 |
TP, FP, FN, TN — Decision Logic
Just look at 2 things:
- What is the Reality? (Ground Truth)
- What did the Model predict? (Prediction)
Decision Table
|
Reality |
Model Said |
Name |
How to Remember |
|
Cancer ✅ |
Cancer ✅ |
True Positive |
Both same → True. Model said Positive (cancer) → Positive |
|
Healthy ✅ |
Cancer ❌ |
False Positive |
Both different → False. Model said Positive → Positive |
|
Cancer ✅ |
Healthy ❌ |
False Negative |
Both different → False. Model said Negative (healthy) → Negative |
|
Healthy ✅ |
Healthy ✅ |
True Negative |
Both same → True. Model said Negative → Negative |
The Naming Formula
TRUE / FALSE → Was the model correct or wrong?
POSITIVE/NEGATIVE → What did the model predict?
That's it. True/False = model right/wrong, Positive/Negative = model's prediction.
Quick Practice
Patient P8 → Reality: Healthy, Model said: Cancer
- Was the model correct? ❌ → False
- What did the model predict? Cancer → Positive
- Answer: False Positive ✅
Patient P11 → Reality: Cancer, Model said: Healthy
- Was the model correct? ❌ → False
- What did the model predict? Healthy → Negative
- Answer: False Negative ✅
Why is FN the most dangerous?
FP → Told a Healthy person "You have Cancer"
→ Unnecessary stress, extra tests
→ But the patient will be monitored ✅
FN → Told a Cancer patient "You are Healthy"
→ Patient went home, no treatment
→ Disease keeps growing ❌ DANGEROUS
This is why Recall matters more in medical cases — because Recall directly tracks how many FNs slipped through.
Recall = TP / (TP + FN)
↑
More FN = Lower Recall
One Line to Remember
The name tells you two things at once — was the model right, and what did it predict.
Step 4 — Calculate Precision
Question Precision answers: "When the model said Cancer — how often was it actually right?"
Model said "Cancer" to 10 people. Out of those 10, only 7 actually had cancer.
Precision = TP / (TP + FP)
= 7 / (7 + 3)
= 7 / 10
= 0.70 → 70%
Real world meaning: If the model tells you "You have cancer" — there is a 70% chance it is correct. 30% chance it is wrong.
Step 5 — Calculate Recall
Question Recall answers: "Out of all patients who actually had cancer — how many did the model catch?"
10 people actually had cancer. Model caught only 7. It missed 3.
Recall = TP / (TP + FN)
= 7 / (7 + 3)
= 7 / 10
= 0.70 → 70%
Real world meaning: 3 cancer patients went home thinking they are healthy. They will never get treated. This is dangerous.
Step 6 — The Tug of War Between Precision and Recall
These two always fight each other. You cannot simply maximize one.
Scenario A — Model is too aggressive (flags everyone as Cancer):
Model said "Cancer" to all 100 patients
TP = 10, FP = 90, FN = 0
Precision = 10 / (10 + 90) = 10% ← Terrible
Recall = 10 / (10 + 0) = 100% ← Perfect
Recall is perfect but Precision is destroyed. You scared 90 healthy people unnecessarily.
Scenario B — Model is too cautious (flags only 1 person as Cancer):
Model said "Cancer" to only 1 patient — and that 1 was correct
TP = 1, FP = 0, FN = 9
Precision = 1 / (1 + 0) = 100% ← Perfect
Recall = 1 / (1 + 9) = 10% ← Terrible
Precision is perfect but Recall is destroyed. 9 sick people went home undetected.
Both extremes are bad. You need balance. That is what F1 gives you.
Step 7 — F1 Score Calculation
F1 = 2 × (Precision × Recall) / (Precision + Recall)
= 2 × (0.70 × 0.70) / (0.70 + 0.70)
= 2 × 0.49 / 1.40
= 0.98 / 1.40
= 0.70 → 70%
Why not just take a normal average?
Look at Scenario A above:
Normal Average = (10% + 100%) / 2 = 55% ← Sounds okay
F1 Score = 2 × (0.10 × 1.0) / (0.10 + 1.0) = 18% ← Shows the truth
Normal average hides the problem. F1 exposes it immediately.
F1 is strict — both Precision AND Recall must be good. If either one is bad, F1 will be bad.
Step 8 — Why Accuracy Still Fails Here
Accuracy = (TP + TN) / Total
= (7 + 87) / 100
= 94 / 100
= 94% ← Looks great!
F1 Score = 70% ← Shows the real picture
Accuracy is high because there are 87 True Negatives dragging the number up. But those 87 are just healthy people — easy to get right. The hard part is catching cancer patients — and the model only got 70% of those.
F1 ignores True Negatives completely. It only cares about how well you handle the positive cases.
Step 9 — Python Code
from sklearn.metrics import ( f1_score, precision_score, recall_score, confusion_matrix )
# Ground Truth — confirmed by pathology tests # 1 = Cancer, 0 = Healthy y_true = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0]
# Model's predictions — its own independent guesses y_pred = [1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]
precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) accuracy = sum(p == t for p, t in zip(y_pred, y_true)) / len(y_true)
print(f"Accuracy : {accuracy:.2f}") # 0.85 — looks good but misleading print(f"Precision : {precision:.2f}") # 0.83 — when it said cancer, 83% correct print(f"Recall : {recall:.2f}") # 0.71 — caught 71% of actual cases print(f"F1 Score : {f1:.2f}") # 0.77 — the honest combined score
Step 10 — When to Use What
|
Situation |
Bigger Mistake |
What to Focus On |
Use |
|
🏥 Cancer Detection |
Missing a sick person |
High Recall |
F1 |
|
📧 Spam Filter |
Deleting a real email |
High Precision |
F1 |
|
💳 Fraud Detection |
Missing a fraud |
High Recall |
F1 |
|
🐶 Dog vs Cat |
Balanced data, both equal |
Either is fine |
Accuracy |
|
Situation |
Golden Rule: If your dataset is imbalanced (example: 95% Healthy, 5% Cancer) — always use F1. Never trust accuracy.
Complete Summary in One Place
100 Patients Total
├── 10 actually had Cancer (Ground Truth — FIXED)
└── 90 actually Healthy (Ground Truth — FIXED)
Model independently predicted:
├── Said "Cancer" → 10 patients (model's own guess)
│ ├── 7 correct → TP = 7 (overlap between both lists)
│ └── 3 wrong → FP = 3 (healthy people wrongly flagged)
└── Said "Healthy" → 90 patients
├── 87 correct → TN = 87
└── 3 wrong → FN = 3 (cancer patients model missed)
Precision = TP / (TP + FP) = 7 / 10 = 70%
Recall = TP / (TP + FN) = 7 / 10 = 70%
F1 = 2 × (0.7 × 0.7) / (0.7 + 0.7) = 70%
Accuracy = (7 + 87) / 100 = 94% ← Misleading!
F1 Score Scale
|
F1 Score |
Meaning |
|
1.00 |
Perfect
model 🏆 |
|
0.75 and
above |
Good — ready for production ✅ |
|
0.50 to
0.75 |
Average —
needs improvement ⚠️ |
|
Below
0.50 |
Bad — do not use ❌ |
|
0.00 |
Completely
useless 🗑️ |
One line to remember forever:
Accuracy tells you how often you were right overall. F1 tells you how well you handled the cases that actually mattered.
No comments:
Post a Comment