Why Does F1 Score Even Exist?
Let me show you the problem first.
You work at a bank. You built a model to detect fraudulent transactions. Your dataset looks like this:
Total Transactions → 10,000
Legitimate → 9,900 (99%)
Fraudulent → 100 (1%)
Your lazy model just says "Everything is legitimate!" for all 10,000 transactions.
Accuracy = 9,900 / 10,000 = 99% 🎉
Looks incredible. But your model caught zero fraud. Not a single one. The bank is losing money every day and your model is celebrating 99% accuracy.
This is the lie accuracy tells when data is imbalanced. F1 Score exists to expose this lie.
The 4 Things Your Model Can Do
Every single prediction your model makes falls into one of these 4 boxes. Think of it like a courtroom:
MODEL PREDICTED
Fraud | Not Fraud
┌────────────────┬────────────────┐
ACTUAL Fraud │ TP │ FN │
├────────────────┼────────────────┤
Legit │ FP │ TN │
└────────────────┴────────────────┘
| Term | Full Name | Plain English | Good or Bad? |
|---|---|---|---|
| TP | True Positive | Actually fraud + Model said fraud | ✅ Good |
| TN | True Negative | Actually legit + Model said legit | ✅ Good |
| FP | False Positive | Actually legit + Model said fraud | ❌ Bad |
| FN | False Negative | Actually fraud + Model said legit | ❌ Worst |
FN is the most dangerous one — real fraud slipped through completely undetected.
Precision — "How Trustworthy Are Your Alarms?"
Your model flagged 50 transactions as fraud.
You investigate all 50 manually. Turns out only 40 were actually fraud. The other 10 were innocent legitimate transactions that got wrongly accused.
Precision = TP / (TP + FP)
= 40 / (40 + 10)
= 40 / 50
= 0.80 → 80%
What this means in real life:
Every time your model raises an alarm, there is an 80% chance it is real fraud. 20% of the time it is a false alarm — a genuine customer getting their card blocked for no reason. That customer is now angry, calling support, and the bank is wasting investigation resources.
Low Precision = Smoke detector that rings every time you cook eggs 🍳🔔
Recall — "How Much Did You Actually Catch?"
There were 100 actual fraud cases in the dataset.
Your model caught only 40 of them. The remaining 60 fraud transactions were silently approved. Money gone.
Recall = TP / (TP + FN)
= 40 / (40 + 60)
= 40 / 100
= 0.40 → 40%
What this means in real life:
60 fraudsters successfully stole money. Your model was running the whole time and still let 60% of fraud pass through. It is completely failing at its main job.
Low Recall = Security guard sleeping on the job 💤
The Tug of War — Why You Cannot Just Fix One
This is the tricky part. Precision and Recall always fight each other.
If you make your model STRICT (only flags when 99% sure):
Model flags only 10 transactions as fraud
All 10 are actually fraud → No false alarms
TP = 10, FP = 0, FN = 90
Precision = 10 / (10 + 0) = 100% ← Perfect ✅
Recall = 10 / (10 + 90) = 10% ← Terrible ❌
Zero false alarms — great! But 90 fraudsters walked away free.
If you make your model LOOSE (flags everything suspicious):
Model flags 500 transactions as fraud
Only 100 are actually fraud → Tons of false alarms
TP = 100, FP = 400, FN = 0
Precision = 100 / (100 + 400) = 20% ← Terrible ❌
Recall = 100 / (100 + 0) = 100% ← Perfect ✅
Caught every fraudster — great! But 400 innocent customers got their cards blocked.
Both extremes destroy the model's usefulness. You need both to be good simultaneously. That is exactly what F1 measures.
F1 Score — The Formula Explained
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Using our main example (Precision = 80%, Recall = 40%):
F1 = 2 × (0.80 × 0.40) / (0.80 + 0.40)
= 2 × 0.32 / 1.20
= 0.64 / 1.20
= 0.53 → 53%
Even though Precision was 80%, the F1 came out only 53% because Recall was dragging it down. F1 punished the imbalance immediately.
Why Harmonic Mean — Not Regular Average?
This is the most important thing to understand about F1.
Strict model example:
Precision = 100%, Recall = 10%
Regular Average = (100 + 10) / 2 = 55% ← Sounds okay
F1 Score = 2 × (1.0 × 0.1) / (1.0 + 0.1) = 18% ← Shows the truth
Regular average said 55% — sounds acceptable. F1 said 18% — exposed the model immediately.
The rule is simple: if either Precision or Recall is very low, F1 will be very low. No hiding. No averaging away the problem.
| Precision | Recall | Regular Average | F1 Score | Verdict |
|---|---|---|---|---|
| 100% | 0% | 50% 😬 | 0% ✅ | Useless model exposed |
| 80% | 80% | 80% | 80% | Both good — fair score |
| 90% | 50% | 70% | 64% | Recall dragging it down |
| 40% | 100% | 70% | 57% | Precision dragging it down |
Full Python Code 🐍
from sklearn.metrics import (
f1_score, precision_score,
recall_score, classification_report
)
# 1 = Fraud, 0 = Legit
y_true = [1,0,1,1,0,0,1,0,1,0, 1,0,0,0,1,0,0,1,0,0]
y_pred = [1,0,1,0,0,0,1,1,1,0, 0,0,0,0,1,0,0,1,0,1]
print("Precision :", precision_score(y_true, y_pred))
print("Recall :", recall_score(y_true, y_pred))
print("F1 Score :", f1_score(y_true, y_pred))
# Full breakdown in one shot
print(classification_report(y_true, y_pred,
target_names=["Legit", "Fraud"]))
Output:
precision recall f1-score support
Legit 0.85 0.92 0.88 12
Fraud 0.83 0.71 0.77 8
accuracy 0.85 20
macro avg 0.84 0.81 0.82 20
How to read this output:
- Support = how many actual samples of that class exist
- Legit row = how well model handled legitimate transactions
- Fraud row = how well model handled fraud — this is what matters most
- Macro avg = simple average of both rows
- Accuracy = 85% — but F1 for Fraud is 77% — see the difference
F1 Variants — Three Flavors 🔥
When you have more than 2 classes (not just Fraud/Legit but maybe 5 categories), you need to combine F1 scores across all classes. Three ways to do it:
Macro F1 — Equal Importance to Every Class
f1_score(y_true, y_pred, average='macro')
Calculate F1 separately for each class, then take a simple average. Every class gets equal weight regardless of how many samples it has.
Use when: Every class matters equally. A rare disease detection where missing any class is equally bad.
Weighted F1 — Bigger Classes Get More Weight
f1_score(y_true, y_pred, average='weighted')
Calculate F1 for each class, then average — but weighted by how many samples each class has. Bigger classes pull the average more.
Use when: Class sizes are very different and you care more about the bigger class performing well.
Micro F1 — Pool Everything Together
f1_score(y_true, y_pred, average='micro')
Add up all TP, FP, FN across every class first, then calculate one single F1 from those totals.
Use when: You care about overall aggregate performance across the entire dataset.
Real World — Which Industry Uses F1 and Why
| Industry | Problem | Why F1 and Not Accuracy |
|---|---|---|
| 🏥 Healthcare | Cancer detection | 99% patients are healthy — accuracy would always say "healthy" |
| 💳 Banking | Credit card fraud | 99% transactions are legit — same trap |
| Spam detection | Most emails are real — missing spam AND deleting real emails both matter | |
| 🏭 Manufacturing | Defective products | Most products are fine — missing defects is a safety risk |
| 🔐 Cybersecurity | Intrusion detection | Most traffic is normal — imbalanced by nature |
| 💊 Pharma | Side effect prediction | Side effects are rare events — classic imbalanced problem |
When to Use Accuracy vs F1 — Decision Guide
Is your data balanced?
(roughly equal number of each class)
│
YES
│
▼
Accuracy is fine ✅
Both mistakes cost equally
│
NO → Data is imbalanced
│
▼
Is it a critical domain?
(health, fraud, security)
│
YES
│
▼
Use F1 Score 🎯
Missing real cases
is very costly
│
NO → F1 is still
better than
accuracy here
Complete Summary — Everything in One Place
Dataset: 10,000 transactions
├── 9,900 Legitimate (99%)
└── 100 Fraudulent (1%)
Lazy model says "All Legit" → Accuracy = 99% ← Complete lie
Good model results:
├── TP = 40 (caught fraud correctly)
├── FP = 10 (innocent people wrongly flagged)
├── FN = 60 (fraud that slipped through)
└── TN = 9890 (legit correctly cleared)
Precision = 40 / (40 + 10) = 80%
Recall = 40 / (40 + 60) = 40%
F1 = 2 × (0.80 × 0.40) / (0.80 + 0.40) = 53%
Accuracy = (40 + 9890) / 10000 = 99% ← Still lying!
F1 = 53% ← Showing the real picture
F1 Score Scale
| F1 Score | What It Means |
|---|---|
| 1.00 | Perfect — catches everything, zero false alarms 🏆 |
| 0.75 and above | Good — generally production ready ✅ |
| 0.50 to 0.75 | Average — needs improvement ⚠️ |
| Below 0.50 | Bad — do not deploy ❌ |
| 0.00 | Completely useless — worse than random 🗑️ |
The one rule to tattoo in your brain:
Use Accuracy when your data is balanced and both types of mistakes cost equally.
Use F1 Score when your data is imbalanced OR when missing a real case (False Negative) is more dangerous than a false alarm (False Positive).
No comments:
Post a Comment