CodeWithGagan | Programming Language and IT Lectures: F1 Score

Why Does F1 Score Even Exist?

Let me show you the problem first.

You work at a bank. You built a model to detect fraudulent transactions. Your dataset looks like this:

Total Transactions     →  10,000
Legitimate             →   9,900  (99%)
Fraudulent             →     100  (1%)

Your lazy model just says "Everything is legitimate!" for all 10,000 transactions.

Accuracy = 9,900 / 10,000 = 99% 🎉

Looks incredible. But your model caught zero fraud. Not a single one. The bank is losing money every day and your model is celebrating 99% accuracy.

This is the lie accuracy tells when data is imbalanced. F1 Score exists to expose this lie.

The 4 Things Your Model Can Do

Every single prediction your model makes falls into one of these 4 boxes. Think of it like a courtroom:

                        MODEL PREDICTED
                     Fraud      |   Not Fraud
              ┌────────────────┬────────────────┐
ACTUAL  Fraud │      TP        │      FN        │
              ├────────────────┼────────────────┤
        Legit │      FP        │      TN        │
              └────────────────┴────────────────┘

Term	Full Name	Plain English	Good or Bad?
TP	True Positive	Actually fraud + Model said fraud	✅ Good
TN	True Negative	Actually legit + Model said legit	✅ Good
FP	False Positive	Actually legit + Model said fraud	❌ Bad
FN	False Negative	Actually fraud + Model said legit	❌ Worst

FN is the most dangerous one — real fraud slipped through completely undetected.

Precision — "How Trustworthy Are Your Alarms?"

Your model flagged 50 transactions as fraud.

You investigate all 50 manually. Turns out only 40 were actually fraud. The other 10 were innocent legitimate transactions that got wrongly accused.

Precision = TP / (TP + FP)
          = 40 / (40 + 10)
          = 40 / 50
          = 0.80 → 80%

What this means in real life:

Every time your model raises an alarm, there is an 80% chance it is real fraud. 20% of the time it is a false alarm — a genuine customer getting their card blocked for no reason. That customer is now angry, calling support, and the bank is wasting investigation resources.

Low Precision = Smoke detector that rings every time you cook eggs 🍳🔔

Recall — "How Much Did You Actually Catch?"

There were 100 actual fraud cases in the dataset.

Your model caught only 40 of them. The remaining 60 fraud transactions were silently approved. Money gone.

Recall = TP / (TP + FN)
       = 40 / (40 + 60)
       = 40 / 100
       = 0.40 → 40%

What this means in real life:

60 fraudsters successfully stole money. Your model was running the whole time and still let 60% of fraud pass through. It is completely failing at its main job.

Low Recall = Security guard sleeping on the job 💤

The Tug of War — Why You Cannot Just Fix One

This is the tricky part. Precision and Recall always fight each other.

If you make your model STRICT (only flags when 99% sure):

Model flags only 10 transactions as fraud
All 10 are actually fraud → No false alarms

TP = 10,  FP = 0,  FN = 90

Precision = 10 / (10 + 0)  = 100%  ← Perfect ✅
Recall    = 10 / (10 + 90) = 10%   ← Terrible ❌

Zero false alarms — great! But 90 fraudsters walked away free.

If you make your model LOOSE (flags everything suspicious):

Model flags 500 transactions as fraud
Only 100 are actually fraud → Tons of false alarms

TP = 100,  FP = 400,  FN = 0

Precision = 100 / (100 + 400) = 20%   ← Terrible ❌
Recall    = 100 / (100 + 0)   = 100%  ← Perfect ✅

Caught every fraudster — great! But 400 innocent customers got their cards blocked.

Both extremes destroy the model's usefulness. You need both to be good simultaneously. That is exactly what F1 measures.

F1 Score — The Formula Explained

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Using our main example (Precision = 80%, Recall = 40%):

F1 = 2 × (0.80 × 0.40) / (0.80 + 0.40)
   = 2 × 0.32 / 1.20
   = 0.64 / 1.20
   = 0.53 → 53%

Even though Precision was 80%, the F1 came out only 53% because Recall was dragging it down. F1 punished the imbalance immediately.

Why Harmonic Mean — Not Regular Average?

This is the most important thing to understand about F1.

Strict model example:
Precision = 100%,  Recall = 10%

Regular Average  =  (100 + 10) / 2  =  55%  ← Sounds okay
F1 Score         =  2 × (1.0 × 0.1) / (1.0 + 0.1)  =  18%  ← Shows the truth

Regular average said 55% — sounds acceptable. F1 said 18% — exposed the model immediately.

The rule is simple: if either Precision or Recall is very low, F1 will be very low. No hiding. No averaging away the problem.

Precision	Recall	Regular Average	F1 Score	Verdict
100%	0%	50% 😬	0% ✅	Useless model exposed
80%	80%	80%	80%	Both good — fair score
90%	50%	70%	64%	Recall dragging it down
40%	100%	70%	57%	Precision dragging it down

Full Python Code 🐍

from sklearn.metrics import (
    f1_score, precision_score,
    recall_score, classification_report
)

# 1 = Fraud, 0 = Legit
y_true = [1,0,1,1,0,0,1,0,1,0, 1,0,0,0,1,0,0,1,0,0]
y_pred = [1,0,1,0,0,0,1,1,1,0, 0,0,0,0,1,0,0,1,0,1]

print("Precision :", precision_score(y_true, y_pred))
print("Recall    :", recall_score(y_true, y_pred))
print("F1 Score  :", f1_score(y_true, y_pred))

# Full breakdown in one shot
print(classification_report(y_true, y_pred,
      target_names=["Legit", "Fraud"]))

Output:

              precision    recall    f1-score   support
       Legit       0.85      0.92      0.88        12
       Fraud       0.83      0.71      0.77         8

    accuracy                           0.85        20
   macro avg       0.84      0.81      0.82        20

How to read this output:

Support = how many actual samples of that class exist
Legit row = how well model handled legitimate transactions
Fraud row = how well model handled fraud — this is what matters most
Macro avg = simple average of both rows
Accuracy = 85% — but F1 for Fraud is 77% — see the difference

F1 Variants — Three Flavors 🔥

When you have more than 2 classes (not just Fraud/Legit but maybe 5 categories), you need to combine F1 scores across all classes. Three ways to do it:

Macro F1 — Equal Importance to Every Class

f1_score(y_true, y_pred, average='macro')

Calculate F1 separately for each class, then take a simple average. Every class gets equal weight regardless of how many samples it has.

Use when: Every class matters equally. A rare disease detection where missing any class is equally bad.

Weighted F1 — Bigger Classes Get More Weight

f1_score(y_true, y_pred, average='weighted')

Calculate F1 for each class, then average — but weighted by how many samples each class has. Bigger classes pull the average more.

Use when: Class sizes are very different and you care more about the bigger class performing well.

Micro F1 — Pool Everything Together

f1_score(y_true, y_pred, average='micro')

Add up all TP, FP, FN across every class first, then calculate one single F1 from those totals.

Use when: You care about overall aggregate performance across the entire dataset.

Real World — Which Industry Uses F1 and Why

Industry	Problem	Why F1 and Not Accuracy
🏥 Healthcare	Cancer detection	99% patients are healthy — accuracy would always say "healthy"
💳 Banking	Credit card fraud	99% transactions are legit — same trap
📧 Email	Spam detection	Most emails are real — missing spam AND deleting real emails both matter
🏭 Manufacturing	Defective products	Most products are fine — missing defects is a safety risk
🔐 Cybersecurity	Intrusion detection	Most traffic is normal — imbalanced by nature
💊 Pharma	Side effect prediction	Side effects are rare events — classic imbalanced problem

When to Use Accuracy vs F1 — Decision Guide

Is your data balanced?
(roughly equal number of each class)
            │
           YES
            │
            ▼
    Accuracy is fine ✅
    Both mistakes cost equally
            │
           NO → Data is imbalanced
                        │
                        ▼
              Is it a critical domain?
              (health, fraud, security)
                        │
                       YES
                        │
                        ▼
                Use F1 Score 🎯
                Missing real cases
                is very costly
                        │
                       NO → F1 is still
                            better than
                            accuracy here

Complete Summary — Everything in One Place

Dataset: 10,000 transactions
├── 9,900 Legitimate  (99%)
└──   100 Fraudulent  (1%)

Lazy model says "All Legit" →  Accuracy = 99%  ← Complete lie

Good model results:
├── TP = 40  (caught fraud correctly)
├── FP = 10  (innocent people wrongly flagged)
├── FN = 60  (fraud that slipped through)
└── TN = 9890 (legit correctly cleared)

Precision = 40 / (40 + 10) = 80%
Recall    = 40 / (40 + 60) = 40%
F1        = 2 × (0.80 × 0.40) / (0.80 + 0.40) = 53%

Accuracy  = (40 + 9890) / 10000 = 99%  ← Still lying!
F1        = 53%  ← Showing the real picture

F1 Score Scale

F1 Score	What It Means
1.00	Perfect — catches everything, zero false alarms 🏆
0.75 and above	Good — generally production ready ✅
0.50 to 0.75	Average — needs improvement ⚠️
Below 0.50	Bad — do not deploy ❌
0.00	Completely useless — worse than random 🗑️

The one rule to tattoo in your brain:

Use Accuracy when your data is balanced and both types of mistakes cost equally.

Use F1 Score when your data is imbalanced OR when missing a real case (False Negative) is more dangerous than a false alarm (False Positive).

CodeWithGagan | Programming Language and IT Lectures

F1 Score — Fraud Detection Edition 🎯