What is F1 Score in Machine Learning

The Real Problem with Accuracy

Let's start with a story.

You are a doctor. 100 patients walk into your hospital. After pathology tests, the reality is confirmed:

Actually have Cancer  →  10 patients
Actually Healthy      →  90 patients

Now you just say "Nobody has cancer" for all 100 patients.

Percentage = (Part / Whole) × 100

Your accuracy:

Accuracy = Correct predictions / Total
         = (90 / 100) x 100
         = 90% ✅

90% sounds amazing. But you just missed every single cancer patient. You are the worst doctor ever — but accuracy says you are great.

This is the problem. Accuracy lies when data is imbalanced.

Imbalanced means not equal or not properly balanced.

Simple meaning:

When things are uneven, unequal, or not in proper proportion, they are called imbalanced.

Examples:

Work-life imbalance → too much work, no personal time
Diet imbalance → eating too much of one type of food, not enough of others
Data imbalance (in coding/ML) → one category has way more data than another

In one line:

👉 Imbalanced = something is out of balance or not evenly distributed.

F1 Score fixes this lie.

How F1 Score Works — Step by Step

Step 1 — Understand the Setup

You build an ML Model that looks at patient data and predicts — "Cancer or No Cancer?"

The model has never seen the pathology results. It makes its own independent guesses based on symptoms, age, reports etc.

Ground Truth (Reality)  →  10 patients have cancer  [FIXED, never changes]
Model's Predictions     →  Model thinks 10 patients have cancer  [Its own guess]

These are two separate lists. Now we compare them.

Step 2 — Compare Model Predictions vs Reality

The model flagged these 10 patients as "Cancer":

Model's Cancer List  →  P1, P2, P3, P4, P5, P6, P7, P8, P9, P10

Reality's Cancer List (from pathology):

Actual Cancer List   →  P1, P2, P3, P4, P5, P6, P7, P11, P12, P13

Now compare both lists:

Both lists have     →  P1, P2, P3, P4, P5, P6, P7  →  7 patients  ✅
                       (Model was RIGHT about these)

Only in Model list  →  P8, P9, P10                  →  3 patients  ❌
                       (Model said Cancer but they were Healthy)

Only in Real list   →  P11, P12, P13                →  3 patients  ❌
                       (Actually had Cancer but Model MISSED them)

This is where 7 comes from — the overlap between both lists.

Step 3 — Learn the 4 Terms

Every prediction falls into one of these 4 boxes:

                        MODEL SAID
                   Cancer      No Cancer
              ┌───────────┬───────────┐
REALITY Cancer│     7     │     3     │ ← 10 actual cancer patients
              ├───────────┼───────────┤
       Healthy│     3     │    87     │ ← 90 actual healthy patients
              └───────────┴───────────┘
                    ↑            ↑
               Model said    Model said
                Cancer        No Cancer

Term	Full Name	Meaning	Count
TP	True Positive	Actually Cancer + Model said Cancer ✅	7
FP	False Positive	Actually Healthy + Model said Cancer ❌	3
FN	False Negative	Actually Cancer + Model said Healthy ❌	3
TN	True Negative	Actually Healthy + Model said Healthy ✅	87

TP, FP, FN, TN — Decision Logic

Just look at 2 things:

What is the Reality? (Ground Truth)
What did the Model predict? (Prediction)

Decision Table

Reality	Model Said	Name	How to Remember
Cancer ✅	Cancer ✅	True Positive	Both same → True. Model said Positive (cancer) → Positive
Healthy ✅	Cancer ❌	False Positive	Both different → False. Model said Positive → Positive
Cancer ✅	Healthy ❌	False Negative	Both different → False. Model said Negative (healthy) → Negative
Healthy ✅	Healthy ✅	True Negative	Both same → True. Model said Negative → Negative

The Naming Formula

TRUE / FALSE     →  Was the model correct or wrong?
POSITIVE/NEGATIVE  →  What did the model predict?

That's it. True/False = model right/wrong, Positive/Negative = model's prediction.

Quick Practice

Patient P8 → Reality: Healthy, Model said: Cancer

Was the model correct? ❌ → False
What did the model predict? Cancer → Positive
Answer: False Positive ✅

Patient P11 → Reality: Cancer, Model said: Healthy

Was the model correct? ❌ → False
What did the model predict? Healthy → Negative
Answer: False Negative ✅

Why is FN the most dangerous?

FP → Told a Healthy person "You have Cancer"
     → Unnecessary stress, extra tests
     → But the patient will be monitored ✅

FN → Told a Cancer patient "You are Healthy"
     → Patient went home, no treatment
     → Disease keeps growing ❌ DANGEROUS

This is why Recall matters more in medical cases — because Recall directly tracks how many FNs slipped through.

Recall = TP / (TP + FN)
                  ↑
       More FN  =  Lower Recall

One Line to Remember

The name tells you two things at once — was the model right, and what did it predict.

Step 4 — Calculate Precision

Question Precision answers: "When the model said Cancer — how often was it actually right?"

Model said "Cancer" to 10 people. Out of those 10, only 7 actually had cancer.

Precision = TP / (TP + FP)
          = 7 / (7 + 3)
          = 7 / 10
          = 0.70 → 70%

Real world meaning: If the model tells you "You have cancer" — there is a 70% chance it is correct. 30% chance it is wrong.

Step 5 — Calculate Recall

Question Recall answers: "Out of all patients who actually had cancer — how many did the model catch?"

10 people actually had cancer. Model caught only 7. It missed 3.

Recall = TP / (TP + FN)
       = 7 / (7 + 3)
       = 7 / 10
       = 0.70 → 70%

Real world meaning: 3 cancer patients went home thinking they are healthy. They will never get treated. This is dangerous.

Step 6 — The Tug of War Between Precision and Recall

These two always fight each other. You cannot simply maximize one.

Scenario A — Model is too aggressive (flags everyone as Cancer):

Model said "Cancer" to all 100 patients

TP = 10,  FP = 90,  FN = 0

Precision = 10 / (10 + 90) = 10%   ← Terrible
Recall    = 10 / (10 + 0)  = 100%  ← Perfect

Recall is perfect but Precision is destroyed. You scared 90 healthy people unnecessarily.

Scenario B — Model is too cautious (flags only 1 person as Cancer):

Model said "Cancer" to only 1 patient — and that 1 was correct

TP = 1,  FP = 0,  FN = 9

Precision = 1 / (1 + 0) = 100%  ← Perfect
Recall    = 1 / (1 + 9) = 10%   ← Terrible

Precision is perfect but Recall is destroyed. 9 sick people went home undetected.

Both extremes are bad. You need balance. That is what F1 gives you.

Step 7 — F1 Score Calculation

F1 = 2 × (Precision × Recall) / (Precision + Recall)
   = 2 × (0.70 × 0.70) / (0.70 + 0.70)
   = 2 × 0.49 / 1.40
   = 0.98 / 1.40
   = 0.70 → 70%

Why not just take a normal average?

Look at Scenario A above:

Normal Average  =  (10% + 100%) / 2  =  55%  ← Sounds okay
F1 Score        =  2 × (0.10 × 1.0) / (0.10 + 1.0)  =  18%  ← Shows the truth

Normal average hides the problem. F1 exposes it immediately.

F1 is strict — both Precision AND Recall must be good. If either one is bad, F1 will be bad.

Step 8 — Why Accuracy Still Fails Here

Accuracy = (TP + TN) / Total
         = (7 + 87) / 100
         = 94 / 100
         = 94%  ← Looks great!

F1 Score = 70%  ← Shows the real picture

Accuracy is high because there are 87 True Negatives dragging the number up. But those 87 are just healthy people — easy to get right. The hard part is catching cancer patients — and the model only got 70% of those.

F1 ignores True Negatives completely. It only cares about how well you handle the positive cases.

Step 9 — Python Code


    from sklearn.metrics import (
        f1_score, precision_score,
        recall_score, confusion_matrix
    )

    # Ground Truth — confirmed by pathology tests
    # 1 = Cancer,  0 = Healthy
    y_true = [1, 0, 1, 1, 0, 0, 1, 0, 1, 0,
            0, 0, 1, 0, 0, 0, 1, 0, 1, 0]

    # Model's predictions — its own independent guesses
    y_pred = [1, 0, 1, 0, 0, 0, 1, 1, 1, 0,
            0, 0, 0, 0, 0, 0, 1, 0, 1, 0]

    precision = precision_score(y_true, y_pred)
    recall    = recall_score(y_true, y_pred)
    f1        = f1_score(y_true, y_pred)
    accuracy  = sum(p == t for p, t in zip(y_pred, y_true)) / len(y_true)

    print(f"Accuracy  : {accuracy:.2f}")   # 0.85 — looks good but misleading
    print(f"Precision : {precision:.2f}")  # 0.83 — when it said cancer, 83% correct
    print(f"Recall    : {recall:.2f}")     # 0.71 — caught 71% of actual cases
    print(f"F1 Score  : {f1:.2f}")         # 0.77 — the honest combined score

Step 10 — When to Use What

*Situation*	Bigger Mistake	What to Focus On	Use
🏥 Cancer Detection	Missing a sick person	High Recall	F1
📧 Spam Filter	Deleting a real email	High Precision	F1
💳 Fraud Detection	Missing a fraud	High Recall	F1
🐶 Dog vs Cat	Balanced data, both equal	Either is fine	Accuracy

Situation

Golden Rule: If your dataset is imbalanced (example: 95% Healthy, 5% Cancer) — always use F1. Never trust accuracy.

Complete Summary in One Place

100 Patients Total
├── 10 actually had Cancer   (Ground Truth — FIXED)
└── 90 actually Healthy      (Ground Truth — FIXED)

Model independently predicted:
├── Said "Cancer"    → 10 patients (model's own guess)
│   ├── 7 correct   → TP = 7  (overlap between both lists)
│   └── 3 wrong     → FP = 3  (healthy people wrongly flagged)
└── Said "Healthy"  → 90 patients
    ├── 87 correct  → TN = 87
    └── 3 wrong     → FN = 3  (cancer patients model missed)

Precision = TP / (TP + FP) = 7 / 10 = 70%
Recall    = TP / (TP + FN) = 7 / 10 = 70%
F1        = 2 × (0.7 × 0.7) / (0.7 + 0.7) = 70%
Accuracy  = (7 + 87) / 100 = 94%  ← Misleading!

F1 Score Scale

F1 Score	Meaning
1.00	Perfect model 🏆
0.75 and above	Good — ready for production ✅
0.50 to 0.75	Average — needs improvement ⚠️
Below 0.50	Bad — do not use ❌
0.00	Completely useless 🗑️

One line to remember forever:

Accuracy tells you how often you were right overall. F1 tells you how well you handled the cases that actually mattered.

CodeWithGagan | Programming Language and IT Lectures