Mean — Complete Chapter for ML & Statistics

What is Mean?

We calculate the mean (average) because it gives a single value that represents the whole dataset.

Why we need mean (benefits):

  • Easy understanding: Instead of looking at many numbers, one value summarizes everything.
  • Quick comparison: You can easily compare different groups (e.g., average salary of two companies).
  • Decision making: Helps in making decisions based on overall performance (e.g., average marks, sales).
  • Finding trends: Shows general behavior of data (high, low, normal).
  • Used in formulas: Mean is the base for many calculations like variance, standard deviation, etc.

Mean is the average of a set of numbers. You add all values together, then divide by how many values there are.

Simple idea: If 5 friends scored 60, 70, 80, 90, 100 in a test — what was the "typical" score? You find the mean.

Formula:

Mean (μ or x̄) = Sum of all values / Total count
              = (x₁ + x₂ + x₃ + ... + xₙ) / n

Example:

Values: 60, 70, 80, 90, 100
Sum = 60 + 70 + 80 + 90 + 100 = 400
Count = 5
Mean = 400 / 5 = 80

Why Do We Use Mean?

Because we need one number that represents the whole dataset.

In ML, you can't feed 10,000 raw values into every formula. You need summaries. Mean is the most fundamental summary of data.

It answers: "If everything was equal, what would each value be?"


Types of Mean (All Used in ML)


1. Arithmetic Mean

This is the standard mean everyone knows. Add everything, divide by count.


    import numpy as np

    scores = [60, 70, 80, 90, 100]
    mean = np.mean(scores)
    print(mean)  # 80.0

Used in: Loss functions, accuracy calculation, gradient descent, feature scaling.


2. Weighted Mean

Weighted mean is used when all values are not equally important.

👉 In normal mean, every value has same importance
👉 In weighted mean, some values have more importance (weight) than others

Some values matter MORE than others. You assign a weight to each value.

Formula:

Weighted Mean = (w₁x₁ + w₂x₂ + ... + wₙxₙ) / (w₁ + w₂ + ... + wₙ)

Example 1: You have 3 exams. Final exam is worth more.


    import numpy as np
    scores  = [70,  80,  90]
    weights = [1,   1,   3]   # Final exam has weight 3

    weighted_mean = np.average(scores, weights=weights)
    print(weighted_mean)  # 84.0

    # Manual: (70*1 + 80*1 + 90*3) / (1+1+3) = 420/5 = 84


Example 2:

Marks:

  • Math = 90 (weight = 50%)
  • English = 80 (weight = 30%)
  • Science = 70 (weight = 20%)

Now we don’t treat all subjects equally.

Weighted Mean =

(90 × 0.5) + (80 × 0.3) + (70 × 0.2)
= 45 + 24 + 14
= 83

Used in: Ensemble models (XGBoost, Random Forest voting), class imbalance handling, recommendation systems.


3. Geometric Mean

🤔 First Understand the Problem — Why Arithmetic Mean Fails?

Suppose you have ₹100. You invest it:

Year

Return

Your Money

Year 1

+100%

₹100 → ₹200

Year 2

-50%

₹200 → ₹100

Arithmetic Mean says:

Reality check: Your money went ₹100 → ₹200 → ₹100 back. Real return = 0% 😐

So arithmetic mean showed 25% profit when actually there was 0% profit. This exact problem is solved by Geometric Mean.

🧠 Core Idea — Growth That Multiplies

Whenever one value grows on top of the previous value (compounding), use Geometric Mean.

Type

Operation

Use When

Arithmetic Mean

Adds numbers

Values are independent

Geometric Mean

Multiplies numbers

Each value depends on previous one

📐 Formula

Two steps only:

  1. Multiply all numbers together
  2. Take the nth root (n = how many numbers you have)

💰 Example 1 — Investment Returns

You have ₹1000. Returns over 3 years:

  • Year 1: +10%
  • Year 2: -20%
  • Year 3: +30%

🔄 Step 0 — Convert % to Multiplier (Most Important Step)

Why do we convert? Because we need to multiply, not add. A multiplier tells us what to multiply the current amount by.

Year

Return

How to Convert

Multiplier

Year 1

+10%

1.00 + 0.10

1.10

Year 2

-20%

1.00 − 0.20

0.80

Year 3

+30%

1.00 + 0.30

1.30

Rule: Always write it as 1 + (percent/100) +10% → 1 + (10/100) = 1 + 0.10 = 1.10 -20% → 1 + (-20/100) = 1 − 0.20 = 0.80

📊 Step 1 — Multiply All Multipliers

Calculate left to right:

1.10×0.80=0.881.10 \times 0.80 = 0.88
0.88×1.30=1.1440.88 \times 1.30 = 1.144
Product=1.144

🌱 Step 2 — Take the nth Root

Here n = 3 (three years), so we take the cube root:

GM = (1.144)1/3

What does 1/3 power mean?

(1.144)1/3

means "what number multiplied by itself 3 times gives 1.144?"​

🎯 Step 3 — Convert Back to Percentage


✅ Step 4 — Verify the Answer (Proof)

Actual path of money:

Using GM (4.56% every year):

Both give the same final amount — so GM is correct!

What Arithmetic Mean would have given (wrong):

👨‍👩‍👧 Example 2 — Population Growth

City population = 10,00,000. Growth over 3 years:

  • Year 1: +5%
  • Year 2: +8%
  • Year 3: +6%

🔄 Step 0 — Convert to Multipliers

Year

Growth

Conversion

Multiplier

Year 1

+5%

1 + 0.05

1.05

Year 2

+8%

1 + 0.08

1.08

Year 3

+6%

1 + 0.06

1.06

📊 Step 1 — Multiply All Multipliers

1.134 × 1.06= 1.20204

Product=1.20204

🌱 Step 2 — Take the Cube Root (n = 3)

🎯 Step 3 — Convert to Percentage

✅ Step 4 — Verify

Actual population growth:

10,00,000×1.05×1.08×1.06=12,02,040

Using GM (6.30% every year):

10,00,000×1.063×1.063×1.063=12,02,040

Both match perfectly!

🔁 Revisiting the ₹100 Problem (Now With GM)

+100% and -50%:

GM correctly said 0% return. Arithmetic mean had wrongly said +25%.

🐍 Python Code With Explanation


    from scipy.stats import gmean

    # Step 1: Write returns as multipliers
    returns = [1.10, 0.80, 1.30]   # +10%, -20%, +30%

    # Step 2: gmean multiplies all and takes nth root automatically
    gm = gmean(returns)

    # Step 3: Convert back to percentage
    print(f"Geometric Mean : {gm:.4f}")              # 1.0456
    print(f"Avg return/year: {(gm - 1) * 100:.2f}%") # 4.56%

📌 When to Use — Quick Reference

Situation

Correct Mean

Average marks, height, weight

Arithmetic Mean

Investment / stock returns

Geometric Mean

Population growth

Geometric Mean

Any % change over time

Geometric Mean

Each value builds on previous

Geometric Mean

🎯 One Line Summary

Whenever money or any quantity grows on top of the previous result (compounding), always use Geometric Mean — Arithmetic Mean will give you a wrong answer.


4. Harmonic Mean

Harmonic Mean is used when values are related to speed, rate, or “per unit” things.

👉 Like:

  • speed (km/h)
  • price per item
  • work per hour

What it actually means

It gives the true average when things are divided (not added or multiplied)

👉 Special case: When you travel the same distance with different speeds

Simple example idea

You go:

  • Half distance at 60 km/h

  • Half distance at 40 km/h

👉 Normal average = (60 + 40) / 2 = 50 ❌ WRONG
👉 Because time taken is different

👉 Harmonic Mean gives correct average speed

Why we need it

  • When values are rates (per unit)

  • When denominator matters (time, distance, etc.)

  • Gives real accurate result in such cases


In one line:

Harmonic mean is used to find the correct average when dealing with speeds or rates (per unit values).

Reciprocal of the arithmetic mean of reciprocals. Sounds complex — but the use case makes it click.

Formula:

Harmonic Mean = n / (1/x₁ + 1/x₂ + ... + 1/xₙ)

✅ Example 1 (Average Speed with multiple values)

You travel equal distances at speeds: 30 km/h, 40 km/h, 60 km/h

Step 1: Formula
3(130+140+160)\frac{3}{\left(\frac{1}{30} + \frac{1}{40} + \frac{1}{60}\right)}

Step 2: Solve
1/30 + 1/40 + 1/60
LCM = 120

= (4 + 3 + 2) / 120 = 9/120 = 3/40

Step 3: Final
3 ÷ (3/40) = 40 km/h

👉 Final Answer: Average speed = 40 km/h

✅ Example 2 (Work Rate)

Two machines complete same work:

  • Machine A → 6 hours
  • Machine B → 12 hours

Step 1: Formula

2(16+112)\frac{2}{\left(\frac{1}{6} + \frac{1}{12}\right)}

Step 2: Solve

1/6 + 1/12 = (2 + 1) / 12 = 3/12 = 1/4

Step 3: Final

2 ÷ (1/4) = 8 hours

👉 Final Answer: Average time = 8 hours

Example:


    from scipy.stats import hmean

    values = [4, 1]
    h_mean = hmean(values)
    print(h_mean)  # 1.6

The most important use in ML — F1 Score:

precision = 0.80
recall    = 0.60


    # F1 Score IS the harmonic mean of precision and recall
    f1 = 2 * (precision * recall) / (precision + recall)
    print(f1)  # 0.686

    # Why harmonic and not arithmetic?
    # Arithmetic mean of 0.8 and 0.6 = 0.70 (too generous)
    # Harmonic mean punishes imbalance — if either is low, F1 is low

Used in: F1 Score, averaging rates, anywhere balance between two metrics matters.


5. Moving Average (Rolling Mean)


6. Exponential Moving Average (EMA)


Mean in Core ML Concepts


Mean Absolute Error (MAE)


Mean Squared Error (MSE)


Root Mean Squared Error (RMSE)


Mean in Feature Scaling — Standardization (Z-score)

Before feeding data into ML models, you scale features. Mean is the center point.

Formula:

z = (x - mean) / standard_deviation
from sklearn.preprocessing import StandardScaler

data = [[25], [30], [35], [40], [45]]
scaler = StandardScaler()
scaled = scaler.fit_transform(data)

print(scaled)
# After scaling: mean becomes 0, std becomes 1
# [-1.41, -0.71, 0.0, 0.71, 1.41]

Why? Algorithms like Linear Regression, SVM, KNN, Neural Networks assume features are on similar scales. Without this, the feature with larger numbers dominates unfairly.


Mean in Gradient Descent

When you train a model, the loss function uses mean over all training examples.

Loss = (1/n) × Σ (predicted - actual)²

The gradient (direction to update weights) is also the mean of gradients across all samples. The model learns by minimizing this average error.


Mean Imputation (Handling Missing Data)

When data has missing values, a simple strategy is to fill them with the mean of that column.

import pandas as pd
import numpy as np

df = pd.DataFrame({'age': [25, 30, np.nan, 40, np.nan, 35]})

mean_age = df['age'].mean()  # 32.5
df['age'].fillna(mean_age, inplace=True)

print(df)
# NaN values replaced with 32.5

When to use: Works well when data is roughly normally distributed and not too many values are missing.


Mean in Batch Normalization (Neural Networks)

Inside deep neural networks, after each layer, the activations are normalized using mean and standard deviation of the current batch. This keeps training stable and fast.

import torch
import torch.nn as nn

# PyTorch example
bn = nn.BatchNorm1d(num_features=4)
x = torch.tensor([[1.0, 2.0, 3.0, 4.0],
                   [5.0, 6.0, 7.0, 8.0]])

output = bn(x)
# Internally: subtracts mean, divides by std, for each feature

Mean vs Median — When Mean Fails

Mean has one big weakness: outliers destroy it.

salaries = [30000, 32000, 31000, 29000, 500000]  # one billionaire in the group

mean_salary   = np.mean(salaries)    # 124,400  ← completely misleading
median_salary = np.median(salaries)  # 31,000   ← represents the group better

Rule of thumb:

  • Data has no extreme outliers → use Mean
  • Data has outliers or is skewed → use Median
  • Always check with a histogram or box plot before deciding

Quick Reference Summary

Type               Formula                   ML Use Case
─────────────────────────────────────────────────────────────────────
Arithmetic Mean    sum / n                   Loss functions, scaling
Weighted Mean      Σ(wᵢxᵢ) / Σwᵢ            Ensembles, class weights
Geometric Mean     (x₁×x₂×...×xₙ)^(1/n)    Growth rates, log-scale eval
Harmonic Mean      n / Σ(1/xᵢ)              F1 Score, rate averaging
Rolling Mean       Mean of last k values     Time series smoothing
EMA                Weighted recent average   Adam optimizer, forecasting
MAE                mean(|actual - pred|)     Regression evaluation
MSE                mean((actual - pred)²)    Regression loss function
RMSE               √MSE                      Regression evaluation

One-Line Memory Hook for Each

  • Arithmetic → "The everyday average"
  • Weighted → "Some things matter more"
  • Geometric → "For growth and multiplication"
  • Harmonic → "For rates and balance — F1 lives here"
  • Rolling → "Sliding window over time"
  • EMA → "Recent past matters more"
  • MAE → "Average of how wrong you were"
  • MSE → "Punish big mistakes harder"
  • RMSE → "MSE in original units"

That's the complete Mean chapter — from the basic definition all the way to how it powers neural network training, model evaluation, and data preprocessing in real ML pipelines.


Root Mean Squared Error (RMSE)

Start With The Problem — Where MSE Falls Short

You just calculated MSE for your house price model:

MSE = 20.8

Your manager asks — "so how wrong is our model on average?"

You say — "MSE is 20.8"

They ask — "20.8 what? Lakhs? Lakhs squared? What does that mean?"

You have no good answer. Because MSE unit is Lakhs² — completely uninterpretable in real world.

You want MSE's superpower (punishing big errors) but in a unit that actually makes sense.

Simple fix — just take the square root of MSE. That's RMSE.


What is RMSE?

RMSE = Square Root of MSE

That's the entire definition. Nothing new to learn conceptually — it's just MSE with a square root on top to fix the unit problem.

RMSE = √MSE = √( (1/n) × Σ (Actual - Predicted)² )

The Formula

RMSE = √ [ (1/n) × Σ (Actual - Predicted)² ]

Step by step:

  1. Find error (Actual − Predicted)
  2. Square each error
  3. Take average → this is MSE
  4. Take square root of MSE → this is RMSE

Manual Walkthrough — Step by Step

Same house price data:

House

Actual

Predicted

Error

Error²

1

50

45

5

25

2

80

85

-5

25

3

60

58

2

4

4

90

95

-5

25

5

70

65

5

25

Step 1 — Sum of squared errors:

25 + 25 + 4 + 25 + 25 = 104

Step 2 — MSE:

MSE = 104 / 5 = 20.8

Step 3 — RMSE:

RMSE = √20.8 = 4.56

Result: RMSE = 4.56 Lakhs

Now you can tell your manager — "on average, our model is wrong by ₹4.56 Lakhs" — and they actually understand it.


MAE vs RMSE — Same Unit, Different Behavior

Both are now in Lakhs. Let's compare on same data:

MAE  = 4.4  Lakhs
RMSE = 4.56 Lakhs

RMSE is slightly higher. Why? Because it penalizes bigger errors more, so it naturally comes out a bit higher than MAE.

This relationship is always true:

RMSE >= MAE   (always, without exception)

The gap between RMSE and MAE tells you something important about your model's errors.


The Gap Between RMSE and MAE — This is Gold

Gap

What it means

RMSE ≈ MAE (small gap)

Errors are consistent — no big outlier mistakes

RMSE >> MAE (big gap)

Model is making some very large errors somewhere

mae  = 4.4
rmse = 4.56
gap  = rmse - mae   # small gap = consistent errors, model is stable

mae2  = 4.4
rmse2 = 18.7
gap2  = rmse2 - mae2  # huge gap = some predictions are badly wrong

In real projects, checking this gap is a quick way to detect if your model has an outlier problem.


Python Program


    import numpy as np
    import pandas as pd
    from sklearn.metrics import mean_squared_error, mean_absolute_error
    import matplotlib.pyplot as plt

    # --- Data ---
    actual    = [50, 80, 60, 90, 70]
    predicted = [45, 85, 58, 95, 65]

    # --- Manual Calculation ---
    errors         = [a - p for a, p in zip(actual, predicted)]
    squared_errors = [e**2 for e in errors]
    mse_manual     = sum(squared_errors) / len(squared_errors)
    rmse_manual    = mse_manual ** 0.5   # square root

    print("=== Manual Calculation ===")
    print(f"Errors         : {errors}")
    print(f"Squared Errors : {squared_errors}")
    print(f"MSE            : {mse_manual}")
    print(f"RMSE           : {rmse_manual:.4f}")

    # --- Using NumPy ---
    rmse_numpy = np.sqrt(np.mean((np.array(actual) - np.array(predicted))**2))
    print(f"\nRMSE (numpy)   : {rmse_numpy:.4f}")

    # --- Using Scikit-learn ---
    mse     = mean_squared_error(actual, predicted)
    rmse    = np.sqrt(mse)
    mae     = mean_absolute_error(actual, predicted)

    print(f"RMSE (sklearn) : {rmse:.4f}")

    # --- The Gap Analysis ---
    print("\n=== MAE vs RMSE Gap Analysis ===")
    print(f"MAE  : {mae:.4f}")
    print(f"RMSE : {rmse:.4f}")
    print(f"Gap  : {rmse - mae:.4f}  ({'small - model is consistent' if (rmse - mae) < 2 else 'large - model has big error somewhere'})")

    # --- Outlier Comparison ---
    actual2    = [50, 80, 60, 90, 70]
    predicted2 = [50, 80, 60, 89, 30]   # last one is way off

    mae2  = mean_absolute_error(actual2, predicted2)
    rmse2 = np.sqrt(mean_squared_error(actual2, predicted2))

    print("\n=== Normal Model vs Outlier Model ===")
    print(f"Normal Model  → MAE: {mae:.2f}  | RMSE: {rmse:.2f}  | Gap: {rmse-mae:.2f}")
    print(f"Outlier Model → MAE: {mae2:.2f} | RMSE: {rmse2:.2f} | Gap: {rmse2-mae2:.2f}")
    print("Notice how RMSE explodes for outlier model but MAE stays modest")

    # --- Plot ---
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Plot 1 - Actual vs Predicted with error lines
    axes[0].plot(range(1, 6), actual,    label='Actual',    marker='o', linewidth=2)
    axes[0].plot(range(1, 6), predicted, label='Predicted', marker='s', linewidth=2, linestyle='--')
    for i in range(5):
        axes[0].vlines(i+1, min(actual[i], predicted[i]),
                            max(actual[i], predicted[i]),
                            colors='red', linewidth=2, alpha=0.6)
    axes[0].set_title(f'Actual vs Predicted\nMAE={mae:.2f} | RMSE={rmse:.2f}')
    axes[0].set_xlabel('House')
    axes[0].set_ylabel('Price (Lakhs)')
    axes[0].legend()
    axes[0].grid(True)

    # Plot 2 - MAE vs RMSE bar comparison across both models
    metrics  = ['MAE', 'RMSE']
    normal   = [mae, rmse]
    outlier  = [mae2, rmse2]
    x        = np.arange(len(metrics))
    width    = 0.35

    axes[1].bar(x - width/2, normal,  width, label='Normal Model',  color='steelblue', alpha=0.8)
    axes[1].bar(x + width/2, outlier, width, label='Outlier Model', color='tomato',    alpha=0.8)
    axes[1].set_title('MAE vs RMSE — Normal vs Outlier Model')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(metrics)
    axes[1].set_ylabel('Error Value')
    axes[1].legend()
    axes[1].grid(True, axis='y')

    plt.tight_layout()
    plt.savefig('rmse_plot.png')
    plt.show()
    print("\nPlot saved!")


Output:

=== Manual Calculation === Errors : [5, -5, 2, -5, 5] Squared Errors : [25, 25, 4, 25, 25] MSE : 20.8 RMSE : 4.5607 RMSE (numpy) : 4.5607 RMSE (sklearn) : 4.5607 === MAE vs RMSE Gap Analysis === MAE : 4.4000 RMSE : 4.5607 Gap : 0.1607 (small - model is consistent) === Normal Model vs Outlier Model === Normal Model → MAE: 4.40 | RMSE: 4.56 | Gap: 0.16 Outlier Model → MAE: 8.20 | RMSE: 17.89 | Gap: 9.69 Notice how RMSE explodes for outlier model but MAE stays modest Plot saved!

How to Read RMSE in Real Projects

RMSE is in the same unit as your target. So interpretation is direct:

Target Variable

RMSE = 4.56 means

House Price (Lakhs)

Wrong by ₹4.56L on average (with big error penalty)

Temperature (°C)

Wrong by 4.56°C on average

Sales (units)

Wrong by 4.56 units on average

Quick sanity check in code:


    rmse = np.sqrt(mean_squared_error(y_test, y_pred))

    target_range = y_test.max() - y_test.min()
    rmse_pct     = (rmse / target_range) * 100

    print(f"RMSE             : {rmse:.2f}")
    print(f"Target Range     : {target_range:.2f}")
    print(f"RMSE as % range  : {rmse_pct:.1f}%")

    # Rule of thumb
    if rmse_pct < 10:
        print("Model is very good")
    elif rmse_pct < 20:
        print("Model is decent")
    else:
        print("Model needs improvement")


MAE vs MSE vs RMSE — Full Picture

MAE

MSE

RMSE

Formula

avg of |errors|

avg of errors²

√MSE

Unit

Same as target

Squared

Same as target

Big error penalty

No

Yes, very heavy

Yes, heavy

Outlier sensitive

No — robust

Very sensitive

Sensitive

Interpretable

Best

Worst

Good

Used as loss function

Sometimes

Yes

Sometimes

Use when

Outliers exist

Training models

Evaluating models


The Golden Rule in Real Projects


    # Always report all three together
    mae  = mean_absolute_error(y_test, y_pred)
    mse  = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)

    print(f"MAE  : {mae:.2f}")    # average error, simple
    print(f"MSE  : {mse:.2f}")    # for reference, used in training
    print(f"RMSE : {rmse:.2f}")   # main reporting metric

    # Then check the gap
    print(f"Gap (RMSE - MAE): {rmse - mae:.2f}")
    # Small gap = consistent model
    # Large gap = outlier errors hiding somewhere

In job interviews and real projects — RMSE is the most commonly reported regression metric. MAE is used when you need simplicity or have lots of outliers. MSE is mostly seen inside model training.


One Line Summary

RMSE is MSE with a square root — it keeps MSE's ability to punish large errors heavily, but brings the unit back to the same scale as your data, making it the most widely used and reported regression evaluation metric.

Mean — Complete Chapter for ML & Statistics

What is Mean? We calculate the mean (average) because it gives a single value that represents the whole dataset. Why we need mean (benefits...