Statistics & Math Roadmap for Data Science / ML

PHASE 1 — Descriptive Statistics (Start Here)

Why: Before any ML model, you need to understand and summarize your data. Pandas .describe(), .mean(), .std() — all of this is descriptive stats.

Topics:

Types of Data — Nominal, Ordinal, Continuous, Discrete
Measures of Central Tendency — Mean, Median, Mode (and when to use which)
Measures of Spread — Variance, Standard Deviation, Range, IQR
Skewness & Kurtosis — Is your data symmetric? Heavy-tailed?
Percentiles & Quantiles — Box plots, outlier detection
Covariance & Correlation — How two variables move together (Pearson, Spearman)

PHASE 2 — Probability Theory

Why: ML models are probabilistic at their core. Naive Bayes, Logistic Regression, Neural Networks — all built on probability.

Topics:

Basic Probability — Events, Sample Space, P(A), Complement
Conditional Probability — P(A|B), independence
Bayes' Theorem — The backbone of Naive Bayes classifier
Random Variables — Discrete vs Continuous
Probability Distributions:
- Bernoulli, Binomial (for classification problems)
- Normal / Gaussian (most important — appears everywhere)
- Poisson (event counting)
- Uniform, Exponential
Expected Value & Variance of distributions
Central Limit Theorem — Why we assume normality in many algorithms

PHASE 3 — Inferential Statistics

Why: You work with samples, not entire populations. This phase teaches you how to draw conclusions and measure confidence in your findings.

Topics:

Population vs Sample
Sampling Methods — Random, Stratified, etc.
Hypothesis Testing:
- Null Hypothesis (H0) vs Alternate Hypothesis (H1)
- p-value — what it actually means (very misunderstood)
- Significance Level (alpha = 0.05)
- Type I Error (False Positive) and Type II Error (False Negative)
Z-test and T-test — Comparing means
Chi-Square Test — For categorical data relationships
ANOVA — Comparing means across multiple groups
Confidence Intervals — "I am 95% confident the true mean lies here"

PHASE 4 — Linear Algebra

Why: Every ML model operates on matrices and vectors internally. Neural networks, PCA, SVD, image data — pure linear algebra.

Topics:

Scalars, Vectors, Matrices, Tensors — What they are and how NumPy maps to them
Matrix Operations — Addition, Multiplication, Transpose
Dot Product — Core operation in every neural network layer
Identity Matrix & Inverse Matrix
Determinant — When does a matrix have a solution?
Eigenvalues & Eigenvectors — Critical for PCA (dimensionality reduction)
Singular Value Decomposition (SVD) — Used in recommendation systems, NLP
Norms (L1, L2) — Used in regularization (Ridge, Lasso regression)
Orthogonality — Basis of PCA and feature independence

PHASE 5 — Calculus (Focused, not full course)

Why: Gradient Descent — the algorithm that trains every ML model — is pure calculus. You don't need deep calculus, but these specific concepts are non-negotiable.

Topics:

Functions & Limits — Basic understanding
Derivatives — Rate of change, slope of a curve
Chain Rule — Essential for backpropagation in neural networks
Partial Derivatives — When your function has multiple variables (it always does in ML)
Gradient — Vector of all partial derivatives; tells you which direction to move
Gradient Descent — How models learn by minimizing loss
Minima & Maxima — Finding where the function is lowest (minimizing error)
Integrals (light) — Area under curve; used in probability distributions

PHASE 6 — Information Theory (Before NLP / Advanced ML)

Why: Decision Trees, Random Forests, and all NLP models use these concepts directly.

Topics:

Entropy — Measure of uncertainty/randomness in data
Information Gain — How much a feature reduces uncertainty (used in Decision Trees)
Cross-Entropy Loss — The loss function used in classification models
KL Divergence — Difference between two probability distributions

PHASE 7 — Optimization (Before Deep Learning)

Why: Training any model = solving an optimization problem.

Topics:

Loss Functions — MSE, MAE, Cross-Entropy — what they measure and when to use
Convex vs Non-Convex problems
Gradient Descent variants — Batch, Stochastic (SGD), Mini-Batch
Learning Rate — Too high vs too low
Momentum, Adam Optimizer — Why plain gradient descent is often not enough
Regularization (L1/L2) — Preventing overfitting using math

Recommended Study Order

Phase 1 → Phase 2 → Phase 3 → Phase 4 → Phase 5 → Phase 6 → Phase 7

You don't need to fully complete one phase before starting the next. Once you're 70-80% comfortable, move forward and come back when a concept blocks you.

Practical Mapping (Math → Python Library)

Descriptive Stats → pandas, numpy
Probability & Distributions → scipy.stats
Hypothesis Testing → scipy.stats, statsmodels
Linear Algebra → numpy.linalg
Calculus / Optimization → Conceptual understanding, then PyTorch/TensorFlow autograd
Visualization of all above → matplotlib, seaborn

What You Can Skip (for now)

Real Analysis, Topology — pure math, not needed for applied ML
Full integral calculus — you need the concept, not manual computation
Complex Number theory — not relevant for standard ML

Honest Time Estimate

Phase 1-3 (Stats): 3-4 weeks at comfortable pace
Phase 4 (Linear Algebra): 2-3 weeks
Phase 5 (Calculus): 2 weeks (focused, not full course)
Phase 6-7: 1-2 weeks each

Total: roughly 3-4 months if studying alongside your Python/ML work. The stats and linear algebra will immediately make your pandas and numpy usage much more intuitive.

CodeWithGagan | Programming Language and IT Lectures