Statistics & Math Roadmap for Data Science / ML

PHASE 1 — Descriptive Statistics (Start Here)

Why: Before any ML model, you need to understand and summarize your data. Pandas .describe(), .mean(), .std() — all of this is descriptive stats.

Topics:

  • Types of Data — Nominal, Ordinal, Continuous, Discrete
  • Measures of Central Tendency — Mean, Median, Mode (and when to use which)
  • Measures of Spread — Variance, Standard Deviation, Range, IQR
  • Skewness & Kurtosis — Is your data symmetric? Heavy-tailed?
  • Percentiles & Quantiles — Box plots, outlier detection
  • Covariance & Correlation — How two variables move together (Pearson, Spearman)

PHASE 2 — Probability Theory

Why: ML models are probabilistic at their core. Naive Bayes, Logistic Regression, Neural Networks — all built on probability.

Topics:

  • Basic Probability — Events, Sample Space, P(A), Complement
  • Conditional Probability — P(A|B), independence
  • Bayes' Theorem — The backbone of Naive Bayes classifier
  • Random Variables — Discrete vs Continuous
  • Probability Distributions:
    • Bernoulli, Binomial (for classification problems)
    • Normal / Gaussian (most important — appears everywhere)
    • Poisson (event counting)
    • Uniform, Exponential
  • Expected Value & Variance of distributions
  • Central Limit Theorem — Why we assume normality in many algorithms

PHASE 3 — Inferential Statistics

Why: You work with samples, not entire populations. This phase teaches you how to draw conclusions and measure confidence in your findings.

Topics:

  • Population vs Sample
  • Sampling Methods — Random, Stratified, etc.
  • Hypothesis Testing:
    • Null Hypothesis (H0) vs Alternate Hypothesis (H1)
    • p-value — what it actually means (very misunderstood)
    • Significance Level (alpha = 0.05)
    • Type I Error (False Positive) and Type II Error (False Negative)
  • Z-test and T-test — Comparing means
  • Chi-Square Test — For categorical data relationships
  • ANOVA — Comparing means across multiple groups
  • Confidence Intervals — "I am 95% confident the true mean lies here"

PHASE 4 — Linear Algebra

Why: Every ML model operates on matrices and vectors internally. Neural networks, PCA, SVD, image data — pure linear algebra.

Topics:

  • Scalars, Vectors, Matrices, Tensors — What they are and how NumPy maps to them
  • Matrix Operations — Addition, Multiplication, Transpose
  • Dot Product — Core operation in every neural network layer
  • Identity Matrix & Inverse Matrix
  • Determinant — When does a matrix have a solution?
  • Eigenvalues & Eigenvectors — Critical for PCA (dimensionality reduction)
  • Singular Value Decomposition (SVD) — Used in recommendation systems, NLP
  • Norms (L1, L2) — Used in regularization (Ridge, Lasso regression)
  • Orthogonality — Basis of PCA and feature independence

PHASE 5 — Calculus (Focused, not full course)

Why: Gradient Descent — the algorithm that trains every ML model — is pure calculus. You don't need deep calculus, but these specific concepts are non-negotiable.

Topics:

  • Functions & Limits — Basic understanding
  • Derivatives — Rate of change, slope of a curve
  • Chain Rule — Essential for backpropagation in neural networks
  • Partial Derivatives — When your function has multiple variables (it always does in ML)
  • Gradient — Vector of all partial derivatives; tells you which direction to move
  • Gradient Descent — How models learn by minimizing loss
  • Minima & Maxima — Finding where the function is lowest (minimizing error)
  • Integrals (light) — Area under curve; used in probability distributions

PHASE 6 — Information Theory (Before NLP / Advanced ML)

Why: Decision Trees, Random Forests, and all NLP models use these concepts directly.

Topics:

  • Entropy — Measure of uncertainty/randomness in data
  • Information Gain — How much a feature reduces uncertainty (used in Decision Trees)
  • Cross-Entropy Loss — The loss function used in classification models
  • KL Divergence — Difference between two probability distributions

PHASE 7 — Optimization (Before Deep Learning)

Why: Training any model = solving an optimization problem.

Topics:

  • Loss Functions — MSE, MAE, Cross-Entropy — what they measure and when to use
  • Convex vs Non-Convex problems
  • Gradient Descent variants — Batch, Stochastic (SGD), Mini-Batch
  • Learning Rate — Too high vs too low
  • Momentum, Adam Optimizer — Why plain gradient descent is often not enough
  • Regularization (L1/L2) — Preventing overfitting using math

Recommended Study Order

Phase 1 → Phase 2 → Phase 3 → Phase 4 → Phase 5 → Phase 6 → Phase 7

You don't need to fully complete one phase before starting the next. Once you're 70-80% comfortable, move forward and come back when a concept blocks you.


Practical Mapping (Math → Python Library)

  • Descriptive Stats → pandas, numpy
  • Probability & Distributions → scipy.stats
  • Hypothesis Testing → scipy.stats, statsmodels
  • Linear Algebra → numpy.linalg
  • Calculus / Optimization → Conceptual understanding, then PyTorch/TensorFlow autograd
  • Visualization of all above → matplotlib, seaborn

What You Can Skip (for now)

  • Real Analysis, Topology — pure math, not needed for applied ML
  • Full integral calculus — you need the concept, not manual computation
  • Complex Number theory — not relevant for standard ML

Honest Time Estimate

  • Phase 1-3 (Stats): 3-4 weeks at comfortable pace
  • Phase 4 (Linear Algebra): 2-3 weeks
  • Phase 5 (Calculus): 2 weeks (focused, not full course)
  • Phase 6-7: 1-2 weeks each

Total: roughly 3-4 months if studying alongside your Python/ML work. The stats and linear algebra will immediately make your pandas and numpy usage much more intuitive.


What is NumPy | Setup — Jupyter Notebook

What is Jupyter Notebook?

When you were learning Python, you wrote code in .py files and ran them in terminal. That works for building apps like FastAPI.

But for Data Science — everyone uses Jupyter Notebook. It's a different way of writing code where:

  • Code is written in cells — small blocks
  • You run one cell at a time and see output immediately below it
  • You can mix code, output, charts, and text in one file
  • Perfect for exploring data step by step

It looks like this:

┌─────────────────────────────────┐
│ import numpy as np              │  ← code cell
└─────────────────────────────────┘
  Output: nothing

┌─────────────────────────────────┐
│ a = np.array([1, 2, 3])         │  ← code cell
│ print(a)                        │
└─────────────────────────────────┘
  Output: [1 2 3]                    ← output appears right below

┌─────────────────────────────────┐
│ # this is a chart cell          │  ← code cell
│ plt.plot(a)                     │
└─────────────────────────────────┘
  Output: 📈 chart appears here

Every data scientist in the world uses this tool. You'll love it after 10 minutes.


Setting Up

Step 1 — Create Project Folder

mkdir data-science-learning
cd data-science-learning

Step 2 — Create Virtual Environment

python -m venv venv

# Windows
venv\Scripts\activate

# Mac/Linux
source venv/bin/activate

You should see (venv) in your terminal now.

Step 3 — Install Everything

pip install numpy pandas matplotlib seaborn jupyter

This installs all 4 libraries at once. It will take a minute — they're large packages.

Step 4 — Launch Jupyter Notebook

jupyter notebook

Your browser will automatically open at http://localhost:8888 showing a file explorer interface.


Creating Your First Notebook

In the Jupyter browser interface:

  1. Click "New" button on the top right
  2. Click "Python 3 (ipykernel)"
  3. A new tab opens — this is your notebook
  4. Click on "Untitled" at the top and rename it to numpy-basics

You'll see an empty cell waiting for you. This is where you write code.


How to Use Jupyter Cells

Write code in the cell
Press Shift + Enter → runs the cell and moves to next
Press Ctrl + Enter  → runs the cell and stays
Press A             → add cell Above current
Press B             → add cell Below current
Press DD            → delete current cell
Press M             → change cell to Markdown (text)
Press Y             → change cell back to Code

These shortcuts will become muscle memory quickly.


Your First Cell — Import NumPy

In your first cell type:

import numpy as np
print("NumPy version:", np.__version__)

Press Shift + Enter. Output:

NumPy version: 2.1.0

import numpy as np — you import NumPy and give it the alias np. This is a universal convention — every data scientist in the world writes np. Never write import numpy without the alias.


What is NumPy and Why Does It Exist?

The Problem with Python Lists

You already know Python lists. They work but they're slow for math:


    # Python list — slow way
    numbers = [1, 2, 3, 4, 5]

    # Multiply every number by 2
    doubled = []
    for n in numbers:
        doubled.append(n * 2)

    print(doubled)    # [2, 4, 6, 8, 10]

This works but imagine doing this on 10 million numbers. Python loop on a list is very slow.

NumPy Solution — Arrays


    import numpy as np

    numbers = np.array([1, 2, 3, 4, 5])
    doubled = numbers * 2

    print(doubled)    # [2 4 6 8 10]

No loop needed. NumPy does the operation on all elements at once — and it's 50-100x faster than a Python loop because NumPy is written in C under the hood.

This is called vectorization — applying an operation to an entire array at once instead of looping.


Step 2: Creating NumPy Arrays

In a new cell:


    import numpy as np

    # From a Python list
    arr1 = np.array([1, 2, 3, 4, 5])
    print(arr1)           # [1 2 3 4 5]
    print(type(arr1))     # <class 'numpy.ndarray'>

Notice the output — NumPy arrays print without commas between elements, unlike Python lists. ndarray = n-dimensional array.


Array Data Types

Every NumPy array has a single data type — all elements must be the same type:


    # Integer array
    int_arr = np.array([1, 2, 3, 4, 5])
    print(int_arr.dtype)     # int64

    # Float array
    float_arr = np.array([1.5, 2.5, 3.5])
    print(float_arr.dtype)   # float64

    # String array
    str_arr = np.array(["apple", "banana", "mango"])
    print(str_arr.dtype)     # <U6  (unicode string)

    # Mixed — NumPy converts everything to same type
    mixed = np.array([1, 2.5, 3])
    print(mixed)             # [1.  2.5 3. ]  — all converted to float
    print(mixed.dtype)       # float64

You can specify type explicitly:


    arr = np.array([1, 2, 3], dtype=float)
    print(arr)        # [1. 2. 3.]

    arr = np.array([1.9, 2.8, 3.7], dtype=int)
    print(arr)        # [1 2 3]  — decimal part cut off


Array Properties


    arr = np.array([10, 20, 30, 40, 50])

    print(arr.shape)      # (5,)  — 5 elements, 1 dimension
    print(arr.ndim)       # 1     — number of dimensions
    print(arr.size)       # 5     — total number of elements
    print(arr.dtype)      # int64 — data type


Ways to Create Arrays — Very Important

You'll use these constantly:

np.zeros() — array filled with zeros


    zeros = np.zeros(5)
    print(zeros)       # [0. 0. 0. 0. 0.]

    zeros_int = np.zeros(5, dtype=int)
    print(zeros_int)   # [0 0 0 0 0]

np.ones() — array filled with ones


    ones = np.ones(5)
    print(ones)        # [1. 1. 1. 1. 1.]

np.arange() — like Python range() but returns array


    arr = np.arange(10)
    print(arr)         # [0 1 2 3 4 5 6 7 8 9]

    arr = np.arange(1, 11)
    print(arr)         # [ 1  2  3  4  5  6  7  8  9 10]

    arr = np.arange(0, 20, 2)
    print(arr)         # [ 0  2  4  6  8 10 12 14 16 18]

    arr = np.arange(10, 0, -1)
    print(arr)         # [10  9  8  7  6  5  4  3  2  1]

np.linspace() — evenly spaced numbers between two values


    arr = np.linspace(0, 1, 5)
    print(arr)         # [0.   0.25 0.5  0.75 1.  ]

    arr = np.linspace(0, 100, 11)
    print(arr)         # [  0.  10.  20.  30.  40.  50.  60.  70.  80.  90. 100.]

linspace(start, stop, num) — gives you num evenly spaced points from start to stop. Unlike arange — the stop value IS included.

np.full() — array filled with a specific value


    arr = np.full(5, 7)
    print(arr)         # [7 7 7 7 7]

    arr = np.full(4, 3.14)
    print(arr)         # [3.14 3.14 3.14 3.14]


Step 3: 2D Arrays — The Real Power

A 2D array is like a table with rows and columns. This is how real data looks — datasets, images, matrices.

    # Create 2D array from list of lists
    matrix = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])

    print(matrix)
    print()
    print("Shape:", matrix.shape)    # (3, 3) — 3 rows, 3 columns
    print("Dimensions:", matrix.ndim) # 2
    print("Total elements:", matrix.size) # 9

Output:

[[1 2 3]
 [4 5 6]
 [7 8 9]]

Shape: (3, 3)
Dimensions: 2
Total elements: 9

Creating 2D Arrays


    # 3x4 array of zeros (3 rows, 4 columns)
    zeros_2d = np.zeros((3, 4))
    print(zeros_2d)

Output:

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

    # 2x3 array of ones
    ones_2d = np.ones((2, 3), dtype=int)
    print(ones_2d)

Output:

[[1 1 1]
 [1 1 1]]

    # Identity matrix — 1s on diagonal
    identity = np.eye(4)
    print(identity)

Output:

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

Step 4: Indexing and Slicing

1D Array Indexing


    arr = np.array([10, 20, 30, 40, 50])
    #               0    1   2   3   4

    print(arr[0])     # 10  — first element
    print(arr[2])     # 30
    print(arr[-1])    # 50  — last element
    print(arr[-2])    # 40  — second from last

1D Array Slicing


    arr = np.array([10, 20, 30, 40, 50, 60, 70])

    print(arr[1:4])     # [20 30 40]  — index 1 to 3
    print(arr[:3])      # [10 20 30]  — first 3
    print(arr[3:])      # [40 50 60 70] — from index 3 to end
    print(arr[::2])     # [10 30 50 70] — every 2nd element
    print(arr[::-1])    # [70 60 50 40 30 20 10] — reversed

2D Array Indexing


    matrix = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])

    # Access single element — [row, column]
    print(matrix[0, 0])    # 1  — row 0, col 0
    print(matrix[1, 2])    # 6  — row 1, col 2
    print(matrix[2, 1])    # 8  — row 2, col 1
    print(matrix[-1, -1])  # 9  — last row, last column

2D Array Slicing


    matrix = np.array([
        [1,  2,  3,  4],
        [5,  6,  7,  8],
        [9, 10, 11, 12]
    ])

    # Get entire row
    print(matrix[0])        # [1 2 3 4]  — first row
    print(matrix[1, :])     # [5 6 7 8]  — second row (explicit)

    # Get entire column
    print(matrix[:, 0])     # [1 5 9]   — first column
    print(matrix[:, 2])     # [3 7 11]  — third column

    # Get submatrix
    print(matrix[0:2, 1:3]) # rows 0-1, cols 1-2

Output of last line:

[[2 3]
 [6 7]]

Step 5: Array Operations

This is where NumPy really shines. Operations apply to every element automatically.

Basic Math


    arr = np.array([1, 2, 3, 4, 5])

    print(arr + 10)     # [11 12 13 14 15]
    print(arr - 3)      # [-2 -1  0  1  2]
    print(arr * 2)      # [ 2  4  6  8 10]
    print(arr / 2)      # [0.5 1.  1.5 2.  2.5]
    print(arr ** 2)     # [ 1  4  9 16 25]
    print(arr % 2)      # [1 0 1 0 1]  — remainder

Operations Between Two Arrays


    a = np.array([1, 2, 3, 4, 5])
    b = np.array([10, 20, 30, 40, 50])

    print(a + b)     # [11 22 33 44 55]
    print(a * b)     # [ 10  40  90 160 250]
    print(b / a)     # [10. 10. 10. 10. 10.]
    print(b - a)     # [ 9 18 27 36 45]

Element-wise — first element with first, second with second, and so on.

Math Functions


    arr = np.array([1, 4, 9, 16, 25])

    print(np.sqrt(arr))    # [1. 2. 3. 4. 5.]  — square root
    print(np.log(arr))     # natural log of each element
    print(np.exp(arr))     # e^x for each element
    print(np.abs(np.array([-3, -1, 0, 2, 4])))  # [3 1 0 2 4]


Step 6: Statistical Functions

These are used constantly in data analysis:


    data = np.array([23, 45, 12, 67, 34, 89, 56, 78, 43, 21])

    print("Sum:", np.sum(data))           # 468
    print("Mean:", np.mean(data))         # 46.8
    print("Median:", np.median(data))     # 44.0
    print("Std Dev:", np.std(data))       # 23.18...
    print("Variance:", np.var(data))      # 537.76
    print("Min:", np.min(data))           # 12
    print("Max:", np.max(data))           # 89
    print("Min index:", np.argmin(data))  # 2  — index of minimum value
    print("Max index:", np.argmax(data))  # 5  — index of maximum value
    print("Range:", np.max(data) - np.min(data))  # 77

On 2D Arrays — axis parameter


    scores = np.array([
        [85, 90, 78],    # student 1 — 3 subjects
        [92, 88, 95],    # student 2
        [76, 82, 79]     # student 3
    ])

    print(np.mean(scores))           # 85.0 — mean of all values

    print(np.mean(scores, axis=1))   # [84.33 91.67 79.0] — mean per row (per student)
    print(np.mean(scores, axis=0))   # [84.33 86.67 84.0] — mean per column (per subject)

    print(np.sum(scores, axis=1))    # [253 275 237] — total per student
    print(np.max(scores, axis=0))    # [92 90 95] — best score per subject

axis=0 means along rows (column-wise result) axis=1 means along columns (row-wise result)

This confuses everyone at first. Just remember:

  • axis=1 → result has one value per row
  • axis=0 → result has one value per column

Step 7: Boolean Indexing — Very Powerful

This is one of the most useful NumPy features for data filtering:


    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91, 66, 48])

    # Create a boolean mask
    passing = marks >= 50
    print(passing)
    # [ True False  True False  True  True False  True  True False]

    # Use mask to filter
    print(marks[passing])
    # [85 90 75 55 91 66]

    # One liner
    print(marks[marks >= 50])
    # [85 90 75 55 91 66]

    # Multiple conditions
    print(marks[(marks >= 50) & (marks < 80)])
    # [75 55 66]  — between 50 and 80

    # How many students passed?
    print(np.sum(marks >= 50))     # 6

    # What percentage passed?
    print(np.mean(marks >= 50) * 100)  # 60.0%


Step 8: Random Numbers

Used constantly in ML for creating test data, initializing weights, etc:

# Set seed for reproducibility — same "random" numbers every run
np.random.seed(42)

# Random floats between 0 and 1
print(np.random.random(5))
# [0.374 0.951 0.732 0.599 0.156]

# Random integers
print(np.random.randint(1, 100, size=5))
# [52 93 15 72 61]

# Random 2D array
print(np.random.randint(0, 10, size=(3, 4)))
# [[6 3 7 4]
#  [6 9 2 6]
#  [7 4 3 7]]

# Random from normal distribution (bell curve)
# mean=0, std=1
normal = np.random.normal(0, 1, 1000)
print("Mean:", np.mean(normal).round(2))    # ~0.0
print("Std:", np.std(normal).round(2))      # ~1.0

# Random choice from array
options = np.array(["rock", "paper", "scissors"])
print(np.random.choice(options, size=5))
# ['paper' 'rock' 'scissors' 'rock' 'paper']

np.random.seed(42) — setting a seed means your "random" numbers are always the same. Crucial in ML so your experiments are reproducible.


Step 9: Reshaping Arrays


    arr = np.arange(12)
    print(arr)          # [ 0  1  2  3  4  5  6  7  8  9 10 11]
    print(arr.shape)    # (12,)

    # Reshape to 3x4 matrix
    matrix = arr.reshape(3, 4)
    print(matrix)
    print(matrix.shape)    # (3, 4)

Output:

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
(3, 4)

    # -1 means "figure it out automatically"
    arr = np.arange(12)
    print(arr.reshape(3, -1))    # 3 rows, NumPy calculates 4 columns
    print(arr.reshape(-1, 6))    # NumPy calculates 2 rows, 6 columns

    # Flatten 2D back to 1D
    matrix = np.array([[1,2,3],[4,5,6]])
    print(matrix.flatten())    # [1 2 3 4 5 6]


Real World Example — Student Grade Analysis

Let's put everything together in one practical example:

import numpy as np

# Exam scores for 5 students across 4 subjects
# Rows = students, Columns = subjects (Math, Science, English, History)
np.random.seed(42)
scores = np.random.randint(40, 100, size=(5, 4))

students = ["Rahul", "Priya", "Gagan", "Amit", "Neha"]
subjects = ["Math", "Science", "English", "History"]

print("=== Raw Scores ===")
print(scores)
print()

# Average per student (axis=1 = across columns)
student_avg = np.mean(scores, axis=1)
print("=== Student Averages ===")
for i, name in enumerate(students):
    print(f"{name}: {student_avg[i]:.1f}")

print()

# Average per subject (axis=0 = across rows)
subject_avg = np.mean(scores, axis=0)
print("=== Subject Averages ===")
for i, subject in enumerate(subjects):
    print(f"{subject}: {subject_avg[i]:.1f}")

print()

# Best and worst students
best_idx = np.argmax(student_avg)
worst_idx = np.argmin(student_avg)
print(f"Best student: {students[best_idx]} ({student_avg[best_idx]:.1f})")
print(f"Needs help: {students[worst_idx]} ({student_avg[worst_idx]:.1f})")

print()

# How many students passed each subject (>=50)
passed_per_subject = np.sum(scores >= 50, axis=0)
print("=== Pass Count Per Subject ===")
for i, subject in enumerate(subjects):
    print(f"{subject}: {passed_per_subject[i]}/5 passed")

print()

# Grade each student
print("=== Grades ===")
for i, name in enumerate(students):
    avg = student_avg[i]
    if avg >= 85:
        grade = "A"
    elif avg >= 70:
        grade = "B"
    elif avg >= 55:
        grade = "C"
    else:
        grade = "F"
    print(f"{name}: {avg:.1f} → Grade {grade}")

Output:

=== Raw Scores ===
[[71 60 57 85]
 [74 77 55 74]
 [49 78 95 80]
 [54 68 95 65]
 [65 71 47 93]]

=== Student Averages ===
Rahul: 68.2
Priya: 70.0
Gagan: 75.5
Amit: 70.5
Neha: 69.0

=== Subject Averages ===
Math: 62.6
Science: 70.8
English: 69.8
History: 79.4

Best student: Gagan (75.5)
Needs help: Rahul (68.2)

=== Pass Count Per Subject ===
Math: 4/5 passed
Science: 5/5 passed
English: 4/5 passed
History: 5/5 passed

=== Grades ===
Rahul: 68.2 → Grade C
Priya: 70.0 → Grade B
Gagan: 75.5 → Grade B
Amit: 70.5 → Grade B
Neha: 69.0 → Grade C

This is actual data analysis — loading data, computing statistics, finding insights. You just did your first data analysis with NumPy.


Quick Reference — NumPy Cheat Sheet


    import numpy as np

    # Creating arrays
    np.array([1,2,3])              # from list
    np.zeros(5)                    # [0. 0. 0. 0. 0.]
    np.ones((3,4))                 # 3x4 matrix of ones
    np.arange(0, 10, 2)            # [0 2 4 6 8]
    np.linspace(0, 1, 5)           # 5 evenly spaced points
    np.random.randint(0, 100, 10)  # 10 random integers
    np.random.seed(42)             # reproducibility

    # Array info
    arr.shape      # dimensions
    arr.ndim       # number of dimensions
    arr.size       # total elements
    arr.dtype      # data type

    # Indexing
    arr[0]          # first element
    arr[-1]         # last element
    arr[1:4]        # slice
    arr[::2]        # every 2nd
    matrix[1, 2]    # row 1, col 2
    matrix[:, 0]    # entire first column
    matrix[0, :]    # entire first row

    # Operations
    arr + 5         # add 5 to all
    arr * 2         # multiply all by 2
    arr ** 2        # square all
    np.sqrt(arr)    # square root

    # Statistics
    np.sum(arr)
    np.mean(arr)
    np.median(arr)
    np.std(arr)
    np.min(arr)
    np.max(arr)
    np.argmin(arr)  # index of min
    np.argmax(arr)  # index of max

    # 2D statistics
    np.mean(matrix, axis=0)   # column means
    np.mean(matrix, axis=1)   # row means

    # Filtering
    arr[arr > 50]              # values greater than 50
    arr[(arr > 20) & (arr < 80)]  # between 20 and 80

    # Reshape
    arr.reshape(3, 4)    # reshape to 3x4
    arr.flatten()        # back to 1D


Exercise 🏋️

Create a new Jupyter notebook cell and solve this:

Sales Analysis:

# Monthly sales data for 4 products over 6 months
# Each row = one product, each column = one month
sales = np.array([
    [1200, 1500, 1100, 1800, 2000, 1600],  # Product A
    [800,  950,  870,  1100, 1250, 900],   # Product B
    [2100, 1900, 2300, 2100, 1800, 2500],  # Product C
    [500,  600,  550,  700,  650,  720]    # Product D
])

products = ["Product A", "Product B", "Product C", "Product D"]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]

Find and print:

  1. Total sales per product over 6 months
  2. Average monthly sales per product
  3. Best performing product (highest total)
  4. Worst performing product (lowest total)
  5. Best sales month overall (sum across all products)
  6. Which product had sales above 1000 in every month
  7. Total company revenue per month


NumPy — Advanced Concepts

What We're Covering Today Copy vs View — one of the most important NumPy concepts Fancy Indexing np.where — conditional operations ...