What is Jupyter Notebook?
When you were learning Python, you wrote code in .py files and ran them in terminal. That works for building apps like FastAPI.
But for Data Science — everyone uses Jupyter Notebook. It's a different way of writing code where:
- Code is written in cells — small blocks
- You run one cell at a time and see output immediately below it
- You can mix code, output, charts, and text in one file
- Perfect for exploring data step by step
It looks like this:
┌─────────────────────────────────┐
│ import numpy as np │ ← code cell
└─────────────────────────────────┘
Output: nothing
┌─────────────────────────────────┐
│ a = np.array([1, 2, 3]) │ ← code cell
│ print(a) │
└─────────────────────────────────┘
Output: [1 2 3] ← output appears right below
┌─────────────────────────────────┐
│ # this is a chart cell │ ← code cell
│ plt.plot(a) │
└─────────────────────────────────┘
Output: 📈 chart appears here
Every data scientist in the world uses this tool. You'll love it after 10 minutes.
Setting Up
Step 1 — Create Project Folder
mkdir data-science-learning
cd data-science-learning
Step 2 — Create Virtual Environment
python -m venv venv
# Windows
venv\Scripts\activate
# Mac/Linux
source venv/bin/activate
You should see (venv) in your terminal now.
Step 3 — Install Everything
pip install numpy pandas matplotlib seaborn jupyter
This installs all 4 libraries at once. It will take a minute — they're large packages.
Step 4 — Launch Jupyter Notebook
jupyter notebook
Your browser will automatically open at http://localhost:8888 showing a file explorer interface.
Creating Your First Notebook
In the Jupyter browser interface:
- Click "New" button on the top right
- Click "Python 3 (ipykernel)"
- A new tab opens — this is your notebook
- Click on "Untitled" at the top and rename it to
numpy-basics
You'll see an empty cell waiting for you. This is where you write code.
How to Use Jupyter Cells
Write code in the cell
Press Shift + Enter → runs the cell and moves to next
Press Ctrl + Enter → runs the cell and stays
Press A → add cell Above current
Press B → add cell Below current
Press DD → delete current cell
Press M → change cell to Markdown (text)
Press Y → change cell back to Code
These shortcuts will become muscle memory quickly.
Your First Cell — Import NumPy
In your first cell type:
import numpy as np
print("NumPy version:", np.__version__)
Press Shift + Enter. Output:
NumPy version: 2.1.0
import numpy as np — you import NumPy and give it the alias np. This is a universal convention — every data scientist in the world writes np. Never write import numpy without the alias.
What is NumPy and Why Does It Exist?
The Problem with Python Lists
You already know Python lists. They work but they're slow for math:
# Python list — slow way
numbers = [1, 2, 3, 4, 5]
# Multiply every number by 2
doubled = []
for n in numbers:
doubled.append(n * 2)
print(doubled) # [2, 4, 6, 8, 10]
This works but imagine doing this on 10 million numbers. Python loop on a list is very slow.
NumPy Solution — Arrays
import numpy as np
numbers = np.array([1, 2, 3, 4, 5])
doubled = numbers * 2
print(doubled) # [2 4 6 8 10]
No loop needed. NumPy does the operation on all elements at once — and it's 50-100x faster than a Python loop because NumPy is written in C under the hood.
This is called vectorization — applying an operation to an entire array at once instead of looping.
Step 2: Creating NumPy Arrays
In a new cell:
import numpy as np
# From a Python list
arr1 = np.array([1, 2, 3, 4, 5])
print(arr1) # [1 2 3 4 5]
print(type(arr1)) # <class 'numpy.ndarray'>
Notice the output — NumPy arrays print without commas between elements, unlike Python lists. ndarray = n-dimensional array.
Array Data Types
Every NumPy array has a single data type — all elements must be the same type:
# Integer array
int_arr = np.array([1, 2, 3, 4, 5])
print(int_arr.dtype) # int64
# Float array
float_arr = np.array([1.5, 2.5, 3.5])
print(float_arr.dtype) # float64
# String array
str_arr = np.array(["apple", "banana", "mango"])
print(str_arr.dtype) # <U6 (unicode string)
# Mixed — NumPy converts everything to same type
mixed = np.array([1, 2.5, 3])
print(mixed) # [1. 2.5 3. ] — all converted to float
print(mixed.dtype) # float64
You can specify type explicitly:
arr = np.array([1, 2, 3], dtype=float)
print(arr) # [1. 2. 3.]
arr = np.array([1.9, 2.8, 3.7], dtype=int)
print(arr) # [1 2 3] — decimal part cut off
Array Properties
arr = np.array([10, 20, 30, 40, 50])
print(arr.shape) # (5,) — 5 elements, 1 dimension
print(arr.ndim) # 1 — number of dimensions
print(arr.size) # 5 — total number of elements
print(arr.dtype) # int64 — data type
Ways to Create Arrays — Very Important
You'll use these constantly:
np.zeros() — array filled with zeros
zeros = np.zeros(5)
print(zeros) # [0. 0. 0. 0. 0.]
zeros_int = np.zeros(5, dtype=int)
print(zeros_int) # [0 0 0 0 0]
np.ones() — array filled with ones
ones = np.ones(5)
print(ones) # [1. 1. 1. 1. 1.]
np.arange() — like Python range() but returns array
arr = np.arange(10)
print(arr) # [0 1 2 3 4 5 6 7 8 9]
arr = np.arange(1, 11)
print(arr) # [ 1 2 3 4 5 6 7 8 9 10]
arr = np.arange(0, 20, 2)
print(arr) # [ 0 2 4 6 8 10 12 14 16 18]
arr = np.arange(10, 0, -1)
print(arr) # [10 9 8 7 6 5 4 3 2 1]
np.linspace() — evenly spaced numbers between two values
arr = np.linspace(0, 1, 5)
print(arr) # [0. 0.25 0.5 0.75 1. ]
arr = np.linspace(0, 100, 11)
print(arr) # [ 0. 10. 20. 30. 40. 50. 60. 70. 80. 90. 100.]
linspace(start, stop, num) — gives you num evenly spaced points from start to stop. Unlike arange — the stop value IS included.
np.full() — array filled with a specific value
arr = np.full(5, 7)
print(arr) # [7 7 7 7 7]
arr = np.full(4, 3.14)
print(arr) # [3.14 3.14 3.14 3.14]
Step 3: 2D Arrays — The Real Power
A 2D array is like a table with rows and columns. This is how real data looks — datasets, images, matrices.
# Create 2D array from list of lists
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
print(matrix)
print()
print("Shape:", matrix.shape) # (3, 3) — 3 rows, 3 columns
print("Dimensions:", matrix.ndim) # 2
print("Total elements:", matrix.size) # 9
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
Shape: (3, 3)
Dimensions: 2
Total elements: 9
Creating 2D Arrays
# 3x4 array of zeros (3 rows, 4 columns)
zeros_2d = np.zeros((3, 4))
print(zeros_2d)
Output:
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
# 2x3 array of ones
ones_2d = np.ones((2, 3), dtype=int)
print(ones_2d)
Output:
[[1 1 1]
[1 1 1]]
# Identity matrix — 1s on diagonal
identity = np.eye(4)
print(identity)
Output:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Step 4: Indexing and Slicing
1D Array Indexing
arr = np.array([10, 20, 30, 40, 50])
# 0 1 2 3 4
print(arr[0]) # 10 — first element
print(arr[2]) # 30
print(arr[-1]) # 50 — last element
print(arr[-2]) # 40 — second from last
1D Array Slicing
arr = np.array([10, 20, 30, 40, 50, 60, 70])
print(arr[1:4]) # [20 30 40] — index 1 to 3
print(arr[:3]) # [10 20 30] — first 3
print(arr[3:]) # [40 50 60 70] — from index 3 to end
print(arr[::2]) # [10 30 50 70] — every 2nd element
print(arr[::-1]) # [70 60 50 40 30 20 10] — reversed
2D Array Indexing
matrix = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
# Access single element — [row, column]
print(matrix[0, 0]) # 1 — row 0, col 0
print(matrix[1, 2]) # 6 — row 1, col 2
print(matrix[2, 1]) # 8 — row 2, col 1
print(matrix[-1, -1]) # 9 — last row, last column
2D Array Slicing
matrix = np.array([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]
])
# Get entire row
print(matrix[0]) # [1 2 3 4] — first row
print(matrix[1, :]) # [5 6 7 8] — second row (explicit)
# Get entire column
print(matrix[:, 0]) # [1 5 9] — first column
print(matrix[:, 2]) # [3 7 11] — third column
# Get submatrix
print(matrix[0:2, 1:3]) # rows 0-1, cols 1-2
Output of last line:
[[2 3]
[6 7]]
Step 5: Array Operations
This is where NumPy really shines. Operations apply to every element automatically.
Basic Math
arr = np.array([1, 2, 3, 4, 5])
print(arr + 10) # [11 12 13 14 15]
print(arr - 3) # [-2 -1 0 1 2]
print(arr * 2) # [ 2 4 6 8 10]
print(arr / 2) # [0.5 1. 1.5 2. 2.5]
print(arr ** 2) # [ 1 4 9 16 25]
print(arr % 2) # [1 0 1 0 1] — remainder
Operations Between Two Arrays
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
print(a + b) # [11 22 33 44 55]
print(a * b) # [ 10 40 90 160 250]
print(b / a) # [10. 10. 10. 10. 10.]
print(b - a) # [ 9 18 27 36 45]
Element-wise — first element with first, second with second, and so on.
Math Functions
arr = np.array([1, 4, 9, 16, 25])
print(np.sqrt(arr)) # [1. 2. 3. 4. 5.] — square root
print(np.log(arr)) # natural log of each element
print(np.exp(arr)) # e^x for each element
print(np.abs(np.array([-3, -1, 0, 2, 4]))) # [3 1 0 2 4]
Step 6: Statistical Functions
These are used constantly in data analysis:
data = np.array([23, 45, 12, 67, 34, 89, 56, 78, 43, 21])
print("Sum:", np.sum(data)) # 468
print("Mean:", np.mean(data)) # 46.8
print("Median:", np.median(data)) # 44.0
print("Std Dev:", np.std(data)) # 23.18...
print("Variance:", np.var(data)) # 537.76
print("Min:", np.min(data)) # 12
print("Max:", np.max(data)) # 89
print("Min index:", np.argmin(data)) # 2 — index of minimum value
print("Max index:", np.argmax(data)) # 5 — index of maximum value
print("Range:", np.max(data) - np.min(data)) # 77
On 2D Arrays — axis parameter
scores = np.array([
[85, 90, 78], # student 1 — 3 subjects
[92, 88, 95], # student 2
[76, 82, 79] # student 3
])
print(np.mean(scores)) # 85.0 — mean of all values
print(np.mean(scores, axis=1)) # [84.33 91.67 79.0] — mean per row (per student)
print(np.mean(scores, axis=0)) # [84.33 86.67 84.0] — mean per column (per subject)
print(np.sum(scores, axis=1)) # [253 275 237] — total per student
print(np.max(scores, axis=0)) # [92 90 95] — best score per subject
axis=0 means along rows (column-wise result)
axis=1 means along columns (row-wise result)
This confuses everyone at first. Just remember:
axis=1 → result has one value per row
axis=0 → result has one value per column
Step 7: Boolean Indexing — Very Powerful
This is one of the most useful NumPy features for data filtering:
marks = np.array([85, 42, 90, 38, 75, 55, 29, 91, 66, 48])
# Create a boolean mask
passing = marks >= 50
print(passing)
# [ True False True False True True False True True False]
# Use mask to filter
print(marks[passing])
# [85 90 75 55 91 66]
# One liner
print(marks[marks >= 50])
# [85 90 75 55 91 66]
# Multiple conditions
print(marks[(marks >= 50) & (marks < 80)])
# [75 55 66] — between 50 and 80
# How many students passed?
print(np.sum(marks >= 50)) # 6
# What percentage passed?
print(np.mean(marks >= 50) * 100) # 60.0%
Step 8: Random Numbers
Used constantly in ML for creating test data, initializing weights, etc:
# Set seed for reproducibility — same "random" numbers every run
np.random.seed(42)
# Random floats between 0 and 1
print(np.random.random(5))
# [0.374 0.951 0.732 0.599 0.156]
# Random integers
print(np.random.randint(1, 100, size=5))
# [52 93 15 72 61]
# Random 2D array
print(np.random.randint(0, 10, size=(3, 4)))
# [[6 3 7 4]
# [6 9 2 6]
# [7 4 3 7]]
# Random from normal distribution (bell curve)
# mean=0, std=1
normal = np.random.normal(0, 1, 1000)
print("Mean:", np.mean(normal).round(2)) # ~0.0
print("Std:", np.std(normal).round(2)) # ~1.0
# Random choice from array
options = np.array(["rock", "paper", "scissors"])
print(np.random.choice(options, size=5))
# ['paper' 'rock' 'scissors' 'rock' 'paper']
np.random.seed(42) — setting a seed means your "random" numbers are always the same. Crucial in ML so your experiments are reproducible.
Step 9: Reshaping Arrays
arr = np.arange(12)
print(arr) # [ 0 1 2 3 4 5 6 7 8 9 10 11]
print(arr.shape) # (12,)
# Reshape to 3x4 matrix
matrix = arr.reshape(3, 4)
print(matrix)
print(matrix.shape) # (3, 4)
Output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
(3, 4)
# -1 means "figure it out automatically"
arr = np.arange(12)
print(arr.reshape(3, -1)) # 3 rows, NumPy calculates 4 columns
print(arr.reshape(-1, 6)) # NumPy calculates 2 rows, 6 columns
# Flatten 2D back to 1D
matrix = np.array([[1,2,3],[4,5,6]])
print(matrix.flatten()) # [1 2 3 4 5 6]
Real World Example — Student Grade Analysis
Let's put everything together in one practical example:
import numpy as np
# Exam scores for 5 students across 4 subjects
# Rows = students, Columns = subjects (Math, Science, English, History)
np.random.seed(42)
scores = np.random.randint(40, 100, size=(5, 4))
students = ["Rahul", "Priya", "Gagan", "Amit", "Neha"]
subjects = ["Math", "Science", "English", "History"]
print("=== Raw Scores ===")
print(scores)
print()
# Average per student (axis=1 = across columns)
student_avg = np.mean(scores, axis=1)
print("=== Student Averages ===")
for i, name in enumerate(students):
print(f"{name}: {student_avg[i]:.1f}")
print()
# Average per subject (axis=0 = across rows)
subject_avg = np.mean(scores, axis=0)
print("=== Subject Averages ===")
for i, subject in enumerate(subjects):
print(f"{subject}: {subject_avg[i]:.1f}")
print()
# Best and worst students
best_idx = np.argmax(student_avg)
worst_idx = np.argmin(student_avg)
print(f"Best student: {students[best_idx]} ({student_avg[best_idx]:.1f})")
print(f"Needs help: {students[worst_idx]} ({student_avg[worst_idx]:.1f})")
print()
# How many students passed each subject (>=50)
passed_per_subject = np.sum(scores >= 50, axis=0)
print("=== Pass Count Per Subject ===")
for i, subject in enumerate(subjects):
print(f"{subject}: {passed_per_subject[i]}/5 passed")
print()
# Grade each student
print("=== Grades ===")
for i, name in enumerate(students):
avg = student_avg[i]
if avg >= 85:
grade = "A"
elif avg >= 70:
grade = "B"
elif avg >= 55:
grade = "C"
else:
grade = "F"
print(f"{name}: {avg:.1f} → Grade {grade}")
Output:
=== Raw Scores ===
[[71 60 57 85]
[74 77 55 74]
[49 78 95 80]
[54 68 95 65]
[65 71 47 93]]
=== Student Averages ===
Rahul: 68.2
Priya: 70.0
Gagan: 75.5
Amit: 70.5
Neha: 69.0
=== Subject Averages ===
Math: 62.6
Science: 70.8
English: 69.8
History: 79.4
Best student: Gagan (75.5)
Needs help: Rahul (68.2)
=== Pass Count Per Subject ===
Math: 4/5 passed
Science: 5/5 passed
English: 4/5 passed
History: 5/5 passed
=== Grades ===
Rahul: 68.2 → Grade C
Priya: 70.0 → Grade B
Gagan: 75.5 → Grade B
Amit: 70.5 → Grade B
Neha: 69.0 → Grade C
This is actual data analysis — loading data, computing statistics, finding insights. You just did your first data analysis with NumPy.
Quick Reference — NumPy Cheat Sheet
import numpy as np
# Creating arrays
np.array([1,2,3]) # from list
np.zeros(5) # [0. 0. 0. 0. 0.]
np.ones((3,4)) # 3x4 matrix of ones
np.arange(0, 10, 2) # [0 2 4 6 8]
np.linspace(0, 1, 5) # 5 evenly spaced points
np.random.randint(0, 100, 10) # 10 random integers
np.random.seed(42) # reproducibility
# Array info
arr.shape # dimensions
arr.ndim # number of dimensions
arr.size # total elements
arr.dtype # data type
# Indexing
arr[0] # first element
arr[-1] # last element
arr[1:4] # slice
arr[::2] # every 2nd
matrix[1, 2] # row 1, col 2
matrix[:, 0] # entire first column
matrix[0, :] # entire first row
# Operations
arr + 5 # add 5 to all
arr * 2 # multiply all by 2
arr ** 2 # square all
np.sqrt(arr) # square root
# Statistics
np.sum(arr)
np.mean(arr)
np.median(arr)
np.std(arr)
np.min(arr)
np.max(arr)
np.argmin(arr) # index of min
np.argmax(arr) # index of max
# 2D statistics
np.mean(matrix, axis=0) # column means
np.mean(matrix, axis=1) # row means
# Filtering
arr[arr > 50] # values greater than 50
arr[(arr > 20) & (arr < 80)] # between 20 and 80
# Reshape
arr.reshape(3, 4) # reshape to 3x4
arr.flatten() # back to 1D
Exercise 🏋️
Create a new Jupyter notebook cell and solve this:
Sales Analysis:
# Monthly sales data for 4 products over 6 months
# Each row = one product, each column = one month
sales = np.array([
[1200, 1500, 1100, 1800, 2000, 1600], # Product A
[800, 950, 870, 1100, 1250, 900], # Product B
[2100, 1900, 2300, 2100, 1800, 2500], # Product C
[500, 600, 550, 700, 650, 720] # Product D
])
products = ["Product A", "Product B", "Product C", "Product D"]
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
Find and print:
- Total sales per product over 6 months
- Average monthly sales per product
- Best performing product (highest total)
- Worst performing product (lowest total)
- Best sales month overall (sum across all products)
- Which product had sales above 1000 in every month
- Total company revenue per month