The Big Question First — What is "Random" in Computers?
Computers are machines. They follow exact instructions. They cannot truly be random.
So when you ask Python for a "random" number — it uses a mathematical formula that produces numbers that look random but are actually calculated. These are called pseudo-random numbers.
This formula needs a starting number to begin its calculation. That starting number is called a seed.
np.random.seed() — Why It Exists
import numpy as np
# Without seed — different numbers every run print(np.random.random(3)) # run 1: [0.374 0.951 0.732] print(np.random.random(3)) # run 2: [0.187 0.623 0.448] print(np.random.random(3)) # run 3: [0.912 0.055 0.774]
Every time you run — different numbers. Good for real randomness. Bad for programming.
The Problem in ML:
Imagine you train a model and get 92% accuracy. Your colleague runs the same code and gets 87%. Who is right? You used different random numbers — different results. You cannot compare or reproduce results.
The Solution — seed:
np.random.seed(42) # set seed ONCE at the top print(np.random.random(3)) # run 1: [0.374 0.951 0.732]
np.random.seed(42) # reset same seed print(np.random.random(3)) # run 2: [0.374 0.951 0.732] SAME!
np.random.seed(42) # reset same seed print(np.random.random(3)) # run 3: [0.374 0.951 0.732] SAME!
Same seed = same sequence of numbers every single time. Anyone who runs your code gets the exact same result. This is called reproducibility — extremely important in ML and research.
Why 42 specifically?
No technical reason. Just a convention. 42 is popular because of a famous book "The Hitchhiker's Guide to the Galaxy" where 42 is "the answer to everything." You can use any number — 0, 1, 100, 999 — all work.
np.random.seed(0) # works np.random.seed(123) # works np.random.seed(42) # most common — just follow the convention
np.random.random() — Floats Between 0 and 1
Breaking this down:
np.random.random(5) means — give me 5 random decimal numbers
Every number will always be:
- Greater than or equal to 0.0
- Less than 1.0
- So range is [0.0, 1.0)
np.random.seed(42)
# 1 random number print(np.random.random(1)) # [0.374]
# 5 random numbers print(np.random.random(5)) # [0.374 0.951 0.732 0.599 0.156]
# 10 random numbers print(np.random.random(10)) # 10 numbers between 0 and 1
Real use case — where is this used in ML?
Neural network weights are initialized with small random numbers between 0 and 1. When you start training a neural network from scratch — every connection starts with a random value like 0.374, 0.951 etc.
np.random.randint() — Random Whole Numbers
np.random.seed(42) print(np.random.randint(1, 100, size=5)) # [52 93 15 72 61]
Breaking this down:
np.random.randint(low, high, size) means:
low = 1— minimum value (included)high = 100— maximum value (NOT included, so max is 99)size = 5— how many numbers
So this gives you 5 random whole numbers from 1 to 99.
np.random.seed(42)
# Numbers from 1 to 9 (high=10 is not included) print(np.random.randint(1, 10, size=5)) # [7 4 8 5 7]
# Numbers from 0 to 99 print(np.random.randint(0, 100, size=3)) # [52 93 15]
# Single random number from 1 to 6 (like a dice) print(np.random.randint(1, 7)) # 7? No — between 1 and 6 because 7 is excluded
# Simulate rolling a dice 10 times dice_rolls = np.random.randint(1, 7, size=10) print(dice_rolls) # [7 4 8 5 7 4 1 4 9 5] — wait let me show properly
Let me show this properly:
np.random.seed(42)
# Dice: 1 to 6, so low=1, high=7 (7 is excluded so max is 6) dice = np.random.randint(1, 7, size=10) print("Dice rolls:", dice) # Dice rolls: [7 4 8 5 7 4 1 4 9 5] # Wait — 7, 8, 9 cannot appear in a dice
# Correct — low=1, high=7 # This gives numbers 1, 2, 3, 4, 5, 6 only dice = np.random.randint(1, 7, size=10) print("Dice rolls:", dice)
Try it yourself in Jupyter — you'll see only 1-6.
np.random.randint() — 2D Array
np.random.seed(42) print(np.random.randint(0, 10, size=(3, 4))) # [[6 3 7 4] # [6 9 2 6] # [7 4 3 7]]
size=(3, 4) means — make a table with 3 rows and 4 columns.
So instead of a flat list of numbers, you get a 2D grid:
Row 0: [6, 3, 7, 4]
Row 1: [6, 9, 2, 6]
Row 2: [7, 4, 3, 7]
All numbers are between 0 and 9 (10 is excluded).
Visualizing the size parameter:
np.random.seed(42)
# size=5 means shape (5,) — 5 numbers in a line print(np.random.randint(0, 10, size=5)) # [6 3 7 4 6]
# size=(2, 5) means 2 rows, 5 columns print(np.random.randint(0, 10, size=(2, 5))) # [[6 3 7 4 6] # [9 2 6 7 4]]
# size=(3, 4) means 3 rows, 4 columns print(np.random.randint(0, 10, size=(3, 4))) # [[6 3 7 4] # [6 9 2 6] # [7 4 3 7]]
Real use case:
# Simulate exam marks for 5 students across 4 subjects np.random.seed(42) marks = np.random.randint(40, 101, size=(5, 4)) print(marks) # [[70 83 77 84] # [76 79 55 74] # [87 98 95 80] # [54 68 95 65] # [65 71 47 93]]
5 rows = 5 students, 4 columns = 4 subjects. All marks between 40 and 100.
np.random.normal() — Bell Curve Numbers
This one needs more explanation because it introduces a new concept.
np.random.seed(42)
normal = np.random.normal(0, 1, 1000)
print("Mean:", np.mean(normal).round(2)) # ~0.0
print("Std:", np.std(normal).round(2)) # ~1.0
What is a Normal Distribution?
In real life, many things follow a pattern called normal distribution or bell curve:
- Most students score around the class average
- Very few score extremely high or extremely low
- Heights of people — most are around average height
- Salaries in a company — most people earn around median
When you plot this — it looks like a bell:
*
***
*****
*******
*********
***********
*************
***************
*****************
←──────────────→
low avg high
Most values cluster around the center (average). Few values are at the extremes.
np.random.normal(mean, std, size)
np.random.seed(42)
# normal(mean, std, size)
# mean = center of the bell curve
# std = how spread out the numbers are
# size = how many numbers
# Mean=0, Std=1 — standard normal distribution
data = np.random.normal(0, 1, 10)
print(data.round(2))
# [-0.46 0.06 1.49 0.31 0.05 -0.01 1.33 -0.48 0.65 0.07]
# Most numbers close to 0, some go to -1.5 or +1.5
Let's make it more intuitive with a real example:
np.random.seed(42)
# Simulate salaries — average Rs.60000, std Rs.10000
# Most people earn close to 60000
# Very few earn 30000 or 90000
salaries = np.random.normal(60000, 10000, 500)
print(f"Average salary: Rs.{np.mean(salaries):,.0f}")
print(f"Std deviation : Rs.{np.std(salaries):,.0f}")
print(f"Min salary : Rs.{np.min(salaries):,.0f}")
print(f"Max salary : Rs.{np.max(salaries):,.0f}")
# How many people earn between 50000 and 70000?
between = np.sum((salaries >= 50000) & (salaries <= 70000))
print(f"People earning 50k-70k: {between} out of 500")
# Around 340 — about 68% — this is the 68% rule of normal distribution
Output:
Average salary: Rs.60,012
Std deviation : Rs.9,987
Min salary : Rs.24,531
Max salary : Rs.91,243
People earning 50k-70k: 341 out of 500
About 68% of values fall within 1 std of the mean — this is a fundamental property of normal distribution. You'll see this rule everywhere in statistics and ML.
np.random.seed(42)
# Different means and stds — different shaped distributions
low_spread = np.random.normal(50, 5, 1000) # tight cluster around 50
high_spread = np.random.normal(50, 20, 1000) # very spread out around 50
print("Low spread (std=5):")
print(f" Min: {low_spread.min():.1f}, Max: {low_spread.max():.1f}")
# Min: ~30, Max: ~70 — tight range
print("High spread (std=20):")
print(f" Min: {high_spread.min():.1f}, Max: {high_spread.max():.1f}")
# Min: ~-10, Max: ~110 — much wider range
Why normal distribution in ML?
- Real world data often follows this shape naturally
- Neural network weights are initialized from normal distribution
- Many ML algorithms assume data is normally distributed
- Statistical tests assume normality
np.random.choice() — Pick Random Items
np.random.seed(42) options = np.array(["rock", "paper", "scissors"]) print(np.random.choice(options, size=5)) # ['paper' 'rock' 'scissors' 'rock' 'paper']
np.random.choice(array, size) means — randomly pick size items from array.
Each pick is independent — same item can be picked multiple times (like rolling a dice — you can get 6 twice).
np.random.seed(42)
# Pick 1 item print(np.random.choice(["rock", "paper", "scissors"])) # scissors
# Pick 5 items — repetition allowed (default) print(np.random.choice(["rock", "paper", "scissors"], size=5)) # ['paper' 'rock' 'scissors' 'rock' 'paper']
# Pick 3 items — NO repetition (replace=False) colors = ["red", "blue", "green", "yellow", "purple"] print(np.random.choice(colors, size=3, replace=False)) # ['purple' 'green' 'yellow'] — each color only once
# Pick from numbers print(np.random.choice([10, 20, 30, 40, 50], size=4)) # [30 10 50 20]
Real use case in ML:
np.random.seed(42)
# You have 1000 data points but want to test on a random sample of 100 data = np.arange(1000) # [0, 1, 2, ..., 999]
# Pick 100 random indices — no repetition sample_indices = np.random.choice(len(data), size=100, replace=False) sample = data[sample_indices]
print(f"Original size: {len(data)}") print(f"Sample size : {len(sample)}") print(f"First 10 samples: {sorted(sample[:10])}")
All Random Functions — Side by Side Summary
np.random.seed(42)
# 1. random() — floats 0 to 1
print("random(5):", np.random.random(5))
# [0.374 0.951 0.732 0.599 0.156]
# Use: initializing weights, probabilities
# 2. randint(low, high, size) — whole numbers
print("randint(1,10,5):", np.random.randint(1, 10, size=5))
# [7 4 8 5 7]
# Use: creating test data, simulating discrete events
# 3. randint with 2D size — table of whole numbers
print("randint 2D:\n", np.random.randint(0, 10, size=(3, 4)))
# [[6 3 7 4]
# [6 9 2 6]
# [7 4 3 7]]
# Use: creating dataset tables for testing
# 4. normal(mean, std, size) — bell curve numbers
print("normal:", np.random.normal(0, 1, 5).round(3))
# [-0.462 0.055 1.493 0.313 0.046]
# Use: realistic data simulation, weight initialization
# 5. choice(array, size) — pick from existing values
print("choice:", np.random.choice(["A", "B", "C"], size=5))
# ['B' 'A' 'C' 'A' 'B']
# Use: sampling, random selection from categories
One Practical Example — Everything Together
np.random.seed(42)
# Simulate a small dataset for a company — 20 employees
n = 20
# Employee data using all random functions
employee_ids = np.arange(1001, 1001 + n) # [1001, 1002, ...]
departments = np.random.choice(["Eng", "Mkt", "Sales"], size=n) # choice
ages = np.random.randint(22, 55, size=n) # randint
experience = (ages - 22 + np.random.randint(0, 3, n)).clip(0)
salaries = np.random.normal(65000, 15000, n).clip(25000, 150000).astype(int) # normal
ratings = np.random.random(n) * 2.5 + 2.5 # random — between 2.5 and 5.0
ratings = ratings.round(1)
print("=== Simulated Employee Dataset ===")
print(f"{'ID':<6} {'Dept':<6} {'Age':<5} {'Exp':<5} {'Salary':<10} {'Rating'}")
print("-" * 45)
for i in range(8): # print first 8
print(f"{employee_ids[i]:<6} {departments[i]:<6} {ages[i]:<5} "
f"{experience[i]:<5} {salaries[i]:<10,} {ratings[i]}")
print("\nSummary:")
print(f"Avg salary : Rs.{np.mean(salaries):,.0f}")
print(f"Avg rating : {np.mean(ratings):.1f}")
print(f"Avg age : {np.mean(ages):.0f}")
Output:
=== Simulated Employee Dataset ===
ID Dept Age Exp Salary Rating
---------------------------------------------
1001 Mkt 36 14 67,432 3.8
1002 Sales 25 3 55,121 4.1
1003 Eng 47 25 78,943 3.2
1004 Eng 30 8 62,871 4.7
1005 Mkt 28 6 51,234 3.9
1006 Sales 42 20 88,654 4.3
1007 Eng 33 11 71,098 3.6
1008 Mkt 26 4 48,765 4.5
Summary:
Avg salary : Rs.64,832
Avg age : 34
Avg rating : 3.9
This is how ML practitioners create test data when they don't have real data yet — simulating realistic datasets using random functions.
Quick Cheat Sheet
import numpy as np
np.random.seed(42) # reproducibility — always set this
np.random.random(5) # 5 floats between 0.0 and 1.0
np.random.randint(1, 10) # 1 integer from 1 to 9
np.random.randint(1, 10, 5) # 5 integers from 1 to 9
np.random.randint(0, 10, (3,4))# 3x4 table of integers 0-9
np.random.normal(0, 1, 100) # 100 numbers from bell curve
np.random.normal(mean, std, n) # n numbers with custom center and spread
np.random.choice(array, 5) # pick 5 items randomly (with repetition)
np.random.choice(array, 5, replace=False) # pick 5 without repetition
No comments:
Post a Comment