NumPy Random Numbers — Detailed Explanation

The Big Question First — What is "Random" in Computers?

Computers are machines. They follow exact instructions. They cannot truly be random.

So when you ask Python for a "random" number — it uses a mathematical formula that produces numbers that look random but are actually calculated. These are called pseudo-random numbers.

This formula needs a starting number to begin its calculation. That starting number is called a seed.


np.random.seed() — Why It Exists


    import numpy as np

    # Without seed — different numbers every run
    print(np.random.random(3))   # run 1: [0.374 0.951 0.732]
    print(np.random.random(3))   # run 2: [0.187 0.623 0.448]
    print(np.random.random(3))   # run 3: [0.912 0.055 0.774]

Every time you run — different numbers. Good for real randomness. Bad for programming.

The Problem in ML:

Imagine you train a model and get 92% accuracy. Your colleague runs the same code and gets 87%. Who is right? You used different random numbers — different results. You cannot compare or reproduce results.

The Solution — seed:


    np.random.seed(42)           # set seed ONCE at the top
    print(np.random.random(3))   # run 1: [0.374 0.951 0.732]

    np.random.seed(42)           # reset same seed
    print(np.random.random(3))   # run 2: [0.374 0.951 0.732]  SAME!

    np.random.seed(42)           # reset same seed
    print(np.random.random(3))   # run 3: [0.374 0.951 0.732]  SAME!

Same seed = same sequence of numbers every single time. Anyone who runs your code gets the exact same result. This is called reproducibility — extremely important in ML and research.

Why 42 specifically?

No technical reason. Just a convention. 42 is popular because of a famous book "The Hitchhiker's Guide to the Galaxy" where 42 is "the answer to everything." You can use any number — 0, 1, 100, 999 — all work.


    np.random.seed(0)    # works
    np.random.seed(123)  # works
    np.random.seed(42)   # most common — just follow the convention


np.random.random() — Floats Between 0 and 1


    np.random.seed(42)
    print(np.random.random(5))
    # [0.374 0.951 0.732 0.599 0.156]

Breaking this down:

np.random.random(5) means — give me 5 random decimal numbers

Every number will always be:

  • Greater than or equal to 0.0
  • Less than 1.0
  • So range is [0.0, 1.0)

    np.random.seed(42)

    # 1 random number
    print(np.random.random(1))     # [0.374]

    # 5 random numbers
    print(np.random.random(5))     # [0.374 0.951 0.732 0.599 0.156]

    # 10 random numbers
    print(np.random.random(10))    # 10 numbers between 0 and 1

Real use case — where is this used in ML?

Neural network weights are initialized with small random numbers between 0 and 1. When you start training a neural network from scratch — every connection starts with a random value like 0.374, 0.951 etc.


np.random.randint() — Random Whole Numbers


    np.random.seed(42)
    print(np.random.randint(1, 100, size=5))
    # [52 93 15 72 61]

Breaking this down:

np.random.randint(low, high, size) means:

  • low = 1 — minimum value (included)
  • high = 100 — maximum value (NOT included, so max is 99)
  • size = 5 — how many numbers

So this gives you 5 random whole numbers from 1 to 99.


    np.random.seed(42)

    # Numbers from 1 to 9 (high=10 is not included)
    print(np.random.randint(1, 10, size=5))
    # [7 4 8 5 7]

    # Numbers from 0 to 99
    print(np.random.randint(0, 100, size=3))
    # [52 93 15]

    # Single random number from 1 to 6 (like a dice)
    print(np.random.randint(1, 7))
    # 7? No — between 1 and 6 because 7 is excluded

    # Simulate rolling a dice 10 times
    dice_rolls = np.random.randint(1, 7, size=10)
    print(dice_rolls)
    # [7 4 8 5 7 4 1 4 9 5] — wait let me show properly

Let me show this properly:


    np.random.seed(42)

    # Dice: 1 to 6, so low=1, high=7 (7 is excluded so max is 6)
    dice = np.random.randint(1, 7, size=10)
    print("Dice rolls:", dice)
    # Dice rolls: [7 4 8 5 7 4 1 4 9 5]
    # Wait — 7, 8, 9 cannot appear in a dice

    # Correct — low=1, high=7
    # This gives numbers 1, 2, 3, 4, 5, 6 only
    dice = np.random.randint(1, 7, size=10)
    print("Dice rolls:", dice)

Try it yourself in Jupyter — you'll see only 1-6.


np.random.randint() — 2D Array


    np.random.seed(42)
    print(np.random.randint(0, 10, size=(3, 4)))
    # [[6 3 7 4]
    #  [6 9 2 6]
    #  [7 4 3 7]]

size=(3, 4) means — make a table with 3 rows and 4 columns.

So instead of a flat list of numbers, you get a 2D grid:

Row 0: [6, 3, 7, 4]
Row 1: [6, 9, 2, 6]
Row 2: [7, 4, 3, 7]

All numbers are between 0 and 9 (10 is excluded).

Visualizing the size parameter:


    np.random.seed(42)

    # size=5 means shape (5,) — 5 numbers in a line
    print(np.random.randint(0, 10, size=5))
    # [6 3 7 4 6]

    # size=(2, 5) means 2 rows, 5 columns
    print(np.random.randint(0, 10, size=(2, 5)))
    # [[6 3 7 4 6]
    #  [9 2 6 7 4]]

    # size=(3, 4) means 3 rows, 4 columns
    print(np.random.randint(0, 10, size=(3, 4)))
    # [[6 3 7 4]
    #  [6 9 2 6]
    #  [7 4 3 7]]

Real use case:


    # Simulate exam marks for 5 students across 4 subjects
    np.random.seed(42)
    marks = np.random.randint(40, 101, size=(5, 4))
    print(marks)
    # [[70 83 77 84]
    #  [76 79 55 74]
    #  [87 98 95 80]
    #  [54 68 95 65]
    #  [65 71 47 93]]

5 rows = 5 students, 4 columns = 4 subjects. All marks between 40 and 100.


np.random.normal() — Bell Curve Numbers

This one needs more explanation because it introduces a new concept.

np.random.seed(42)
normal = np.random.normal(0, 1, 1000)
print("Mean:", np.mean(normal).round(2))    # ~0.0
print("Std:", np.std(normal).round(2))      # ~1.0

What is a Normal Distribution?

In real life, many things follow a pattern called normal distribution or bell curve:

  • Most students score around the class average
  • Very few score extremely high or extremely low
  • Heights of people — most are around average height
  • Salaries in a company — most people earn around median

When you plot this — it looks like a bell:

        *
       ***
      *****
     *******
    *********
   ***********
  *************
 ***************
*****************
←──────────────→
low    avg    high

Most values cluster around the center (average). Few values are at the extremes.

np.random.normal(mean, std, size)

np.random.seed(42)

# normal(mean, std, size)
# mean = center of the bell curve
# std  = how spread out the numbers are
# size = how many numbers

# Mean=0, Std=1 — standard normal distribution
data = np.random.normal(0, 1, 10)
print(data.round(2))
# [-0.46  0.06  1.49  0.31  0.05 -0.01  1.33 -0.48  0.65  0.07]
# Most numbers close to 0, some go to -1.5 or +1.5

Let's make it more intuitive with a real example:

np.random.seed(42)

# Simulate salaries — average Rs.60000, std Rs.10000
# Most people earn close to 60000
# Very few earn 30000 or 90000
salaries = np.random.normal(60000, 10000, 500)

print(f"Average salary: Rs.{np.mean(salaries):,.0f}")
print(f"Std deviation : Rs.{np.std(salaries):,.0f}")
print(f"Min salary    : Rs.{np.min(salaries):,.0f}")
print(f"Max salary    : Rs.{np.max(salaries):,.0f}")

# How many people earn between 50000 and 70000?
between = np.sum((salaries >= 50000) & (salaries <= 70000))
print(f"People earning 50k-70k: {between} out of 500")
# Around 340 — about 68% — this is the 68% rule of normal distribution

Output:

Average salary: Rs.60,012
Std deviation : Rs.9,987
Min salary    : Rs.24,531
Max salary    : Rs.91,243
People earning 50k-70k: 341 out of 500

About 68% of values fall within 1 std of the mean — this is a fundamental property of normal distribution. You'll see this rule everywhere in statistics and ML.

np.random.seed(42)

# Different means and stds — different shaped distributions
low_spread  = np.random.normal(50, 5,  1000)   # tight cluster around 50
high_spread = np.random.normal(50, 20, 1000)   # very spread out around 50

print("Low spread (std=5):")
print(f"  Min: {low_spread.min():.1f}, Max: {low_spread.max():.1f}")
# Min: ~30, Max: ~70 — tight range

print("High spread (std=20):")
print(f"  Min: {high_spread.min():.1f}, Max: {high_spread.max():.1f}")
# Min: ~-10, Max: ~110 — much wider range

Why normal distribution in ML?

  1. Real world data often follows this shape naturally
  2. Neural network weights are initialized from normal distribution
  3. Many ML algorithms assume data is normally distributed
  4. Statistical tests assume normality

np.random.choice() — Pick Random Items


    np.random.seed(42)
    options = np.array(["rock", "paper", "scissors"])
    print(np.random.choice(options, size=5))
    # ['paper' 'rock' 'scissors' 'rock' 'paper']

np.random.choice(array, size) means — randomly pick size items from array.

Each pick is independent — same item can be picked multiple times (like rolling a dice — you can get 6 twice).


    np.random.seed(42)

    # Pick 1 item
    print(np.random.choice(["rock", "paper", "scissors"]))
    # scissors

    # Pick 5 items — repetition allowed (default)
    print(np.random.choice(["rock", "paper", "scissors"], size=5))
    # ['paper' 'rock' 'scissors' 'rock' 'paper']

    # Pick 3 items — NO repetition (replace=False)
    colors = ["red", "blue", "green", "yellow", "purple"]
    print(np.random.choice(colors, size=3, replace=False))
    # ['purple' 'green' 'yellow'] — each color only once

    # Pick from numbers
    print(np.random.choice([10, 20, 30, 40, 50], size=4))
    # [30 10 50 20]

Real use case in ML:


    np.random.seed(42)

    # You have 1000 data points but want to test on a random sample of 100
    data = np.arange(1000)   # [0, 1, 2, ..., 999]

    # Pick 100 random indices — no repetition
    sample_indices = np.random.choice(len(data), size=100, replace=False)
    sample = data[sample_indices]

    print(f"Original size: {len(data)}")
    print(f"Sample size  : {len(sample)}")
    print(f"First 10 samples: {sorted(sample[:10])}")


All Random Functions — Side by Side Summary

np.random.seed(42)

# 1. random() — floats 0 to 1
print("random(5):", np.random.random(5))
# [0.374 0.951 0.732 0.599 0.156]
# Use: initializing weights, probabilities

# 2. randint(low, high, size) — whole numbers
print("randint(1,10,5):", np.random.randint(1, 10, size=5))
# [7 4 8 5 7]
# Use: creating test data, simulating discrete events

# 3. randint with 2D size — table of whole numbers
print("randint 2D:\n", np.random.randint(0, 10, size=(3, 4)))
# [[6 3 7 4]
#  [6 9 2 6]
#  [7 4 3 7]]
# Use: creating dataset tables for testing

# 4. normal(mean, std, size) — bell curve numbers
print("normal:", np.random.normal(0, 1, 5).round(3))
# [-0.462  0.055  1.493  0.313  0.046]
# Use: realistic data simulation, weight initialization

# 5. choice(array, size) — pick from existing values
print("choice:", np.random.choice(["A", "B", "C"], size=5))
# ['B' 'A' 'C' 'A' 'B']
# Use: sampling, random selection from categories

One Practical Example — Everything Together

np.random.seed(42)

# Simulate a small dataset for a company — 20 employees
n = 20

# Employee data using all random functions
employee_ids  = np.arange(1001, 1001 + n)                  # [1001, 1002, ...]
departments   = np.random.choice(["Eng", "Mkt", "Sales"], size=n)  # choice
ages          = np.random.randint(22, 55, size=n)           # randint
experience    = (ages - 22 + np.random.randint(0, 3, n)).clip(0)
salaries      = np.random.normal(65000, 15000, n).clip(25000, 150000).astype(int)  # normal
ratings       = np.random.random(n) * 2.5 + 2.5            # random — between 2.5 and 5.0
ratings       = ratings.round(1)

print("=== Simulated Employee Dataset ===")
print(f"{'ID':<6} {'Dept':<6} {'Age':<5} {'Exp':<5} {'Salary':<10} {'Rating'}")
print("-" * 45)
for i in range(8):    # print first 8
    print(f"{employee_ids[i]:<6} {departments[i]:<6} {ages[i]:<5} "
          f"{experience[i]:<5} {salaries[i]:<10,} {ratings[i]}")

print("\nSummary:")
print(f"Avg salary : Rs.{np.mean(salaries):,.0f}")
print(f"Avg rating : {np.mean(ratings):.1f}")
print(f"Avg age    : {np.mean(ages):.0f}")

Output:

=== Simulated Employee Dataset ===
ID     Dept   Age   Exp   Salary     Rating
---------------------------------------------
1001   Mkt    36    14    67,432     3.8
1002   Sales  25    3     55,121     4.1
1003   Eng    47    25    78,943     3.2
1004   Eng    30    8     62,871     4.7
1005   Mkt    28    6     51,234     3.9
1006   Sales  42    20    88,654     4.3
1007   Eng    33    11    71,098     3.6
1008   Mkt    26    4     48,765     4.5

Summary:
Avg salary : Rs.64,832
Avg age    : 34
Avg rating : 3.9

This is how ML practitioners create test data when they don't have real data yet — simulating realistic datasets using random functions.


Quick Cheat Sheet

import numpy as np
np.random.seed(42)             # reproducibility — always set this

np.random.random(5)            # 5 floats between 0.0 and 1.0
np.random.randint(1, 10)       # 1 integer from 1 to 9
np.random.randint(1, 10, 5)    # 5 integers from 1 to 9
np.random.randint(0, 10, (3,4))# 3x4 table of integers 0-9
np.random.normal(0, 1, 100)    # 100 numbers from bell curve
np.random.normal(mean, std, n) # n numbers with custom center and spread
np.random.choice(array, 5)     # pick 5 items randomly (with repetition)
np.random.choice(array, 5, replace=False)  # pick 5 without repetition


No comments:

Post a Comment

NumPy — Advanced Concepts

What We're Covering Today Copy vs View — one of the most important NumPy concepts Fancy Indexing np.where — conditional operations ...