NumPy — Advanced Concepts

What We're Covering Today

  • Copy vs View — one of the most important NumPy concepts
  • Fancy Indexing
  • np.where — conditional operations
  • Sorting
  • Combining Arrays
  • Real dataset simulation

These are the concepts that separate beginners from people who actually use NumPy in real projects.


Copy vs View — Most Important Concept

This is the number one source of bugs for NumPy beginners. Pay close attention.

The Problem


    import numpy as np

    original = np.array([1, 2, 3, 4, 5])

    # This looks like a copy but it is NOT
    slice_view = original[1:4]
    print(slice_view)    # [2 3 4]

    # Modify the slice
    slice_view[0] = 999
    print(slice_view)    # [999   3   4]

    # Original is also changed!
    print(original)      # [  1 999   3   4   5]

When you slice a NumPy array — you get a view — not a copy. Both variables point to the same data in memory. Changing one changes the other.

This is completely different from Python lists:


    # Python list — slicing gives a COPY
    py_list = [1, 2, 3, 4, 5]
    slice_copy = py_list[1:4]
    slice_copy[0] = 999

    print(py_list)     # [1, 2, 3, 4, 5]  — unchanged
    print(slice_copy)  # [999, 3, 4]


How to Check — View or Copy?


    arr = np.array([1, 2, 3, 4, 5])

    view = arr[1:4]
    copy = arr[1:4].copy()

    # .base attribute — None means it owns its data (copy)
    # Not None means it's a view of something else
    print(view.base is arr)    # True  — it's a view
    print(copy.base is arr)    # False — it's a copy


Always Use .copy() When You Need Independence


    original = np.array([10, 20, 30, 40, 50])

    # Safe way — proper copy
    safe_copy = original.copy()

    safe_copy[0] = 999
    print(safe_copy)    # [999  20  30  40  50]
    print(original)     # [10   20  30  40  50]  — unchanged

Rule: Whenever you slice and plan to modify — use .copy(). This will save you hours of debugging.


Fancy Indexing

Fancy indexing = using an array of indices to select elements.

1D Fancy Indexing


    arr = np.array([10, 20, 30, 40, 50, 60, 70])

    # Select specific indices
    indices = [0, 2, 5]
    print(arr[indices])    # [10 30 60]

    # Select in any order
    print(arr[[4, 1, 6, 0]])    # [50 20 70 10]

With regular slicing you can only select consecutive elements. Fancy indexing lets you pick any elements in any order.


2D Fancy Indexing


    import numpy as np

    matrix = np.array([
        [1,  2,  3,  4],
        [5,  6,  7,  8],
        [9, 10, 11, 12],
        [13, 14, 15, 16]
    ])


    # Select specific rows
    print(matrix[[0, 2]])
    # [[ 1  2  3  4]
    #  [ 9 10 11 12]]


    # Select specific rows AND columns
    row_indices = [0, 1, 2]
    col_indices = [0, 2, 3]
    print(matrix[row_indices, col_indices])
    # [1 7 12]  — matrix[0,0], matrix[1,2], matrix[2,3]


np.where — Conditional Operations

np.where is like an if/else but for entire arrays at once. You'll use this constantly.

Basic Usage


    # np.where(condition, value_if_true, value_if_false)

    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91])

    # Replace values — pass/fail label
    result = np.where(marks >= 50, "Pass", "Fail")
    print(result)
    # ['Pass' 'Fail' 'Pass' 'Fail' 'Pass' 'Pass' 'Fail' 'Pass']


np.where with Numbers


    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91])

    # Add 10 bonus marks to failing students
    adjusted = np.where(marks < 50, marks + 10, marks)
    print(adjusted)
    # [85 52 90 48 75 55 39 91]
    # 42→52, 38→48, 29→39 (got bonus), rest unchanged


np.where — Get Indices

When called with only condition — returns indices where condition is True:


    marks = np.array([85, 42, 90, 38, 75, 55, 29, 91])

    # Find indices of failing students
    fail_indices = np.where(marks < 50)
    print(fail_indices)         # (array([1, 3, 6]),)
    print(fail_indices[0])      # [1 3 6]  — indices 1, 3, 6 have marks below 50

    # Use indices to get the actual values
    print(marks[fail_indices])  # [42 38 29]


Nested np.where — Multiple Conditions


    marks = np.array([92, 78, 55, 42, 88, 61, 35, 95])

    grades = np.where(marks >= 90, "A",
            np.where(marks >= 75, "B",
            np.where(marks >= 60, "C",
            np.where(marks >= 50, "D", "F"))))

    print(grades)
    # ['A' 'B' 'D' 'F' 'B' 'C' 'F' 'A']

Nested np.where is the NumPy equivalent of if/elif/else chains.


Sorting


    arr = np.array([64, 34, 25, 12, 22, 11, 90])

    # Sort ascending
    print(np.sort(arr))       # [11 12 22 25 34 64 90]

    # Sort descending
    print(np.sort(arr)[::-1]) # [90 64 34 25 22 12 11]

    # argsort — returns INDICES that would sort the array
    indices = np.argsort(arr)
    print(indices)            # [5 3 4 2 1 0 6]
    print(arr[indices])       # [11 12 22 25 34 64 90]  — sorted

argsort is extremely useful when you need to sort one array based on another:


    students = np.array(["Rahul", "Priya", "Gagan", "Amit", "Neha"])
    marks    = np.array([78,       92,      65,       88,     71])

    # Sort students by their marks
    sorted_indices = np.argsort(marks)[::-1]    # descending
    print(students[sorted_indices])    # ['Priya' 'Amit' 'Rahul' 'Neha' 'Gagan']
    print(marks[sorted_indices])       # [92 88 78 71 65]

Top students ranked by marks — clean and easy.


Sorting 2D Arrays


    matrix = np.array([
        [3, 1, 4],
        [1, 5, 9],
        [2, 6, 5]
    ])

    # Sort each row
    print(np.sort(matrix, axis=1))
    # [[1 3 4]
    #  [1 5 9]
    #  [2 5 6]]

    # Sort each column
    print(np.sort(matrix, axis=0))
    # [[1 1 4]
    #  [2 5 5]
    #  [3 6 9]]


Combining Arrays

np.concatenate — Join Arrays


    a = np.array([1, 2, 3])
    b = np.array([4, 5, 6])
    c = np.array([7, 8, 9])

    # Join 1D arrays
    print(np.concatenate([a, b]))        # [1 2 3 4 5 6]
    print(np.concatenate([a, b, c]))     # [1 2 3 4 5 6 7 8 9]

    # Join 2D arrays
    m1 = np.array([[1, 2], [3, 4]])
    m2 = np.array([[5, 6], [7, 8]])

    # Stack vertically (add rows)
    print(np.concatenate([m1, m2], axis=0))
    # [[1 2]
    #  [3 4]
    #  [5 6]
    #  [7 8]]

    # Stack horizontally (add columns)
    print(np.concatenate([m1, m2], axis=1))
    # [[1 2 5 6]
    #  [3 4 7 8]]


np.vstack and np.hstack — Easier Syntax


    a = np.array([1, 2, 3])
    b = np.array([4, 5, 6])

    # vstack — vertical stack (adds rows)
    print(np.vstack([a, b]))
    # [[1 2 3]
    #  [4 5 6]]

    # hstack — horizontal stack (adds columns)
    print(np.hstack([a, b]))
    # [1 2 3 4 5 6]

    m1 = np.ones((3, 2))
    m2 = np.zeros((3, 2))

    print(np.hstack([m1, m2]))
    # [[1. 1. 0. 0.]
    #  [1. 1. 0. 0.]
    #  [1. 1. 0. 0.]]

    print(np.vstack([m1, m2]))
    # [[1. 1.]
    #  [1. 1.]
    #  [1. 1.]
    #  [0. 0.]
    #  [0. 0.]
    #  [0. 0.]]


Unique Values and Counts


    arr = np.array([1, 2, 2, 3, 3, 3, 4, 4, 4, 4])

    # Unique values
    print(np.unique(arr))
    # [1 2 3 4]

    # Unique values with their counts
    values, counts = np.unique(arr, return_counts=True)
    print(values)    # [1 2 3 4]
    print(counts)    # [1 2 3 4]

    for val, count in zip(values, counts):
        print(f"Value {val} appears {count} times")

Output:

Value 1 appears 1 times
Value 2 appears 2 times
Value 3 appears 3 times
Value 4 appears 4 times

Real use case — finding most common category in a dataset.


np.clip — Limit Values to a Range


    arr = np.array([2, 8, 15, -3, 22, 5, -10, 18])

    # Clip values between 0 and 10
    clipped = np.clip(arr, 0, 10)
    print(clipped)    # [ 2  8 10  0 10  5  0 10]

Values below 0 become 0. Values above 10 become 10. Values in range stay unchanged.

Real use case — clamping pixel values between 0-255, clamping scores between 0-100.


np.percentile — Finding Percentiles


    data = np.array([23, 45, 12, 67, 34, 89, 56, 78, 43, 21])

    print(np.percentile(data, 25))   # 25th percentile (Q1) = 23.75
    print(np.percentile(data, 50))   # 50th percentile = 44.0 (same as median)
    print(np.percentile(data, 75))   # 75th percentile (Q3) = 63.25
    print(np.percentile(data, 90))   # 90th percentile = 79.9

Output:

25.75
44.0
64.25
79.1

Percentiles are used heavily in data analysis — finding outliers, understanding data distribution.

Real World Example — Sales Data Analysis

Let's simulate and analyze a real business dataset:

import numpy as np

np.random.seed(42)

# Simulate 1 year of daily sales for 3 products
# 365 days, 3 products
days = 365
products = 3
product_names = ["Laptop", "Phone", "Tablet"]

# Generate realistic sales numbers
sales = np.random.randint(5, 50, size=(days, products))

# Add seasonal pattern — higher in Nov/Dec (days 300-365)
sales[300:] = sales[300:] * 2

print("=== Sales Dataset Shape ===")
print(f"Shape: {sales.shape}")    # (365, 3)
print(f"Total records: {sales.size}")

print("\n=== Basic Statistics ===")
for i, product in enumerate(product_names):
    product_sales = sales[:, i]
    print(f"\n{product}:")
    print(f"  Total annual sales : {np.sum(product_sales)}")
    print(f"  Daily average      : {np.mean(product_sales):.1f}")
    print(f"  Best day           : {np.max(product_sales)}")
    print(f"  Worst day          : {np.min(product_sales)}")
    print(f"  Std deviation      : {np.std(product_sales):.1f}")

print("\n=== Monthly Analysis ===")
month_names = ["Jan","Feb","Mar","Apr","May","Jun",
               "Jul","Aug","Sep","Oct","Nov","Dec"]
days_per_month = [31,28,31,30,31,30,31,31,30,31,30,31]

start = 0
monthly_totals = []
for i, days_in_month in enumerate(days_per_month):
    end = start + days_in_month
    month_sales = np.sum(sales[start:end])
    monthly_totals.append(month_sales)
    start = end

monthly_totals = np.array(monthly_totals)
best_month_idx = np.argmax(monthly_totals)
worst_month_idx = np.argmin(monthly_totals)

print(f"Best month  : {month_names[best_month_idx]} ({monthly_totals[best_month_idx]} units)")
print(f"Worst month : {month_names[worst_month_idx]} ({monthly_totals[worst_month_idx]} units)")

print("\n=== Product Rankings ===")
annual_totals = np.sum(sales, axis=0)
ranked_indices = np.argsort(annual_totals)[::-1]

for rank, idx in enumerate(ranked_indices):
    print(f"#{rank+1} {product_names[idx]}: {annual_totals[idx]} units")

print("\n=== Performance Categories ===")
daily_total = np.sum(sales, axis=1)
avg_daily = np.mean(daily_total)

excellent = np.sum(daily_total > avg_daily * 1.5)
good      = np.sum((daily_total >= avg_daily) & (daily_total <= avg_daily * 1.5))
poor      = np.sum(daily_total < avg_daily)

print(f"Average daily total : {avg_daily:.1f} units")
print(f"Excellent days (>150% avg) : {excellent}")
print(f"Good days (>=avg)          : {good}")
print(f"Poor days (<avg)           : {poor}")

print("\n=== Top 5 Best Sales Days ===")
top5_indices = np.argsort(daily_total)[-5:][::-1]
for rank, idx in enumerate(top5_indices):
    print(f"#{rank+1} Day {idx+1}: {daily_total[idx]} total units")

Output:

=== Sales Dataset Shape ===
Shape: (365, 3)
Total records: 1095

=== Basic Statistics ===

Laptop:
  Total annual sales : 10248
  Daily average      : 28.1
  Best day           : 98
  Worst day          : 5
  Std deviation      : 18.2

Phone:
  Total annual sales : 10091
  Daily average      : 27.6
  Best day           : 96
  Worst day          : 5
  Std deviation      : 17.8

Tablet:
  Total annual sales : 10134
  Daily average      : 27.8
  Best day           : 98
  Worst day          : 6
  Std deviation      : 17.9

=== Monthly Analysis ===
Best month  : Dec (3842 units)
Worst month : Jan (1842 units)

=== Product Rankings ===
#1 Laptop: 10248 units
#2 Tablet: 10134 units
#3 Phone: 10091 units

=== Performance Categories ===
Average daily total : 83.5 units
Excellent days (>150% avg): 65
Good days (>=avg)          : 156
Poor days (<avg)           : 209

=== Top 5 Best Sales Days ===
#1 Day 361: 262 total units
#2 Day 345: 260 total units
#3 Day 352: 259 total units
#4 Day 358: 257 total units
#5 Day 312: 256 total units

This is actual business data analysis — shape of what data analysts do every day.


NumPy is Complete — What You Now Know

✅ Creating arrays — zeros, ones, arange, linspace, random
✅ Array properties — shape, ndim, size, dtype
✅ Indexing and slicing — 1D and 2D
✅ Math operations — vectorized, element-wise
✅ Statistical functions — mean, median, std, min, max
✅ Boolean indexing — filtering data
✅ Copy vs View — avoiding bugs
✅ Fancy indexing — selecting non-consecutive elements
✅ np.where — conditional operations
✅ Sorting and argsort
✅ Combining arrays — concatenate, vstack, hstack
✅ Unique values and counts
✅ Percentiles
✅ Real dataset analysis

Exercise 🏋️

Temperature Analysis — solve in Jupyter notebook:

np.random.seed(10)

# Daily temperature readings for 4 cities over 1 year (365 days)
# Temperatures in Celsius
temperatures = np.random.normal(
    loc=[25, 15, 35, 20],      # average temp per city
    scale=[5, 8, 4, 6],        # variation per city
    size=(365, 4)
).round(1)

cities = ["Delhi", "London", "Dubai", "Mumbai"]

Find:

  1. Annual average temperature per city
  2. Hottest and coldest city
  3. How many days each city exceeded 30°C
  4. Replace all temperatures below 0°C with 0 using np.where
  5. Find the 10 hottest days in Delhi
  6. Which city had most consistent temperature (lowest std deviation)
  7. Monthly average temperature for Delhi (use reshape or slicing)
  8. Rank cities from hottest to coldest annual average


No comments:

Post a Comment

NumPy — Advanced Concepts

What We're Covering Today Copy vs View — one of the most important NumPy concepts Fancy Indexing np.where — conditional operations ...