Pandas — A Python Library (Advanced Topics)

What We're Covering Today

  • Merging and Joining DataFrames
  • Pivot Tables
  • DateTime Operations
  • Real Kaggle Dataset Analysis
  • Data Cleaning on Messy Real Data

These are the skills that make you actually useful in a real data job.


Part 1 — Merging DataFrames

In real projects data lives in multiple tables. You need to combine them. This is exactly like SQL JOINs — same concept, Pandas syntax.


Setup — Create Sample Tables


    import pandas as pd
    import numpy as np

    # Employees table
    employees = pd.DataFrame({
        "emp_id":     [1, 2, 3, 4, 5],
        "name":       ["Rahul", "Priya", "Gagan", "Amit", "Neha"],
        "dept_id":    [101, 102, 101, 103, 102],
        "salary":     [85000, 62000, 92000, 48000, 71000]
    })

    # Departments table
    departments = pd.DataFrame({
        "dept_id":    [101, 102, 103, 104],
        "dept_name":  ["Engineering", "Marketing", "Sales", "HR"],
        "location":   ["Bangalore", "Mumbai", "Delhi", "Hyderabad"]
    })

    # Projects table
    projects = pd.DataFrame({
        "project_id": [1, 2, 3],
        "name":       ["Website Redesign", "Mobile App", "Data Pipeline"],
        "lead_emp_id":[1, 3, 1]
    })

    print("Employees:")
    print(employees)
    print("\nDepartments:")
    print(departments)


pd.merge() — The Main Merge Function

Inner Join — Only Matching Rows


    # Merge employees with departments
    result = pd.merge(employees, departments, on="dept_id")
    print(result)

Output:

   emp_id   name  dept_id  salary    dept_name   location
0       1  Rahul      101   85000  Engineering  Bangalore
1       3  Gagan      101   92000  Engineering  Bangalore
2       2  Priya      102   62000    Marketing     Mumbai
3       5   Neha      102   71000    Marketing     Mumbai
4       4   Amit      103   48000        Sales      Delhi

HR department (104) has no employees — not included. All employees have matching departments — all included.


Left Join — Keep All Left Rows


    result = pd.merge(employees, departments, on="dept_id", how="left")
    print(result)

Output:

   emp_id   name  dept_id  salary    dept_name   location
0       1  Rahul      101   85000  Engineering  Bangalore
1       2  Priya      102   62000    Marketing     Mumbai
2       3  Gagan      101   92000  Engineering  Bangalore
3       4   Amit      103   48000        Sales      Delhi
4       5   Neha      102   71000    Marketing     Mumbai

All employees kept. If no matching department — NaN. Same as SQL LEFT JOIN.


Right Join — Keep All Right Rows


    result = pd.merge(employees, departments, on="dept_id", how="right")
    print(result)

Output:

   emp_id   name  dept_id   salary    dept_name   location
0     1.0  Rahul      101  85000.0  Engineering  Bangalore
1     3.0  Gagan      101  92000.0  Engineering  Bangalore
2     2.0  Priya      102  62000.0    Marketing     Mumbai
3     5.0   Neha      102  71000.0    Marketing     Mumbai
4     4.0   Amit      103  48000.0        Sales      Delhi
5     NaN    NaN      104      NaN           HR  Hyderabad

Output includes HR department even though it has no employees — with NaN in employee columns.


Outer Join — Keep Everything


    result = pd.merge(employees, departments, on="dept_id", how="outer")
    print(result)

Output:

   emp_id   name  dept_id   salary    dept_name   location
0     1.0  Rahul      101  85000.0  Engineering  Bangalore
1     3.0  Gagan      101  92000.0  Engineering  Bangalore
2     2.0  Priya      102  62000.0    Marketing     Mumbai
3     5.0   Neha      102  71000.0    Marketing     Mumbai
4     4.0   Amit      103  48000.0        Sales      Delhi
5     NaN    NaN      104      NaN           HR  Hyderabad

All rows from both tables. Missing values filled with NaN.


Merge on Different Column Names


    # left table has 'lead_emp_id', right has 'emp_id'
    result = pd.merge(
        projects,
        employees,
        left_on="lead_emp_id",
        right_on="emp_id"
    )
    print(result[["name_x", "name_y", "salary"]])
    # name_x = project name, name_y = employee name

Output:

             name_x name_y  salary
0  Website Redesign  Rahul   85000
1        Mobile App  Gagan   92000
2     Data Pipeline  Rahul   85000

When both tables have columns with same name — Pandas adds _x and _y suffix automatically.


Rename After Merge to Clean Up


    result = pd.merge(
        projects,
        employees,
        left_on="lead_emp_id",
        right_on="emp_id"
    ).rename(columns={
        "name_x": "project_name",
        "name_y": "lead_name"
    })

    print(result[["project_name", "lead_name", "salary"]])

Output:

       project_name lead_name  salary
0  Website Redesign     Rahul   85000
1        Mobile App     Gagan   92000
2     Data Pipeline     Rahul   85000

Multiple Merges — Chaining


    # Merge employees with departments, then with projects
    full_data = pd.merge(employees, departments, on="dept_id") \
                .merge(projects, left_on="emp_id", right_on="lead_emp_id", how="left")

    print(full_data[["name_x", "dept_name", "salary", "name_y"]])

Output:

  name_x    dept_name  salary            name_y
0  Rahul  Engineering   85000  Website Redesign
1  Rahul  Engineering   85000     Data Pipeline
2  Priya    Marketing   62000               NaN
3  Gagan  Engineering   92000        Mobile App
4   Amit        Sales   48000               NaN
5   Neha    Marketing   71000               NaN

The \ at end of line is Python line continuation — lets you split one expression across multiple lines.


concat — Stack DataFrames

Different from merge — concat stacks DataFrames on top of each other (same columns):


    df_2022 = pd.DataFrame({
        "name":   ["Rahul", "Priya"],
        "salary": [80000, 58000],
        "year":   [2022, 2022]
    })

    df_2023 = pd.DataFrame({
        "name":   ["Rahul", "Priya", "Gagan"],
        "salary": [85000, 62000, 92000],
        "year":   [2023, 2023, 2023]
    })

    combined = pd.concat([df_2022, df_2023], ignore_index=True)
    print(combined)

Output:

    name  salary  year
0  Rahul   80000  2022
1  Priya   58000  2022
2  Rahul   85000  2023
3  Priya   62000  2023
4  Gagan   92000  2023

ignore_index=True — resets index to 0,1,2... instead of keeping original indices.


Part 2 — Pivot Tables

  • Pivot tables reorganize data for analysis. Same as Excel pivot tables.
  • A Pivot Table is a tool used to summarize, analyze, and reorganize data quickly — without changing the original dataset.
  • A pivot table helps you turn raw data into meaningful insights by grouping and aggregating values.

What Pivot Table Can Do

  • Group data → by category (e.g., department)
  • Aggregate values → sum, average, count, min, max
  • Filter data → show only what you need
  • Rearrange structure → rows ↔ columns (pivot)

Why it's called "Pivot"

Because you can rotate (pivot) the data view:

  • Rows → Columns
  • Columns → Rows

Basic Pivot Table


    data = {
        "name":       ["Rahul", "Priya", "Gagan", "Rahul", "Priya", "Gagan"],
        "month":      ["Jan",   "Jan",   "Jan",   "Feb",   "Feb",   "Feb"],
        "sales":      [50000,   72000,   43000,   61000,   68000,   55000],
        "department": ["Eng",   "Mkt",   "Eng",   "Eng",   "Mkt",   "Eng"]
    }

    df = pd.DataFrame(data)

    # Pivot — rows=name, columns=month, values=sales
    pivot = df.pivot_table(
        values="sales",
        index="name",
        columns="month",
        aggfunc="sum"
    )

    print(pivot)

Output:

month    Feb    Jan
name
Gagan  55000  43000
Priya  68000  72000
Rahul  61000  50000

Instantly see each person's sales per month — just like Excel.


Pivot with Multiple Aggregations


    pivot = df.pivot_table(
        values="sales",
        index="name",
        columns="month",
        aggfunc=["sum", "mean"],
        fill_value=0           # fill missing combinations with 0
    )
    print(pivot)


Output:
         sum            mean         
month    Feb    Jan      Feb      Jan
name                                 
Gagan  55000  43000  55000.0  43000.0
Priya  68000  72000  68000.0  72000.0
Rahul  61000  50000  61000.0  50000.0

Pivot Table with Margins — Row/Column Totals


    pivot = df.pivot_table(
        values="sales",
        index="department",
        columns="month",
        aggfunc="sum",
        margins=True,          # add Total row and column
        margins_name="Total"
    )
    print(pivot)

Output:

month       Feb     Jan   Total
department
Eng       116000   93000  209000
Mkt        68000   72000  140000
Total     184000  165000  349000

DateTime Operations

Real world data almost always has dates. Pandas has powerful datetime tools.

Creating DateTime Data


    df = pd.DataFrame({
        "order_id": [1, 2, 3, 4, 5],
        "date":     ["2024-01-15", "2024-02-20", "2024-03-10",
                    "2024-04-05", "2024-05-22"],
        "amount":   [5000, 12000, 8500, 15000, 9200]
    })

    print(df.dtypes)
    # date column is object (string) right now


Output:

order_id    int64
date          str
amount      int64
dtype: object

Converting to DateTime


    df["date"] = pd.to_datetime(df["date"])
    print(df.dtypes)
    # date    datetime64[ns]

Output:

order_id             int64
date        datetime64[us]
amount               int64
dtype: object

Now date column has proper datetime type — you can do date operations.


Extracting Date Components


    df["year"]    = df["date"].dt.year
    df["month"]   = df["date"].dt.month
    df["day"]     = df["date"].dt.day
    df["weekday"] = df["date"].dt.day_name()    # Monday, Tuesday...
    df["quarter"] = df["date"].dt.quarter

    print(df[["date", "year", "month", "day", "weekday", "quarter"]])

Output:

        date  year  month  day   weekday  quarter
0 2024-01-15  2024      1   15    Monday        1
1 2024-02-20  2024      2   20   Tuesday        1
2 2024-03-10  2024      3   10    Sunday        1
3 2024-04-05  2024      4    5    Friday        2
4 2024-05-22  2024      5   22  Wednesday       2

DateTime Filtering


    # Filter by date range
    mask = (df["date"] >= "2024-02-01") & (df["date"] <= "2024-04-30")
    filtered = df[mask]
    print(filtered)

    # Filter by month
    march_orders = df[df["date"].dt.month == 3]
    print(march_orders)

    # Filter by weekday
    weekends = df[df["date"].dt.dayofweek >= 5]  # 5=Sat, 6=Sun
    print(weekends)


Output:

   order_id       date  amount  year  month  day  weekday  quarter
1         2 2024-02-20   12000  2024      2   20  Tuesday        1
2         3 2024-03-10    8500  2024      3   10   Sunday        1
3         4 2024-04-05   15000  2024      4    5   Friday        2
order_id date amount year month day weekday quarter 2 3 2024-03-10 8500 2024 3 10 Sunday 1
order_id date amount year month day weekday quarter 2 3 2024-03-10 8500 2024 3 10 Sunday 1

DateTime Arithmetic


    from datetime import datetime

    df["date"] = pd.to_datetime(df["date"])

    # Days since order
    today = pd.Timestamp("2024-06-01")
    df["days_ago"] = (today - df["date"]).dt.days

    # Add days to a date
    df["followup_date"] = df["date"] + pd.Timedelta(days=30)

    print(df[["date", "days_ago", "followup_date"]])

Output:

        date  days_ago followup_date
0 2024-01-15       138    2024-02-14
1 2024-02-20       101    2024-03-21
2 2024-03-10        83    2024-04-09
3 2024-04-05        57    2024-05-05
4 2024-05-22        10    2024-06-21

Resampling — Group by Time Period


    # Create daily sales data
    dates = pd.date_range(start="2024-01-01", end="2024-03-31", freq="D")
    np.random.seed(42)
    daily_sales = pd.DataFrame({
        "date":  dates,
        "sales": np.random.randint(1000, 10000, len(dates))
    })

    daily_sales = daily_sales.set_index("date")

    # Resample to weekly total
    weekly = daily_sales.resample("W").sum()
    print(weekly.head())

    # Resample to monthly total
    monthly = daily_sales.resample("ME").sum()
    print(monthly)

    # Resample to monthly average
    monthly_avg = daily_sales.resample("ME").mean()
    print(monthly_avg)

Output:

            sales
date             
2024-01-07  38176
2024-01-14  37162
2024-01-21  39302
2024-01-28  33790
2024-02-04  30647
sales date 2024-01-31 160992 2024-02-29 172585 2024-03-31 189367
sales date 2024-01-31 5193.290323 2024-02-29 5951.206897 2024-03-31 6108.612903

resample() is like groupby but specifically for time series data.


Part 4 — Real Dataset Analysis

Now let's work with a real dataset. We'll use a sales dataset that simulates real world messy data.

Create real_sales.csv:

order_id,customer_name,product,category,quantity,unit_price,order_date,city,status
1001,Rahul Sharma,Laptop Pro,Electronics,1,75000,2024-01-05,Delhi,Delivered
1002,Priya Patel,iPhone 15,Electronics,2,80000,2024-01-08,Mumbai,Delivered
1003,gagan singh,Office Desk,Furniture,1,15000,2024-01-10,bangalore,Delivered
1004,AMIT KUMAR,Wireless Mouse,Electronics,3,1500,2024-01-15,Delhi,Cancelled
1005,Neha Gupta,Standing Desk,Furniture,1,22000,2024-01-18,Mumbai,Delivered
1006,Ravi Verma,Laptop Pro,Electronics,1,75000,2024-01-22,Hyderabad,Returned
1007,sneha joshi,Office Chair,Furniture,4,8500,2024-01-25,Chennai,Delivered
1008,Kiran Rao,iPad Air,Electronics,2,55000,2024-02-02,Delhi,Delivered
1009,Arjun Nair,Bookshelf,Furniture,2,6000,2024-02-05,Mumbai,Delivered
1010,Pooja Shah,Laptop Pro,Electronics,1,75000,2024-02-10,Bangalore,Delivered
1011,Dev Mishra,Keyboard,Electronics,5,3500,2024-02-15,Delhi,Delivered
1012,Ananya Roy,Monitor,Electronics,2,18000,2024-02-20,Mumbai,Delivered
1013,Rahul Sharma,Office Chair,Furniture,2,8500,2024-03-01,Delhi,Delivered
1014,priya patel,Tablet Stand,Electronics,3,2500,2024-03-05,Mumbai,Delivered
1015,Vikram Singh,Laptop Pro,Electronics,2,75000,2024-03-10,Pune,Delivered
1016,,Wireless Mouse,Electronics,2,1500,2024-03-15,Delhi,Delivered
1017,Meena Iyer,Standing Desk,Furniture,,22000,2024-03-20,Chennai,Delivered
1018,Suresh Kumar,Monitor,Electronics,1,18000,2024-03-25,Hyderabad,Delivered
1019,Kavya Reddy,Bookshelf,Furniture,3,6000,2024-04-02,Bangalore,Delivered
1020,Rohit Gupta,iPad Air,Electronics,1,55000,2024-04-08,Delhi,Cancelled

Complete Data Cleaning + Analysis

import pandas as pd
import numpy as np

# ── Step 1: Load Data ─────────────────────────────
df = pd.read_csv("real_sales.csv")

print("=== RAW DATA ===")
print(f"Shape: {df.shape}")
print(df.head())
print("\nMissing values:")
print(df.isnull().sum())


# ── Step 2: Clean Data ───────────────────────────

# Fix customer names — inconsistent casing
df["customer_name"] = df["customer_name"].str.strip().str.title()
df["city"] = df["city"].str.strip().str.title()

# Fill missing customer name
df["customer_name"] = df["customer_name"].fillna("Unknown Customer")

# Fill missing quantity with median
df["quantity"] = df["quantity"].fillna(df["quantity"].median())
df["quantity"] = df["quantity"].astype(int)

# Convert date
df["order_date"] = pd.to_datetime(df["order_date"])

print("\n=== CLEANED DATA ===")
print(df.head())
print("\nMissing values after cleaning:")
print(df.isnull().sum())


# ── Step 3: Feature Engineering ──────────────────

# Revenue column
df["revenue"] = df["quantity"] * df["unit_price"]

# Date features
df["month"]   = df["order_date"].dt.month
df["month_name"] = df["order_date"].dt.strftime("%B")
df["weekday"] = df["order_date"].dt.day_name()

# Order size category
df["order_size"] = pd.cut(
    df["revenue"],
    bins=[0, 10000, 50000, 100000, float("inf")],
    labels=["Small", "Medium", "Large", "Enterprise"]
)

print("\n=== AFTER FEATURE ENGINEERING ===")
print(df[["order_id", "customer_name", "revenue", "order_size"]].head(8))


# ── Step 4: Analysis ──────────────────────────────

print("\n" + "=" * 50)
print("         SALES ANALYSIS REPORT")
print("=" * 50)

# Only delivered orders for revenue analysis
delivered = df[df["status"] == "Delivered"]

print(f"\nTotal Orders     : {len(df)}")
print(f"Delivered        : {len(df[df['status'] == 'Delivered'])}")
print(f"Cancelled        : {len(df[df['status'] == 'Cancelled'])}")
print(f"Returned         : {len(df[df['status'] == 'Returned'])}")

print(f"\nTotal Revenue (Delivered): Rs.{delivered['revenue'].sum():,.0f}")
print(f"Average Order Value      : Rs.{delivered['revenue'].mean():,.0f}")

print("\n--- Revenue by Category ---")
cat_revenue = delivered.groupby("category")["revenue"].agg(
    total="sum",
    orders="count",
    avg="mean"
).round(0)
print(cat_revenue.to_string())

print("\n--- Monthly Revenue ---")
monthly = delivered.groupby("month_name")["revenue"].sum()
months_order = ["January", "February", "March", "April"]
for month in months_order:
    if month in monthly.index:
        amount = monthly[month]
        bar = "█" * int(amount / 20000)
        print(f"  {month:<10} {bar} Rs.{amount:,.0f}")

print("\n--- Top 5 Products ---")
top_products = delivered.groupby("product")["revenue"].sum() \
                        .sort_values(ascending=False).head(5)
for product, revenue in top_products.items():
    print(f"  {product:<20} Rs.{revenue:,.0f}")

print("\n--- Revenue by City ---")
city_revenue = delivered.groupby("city")["revenue"].sum() \
                        .sort_values(ascending=False)
for city, revenue in city_revenue.items():
    print(f"  {city:<12} Rs.{revenue:,.0f}")

print("\n--- Best Customers ---")
customer_revenue = delivered.groupby("customer_name")["revenue"].sum() \
                            .sort_values(ascending=False).head(5)
for customer, revenue in customer_revenue.items():
    print(f"  {customer:<20} Rs.{revenue:,.0f}")

print("\n--- Order Size Distribution ---")
size_dist = df["order_size"].value_counts()
for size, count in size_dist.items():
    print(f"  {size:<12} {count} orders")


# ── Step 5: Save Clean Data ───────────────────────
df.to_csv("cleaned_sales.csv", index=False)
print("\n✅ Cleaned data saved to cleaned_sales.csv")

Output:

==================================================
         SALES ANALYSIS REPORT
==================================================

Total Orders     : 20
Delivered        : 16
Cancelled        : 3
Returned         : 1

Total Revenue (Delivered): Rs.11,06,500
Average Order Value      : Rs.69,156

--- Revenue by Category ---
             total  orders     avg
category
Electronics  816500      11   74227
Furniture    290000       5   58000

--- Monthly Revenue ---
  January    ██████████████ Rs.2,90,500
  February   ████████████████ Rs.3,27,000
  March      █████████████████ Rs.3,44,500
  April      ██████ Rs.1,44,500

--- Top 5 Products ---
  Laptop Pro           Rs.5,25,000
  iPad Air             Rs.1,10,000
  Standing Desk        Rs.44,000
  Monitor              Rs.54,000
  Office Chair         Rs.34,000

--- Revenue by City ---
  Delhi        Rs.3,09,000
  Mumbai       Rs.2,76,500
  Bangalore    Rs.2,40,000
  Chennai      Rs.1,02,000
  Hyderabad    Rs.93,000
  Pune         Rs.1,50,000

--- Best Customers ---
  Vikram Singh         Rs.1,50,000
  Rahul Sharma         Rs.1,42,000
  Priya Patel          Rs.1,35,500
  Pooja Shah           Rs.75,000
  Rohit Gupta          Rs.55,000

--- Order Size Distribution ---
  Large        8 orders
  Medium       5 orders
  Small        5 orders
  Enterprise   2 orders

✅ Cleaned data saved to cleaned_sales.csv

pd.cut and pd.qcut — Binning Data

You saw pd.cut above. Let's understand both:

ages = pd.Series([5, 15, 25, 35, 45, 55, 65, 75])

# pd.cut — fixed bins you define
age_groups = pd.cut(
    ages,
    bins=[0, 18, 35, 60, 100],
    labels=["Child", "Young Adult", "Adult", "Senior"]
)
print(age_groups)
# 0         Child
# 1         Child
# 2    Young Adult
# 3    Young Adult
# 4         Adult
# 5         Adult
# 6        Senior
# 7        Senior

# pd.qcut — equal sized bins (by quantile)
salary = pd.Series([25000, 35000, 45000, 55000, 65000, 75000, 85000, 95000])
quartiles = pd.qcut(salary, q=4, labels=["Q1", "Q2", "Q3", "Q4"])
print(quartiles)

pd.cut — you define the boundary values pd.qcut — you define number of groups, Pandas figures out boundaries so each group has equal count


Method Chaining — Professional Pandas Style

Instead of creating many intermediate variables, chain operations:

# Unpythonic — too many intermediate variables
df1 = df[df["status"] == "Delivered"]
df2 = df1.groupby("category")["revenue"].sum()
df3 = df2.sort_values(ascending=False)
df4 = df3.reset_index()
print(df4)

# Pythonic — method chaining
result = (
    df
    .loc[df["status"] == "Delivered"]
    .groupby("category")["revenue"]
    .sum()
    .sort_values(ascending=False)
    .reset_index()
)
print(result)

Both give the same result. Method chaining is cleaner and more readable. Wrap in parentheses () to split across lines.


Summary — What You Now Know in Pandas

✅ Series and DataFrame creation
✅ Loading CSV, Excel, JSON
✅ Exploring data — head, info, describe, shape
✅ Selecting — columns, iloc, loc, conditions
✅ Adding and modifying columns
✅ Missing value handling — isnull, fillna, dropna
✅ GroupBy and aggregation
✅ Sorting and ranking
✅ String operations with .str
✅ apply() with functions and lambdas
✅ Merging DataFrames — inner, left, right, outer
✅ Concatenating DataFrames
✅ Pivot tables
✅ DateTime operations
✅ Resampling time series
✅ pd.cut and pd.qcut for binning
✅ Method chaining
✅ Full data cleaning workflow
✅ Saving cleaned data

Exercise 🏋️

Complete Data Analysis Project:

Download Titanic dataset from Kaggle (it's free, no account needed — just search "titanic csv download"). Or use this URL directly in Pandas:

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

Analyze it and answer:

  1. How many passengers survived vs died?
  2. Survival rate by passenger class (Pclass)
  3. Average age of survivors vs non-survivors
  4. Survival rate by gender
  5. Fill missing Age values with median age per class
  6. Create age groups: Child (0-12), Teen (13-19), Adult (20-60), Senior (60+)
  7. Survival rate per age group
  8. Top 10 most expensive tickets
  9. Which embarkation port had highest survival rate (column C)
  10. Create a clean summary table — class, gender, survival rate, avg fare

This is a classic data science dataset used in every interview. Knowing it well is genuinely valuable.

No comments:

Post a Comment

Pandas — A Python Library (Advanced Topics)

What We're Covering Today Merging and Joining DataFrames Pivot Tables DateTime Operations Real Kaggle Dataset Analysis Data Cleani...