What We're Covering Today
- Merging and Joining DataFrames
- Pivot Tables
- DateTime Operations
- Real Kaggle Dataset Analysis
- Data Cleaning on Messy Real Data
These are the skills that make you actually useful in a real data job.
Part 1 — Merging DataFrames
In real projects data lives in multiple tables. You need to combine them. This is exactly like SQL JOINs — same concept, Pandas syntax.
Setup — Create Sample Tables
import pandas as pd import numpy as np
# Employees table employees = pd.DataFrame({ "emp_id": [1, 2, 3, 4, 5], "name": ["Rahul", "Priya", "Gagan", "Amit", "Neha"], "dept_id": [101, 102, 101, 103, 102], "salary": [85000, 62000, 92000, 48000, 71000] })
# Departments table departments = pd.DataFrame({ "dept_id": [101, 102, 103, 104], "dept_name": ["Engineering", "Marketing", "Sales", "HR"], "location": ["Bangalore", "Mumbai", "Delhi", "Hyderabad"] })
# Projects table projects = pd.DataFrame({ "project_id": [1, 2, 3], "name": ["Website Redesign", "Mobile App", "Data Pipeline"], "lead_emp_id":[1, 3, 1] })
print("Employees:") print(employees) print("\nDepartments:") print(departments)
pd.merge() — The Main Merge Function
Inner Join — Only Matching Rows
# Merge employees with departments result = pd.merge(employees, departments, on="dept_id") print(result)
Output:
emp_id name dept_id salary dept_name location
0 1 Rahul 101 85000 Engineering Bangalore
1 3 Gagan 101 92000 Engineering Bangalore
2 2 Priya 102 62000 Marketing Mumbai
3 5 Neha 102 71000 Marketing Mumbai
4 4 Amit 103 48000 Sales Delhi
HR department (104) has no employees — not included. All employees have matching departments — all included.
Left Join — Keep All Left Rows
result = pd.merge(employees, departments, on="dept_id", how="left") print(result)
Output:
emp_id name dept_id salary dept_name location 0 1 Rahul 101 85000 Engineering Bangalore 1 2 Priya 102 62000 Marketing Mumbai 2 3 Gagan 101 92000 Engineering Bangalore 3 4 Amit 103 48000 Sales Delhi 4 5 Neha 102 71000 Marketing Mumbai
All employees kept. If no matching department — NaN. Same as SQL LEFT JOIN.
Right Join — Keep All Right Rows
result = pd.merge(employees, departments, on="dept_id", how="right") print(result)
Output:
emp_id name dept_id salary dept_name location 0 1.0 Rahul 101 85000.0 Engineering Bangalore 1 3.0 Gagan 101 92000.0 Engineering Bangalore 2 2.0 Priya 102 62000.0 Marketing Mumbai 3 5.0 Neha 102 71000.0 Marketing Mumbai 4 4.0 Amit 103 48000.0 Sales Delhi 5 NaN NaN 104 NaN HR Hyderabad
Output includes HR department even though it has no employees — with NaN in employee columns.
Outer Join — Keep Everything
result = pd.merge(employees, departments, on="dept_id", how="outer") print(result)
Output:
emp_id name dept_id salary dept_name location 0 1.0 Rahul 101 85000.0 Engineering Bangalore 1 3.0 Gagan 101 92000.0 Engineering Bangalore 2 2.0 Priya 102 62000.0 Marketing Mumbai 3 5.0 Neha 102 71000.0 Marketing Mumbai 4 4.0 Amit 103 48000.0 Sales Delhi 5 NaN NaN 104 NaN HR Hyderabad
All rows from both tables. Missing values filled with NaN.
Merge on Different Column Names
# left table has 'lead_emp_id', right has 'emp_id' result = pd.merge( projects, employees, left_on="lead_emp_id", right_on="emp_id" ) print(result[["name_x", "name_y", "salary"]]) # name_x = project name, name_y = employee name
Output:
name_x name_y salary 0 Website Redesign Rahul 85000 1 Mobile App Gagan 92000 2 Data Pipeline Rahul 85000
When both tables have columns with same name — Pandas adds _x and _y suffix automatically.
Rename After Merge to Clean Up
result = pd.merge( projects, employees, left_on="lead_emp_id", right_on="emp_id" ).rename(columns={ "name_x": "project_name", "name_y": "lead_name" })
print(result[["project_name", "lead_name", "salary"]])
Output:
project_name lead_name salary
0 Website Redesign Rahul 85000
1 Mobile App Gagan 92000
2 Data Pipeline Rahul 85000
Multiple Merges — Chaining
# Merge employees with departments, then with projects full_data = pd.merge(employees, departments, on="dept_id") \ .merge(projects, left_on="emp_id", right_on="lead_emp_id", how="left")
print(full_data[["name_x", "dept_name", "salary", "name_y"]])
Output:
name_x dept_name salary name_y 0 Rahul Engineering 85000 Website Redesign 1 Rahul Engineering 85000 Data Pipeline 2 Priya Marketing 62000 NaN 3 Gagan Engineering 92000 Mobile App 4 Amit Sales 48000 NaN 5 Neha Marketing 71000 NaN
The \ at end of line is Python line continuation — lets you split one expression across multiple lines.
concat — Stack DataFrames
Different from merge — concat stacks DataFrames on top of each other (same columns):
df_2022 = pd.DataFrame({ "name": ["Rahul", "Priya"], "salary": [80000, 58000], "year": [2022, 2022] })
df_2023 = pd.DataFrame({ "name": ["Rahul", "Priya", "Gagan"], "salary": [85000, 62000, 92000], "year": [2023, 2023, 2023] })
combined = pd.concat([df_2022, df_2023], ignore_index=True) print(combined)
Output:
name salary year
0 Rahul 80000 2022
1 Priya 58000 2022
2 Rahul 85000 2023
3 Priya 62000 2023
4 Gagan 92000 2023
ignore_index=True — resets index to 0,1,2... instead of keeping original indices.
Part 2 — Pivot Tables
- Pivot tables reorganize data for analysis. Same as Excel pivot tables.
- A Pivot Table is a tool used to summarize, analyze, and reorganize data quickly — without changing the original dataset.
- A pivot table helps you turn raw data into meaningful insights by grouping and aggregating values.
What Pivot Table Can Do
- Group data → by category (e.g., department)
- Aggregate values → sum, average, count, min, max
- Filter data → show only what you need
- Rearrange structure → rows ↔ columns (pivot)
Why it's called "Pivot"
Because you can rotate (pivot) the data view:
- Rows → Columns
- Columns → Rows
Basic Pivot Table
data = { "name": ["Rahul", "Priya", "Gagan", "Rahul", "Priya", "Gagan"], "month": ["Jan", "Jan", "Jan", "Feb", "Feb", "Feb"], "sales": [50000, 72000, 43000, 61000, 68000, 55000], "department": ["Eng", "Mkt", "Eng", "Eng", "Mkt", "Eng"] }
df = pd.DataFrame(data)
# Pivot — rows=name, columns=month, values=sales pivot = df.pivot_table( values="sales", index="name", columns="month", aggfunc="sum" )
print(pivot)
Output:
month Feb Jan
name
Gagan 55000 43000
Priya 68000 72000
Rahul 61000 50000
Instantly see each person's sales per month — just like Excel.
Pivot with Multiple Aggregations
pivot = df.pivot_table( values="sales", index="name", columns="month", aggfunc=["sum", "mean"], fill_value=0 # fill missing combinations with 0 ) print(pivot)
Output:
sum mean
month Feb Jan Feb Jan
name
Gagan 55000 43000 55000.0 43000.0
Priya 68000 72000 68000.0 72000.0
Rahul 61000 50000 61000.0 50000.0
Pivot Table with Margins — Row/Column Totals
pivot = df.pivot_table( values="sales", index="department", columns="month", aggfunc="sum", margins=True, # add Total row and column margins_name="Total" ) print(pivot)
Output:
month Feb Jan Total
department
Eng 116000 93000 209000
Mkt 68000 72000 140000
Total 184000 165000 349000
DateTime Operations
Real world data almost always has dates. Pandas has powerful datetime tools.
Creating DateTime Data
df = pd.DataFrame({ "order_id": [1, 2, 3, 4, 5], "date": ["2024-01-15", "2024-02-20", "2024-03-10", "2024-04-05", "2024-05-22"], "amount": [5000, 12000, 8500, 15000, 9200] })
print(df.dtypes) # date column is object (string) right now
Output:
order_id int64
date str
amount int64
dtype: object
Converting to DateTime
df["date"] = pd.to_datetime(df["date"]) print(df.dtypes) # date datetime64[ns]
Output:
order_id int64 date datetime64[us] amount int64 dtype: object
Now date column has proper datetime type — you can do date operations.
Extracting Date Components
df["year"] = df["date"].dt.year df["month"] = df["date"].dt.month df["day"] = df["date"].dt.day df["weekday"] = df["date"].dt.day_name() # Monday, Tuesday... df["quarter"] = df["date"].dt.quarter
print(df[["date", "year", "month", "day", "weekday", "quarter"]])
Output:
date year month day weekday quarter
0 2024-01-15 2024 1 15 Monday 1
1 2024-02-20 2024 2 20 Tuesday 1
2 2024-03-10 2024 3 10 Sunday 1
3 2024-04-05 2024 4 5 Friday 2
4 2024-05-22 2024 5 22 Wednesday 2
DateTime Filtering
# Filter by date range mask = (df["date"] >= "2024-02-01") & (df["date"] <= "2024-04-30") filtered = df[mask] print(filtered)
# Filter by month march_orders = df[df["date"].dt.month == 3] print(march_orders)
# Filter by weekday weekends = df[df["date"].dt.dayofweek >= 5] # 5=Sat, 6=Sun print(weekends)
Output:
order_id date amount year month day weekday quarter
1 2 2024-02-20 12000 2024 2 20 Tuesday 1
2 3 2024-03-10 8500 2024 3 10 Sunday 1
3 4 2024-04-05 15000 2024 4 5 Friday 2
order_id date amount year month day weekday quarter
2 3 2024-03-10 8500 2024 3 10 Sunday 1
order_id date amount year month day weekday quarter
2 3 2024-03-10 8500 2024 3 10 Sunday 1
DateTime Arithmetic
from datetime import datetime
df["date"] = pd.to_datetime(df["date"])
# Days since order today = pd.Timestamp("2024-06-01") df["days_ago"] = (today - df["date"]).dt.days
# Add days to a date df["followup_date"] = df["date"] + pd.Timedelta(days=30)
print(df[["date", "days_ago", "followup_date"]])
Output:
date days_ago followup_date
0 2024-01-15 138 2024-02-14
1 2024-02-20 101 2024-03-21
2 2024-03-10 83 2024-04-09
3 2024-04-05 57 2024-05-05
4 2024-05-22 10 2024-06-21
Resampling — Group by Time Period
# Create daily sales data dates = pd.date_range(start="2024-01-01", end="2024-03-31", freq="D") np.random.seed(42) daily_sales = pd.DataFrame({ "date": dates, "sales": np.random.randint(1000, 10000, len(dates)) })
daily_sales = daily_sales.set_index("date")
# Resample to weekly total weekly = daily_sales.resample("W").sum() print(weekly.head())
# Resample to monthly total monthly = daily_sales.resample("ME").sum() print(monthly)
# Resample to monthly average monthly_avg = daily_sales.resample("ME").mean() print(monthly_avg)
Output:
sales
date
2024-01-07 38176
2024-01-14 37162
2024-01-21 39302
2024-01-28 33790
2024-02-04 30647
sales
date
2024-01-31 160992
2024-02-29 172585
2024-03-31 189367
sales
date
2024-01-31 5193.290323
2024-02-29 5951.206897
2024-03-31 6108.612903
resample() is like groupby but specifically for time series data.
Part 4 — Real Dataset Analysis
Now let's work with a real dataset. We'll use a sales dataset that simulates real world messy data.
Create real_sales.csv:
order_id,customer_name,product,category,quantity,unit_price,order_date,city,status
1001,Rahul Sharma,Laptop Pro,Electronics,1,75000,2024-01-05,Delhi,Delivered
1002,Priya Patel,iPhone 15,Electronics,2,80000,2024-01-08,Mumbai,Delivered
1003,gagan singh,Office Desk,Furniture,1,15000,2024-01-10,bangalore,Delivered
1004,AMIT KUMAR,Wireless Mouse,Electronics,3,1500,2024-01-15,Delhi,Cancelled
1005,Neha Gupta,Standing Desk,Furniture,1,22000,2024-01-18,Mumbai,Delivered
1006,Ravi Verma,Laptop Pro,Electronics,1,75000,2024-01-22,Hyderabad,Returned
1007,sneha joshi,Office Chair,Furniture,4,8500,2024-01-25,Chennai,Delivered
1008,Kiran Rao,iPad Air,Electronics,2,55000,2024-02-02,Delhi,Delivered
1009,Arjun Nair,Bookshelf,Furniture,2,6000,2024-02-05,Mumbai,Delivered
1010,Pooja Shah,Laptop Pro,Electronics,1,75000,2024-02-10,Bangalore,Delivered
1011,Dev Mishra,Keyboard,Electronics,5,3500,2024-02-15,Delhi,Delivered
1012,Ananya Roy,Monitor,Electronics,2,18000,2024-02-20,Mumbai,Delivered
1013,Rahul Sharma,Office Chair,Furniture,2,8500,2024-03-01,Delhi,Delivered
1014,priya patel,Tablet Stand,Electronics,3,2500,2024-03-05,Mumbai,Delivered
1015,Vikram Singh,Laptop Pro,Electronics,2,75000,2024-03-10,Pune,Delivered
1016,,Wireless Mouse,Electronics,2,1500,2024-03-15,Delhi,Delivered
1017,Meena Iyer,Standing Desk,Furniture,,22000,2024-03-20,Chennai,Delivered
1018,Suresh Kumar,Monitor,Electronics,1,18000,2024-03-25,Hyderabad,Delivered
1019,Kavya Reddy,Bookshelf,Furniture,3,6000,2024-04-02,Bangalore,Delivered
1020,Rohit Gupta,iPad Air,Electronics,1,55000,2024-04-08,Delhi,Cancelled
Complete Data Cleaning + Analysis
import pandas as pd
import numpy as np
# ── Step 1: Load Data ─────────────────────────────
df = pd.read_csv("real_sales.csv")
print("=== RAW DATA ===")
print(f"Shape: {df.shape}")
print(df.head())
print("\nMissing values:")
print(df.isnull().sum())
# ── Step 2: Clean Data ───────────────────────────
# Fix customer names — inconsistent casing
df["customer_name"] = df["customer_name"].str.strip().str.title()
df["city"] = df["city"].str.strip().str.title()
# Fill missing customer name
df["customer_name"] = df["customer_name"].fillna("Unknown Customer")
# Fill missing quantity with median
df["quantity"] = df["quantity"].fillna(df["quantity"].median())
df["quantity"] = df["quantity"].astype(int)
# Convert date
df["order_date"] = pd.to_datetime(df["order_date"])
print("\n=== CLEANED DATA ===")
print(df.head())
print("\nMissing values after cleaning:")
print(df.isnull().sum())
# ── Step 3: Feature Engineering ──────────────────
# Revenue column
df["revenue"] = df["quantity"] * df["unit_price"]
# Date features
df["month"] = df["order_date"].dt.month
df["month_name"] = df["order_date"].dt.strftime("%B")
df["weekday"] = df["order_date"].dt.day_name()
# Order size category
df["order_size"] = pd.cut(
df["revenue"],
bins=[0, 10000, 50000, 100000, float("inf")],
labels=["Small", "Medium", "Large", "Enterprise"]
)
print("\n=== AFTER FEATURE ENGINEERING ===")
print(df[["order_id", "customer_name", "revenue", "order_size"]].head(8))
# ── Step 4: Analysis ──────────────────────────────
print("\n" + "=" * 50)
print(" SALES ANALYSIS REPORT")
print("=" * 50)
# Only delivered orders for revenue analysis
delivered = df[df["status"] == "Delivered"]
print(f"\nTotal Orders : {len(df)}")
print(f"Delivered : {len(df[df['status'] == 'Delivered'])}")
print(f"Cancelled : {len(df[df['status'] == 'Cancelled'])}")
print(f"Returned : {len(df[df['status'] == 'Returned'])}")
print(f"\nTotal Revenue (Delivered): Rs.{delivered['revenue'].sum():,.0f}")
print(f"Average Order Value : Rs.{delivered['revenue'].mean():,.0f}")
print("\n--- Revenue by Category ---")
cat_revenue = delivered.groupby("category")["revenue"].agg(
total="sum",
orders="count",
avg="mean"
).round(0)
print(cat_revenue.to_string())
print("\n--- Monthly Revenue ---")
monthly = delivered.groupby("month_name")["revenue"].sum()
months_order = ["January", "February", "March", "April"]
for month in months_order:
if month in monthly.index:
amount = monthly[month]
bar = "█" * int(amount / 20000)
print(f" {month:<10} {bar} Rs.{amount:,.0f}")
print("\n--- Top 5 Products ---")
top_products = delivered.groupby("product")["revenue"].sum() \
.sort_values(ascending=False).head(5)
for product, revenue in top_products.items():
print(f" {product:<20} Rs.{revenue:,.0f}")
print("\n--- Revenue by City ---")
city_revenue = delivered.groupby("city")["revenue"].sum() \
.sort_values(ascending=False)
for city, revenue in city_revenue.items():
print(f" {city:<12} Rs.{revenue:,.0f}")
print("\n--- Best Customers ---")
customer_revenue = delivered.groupby("customer_name")["revenue"].sum() \
.sort_values(ascending=False).head(5)
for customer, revenue in customer_revenue.items():
print(f" {customer:<20} Rs.{revenue:,.0f}")
print("\n--- Order Size Distribution ---")
size_dist = df["order_size"].value_counts()
for size, count in size_dist.items():
print(f" {size:<12} {count} orders")
# ── Step 5: Save Clean Data ───────────────────────
df.to_csv("cleaned_sales.csv", index=False)
print("\n✅ Cleaned data saved to cleaned_sales.csv")
Output:
==================================================
SALES ANALYSIS REPORT
==================================================
Total Orders : 20
Delivered : 16
Cancelled : 3
Returned : 1
Total Revenue (Delivered): Rs.11,06,500
Average Order Value : Rs.69,156
--- Revenue by Category ---
total orders avg
category
Electronics 816500 11 74227
Furniture 290000 5 58000
--- Monthly Revenue ---
January ██████████████ Rs.2,90,500
February ████████████████ Rs.3,27,000
March █████████████████ Rs.3,44,500
April ██████ Rs.1,44,500
--- Top 5 Products ---
Laptop Pro Rs.5,25,000
iPad Air Rs.1,10,000
Standing Desk Rs.44,000
Monitor Rs.54,000
Office Chair Rs.34,000
--- Revenue by City ---
Delhi Rs.3,09,000
Mumbai Rs.2,76,500
Bangalore Rs.2,40,000
Chennai Rs.1,02,000
Hyderabad Rs.93,000
Pune Rs.1,50,000
--- Best Customers ---
Vikram Singh Rs.1,50,000
Rahul Sharma Rs.1,42,000
Priya Patel Rs.1,35,500
Pooja Shah Rs.75,000
Rohit Gupta Rs.55,000
--- Order Size Distribution ---
Large 8 orders
Medium 5 orders
Small 5 orders
Enterprise 2 orders
✅ Cleaned data saved to cleaned_sales.csv
pd.cut and pd.qcut — Binning Data
You saw pd.cut above. Let's understand both:
ages = pd.Series([5, 15, 25, 35, 45, 55, 65, 75])
# pd.cut — fixed bins you define
age_groups = pd.cut(
ages,
bins=[0, 18, 35, 60, 100],
labels=["Child", "Young Adult", "Adult", "Senior"]
)
print(age_groups)
# 0 Child
# 1 Child
# 2 Young Adult
# 3 Young Adult
# 4 Adult
# 5 Adult
# 6 Senior
# 7 Senior
# pd.qcut — equal sized bins (by quantile)
salary = pd.Series([25000, 35000, 45000, 55000, 65000, 75000, 85000, 95000])
quartiles = pd.qcut(salary, q=4, labels=["Q1", "Q2", "Q3", "Q4"])
print(quartiles)
pd.cut — you define the boundary values
pd.qcut — you define number of groups, Pandas figures out boundaries so each group has equal count
Method Chaining — Professional Pandas Style
Instead of creating many intermediate variables, chain operations:
# Unpythonic — too many intermediate variables
df1 = df[df["status"] == "Delivered"]
df2 = df1.groupby("category")["revenue"].sum()
df3 = df2.sort_values(ascending=False)
df4 = df3.reset_index()
print(df4)
# Pythonic — method chaining
result = (
df
.loc[df["status"] == "Delivered"]
.groupby("category")["revenue"]
.sum()
.sort_values(ascending=False)
.reset_index()
)
print(result)
Both give the same result. Method chaining is cleaner and more readable. Wrap in parentheses () to split across lines.
Summary — What You Now Know in Pandas
✅ Series and DataFrame creation
✅ Loading CSV, Excel, JSON
✅ Exploring data — head, info, describe, shape
✅ Selecting — columns, iloc, loc, conditions
✅ Adding and modifying columns
✅ Missing value handling — isnull, fillna, dropna
✅ GroupBy and aggregation
✅ Sorting and ranking
✅ String operations with .str
✅ apply() with functions and lambdas
✅ Merging DataFrames — inner, left, right, outer
✅ Concatenating DataFrames
✅ Pivot tables
✅ DateTime operations
✅ Resampling time series
✅ pd.cut and pd.qcut for binning
✅ Method chaining
✅ Full data cleaning workflow
✅ Saving cleaned data
Exercise 🏋️
Complete Data Analysis Project:
Download Titanic dataset from Kaggle (it's free, no account needed — just search "titanic csv download"). Or use this URL directly in Pandas:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
Analyze it and answer:
- How many passengers survived vs died?
- Survival rate by passenger class (Pclass)
- Average age of survivors vs non-survivors
- Survival rate by gender
- Fill missing Age values with median age per class
- Create age groups: Child (0-12), Teen (13-19), Adult (20-60), Senior (60+)
- Survival rate per age group
- Top 10 most expensive tickets
- Which embarkation port had highest survival rate (column C)
- Create a clean summary table — class, gender, survival rate, avg fare
This is a classic data science dataset used in every interview. Knowing it well is genuinely valuable.
No comments:
Post a Comment