Pandas — A Python Library (Advanced Topics)

What We're Covering Today

  • Merging and Joining DataFrames
  • Pivot Tables
  • DateTime Operations
  • Real Kaggle Dataset Analysis
  • Data Cleaning on Messy Real Data

These are the skills that make you actually useful in a real data job.


Part 1 — Merging DataFrames

In real projects data lives in multiple tables. You need to combine them. This is exactly like SQL JOINs — same concept, Pandas syntax.


Setup — Create Sample Tables


    import pandas as pd
    import numpy as np

    # Employees table
    employees = pd.DataFrame({
        "emp_id":     [1, 2, 3, 4, 5],
        "name":       ["Rahul", "Priya", "Gagan", "Amit", "Neha"],
        "dept_id":    [101, 102, 101, 103, 102],
        "salary":     [85000, 62000, 92000, 48000, 71000]
    })

    # Departments table
    departments = pd.DataFrame({
        "dept_id":    [101, 102, 103, 104],
        "dept_name":  ["Engineering", "Marketing", "Sales", "HR"],
        "location":   ["Bangalore", "Mumbai", "Delhi", "Hyderabad"]
    })

    # Projects table
    projects = pd.DataFrame({
        "project_id": [1, 2, 3],
        "name":       ["Website Redesign", "Mobile App", "Data Pipeline"],
        "lead_emp_id":[1, 3, 1]
    })

    print("Employees:")
    print(employees)
    print("\nDepartments:")
    print(departments)


pd.merge() — The Main Merge Function

Inner Join — Only Matching Rows


    # Merge employees with departments
    result = pd.merge(employees, departments, on="dept_id")
    print(result)

Output:

   emp_id   name  dept_id  salary    dept_name   location
0       1  Rahul      101   85000  Engineering  Bangalore
1       3  Gagan      101   92000  Engineering  Bangalore
2       2  Priya      102   62000    Marketing     Mumbai
3       5   Neha      102   71000    Marketing     Mumbai
4       4   Amit      103   48000        Sales      Delhi

HR department (104) has no employees — not included. All employees have matching departments — all included.


Left Join — Keep All Left Rows


    result = pd.merge(employees, departments, on="dept_id", how="left")
    print(result)

Output:

   emp_id   name  dept_id  salary    dept_name   location
0       1  Rahul      101   85000  Engineering  Bangalore
1       2  Priya      102   62000    Marketing     Mumbai
2       3  Gagan      101   92000  Engineering  Bangalore
3       4   Amit      103   48000        Sales      Delhi
4       5   Neha      102   71000    Marketing     Mumbai

All employees kept. If no matching department — NaN. Same as SQL LEFT JOIN.


Right Join — Keep All Right Rows


    result = pd.merge(employees, departments, on="dept_id", how="right")
    print(result)

Output:

   emp_id   name  dept_id   salary    dept_name   location
0     1.0  Rahul      101  85000.0  Engineering  Bangalore
1     3.0  Gagan      101  92000.0  Engineering  Bangalore
2     2.0  Priya      102  62000.0    Marketing     Mumbai
3     5.0   Neha      102  71000.0    Marketing     Mumbai
4     4.0   Amit      103  48000.0        Sales      Delhi
5     NaN    NaN      104      NaN           HR  Hyderabad

Output includes HR department even though it has no employees — with NaN in employee columns.


Outer Join — Keep Everything


    result = pd.merge(employees, departments, on="dept_id", how="outer")
    print(result)

Output:

   emp_id   name  dept_id   salary    dept_name   location
0     1.0  Rahul      101  85000.0  Engineering  Bangalore
1     3.0  Gagan      101  92000.0  Engineering  Bangalore
2     2.0  Priya      102  62000.0    Marketing     Mumbai
3     5.0   Neha      102  71000.0    Marketing     Mumbai
4     4.0   Amit      103  48000.0        Sales      Delhi
5     NaN    NaN      104      NaN           HR  Hyderabad

All rows from both tables. Missing values filled with NaN.


Merge on Different Column Names


    # left table has 'lead_emp_id', right has 'emp_id'
    result = pd.merge(
        projects,
        employees,
        left_on="lead_emp_id",
        right_on="emp_id"
    )
    print(result[["name_x", "name_y", "salary"]])
    # name_x = project name, name_y = employee name

Output:

             name_x name_y  salary
0  Website Redesign  Rahul   85000
1        Mobile App  Gagan   92000
2     Data Pipeline  Rahul   85000

When both tables have columns with same name — Pandas adds _x and _y suffix automatically.


Rename After Merge to Clean Up


    result = pd.merge(
        projects,
        employees,
        left_on="lead_emp_id",
        right_on="emp_id"
    ).rename(columns={
        "name_x": "project_name",
        "name_y": "lead_name"
    })

    print(result[["project_name", "lead_name", "salary"]])

Output:

       project_name lead_name  salary
0  Website Redesign     Rahul   85000
1        Mobile App     Gagan   92000
2     Data Pipeline     Rahul   85000

Multiple Merges — Chaining


    # Merge employees with departments, then with projects
    full_data = pd.merge(employees, departments, on="dept_id") \
                .merge(projects, left_on="emp_id", right_on="lead_emp_id", how="left")

    print(full_data[["name_x", "dept_name", "salary", "name_y"]])

Output:

  name_x    dept_name  salary            name_y
0  Rahul  Engineering   85000  Website Redesign
1  Rahul  Engineering   85000     Data Pipeline
2  Priya    Marketing   62000               NaN
3  Gagan  Engineering   92000        Mobile App
4   Amit        Sales   48000               NaN
5   Neha    Marketing   71000               NaN

The \ at end of line is Python line continuation — lets you split one expression across multiple lines.


concat — Stack DataFrames

Different from merge — concat stacks DataFrames on top of each other (same columns):


    df_2022 = pd.DataFrame({
        "name":   ["Rahul", "Priya"],
        "salary": [80000, 58000],
        "year":   [2022, 2022]
    })

    df_2023 = pd.DataFrame({
        "name":   ["Rahul", "Priya", "Gagan"],
        "salary": [85000, 62000, 92000],
        "year":   [2023, 2023, 2023]
    })

    combined = pd.concat([df_2022, df_2023], ignore_index=True)
    print(combined)

Output:

    name  salary  year
0  Rahul   80000  2022
1  Priya   58000  2022
2  Rahul   85000  2023
3  Priya   62000  2023
4  Gagan   92000  2023

ignore_index=True — resets index to 0,1,2... instead of keeping original indices.


Part 2 — Pivot Tables

  • Pivot tables reorganize data for analysis. Same as Excel pivot tables.
  • A Pivot Table is a tool used to summarize, analyze, and reorganize data quickly — without changing the original dataset.
  • A pivot table helps you turn raw data into meaningful insights by grouping and aggregating values.

What Pivot Table Can Do

  • Group data → by category (e.g., department)
  • Aggregate values → sum, average, count, min, max
  • Filter data → show only what you need
  • Rearrange structure → rows ↔ columns (pivot)

Why it's called "Pivot"

Because you can rotate (pivot) the data view:

  • Rows → Columns
  • Columns → Rows

Basic Pivot Table


    data = {
        "name":       ["Rahul", "Priya", "Gagan", "Rahul", "Priya", "Gagan"],
        "month":      ["Jan",   "Jan",   "Jan",   "Feb",   "Feb",   "Feb"],
        "sales":      [50000,   72000,   43000,   61000,   68000,   55000],
        "department": ["Eng",   "Mkt",   "Eng",   "Eng",   "Mkt",   "Eng"]
    }

    df = pd.DataFrame(data)

    # Pivot — rows=name, columns=month, values=sales
    pivot = df.pivot_table(
        values="sales",
        index="name",
        columns="month",
        aggfunc="sum"
    )

    print(pivot)

Output:

month    Feb    Jan
name
Gagan  55000  43000
Priya  68000  72000
Rahul  61000  50000

Instantly see each person's sales per month — just like Excel.


Pivot with Multiple Aggregations


    pivot = df.pivot_table(
        values="sales",
        index="name",
        columns="month",
        aggfunc=["sum", "mean"],
        fill_value=0           # fill missing combinations with 0
    )
    print(pivot)


Output:
         sum            mean         
month    Feb    Jan      Feb      Jan
name                                 
Gagan  55000  43000  55000.0  43000.0
Priya  68000  72000  68000.0  72000.0
Rahul  61000  50000  61000.0  50000.0

Pivot Table with Margins — Row/Column Totals


    pivot = df.pivot_table(
        values="sales",
        index="department",
        columns="month",
        aggfunc="sum",
        margins=True,          # add Total row and column
        margins_name="Total"
    )
    print(pivot)

Output:

month       Feb     Jan   Total
department
Eng       116000   93000  209000
Mkt        68000   72000  140000
Total     184000  165000  349000

DateTime Operations

Real world data almost always has dates. Pandas has powerful datetime tools.

Creating DateTime Data


    df = pd.DataFrame({
        "order_id": [1, 2, 3, 4, 5],
        "date":     ["2024-01-15", "2024-02-20", "2024-03-10",
                    "2024-04-05", "2024-05-22"],
        "amount":   [5000, 12000, 8500, 15000, 9200]
    })

    print(df.dtypes)
    # date column is object (string) right now


Output:

order_id    int64
date          str
amount      int64
dtype: object

Converting to DateTime


    df["date"] = pd.to_datetime(df["date"])
    print(df.dtypes)
    # date    datetime64[ns]

Output:

order_id             int64
date        datetime64[us]
amount               int64
dtype: object

Now date column has proper datetime type — you can do date operations.


Extracting Date Components


    df["year"]    = df["date"].dt.year
    df["month"]   = df["date"].dt.month
    df["day"]     = df["date"].dt.day
    df["weekday"] = df["date"].dt.day_name()    # Monday, Tuesday...
    df["quarter"] = df["date"].dt.quarter

    print(df[["date", "year", "month", "day", "weekday", "quarter"]])

Output:

        date  year  month  day   weekday  quarter
0 2024-01-15  2024      1   15    Monday        1
1 2024-02-20  2024      2   20   Tuesday        1
2 2024-03-10  2024      3   10    Sunday        1
3 2024-04-05  2024      4    5    Friday        2
4 2024-05-22  2024      5   22  Wednesday       2

DateTime Filtering


    # Filter by date range
    mask = (df["date"] >= "2024-02-01") & (df["date"] <= "2024-04-30")
    filtered = df[mask]
    print(filtered)

    # Filter by month
    march_orders = df[df["date"].dt.month == 3]
    print(march_orders)

    # Filter by weekday
    weekends = df[df["date"].dt.dayofweek >= 5]  # 5=Sat, 6=Sun
    print(weekends)


Output:

   order_id       date  amount  year  month  day  weekday  quarter
1         2 2024-02-20   12000  2024      2   20  Tuesday        1
2         3 2024-03-10    8500  2024      3   10   Sunday        1
3         4 2024-04-05   15000  2024      4    5   Friday        2
order_id date amount year month day weekday quarter 2 3 2024-03-10 8500 2024 3 10 Sunday 1
order_id date amount year month day weekday quarter 2 3 2024-03-10 8500 2024 3 10 Sunday 1

DateTime Arithmetic


    from datetime import datetime

    df["date"] = pd.to_datetime(df["date"])

    # Days since order
    today = pd.Timestamp("2024-06-01")
    df["days_ago"] = (today - df["date"]).dt.days

    # Add days to a date
    df["followup_date"] = df["date"] + pd.Timedelta(days=30)

    print(df[["date", "days_ago", "followup_date"]])

Output:

        date  days_ago followup_date
0 2024-01-15       138    2024-02-14
1 2024-02-20       101    2024-03-21
2 2024-03-10        83    2024-04-09
3 2024-04-05        57    2024-05-05
4 2024-05-22        10    2024-06-21

Resampling — Group by Time Period


    # Create daily sales data
    dates = pd.date_range(start="2024-01-01", end="2024-03-31", freq="D")
    np.random.seed(42)
    daily_sales = pd.DataFrame({
        "date":  dates,
        "sales": np.random.randint(1000, 10000, len(dates))
    })

    daily_sales = daily_sales.set_index("date")

    # Resample to weekly total
    weekly = daily_sales.resample("W").sum()
    print(weekly.head())

    # Resample to monthly total
    monthly = daily_sales.resample("ME").sum()
    print(monthly)

    # Resample to monthly average
    monthly_avg = daily_sales.resample("ME").mean()
    print(monthly_avg)

Output:

            sales
date             
2024-01-07  38176
2024-01-14  37162
2024-01-21  39302
2024-01-28  33790
2024-02-04  30647
sales date 2024-01-31 160992 2024-02-29 172585 2024-03-31 189367
sales date 2024-01-31 5193.290323 2024-02-29 5951.206897 2024-03-31 6108.612903

resample() is like groupby but specifically for time series data.


Part 4 — Real Dataset Analysis

Now let's work with a real dataset. We'll use a sales dataset that simulates real world messy data.

Create real_sales.csv:

order_id,customer_name,product,category,quantity,unit_price,order_date,city,status
1001,Rahul Sharma,Laptop Pro,Electronics,1,75000,2024-01-05,Delhi,Delivered
1002,Priya Patel,iPhone 15,Electronics,2,80000,2024-01-08,Mumbai,Delivered
1003,gagan singh,Office Desk,Furniture,1,15000,2024-01-10,bangalore,Delivered
1004,AMIT KUMAR,Wireless Mouse,Electronics,3,1500,2024-01-15,Delhi,Cancelled
1005,Neha Gupta,Standing Desk,Furniture,1,22000,2024-01-18,Mumbai,Delivered
1006,Ravi Verma,Laptop Pro,Electronics,1,75000,2024-01-22,Hyderabad,Returned
1007,sneha joshi,Office Chair,Furniture,4,8500,2024-01-25,Chennai,Delivered
1008,Kiran Rao,iPad Air,Electronics,2,55000,2024-02-02,Delhi,Delivered
1009,Arjun Nair,Bookshelf,Furniture,2,6000,2024-02-05,Mumbai,Delivered
1010,Pooja Shah,Laptop Pro,Electronics,1,75000,2024-02-10,Bangalore,Delivered
1011,Dev Mishra,Keyboard,Electronics,5,3500,2024-02-15,Delhi,Delivered
1012,Ananya Roy,Monitor,Electronics,2,18000,2024-02-20,Mumbai,Delivered
1013,Rahul Sharma,Office Chair,Furniture,2,8500,2024-03-01,Delhi,Delivered
1014,priya patel,Tablet Stand,Electronics,3,2500,2024-03-05,Mumbai,Delivered
1015,Vikram Singh,Laptop Pro,Electronics,2,75000,2024-03-10,Pune,Delivered
1016,,Wireless Mouse,Electronics,2,1500,2024-03-15,Delhi,Delivered
1017,Meena Iyer,Standing Desk,Furniture,,22000,2024-03-20,Chennai,Delivered
1018,Suresh Kumar,Monitor,Electronics,1,18000,2024-03-25,Hyderabad,Delivered
1019,Kavya Reddy,Bookshelf,Furniture,3,6000,2024-04-02,Bangalore,Delivered
1020,Rohit Gupta,iPad Air,Electronics,1,55000,2024-04-08,Delhi,Cancelled

Complete Data Cleaning + Analysis

import pandas as pd
import numpy as np

# ── Step 1: Load Data ─────────────────────────────
df = pd.read_csv("real_sales.csv")

print("=== RAW DATA ===")
print(f"Shape: {df.shape}")
print(df.head())
print("\nMissing values:")
print(df.isnull().sum())


# ── Step 2: Clean Data ───────────────────────────

# Fix customer names — inconsistent casing
df["customer_name"] = df["customer_name"].str.strip().str.title()
df["city"] = df["city"].str.strip().str.title()

# Fill missing customer name
df["customer_name"] = df["customer_name"].fillna("Unknown Customer")

# Fill missing quantity with median
df["quantity"] = df["quantity"].fillna(df["quantity"].median())
df["quantity"] = df["quantity"].astype(int)

# Convert date
df["order_date"] = pd.to_datetime(df["order_date"])

print("\n=== CLEANED DATA ===")
print(df.head())
print("\nMissing values after cleaning:")
print(df.isnull().sum())


# ── Step 3: Feature Engineering ──────────────────

# Revenue column
df["revenue"] = df["quantity"] * df["unit_price"]

# Date features
df["month"]   = df["order_date"].dt.month
df["month_name"] = df["order_date"].dt.strftime("%B")
df["weekday"] = df["order_date"].dt.day_name()

# Order size category
df["order_size"] = pd.cut(
    df["revenue"],
    bins=[0, 10000, 50000, 100000, float("inf")],
    labels=["Small", "Medium", "Large", "Enterprise"]
)

print("\n=== AFTER FEATURE ENGINEERING ===")
print(df[["order_id", "customer_name", "revenue", "order_size"]].head(8))


# ── Step 4: Analysis ──────────────────────────────

print("\n" + "=" * 50)
print("         SALES ANALYSIS REPORT")
print("=" * 50)

# Only delivered orders for revenue analysis
delivered = df[df["status"] == "Delivered"]

print(f"\nTotal Orders     : {len(df)}")
print(f"Delivered        : {len(df[df['status'] == 'Delivered'])}")
print(f"Cancelled        : {len(df[df['status'] == 'Cancelled'])}")
print(f"Returned         : {len(df[df['status'] == 'Returned'])}")

print(f"\nTotal Revenue (Delivered): Rs.{delivered['revenue'].sum():,.0f}")
print(f"Average Order Value      : Rs.{delivered['revenue'].mean():,.0f}")

print("\n--- Revenue by Category ---")
cat_revenue = delivered.groupby("category")["revenue"].agg(
    total="sum",
    orders="count",
    avg="mean"
).round(0)
print(cat_revenue.to_string())

print("\n--- Monthly Revenue ---")
monthly = delivered.groupby("month_name")["revenue"].sum()
months_order = ["January", "February", "March", "April"]
for month in months_order:
    if month in monthly.index:
        amount = monthly[month]
        bar = "█" * int(amount / 20000)
        print(f"  {month:<10} {bar} Rs.{amount:,.0f}")

print("\n--- Top 5 Products ---")
top_products = delivered.groupby("product")["revenue"].sum() \
                        .sort_values(ascending=False).head(5)
for product, revenue in top_products.items():
    print(f"  {product:<20} Rs.{revenue:,.0f}")

print("\n--- Revenue by City ---")
city_revenue = delivered.groupby("city")["revenue"].sum() \
                        .sort_values(ascending=False)
for city, revenue in city_revenue.items():
    print(f"  {city:<12} Rs.{revenue:,.0f}")

print("\n--- Best Customers ---")
customer_revenue = delivered.groupby("customer_name")["revenue"].sum() \
                            .sort_values(ascending=False).head(5)
for customer, revenue in customer_revenue.items():
    print(f"  {customer:<20} Rs.{revenue:,.0f}")

print("\n--- Order Size Distribution ---")
size_dist = df["order_size"].value_counts()
for size, count in size_dist.items():
    print(f"  {size:<12} {count} orders")


# ── Step 5: Save Clean Data ───────────────────────
df.to_csv("cleaned_sales.csv", index=False)
print("\n✅ Cleaned data saved to cleaned_sales.csv")

Output:

==================================================
         SALES ANALYSIS REPORT
==================================================

Total Orders     : 20
Delivered        : 16
Cancelled        : 3
Returned         : 1

Total Revenue (Delivered): Rs.11,06,500
Average Order Value      : Rs.69,156

--- Revenue by Category ---
             total  orders     avg
category
Electronics  816500      11   74227
Furniture    290000       5   58000

--- Monthly Revenue ---
  January    ██████████████ Rs.2,90,500
  February   ████████████████ Rs.3,27,000
  March      █████████████████ Rs.3,44,500
  April      ██████ Rs.1,44,500

--- Top 5 Products ---
  Laptop Pro           Rs.5,25,000
  iPad Air             Rs.1,10,000
  Standing Desk        Rs.44,000
  Monitor              Rs.54,000
  Office Chair         Rs.34,000

--- Revenue by City ---
  Delhi        Rs.3,09,000
  Mumbai       Rs.2,76,500
  Bangalore    Rs.2,40,000
  Chennai      Rs.1,02,000
  Hyderabad    Rs.93,000
  Pune         Rs.1,50,000

--- Best Customers ---
  Vikram Singh         Rs.1,50,000
  Rahul Sharma         Rs.1,42,000
  Priya Patel          Rs.1,35,500
  Pooja Shah           Rs.75,000
  Rohit Gupta          Rs.55,000

--- Order Size Distribution ---
  Large        8 orders
  Medium       5 orders
  Small        5 orders
  Enterprise   2 orders

✅ Cleaned data saved to cleaned_sales.csv

pd.cut and pd.qcut — Binning Data

You saw pd.cut above. Let's understand both:

ages = pd.Series([5, 15, 25, 35, 45, 55, 65, 75])

# pd.cut — fixed bins you define
age_groups = pd.cut(
    ages,
    bins=[0, 18, 35, 60, 100],
    labels=["Child", "Young Adult", "Adult", "Senior"]
)
print(age_groups)
# 0         Child
# 1         Child
# 2    Young Adult
# 3    Young Adult
# 4         Adult
# 5         Adult
# 6        Senior
# 7        Senior

# pd.qcut — equal sized bins (by quantile)
salary = pd.Series([25000, 35000, 45000, 55000, 65000, 75000, 85000, 95000])
quartiles = pd.qcut(salary, q=4, labels=["Q1", "Q2", "Q3", "Q4"])
print(quartiles)

pd.cut — you define the boundary values pd.qcut — you define number of groups, Pandas figures out boundaries so each group has equal count


Method Chaining — Professional Pandas Style

Instead of creating many intermediate variables, chain operations:

# Unpythonic — too many intermediate variables
df1 = df[df["status"] == "Delivered"]
df2 = df1.groupby("category")["revenue"].sum()
df3 = df2.sort_values(ascending=False)
df4 = df3.reset_index()
print(df4)

# Pythonic — method chaining
result = (
    df
    .loc[df["status"] == "Delivered"]
    .groupby("category")["revenue"]
    .sum()
    .sort_values(ascending=False)
    .reset_index()
)
print(result)

Both give the same result. Method chaining is cleaner and more readable. Wrap in parentheses () to split across lines.


Summary — What You Now Know in Pandas

✅ Series and DataFrame creation
✅ Loading CSV, Excel, JSON
✅ Exploring data — head, info, describe, shape
✅ Selecting — columns, iloc, loc, conditions
✅ Adding and modifying columns
✅ Missing value handling — isnull, fillna, dropna
✅ GroupBy and aggregation
✅ Sorting and ranking
✅ String operations with .str
✅ apply() with functions and lambdas
✅ Merging DataFrames — inner, left, right, outer
✅ Concatenating DataFrames
✅ Pivot tables
✅ DateTime operations
✅ Resampling time series
✅ pd.cut and pd.qcut for binning
✅ Method chaining
✅ Full data cleaning workflow
✅ Saving cleaned data

Exercise 🏋️

Complete Data Analysis Project:

Download Titanic dataset from Kaggle (it's free, no account needed — just search "titanic csv download"). Or use this URL directly in Pandas:

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

Analyze it and answer:

  1. How many passengers survived vs died?
  2. Survival rate by passenger class (Pclass)
  3. Average age of survivors vs non-survivors
  4. Survival rate by gender
  5. Fill missing Age values with median age per class
  6. Create age groups: Child (0-12), Teen (13-19), Adult (20-60), Senior (60+)
  7. Survival rate per age group
  8. Top 10 most expensive tickets
  9. Which embarkation port had highest survival rate (column C)
  10. Create a clean summary table — class, gender, survival rate, avg fare

This is a classic data science dataset used in every interview. Knowing it well is genuinely valuable.

Data Cleaning Checklist — Interview & Industry Ready

1. Understand the Data First

What to do: Before touching anything, explore your dataset using df.head(), df.info(), and df.describe().

Why: You cannot clean what you don't understand. Jumping straight into cleaning without exploration leads to wrong assumptions — you might fill missing values incorrectly or drop columns that were actually useful.

Practical Example: You load a customer dataset and run df.info(). You notice a column called "age" has dtype "object" instead of "int". That tells you something is wrong — maybe there are string values like "N/A" or "unknown" hiding in there. Without this step, you'd never catch it.


2. Find Missing Values

What to do: Run df.isnull().sum() to count missing values per column. Also calculate the percentage: df.isnull().sum() / len(df) * 100

Why: Not all missing data is equal. A column with 2% missing is handled very differently from one with 60% missing. If a column is mostly empty, dropping it might be smarter than filling it.

Practical Example: In an HR dataset, "performance_score" has 65% missing values. Rather than filling all those blanks with the mean (which would fabricate data), you decide to drop the column entirely because it has too little real data to be useful.


3. Handle Missing Values

What to do: Choose the right fill strategy based on the column type and data distribution.

Why: Wrong imputation introduces bias. Using mean on skewed salary data pulls values toward outliers. Using a placeholder like "Unknown" for categories keeps the data honest without losing rows.

Practical Example:

  • Age column (roughly normal distribution) → fill with mean
  • Salary column (right-skewed, some people earn 10x others) → fill with median
  • City column (categorical) → fill with "Unknown"
  • Special case: salary has 0 values that actually mean "not recorded" → replace 0 with NaN first, then fill

4. Handle Fake Missing Values

What to do: Identify values like 0, -1, 999, "NA", "none", "null" that were used as placeholders instead of actual NaN. Replace them with np.nan before doing any filling.

Why: If someone recorded missing salary as 0, and you calculate the mean salary including all those zeros, your mean will be completely wrong. Your model will also learn that salary = 0 is a valid data point.

Practical Example: A medical dataset uses -1 to represent "test not conducted" for blood pressure readings. If you don't replace -1 with NaN first, your average blood pressure calculation will give a meaningless negative number.


5. Remove Duplicates

What to do: Run df.drop_duplicates() to remove fully duplicate rows. For partial duplicates (same person, different entries), decide which row to keep based on business logic.

Why: Duplicate rows inflate your dataset, skew statistics, and cause models to overfit on repeated data. A machine learning model trained on duplicates will perform worse on real unseen data.

Practical Example: An e-commerce order dataset has the same order recorded twice due to a system glitch. If you train a sales forecasting model on this, it will think that customer ordered double what they actually did, inflating predictions.


6. Fix Data Types

What to do: Convert columns to their correct types — integers, floats, datetime, category — using astype() and pd.to_datetime().

Why: Wrong data types cause silent bugs. Calculations on a numeric column stored as string will either fail or give wrong results. Date columns stored as strings can't be sorted, filtered by range, or used for time-based features.

Practical Example: A "date_of_joining" column stored as a string means you can't calculate how many years an employee has worked. Once you convert it with pd.to_datetime(), you can do df["years_at_company"] = (today - df["date_of_joining"]).dt.days / 365 instantly.


7. Handle Outliers

What to do: Use df.describe() to spot extreme min/max values. Use boxplots visually. Remove or cap values that are statistically impossible or clearly data entry errors.

Why: A single outlier can distort the mean dramatically. In ML models, outliers can pull regression lines and skew predictions. However, not all outliers should be removed — some are real and meaningful.

Practical Example: In a salary dataset, one entry shows salary = 99999999. That's clearly a data entry error (someone held "9" too long). Keeping it would make the average salary look like crores when it's actually lakhs. You filter it out with df = df[df["salary"] < 500000].


8. Clean Text Data

What to do: Strip extra spaces with str.strip(), standardize case with str.lower() or str.upper(), fix inconsistent spellings or categories.

Why: "Mumbai", " mumbai", "MUMBAI" are the same city but Python treats them as three different values. This breaks groupby operations, value counts, and any category-based analysis.

Practical Example: You're grouping sales by city. Without cleaning, "Delhi", " delhi", "DELHI", "delhi " appear as four separate groups. After str.strip().str.lower(), they all correctly merge into one "delhi" group.


9. Feature Engineering (Optional but Powerful)

What to do: Create new meaningful columns from existing ones — ratios, date parts, bins, flags.

Why: Raw data isn't always the most informative form. Derived features often carry more signal for models and make analysis more intuitive.

Practical Example: Instead of raw salary in rupees, create salary_in_lakhs = salary / 100000 for easier readability. Or extract join_year and join_month from a date column to analyze hiring trends by time period.


10. Final Validation

What to do: After all cleaning steps, run a final check: df.isnull().sum() should be 0 (or acceptable), df.info() should show correct types, df.shape should match expected row count.

Why: Cleaning steps can sometimes introduce new issues — for example, converting a column type might fail silently and produce NaN for some rows. Always verify the end state matches your expectations.

Practical Example: After all your cleaning, df.isnull().sum() shows "age" still has 3 missing values. You investigate and realize those rows had age stored as "unknown" (string), which your mean-fill didn't catch because it was still treated as a string. Validation caught a bug that would have broken your model training.


One-Liner Interview Summary

Understand the data → Identify and handle missing values (including fake ones) → Remove duplicates → Fix data types → Handle outliers → Clean text → Engineer features if needed → Validate everything at the end.

What is Pandas - A Python Library

What is Pandas?

NumPy is great for numerical arrays where all data is the same type. But real world data looks like this:

Name

Age

City

Salary

Joined

Gagan

22

Delhi

45000

2023-01-15

Priya

28

Mumbai

72000

2021-06-20

Rahul

35

Bangalore

95000

2019-03-10

Mixed types — strings, numbers, dates — all in one table. NumPy can't handle this well. Pandas is built exactly for this.

Pandas gives you a data structure called DataFrame — essentially a spreadsheet / database table inside Python. It's the most used tool in all of data science.


Installing and Importing

Already installed. In a new Jupyter notebook:


    import pandas as pd
    import numpy as np

    print("Pandas version:", pd.__version__)

pd is the universal alias — everyone in the world writes pd. Never write import pandas without the alias.


Part 1 — Core Data Structures

Pandas has two main data structures:

  • Series — one column (1D)
  • DataFrame — full table (2D)

Series — Single Column of Data


    import pandas as pd

    # Create a Series from a list
    marks = pd.Series([85, 92, 78, 65, 90])
    print(marks)

Output:

0    85
1    92
2    78
3    65
4    90
dtype: int64

Left side = index, right side = values. Every Series has an index — by default 0, 1, 2...


Series with Custom Index


    marks = pd.Series(
        [85, 92, 78, 65, 90],
        index=["Rahul", "Priya", "Gagan", "Amit", "Neha"]
    )
    print(marks)

Output:

Rahul    85
Priya    92
Gagan    78
Amit     65
Neha     90
dtype: int64

Now you access by name instead of number:


    print(marks["Gagan"])     # 78
    print(marks["Priya"])     # 92


Series Operations


    marks = pd.Series([85, 92, 78, 65, 90])

    print(marks.mean())       # 82.0
    print(marks.sum())        # 410
    print(marks.max())        # 92
    print(marks.min())        # 65
    print(marks.std())        # 10.8...
    print(marks.describe())   # full statistical summary

Output of describe():

count     5.000000
mean     82.000000
std      10.816654
min      65.000000
25%      78.000000
50%      85.000000
75%      90.000000
max      92.000000
dtype: float64

One line gives you count, mean, std, min, quartiles, max. You'll use .describe() constantly.


DataFrame — The Main Event

A DataFrame is a table — multiple Series side by side sharing the same index.


    # Create DataFrame from dictionary
    data = {
        "name":   ["Rahul", "Priya", "Gagan", "Amit", "Neha"],
        "age":    [25,      28,      22,      35,     30],
        "city":   ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai"],
        "salary": [45000,   72000,   38000,   95000,  68000],
        "active": [True,    True,    True,    False,  True]
    }

    df = pd.DataFrame(data)
    print(df)

Output:

    name  age       city  salary  active
0  Rahul   25      Delhi   45000    True
1  Priya   28     Mumbai   72000    True
2  Gagan   22      Delhi   38000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True

This is a proper table — rows and columns, exactly like a spreadsheet.


Part 2 — Exploring a DataFrame

These are the first things you do with any new dataset:


    print(df.shape)       # (5, 5) — 5 rows, 5 columns
    print(df.ndim)        # 2
    print(df.size)        # 25 — total cells
    print(len(df))        # 5 — number of rows
    print()

    print(df.columns)
    # Index(['name', 'age', 'city', 'salary', 'active'], dtype='object')
    print()

    print(df.dtypes)
    # name      object
    # age        int64
    # city      object
    # salary     int64
    # active      bool
    # dtype: object
    print()

    # First 3 rows
    print(df.head(3))
    print()

    # Last 3 rows
    print(df.tail(3))
    print()

    # Random sample of 2 rows
    print(df.sample(2))
    print()

    # Most important — full summary
    print(df.info())

Output:

(5, 5)
2
25
5

Index(['name', 'age', 'city', 'salary', 'active'], dtype='str')

name        str
age       int64
city        str
salary    int64
active     bool
dtype: object

    name  age    city  salary  active
0  Rahul   25   Delhi   45000    True
1  Priya   28  Mumbai   72000    True
2  Gagan   22   Delhi   38000    True

    name  age       city  salary  active
2  Gagan   22      Delhi   38000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True

    name  age       city  salary  active
3   Amit   35  Bangalore   95000   False
1  Priya   28     Mumbai   72000    True

<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   name    5 non-null      str  
 1   age     5 non-null      int64
 2   city    5 non-null      str  
 3   salary  5 non-null      int64
 4   active  5 non-null      bool 
dtypes: bool(1), int64(2), str(2)
memory usage: 297.0 bytes
None

df.info() shows column names, data types, and how many non-null values. Missing values show up here.


    # Statistical summary of numeric columns
    print(df.describe())

Output:

             age         salary
count   5.000000       5.000000
mean   28.000000   63600.000000
std     4.848857   21584.954000
min    22.000000   38000.000000
25%    25.000000   45000.000000
50%    28.000000   68000.000000
75%    30.000000   72000.000000
max    35.000000   95000.000000

Part 3 — Selecting Data

Selecting Columns


    # Single column — returns Series
    print(df["name"])
    print(df["salary"])

    # Multiple columns — returns DataFrame
    print(df[["name", "salary"]])
    print(df[["name", "age", "city"]])


Output:

0    Rahul
1    Priya
2    Gagan
3     Amit
4     Neha
Name: name, dtype: str
0 45000 1 72000 2 38000 3 95000 4 68000 Name: salary, dtype: int64
name salary 0 Rahul 45000 1 Priya 72000 2 Gagan 38000 3 Amit 95000 4 Neha 68000
name age city 0 Rahul 25 Delhi 1 Priya 28 Mumbai 2 Gagan 22 Delhi 3 Amit 35 Bangalore 4 Neha 30 Mumbai

Selecting Rows — loc and iloc

This is where beginners get confused. Two ways to select rows:

iloc — select by integer position (like NumPy indexing)


    print(df.iloc[0])       # first row
    print(df.iloc[-1])      # last row
    print(df.iloc[1:3])     # rows 1 and 2
    print(df.iloc[[0, 2, 4]])   # rows 0, 2, 4

Output:

name      Rahul
age          25
city      Delhi
salary    45000
active     True
Name: 0, dtype: object
name Neha age 30 city Mumbai salary 68000 active True Name: 4, dtype: object
name age city salary active 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True

loc — select by label/condition


    print(df.loc[0])      # row with index label 0
    print(df.loc[0:2])    # rows with labels 0, 1, 2 (INCLUSIVE — different from iloc)


Output:

name      Rahul
age          25
city      Delhi
salary    45000
active     True
Name: 0, dtype: object
name age city salary active 0 Rahul 25 Delhi 45000 True 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True

loc with Conditions — Most Used Pattern


    # Get all rows where city is Delhi
    delhi_employees = df.loc[df["city"] == "Delhi"]
    print(delhi_employees)

Output:

    name  age   city  salary  active
0  Rahul   25  Delhi   45000    True
2  Gagan   22  Delhi   38000    True

    # High earners
    high_earners = df.loc[df["salary"] > 60000]
    print(high_earners)

    # Active employees only
    active_df = df.loc[df["active"] == True]
    print(active_df)

    # Multiple conditions — & for AND, | for OR
    # Must use parentheses around each condition
    delhi_active = df.loc[(df["city"] == "Delhi") & (df["active"] == True)]
    print(delhi_active)

    young_or_senior = df.loc[(df["age"] < 25) | (df["age"] > 33)]
    print(young_or_senior)


Output:

    name  age       city  salary  active
1  Priya   28     Mumbai   72000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True
name age city salary active 0 Rahul 25 Delhi 45000 True 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 2 Gagan 22 Delhi 38000 True 3 Amit 35 Bangalore 95000 False

Select Specific Rows AND Columns


    # loc[rows, columns]
    print(df.loc[df["salary"] > 60000, ["name", "salary"]])

Output:

    name  salary
1  Priya   72000
3   Amit   95000
4   Neha   68000

This is the most useful selection pattern — filter rows AND select specific columns at the same time.


Part 4 — Adding and Modifying Data

Adding New Column


    # Add tax column — 20% of salary
    df["tax"] = df["salary"] * 0.20
    print(df[["name", "salary", "tax"]])

Output:

    name  salary      tax
0  Rahul   45000   9000.0
1  Priya   72000  14400.0
2  Gagan   38000   7600.0
3   Amit   95000  19000.0
4   Neha   68000  13600.0

    # Net salary after tax
    df["net_salary"] = df["salary"] - df["tax"]

    # Category based on salary using np.where
    df["level"] = np.where(df["salary"] >= 70000, "Senior", "Junior")
    print(df[["name", "salary", "level"]])

Output:

    name  salary   level
0  Rahul   45000  Junior
1  Priya   72000  Senior
2  Gagan   38000  Junior
3   Amit   95000  Senior
4   Neha   68000  Junior

Modifying Existing Values

# Give all Junior employees a 10% raise
df.loc[df["level"] == "Junior", "salary"] = df.loc[df["level"] == "Junior", "salary"] * 1.10

print(df[["name", "salary", "level"]])

Renaming Columns


    df = df.rename(columns={
        "name": "full_name",
        "age": "employee_age"
    })
    print(df.columns)


Dropping Columns and Rows


    # Drop column
    df_clean = df.drop(columns=["tax", "net_salary"])

    # Drop row by index
    df_clean = df.drop(index=2)

    # Drop multiple rows
    df_clean = df.drop(index=[0, 3])

    # axis=1 for columns, axis=0 for rows (older syntax you'll see)
    df_clean = df.drop("tax", axis=1)

Importantdrop() returns a new DataFrame by default. Original df is unchanged unless you use inplace=True or reassign.


Part 5 — Loading Real Data

This is where Pandas becomes powerful. Loading a CSV file:


    # Read CSV file
    df = pd.read_csv("data.csv")

    # Read Excel
    df = pd.read_excel("data.xlsx")

    # Read JSON
    df = pd.read_json("data.json")

    # Read from URL
    df = pd.read_csv("https://example.com/data.csv")

Let's use a real dataset. Create a CSV file called employees.csv in your project folder:

id,name,department,salary,experience,city,rating
1,Rahul Sharma,Engineering,85000,5,Delhi,4.2
2,Priya Patel,Marketing,62000,3,Mumbai,3.8
3,Gagan Singh,Engineering,92000,7,Bangalore,4.5
4,Amit Kumar,Sales,48000,2,Delhi,3.2
5,Neha Gupta,HR,55000,4,Mumbai,4.0
6,Ravi Verma,Engineering,78000,4,Hyderabad,3.9
7,Sneha Joshi,Marketing,67000,5,Bangalore,4.3
8,Kiran Rao,Sales,52000,3,Chennai,3.6
9,Arjun Nair,Engineering,105000,9,Bangalore,4.7
10,Pooja Shah,HR,58000,6,Delhi,4.1
11,Dev Mishra,Engineering,88000,6,Delhi,4.4
12,Ananya Roy,Marketing,71000,4,Mumbai,4.0

Now load and explore it:


    df = pd.read_csv("employees.csv")

    print("Shape:", df.shape)
    print("\nFirst 5 rows:")
    print(df.head())
    print("\nData types:")
    print(df.dtypes)
    print("\nSummary:")
    print(df.describe())
    print("\nInfo:")
    print(df.info())


Part 6 — Handling Missing Values

Real world data always has missing values. This is one of the most important data cleaning skills.

Let's create data with missing values:


    data = {
        "name":   ["Rahul", "Priya", None, "Amit", "Neha"],
        "age":    [25, None, 22, 35, 30],
        "salary": [45000, 72000, 38000, None, 68000],
        "city":   ["Delhi", "Mumbai", "Delhi", None, "Mumbai"]
    }

    df = pd.DataFrame(data)
    print(df)

Output:

    name   age   salary    city
0  Rahul  25.0  45000.0   Delhi
1  Priya   NaN  72000.0  Mumbai
2   None  22.0  38000.0   Delhi
3   Amit  35.0      NaN    None
4   Neha  30.0  68000.0  Mumbai

NaN = Not a Number = missing value in Pandas.


Detecting Missing Values


    print(df.isnull())          # True where value is missing
    print(df.isnull().sum())    # count of missing per column
    print(df.isnull().sum().sum())  # total missing in entire df
    print(df.notnull())         # opposite — True where value exists

Output:

    name    age  salary   city
0  False  False   False  False
1  False   True   False  False
2   True  False   False  False
3  False  False    True   True
4  False  False   False  False
name 1 age 1 salary 1 city 1 dtype: int64
4
name age salary city 0 True True True True 1 True False True True 2 False True True True 3 True True False False 4 True True True True

Dropping Missing Values


    # Drop rows with ANY missing value
    df_clean = df.dropna()
    print(df_clean)    # only rows with no missing values

    # Drop rows only if ALL values are missing
    df_clean = df.dropna(how="all")

    # Drop rows missing in specific columns
    df_clean = df.dropna(subset=["name", "salary"])


Filling Missing Values


    # Fill all missing with a value
    df_filled = df.fillna(0)

    # Fill specific column
    df["salary"] = df["salary"].fillna(0)

    # Fill with mean of that column — most common
    df["age"] = df["age"].fillna(df["age"].mean())
    df["salary"] = df["salary"].fillna(df["salary"].median())

    # Fill string columns with placeholder
    df["city"] = df["city"].fillna("Unknown")
    df["name"] = df["name"].fillna("No Name")

    print(df)

Output:

     name   age   salary    city
0   Rahul  25.0  45000.0   Delhi
1   Priya  28.0  72000.0  Mumbai
2  No Name 22.0  38000.0   Delhi
3    Amit  35.0  57750.0  Unknown
4    Neha  30.0  68000.0  Mumbai

Missing age filled with mean (28.0), missing salary filled with median (57750.0).


Part 7 — Grouping and Aggregation

This is one of the most powerful Pandas features. Same as SQL GROUP BY.


    df = pd.read_csv("employees.csv")
    print(df)

    # Group by department — get mean salary
    dept_salary = df.groupby("department")["salary"].mean()
    print(dept_salary)

Output:

    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
department Engineering 89600.000000 HR 56500.000000 Marketing 66666.666667 Sales 50000.000000 Name: salary, dtype: float64

Multiple Aggregations


    # Multiple functions at once
    dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
    print(dept_stats)

Output:

                   mean     min     max  count
department
Engineering    89600.0   78000  105000      5
HR             56500.0   55000   58000      2
Marketing      66666.7   62000   71000      3
Sales          50000.0   48000   52000      2

Group by Multiple Columns


    # Average salary by department AND city
    dept_city = df.groupby(["department", "city"])["salary"].mean()
    print(dept_city)


Output:

department   city     
Engineering  Bangalore    98500.0
             Delhi        86500.0
             Hyderabad    78000.0
HR           Delhi        58000.0
             Mumbai       55000.0
Marketing    Bangalore    67000.0
             Mumbai       66500.0
Sales        Chennai      52000.0
             Delhi        48000.0
Name: salary, dtype: float64

agg with Different Functions per Column


    result = df.groupby("department").agg({
        "salary":     ["mean", "max"],
        "experience": "mean",
        "rating":     "mean",
        "name":       "count"
    })
    print(result)


Output:
                   salary         experience    rating  name
                     mean     max       mean      mean count
department                                                  
Engineering  89600.000000  105000        6.2  4.340000     5
HR           56500.000000   58000        5.0  4.050000     2
Marketing    66666.666667   71000        4.0  4.033333     3
Sales        50000.000000   52000        2.5  3.400000     2

value_counts — Frequency Count


    # How many employees per department
    print(df["department"].value_counts())

Output:

Engineering    5
Marketing      3
Sales          2
HR             2
Name: department, dtype: int64

    # How many per city
    print(df["city"].value_counts())

    # As percentage
    print(df["department"].value_counts(normalize=True) * 100)


Output:

city
Delhi        4
Mumbai       3
Bangalore    3
Hyderabad    1
Chennai      1
Name: count, dtype: int64
department Engineering 41.666667 Marketing 25.000000 Sales 16.666667 HR 16.666667 Name: proportion, dtype: float64

Part 8 — Sorting


    df = pd.read_csv("employees.csv")
    print(df)

    # Sort by salary ascending
    print(df.sort_values("salary").head())

    # Sort by salary descending
    print(df.sort_values("salary", ascending=False).head())

    # Sort by multiple columns
    print(df.sort_values(["department", "salary"], ascending=[True, False]))

    # Sort by rating descending — top performers
    top_performers = df.sort_values("rating", ascending=False)
    print(top_performers[["name", "department", "salary", "rating"]].head(5))


Output:

    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
id name department salary experience city rating 3 4 Amit Kumar Sales 48000 2 Delhi 3.2 7 8 Kiran Rao Sales 52000 3 Chennai 3.6 4 5 Neha Gupta HR 55000 4 Mumbai 4.0 9 10 Pooja Shah HR 58000 6 Delhi 4.1 1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
id name department salary experience city rating 8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7 2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5 10 11 Dev Mishra Engineering 88000 6 Delhi 4.4 0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2 5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
id name department salary experience city rating 8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7 2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5 10 11 Dev Mishra Engineering 88000 6 Delhi 4.4 0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2 5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9 9 10 Pooja Shah HR 58000 6 Delhi 4.1 4 5 Neha Gupta HR 55000 4 Mumbai 4.0 11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0 6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3 1 2 Priya Patel Marketing 62000 3 Mumbai 3.8 7 8 Kiran Rao Sales 52000 3 Chennai 3.6 3 4 Amit Kumar Sales 48000 2 Delhi 3.2
name department salary rating 8 Arjun Nair Engineering 105000 4.7 2 Gagan Singh Engineering 92000 4.5 10 Dev Mishra Engineering 88000 4.4 6 Sneha Joshi Marketing 67000 4.3 0 Rahul Sharma Engineering 85000 4.2

Part 9 — String Operations

Pandas has built-in string operations via .str:


    df = pd.read_csv("employees.csv")
    print(df)

    # Uppercase all names
    print(df["name"].str.upper())

    # Check if name contains a word
    print(df["name"].str.contains("Kumar"))

    # Extract first name
    df["first_name"] = df["name"].str.split(" ").str[0]
    print(df[["name", "first_name"]])

    # String length
    df["name_length"] = df["name"].str.len()

    # Replace
    df["city"] = df["city"].str.replace("Bangalore", "Bengaluru")

    # Strip whitespace — always do this on text data
    df["name"] = df["name"].str.strip()
    df["city"] = df["city"].str.strip().str.title()


Output:
    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
0 RAHUL SHARMA 1 PRIYA PATEL 2 GAGAN SINGH 3 AMIT KUMAR 4 NEHA GUPTA 5 RAVI VERMA 6 SNEHA JOSHI 7 KIRAN RAO 8 ARJUN NAIR 9 POOJA SHAH 10 DEV MISHRA 11 ANANYA ROY Name: name, dtype: str
0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False Name: name, dtype: bool
name first_name 0 Rahul Sharma Rahul 1 Priya Patel Priya 2 Gagan Singh Gagan 3 Amit Kumar Amit 4 Neha Gupta Neha 5 Ravi Verma Ravi 6 Sneha Joshi Sneha 7 Kiran Rao Kiran 8 Arjun Nair Arjun 9 Pooja Shah Pooja 10 Dev Mishra Dev 11 Ananya Roy Ananya

Part 10 — Applying Custom Functions

apply() — Apply Function to Column


    df = pd.read_csv("employees.csv")

    # Apply function to each value in a column
    def get_grade(rating):
        if rating >= 4.5:
            return "Excellent"
        elif rating >= 4.0:
            return "Good"
        elif rating >= 3.5:
            return "Average"
        else:
            return "Needs Improvement"

    df["performance"] = df["rating"].apply(get_grade)
    print(df[["name", "rating", "performance"]])

Output:

            name  rating        performance
0   Rahul Sharma     4.2               Good
1    Priya Patel     3.8            Average
2    Gagan Singh     4.5          Excellent
3     Amit Kumar     3.2  Needs Improvement
4     Neha Gupta     4.0               Good
5     Ravi Verma     3.9            Average
6    Sneha Joshi     4.3               Good
7      Kiran Rao     3.6            Average
8     Arjun Nair     4.7          Excellent
9     Pooja Shah     4.1               Good
10    Dev Mishra     4.4               Good
11    Ananya Roy     4.0               Good

Lambda with apply()


    # Quick one-liner transformations
    df["salary_in_lakhs"] = df["salary"].apply(lambda x: round(x / 100000, 2))
    df["experience_level"] = df["experience"].apply(
        lambda x: "Junior" if x <= 3 else ("Mid" if x <= 6 else "Senior")
    )
    print(df[["name", "experience", "experience_level", "salary_in_lakhs"]])


Output:
            name  experience experience_level  salary_in_lakhs
0   Rahul Sharma           5              Mid             0.85
1    Priya Patel           3           Junior             0.62
2    Gagan Singh           7           Senior             0.92
3     Amit Kumar           2           Junior             0.48
4     Neha Gupta           4              Mid             0.55
5     Ravi Verma           4              Mid             0.78
6    Sneha Joshi           5              Mid             0.67
7      Kiran Rao           3           Junior             0.52
8     Arjun Nair           9           Senior             1.05
9     Pooja Shah           6              Mid             0.58
10    Dev Mishra           6              Mid             0.88
11    Ananya Roy           4              Mid             0.71

apply() on Entire Row — axis=1


    def calculate_bonus(row):
        if row["rating"] >= 4.5:
            return row["salary"] * 0.20
        elif row["rating"] >= 4.0:
            return row["salary"] * 0.15
        else:
            return row["salary"] * 0.10

    df["bonus"] = df.apply(calculate_bonus, axis=1)
    print(df[["name", "salary", "rating", "bonus"]])

Output:

            name  salary  rating    bonus
0   Rahul Sharma   85000     4.2  12750.0
1    Priya Patel   62000     3.8   6200.0
2    Gagan Singh   92000     4.5  18400.0
3     Amit Kumar   48000     3.2   4800.0
4     Neha Gupta   55000     4.0   8250.0
5     Ravi Verma   78000     3.9   7800.0
6    Sneha Joshi   67000     4.3  10050.0
7      Kiran Rao   52000     3.6   5200.0
8     Arjun Nair  105000     4.7  21000.0
9     Pooja Shah   58000     4.1   8700.0
10    Dev Mishra   88000     4.4  13200.0
11    Ananya Roy   71000     4.0  10650.0

axis=1 means apply the function to each row — row inside the function is a Series of that row's values.


Complete Real World Analysis

Let's do a full analysis of our employees dataset:


    import pandas as pd
    import numpy as np

    df = pd.read_csv("employees.csv")

    print("=" * 50)
    print("   EMPLOYEE DATA ANALYSIS REPORT")
    print("=" * 50)

    print(f"\nTotal Employees: {len(df)}")
    print(f"Departments: {df['department'].nunique()}")
    print(f"Cities: {df['city'].nunique()}")

    print("\n--- Salary Statistics ---")
    print(f"Average Salary : Rs.{df['salary'].mean():,.0f}")
    print(f"Median Salary  : Rs.{df['salary'].median():,.0f}")
    print(f"Highest Salary : Rs.{df['salary'].max():,.0f}")
    print(f"Lowest Salary  : Rs.{df['salary'].min():,.0f}")

    print("\n--- Department Summary ---")
    dept = df.groupby("department").agg(
        headcount=("name", "count"),
        avg_salary=("salary", "mean"),
        avg_rating=("rating", "mean")
    ).round(2)
    print(dept.to_string())

    print("\n--- Top 3 Earners ---")
    top3 = df.nlargest(3, "salary")[["name", "department", "salary"]]
    for _, row in top3.iterrows():
        print(f"  {row['name']} ({row['department']}): Rs.{row['salary']:,}")

    print("\n--- Performance Distribution ---")
    df["performance"] = df["rating"].apply(
        lambda r: "Excellent" if r >= 4.5 else ("Good" if r >= 4.0 else "Average")
    )
    print(df["performance"].value_counts().to_string())

    print("\n--- City Distribution ---")
    city_counts = df["city"].value_counts()
    for city, count in city_counts.items():
        bar = "" * count
        print(f"  {city:<12} {bar} ({count})")

    print("\n--- Salary by Experience Level ---")
    df["exp_level"] = df["experience"].apply(
        lambda x: "Junior (0-3)" if x <= 3 else ("Mid (4-6)" if x <= 6 else "Senior (7+)")
    )
    exp_salary = df.groupby("exp_level")["salary"].mean().sort_values(ascending=False)
    for level, avg in exp_salary.items():
        print(f"  {level}: Rs.{avg:,.0f}")

Output:

==================================================
   EMPLOYEE DATA ANALYSIS REPORT
==================================================

Total Employees: 12
Departments: 4
Cities: 6

--- Salary Statistics ---
Average Salary : Rs.71,583
Median Salary  : Rs.69,500
Highest Salary : Rs.105,000
Lowest Salary  : Rs.48,000

--- Department Summary ---
              headcount  avg_salary  avg_rating
department
Engineering           5     89600.0        4.34
HR                    2     56500.0        4.05
Marketing             3     66666.7        4.03
Sales                 2     50000.0        3.40

--- Top 3 Earners ---
  Arjun Nair (Engineering): Rs.1,05,000
  Gagan Singh (Engineering): Rs.92,000
  Dev Mishra (Engineering): Rs.88,000

--- Performance Distribution ---
Good         6
Excellent    3
Average      3

--- City Distribution ---
Bangalore    ███ (3)
Delhi        ████ (4)
Mumbai       ███ (3)
Hyderabad    █ (1)
Chennai      █ (1)

--- Salary by Experience Level ---
  Senior (7+): Rs.97,333
  Mid (4-6): Rs.71,429
  Junior (0-3): Rs.54,000

Saving Data


    # Save to CSV
    df.to_csv("output.csv", index=False)    # index=False — don't save row numbers

    # Save to Excel
    df.to_excel("output.xlsx", index=False)

    # Save to JSON
    df.to_json("output.json", orient="records", indent=2)


Quick Reference


    # Loading
    df = pd.read_csv("file.csv")

    # Exploring
    df.shape          # dimensions
    df.head()         # first 5 rows
    df.tail()         # last 5 rows
    df.info()         # column types and nulls
    df.describe()     # statistics
    df.columns        # column names
    df.dtypes         # data types

    # Selecting
    df["col"]                    # single column
    df[["col1", "col2"]]         # multiple columns
    df.iloc[0]                   # row by position
    df.loc[df["col"] > 50]       # rows by condition
    df.loc[condition, ["col1"]]  # rows + columns

    # Modifying
    df["new_col"] = values        # add column
    df.drop(columns=["col"])      # remove column
    df.rename(columns={"old": "new"})  # rename
    df.fillna(value)              # fill missing
    df.dropna()                   # remove missing rows

    # Analysis
    df["col"].value_counts()     # frequency count
    df.groupby("col").mean()     # group and aggregate
    df.sort_values("col")        # sort
    df["col"].apply(func)        # apply function
    df.nlargest(5, "col")        # top 5 rows
    df.nsmallest(5, "col")       # bottom 5 rows

    # Saving
    df.to_csv("file.csv", index=False)


Exercise 🏋️

Download this free dataset from Kaggle — "World Population Dataset" or use this mock data:

Create sales_data.csv:

date,product,category,quantity,price,city,salesperson
2024-01-05,Laptop,Electronics,2,75000,Delhi,Rahul
2024-01-08,Phone,Electronics,5,25000,Mumbai,Priya
2024-01-10,Desk,Furniture,1,15000,Delhi,Rahul
2024-01-15,Phone,Electronics,3,25000,Bangalore,Gagan
2024-01-18,Chair,Furniture,4,8000,Delhi,Amit
2024-01-22,Laptop,Electronics,1,75000,Mumbai,Priya
2024-01-25,Tablet,Electronics,2,35000,Chennai,Neha
2024-02-02,Phone,Electronics,6,25000,Delhi,Rahul
2024-02-05,Desk,Furniture,2,15000,Mumbai,Priya
2024-02-10,Laptop,Electronics,3,75000,Bangalore,Gagan
2024-02-15,Chair,Furniture,5,8000,Hyderabad,Ravi
2024-02-20,Tablet,Electronics,4,35000,Delhi,Rahul
2024-03-01,Phone,Electronics,8,25000,Mumbai,Priya
2024-03-05,Laptop,Electronics,2,75000,Delhi,Amit
2024-03-10,Desk,Furniture,3,15000,Bangalore,Gagan

Tasks:

  1. Load and explore the dataset (shape, info, describe)
  2. Add a revenue column (quantity × price)
  3. Total revenue per category
  4. Top performing salesperson by total revenue
  5. Which city has highest total sales
  6. Monthly revenue trend (group by month)
  7. Best selling product by quantity
  8. Average order value per salesperson
  9. Find all Electronics sales above Rs.1,00,000 revenue
  10. Save the cleaned dataset with revenue column to a new CSV

Pandas — A Python Library (Advanced Topics)

What We're Covering Today Merging and Joining DataFrames Pivot Tables DateTime Operations Real Kaggle Dataset Analysis Data Cleani...