Data Cleaning Checklist — Interview & Industry Ready

1. Understand the Data First

What to do: Before touching anything, explore your dataset using df.head(), df.info(), and df.describe().

Why: You cannot clean what you don't understand. Jumping straight into cleaning without exploration leads to wrong assumptions — you might fill missing values incorrectly or drop columns that were actually useful.

Practical Example: You load a customer dataset and run df.info(). You notice a column called "age" has dtype "object" instead of "int". That tells you something is wrong — maybe there are string values like "N/A" or "unknown" hiding in there. Without this step, you'd never catch it.


2. Find Missing Values

What to do: Run df.isnull().sum() to count missing values per column. Also calculate the percentage: df.isnull().sum() / len(df) * 100

Why: Not all missing data is equal. A column with 2% missing is handled very differently from one with 60% missing. If a column is mostly empty, dropping it might be smarter than filling it.

Practical Example: In an HR dataset, "performance_score" has 65% missing values. Rather than filling all those blanks with the mean (which would fabricate data), you decide to drop the column entirely because it has too little real data to be useful.


3. Handle Missing Values

What to do: Choose the right fill strategy based on the column type and data distribution.

Why: Wrong imputation introduces bias. Using mean on skewed salary data pulls values toward outliers. Using a placeholder like "Unknown" for categories keeps the data honest without losing rows.

Practical Example:

  • Age column (roughly normal distribution) → fill with mean
  • Salary column (right-skewed, some people earn 10x others) → fill with median
  • City column (categorical) → fill with "Unknown"
  • Special case: salary has 0 values that actually mean "not recorded" → replace 0 with NaN first, then fill

4. Handle Fake Missing Values

What to do: Identify values like 0, -1, 999, "NA", "none", "null" that were used as placeholders instead of actual NaN. Replace them with np.nan before doing any filling.

Why: If someone recorded missing salary as 0, and you calculate the mean salary including all those zeros, your mean will be completely wrong. Your model will also learn that salary = 0 is a valid data point.

Practical Example: A medical dataset uses -1 to represent "test not conducted" for blood pressure readings. If you don't replace -1 with NaN first, your average blood pressure calculation will give a meaningless negative number.


5. Remove Duplicates

What to do: Run df.drop_duplicates() to remove fully duplicate rows. For partial duplicates (same person, different entries), decide which row to keep based on business logic.

Why: Duplicate rows inflate your dataset, skew statistics, and cause models to overfit on repeated data. A machine learning model trained on duplicates will perform worse on real unseen data.

Practical Example: An e-commerce order dataset has the same order recorded twice due to a system glitch. If you train a sales forecasting model on this, it will think that customer ordered double what they actually did, inflating predictions.


6. Fix Data Types

What to do: Convert columns to their correct types — integers, floats, datetime, category — using astype() and pd.to_datetime().

Why: Wrong data types cause silent bugs. Calculations on a numeric column stored as string will either fail or give wrong results. Date columns stored as strings can't be sorted, filtered by range, or used for time-based features.

Practical Example: A "date_of_joining" column stored as a string means you can't calculate how many years an employee has worked. Once you convert it with pd.to_datetime(), you can do df["years_at_company"] = (today - df["date_of_joining"]).dt.days / 365 instantly.


7. Handle Outliers

What to do: Use df.describe() to spot extreme min/max values. Use boxplots visually. Remove or cap values that are statistically impossible or clearly data entry errors.

Why: A single outlier can distort the mean dramatically. In ML models, outliers can pull regression lines and skew predictions. However, not all outliers should be removed — some are real and meaningful.

Practical Example: In a salary dataset, one entry shows salary = 99999999. That's clearly a data entry error (someone held "9" too long). Keeping it would make the average salary look like crores when it's actually lakhs. You filter it out with df = df[df["salary"] < 500000].


8. Clean Text Data

What to do: Strip extra spaces with str.strip(), standardize case with str.lower() or str.upper(), fix inconsistent spellings or categories.

Why: "Mumbai", " mumbai", "MUMBAI" are the same city but Python treats them as three different values. This breaks groupby operations, value counts, and any category-based analysis.

Practical Example: You're grouping sales by city. Without cleaning, "Delhi", " delhi", "DELHI", "delhi " appear as four separate groups. After str.strip().str.lower(), they all correctly merge into one "delhi" group.


9. Feature Engineering (Optional but Powerful)

What to do: Create new meaningful columns from existing ones — ratios, date parts, bins, flags.

Why: Raw data isn't always the most informative form. Derived features often carry more signal for models and make analysis more intuitive.

Practical Example: Instead of raw salary in rupees, create salary_in_lakhs = salary / 100000 for easier readability. Or extract join_year and join_month from a date column to analyze hiring trends by time period.


10. Final Validation

What to do: After all cleaning steps, run a final check: df.isnull().sum() should be 0 (or acceptable), df.info() should show correct types, df.shape should match expected row count.

Why: Cleaning steps can sometimes introduce new issues — for example, converting a column type might fail silently and produce NaN for some rows. Always verify the end state matches your expectations.

Practical Example: After all your cleaning, df.isnull().sum() shows "age" still has 3 missing values. You investigate and realize those rows had age stored as "unknown" (string), which your mean-fill didn't catch because it was still treated as a string. Validation caught a bug that would have broken your model training.


One-Liner Interview Summary

Understand the data → Identify and handle missing values (including fake ones) → Remove duplicates → Fix data types → Handle outliers → Clean text → Engineer features if needed → Validate everything at the end.

What is Pandas - A Python Library

What is Pandas?

NumPy is great for numerical arrays where all data is the same type. But real world data looks like this:

Name

Age

City

Salary

Joined

Gagan

22

Delhi

45000

2023-01-15

Priya

28

Mumbai

72000

2021-06-20

Rahul

35

Bangalore

95000

2019-03-10

Mixed types — strings, numbers, dates — all in one table. NumPy can't handle this well. Pandas is built exactly for this.

Pandas gives you a data structure called DataFrame — essentially a spreadsheet / database table inside Python. It's the most used tool in all of data science.


Installing and Importing

Already installed. In a new Jupyter notebook:


    import pandas as pd
    import numpy as np

    print("Pandas version:", pd.__version__)

pd is the universal alias — everyone in the world writes pd. Never write import pandas without the alias.


Part 1 — Core Data Structures

Pandas has two main data structures:

  • Series — one column (1D)
  • DataFrame — full table (2D)

Series — Single Column of Data


    import pandas as pd

    # Create a Series from a list
    marks = pd.Series([85, 92, 78, 65, 90])
    print(marks)

Output:

0    85
1    92
2    78
3    65
4    90
dtype: int64

Left side = index, right side = values. Every Series has an index — by default 0, 1, 2...


Series with Custom Index


    marks = pd.Series(
        [85, 92, 78, 65, 90],
        index=["Rahul", "Priya", "Gagan", "Amit", "Neha"]
    )
    print(marks)

Output:

Rahul    85
Priya    92
Gagan    78
Amit     65
Neha     90
dtype: int64

Now you access by name instead of number:


    print(marks["Gagan"])     # 78
    print(marks["Priya"])     # 92


Series Operations


    marks = pd.Series([85, 92, 78, 65, 90])

    print(marks.mean())       # 82.0
    print(marks.sum())        # 410
    print(marks.max())        # 92
    print(marks.min())        # 65
    print(marks.std())        # 10.8...
    print(marks.describe())   # full statistical summary

Output of describe():

count     5.000000
mean     82.000000
std      10.816654
min      65.000000
25%      78.000000
50%      85.000000
75%      90.000000
max      92.000000
dtype: float64

One line gives you count, mean, std, min, quartiles, max. You'll use .describe() constantly.


DataFrame — The Main Event

A DataFrame is a table — multiple Series side by side sharing the same index.


    # Create DataFrame from dictionary
    data = {
        "name":   ["Rahul", "Priya", "Gagan", "Amit", "Neha"],
        "age":    [25,      28,      22,      35,     30],
        "city":   ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai"],
        "salary": [45000,   72000,   38000,   95000,  68000],
        "active": [True,    True,    True,    False,  True]
    }

    df = pd.DataFrame(data)
    print(df)

Output:

    name  age       city  salary  active
0  Rahul   25      Delhi   45000    True
1  Priya   28     Mumbai   72000    True
2  Gagan   22      Delhi   38000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True

This is a proper table — rows and columns, exactly like a spreadsheet.


Part 2 — Exploring a DataFrame

These are the first things you do with any new dataset:


    print(df.shape)       # (5, 5) — 5 rows, 5 columns
    print(df.ndim)        # 2
    print(df.size)        # 25 — total cells
    print(len(df))        # 5 — number of rows
    print()

    print(df.columns)
    # Index(['name', 'age', 'city', 'salary', 'active'], dtype='object')
    print()

    print(df.dtypes)
    # name      object
    # age        int64
    # city      object
    # salary     int64
    # active      bool
    # dtype: object
    print()

    # First 3 rows
    print(df.head(3))
    print()

    # Last 3 rows
    print(df.tail(3))
    print()

    # Random sample of 2 rows
    print(df.sample(2))
    print()

    # Most important — full summary
    print(df.info())

Output:

(5, 5)
2
25
5

Index(['name', 'age', 'city', 'salary', 'active'], dtype='str')

name        str
age       int64
city        str
salary    int64
active     bool
dtype: object

    name  age    city  salary  active
0  Rahul   25   Delhi   45000    True
1  Priya   28  Mumbai   72000    True
2  Gagan   22   Delhi   38000    True

    name  age       city  salary  active
2  Gagan   22      Delhi   38000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True

    name  age       city  salary  active
3   Amit   35  Bangalore   95000   False
1  Priya   28     Mumbai   72000    True

<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   name    5 non-null      str  
 1   age     5 non-null      int64
 2   city    5 non-null      str  
 3   salary  5 non-null      int64
 4   active  5 non-null      bool 
dtypes: bool(1), int64(2), str(2)
memory usage: 297.0 bytes
None

df.info() shows column names, data types, and how many non-null values. Missing values show up here.


    # Statistical summary of numeric columns
    print(df.describe())

Output:

             age         salary
count   5.000000       5.000000
mean   28.000000   63600.000000
std     4.848857   21584.954000
min    22.000000   38000.000000
25%    25.000000   45000.000000
50%    28.000000   68000.000000
75%    30.000000   72000.000000
max    35.000000   95000.000000

Part 3 — Selecting Data

Selecting Columns


    # Single column — returns Series
    print(df["name"])
    print(df["salary"])

    # Multiple columns — returns DataFrame
    print(df[["name", "salary"]])
    print(df[["name", "age", "city"]])


Output:

0    Rahul
1    Priya
2    Gagan
3     Amit
4     Neha
Name: name, dtype: str
0 45000 1 72000 2 38000 3 95000 4 68000 Name: salary, dtype: int64
name salary 0 Rahul 45000 1 Priya 72000 2 Gagan 38000 3 Amit 95000 4 Neha 68000
name age city 0 Rahul 25 Delhi 1 Priya 28 Mumbai 2 Gagan 22 Delhi 3 Amit 35 Bangalore 4 Neha 30 Mumbai

Selecting Rows — loc and iloc

This is where beginners get confused. Two ways to select rows:

iloc — select by integer position (like NumPy indexing)


    print(df.iloc[0])       # first row
    print(df.iloc[-1])      # last row
    print(df.iloc[1:3])     # rows 1 and 2
    print(df.iloc[[0, 2, 4]])   # rows 0, 2, 4

Output:

name      Rahul
age          25
city      Delhi
salary    45000
active     True
Name: 0, dtype: object
name Neha age 30 city Mumbai salary 68000 active True Name: 4, dtype: object
name age city salary active 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True

loc — select by label/condition


    print(df.loc[0])      # row with index label 0
    print(df.loc[0:2])    # rows with labels 0, 1, 2 (INCLUSIVE — different from iloc)


Output:

name      Rahul
age          25
city      Delhi
salary    45000
active     True
Name: 0, dtype: object
name age city salary active 0 Rahul 25 Delhi 45000 True 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True

loc with Conditions — Most Used Pattern


    # Get all rows where city is Delhi
    delhi_employees = df.loc[df["city"] == "Delhi"]
    print(delhi_employees)

Output:

    name  age   city  salary  active
0  Rahul   25  Delhi   45000    True
2  Gagan   22  Delhi   38000    True

    # High earners
    high_earners = df.loc[df["salary"] > 60000]
    print(high_earners)

    # Active employees only
    active_df = df.loc[df["active"] == True]
    print(active_df)

    # Multiple conditions — & for AND, | for OR
    # Must use parentheses around each condition
    delhi_active = df.loc[(df["city"] == "Delhi") & (df["active"] == True)]
    print(delhi_active)

    young_or_senior = df.loc[(df["age"] < 25) | (df["age"] > 33)]
    print(young_or_senior)


Output:

    name  age       city  salary  active
1  Priya   28     Mumbai   72000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True
name age city salary active 0 Rahul 25 Delhi 45000 True 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 2 Gagan 22 Delhi 38000 True 3 Amit 35 Bangalore 95000 False

Select Specific Rows AND Columns


    # loc[rows, columns]
    print(df.loc[df["salary"] > 60000, ["name", "salary"]])

Output:

    name  salary
1  Priya   72000
3   Amit   95000
4   Neha   68000

This is the most useful selection pattern — filter rows AND select specific columns at the same time.


Part 4 — Adding and Modifying Data

Adding New Column


    # Add tax column — 20% of salary
    df["tax"] = df["salary"] * 0.20
    print(df[["name", "salary", "tax"]])

Output:

    name  salary      tax
0  Rahul   45000   9000.0
1  Priya   72000  14400.0
2  Gagan   38000   7600.0
3   Amit   95000  19000.0
4   Neha   68000  13600.0

    # Net salary after tax
    df["net_salary"] = df["salary"] - df["tax"]

    # Category based on salary using np.where
    df["level"] = np.where(df["salary"] >= 70000, "Senior", "Junior")
    print(df[["name", "salary", "level"]])

Output:

    name  salary   level
0  Rahul   45000  Junior
1  Priya   72000  Senior
2  Gagan   38000  Junior
3   Amit   95000  Senior
4   Neha   68000  Junior

Modifying Existing Values

# Give all Junior employees a 10% raise
df.loc[df["level"] == "Junior", "salary"] = df.loc[df["level"] == "Junior", "salary"] * 1.10

print(df[["name", "salary", "level"]])

Renaming Columns


    df = df.rename(columns={
        "name": "full_name",
        "age": "employee_age"
    })
    print(df.columns)


Dropping Columns and Rows


    # Drop column
    df_clean = df.drop(columns=["tax", "net_salary"])

    # Drop row by index
    df_clean = df.drop(index=2)

    # Drop multiple rows
    df_clean = df.drop(index=[0, 3])

    # axis=1 for columns, axis=0 for rows (older syntax you'll see)
    df_clean = df.drop("tax", axis=1)

Importantdrop() returns a new DataFrame by default. Original df is unchanged unless you use inplace=True or reassign.


Part 5 — Loading Real Data

This is where Pandas becomes powerful. Loading a CSV file:


    # Read CSV file
    df = pd.read_csv("data.csv")

    # Read Excel
    df = pd.read_excel("data.xlsx")

    # Read JSON
    df = pd.read_json("data.json")

    # Read from URL
    df = pd.read_csv("https://example.com/data.csv")

Let's use a real dataset. Create a CSV file called employees.csv in your project folder:

id,name,department,salary,experience,city,rating
1,Rahul Sharma,Engineering,85000,5,Delhi,4.2
2,Priya Patel,Marketing,62000,3,Mumbai,3.8
3,Gagan Singh,Engineering,92000,7,Bangalore,4.5
4,Amit Kumar,Sales,48000,2,Delhi,3.2
5,Neha Gupta,HR,55000,4,Mumbai,4.0
6,Ravi Verma,Engineering,78000,4,Hyderabad,3.9
7,Sneha Joshi,Marketing,67000,5,Bangalore,4.3
8,Kiran Rao,Sales,52000,3,Chennai,3.6
9,Arjun Nair,Engineering,105000,9,Bangalore,4.7
10,Pooja Shah,HR,58000,6,Delhi,4.1
11,Dev Mishra,Engineering,88000,6,Delhi,4.4
12,Ananya Roy,Marketing,71000,4,Mumbai,4.0

Now load and explore it:


    df = pd.read_csv("employees.csv")

    print("Shape:", df.shape)
    print("\nFirst 5 rows:")
    print(df.head())
    print("\nData types:")
    print(df.dtypes)
    print("\nSummary:")
    print(df.describe())
    print("\nInfo:")
    print(df.info())


Part 6 — Handling Missing Values

Real world data always has missing values. This is one of the most important data cleaning skills.

Let's create data with missing values:


    data = {
        "name":   ["Rahul", "Priya", None, "Amit", "Neha"],
        "age":    [25, None, 22, 35, 30],
        "salary": [45000, 72000, 38000, None, 68000],
        "city":   ["Delhi", "Mumbai", "Delhi", None, "Mumbai"]
    }

    df = pd.DataFrame(data)
    print(df)

Output:

    name   age   salary    city
0  Rahul  25.0  45000.0   Delhi
1  Priya   NaN  72000.0  Mumbai
2   None  22.0  38000.0   Delhi
3   Amit  35.0      NaN    None
4   Neha  30.0  68000.0  Mumbai

NaN = Not a Number = missing value in Pandas.


Detecting Missing Values


    print(df.isnull())          # True where value is missing
    print(df.isnull().sum())    # count of missing per column
    print(df.isnull().sum().sum())  # total missing in entire df
    print(df.notnull())         # opposite — True where value exists

Output:

    name    age  salary   city
0  False  False   False  False
1  False   True   False  False
2   True  False   False  False
3  False  False    True   True
4  False  False   False  False
name 1 age 1 salary 1 city 1 dtype: int64
4
name age salary city 0 True True True True 1 True False True True 2 False True True True 3 True True False False 4 True True True True

Dropping Missing Values


    # Drop rows with ANY missing value
    df_clean = df.dropna()
    print(df_clean)    # only rows with no missing values

    # Drop rows only if ALL values are missing
    df_clean = df.dropna(how="all")

    # Drop rows missing in specific columns
    df_clean = df.dropna(subset=["name", "salary"])


Filling Missing Values


    # Fill all missing with a value
    df_filled = df.fillna(0)

    # Fill specific column
    df["salary"] = df["salary"].fillna(0)

    # Fill with mean of that column — most common
    df["age"] = df["age"].fillna(df["age"].mean())
    df["salary"] = df["salary"].fillna(df["salary"].median())

    # Fill string columns with placeholder
    df["city"] = df["city"].fillna("Unknown")
    df["name"] = df["name"].fillna("No Name")

    print(df)

Output:

     name   age   salary    city
0   Rahul  25.0  45000.0   Delhi
1   Priya  28.0  72000.0  Mumbai
2  No Name 22.0  38000.0   Delhi
3    Amit  35.0  57750.0  Unknown
4    Neha  30.0  68000.0  Mumbai

Missing age filled with mean (28.0), missing salary filled with median (57750.0).


Part 7 — Grouping and Aggregation

This is one of the most powerful Pandas features. Same as SQL GROUP BY.


    df = pd.read_csv("employees.csv")
    print(df)

    # Group by department — get mean salary
    dept_salary = df.groupby("department")["salary"].mean()
    print(dept_salary)

Output:

    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
department Engineering 89600.000000 HR 56500.000000 Marketing 66666.666667 Sales 50000.000000 Name: salary, dtype: float64

Multiple Aggregations


    # Multiple functions at once
    dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
    print(dept_stats)

Output:

                   mean     min     max  count
department
Engineering    89600.0   78000  105000      5
HR             56500.0   55000   58000      2
Marketing      66666.7   62000   71000      3
Sales          50000.0   48000   52000      2

Group by Multiple Columns


    # Average salary by department AND city
    dept_city = df.groupby(["department", "city"])["salary"].mean()
    print(dept_city)


Output:

department   city     
Engineering  Bangalore    98500.0
             Delhi        86500.0
             Hyderabad    78000.0
HR           Delhi        58000.0
             Mumbai       55000.0
Marketing    Bangalore    67000.0
             Mumbai       66500.0
Sales        Chennai      52000.0
             Delhi        48000.0
Name: salary, dtype: float64

agg with Different Functions per Column


    result = df.groupby("department").agg({
        "salary":     ["mean", "max"],
        "experience": "mean",
        "rating":     "mean",
        "name":       "count"
    })
    print(result)


Output:
                   salary         experience    rating  name
                     mean     max       mean      mean count
department                                                  
Engineering  89600.000000  105000        6.2  4.340000     5
HR           56500.000000   58000        5.0  4.050000     2
Marketing    66666.666667   71000        4.0  4.033333     3
Sales        50000.000000   52000        2.5  3.400000     2

value_counts — Frequency Count


    # How many employees per department
    print(df["department"].value_counts())

Output:

Engineering    5
Marketing      3
Sales          2
HR             2
Name: department, dtype: int64

    # How many per city
    print(df["city"].value_counts())

    # As percentage
    print(df["department"].value_counts(normalize=True) * 100)


Output:

city
Delhi        4
Mumbai       3
Bangalore    3
Hyderabad    1
Chennai      1
Name: count, dtype: int64
department Engineering 41.666667 Marketing 25.000000 Sales 16.666667 HR 16.666667 Name: proportion, dtype: float64

Part 8 — Sorting


    df = pd.read_csv("employees.csv")
    print(df)

    # Sort by salary ascending
    print(df.sort_values("salary").head())

    # Sort by salary descending
    print(df.sort_values("salary", ascending=False).head())

    # Sort by multiple columns
    print(df.sort_values(["department", "salary"], ascending=[True, False]))

    # Sort by rating descending — top performers
    top_performers = df.sort_values("rating", ascending=False)
    print(top_performers[["name", "department", "salary", "rating"]].head(5))


Output:

    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
id name department salary experience city rating 3 4 Amit Kumar Sales 48000 2 Delhi 3.2 7 8 Kiran Rao Sales 52000 3 Chennai 3.6 4 5 Neha Gupta HR 55000 4 Mumbai 4.0 9 10 Pooja Shah HR 58000 6 Delhi 4.1 1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
id name department salary experience city rating 8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7 2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5 10 11 Dev Mishra Engineering 88000 6 Delhi 4.4 0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2 5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
id name department salary experience city rating 8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7 2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5 10 11 Dev Mishra Engineering 88000 6 Delhi 4.4 0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2 5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9 9 10 Pooja Shah HR 58000 6 Delhi 4.1 4 5 Neha Gupta HR 55000 4 Mumbai 4.0 11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0 6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3 1 2 Priya Patel Marketing 62000 3 Mumbai 3.8 7 8 Kiran Rao Sales 52000 3 Chennai 3.6 3 4 Amit Kumar Sales 48000 2 Delhi 3.2
name department salary rating 8 Arjun Nair Engineering 105000 4.7 2 Gagan Singh Engineering 92000 4.5 10 Dev Mishra Engineering 88000 4.4 6 Sneha Joshi Marketing 67000 4.3 0 Rahul Sharma Engineering 85000 4.2

Part 9 — String Operations

Pandas has built-in string operations via .str:


    df = pd.read_csv("employees.csv")
    print(df)

    # Uppercase all names
    print(df["name"].str.upper())

    # Check if name contains a word
    print(df["name"].str.contains("Kumar"))

    # Extract first name
    df["first_name"] = df["name"].str.split(" ").str[0]
    print(df[["name", "first_name"]])

    # String length
    df["name_length"] = df["name"].str.len()

    # Replace
    df["city"] = df["city"].str.replace("Bangalore", "Bengaluru")

    # Strip whitespace — always do this on text data
    df["name"] = df["name"].str.strip()
    df["city"] = df["city"].str.strip().str.title()


Output:
    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
0 RAHUL SHARMA 1 PRIYA PATEL 2 GAGAN SINGH 3 AMIT KUMAR 4 NEHA GUPTA 5 RAVI VERMA 6 SNEHA JOSHI 7 KIRAN RAO 8 ARJUN NAIR 9 POOJA SHAH 10 DEV MISHRA 11 ANANYA ROY Name: name, dtype: str
0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False Name: name, dtype: bool
name first_name 0 Rahul Sharma Rahul 1 Priya Patel Priya 2 Gagan Singh Gagan 3 Amit Kumar Amit 4 Neha Gupta Neha 5 Ravi Verma Ravi 6 Sneha Joshi Sneha 7 Kiran Rao Kiran 8 Arjun Nair Arjun 9 Pooja Shah Pooja 10 Dev Mishra Dev 11 Ananya Roy Ananya

Part 10 — Applying Custom Functions

apply() — Apply Function to Column


    df = pd.read_csv("employees.csv")

    # Apply function to each value in a column
    def get_grade(rating):
        if rating >= 4.5:
            return "Excellent"
        elif rating >= 4.0:
            return "Good"
        elif rating >= 3.5:
            return "Average"
        else:
            return "Needs Improvement"

    df["performance"] = df["rating"].apply(get_grade)
    print(df[["name", "rating", "performance"]])

Output:

            name  rating        performance
0   Rahul Sharma     4.2               Good
1    Priya Patel     3.8            Average
2    Gagan Singh     4.5          Excellent
3     Amit Kumar     3.2  Needs Improvement
4     Neha Gupta     4.0               Good
5     Ravi Verma     3.9            Average
6    Sneha Joshi     4.3               Good
7      Kiran Rao     3.6            Average
8     Arjun Nair     4.7          Excellent
9     Pooja Shah     4.1               Good
10    Dev Mishra     4.4               Good
11    Ananya Roy     4.0               Good

Lambda with apply()


    # Quick one-liner transformations
    df["salary_in_lakhs"] = df["salary"].apply(lambda x: round(x / 100000, 2))
    df["experience_level"] = df["experience"].apply(
        lambda x: "Junior" if x <= 3 else ("Mid" if x <= 6 else "Senior")
    )
    print(df[["name", "experience", "experience_level", "salary_in_lakhs"]])


Output:
            name  experience experience_level  salary_in_lakhs
0   Rahul Sharma           5              Mid             0.85
1    Priya Patel           3           Junior             0.62
2    Gagan Singh           7           Senior             0.92
3     Amit Kumar           2           Junior             0.48
4     Neha Gupta           4              Mid             0.55
5     Ravi Verma           4              Mid             0.78
6    Sneha Joshi           5              Mid             0.67
7      Kiran Rao           3           Junior             0.52
8     Arjun Nair           9           Senior             1.05
9     Pooja Shah           6              Mid             0.58
10    Dev Mishra           6              Mid             0.88
11    Ananya Roy           4              Mid             0.71

apply() on Entire Row — axis=1


    def calculate_bonus(row):
        if row["rating"] >= 4.5:
            return row["salary"] * 0.20
        elif row["rating"] >= 4.0:
            return row["salary"] * 0.15
        else:
            return row["salary"] * 0.10

    df["bonus"] = df.apply(calculate_bonus, axis=1)
    print(df[["name", "salary", "rating", "bonus"]])

Output:

            name  salary  rating    bonus
0   Rahul Sharma   85000     4.2  12750.0
1    Priya Patel   62000     3.8   6200.0
2    Gagan Singh   92000     4.5  18400.0
3     Amit Kumar   48000     3.2   4800.0
4     Neha Gupta   55000     4.0   8250.0
5     Ravi Verma   78000     3.9   7800.0
6    Sneha Joshi   67000     4.3  10050.0
7      Kiran Rao   52000     3.6   5200.0
8     Arjun Nair  105000     4.7  21000.0
9     Pooja Shah   58000     4.1   8700.0
10    Dev Mishra   88000     4.4  13200.0
11    Ananya Roy   71000     4.0  10650.0

axis=1 means apply the function to each row — row inside the function is a Series of that row's values.


Complete Real World Analysis

Let's do a full analysis of our employees dataset:


    import pandas as pd
    import numpy as np

    df = pd.read_csv("employees.csv")

    print("=" * 50)
    print("   EMPLOYEE DATA ANALYSIS REPORT")
    print("=" * 50)

    print(f"\nTotal Employees: {len(df)}")
    print(f"Departments: {df['department'].nunique()}")
    print(f"Cities: {df['city'].nunique()}")

    print("\n--- Salary Statistics ---")
    print(f"Average Salary : Rs.{df['salary'].mean():,.0f}")
    print(f"Median Salary  : Rs.{df['salary'].median():,.0f}")
    print(f"Highest Salary : Rs.{df['salary'].max():,.0f}")
    print(f"Lowest Salary  : Rs.{df['salary'].min():,.0f}")

    print("\n--- Department Summary ---")
    dept = df.groupby("department").agg(
        headcount=("name", "count"),
        avg_salary=("salary", "mean"),
        avg_rating=("rating", "mean")
    ).round(2)
    print(dept.to_string())

    print("\n--- Top 3 Earners ---")
    top3 = df.nlargest(3, "salary")[["name", "department", "salary"]]
    for _, row in top3.iterrows():
        print(f"  {row['name']} ({row['department']}): Rs.{row['salary']:,}")

    print("\n--- Performance Distribution ---")
    df["performance"] = df["rating"].apply(
        lambda r: "Excellent" if r >= 4.5 else ("Good" if r >= 4.0 else "Average")
    )
    print(df["performance"].value_counts().to_string())

    print("\n--- City Distribution ---")
    city_counts = df["city"].value_counts()
    for city, count in city_counts.items():
        bar = "" * count
        print(f"  {city:<12} {bar} ({count})")

    print("\n--- Salary by Experience Level ---")
    df["exp_level"] = df["experience"].apply(
        lambda x: "Junior (0-3)" if x <= 3 else ("Mid (4-6)" if x <= 6 else "Senior (7+)")
    )
    exp_salary = df.groupby("exp_level")["salary"].mean().sort_values(ascending=False)
    for level, avg in exp_salary.items():
        print(f"  {level}: Rs.{avg:,.0f}")

Output:

==================================================
   EMPLOYEE DATA ANALYSIS REPORT
==================================================

Total Employees: 12
Departments: 4
Cities: 6

--- Salary Statistics ---
Average Salary : Rs.71,583
Median Salary  : Rs.69,500
Highest Salary : Rs.105,000
Lowest Salary  : Rs.48,000

--- Department Summary ---
              headcount  avg_salary  avg_rating
department
Engineering           5     89600.0        4.34
HR                    2     56500.0        4.05
Marketing             3     66666.7        4.03
Sales                 2     50000.0        3.40

--- Top 3 Earners ---
  Arjun Nair (Engineering): Rs.1,05,000
  Gagan Singh (Engineering): Rs.92,000
  Dev Mishra (Engineering): Rs.88,000

--- Performance Distribution ---
Good         6
Excellent    3
Average      3

--- City Distribution ---
Bangalore    ███ (3)
Delhi        ████ (4)
Mumbai       ███ (3)
Hyderabad    █ (1)
Chennai      █ (1)

--- Salary by Experience Level ---
  Senior (7+): Rs.97,333
  Mid (4-6): Rs.71,429
  Junior (0-3): Rs.54,000

Saving Data


    # Save to CSV
    df.to_csv("output.csv", index=False)    # index=False — don't save row numbers

    # Save to Excel
    df.to_excel("output.xlsx", index=False)

    # Save to JSON
    df.to_json("output.json", orient="records", indent=2)


Quick Reference


    # Loading
    df = pd.read_csv("file.csv")

    # Exploring
    df.shape          # dimensions
    df.head()         # first 5 rows
    df.tail()         # last 5 rows
    df.info()         # column types and nulls
    df.describe()     # statistics
    df.columns        # column names
    df.dtypes         # data types

    # Selecting
    df["col"]                    # single column
    df[["col1", "col2"]]         # multiple columns
    df.iloc[0]                   # row by position
    df.loc[df["col"] > 50]       # rows by condition
    df.loc[condition, ["col1"]]  # rows + columns

    # Modifying
    df["new_col"] = values        # add column
    df.drop(columns=["col"])      # remove column
    df.rename(columns={"old": "new"})  # rename
    df.fillna(value)              # fill missing
    df.dropna()                   # remove missing rows

    # Analysis
    df["col"].value_counts()     # frequency count
    df.groupby("col").mean()     # group and aggregate
    df.sort_values("col")        # sort
    df["col"].apply(func)        # apply function
    df.nlargest(5, "col")        # top 5 rows
    df.nsmallest(5, "col")       # bottom 5 rows

    # Saving
    df.to_csv("file.csv", index=False)


Exercise 🏋️

Download this free dataset from Kaggle — "World Population Dataset" or use this mock data:

Create sales_data.csv:

date,product,category,quantity,price,city,salesperson
2024-01-05,Laptop,Electronics,2,75000,Delhi,Rahul
2024-01-08,Phone,Electronics,5,25000,Mumbai,Priya
2024-01-10,Desk,Furniture,1,15000,Delhi,Rahul
2024-01-15,Phone,Electronics,3,25000,Bangalore,Gagan
2024-01-18,Chair,Furniture,4,8000,Delhi,Amit
2024-01-22,Laptop,Electronics,1,75000,Mumbai,Priya
2024-01-25,Tablet,Electronics,2,35000,Chennai,Neha
2024-02-02,Phone,Electronics,6,25000,Delhi,Rahul
2024-02-05,Desk,Furniture,2,15000,Mumbai,Priya
2024-02-10,Laptop,Electronics,3,75000,Bangalore,Gagan
2024-02-15,Chair,Furniture,5,8000,Hyderabad,Ravi
2024-02-20,Tablet,Electronics,4,35000,Delhi,Rahul
2024-03-01,Phone,Electronics,8,25000,Mumbai,Priya
2024-03-05,Laptop,Electronics,2,75000,Delhi,Amit
2024-03-10,Desk,Furniture,3,15000,Bangalore,Gagan

Tasks:

  1. Load and explore the dataset (shape, info, describe)
  2. Add a revenue column (quantity × price)
  3. Total revenue per category
  4. Top performing salesperson by total revenue
  5. Which city has highest total sales
  6. Monthly revenue trend (group by month)
  7. Best selling product by quantity
  8. Average order value per salesperson
  9. Find all Electronics sales above Rs.1,00,000 revenue
  10. Save the cleaned dataset with revenue column to a new CSV

Complete Machine Learning Terminology Guide

🗂️ PART 1: DATA TERMS Dataset A dataset is simply a collection of data organized in rows and columns — like an Excel sheet. Every ML proje...