What is Pandas - A Python Library

What is Pandas?

NumPy is great for numerical arrays where all data is the same type. But real world data looks like this:

Name

Age

City

Salary

Joined

Gagan

22

Delhi

45000

2023-01-15

Priya

28

Mumbai

72000

2021-06-20

Rahul

35

Bangalore

95000

2019-03-10

Mixed types — strings, numbers, dates — all in one table. NumPy can't handle this well. Pandas is built exactly for this.

Pandas gives you a data structure called DataFrame — essentially a spreadsheet / database table inside Python. It's the most used tool in all of data science.


Installing and Importing

Already installed. In a new Jupyter notebook:


    import pandas as pd
    import numpy as np

    print("Pandas version:", pd.__version__)

pd is the universal alias — everyone in the world writes pd. Never write import pandas without the alias.


Part 1 — Core Data Structures

Pandas has two main data structures:

  • Series — one column (1D)
  • DataFrame — full table (2D)

Series — Single Column of Data


    import pandas as pd

    # Create a Series from a list
    marks = pd.Series([85, 92, 78, 65, 90])
    print(marks)

Output:

0    85
1    92
2    78
3    65
4    90
dtype: int64

Left side = index, right side = values. Every Series has an index — by default 0, 1, 2...


Series with Custom Index


    marks = pd.Series(
        [85, 92, 78, 65, 90],
        index=["Rahul", "Priya", "Gagan", "Amit", "Neha"]
    )
    print(marks)

Output:

Rahul    85
Priya    92
Gagan    78
Amit     65
Neha     90
dtype: int64

Now you access by name instead of number:


    print(marks["Gagan"])     # 78
    print(marks["Priya"])     # 92


Series Operations


    marks = pd.Series([85, 92, 78, 65, 90])

    print(marks.mean())       # 82.0
    print(marks.sum())        # 410
    print(marks.max())        # 92
    print(marks.min())        # 65
    print(marks.std())        # 10.8...
    print(marks.describe())   # full statistical summary

Output of describe():

count     5.000000
mean     82.000000
std      10.816654
min      65.000000
25%      78.000000
50%      85.000000
75%      90.000000
max      92.000000
dtype: float64

One line gives you count, mean, std, min, quartiles, max. You'll use .describe() constantly.


DataFrame — The Main Event

A DataFrame is a table — multiple Series side by side sharing the same index.


    # Create DataFrame from dictionary
    data = {
        "name":   ["Rahul", "Priya", "Gagan", "Amit", "Neha"],
        "age":    [25,      28,      22,      35,     30],
        "city":   ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai"],
        "salary": [45000,   72000,   38000,   95000,  68000],
        "active": [True,    True,    True,    False,  True]
    }

    df = pd.DataFrame(data)
    print(df)

Output:

    name  age       city  salary  active
0  Rahul   25      Delhi   45000    True
1  Priya   28     Mumbai   72000    True
2  Gagan   22      Delhi   38000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True

This is a proper table — rows and columns, exactly like a spreadsheet.


Part 2 — Exploring a DataFrame

These are the first things you do with any new dataset:


    print(df.shape)       # (5, 5) — 5 rows, 5 columns
    print(df.ndim)        # 2
    print(df.size)        # 25 — total cells
    print(len(df))        # 5 — number of rows
    print()

    print(df.columns)
    # Index(['name', 'age', 'city', 'salary', 'active'], dtype='object')
    print()

    print(df.dtypes)
    # name      object
    # age        int64
    # city      object
    # salary     int64
    # active      bool
    # dtype: object
    print()

    # First 3 rows
    print(df.head(3))
    print()

    # Last 3 rows
    print(df.tail(3))
    print()

    # Random sample of 2 rows
    print(df.sample(2))
    print()

    # Most important — full summary
    print(df.info())

Output:

(5, 5)
2
25
5

Index(['name', 'age', 'city', 'salary', 'active'], dtype='str')

name        str
age       int64
city        str
salary    int64
active     bool
dtype: object

    name  age    city  salary  active
0  Rahul   25   Delhi   45000    True
1  Priya   28  Mumbai   72000    True
2  Gagan   22   Delhi   38000    True

    name  age       city  salary  active
2  Gagan   22      Delhi   38000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True

    name  age       city  salary  active
3   Amit   35  Bangalore   95000   False
1  Priya   28     Mumbai   72000    True

<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   name    5 non-null      str  
 1   age     5 non-null      int64
 2   city    5 non-null      str  
 3   salary  5 non-null      int64
 4   active  5 non-null      bool 
dtypes: bool(1), int64(2), str(2)
memory usage: 297.0 bytes
None

df.info() shows column names, data types, and how many non-null values. Missing values show up here.


    # Statistical summary of numeric columns
    print(df.describe())

Output:

             age         salary
count   5.000000       5.000000
mean   28.000000   63600.000000
std     4.848857   21584.954000
min    22.000000   38000.000000
25%    25.000000   45000.000000
50%    28.000000   68000.000000
75%    30.000000   72000.000000
max    35.000000   95000.000000

Part 3 — Selecting Data

Selecting Columns


    # Single column — returns Series
    print(df["name"])
    print(df["salary"])

    # Multiple columns — returns DataFrame
    print(df[["name", "salary"]])
    print(df[["name", "age", "city"]])


Output:

0    Rahul
1    Priya
2    Gagan
3     Amit
4     Neha
Name: name, dtype: str
0 45000 1 72000 2 38000 3 95000 4 68000 Name: salary, dtype: int64
name salary 0 Rahul 45000 1 Priya 72000 2 Gagan 38000 3 Amit 95000 4 Neha 68000
name age city 0 Rahul 25 Delhi 1 Priya 28 Mumbai 2 Gagan 22 Delhi 3 Amit 35 Bangalore 4 Neha 30 Mumbai

Selecting Rows — loc and iloc

This is where beginners get confused. Two ways to select rows:

iloc — select by integer position (like NumPy indexing)


    print(df.iloc[0])       # first row
    print(df.iloc[-1])      # last row
    print(df.iloc[1:3])     # rows 1 and 2
    print(df.iloc[[0, 2, 4]])   # rows 0, 2, 4

Output:

name      Rahul
age          25
city      Delhi
salary    45000
active     True
Name: 0, dtype: object
name Neha age 30 city Mumbai salary 68000 active True Name: 4, dtype: object
name age city salary active 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True

loc — select by label/condition


    print(df.loc[0])      # row with index label 0
    print(df.loc[0:2])    # rows with labels 0, 1, 2 (INCLUSIVE — different from iloc)


Output:

name      Rahul
age          25
city      Delhi
salary    45000
active     True
Name: 0, dtype: object
name age city salary active 0 Rahul 25 Delhi 45000 True 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True

loc with Conditions — Most Used Pattern


    # Get all rows where city is Delhi
    delhi_employees = df.loc[df["city"] == "Delhi"]
    print(delhi_employees)

Output:

    name  age   city  salary  active
0  Rahul   25  Delhi   45000    True
2  Gagan   22  Delhi   38000    True

    # High earners
    high_earners = df.loc[df["salary"] > 60000]
    print(high_earners)

    # Active employees only
    active_df = df.loc[df["active"] == True]
    print(active_df)

    # Multiple conditions — & for AND, | for OR
    # Must use parentheses around each condition
    delhi_active = df.loc[(df["city"] == "Delhi") & (df["active"] == True)]
    print(delhi_active)

    young_or_senior = df.loc[(df["age"] < 25) | (df["age"] > 33)]
    print(young_or_senior)


Output:

    name  age       city  salary  active
1  Priya   28     Mumbai   72000    True
3   Amit   35  Bangalore   95000   False
4   Neha   30     Mumbai   68000    True
name age city salary active 0 Rahul 25 Delhi 45000 True 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 2 Gagan 22 Delhi 38000 True 3 Amit 35 Bangalore 95000 False

Select Specific Rows AND Columns


    # loc[rows, columns]
    print(df.loc[df["salary"] > 60000, ["name", "salary"]])

Output:

    name  salary
1  Priya   72000
3   Amit   95000
4   Neha   68000

This is the most useful selection pattern — filter rows AND select specific columns at the same time.


Part 4 — Adding and Modifying Data

Adding New Column


    # Add tax column — 20% of salary
    df["tax"] = df["salary"] * 0.20
    print(df[["name", "salary", "tax"]])

Output:

    name  salary      tax
0  Rahul   45000   9000.0
1  Priya   72000  14400.0
2  Gagan   38000   7600.0
3   Amit   95000  19000.0
4   Neha   68000  13600.0

    # Net salary after tax
    df["net_salary"] = df["salary"] - df["tax"]

    # Category based on salary using np.where
    df["level"] = np.where(df["salary"] >= 70000, "Senior", "Junior")
    print(df[["name", "salary", "level"]])

Output:

    name  salary   level
0  Rahul   45000  Junior
1  Priya   72000  Senior
2  Gagan   38000  Junior
3   Amit   95000  Senior
4   Neha   68000  Junior

Modifying Existing Values

# Give all Junior employees a 10% raise
df.loc[df["level"] == "Junior", "salary"] = df.loc[df["level"] == "Junior", "salary"] * 1.10

print(df[["name", "salary", "level"]])

Renaming Columns


    df = df.rename(columns={
        "name": "full_name",
        "age": "employee_age"
    })
    print(df.columns)


Dropping Columns and Rows


    # Drop column
    df_clean = df.drop(columns=["tax", "net_salary"])

    # Drop row by index
    df_clean = df.drop(index=2)

    # Drop multiple rows
    df_clean = df.drop(index=[0, 3])

    # axis=1 for columns, axis=0 for rows (older syntax you'll see)
    df_clean = df.drop("tax", axis=1)

Importantdrop() returns a new DataFrame by default. Original df is unchanged unless you use inplace=True or reassign.


Part 5 — Loading Real Data

This is where Pandas becomes powerful. Loading a CSV file:


    # Read CSV file
    df = pd.read_csv("data.csv")

    # Read Excel
    df = pd.read_excel("data.xlsx")

    # Read JSON
    df = pd.read_json("data.json")

    # Read from URL
    df = pd.read_csv("https://example.com/data.csv")

Let's use a real dataset. Create a CSV file called employees.csv in your project folder:

id,name,department,salary,experience,city,rating
1,Rahul Sharma,Engineering,85000,5,Delhi,4.2
2,Priya Patel,Marketing,62000,3,Mumbai,3.8
3,Gagan Singh,Engineering,92000,7,Bangalore,4.5
4,Amit Kumar,Sales,48000,2,Delhi,3.2
5,Neha Gupta,HR,55000,4,Mumbai,4.0
6,Ravi Verma,Engineering,78000,4,Hyderabad,3.9
7,Sneha Joshi,Marketing,67000,5,Bangalore,4.3
8,Kiran Rao,Sales,52000,3,Chennai,3.6
9,Arjun Nair,Engineering,105000,9,Bangalore,4.7
10,Pooja Shah,HR,58000,6,Delhi,4.1
11,Dev Mishra,Engineering,88000,6,Delhi,4.4
12,Ananya Roy,Marketing,71000,4,Mumbai,4.0

Now load and explore it:


    df = pd.read_csv("employees.csv")

    print("Shape:", df.shape)
    print("\nFirst 5 rows:")
    print(df.head())
    print("\nData types:")
    print(df.dtypes)
    print("\nSummary:")
    print(df.describe())
    print("\nInfo:")
    print(df.info())


Part 6 — Handling Missing Values

Real world data always has missing values. This is one of the most important data cleaning skills.

Let's create data with missing values:


    data = {
        "name":   ["Rahul", "Priya", None, "Amit", "Neha"],
        "age":    [25, None, 22, 35, 30],
        "salary": [45000, 72000, 38000, None, 68000],
        "city":   ["Delhi", "Mumbai", "Delhi", None, "Mumbai"]
    }

    df = pd.DataFrame(data)
    print(df)

Output:

    name   age   salary    city
0  Rahul  25.0  45000.0   Delhi
1  Priya   NaN  72000.0  Mumbai
2   None  22.0  38000.0   Delhi
3   Amit  35.0      NaN    None
4   Neha  30.0  68000.0  Mumbai

NaN = Not a Number = missing value in Pandas.


Detecting Missing Values


    print(df.isnull())          # True where value is missing
    print(df.isnull().sum())    # count of missing per column
    print(df.isnull().sum().sum())  # total missing in entire df
    print(df.notnull())         # opposite — True where value exists

Output:

    name    age  salary   city
0  False  False   False  False
1  False   True   False  False
2   True  False   False  False
3  False  False    True   True
4  False  False   False  False
name 1 age 1 salary 1 city 1 dtype: int64
4
name age salary city 0 True True True True 1 True False True True 2 False True True True 3 True True False False 4 True True True True

Dropping Missing Values


    # Drop rows with ANY missing value
    df_clean = df.dropna()
    print(df_clean)    # only rows with no missing values

    # Drop rows only if ALL values are missing
    df_clean = df.dropna(how="all")

    # Drop rows missing in specific columns
    df_clean = df.dropna(subset=["name", "salary"])


Filling Missing Values


    # Fill all missing with a value
    df_filled = df.fillna(0)

    # Fill specific column
    df["salary"] = df["salary"].fillna(0)

    # Fill with mean of that column — most common
    df["age"] = df["age"].fillna(df["age"].mean())
    df["salary"] = df["salary"].fillna(df["salary"].median())

    # Fill string columns with placeholder
    df["city"] = df["city"].fillna("Unknown")
    df["name"] = df["name"].fillna("No Name")

    print(df)

Output:

     name   age   salary    city
0   Rahul  25.0  45000.0   Delhi
1   Priya  28.0  72000.0  Mumbai
2  No Name 22.0  38000.0   Delhi
3    Amit  35.0  57750.0  Unknown
4    Neha  30.0  68000.0  Mumbai

Missing age filled with mean (28.0), missing salary filled with median (57750.0).


Part 7 — Grouping and Aggregation

This is one of the most powerful Pandas features. Same as SQL GROUP BY.


    df = pd.read_csv("employees.csv")
    print(df)

    # Group by department — get mean salary
    dept_salary = df.groupby("department")["salary"].mean()
    print(dept_salary)

Output:

    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
department Engineering 89600.000000 HR 56500.000000 Marketing 66666.666667 Sales 50000.000000 Name: salary, dtype: float64

Multiple Aggregations


    # Multiple functions at once
    dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"])
    print(dept_stats)

Output:

                   mean     min     max  count
department
Engineering    89600.0   78000  105000      5
HR             56500.0   55000   58000      2
Marketing      66666.7   62000   71000      3
Sales          50000.0   48000   52000      2

Group by Multiple Columns


    # Average salary by department AND city
    dept_city = df.groupby(["department", "city"])["salary"].mean()
    print(dept_city)


Output:

department   city     
Engineering  Bangalore    98500.0
             Delhi        86500.0
             Hyderabad    78000.0
HR           Delhi        58000.0
             Mumbai       55000.0
Marketing    Bangalore    67000.0
             Mumbai       66500.0
Sales        Chennai      52000.0
             Delhi        48000.0
Name: salary, dtype: float64

agg with Different Functions per Column


    result = df.groupby("department").agg({
        "salary":     ["mean", "max"],
        "experience": "mean",
        "rating":     "mean",
        "name":       "count"
    })
    print(result)


Output:
                   salary         experience    rating  name
                     mean     max       mean      mean count
department                                                  
Engineering  89600.000000  105000        6.2  4.340000     5
HR           56500.000000   58000        5.0  4.050000     2
Marketing    66666.666667   71000        4.0  4.033333     3
Sales        50000.000000   52000        2.5  3.400000     2

value_counts — Frequency Count


    # How many employees per department
    print(df["department"].value_counts())

Output:

Engineering    5
Marketing      3
Sales          2
HR             2
Name: department, dtype: int64

    # How many per city
    print(df["city"].value_counts())

    # As percentage
    print(df["department"].value_counts(normalize=True) * 100)


Output:

city
Delhi        4
Mumbai       3
Bangalore    3
Hyderabad    1
Chennai      1
Name: count, dtype: int64
department Engineering 41.666667 Marketing 25.000000 Sales 16.666667 HR 16.666667 Name: proportion, dtype: float64

Part 8 — Sorting


    df = pd.read_csv("employees.csv")
    print(df)

    # Sort by salary ascending
    print(df.sort_values("salary").head())

    # Sort by salary descending
    print(df.sort_values("salary", ascending=False).head())

    # Sort by multiple columns
    print(df.sort_values(["department", "salary"], ascending=[True, False]))

    # Sort by rating descending — top performers
    top_performers = df.sort_values("rating", ascending=False)
    print(top_performers[["name", "department", "salary", "rating"]].head(5))


Output:

    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
id name department salary experience city rating 3 4 Amit Kumar Sales 48000 2 Delhi 3.2 7 8 Kiran Rao Sales 52000 3 Chennai 3.6 4 5 Neha Gupta HR 55000 4 Mumbai 4.0 9 10 Pooja Shah HR 58000 6 Delhi 4.1 1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
id name department salary experience city rating 8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7 2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5 10 11 Dev Mishra Engineering 88000 6 Delhi 4.4 0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2 5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
id name department salary experience city rating 8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7 2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5 10 11 Dev Mishra Engineering 88000 6 Delhi 4.4 0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2 5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9 9 10 Pooja Shah HR 58000 6 Delhi 4.1 4 5 Neha Gupta HR 55000 4 Mumbai 4.0 11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0 6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3 1 2 Priya Patel Marketing 62000 3 Mumbai 3.8 7 8 Kiran Rao Sales 52000 3 Chennai 3.6 3 4 Amit Kumar Sales 48000 2 Delhi 3.2
name department salary rating 8 Arjun Nair Engineering 105000 4.7 2 Gagan Singh Engineering 92000 4.5 10 Dev Mishra Engineering 88000 4.4 6 Sneha Joshi Marketing 67000 4.3 0 Rahul Sharma Engineering 85000 4.2

Part 9 — String Operations

Pandas has built-in string operations via .str:


    df = pd.read_csv("employees.csv")
    print(df)

    # Uppercase all names
    print(df["name"].str.upper())

    # Check if name contains a word
    print(df["name"].str.contains("Kumar"))

    # Extract first name
    df["first_name"] = df["name"].str.split(" ").str[0]
    print(df[["name", "first_name"]])

    # String length
    df["name_length"] = df["name"].str.len()

    # Replace
    df["city"] = df["city"].str.replace("Bangalore", "Bengaluru")

    # Strip whitespace — always do this on text data
    df["name"] = df["name"].str.strip()
    df["city"] = df["city"].str.strip().str.title()


Output:
    id          name   department  salary  experience       city  rating
0    1  Rahul Sharma  Engineering   85000           5      Delhi     4.2
1    2   Priya Patel    Marketing   62000           3     Mumbai     3.8
2    3   Gagan Singh  Engineering   92000           7  Bangalore     4.5
3    4    Amit Kumar        Sales   48000           2      Delhi     3.2
4    5    Neha Gupta           HR   55000           4     Mumbai     4.0
5    6    Ravi Verma  Engineering   78000           4  Hyderabad     3.9
6    7   Sneha Joshi    Marketing   67000           5  Bangalore     4.3
7    8     Kiran Rao        Sales   52000           3    Chennai     3.6
8    9    Arjun Nair  Engineering  105000           9  Bangalore     4.7
9   10    Pooja Shah           HR   58000           6      Delhi     4.1
10  11    Dev Mishra  Engineering   88000           6      Delhi     4.4
11  12    Ananya Roy    Marketing   71000           4     Mumbai     4.0
0 RAHUL SHARMA 1 PRIYA PATEL 2 GAGAN SINGH 3 AMIT KUMAR 4 NEHA GUPTA 5 RAVI VERMA 6 SNEHA JOSHI 7 KIRAN RAO 8 ARJUN NAIR 9 POOJA SHAH 10 DEV MISHRA 11 ANANYA ROY Name: name, dtype: str
0 False 1 False 2 False 3 True 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False Name: name, dtype: bool
name first_name 0 Rahul Sharma Rahul 1 Priya Patel Priya 2 Gagan Singh Gagan 3 Amit Kumar Amit 4 Neha Gupta Neha 5 Ravi Verma Ravi 6 Sneha Joshi Sneha 7 Kiran Rao Kiran 8 Arjun Nair Arjun 9 Pooja Shah Pooja 10 Dev Mishra Dev 11 Ananya Roy Ananya

Part 10 — Applying Custom Functions

apply() — Apply Function to Column


    df = pd.read_csv("employees.csv")

    # Apply function to each value in a column
    def get_grade(rating):
        if rating >= 4.5:
            return "Excellent"
        elif rating >= 4.0:
            return "Good"
        elif rating >= 3.5:
            return "Average"
        else:
            return "Needs Improvement"

    df["performance"] = df["rating"].apply(get_grade)
    print(df[["name", "rating", "performance"]])

Output:

            name  rating        performance
0   Rahul Sharma     4.2               Good
1    Priya Patel     3.8            Average
2    Gagan Singh     4.5          Excellent
3     Amit Kumar     3.2  Needs Improvement
4     Neha Gupta     4.0               Good
5     Ravi Verma     3.9            Average
6    Sneha Joshi     4.3               Good
7      Kiran Rao     3.6            Average
8     Arjun Nair     4.7          Excellent
9     Pooja Shah     4.1               Good
10    Dev Mishra     4.4               Good
11    Ananya Roy     4.0               Good

Lambda with apply()


    # Quick one-liner transformations
    df["salary_in_lakhs"] = df["salary"].apply(lambda x: round(x / 100000, 2))
    df["experience_level"] = df["experience"].apply(
        lambda x: "Junior" if x <= 3 else ("Mid" if x <= 6 else "Senior")
    )
    print(df[["name", "experience", "experience_level", "salary_in_lakhs"]])


Output:
            name  experience experience_level  salary_in_lakhs
0   Rahul Sharma           5              Mid             0.85
1    Priya Patel           3           Junior             0.62
2    Gagan Singh           7           Senior             0.92
3     Amit Kumar           2           Junior             0.48
4     Neha Gupta           4              Mid             0.55
5     Ravi Verma           4              Mid             0.78
6    Sneha Joshi           5              Mid             0.67
7      Kiran Rao           3           Junior             0.52
8     Arjun Nair           9           Senior             1.05
9     Pooja Shah           6              Mid             0.58
10    Dev Mishra           6              Mid             0.88
11    Ananya Roy           4              Mid             0.71

apply() on Entire Row — axis=1


    def calculate_bonus(row):
        if row["rating"] >= 4.5:
            return row["salary"] * 0.20
        elif row["rating"] >= 4.0:
            return row["salary"] * 0.15
        else:
            return row["salary"] * 0.10

    df["bonus"] = df.apply(calculate_bonus, axis=1)
    print(df[["name", "salary", "rating", "bonus"]])

Output:

            name  salary  rating    bonus
0   Rahul Sharma   85000     4.2  12750.0
1    Priya Patel   62000     3.8   6200.0
2    Gagan Singh   92000     4.5  18400.0
3     Amit Kumar   48000     3.2   4800.0
4     Neha Gupta   55000     4.0   8250.0
5     Ravi Verma   78000     3.9   7800.0
6    Sneha Joshi   67000     4.3  10050.0
7      Kiran Rao   52000     3.6   5200.0
8     Arjun Nair  105000     4.7  21000.0
9     Pooja Shah   58000     4.1   8700.0
10    Dev Mishra   88000     4.4  13200.0
11    Ananya Roy   71000     4.0  10650.0

axis=1 means apply the function to each row — row inside the function is a Series of that row's values.


Complete Real World Analysis

Let's do a full analysis of our employees dataset:

import pandas as pd
import numpy as np

df = pd.read_csv("employees.csv")

print("=" * 50)
print("   EMPLOYEE DATA ANALYSIS REPORT")
print("=" * 50)

print(f"\nTotal Employees: {len(df)}")
print(f"Departments: {df['department'].nunique()}")
print(f"Cities: {df['city'].nunique()}")

print("\n--- Salary Statistics ---")
print(f"Average Salary : Rs.{df['salary'].mean():,.0f}")
print(f"Median Salary  : Rs.{df['salary'].median():,.0f}")
print(f"Highest Salary : Rs.{df['salary'].max():,.0f}")
print(f"Lowest Salary  : Rs.{df['salary'].min():,.0f}")

print("\n--- Department Summary ---")
dept = df.groupby("department").agg(
    headcount=("name", "count"),
    avg_salary=("salary", "mean"),
    avg_rating=("rating", "mean")
).round(2)
print(dept.to_string())

print("\n--- Top 3 Earners ---")
top3 = df.nlargest(3, "salary")[["name", "department", "salary"]]
for _, row in top3.iterrows():
    print(f"  {row['name']} ({row['department']}): Rs.{row['salary']:,}")

print("\n--- Performance Distribution ---")
df["performance"] = df["rating"].apply(
    lambda r: "Excellent" if r >= 4.5 else ("Good" if r >= 4.0 else "Average")
)
print(df["performance"].value_counts().to_string())

print("\n--- City Distribution ---")
city_counts = df["city"].value_counts()
for city, count in city_counts.items():
    bar = "█" * count
    print(f"  {city:<12} {bar} ({count})")

print("\n--- Salary by Experience Level ---")
df["exp_level"] = df["experience"].apply(
    lambda x: "Junior (0-3)" if x <= 3 else ("Mid (4-6)" if x <= 6 else "Senior (7+)")
)
exp_salary = df.groupby("exp_level")["salary"].mean().sort_values(ascending=False)
for level, avg in exp_salary.items():
    print(f"  {level}: Rs.{avg:,.0f}")

Output:

==================================================
   EMPLOYEE DATA ANALYSIS REPORT
==================================================

Total Employees: 12
Departments: 4
Cities: 6

--- Salary Statistics ---
Average Salary : Rs.71,583
Median Salary  : Rs.69,500
Highest Salary : Rs.105,000
Lowest Salary  : Rs.48,000

--- Department Summary ---
              headcount  avg_salary  avg_rating
department
Engineering           5     89600.0        4.34
HR                    2     56500.0        4.05
Marketing             3     66666.7        4.03
Sales                 2     50000.0        3.40

--- Top 3 Earners ---
  Arjun Nair (Engineering): Rs.1,05,000
  Gagan Singh (Engineering): Rs.92,000
  Dev Mishra (Engineering): Rs.88,000

--- Performance Distribution ---
Good         6
Excellent    3
Average      3

--- City Distribution ---
Bangalore    ███ (3)
Delhi        ████ (4)
Mumbai       ███ (3)
Hyderabad    █ (1)
Chennai      █ (1)

--- Salary by Experience Level ---
  Senior (7+): Rs.97,333
  Mid (4-6): Rs.71,429
  Junior (0-3): Rs.54,000

Saving Data

# Save to CSV
df.to_csv("output.csv", index=False)    # index=False — don't save row numbers

# Save to Excel
df.to_excel("output.xlsx", index=False)

# Save to JSON
df.to_json("output.json", orient="records", indent=2)

Quick Reference

# Loading
df = pd.read_csv("file.csv")

# Exploring
df.shape          # dimensions
df.head()         # first 5 rows
df.tail()         # last 5 rows
df.info()         # column types and nulls
df.describe()     # statistics
df.columns        # column names
df.dtypes         # data types

# Selecting
df["col"]                    # single column
df[["col1", "col2"]]         # multiple columns
df.iloc[0]                   # row by position
df.loc[df["col"] > 50]       # rows by condition
df.loc[condition, ["col1"]]  # rows + columns

# Modifying
df["new_col"] = values        # add column
df.drop(columns=["col"])      # remove column
df.rename(columns={"old": "new"})  # rename
df.fillna(value)              # fill missing
df.dropna()                   # remove missing rows

# Analysis
df["col"].value_counts()     # frequency count
df.groupby("col").mean()     # group and aggregate
df.sort_values("col")        # sort
df["col"].apply(func)        # apply function
df.nlargest(5, "col")        # top 5 rows
df.nsmallest(5, "col")       # bottom 5 rows

# Saving
df.to_csv("file.csv", index=False)

Exercise 🏋️

Download this free dataset from Kaggle — "World Population Dataset" or use this mock data:

Create sales_data.csv:

date,product,category,quantity,price,city,salesperson
2024-01-05,Laptop,Electronics,2,75000,Delhi,Rahul
2024-01-08,Phone,Electronics,5,25000,Mumbai,Priya
2024-01-10,Desk,Furniture,1,15000,Delhi,Rahul
2024-01-15,Phone,Electronics,3,25000,Bangalore,Gagan
2024-01-18,Chair,Furniture,4,8000,Delhi,Amit
2024-01-22,Laptop,Electronics,1,75000,Mumbai,Priya
2024-01-25,Tablet,Electronics,2,35000,Chennai,Neha
2024-02-02,Phone,Electronics,6,25000,Delhi,Rahul
2024-02-05,Desk,Furniture,2,15000,Mumbai,Priya
2024-02-10,Laptop,Electronics,3,75000,Bangalore,Gagan
2024-02-15,Chair,Furniture,5,8000,Hyderabad,Ravi
2024-02-20,Tablet,Electronics,4,35000,Delhi,Rahul
2024-03-01,Phone,Electronics,8,25000,Mumbai,Priya
2024-03-05,Laptop,Electronics,2,75000,Delhi,Amit
2024-03-10,Desk,Furniture,3,15000,Bangalore,Gagan

Tasks:

  1. Load and explore the dataset (shape, info, describe)
  2. Add a revenue column (quantity × price)
  3. Total revenue per category
  4. Top performing salesperson by total revenue
  5. Which city has highest total sales
  6. Monthly revenue trend (group by month)
  7. Best selling product by quantity
  8. Average order value per salesperson
  9. Find all Electronics sales above Rs.1,00,000 revenue
  10. Save the cleaned dataset with revenue column to a new CSV

No comments:

Post a Comment

Data Cleaning Checklist — Interview & Industry Ready

1. Understand the Data First What to do: Before touching anything, explore your dataset using df.head(), df.info(), and df.describe(). Wh...