What is Pandas?
NumPy is great for numerical arrays where all data is the same type. But real world data looks like this:
|
Name |
Age |
City |
Salary |
Joined |
|
Gagan |
22 |
Delhi |
45000 |
2023-01-15 |
|
Priya |
28 |
Mumbai |
72000 |
2021-06-20 |
|
Rahul |
35 |
Bangalore |
95000 |
2019-03-10 |
Mixed types — strings, numbers, dates — all in one table. NumPy can't handle this well. Pandas is built exactly for this.
Pandas gives you a data structure called DataFrame — essentially a spreadsheet / database table inside Python. It's the most used tool in all of data science.
Installing and Importing
Already installed. In a new Jupyter notebook:
import pandas as pd import numpy as np
print("Pandas version:", pd.__version__)
pd is the universal alias — everyone in the world writes pd. Never write import pandas without the alias.
Part 1 — Core Data Structures
Pandas has two main data structures:
- Series — one column (1D)
- DataFrame — full table (2D)
Series — Single Column of Data
import pandas as pd
# Create a Series from a list marks = pd.Series([85, 92, 78, 65, 90]) print(marks)
Output:
0 85
1 92
2 78
3 65
4 90
dtype: int64
Left side = index, right side = values. Every Series has an index — by default 0, 1, 2...
Series with Custom Index
marks = pd.Series( [85, 92, 78, 65, 90], index=["Rahul", "Priya", "Gagan", "Amit", "Neha"] ) print(marks)
Output:
Rahul 85
Priya 92
Gagan 78
Amit 65
Neha 90
dtype: int64
Now you access by name instead of number:
print(marks["Gagan"]) # 78 print(marks["Priya"]) # 92
Series Operations
marks = pd.Series([85, 92, 78, 65, 90])
print(marks.mean()) # 82.0 print(marks.sum()) # 410 print(marks.max()) # 92 print(marks.min()) # 65 print(marks.std()) # 10.8... print(marks.describe()) # full statistical summary
Output of describe():
count 5.000000
mean 82.000000
std 10.816654
min 65.000000
25% 78.000000
50% 85.000000
75% 90.000000
max 92.000000
dtype: float64
One line gives you count, mean, std, min, quartiles, max. You'll use .describe() constantly.
DataFrame — The Main Event
A DataFrame is a table — multiple Series side by side sharing the same index.
# Create DataFrame from dictionary data = { "name": ["Rahul", "Priya", "Gagan", "Amit", "Neha"], "age": [25, 28, 22, 35, 30], "city": ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai"], "salary": [45000, 72000, 38000, 95000, 68000], "active": [True, True, True, False, True] }
df = pd.DataFrame(data) print(df)
Output:
name age city salary active
0 Rahul 25 Delhi 45000 True
1 Priya 28 Mumbai 72000 True
2 Gagan 22 Delhi 38000 True
3 Amit 35 Bangalore 95000 False
4 Neha 30 Mumbai 68000 True
This is a proper table — rows and columns, exactly like a spreadsheet.
Part 2 — Exploring a DataFrame
These are the first things you do with any new dataset:
print(df.shape) # (5, 5) — 5 rows, 5 columnsprint(df.ndim) # 2print(df.size) # 25 — total cellsprint(len(df)) # 5 — number of rowsprint()print(df.columns)# Index(['name', 'age', 'city', 'salary', 'active'], dtype='object')print()print(df.dtypes)# name object# age int64# city object# salary int64# active bool# dtype: objectprint()# First 3 rowsprint(df.head(3))print()# Last 3 rowsprint(df.tail(3))print()# Random sample of 2 rowsprint(df.sample(2))print()# Most important — full summaryprint(df.info())
Output:
(5, 5)
2
25
5
Index(['name', 'age', 'city', 'salary', 'active'], dtype='str')
name str
age int64
city str
salary int64
active bool
dtype: object
name age city salary active
0 Rahul 25 Delhi 45000 True
1 Priya 28 Mumbai 72000 True
2 Gagan 22 Delhi 38000 True
name age city salary active
2 Gagan 22 Delhi 38000 True
3 Amit 35 Bangalore 95000 False
4 Neha 30 Mumbai 68000 True
name age city salary active
3 Amit 35 Bangalore 95000 False
1 Priya 28 Mumbai 72000 True
<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 5 non-null str
1 age 5 non-null int64
2 city 5 non-null str
3 salary 5 non-null int64
4 active 5 non-null bool
dtypes: bool(1), int64(2), str(2)
memory usage: 297.0 bytes
None
df.info() shows column names, data types, and how many non-null values. Missing values show up here.
# Statistical summary of numeric columns print(df.describe())
Output:
age salary
count 5.000000 5.000000
mean 28.000000 63600.000000
std 4.848857 21584.954000
min 22.000000 38000.000000
25% 25.000000 45000.000000
50% 28.000000 68000.000000
75% 30.000000 72000.000000
max 35.000000 95000.000000
Part 3 — Selecting Data
Selecting Columns
# Single column — returns Series print(df["name"]) print(df["salary"])
# Multiple columns — returns DataFrame print(df[["name", "salary"]]) print(df[["name", "age", "city"]])
Output:
0 Rahul
1 Priya
2 Gagan
3 Amit
4 Neha
Name: name, dtype: str
0 45000
1 72000
2 38000
3 95000
4 68000
Name: salary, dtype: int64
name salary
0 Rahul 45000
1 Priya 72000
2 Gagan 38000
3 Amit 95000
4 Neha 68000
name age city
0 Rahul 25 Delhi
1 Priya 28 Mumbai
2 Gagan 22 Delhi
3 Amit 35 Bangalore
4 Neha 30 Mumbai
Selecting Rows — loc and iloc
This is where beginners get confused. Two ways to select rows:
iloc — select by integer position (like NumPy indexing)
print(df.iloc[0]) # first row print(df.iloc[-1]) # last row print(df.iloc[1:3]) # rows 1 and 2 print(df.iloc[[0, 2, 4]]) # rows 0, 2, 4
Output:
name Rahul age 25 city Delhi salary 45000 active True Name: 0, dtype: object
name Neha age 30 city Mumbai salary 68000 active True Name: 4, dtype: object
name age city salary active 1 Priya 28 Mumbai 72000 True 2 Gagan 22 Delhi 38000 True
name age city salary active 0 Rahul 25 Delhi 45000 True 2 Gagan 22 Delhi 38000 True 4 Neha 30 Mumbai 68000 True
loc — select by label/condition
print(df.loc[0]) # row with index label 0 print(df.loc[0:2]) # rows with labels 0, 1, 2 (INCLUSIVE — different from iloc)
Output:
name Rahul
age 25
city Delhi
salary 45000
active True
Name: 0, dtype: object
name age city salary active
0 Rahul 25 Delhi 45000 True
1 Priya 28 Mumbai 72000 True
2 Gagan 22 Delhi 38000 True
loc with Conditions — Most Used Pattern
# Get all rows where city is Delhi delhi_employees = df.loc[df["city"] == "Delhi"] print(delhi_employees)
Output:
name age city salary active
0 Rahul 25 Delhi 45000 True
2 Gagan 22 Delhi 38000 True
# High earners high_earners = df.loc[df["salary"] > 60000] print(high_earners)
# Active employees only active_df = df.loc[df["active"] == True] print(active_df)
# Multiple conditions — & for AND, | for OR # Must use parentheses around each condition delhi_active = df.loc[(df["city"] == "Delhi") & (df["active"] == True)] print(delhi_active)
young_or_senior = df.loc[(df["age"] < 25) | (df["age"] > 33)] print(young_or_senior)
Output:
name age city salary active
1 Priya 28 Mumbai 72000 True
3 Amit 35 Bangalore 95000 False
4 Neha 30 Mumbai 68000 True
name age city salary active
0 Rahul 25 Delhi 45000 True
1 Priya 28 Mumbai 72000 True
2 Gagan 22 Delhi 38000 True
4 Neha 30 Mumbai 68000 True
name age city salary active
0 Rahul 25 Delhi 45000 True
2 Gagan 22 Delhi 38000 True
name age city salary active
2 Gagan 22 Delhi 38000 True
3 Amit 35 Bangalore 95000 False
Select Specific Rows AND Columns
# loc[rows, columns] print(df.loc[df["salary"] > 60000, ["name", "salary"]])
Output:
name salary
1 Priya 72000
3 Amit 95000
4 Neha 68000
This is the most useful selection pattern — filter rows AND select specific columns at the same time.
Part 4 — Adding and Modifying Data
Adding New Column
# Add tax column — 20% of salary df["tax"] = df["salary"] * 0.20 print(df[["name", "salary", "tax"]])
Output:
name salary tax
0 Rahul 45000 9000.0
1 Priya 72000 14400.0
2 Gagan 38000 7600.0
3 Amit 95000 19000.0
4 Neha 68000 13600.0
# Net salary after tax df["net_salary"] = df["salary"] - df["tax"]
# Category based on salary using np.where df["level"] = np.where(df["salary"] >= 70000, "Senior", "Junior") print(df[["name", "salary", "level"]])
Output:
name salary level
0 Rahul 45000 Junior
1 Priya 72000 Senior
2 Gagan 38000 Junior
3 Amit 95000 Senior
4 Neha 68000 Junior
Modifying Existing Values
# Give all Junior employees a 10% raise
df.loc[df["level"] == "Junior", "salary"] = df.loc[df["level"] == "Junior", "salary"] * 1.10
print(df[["name", "salary", "level"]])
Renaming Columns
df = df.rename(columns={ "name": "full_name", "age": "employee_age" }) print(df.columns)
Dropping Columns and Rows
# Drop column df_clean = df.drop(columns=["tax", "net_salary"])
# Drop row by index df_clean = df.drop(index=2)
# Drop multiple rows df_clean = df.drop(index=[0, 3])
# axis=1 for columns, axis=0 for rows (older syntax you'll see) df_clean = df.drop("tax", axis=1)
Important — drop() returns a new DataFrame by default. Original df is unchanged unless you use inplace=True or reassign.
Part 5 — Loading Real Data
This is where Pandas becomes powerful. Loading a CSV file:
# Read CSV file df = pd.read_csv("data.csv")
# Read Excel df = pd.read_excel("data.xlsx")
# Read JSON df = pd.read_json("data.json")
# Read from URL df = pd.read_csv("https://example.com/data.csv")
Let's use a real dataset. Create a CSV file called employees.csv in your project folder:
id,name,department,salary,experience,city,rating
1,Rahul Sharma,Engineering,85000,5,Delhi,4.2
2,Priya Patel,Marketing,62000,3,Mumbai,3.8
3,Gagan Singh,Engineering,92000,7,Bangalore,4.5
4,Amit Kumar,Sales,48000,2,Delhi,3.2
5,Neha Gupta,HR,55000,4,Mumbai,4.0
6,Ravi Verma,Engineering,78000,4,Hyderabad,3.9
7,Sneha Joshi,Marketing,67000,5,Bangalore,4.3
8,Kiran Rao,Sales,52000,3,Chennai,3.6
9,Arjun Nair,Engineering,105000,9,Bangalore,4.7
10,Pooja Shah,HR,58000,6,Delhi,4.1
11,Dev Mishra,Engineering,88000,6,Delhi,4.4
12,Ananya Roy,Marketing,71000,4,Mumbai,4.0
Now load and explore it:
df = pd.read_csv("employees.csv")
print("Shape:", df.shape) print("\nFirst 5 rows:") print(df.head()) print("\nData types:") print(df.dtypes) print("\nSummary:") print(df.describe()) print("\nInfo:") print(df.info())
Part 6 — Handling Missing Values
Real world data always has missing values. This is one of the most important data cleaning skills.
Let's create data with missing values:
data = { "name": ["Rahul", "Priya", None, "Amit", "Neha"], "age": [25, None, 22, 35, 30], "salary": [45000, 72000, 38000, None, 68000], "city": ["Delhi", "Mumbai", "Delhi", None, "Mumbai"] }
df = pd.DataFrame(data) print(df)
Output:
name age salary city
0 Rahul 25.0 45000.0 Delhi
1 Priya NaN 72000.0 Mumbai
2 None 22.0 38000.0 Delhi
3 Amit 35.0 NaN None
4 Neha 30.0 68000.0 Mumbai
NaN = Not a Number = missing value in Pandas.
Detecting Missing Values
print(df.isnull()) # True where value is missing print(df.isnull().sum()) # count of missing per column print(df.isnull().sum().sum()) # total missing in entire df print(df.notnull()) # opposite — True where value exists
Output:
name age salary city
0 False False False False
1 False True False False
2 True False False False
3 False False True True
4 False False False False
name 1
age 1
salary 1
city 1
dtype: int64
4
name age salary city
0 True True True True
1 True False True True
2 False True True True
3 True True False False
4 True True True True
Dropping Missing Values
# Drop rows with ANY missing value df_clean = df.dropna() print(df_clean) # only rows with no missing values
# Drop rows only if ALL values are missing df_clean = df.dropna(how="all")
# Drop rows missing in specific columns df_clean = df.dropna(subset=["name", "salary"])
Filling Missing Values
# Fill all missing with a value df_filled = df.fillna(0)
# Fill specific column df["salary"] = df["salary"].fillna(0)
# Fill with mean of that column — most common df["age"] = df["age"].fillna(df["age"].mean()) df["salary"] = df["salary"].fillna(df["salary"].median())
# Fill string columns with placeholder df["city"] = df["city"].fillna("Unknown") df["name"] = df["name"].fillna("No Name")
print(df)
Output:
name age salary city
0 Rahul 25.0 45000.0 Delhi
1 Priya 28.0 72000.0 Mumbai
2 No Name 22.0 38000.0 Delhi
3 Amit 35.0 57750.0 Unknown
4 Neha 30.0 68000.0 Mumbai
Missing age filled with mean (28.0), missing salary filled with median (57750.0).
Part 7 — Grouping and Aggregation
This is one of the most powerful Pandas features. Same as SQL GROUP BY.
df = pd.read_csv("employees.csv") print(df)
# Group by department — get mean salary dept_salary = df.groupby("department")["salary"].mean() print(dept_salary)
Output:
id name department salary experience city rating
0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2
1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5
3 4 Amit Kumar Sales 48000 2 Delhi 3.2
4 5 Neha Gupta HR 55000 4 Mumbai 4.0
5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3
7 8 Kiran Rao Sales 52000 3 Chennai 3.6
8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7
9 10 Pooja Shah HR 58000 6 Delhi 4.1
10 11 Dev Mishra Engineering 88000 6 Delhi 4.4
11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0
department
Engineering 89600.000000
HR 56500.000000
Marketing 66666.666667
Sales 50000.000000
Name: salary, dtype: float64
Multiple Aggregations
# Multiple functions at once dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count"]) print(dept_stats)
Output:
mean min max count
department
Engineering 89600.0 78000 105000 5
HR 56500.0 55000 58000 2
Marketing 66666.7 62000 71000 3
Sales 50000.0 48000 52000 2
Group by Multiple Columns
# Average salary by department AND city dept_city = df.groupby(["department", "city"])["salary"].mean() print(dept_city)
Output:
department city
Engineering Bangalore 98500.0
Delhi 86500.0
Hyderabad 78000.0
HR Delhi 58000.0
Mumbai 55000.0
Marketing Bangalore 67000.0
Mumbai 66500.0
Sales Chennai 52000.0
Delhi 48000.0
Name: salary, dtype: float64
agg with Different Functions per Column
result = df.groupby("department").agg({ "salary": ["mean", "max"], "experience": "mean", "rating": "mean", "name": "count" }) print(result)
Output:
salary experience rating name
mean max mean mean count
department
Engineering 89600.000000 105000 6.2 4.340000 5
HR 56500.000000 58000 5.0 4.050000 2
Marketing 66666.666667 71000 4.0 4.033333 3
Sales 50000.000000 52000 2.5 3.400000 2
value_counts — Frequency Count
# How many employees per department print(df["department"].value_counts())
Output:
Engineering 5
Marketing 3
Sales 2
HR 2
Name: department, dtype: int64
# How many per city print(df["city"].value_counts())
# As percentage print(df["department"].value_counts(normalize=True) * 100)
Output:
city
Delhi 4
Mumbai 3
Bangalore 3
Hyderabad 1
Chennai 1
Name: count, dtype: int64
department
Engineering 41.666667
Marketing 25.000000
Sales 16.666667
HR 16.666667
Name: proportion, dtype: float64
Part 8 — Sorting
df = pd.read_csv("employees.csv") print(df)
# Sort by salary ascending print(df.sort_values("salary").head())
# Sort by salary descending print(df.sort_values("salary", ascending=False).head())
# Sort by multiple columns print(df.sort_values(["department", "salary"], ascending=[True, False]))
# Sort by rating descending — top performers top_performers = df.sort_values("rating", ascending=False) print(top_performers[["name", "department", "salary", "rating"]].head(5))
Output:
id name department salary experience city rating
0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2
1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5
3 4 Amit Kumar Sales 48000 2 Delhi 3.2
4 5 Neha Gupta HR 55000 4 Mumbai 4.0
5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3
7 8 Kiran Rao Sales 52000 3 Chennai 3.6
8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7
9 10 Pooja Shah HR 58000 6 Delhi 4.1
10 11 Dev Mishra Engineering 88000 6 Delhi 4.4
11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0
id name department salary experience city rating
3 4 Amit Kumar Sales 48000 2 Delhi 3.2
7 8 Kiran Rao Sales 52000 3 Chennai 3.6
4 5 Neha Gupta HR 55000 4 Mumbai 4.0
9 10 Pooja Shah HR 58000 6 Delhi 4.1
1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
id name department salary experience city rating
8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7
2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5
10 11 Dev Mishra Engineering 88000 6 Delhi 4.4
0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2
5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
id name department salary experience city rating
8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7
2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5
10 11 Dev Mishra Engineering 88000 6 Delhi 4.4
0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2
5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
9 10 Pooja Shah HR 58000 6 Delhi 4.1
4 5 Neha Gupta HR 55000 4 Mumbai 4.0
11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0
6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3
1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
7 8 Kiran Rao Sales 52000 3 Chennai 3.6
3 4 Amit Kumar Sales 48000 2 Delhi 3.2
name department salary rating
8 Arjun Nair Engineering 105000 4.7
2 Gagan Singh Engineering 92000 4.5
10 Dev Mishra Engineering 88000 4.4
6 Sneha Joshi Marketing 67000 4.3
0 Rahul Sharma Engineering 85000 4.2
Part 9 — String Operations
Pandas has built-in string operations via .str:
df = pd.read_csv("employees.csv") print(df)
# Uppercase all names print(df["name"].str.upper())
# Check if name contains a word print(df["name"].str.contains("Kumar"))
# Extract first name df["first_name"] = df["name"].str.split(" ").str[0] print(df[["name", "first_name"]])
# String length df["name_length"] = df["name"].str.len()
# Replace df["city"] = df["city"].str.replace("Bangalore", "Bengaluru")
# Strip whitespace — always do this on text data df["name"] = df["name"].str.strip() df["city"] = df["city"].str.strip().str.title()
Output:
id name department salary experience city rating
0 1 Rahul Sharma Engineering 85000 5 Delhi 4.2
1 2 Priya Patel Marketing 62000 3 Mumbai 3.8
2 3 Gagan Singh Engineering 92000 7 Bangalore 4.5
3 4 Amit Kumar Sales 48000 2 Delhi 3.2
4 5 Neha Gupta HR 55000 4 Mumbai 4.0
5 6 Ravi Verma Engineering 78000 4 Hyderabad 3.9
6 7 Sneha Joshi Marketing 67000 5 Bangalore 4.3
7 8 Kiran Rao Sales 52000 3 Chennai 3.6
8 9 Arjun Nair Engineering 105000 9 Bangalore 4.7
9 10 Pooja Shah HR 58000 6 Delhi 4.1
10 11 Dev Mishra Engineering 88000 6 Delhi 4.4
11 12 Ananya Roy Marketing 71000 4 Mumbai 4.0
0 RAHUL SHARMA
1 PRIYA PATEL
2 GAGAN SINGH
3 AMIT KUMAR
4 NEHA GUPTA
5 RAVI VERMA
6 SNEHA JOSHI
7 KIRAN RAO
8 ARJUN NAIR
9 POOJA SHAH
10 DEV MISHRA
11 ANANYA ROY
Name: name, dtype: str
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
Name: name, dtype: bool
name first_name
0 Rahul Sharma Rahul
1 Priya Patel Priya
2 Gagan Singh Gagan
3 Amit Kumar Amit
4 Neha Gupta Neha
5 Ravi Verma Ravi
6 Sneha Joshi Sneha
7 Kiran Rao Kiran
8 Arjun Nair Arjun
9 Pooja Shah Pooja
10 Dev Mishra Dev
11 Ananya Roy Ananya
Part 10 — Applying Custom Functions
apply() — Apply Function to Column
df = pd.read_csv("employees.csv")
# Apply function to each value in a column def get_grade(rating): if rating >= 4.5: return "Excellent" elif rating >= 4.0: return "Good" elif rating >= 3.5: return "Average" else: return "Needs Improvement"
df["performance"] = df["rating"].apply(get_grade) print(df[["name", "rating", "performance"]])
Output:
name rating performance
0 Rahul Sharma 4.2 Good
1 Priya Patel 3.8 Average
2 Gagan Singh 4.5 Excellent
3 Amit Kumar 3.2 Needs Improvement
4 Neha Gupta 4.0 Good
5 Ravi Verma 3.9 Average
6 Sneha Joshi 4.3 Good
7 Kiran Rao 3.6 Average
8 Arjun Nair 4.7 Excellent
9 Pooja Shah 4.1 Good
10 Dev Mishra 4.4 Good
11 Ananya Roy 4.0 Good
Lambda with apply()
# Quick one-liner transformations df["salary_in_lakhs"] = df["salary"].apply(lambda x: round(x / 100000, 2)) df["experience_level"] = df["experience"].apply( lambda x: "Junior" if x <= 3 else ("Mid" if x <= 6 else "Senior") ) print(df[["name", "experience", "experience_level", "salary_in_lakhs"]])
Output:
name experience experience_level salary_in_lakhs
0 Rahul Sharma 5 Mid 0.85
1 Priya Patel 3 Junior 0.62
2 Gagan Singh 7 Senior 0.92
3 Amit Kumar 2 Junior 0.48
4 Neha Gupta 4 Mid 0.55
5 Ravi Verma 4 Mid 0.78
6 Sneha Joshi 5 Mid 0.67
7 Kiran Rao 3 Junior 0.52
8 Arjun Nair 9 Senior 1.05
9 Pooja Shah 6 Mid 0.58
10 Dev Mishra 6 Mid 0.88
11 Ananya Roy 4 Mid 0.71
apply() on Entire Row — axis=1
def calculate_bonus(row): if row["rating"] >= 4.5: return row["salary"] * 0.20 elif row["rating"] >= 4.0: return row["salary"] * 0.15 else: return row["salary"] * 0.10
df["bonus"] = df.apply(calculate_bonus, axis=1) print(df[["name", "salary", "rating", "bonus"]])
Output:
name salary rating bonus 0 Rahul Sharma 85000 4.2 12750.0 1 Priya Patel 62000 3.8 6200.0 2 Gagan Singh 92000 4.5 18400.0 3 Amit Kumar 48000 3.2 4800.0 4 Neha Gupta 55000 4.0 8250.0 5 Ravi Verma 78000 3.9 7800.0 6 Sneha Joshi 67000 4.3 10050.0 7 Kiran Rao 52000 3.6 5200.0 8 Arjun Nair 105000 4.7 21000.0 9 Pooja Shah 58000 4.1 8700.0 10 Dev Mishra 88000 4.4 13200.0 11 Ananya Roy 71000 4.0 10650.0
axis=1 means apply the function to each row — row inside the function is a Series of that row's values.
Complete Real World Analysis
Let's do a full analysis of our employees dataset:
import pandas as pd
import numpy as np
df = pd.read_csv("employees.csv")
print("=" * 50)
print(" EMPLOYEE DATA ANALYSIS REPORT")
print("=" * 50)
print(f"\nTotal Employees: {len(df)}")
print(f"Departments: {df['department'].nunique()}")
print(f"Cities: {df['city'].nunique()}")
print("\n--- Salary Statistics ---")
print(f"Average Salary : Rs.{df['salary'].mean():,.0f}")
print(f"Median Salary : Rs.{df['salary'].median():,.0f}")
print(f"Highest Salary : Rs.{df['salary'].max():,.0f}")
print(f"Lowest Salary : Rs.{df['salary'].min():,.0f}")
print("\n--- Department Summary ---")
dept = df.groupby("department").agg(
headcount=("name", "count"),
avg_salary=("salary", "mean"),
avg_rating=("rating", "mean")
).round(2)
print(dept.to_string())
print("\n--- Top 3 Earners ---")
top3 = df.nlargest(3, "salary")[["name", "department", "salary"]]
for _, row in top3.iterrows():
print(f" {row['name']} ({row['department']}): Rs.{row['salary']:,}")
print("\n--- Performance Distribution ---")
df["performance"] = df["rating"].apply(
lambda r: "Excellent" if r >= 4.5 else ("Good" if r >= 4.0 else "Average")
)
print(df["performance"].value_counts().to_string())
print("\n--- City Distribution ---")
city_counts = df["city"].value_counts()
for city, count in city_counts.items():
bar = "█" * count
print(f" {city:<12} {bar} ({count})")
print("\n--- Salary by Experience Level ---")
df["exp_level"] = df["experience"].apply(
lambda x: "Junior (0-3)" if x <= 3 else ("Mid (4-6)" if x <= 6 else "Senior (7+)")
)
exp_salary = df.groupby("exp_level")["salary"].mean().sort_values(ascending=False)
for level, avg in exp_salary.items():
print(f" {level}: Rs.{avg:,.0f}")
Output:
==================================================
EMPLOYEE DATA ANALYSIS REPORT
==================================================
Total Employees: 12
Departments: 4
Cities: 6
--- Salary Statistics ---
Average Salary : Rs.71,583
Median Salary : Rs.69,500
Highest Salary : Rs.105,000
Lowest Salary : Rs.48,000
--- Department Summary ---
headcount avg_salary avg_rating
department
Engineering 5 89600.0 4.34
HR 2 56500.0 4.05
Marketing 3 66666.7 4.03
Sales 2 50000.0 3.40
--- Top 3 Earners ---
Arjun Nair (Engineering): Rs.1,05,000
Gagan Singh (Engineering): Rs.92,000
Dev Mishra (Engineering): Rs.88,000
--- Performance Distribution ---
Good 6
Excellent 3
Average 3
--- City Distribution ---
Bangalore ███ (3)
Delhi ████ (4)
Mumbai ███ (3)
Hyderabad █ (1)
Chennai █ (1)
--- Salary by Experience Level ---
Senior (7+): Rs.97,333
Mid (4-6): Rs.71,429
Junior (0-3): Rs.54,000
Saving Data
# Save to CSV
df.to_csv("output.csv", index=False) # index=False — don't save row numbers
# Save to Excel
df.to_excel("output.xlsx", index=False)
# Save to JSON
df.to_json("output.json", orient="records", indent=2)
Quick Reference
# Loading
df = pd.read_csv("file.csv")
# Exploring
df.shape # dimensions
df.head() # first 5 rows
df.tail() # last 5 rows
df.info() # column types and nulls
df.describe() # statistics
df.columns # column names
df.dtypes # data types
# Selecting
df["col"] # single column
df[["col1", "col2"]] # multiple columns
df.iloc[0] # row by position
df.loc[df["col"] > 50] # rows by condition
df.loc[condition, ["col1"]] # rows + columns
# Modifying
df["new_col"] = values # add column
df.drop(columns=["col"]) # remove column
df.rename(columns={"old": "new"}) # rename
df.fillna(value) # fill missing
df.dropna() # remove missing rows
# Analysis
df["col"].value_counts() # frequency count
df.groupby("col").mean() # group and aggregate
df.sort_values("col") # sort
df["col"].apply(func) # apply function
df.nlargest(5, "col") # top 5 rows
df.nsmallest(5, "col") # bottom 5 rows
# Saving
df.to_csv("file.csv", index=False)
Exercise 🏋️
Download this free dataset from Kaggle — "World Population Dataset" or use this mock data:
Create sales_data.csv:
date,product,category,quantity,price,city,salesperson
2024-01-05,Laptop,Electronics,2,75000,Delhi,Rahul
2024-01-08,Phone,Electronics,5,25000,Mumbai,Priya
2024-01-10,Desk,Furniture,1,15000,Delhi,Rahul
2024-01-15,Phone,Electronics,3,25000,Bangalore,Gagan
2024-01-18,Chair,Furniture,4,8000,Delhi,Amit
2024-01-22,Laptop,Electronics,1,75000,Mumbai,Priya
2024-01-25,Tablet,Electronics,2,35000,Chennai,Neha
2024-02-02,Phone,Electronics,6,25000,Delhi,Rahul
2024-02-05,Desk,Furniture,2,15000,Mumbai,Priya
2024-02-10,Laptop,Electronics,3,75000,Bangalore,Gagan
2024-02-15,Chair,Furniture,5,8000,Hyderabad,Ravi
2024-02-20,Tablet,Electronics,4,35000,Delhi,Rahul
2024-03-01,Phone,Electronics,8,25000,Mumbai,Priya
2024-03-05,Laptop,Electronics,2,75000,Delhi,Amit
2024-03-10,Desk,Furniture,3,15000,Bangalore,Gagan
Tasks:
- Load and explore the dataset (shape, info, describe)
- Add a
revenuecolumn (quantity × price) - Total revenue per category
- Top performing salesperson by total revenue
- Which city has highest total sales
- Monthly revenue trend (group by month)
- Best selling product by quantity
- Average order value per salesperson
- Find all Electronics sales above Rs.1,00,000 revenue
- Save the cleaned dataset with revenue column to a new CSV
No comments:
Post a Comment