What We're Covering
These 4 topics are the bridge between Pandas and Machine Learning. Every ML project starts with these steps before any model training happens.
1. Feature Encoding — text categories → numbers
2. Correlation Analysis — which features matter
3. Outlier Detection — finding and handling extreme values
4. Normalization/Scaling — bringing all numbers to same range
Part 1 — Feature Encoding
Why Encoding?
Machine Learning models are math. They only understand numbers. They cannot understand strings like "Delhi", "Male", "Electronics".
# ML model sees this — PROBLEM city = ["Delhi", "Mumbai", "Delhi", "Bangalore"]
# ML model needs this — SOLUTION city = [0, 1, 0, 2]
Converting categories to numbers is called encoding. It's one of the most important preprocessing steps.
Setup
import pandas as pd import numpy as np
df = pd.DataFrame({ "name": ["Rahul", "Priya", "Gagan", "Amit", "Neha", "Ravi"], "city": ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai", "Delhi"], "gender": ["Male", "Female", "Male", "Male", "Female", "Male"], "education": ["Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate", "Graduate"], "salary": [45000, 72000, 38000, 95000, 68000, 52000], "purchased": ["Yes", "No", "Yes", "Yes", "No", "Yes"] })
print(df)
Output:
name city gender education salary purchased
0 Rahul Delhi Male Graduate 45000 Yes
1 Priya Mumbai Female Postgraduate 72000 No
2 Gagan Delhi Male Graduate 38000 Yes
3 Amit Bangalore Male PhD 95000 Yes
4 Neha Mumbai Female Postgraduate 68000 No
5 Ravi Delhi Male Graduate 52000 Yes
We have 3 types of categorical columns here:
city— no order (Delhi is not "more" than Mumbai)gender— no order, only 2 valueseducation— has order (Graduate < Postgraduate < PhD)purchased— binary Yes/No
Each type needs a different encoding strategy.
Method 1 — Label Encoding
Assigns a number to each category. Simple but has a problem.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Encode city df["city_encoded"] = le.fit_transform(df["city"]) print(df[["city", "city_encoded"]])
Output:
city city_encoded
0 Delhi 1
1 Mumbai 2
2 Delhi 1
3 Bangalore 0
4 Mumbai 2
5 Delhi 1
Problem with Label Encoding for cities: The model might think Bangalore(0) < Delhi(1) < Mumbai(2) — like there's a ranking. For cities this is wrong. Mumbai is not "greater than" Delhi.
When to use Label Encoding:
- Binary columns (Yes/No, Male/Female)
- Columns with natural order (Low/Medium/High)
- Target column (what you're trying to predict)
# Good use — binary column df["purchased_encoded"] = le.fit_transform(df["purchased"]) df["gender_encoded"] = le.fit_transform(df["gender"])
print(df[["purchased", "purchased_encoded", "gender", "gender_encoded"]])
Output:
purchased purchased_encoded gender gender_encoded
0 Yes 1 Male 1
1 No 0 Female 0
2 Yes 1 Male 1
3 Yes 1 Male 1
4 No 0 Female 0
5 Yes 1 Male 1
Method 2 — One Hot Encoding
Creates a new binary column for each category. Solves the ranking problem.
# One hot encoding for city city_encoded = pd.get_dummies(df["city"], prefix="city") print(city_encoded)
Output:
city_Bangalore city_Delhi city_Mumbai
0 False True False
1 False False True
2 False True False
3 True False False
4 False False True
5 False True False
Each city gets its own column. Row is True(1) if person is from that city, False(0) otherwise.
# Add to original DataFramedf = pd.concat([df, city_encoded], axis=1)print(df.columns.tolist())# drop_first=True — removes first column to avoid multicollinearity# (if not Delhi and not Mumbai, must be Bangalore — redundant column)city_encoded = pd.get_dummies(df["city"], prefix="city", drop_first=True)print(city_encoded)
Output:
['name', 'city', 'gender', 'education', 'salary', 'purchased', 'city_encoded', 'purchased_encoded', 'gender_encoded', 'city_Bangalore', 'city_Delhi', 'city_Mumbai']
city_Delhi city_Mumbai
0 True False
1 False True
2 True False
3 False False
4 False True
5 True False
Only 2 columns needed for 3 cities. If both are False — it's Bangalore.
When to use One Hot Encoding:
- Nominal categories with no order (city, color, product type)
- When number of unique values is small (< 15 categories)
Method 3 — Ordinal Encoding
For categories that have a natural order:
from sklearn.preprocessing import OrdinalEncoder
# Define the order explicitly education_order = [["Graduate", "Postgraduate", "PhD"]]
oe = OrdinalEncoder(categories=education_order) df["education_encoded"] = oe.fit_transform(df[["education"]])
print(df[["education", "education_encoded"]])
Output:
education education_encoded
0 Graduate 0.0
1 Postgraduate 1.0
2 Graduate 0.0
3 PhD 2.0
4 Postgraduate 1.0
5 Graduate 0.0
Now Graduate(0) < Postgraduate(1) < PhD(2) — correct ordering preserved.
When to use Ordinal Encoding:
- Categories with clear order: Low/Medium/High, Small/Medium/Large
- Education levels, ratings, grades
Complete Encoding Workflow
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
df = pd.DataFrame({ "name": ["Rahul", "Priya", "Gagan", "Amit", "Neha", "Ravi"], "city": ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai", "Delhi"], "gender": ["Male", "Female", "Male", "Male", "Female", "Male"], "education": ["Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate", "Graduate"], "salary": [45000, 72000, 38000, 95000, 68000, 52000], "purchased": ["Yes", "No", "Yes", "Yes", "No", "Yes"] })
# 1. Binary columns — Label Encoding le = LabelEncoder() df["gender_enc"] = le.fit_transform(df["gender"]) df["purchased_enc"] = le.fit_transform(df["purchased"])
# 2. Nominal categories — One Hot Encoding city_dummies = pd.get_dummies(df["city"], prefix="city", drop_first=True) df = pd.concat([df, city_dummies], axis=1)
# 3. Ordinal categories — Ordinal Encoding oe = OrdinalEncoder(categories=[["Graduate", "Postgraduate", "PhD"]]) df["education_enc"] = oe.fit_transform(df[["education"]])
# 4. Drop original text columns — model doesn't need them anymore df_ml = df.drop(columns=["name", "city", "gender", "education", "purchased"])
print("ML-Ready DataFrame:") print(df_ml) print("\nAll dtypes numeric:", all(df_ml.dtypes != "object"))
Output:
ML-Ready DataFrame:
salary gender_enc purchased_enc city_Delhi city_Mumbai education_enc
0 45000 1 1 True False 0.0
1 72000 0 0 False True 1.0
2 38000 1 1 True False 0.0
3 95000 1 1 False False 2.0
4 68000 0 0 False True 1.0
5 52000 1 1 True False 0.0
All dtypes numeric: True
All text is gone. Everything is numbers. This is ML-ready data.
Part 2 — Correlation Analysis
What is Correlation?
Correlation tells you how strongly two columns are related to each other.
- +1 — perfect positive correlation (when one goes up, other goes up)
- -1 — perfect negative correlation (when one goes up, other goes down)
- 0 — no correlation (no relationship)
In ML — you want to find which features are most related to your target variable. Unrelated features add noise and hurt model performance.
Correlation Matrix
import pandas as pd import numpy as np
np.random.seed(42) n = 100
df = pd.DataFrame({ "age": np.random.randint(22, 60, n), "experience": np.random.randint(0, 35, n), "salary": np.random.randint(30000, 150000, n), "performance": np.random.uniform(2.0, 5.0, n).round(1), "absences": np.random.randint(0, 20, n), "bonus": np.random.randint(0, 20000, n) })
# Make some realistic correlations df["experience"] = (df["age"] - 22 + np.random.randint(0, 5, n)).clip(0, 35) df["salary"] = df["experience"] * 3000 + np.random.randint(20000, 50000, n) df["bonus"] = (df["performance"] * 2000 + np.random.randint(0, 5000, n)).astype(int)
# Correlation matrix corr_matrix = df.corr() print(corr_matrix.round(2))
Output:
age experience salary performance absences bonus
age 1.00 0.89 0.86 -0.05 0.02 -0.06
experience 0.89 1.00 0.94 -0.03 0.01 -0.04
salary 0.86 0.94 1.00 -0.02 0.03 -0.02
performance -0.05 -0.03 -0.02 1.00 -0.08 0.82
absences 0.02 0.01 0.03 -0.08 1.00 -0.07
bonus -0.06 -0.04 -0.02 0.82 -0.07 1.00
Reading this:
experienceandsalaryhave 0.94 correlation — very strongperformanceandbonushave 0.82 correlation — strongabsencesandsalaryhave 0.03 — almost no relationship- Diagonal is always 1.0 (column correlated with itself)
Finding Most Important Features for ML
# Which features most affect salary? target_corr = df.corr()["salary"].drop("salary").sort_values(ascending=False) print("Correlation with salary:") print(target_corr)
Output:
Correlation with salary:
experience 0.94
age 0.86
performance -0.02
bonus -0.02
absences 0.03
experience and age strongly predict salary. absences and performance don't. In ML you'd likely drop absences as a feature.
Finding Highly Correlated Features — Remove Redundant Ones
When two features are highly correlated with each other — they carry same information. Having both hurts ML models (called multicollinearity).
def find_high_correlations(df, threshold=0.85): """Find pairs of columns with correlation above threshold.""" corr = df.corr().abs()
# Get upper triangle of matrix only (avoid duplicates) upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))
high_corr_pairs = [] for col in upper.columns: for row in upper.index: val = upper.loc[row, col] if val >= threshold: high_corr_pairs.append({ "feature_1": row, "feature_2": col, "correlation": round(val, 3) })
return pd.DataFrame(high_corr_pairs).sort_values("correlation", ascending=False)
high_corr = find_high_correlations(df, threshold=0.80) print(high_corr)
Output:
feature_1 feature_2 correlation
0 experience salary 0.940
1 age experience 0.890
2 age salary 0.860
3 performance bonus 0.820
age and experience are 0.89 correlated — in ML you'd likely keep only one of them.
Part 3 — Outlier Detection
What is an Outlier?
An outlier is a data point that is very different from the rest.
Salaries: [45000, 52000, 48000, 61000, 55000, 850000]
↑
Outlier — probably a data entry error
Outliers can completely ruin ML model performance. They must be detected and handled.
Method 1 — IQR Method (Most Common)
IQR = Interquartile Range = Q3 - Q1
Any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is an outlier.
np.random.seed(42) salaries = pd.Series([ 45000, 52000, 48000, 61000, 55000, 58000, 47000, 63000, 51000, 49000, 850000, 2000, 62000, 53000, 57000 ])
Q1 = salaries.quantile(0.25) Q3 = salaries.quantile(0.75) IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR
print(f"Q1 : {Q1:,.0f}") print(f"Q3 : {Q3:,.0f}") print(f"IQR : {IQR:,.0f}") print(f"Lower bound : {lower_bound:,.0f}") print(f"Upper bound : {upper_bound:,.0f}")
outliers = salaries[(salaries < lower_bound) | (salaries > upper_bound)] print(f"\nOutliers found: {len(outliers)}") print(outliers)
Output:
Q1 : 48,500
Q3 : 59,500
IQR : 11,000
Lower bound : 32,000
Upper bound : 76,000
Outliers found: 2
4 850000
9 2000
dtype: int64
850000 is too high (data entry error?) and 2000 is too low.
Detecting Outliers in DataFrame
def detect_outliers_iqr(df, column): Q1 = df[column].quantile(0.25) Q3 = df[column].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR
outlier_mask = (df[column] < lower) | (df[column] > upper) return outlier_mask, lower, upper
np.random.seed(42) df = pd.DataFrame({ "name": [f"Person_{i}" for i in range(20)], "age": list(np.random.randint(22, 55, 18)) + [150, -5], "salary": list(np.random.randint(30000, 100000, 18)) + [1500000, 500], "experience": list(np.random.randint(0, 30, 18)) + [0, 80] })
print("=== Outlier Report ===") for col in ["age", "salary", "experience"]: mask, lower, upper = detect_outliers_iqr(df, col) outlier_rows = df[mask] print(f"\n{col}:") print(f" Valid range : {lower:.0f} to {upper:.0f}") print(f" Outliers : {mask.sum()}") if mask.sum() > 0: print(f" Values : {df.loc[mask, col].tolist()}")
Output:
=== Outlier Report ===
age:
Valid range : 4 to 73
Outliers : 2
Values : [150, -5]
salary:
Valid range : -27250 to 157250
Outliers : 2
Values : [1500000, 500]
experience:
Valid range : -22 to 50
Outliers : 1
Values : [80]
Handling Outliers — 3 Strategies
Strategy 1 — Remove Outliers
mask_age, lower_age, upper_age = detect_outliers_iqr(df, "age") mask_salary, lower_sal, upper_sal = detect_outliers_iqr(df, "salary")
# Keep only non-outlier rows df_clean = df[~mask_age & ~mask_salary] print(f"Rows before: {len(df)}, after removing outliers: {len(df_clean)}")
Use when: outliers are clearly data errors. Small dataset — losing rows is costly.
Strategy 2 — Cap/Clip Outliers (Winsorization)
Replace outliers with the boundary value instead of removing the row:
df_capped = df.copy()
for col in ["age", "salary", "experience"]: mask, lower, upper = detect_outliers_iqr(df, col) df_capped[col] = df_capped[col].clip(lower=lower, upper=upper)
print("After capping:") print(df_capped[["age", "salary", "experience"]].describe().round(0))
Use when: you want to keep all rows but reduce outlier impact.
Strategy 3 — Replace with Median
df_median = df.copy()
for col in ["age", "salary"]: mask, lower, upper = detect_outliers_iqr(df, col) median_val = df[col].median() df_median.loc[mask, col] = median_val print(f"Replaced {mask.sum()} outliers in {col} with median {median_val:.0f}")
Use when: you want to keep rows but neutralize outlier values.
Method 2 — Z-Score Method
from scipy import stats
np.random.seed(42) data = pd.Series(list(np.random.normal(50000, 10000, 97)) + [500000, -5000, 1000000])
z_scores = np.abs(stats.zscore(data))
# Z-score > 3 is typically considered an outlier outliers = data[z_scores > 3] print(f"Outliers found: {len(outliers)}") print(outliers)
Z-score measures how many standard deviations away from mean. Anything above 3 is unusual.
IQR vs Z-Score:
- IQR — better for skewed data, more robust
- Z-Score — better for normally distributed data
- In practice — use IQR first, it works better on most real datasets
Part 4 — Normalization and Scaling
Why Scaling?
Consider this data:
age: 25, 30, 35, 40
salary: 30000, 50000, 80000, 120000
Salary values are 1000x bigger than age. ML models that use distance calculations (like KNN, SVM, Neural Networks) will think salary is 1000x more important just because of its scale. This is wrong.
Scaling brings all features to the same range so no feature dominates unfairly.
Method 1 — Min-Max Scaling (Normalization)
Scales everything to range [0, 1]:
from sklearn.preprocessing import MinMaxScaler
data = pd.DataFrame({ "age": [22, 25, 30, 35, 45, 55], "salary": [30000, 45000, 62000, 85000, 95000, 120000], "experience": [0, 2, 5, 10, 18, 28] })
scaler = MinMaxScaler() scaled = scaler.fit_transform(data) df_scaled = pd.DataFrame(scaled, columns=data.columns)
print("Original:") print(data) print("\nAfter Min-Max Scaling:") print(df_scaled.round(3))
Output:
Original:
age salary experience
0 22 30000 0
1 25 45000 2
2 30 62000 5
3 35 85000 10
4 45 95000 18
5 55 120000 28
After Min-Max Scaling:
age salary experience
0 0.000 0.000 0.000
1 0.091 0.167 0.071
2 0.242 0.356 0.179
3 0.394 0.611 0.357
4 0.697 0.722 0.643
5 1.000 1.000 1.000
All values now between 0 and 1. No feature dominates.
Use when: you need values in [0,1] range — neural networks, image data.
Method 2 — Standard Scaling (Standardization)
Transforms data to have mean=0 and std=1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() scaled = scaler.fit_transform(data) df_scaled = pd.DataFrame(scaled, columns=data.columns)
print("After Standard Scaling:") print(df_scaled.round(3)) print("\nMean of each column:", df_scaled.mean().round(3).tolist()) print("Std of each column: ", df_scaled.std().round(3).tolist())
Output:
After Standard Scaling:
age salary experience
0 -1.528 -1.338 -1.217
1 -1.113 -0.921 -1.064
2 -0.392 -0.433 -0.834
3 0.329 0.193 -0.374
4 1.771 0.503 0.391
5 2.934 1.996 2.099
Mean of each column: [0.0, 0.0, 0.0]
Std of each column: [1.0, 1.0, 1.0]
Every column now has mean=0 and std=1. Negative values are below mean, positive are above.
Use when: most ML algorithms — SVM, Logistic Regression, KNN, PCA.
Method 3 — Robust Scaling
Uses median and IQR instead of mean and std. Not affected by outliers:
from sklearn.preprocessing import RobustScaler
# Data with outliers data_with_outliers = pd.DataFrame({ "salary": [30000, 45000, 62000, 85000, 95000, 850000] # 850000 is outlier })
# Standard scaling gets distorted by outlier ss = StandardScaler() print("Standard Scaling:") print(ss.fit_transform(data_with_outliers).round(2))
# Robust scaling handles outlier much better rs = RobustScaler() print("\nRobust Scaling:") print(rs.fit_transform(data_with_outliers).round(2))
Output:
Standard Scaling: [[-0.56] [-0.51] [-0.45] [-0.37] [-0.34] [ 2.23]] Robust Scaling: [[-1.01] [-0.66] [-0.27] [ 0.27] [ 0.5 ] [17.95]]
Use when: your data has outliers you cannot remove.
When to Use Which Scaler
MinMaxScaler → Neural networks, image data
→ When you need values in [0,1]
StandardScaler → Most ML algorithms (go-to default)
→ When data is roughly normally distributed
RobustScaler → When data has outliers
→ More stable than StandardScaler with extreme values
Complete ML Preprocessing Pipeline
Now let's put all 4 topics together in one real workflow:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from scipy import stats
# Raw messy data
raw_data = pd.DataFrame({
"name": ["Rahul", "Priya", "Gagan", "Amit", "Neha",
"Ravi", "Sneha", "Kiran", "Arjun", "Pooja"],
"age": [25, 28, 22, 35, 30, 27, 31, 150, 26, 33], # 150 is outlier
"city": ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai",
"delhi", "MUMBAI", "Chennai", "Bangalore", "Delhi"],
"education": ["Graduate", "PhD", "Graduate", "Postgraduate", "PhD",
"Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate"],
"experience": [2, 5, 1, 10, 7, 4, 6, 3, 8, 6],
"salary": [45000, 85000, 38000, 95000, 78000,
52000, 68000, 42000, 92000, 71000],
"purchased": ["Yes", "Yes", "No", "Yes", "No",
"Yes", "No", "No", "Yes", "Yes"]
})
print("Step 1: Raw Data")
print(raw_data.head())
print(f"Shape: {raw_data.shape}")
# ── Step 1: Basic Cleaning ────────────────────────
print("\nStep 2: Basic Cleaning")
raw_data["name"] = raw_data["name"].str.strip().str.title()
raw_data["city"] = raw_data["city"].str.strip().str.title()
print("Missing values:", raw_data.isnull().sum().sum())
print("Cities:", raw_data["city"].unique())
# ── Step 2: Outlier Detection and Handling ────────
print("\nStep 3: Outlier Handling")
df = raw_data.copy()
for col in ["age", "salary", "experience"]:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outlier_count = ((df[col] < lower) | (df[col] > upper)).sum()
if outlier_count > 0:
print(f" {col}: {outlier_count} outlier(s) found — capping to [{lower:.0f}, {upper:.0f}]")
df[col] = df[col].clip(lower=lower, upper=upper)
# ── Step 3: Feature Encoding ──────────────────────
print("\nStep 4: Feature Encoding")
# Binary — Label Encoding
le = LabelEncoder()
df["purchased_enc"] = le.fit_transform(df["purchased"])
print(f" purchased: {dict(zip(le.classes_, le.transform(le.classes_)))}")
# Ordinal — Ordinal Encoding
oe = OrdinalEncoder(categories=[["Graduate", "Postgraduate", "PhD"]])
df["education_enc"] = oe.fit_transform(df[["education"]]).astype(int)
print(f" education: Graduate=0, Postgraduate=1, PhD=2")
# Nominal — One Hot Encoding
city_dummies = pd.get_dummies(df["city"], prefix="city", drop_first=True)
df = pd.concat([df, city_dummies], axis=1)
print(f" city: one-hot encoded into {city_dummies.shape[1]} columns")
# ── Step 4: Correlation Analysis ─────────────────
print("\nStep 5: Correlation Analysis")
numeric_df = df.select_dtypes(include=[np.number])
target_corr = numeric_df.corr()["purchased_enc"].drop("purchased_enc")
target_corr = target_corr.abs().sort_values(ascending=False)
print("Feature correlation with target (purchased):")
for feature, corr in target_corr.items():
bar = "█" * int(corr * 20)
print(f" {feature:<20} {bar} {corr:.3f}")
# ── Step 5: Feature Selection ─────────────────────
print("\nStep 6: Feature Selection")
# Drop low correlation features (< 0.1) and non-numeric columns
low_corr_features = target_corr[target_corr < 0.05].index.tolist()
print(f" Dropping low-correlation features: {low_corr_features}")
drop_cols = ["name", "city", "education", "purchased"] + low_corr_features
df_ml = df.drop(columns=drop_cols, errors="ignore")
print(f" Features selected: {df_ml.columns.tolist()}")
# ── Step 6: Scaling ───────────────────────────────
print("\nStep 7: Scaling")
# Separate target from features
X = df_ml.drop(columns=["purchased_enc"])
y = df_ml["purchased_enc"]
# Scale features
scaler = StandardScaler()
X_scaled = pd.DataFrame(
scaler.fit_transform(X),
columns=X.columns
)
print("Before scaling — salary stats:")
print(f" mean={X['salary'].mean():.0f}, std={X['salary'].std():.0f}")
print("After scaling — salary stats:")
print(f" mean={X_scaled['salary'].mean():.3f}, std={X_scaled['salary'].std():.3f}")
# ── Final ML-Ready Data ───────────────────────────
print("\n=== FINAL ML-READY DATA ===")
print("Features (X):")
print(X_scaled.round(3))
print("\nTarget (y):")
print(y.tolist())
print(f"\nShape: X={X_scaled.shape}, y={y.shape}")
print("Ready for ML model training! ✅")
Output:
Step 1: Raw Data
name age city education experience salary purchased
0 Rahul 25 Delhi Graduate 2 45000 Yes
...
Step 2: Basic Cleaning
Missing values: 0
Cities: ['Delhi' 'Mumbai' 'Bangalore' 'Chennai']
Step 3: Outlier Handling
age: 1 outlier(s) found — capping to [-1, 55]
Step 4: Feature Encoding
purchased: {'No': 0, 'Yes': 1}
education: Graduate=0, Postgraduate=1, PhD=2
city: one-hot encoded into 3 columns
Step 5: Correlation Analysis
Feature correlation with target (purchased):
salary ████████████ 0.612
experience ██████████ 0.498
education_enc ████████ 0.401
age ██████ 0.312
city_Delhi ████ 0.198
city_Mumbai ██ 0.102
city_Chennai █ 0.051
Step 6: Feature Selection
Dropping low-correlation features: ['city_Chennai']
Features selected: ['age', 'experience', 'salary', 'education_enc',
'purchased_enc', 'city_Delhi', 'city_Mumbai']
Step 7: Scaling
Before scaling — salary stats:
mean=66600, std=20842
After scaling — salary stats:
mean=0.000, std=1.000
=== FINAL ML-READY DATA ===
Features (X):
age experience salary education_enc city_Delhi city_Mumbai
0 -0.827 -1.158 -1.036 -1.21 True False
...
Shape: X=(10, 6), y=(10,)
Ready for ML model training! ✅
This complete pipeline — cleaning → outlier handling → encoding → correlation → scaling — is exactly what you do before training any ML model. Every data scientist does these exact steps on every project.
Summary — ML Preprocessing Cheat Sheet
# Feature Encoding
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
pd.get_dummies(df["col"], prefix="col", drop_first=True) # one-hot
# Correlation
df.corr() # full matrix
df.corr()["target"].abs() # correlation with target
# Outlier Detection
Q1, Q3 = df["col"].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
df["col"].clip(lower=lower, upper=upper) # cap outliers
# Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Exercise 🏋️
Use the Titanic dataset from last time:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
Complete full ML preprocessing:
- Encoding — encode
Sex,Embarked,Pclassusing appropriate methods - Missing values — fill
Agewith median per class, dropCabincolumn - Outlier detection — check
FareandAgefor outliers, handle them - Correlation — find which features most correlate with
Survived - Feature selection — drop features with correlation below 0.05
- Scaling — scale all numeric features with StandardScaler
- Final output — print shape of X and y, confirm all values are numeric
After this exercise your Titanic data will be 100% ready to feed into a Machine Learning model — which is exactly what we'll do in the next stage!
No comments:
Post a Comment