Pandas — ML-Specific Topics

What We're Covering

These 4 topics are the bridge between Pandas and Machine Learning. Every ML project starts with these steps before any model training happens.

1. Feature Encoding     — text categories → numbers
2. Correlation Analysis — which features matter
3. Outlier Detection    — finding and handling extreme values
4. Normalization/Scaling — bringing all numbers to same range

Part 1 — Feature Encoding

Why Encoding?

Machine Learning models are math. They only understand numbers. They cannot understand strings like "Delhi", "Male", "Electronics".


    # ML model sees this — PROBLEM
    city = ["Delhi", "Mumbai", "Delhi", "Bangalore"]

    # ML model needs this — SOLUTION
    city = [0, 1, 0, 2]

Converting categories to numbers is called encoding. It's one of the most important preprocessing steps.


Setup


    import pandas as pd
    import numpy as np

    df = pd.DataFrame({
        "name":       ["Rahul", "Priya", "Gagan", "Amit", "Neha", "Ravi"],
        "city":       ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai", "Delhi"],
        "gender":     ["Male", "Female", "Male", "Male", "Female", "Male"],
        "education":  ["Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate", "Graduate"],
        "salary":     [45000, 72000, 38000, 95000, 68000, 52000],
        "purchased":  ["Yes", "No", "Yes", "Yes", "No", "Yes"]
    })

    print(df)

Output:

    name       city  gender     education  salary purchased
0  Rahul      Delhi    Male      Graduate   45000       Yes
1  Priya     Mumbai  Female  Postgraduate   72000        No
2  Gagan      Delhi    Male      Graduate   38000       Yes
3   Amit  Bangalore    Male           PhD   95000       Yes
4   Neha     Mumbai  Female  Postgraduate   68000        No
5   Ravi      Delhi    Male      Graduate   52000       Yes

We have 3 types of categorical columns here:

  • city — no order (Delhi is not "more" than Mumbai)
  • gender — no order, only 2 values
  • education — has order (Graduate < Postgraduate < PhD)
  • purchased — binary Yes/No

Each type needs a different encoding strategy.


Method 1 — Label Encoding

Assigns a number to each category. Simple but has a problem.


    from sklearn.preprocessing import LabelEncoder

    le = LabelEncoder()

    # Encode city
    df["city_encoded"] = le.fit_transform(df["city"])
    print(df[["city", "city_encoded"]])

Output:

        city  city_encoded
0      Delhi             1
1     Mumbai             2
2      Delhi             1
3  Bangalore             0
4     Mumbai             2
5      Delhi             1

Problem with Label Encoding for cities: The model might think Bangalore(0) < Delhi(1) < Mumbai(2) — like there's a ranking. For cities this is wrong. Mumbai is not "greater than" Delhi.

When to use Label Encoding:

  • Binary columns (Yes/No, Male/Female)
  • Columns with natural order (Low/Medium/High)
  • Target column (what you're trying to predict)

    # Good use — binary column
    df["purchased_encoded"] = le.fit_transform(df["purchased"])
    df["gender_encoded"] = le.fit_transform(df["gender"])

    print(df[["purchased", "purchased_encoded", "gender", "gender_encoded"]])

Output:

  purchased  purchased_encoded  gender  gender_encoded
0       Yes                  1    Male               1
1        No                  0  Female               0
2       Yes                  1    Male               1
3       Yes                  1    Male               1
4        No                  0  Female               0
5       Yes                  1    Male               1

Method 2 — One Hot Encoding

Creates a new binary column for each category. Solves the ranking problem.


    # One hot encoding for city
    city_encoded = pd.get_dummies(df["city"], prefix="city")
    print(city_encoded)

Output:

   city_Bangalore  city_Delhi  city_Mumbai
0           False        True        False
1           False       False         True
2           False        True        False
3            True       False        False
4           False       False         True
5           False        True        False

Each city gets its own column. Row is True(1) if person is from that city, False(0) otherwise.


    # Add to original DataFrame
    df = pd.concat([df, city_encoded], axis=1)
    print(df.columns.tolist())

    # drop_first=True — removes first column to avoid multicollinearity
    # (if not Delhi and not Mumbai, must be Bangalore — redundant column)
    city_encoded = pd.get_dummies(df["city"], prefix="city", drop_first=True)
    print(city_encoded)

Output:

['name', 'city', 'gender', 'education', 'salary', 'purchased', 'city_encoded', 'purchased_encoded', 'gender_encoded', 'city_Bangalore', 'city_Delhi', 'city_Mumbai']
city_Delhi city_Mumbai 0 True False 1 False True 2 True False 3 False False 4 False True 5 True False

Only 2 columns needed for 3 cities. If both are False — it's Bangalore.

When to use One Hot Encoding:

  • Nominal categories with no order (city, color, product type)
  • When number of unique values is small (< 15 categories)

Method 3 — Ordinal Encoding

For categories that have a natural order:


    from sklearn.preprocessing import OrdinalEncoder

    # Define the order explicitly
    education_order = [["Graduate", "Postgraduate", "PhD"]]

    oe = OrdinalEncoder(categories=education_order)
    df["education_encoded"] = oe.fit_transform(df[["education"]])

    print(df[["education", "education_encoded"]])

Output:

      education  education_encoded
0      Graduate                0.0
1  Postgraduate                1.0
2      Graduate                0.0
3           PhD                2.0
4  Postgraduate                1.0
5      Graduate                0.0

Now Graduate(0) < Postgraduate(1) < PhD(2) — correct ordering preserved.

When to use Ordinal Encoding:

  • Categories with clear order: Low/Medium/High, Small/Medium/Large
  • Education levels, ratings, grades

Complete Encoding Workflow


    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

    df = pd.DataFrame({
        "name":       ["Rahul", "Priya", "Gagan", "Amit", "Neha", "Ravi"],
        "city":       ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai", "Delhi"],
        "gender":     ["Male", "Female", "Male", "Male", "Female", "Male"],
        "education":  ["Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate", "Graduate"],
        "salary":     [45000, 72000, 38000, 95000, 68000, 52000],
        "purchased":  ["Yes", "No", "Yes", "Yes", "No", "Yes"]
    })

    # 1. Binary columns — Label Encoding
    le = LabelEncoder()
    df["gender_enc"]    = le.fit_transform(df["gender"])
    df["purchased_enc"] = le.fit_transform(df["purchased"])

    # 2. Nominal categories — One Hot Encoding
    city_dummies = pd.get_dummies(df["city"], prefix="city", drop_first=True)
    df = pd.concat([df, city_dummies], axis=1)

    # 3. Ordinal categories — Ordinal Encoding
    oe = OrdinalEncoder(categories=[["Graduate", "Postgraduate", "PhD"]])
    df["education_enc"] = oe.fit_transform(df[["education"]])

    # 4. Drop original text columns — model doesn't need them anymore
    df_ml = df.drop(columns=["name", "city", "gender", "education", "purchased"])

    print("ML-Ready DataFrame:")
    print(df_ml)
    print("\nAll dtypes numeric:", all(df_ml.dtypes != "object"))

Output:

ML-Ready DataFrame:
   salary  gender_enc  purchased_enc  city_Delhi  city_Mumbai  education_enc
0   45000           1              1        True        False            0.0
1   72000           0              0       False         True            1.0
2   38000           1              1        True        False            0.0
3   95000           1              1       False        False            2.0
4   68000           0              0       False         True            1.0
5   52000           1              1        True        False            0.0

All dtypes numeric: True

All text is gone. Everything is numbers. This is ML-ready data.


Part 2 — Correlation Analysis

What is Correlation?

Correlation tells you how strongly two columns are related to each other.

  • +1 — perfect positive correlation (when one goes up, other goes up)
  • -1 — perfect negative correlation (when one goes up, other goes down)
  • 0 — no correlation (no relationship)

In ML — you want to find which features are most related to your target variable. Unrelated features add noise and hurt model performance.


Correlation Matrix


    import pandas as pd
    import numpy as np

    np.random.seed(42)
    n = 100

    df = pd.DataFrame({
        "age":         np.random.randint(22, 60, n),
        "experience":  np.random.randint(0, 35, n),
        "salary":      np.random.randint(30000, 150000, n),
        "performance": np.random.uniform(2.0, 5.0, n).round(1),
        "absences":    np.random.randint(0, 20, n),
        "bonus":       np.random.randint(0, 20000, n)
    })

    # Make some realistic correlations
    df["experience"] = (df["age"] - 22 + np.random.randint(0, 5, n)).clip(0, 35)
    df["salary"]     = df["experience"] * 3000 + np.random.randint(20000, 50000, n)
    df["bonus"]      = (df["performance"] * 2000 + np.random.randint(0, 5000, n)).astype(int)

    # Correlation matrix
    corr_matrix = df.corr()
    print(corr_matrix.round(2))

Output:

              age  experience  salary  performance  absences  bonus
age          1.00        0.89    0.86        -0.05      0.02  -0.06
experience   0.89        1.00    0.94        -0.03      0.01  -0.04
salary       0.86        0.94    1.00        -0.02      0.03  -0.02
performance -0.05       -0.03   -0.02         1.00     -0.08   0.82
absences     0.02        0.01    0.03        -0.08      1.00  -0.07
bonus       -0.06       -0.04   -0.02         0.82     -0.07   1.00

Reading this:

  • experience and salary have 0.94 correlation — very strong
  • performance and bonus have 0.82 correlation — strong
  • absences and salary have 0.03 — almost no relationship
  • Diagonal is always 1.0 (column correlated with itself)

Finding Most Important Features for ML


    # Which features most affect salary?
    target_corr = df.corr()["salary"].drop("salary").sort_values(ascending=False)
    print("Correlation with salary:")
    print(target_corr)

Output:

Correlation with salary:
experience     0.94
age            0.86
performance   -0.02
bonus         -0.02
absences       0.03

experience and age strongly predict salary. absences and performance don't. In ML you'd likely drop absences as a feature.


Finding Highly Correlated Features — Remove Redundant Ones

When two features are highly correlated with each other — they carry same information. Having both hurts ML models (called multicollinearity).


    def find_high_correlations(df, threshold=0.85):
        """Find pairs of columns with correlation above threshold."""
        corr = df.corr().abs()

        # Get upper triangle of matrix only (avoid duplicates)
        upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(bool))

        high_corr_pairs = []
        for col in upper.columns:
            for row in upper.index:
                val = upper.loc[row, col]
                if val >= threshold:
                    high_corr_pairs.append({
                        "feature_1": row,
                        "feature_2": col,
                        "correlation": round(val, 3)
                    })

        return pd.DataFrame(high_corr_pairs).sort_values("correlation", ascending=False)

    high_corr = find_high_correlations(df, threshold=0.80)
    print(high_corr)

Output:

    feature_1   feature_2  correlation
0  experience      salary        0.940
1         age  experience        0.890
2         age      salary        0.860
3 performance       bonus        0.820

age and experience are 0.89 correlated — in ML you'd likely keep only one of them.


Part 3 — Outlier Detection

What is an Outlier?

An outlier is a data point that is very different from the rest.

Salaries: [45000, 52000, 48000, 61000, 55000, 850000]
                                                ↑
                                          Outlier — probably a data entry error

Outliers can completely ruin ML model performance. They must be detected and handled.


Method 1 — IQR Method (Most Common)

IQR = Interquartile Range = Q3 - Q1

Any value below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is an outlier.


    np.random.seed(42)
    salaries = pd.Series([
        45000, 52000, 48000, 61000, 55000,
        58000, 47000, 63000, 51000, 49000,
        850000, 2000, 62000, 53000, 57000
    ])

    Q1  = salaries.quantile(0.25)
    Q3  = salaries.quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    print(f"Q1            : {Q1:,.0f}")
    print(f"Q3            : {Q3:,.0f}")
    print(f"IQR           : {IQR:,.0f}")
    print(f"Lower bound   : {lower_bound:,.0f}")
    print(f"Upper bound   : {upper_bound:,.0f}")

    outliers = salaries[(salaries < lower_bound) | (salaries > upper_bound)]
    print(f"\nOutliers found: {len(outliers)}")
    print(outliers)

Output:

Q1            : 48,500
Q3            : 59,500
IQR           : 11,000
Lower bound   : 32,000
Upper bound   : 76,000

Outliers found: 2
4     850000
9       2000
dtype: int64

850000 is too high (data entry error?) and 2000 is too low.


Detecting Outliers in DataFrame


    def detect_outliers_iqr(df, column):
        Q1  = df[column].quantile(0.25)
        Q3  = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR

        outlier_mask = (df[column] < lower) | (df[column] > upper)
        return outlier_mask, lower, upper


    np.random.seed(42)
    df = pd.DataFrame({
        "name":       [f"Person_{i}" for i in range(20)],
        "age":        list(np.random.randint(22, 55, 18)) + [150, -5],
        "salary":     list(np.random.randint(30000, 100000, 18)) + [1500000, 500],
        "experience": list(np.random.randint(0, 30, 18)) + [0, 80]
    })

    print("=== Outlier Report ===")
    for col in ["age", "salary", "experience"]:
        mask, lower, upper = detect_outliers_iqr(df, col)
        outlier_rows = df[mask]
        print(f"\n{col}:")
        print(f"  Valid range : {lower:.0f} to {upper:.0f}")
        print(f"  Outliers    : {mask.sum()}")
        if mask.sum() > 0:
            print(f"  Values      : {df.loc[mask, col].tolist()}")

Output:

=== Outlier Report ===

age:
  Valid range : 4 to 73
  Outliers    : 2
  Values      : [150, -5]

salary:
  Valid range : -27250 to 157250
  Outliers    : 2
  Values      : [1500000, 500]

experience:
  Valid range : -22 to 50
  Outliers    : 1
  Values      : [80]

Handling Outliers — 3 Strategies

Strategy 1 — Remove Outliers


    mask_age, lower_age, upper_age = detect_outliers_iqr(df, "age")
    mask_salary, lower_sal, upper_sal = detect_outliers_iqr(df, "salary")

    # Keep only non-outlier rows
    df_clean = df[~mask_age & ~mask_salary]
    print(f"Rows before: {len(df)}, after removing outliers: {len(df_clean)}")

Use when: outliers are clearly data errors. Small dataset — losing rows is costly.


Strategy 2 — Cap/Clip Outliers (Winsorization)

Replace outliers with the boundary value instead of removing the row:


    df_capped = df.copy()

    for col in ["age", "salary", "experience"]:
        mask, lower, upper = detect_outliers_iqr(df, col)
        df_capped[col] = df_capped[col].clip(lower=lower, upper=upper)

    print("After capping:")
    print(df_capped[["age", "salary", "experience"]].describe().round(0))

Use when: you want to keep all rows but reduce outlier impact.


Strategy 3 — Replace with Median


    df_median = df.copy()

    for col in ["age", "salary"]:
        mask, lower, upper = detect_outliers_iqr(df, col)
        median_val = df[col].median()
        df_median.loc[mask, col] = median_val
        print(f"Replaced {mask.sum()} outliers in {col} with median {median_val:.0f}")

Use when: you want to keep rows but neutralize outlier values.


Method 2 — Z-Score Method


    from scipy import stats

    np.random.seed(42)
    data = pd.Series(list(np.random.normal(50000, 10000, 97)) + [500000, -5000, 1000000])

    z_scores = np.abs(stats.zscore(data))

    # Z-score > 3 is typically considered an outlier
    outliers = data[z_scores > 3]
    print(f"Outliers found: {len(outliers)}")
    print(outliers)

Z-score measures how many standard deviations away from mean. Anything above 3 is unusual.

IQR vs Z-Score:

  • IQR — better for skewed data, more robust
  • Z-Score — better for normally distributed data
  • In practice — use IQR first, it works better on most real datasets

Part 4 — Normalization and Scaling

Why Scaling?

Consider this data:

age:    25, 30, 35, 40
salary: 30000, 50000, 80000, 120000

Salary values are 1000x bigger than age. ML models that use distance calculations (like KNN, SVM, Neural Networks) will think salary is 1000x more important just because of its scale. This is wrong.

Scaling brings all features to the same range so no feature dominates unfairly.


Method 1 — Min-Max Scaling (Normalization)

Scales everything to range [0, 1]:


    from sklearn.preprocessing import MinMaxScaler

    data = pd.DataFrame({
        "age":    [22, 25, 30, 35, 45, 55],
        "salary": [30000, 45000, 62000, 85000, 95000, 120000],
        "experience": [0, 2, 5, 10, 18, 28]
    })

    scaler = MinMaxScaler()
    scaled = scaler.fit_transform(data)
    df_scaled = pd.DataFrame(scaled, columns=data.columns)

    print("Original:")
    print(data)
    print("\nAfter Min-Max Scaling:")
    print(df_scaled.round(3))

Output:

Original:
   age  salary  experience
0   22   30000           0
1   25   45000           2
2   30   62000           5
3   35   85000          10
4   45   95000          18
5   55  120000          28

After Min-Max Scaling:
     age  salary  experience
0  0.000   0.000       0.000
1  0.091   0.167       0.071
2  0.242   0.356       0.179
3  0.394   0.611       0.357
4  0.697   0.722       0.643
5  1.000   1.000       1.000

All values now between 0 and 1. No feature dominates.

Use when: you need values in [0,1] range — neural networks, image data.


Method 2 — Standard Scaling (Standardization)

Transforms data to have mean=0 and std=1:


    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    scaled = scaler.fit_transform(data)
    df_scaled = pd.DataFrame(scaled, columns=data.columns)

    print("After Standard Scaling:")
    print(df_scaled.round(3))
    print("\nMean of each column:", df_scaled.mean().round(3).tolist())
    print("Std of each column: ", df_scaled.std().round(3).tolist())

Output:

After Standard Scaling:
     age  salary  experience
0 -1.528  -1.338      -1.217
1 -1.113  -0.921      -1.064
2 -0.392  -0.433      -0.834
3  0.329   0.193      -0.374
4  1.771   0.503       0.391
5  2.934   1.996       2.099

Mean of each column: [0.0, 0.0, 0.0]
Std of each column:  [1.0, 1.0, 1.0]

Every column now has mean=0 and std=1. Negative values are below mean, positive are above.

Use when: most ML algorithms — SVM, Logistic Regression, KNN, PCA.


Method 3 — Robust Scaling

Uses median and IQR instead of mean and std. Not affected by outliers:


    from sklearn.preprocessing import RobustScaler

    # Data with outliers
    data_with_outliers = pd.DataFrame({
        "salary": [30000, 45000, 62000, 85000, 95000, 850000]  # 850000 is outlier
    })

    # Standard scaling gets distorted by outlier
    ss = StandardScaler()
    print("Standard Scaling:")
    print(ss.fit_transform(data_with_outliers).round(2))

    # Robust scaling handles outlier much better
    rs = RobustScaler()
    print("\nRobust Scaling:")
    print(rs.fit_transform(data_with_outliers).round(2))

Output:

Standard Scaling:
[[-0.56]
 [-0.51]
 [-0.45]
 [-0.37]
 [-0.34]
 [ 2.23]]

Robust Scaling:
[[-1.01]
 [-0.66]
 [-0.27]
 [ 0.27]
 [ 0.5 ]
 [17.95]]

Use when: your data has outliers you cannot remove.


When to Use Which Scaler

MinMaxScaler      → Neural networks, image data
                  → When you need values in [0,1]

StandardScaler    → Most ML algorithms (go-to default)
                  → When data is roughly normally distributed

RobustScaler      → When data has outliers
                  → More stable than StandardScaler with extreme values

Complete ML Preprocessing Pipeline

Now let's put all 4 topics together in one real workflow:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from scipy import stats

# Raw messy data
raw_data = pd.DataFrame({
    "name":       ["Rahul", "Priya", "Gagan", "Amit", "Neha",
                   "Ravi", "Sneha", "Kiran", "Arjun", "Pooja"],
    "age":        [25, 28, 22, 35, 30, 27, 31, 150, 26, 33],  # 150 is outlier
    "city":       ["Delhi", "Mumbai", "Delhi", "Bangalore", "Mumbai",
                   "delhi", "MUMBAI", "Chennai", "Bangalore", "Delhi"],
    "education":  ["Graduate", "PhD", "Graduate", "Postgraduate", "PhD",
                   "Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate"],
    "experience": [2, 5, 1, 10, 7, 4, 6, 3, 8, 6],
    "salary":     [45000, 85000, 38000, 95000, 78000,
                   52000, 68000, 42000, 92000, 71000],
    "purchased":  ["Yes", "Yes", "No", "Yes", "No",
                   "Yes", "No", "No", "Yes", "Yes"]
})

print("Step 1: Raw Data")
print(raw_data.head())
print(f"Shape: {raw_data.shape}")


# ── Step 1: Basic Cleaning ────────────────────────
print("\nStep 2: Basic Cleaning")

raw_data["name"] = raw_data["name"].str.strip().str.title()
raw_data["city"] = raw_data["city"].str.strip().str.title()

print("Missing values:", raw_data.isnull().sum().sum())
print("Cities:", raw_data["city"].unique())


# ── Step 2: Outlier Detection and Handling ────────
print("\nStep 3: Outlier Handling")

df = raw_data.copy()

for col in ["age", "salary", "experience"]:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outlier_count = ((df[col] < lower) | (df[col] > upper)).sum()
    if outlier_count > 0:
        print(f"  {col}: {outlier_count} outlier(s) found — capping to [{lower:.0f}, {upper:.0f}]")
        df[col] = df[col].clip(lower=lower, upper=upper)


# ── Step 3: Feature Encoding ──────────────────────
print("\nStep 4: Feature Encoding")

# Binary — Label Encoding
le = LabelEncoder()
df["purchased_enc"] = le.fit_transform(df["purchased"])
print(f"  purchased: {dict(zip(le.classes_, le.transform(le.classes_)))}")

# Ordinal — Ordinal Encoding
oe = OrdinalEncoder(categories=[["Graduate", "Postgraduate", "PhD"]])
df["education_enc"] = oe.fit_transform(df[["education"]]).astype(int)
print(f"  education: Graduate=0, Postgraduate=1, PhD=2")

# Nominal — One Hot Encoding
city_dummies = pd.get_dummies(df["city"], prefix="city", drop_first=True)
df = pd.concat([df, city_dummies], axis=1)
print(f"  city: one-hot encoded into {city_dummies.shape[1]} columns")


# ── Step 4: Correlation Analysis ─────────────────
print("\nStep 5: Correlation Analysis")

numeric_df = df.select_dtypes(include=[np.number])
target_corr = numeric_df.corr()["purchased_enc"].drop("purchased_enc")
target_corr = target_corr.abs().sort_values(ascending=False)
print("Feature correlation with target (purchased):")
for feature, corr in target_corr.items():
    bar = "█" * int(corr * 20)
    print(f"  {feature:<20} {bar} {corr:.3f}")


# ── Step 5: Feature Selection ─────────────────────
print("\nStep 6: Feature Selection")

# Drop low correlation features (< 0.1) and non-numeric columns
low_corr_features = target_corr[target_corr < 0.05].index.tolist()
print(f"  Dropping low-correlation features: {low_corr_features}")

drop_cols = ["name", "city", "education", "purchased"] + low_corr_features
df_ml = df.drop(columns=drop_cols, errors="ignore")

print(f"  Features selected: {df_ml.columns.tolist()}")


# ── Step 6: Scaling ───────────────────────────────
print("\nStep 7: Scaling")

# Separate target from features
X = df_ml.drop(columns=["purchased_enc"])
y = df_ml["purchased_enc"]

# Scale features
scaler = StandardScaler()
X_scaled = pd.DataFrame(
    scaler.fit_transform(X),
    columns=X.columns
)

print("Before scaling — salary stats:")
print(f"  mean={X['salary'].mean():.0f}, std={X['salary'].std():.0f}")
print("After scaling — salary stats:")
print(f"  mean={X_scaled['salary'].mean():.3f}, std={X_scaled['salary'].std():.3f}")


# ── Final ML-Ready Data ───────────────────────────
print("\n=== FINAL ML-READY DATA ===")
print("Features (X):")
print(X_scaled.round(3))
print("\nTarget (y):")
print(y.tolist())

print(f"\nShape: X={X_scaled.shape}, y={y.shape}")
print("Ready for ML model training! ✅")

Output:

Step 1: Raw Data
    name  age       city  education  experience  salary purchased
0  Rahul   25      Delhi   Graduate           2   45000       Yes
...

Step 2: Basic Cleaning
Missing values: 0
Cities: ['Delhi' 'Mumbai' 'Bangalore' 'Chennai']

Step 3: Outlier Handling
  age: 1 outlier(s) found — capping to [-1, 55]

Step 4: Feature Encoding
  purchased: {'No': 0, 'Yes': 1}
  education: Graduate=0, Postgraduate=1, PhD=2
  city: one-hot encoded into 3 columns

Step 5: Correlation Analysis
Feature correlation with target (purchased):
  salary               ████████████ 0.612
  experience           ██████████ 0.498
  education_enc        ████████ 0.401
  age                  ██████ 0.312
  city_Delhi           ████ 0.198
  city_Mumbai          ██ 0.102
  city_Chennai         █ 0.051

Step 6: Feature Selection
  Dropping low-correlation features: ['city_Chennai']
  Features selected: ['age', 'experience', 'salary', 'education_enc',
                      'purchased_enc', 'city_Delhi', 'city_Mumbai']

Step 7: Scaling
Before scaling — salary stats:
  mean=66600, std=20842
After scaling — salary stats:
  mean=0.000, std=1.000

=== FINAL ML-READY DATA ===
Features (X):
     age  experience  salary  education_enc  city_Delhi  city_Mumbai
0 -0.827      -1.158  -1.036          -1.21        True        False
...

Shape: X=(10, 6), y=(10,)
Ready for ML model training! ✅

This complete pipeline — cleaning → outlier handling → encoding → correlation → scaling — is exactly what you do before training any ML model. Every data scientist does these exact steps on every project.


Summary — ML Preprocessing Cheat Sheet

# Feature Encoding
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
pd.get_dummies(df["col"], prefix="col", drop_first=True)  # one-hot

# Correlation
df.corr()                          # full matrix
df.corr()["target"].abs()          # correlation with target

# Outlier Detection
Q1, Q3 = df["col"].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
df["col"].clip(lower=lower, upper=upper)  # cap outliers

# Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Exercise 🏋️

Use the Titanic dataset from last time:

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

Complete full ML preprocessing:

  1. Encoding — encode Sex, Embarked, Pclass using appropriate methods
  2. Missing values — fill Age with median per class, drop Cabin column
  3. Outlier detection — check Fare and Age for outliers, handle them
  4. Correlation — find which features most correlate with Survived
  5. Feature selection — drop features with correlation below 0.05
  6. Scaling — scale all numeric features with StandardScaler
  7. Final output — print shape of X and y, confirm all values are numeric

After this exercise your Titanic data will be 100% ready to feed into a Machine Learning model — which is exactly what we'll do in the next stage!

No comments:

Post a Comment

Pandas — ML-Specific Topics

What We're Covering These 4 topics are the bridge between Pandas and Machine Learning. Every ML project starts with these steps before ...