Data Cleaning Checklist — Interview & Industry Ready

1. Understand the Data First

What to do: Before touching anything, explore your dataset using df.head(), df.info(), and df.describe().

Why: You cannot clean what you don't understand. Jumping straight into cleaning without exploration leads to wrong assumptions — you might fill missing values incorrectly or drop columns that were actually useful.

Practical Example: You load a customer dataset and run df.info(). You notice a column called "age" has dtype "object" instead of "int". That tells you something is wrong — maybe there are string values like "N/A" or "unknown" hiding in there. Without this step, you'd never catch it.


2. Find Missing Values

What to do: Run df.isnull().sum() to count missing values per column. Also calculate the percentage: df.isnull().sum() / len(df) * 100

Why: Not all missing data is equal. A column with 2% missing is handled very differently from one with 60% missing. If a column is mostly empty, dropping it might be smarter than filling it.

Practical Example: In an HR dataset, "performance_score" has 65% missing values. Rather than filling all those blanks with the mean (which would fabricate data), you decide to drop the column entirely because it has too little real data to be useful.


3. Handle Missing Values

What to do: Choose the right fill strategy based on the column type and data distribution.

Why: Wrong imputation introduces bias. Using mean on skewed salary data pulls values toward outliers. Using a placeholder like "Unknown" for categories keeps the data honest without losing rows.

Practical Example:

  • Age column (roughly normal distribution) → fill with mean
  • Salary column (right-skewed, some people earn 10x others) → fill with median
  • City column (categorical) → fill with "Unknown"
  • Special case: salary has 0 values that actually mean "not recorded" → replace 0 with NaN first, then fill

4. Handle Fake Missing Values

What to do: Identify values like 0, -1, 999, "NA", "none", "null" that were used as placeholders instead of actual NaN. Replace them with np.nan before doing any filling.

Why: If someone recorded missing salary as 0, and you calculate the mean salary including all those zeros, your mean will be completely wrong. Your model will also learn that salary = 0 is a valid data point.

Practical Example: A medical dataset uses -1 to represent "test not conducted" for blood pressure readings. If you don't replace -1 with NaN first, your average blood pressure calculation will give a meaningless negative number.


5. Remove Duplicates

What to do: Run df.drop_duplicates() to remove fully duplicate rows. For partial duplicates (same person, different entries), decide which row to keep based on business logic.

Why: Duplicate rows inflate your dataset, skew statistics, and cause models to overfit on repeated data. A machine learning model trained on duplicates will perform worse on real unseen data.

Practical Example: An e-commerce order dataset has the same order recorded twice due to a system glitch. If you train a sales forecasting model on this, it will think that customer ordered double what they actually did, inflating predictions.


6. Fix Data Types

What to do: Convert columns to their correct types — integers, floats, datetime, category — using astype() and pd.to_datetime().

Why: Wrong data types cause silent bugs. Calculations on a numeric column stored as string will either fail or give wrong results. Date columns stored as strings can't be sorted, filtered by range, or used for time-based features.

Practical Example: A "date_of_joining" column stored as a string means you can't calculate how many years an employee has worked. Once you convert it with pd.to_datetime(), you can do df["years_at_company"] = (today - df["date_of_joining"]).dt.days / 365 instantly.


7. Handle Outliers

What to do: Use df.describe() to spot extreme min/max values. Use boxplots visually. Remove or cap values that are statistically impossible or clearly data entry errors.

Why: A single outlier can distort the mean dramatically. In ML models, outliers can pull regression lines and skew predictions. However, not all outliers should be removed — some are real and meaningful.

Practical Example: In a salary dataset, one entry shows salary = 99999999. That's clearly a data entry error (someone held "9" too long). Keeping it would make the average salary look like crores when it's actually lakhs. You filter it out with df = df[df["salary"] < 500000].


8. Clean Text Data

What to do: Strip extra spaces with str.strip(), standardize case with str.lower() or str.upper(), fix inconsistent spellings or categories.

Why: "Mumbai", " mumbai", "MUMBAI" are the same city but Python treats them as three different values. This breaks groupby operations, value counts, and any category-based analysis.

Practical Example: You're grouping sales by city. Without cleaning, "Delhi", " delhi", "DELHI", "delhi " appear as four separate groups. After str.strip().str.lower(), they all correctly merge into one "delhi" group.


9. Feature Engineering (Optional but Powerful)

What to do: Create new meaningful columns from existing ones — ratios, date parts, bins, flags.

Why: Raw data isn't always the most informative form. Derived features often carry more signal for models and make analysis more intuitive.

Practical Example: Instead of raw salary in rupees, create salary_in_lakhs = salary / 100000 for easier readability. Or extract join_year and join_month from a date column to analyze hiring trends by time period.


10. Final Validation

What to do: After all cleaning steps, run a final check: df.isnull().sum() should be 0 (or acceptable), df.info() should show correct types, df.shape should match expected row count.

Why: Cleaning steps can sometimes introduce new issues — for example, converting a column type might fail silently and produce NaN for some rows. Always verify the end state matches your expectations.

Practical Example: After all your cleaning, df.isnull().sum() shows "age" still has 3 missing values. You investigate and realize those rows had age stored as "unknown" (string), which your mean-fill didn't catch because it was still treated as a string. Validation caught a bug that would have broken your model training.


One-Liner Interview Summary

Understand the data → Identify and handle missing values (including fake ones) → Remove duplicates → Fix data types → Handle outliers → Clean text → Engineer features if needed → Validate everything at the end.

No comments:

Post a Comment

Data Cleaning Checklist — Interview & Industry Ready

1. Understand the Data First What to do: Before touching anything, explore your dataset using df.head(), df.info(), and df.describe(). Wh...