Data Cleaning
What
Fixing data quality issues before modeling. This is where most time goes.
Common issues and fixes
Missing values
# How much is missing?
df.isnull().sum()
# Options:
df.dropna() # drop rows (if few missing)
df["col"].fillna(df["col"].median()) # fill numeric with median
df["col"].fillna(df["col"].mode()[0]) # fill categorical with mode
df["col"].fillna("unknown") # explicit unknown categoryRule of thumb: if a column is >50% missing, consider dropping it. If a row has many missing values, consider dropping it. Otherwise, impute.
Duplicates
df.duplicated().sum() # count duplicates
df.drop_duplicates() # remove exact duplicatesWrong types
df["date"] = pd.to_datetime(df["date"])
df["price"] = pd.to_numeric(df["price"], errors="coerce") # invalid → NaN
df["category"] = df["category"].astype("category")Outliers
# Z-score method
from scipy import stats
z = stats.zscore(df["col"])
df_clean = df[abs(z) < 3]
# IQR method
Q1, Q3 = df["col"].quantile([0.25, 0.75])
IQR = Q3 - Q1
df_clean = df[(df["col"] >= Q1 - 1.5*IQR) & (df["col"] <= Q3 + 1.5*IQR)]Inconsistent categories
df["col"] = df["col"].str.lower().str.strip()
df["col"] = df["col"].replace({"ny": "new york", "NYC": "new york"})