Data Fundamentals Roadmap

Garbage in, garbage out. Most ML work is data work. A clean, well-understood dataset beats a fancier model every time.

Topics

  1. Loading and Inspecting Data — CSV, JSON, parquet, first look
  2. Data Cleaning — missing values, duplicates, outliers, types
  3. Exploratory Data Analysis — distributions, correlations, patterns
  4. Feature Engineering — creating useful inputs from raw data
  5. Feature Scaling — normalization, standardization, when and why
  6. Train-Test Split — why you must separate data, how to avoid leakage

The workflow

raw data → inspect → clean → explore → engineer features → scale → split → model

Every step feeds into the next. Shortcuts here create bugs that are invisible until production.

This isn’t a one-pass pipeline. You’ll loop: explore the data, realize a feature needs different cleaning, go back, re-explore. The diagram is linear but the real process is iterative. Expect to revisit earlier stages after you see model results.

Common pitfalls

These are subtle bugs that produce models that look great in training but fail in production.

PitfallWhat happensHow to avoid
Data leakageInformation from test set leaks into training (e.g., fitting a scaler on full data before splitting)Always split first, then fit transforms on train only
Target leakageA feature that encodes the target (e.g., “treatment outcome” when predicting “will patient be treated”)Audit features for causal relationship with target
Look-ahead biasUsing future data to predict the past (time-series: training on Tuesday to predict Monday)Respect temporal order, use time-based splits
Class imbalance99% negative, 1% positive — model learns to always predict negative and gets 99% accuracyUse stratified splits, appropriate metrics (F1, PR-AUC), resampling, or class weights

The worst part: all of these can produce excellent validation metrics while being completely useless. Always sanity-check your results — if they look too good, something is probably leaking.