Data Fundamentals Roadmap

Garbage in, garbage out. Most ML work is data work. A clean, well-understood dataset beats a fancier model every time.

Topics

Loading and Inspecting Data — CSV, JSON, parquet, first look
Data Cleaning — missing values, duplicates, outliers, types
Exploratory Data Analysis — distributions, correlations, patterns
Feature Engineering — creating useful inputs from raw data
Feature Scaling — normalization, standardization, when and why
Train-Test Split — why you must separate data, how to avoid leakage

The workflow

raw data → inspect → clean → explore → engineer features → scale → split → model

Every step feeds into the next. Shortcuts here create bugs that are invisible until production.

This isn’t a one-pass pipeline. You’ll loop: explore the data, realize a feature needs different cleaning, go back, re-explore. The diagram is linear but the real process is iterative. Expect to revisit earlier stages after you see model results.

Common pitfalls

These are subtle bugs that produce models that look great in training but fail in production.

Pitfall	What happens	How to avoid
Data leakage	Information from test set leaks into training (e.g., fitting a scaler on full data before splitting)	Always split first, then fit transforms on train only
Target leakage	A feature that encodes the target (e.g., “treatment outcome” when predicting “will patient be treated”)	Audit features for causal relationship with target
Look-ahead bias	Using future data to predict the past (time-series: training on Tuesday to predict Monday)	Respect temporal order, use time-based splits
Class imbalance	99% negative, 1% positive — model learns to always predict negative and gets 99% accuracy	Use stratified splits, appropriate metrics (F1, PR-AUC), resampling, or class weights

The worst part: all of these can produce excellent validation metrics while being completely useless. Always sanity-check your results — if they look too good, something is probably leaking.

AI/ML Notes

Explorer

Data Fundamentals Roadmap

Data Fundamentals Roadmap

Topics

The workflow

Common pitfalls

Links

Graph View

Table of Contents

Backlinks