The AI/ML Mind Map

Everything else is detail. This page is the thinking framework — the patterns that repeat across every model, every algorithm, every technique.


The Three Core Ideas

1. Learning is optimization

Every ML algorithm:
  Define a loss function (how wrong am I?)
  Compute the gradient (which direction reduces wrongness?)
  Update parameters (step in that direction)
  Repeat

Linear regression, neural networks, gradient boosting, RLHF — all are gradient descent on different loss functions with different parameterizations. The math is the same. The architecture changes.

When understanding any model, ask: what is the loss function, and what is being optimized?

2. All models trade bias against variance

Simple model (few parameters): high bias, low variance → underfits
  Misses real patterns. Predictions are wrong but stable.

Complex model (many parameters): low bias, high variance → overfits
  Memorizes noise. Predictions are accurate on training data, wild on new data.

The art: find the complexity that captures the signal without fitting the noise.

Regularization, dropout, early stopping, cross-validation, ensemble methods — all are techniques for navigating this tradeoff. They look different but solve the same problem.

When a model fails, ask: is it underfitting (need more capacity) or overfitting (need more constraint)?

3. Representation is everything

Raw data → useful representation → simple model works

Bad representation: try to classify images from raw pixel values with logistic regression → fails
Good representation: extract features (edges, textures, shapes) → logistic regression works

The revolution: deep learning LEARNS the representation
  Raw pixels → conv layers learn edges → deeper layers learn shapes → final layers learn objects
  Raw text → embedding layers learn word meaning → attention learns relationships

Feature engineering (classical ML) and architecture design (deep learning) are both about finding the right representation. PCA, embeddings, attention, convolution — all are representation transformers.

When building any model, ask: does the model see the data in a form where the pattern is obvious?


The Universal ML Pipeline

Every ML project — from Kaggle competition to production system — follows this structure:

UNDERSTAND → PREPARE → MODEL → EVALUATE → DEPLOY → MONITOR

1. UNDERSTAND the problem
   - What are you predicting? Why?
   - What data exists? What's missing?
   - What would a human expert do?
   - What's the baseline? (always start with the simplest possible approach)

2. PREPARE the data
   - Explore: distributions, correlations, anomalies
   - Clean: missing values, outliers, inconsistencies
   - Engineer features: create useful inputs from raw data
   - Split: train/validation/test (NEVER leak between them)

3. MODEL
   - Start simple (linear model, decision tree) → establish baseline
   - Increase complexity only if needed (random forest → gradient boosting → neural net)
   - Tune hyperparameters (but only after the approach is right)

4. EVALUATE honestly
   - Right metric for the problem (accuracy is usually wrong for imbalanced data)
   - Cross-validation (not a single split)
   - Test set touched ONCE at the very end
   - Sanity check: does the model make sense? Feature importances reasonable?

5. DEPLOY (if applicable)
   - Model serving (API, batch, edge)
   - Monitoring for drift

6. MONITOR
   - Data drift: is the input distribution changing?
   - Concept drift: is the relationship between inputs and outputs changing?
   - Performance degradation: retrain when metrics drop

The Five Types of ML Problems

Every ML task maps to one of these. Recognize the type and you know which techniques apply.

1. Classification (predict a category)

Input: features → Output: class label
Binary: spam/not-spam, fraud/legitimate
Multi-class: digit 0-9, animal species
Multi-label: image tags (can have multiple)

Loss: cross-entropy
Metrics: accuracy, precision, recall, F1, AUC-ROC
Models: logistic regression → random forest → gradient boosting → neural net

2. Regression (predict a number)

Input: features → Output: continuous value
House price, temperature, stock price, age

Loss: MSE, MAE, Huber
Metrics: RMSE, MAE, R²
Models: linear regression → random forest → gradient boosting → neural net

3. Sequence modeling (predict next in sequence)

Input: sequence → Output: next element or transformed sequence
Language modeling, translation, speech recognition, time series

Loss: cross-entropy (per token), MSE (regression)
Models: RNN/LSTM (legacy) → Transformer (current standard)
Key: attention mechanism

4. Representation learning (learn a useful embedding)

Input: raw data → Output: dense vector in meaningful space
Word embeddings, image features, speaker embeddings, sentence encodings

Loss: contrastive (similar things close, different things far)
Models: Word2Vec, BERT, CLIP, autoencoders, SimCLR
Application: similarity search, transfer learning, clustering

5. Generation (create new data)

Input: noise / prompt / condition → Output: new data (image, text, audio)
Text generation, image synthesis, music, voice cloning

Loss: varies (adversarial, diffusion, autoregressive likelihood)
Models: GPT (autoregressive), Diffusion (denoising), GAN (adversarial)
Key challenge: quality + diversity + controllability

The Recurring Patterns

Pattern: Compression is understanding

A model that predicts well has learned to compress the data.
Compression = discarding irrelevant information, keeping structure.
PCA compresses by finding principal directions.
Autoencoders compress through a bottleneck.
Language models compress by predicting the next token.
Neural nets compress by learning hierarchical features.

If you can compress it, you understand it.
If you can predict it, you've captured its structure.

Pattern: The unreasonable effectiveness of simple baselines

Before building a complex model:
  - Classification: what does "predict the majority class" give you?
  - Regression: what does "predict the mean" give you?
  - NLP: what does TF-IDF + logistic regression give you?
  - Vision: what does a pretrained ResNet give you?

Often: 80% of the final performance with 5% of the complexity.
The gap between baseline and SOTA is where you decide if complexity is worth it.

Pattern: Regularization is everywhere (just in different disguises)

L1/L2 penalties on weights = explicit regularization
Dropout = implicit regularization (ensemble of subnetworks)
Early stopping = regularization by limiting training
Data augmentation = regularization by expanding apparent data
Batch normalization = regularization by adding noise
Smaller model = regularization by limiting capacity
Ensemble methods = regularization by averaging

All do the same thing: prevent the model from fitting noise.

Pattern: More data beats better algorithms

A simple model on lots of data usually beats
a complex model on little data.

Scaling laws (Kaplan et al.): performance improves as a power law
with more data, more compute, and more parameters.

This is why foundation models (trained on internet-scale data)
are so powerful — and why fine-tuning beats training from scratch.

Pattern: The feature importance hierarchy

In tabular data:    feature engineering > model choice > hyperparameter tuning
In NLP:             pretraining data > architecture > fine-tuning > prompting
In vision:          data quality > augmentation > architecture > training tricks
In all domains:     data quality > everything else

Pattern: Everything is a vector

Words → vectors (embeddings)
Images → vectors (CNN features, CLIP embeddings)
Audio → vectors (speaker embeddings, MFCCs)
Users → vectors (collaborative filtering)
Graphs → vectors (node embeddings)

Once everything is a vector, the same math works everywhere:
  cosine similarity, nearest neighbors, clustering, classification

The Meta-Questions

When studying ANY ML topic, always ask:

  1. What is the loss function? (what is being optimized)
  2. What is the inductive bias? (what assumptions does the architecture encode)
  3. What would the simplest baseline be? (before getting fancy)
  4. Where could data leak? (train/test contamination)
  5. What representation does the model learn? (look at embeddings/features)
  6. What fails at scale? (data size, latency, cost)
  7. What does the model NOT capture? (limitations, failure modes)

Map to the Vault