Glossary
Quick reference for terms you’ll encounter constantly.
A
- Activation function: nonlinear function applied after a neuron’s linear operation → Neurons and Activation Functions
- Adam: adaptive optimizer, the default for deep learning → Optimizers
- Attention: mechanism for models to focus on relevant parts of input → Attention Mechanism
- Autoencoder: net that compresses and reconstructs input → Autoencoders
- AUC: area under the ROC curve, classification metric → Evaluation Metrics
B
- Backpropagation: computing gradients via chain rule → Backpropagation
- Batch normalization: normalize activations within a batch → Batch Normalization
- Bias (model): error from oversimplified assumptions → Bias-Variance Tradeoff
- Bias (parameter): the b in wx + b, an offset term
- BPE: byte-pair encoding, subword tokenization → Tokenization
C
- CNN: convolutional neural network → Convolutional Neural Networks
- Cosine similarity: measures angle between vectors → Cosine Similarity
- Cross-entropy: classification loss function → Cross-Entropy and KL Divergence
- Cross-validation: evaluate model on multiple data splits → Cross-Validation
D
- Diffusion model: generates data by reversing noise process → Diffusion Models
- DPO: direct preference optimization, alternative to RLHF → RLHF and Alignment
- Dropout: randomly zero neurons during training for regularization → Dropout
- DQN: deep Q-network → Q-Learning and DQN
E
- Embedding: dense vector representation of discrete data → Embeddings
- Epoch: one full pass through the training data
- Ensemble: combining multiple models (bagging, boosting, stacking) → Ensemble Methods
F
- F1 score: harmonic mean of precision and recall → Evaluation Metrics
- Feature: an input variable to the model
- Feature store: centralized repository for ML features → Feature Stores
- Fine-tuning: adapting a pretrained model → Transfer Learning
G
- GAN: generative adversarial network → Image Generation
- Gradient: vector of partial derivatives → Gradient
- Gradient descent: optimization by following negative gradient → Gradient Descent
H
- Hyperparameter: setting chosen before training (lr, batch size, depth) → Hyperparameter Tuning
I
- Inference: using a trained model to make predictions
K
- KV cache: stored key-value pairs for efficient transformer inference
L
- Learning rate: step size in gradient descent
- LLM: large language model → Language Models
- Loss function: measures prediction error → Loss Functions
- LoRA: low-rank adaptation for efficient fine-tuning → LoRA and PEFT
M
- MLP: multi-layer perceptron (feedforward neural net)
- MSE: mean squared error → Loss Functions
O
- Overfitting: model memorizes training data → Bias-Variance Tradeoff
P
- PCA: principal component analysis → PCA
- PPO: proximal policy optimization → Actor-Critic Methods
R
- RAG: retrieval augmented generation → Retrieval Augmented Generation
- Regularization: penalties to prevent overfitting → Regularization
- ReLU: max(0, x), the default activation → Neurons and Activation Functions
- Residual connection: skip connection, y = F(x) + x → Residual Networks
- RLHF: reinforcement learning from human feedback
S
- Scaling laws: power-law relationships between compute/data/performance → Scaling Laws
- Softmax: convert logits to probabilities summing to 1
- SGD: stochastic gradient descent → Gradient Descent
- SVD: singular value decomposition → Matrix Decomposition
T
- Tensor: n-dimensional array (generalization of matrices)
- Tokenization: splitting text into units → Text Preprocessing
- Transformer: the dominant architecture → Transformers
- Transfer learning: reuse pretrained models → Transfer Learning
U
- Underfitting: model is too simple → Bias-Variance Tradeoff
V
- Variance (model): error from oversensitivity to training data → Bias-Variance Tradeoff
- ViT: vision transformer (transformer for images) → Vision Transformers
W
- Word2Vec: neural word embedding model → Word Embeddings