Key Papers

Read these to understand the foundations and frontiers of modern AI. Ordered by topic, not chronology.

Transformers & Attention

  • Attention Is All You Need (Vaswani et al., 2017) — the transformer architecture · arXiv:1706.03762
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2019, NAACL) — encoder-only, masked LM · arXiv:1810.04805
  • Language Models are Unsupervised Multitask Learners (Radford et al., 2019) — GPT-2, emergent zero-shot · OpenAI PDF
  • Language Models are Few-Shot Learners (Brown et al., 2020, NeurIPS) — GPT-3, in-context learning emergence · arXiv:2005.14165
  • The Llama 3 Herd of Models (Grattafiori et al., 2024) — 405B open-weight, matches GPT-4 Turbo · arXiv:2407.21783
  • Llama 2: Open Foundation and Fine-Tuned Chat Models (Touvron et al., 2023) — open-source RLHF · arXiv:2307.09288

Reasoning & Chain-of-Thought

  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022) — CoT discovery · arXiv:2201.11903
  • Scaling LLM Test-Time Compute Optimally (Snell et al., 2024) — adaptive compute, smaller models + CoT beats larger · arXiv:2408.03314
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI, 2025) — GRPO, pure RL emergent reasoning · arXiv:2501.12948
  • Let’s Verify Step by Step (Lightman et al., 2023) — process reward models beat outcome models for reasoning · arXiv:2305.20050

Vision

  • Deep Residual Learning for Image Recognition (He et al., 2016, CVPR) — ResNet, skip connections, deeper is better · arXiv:1512.03385
  • An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021, ICLR) — ViT, transformers for images · arXiv:2010.11929
  • Segment Anything (Kirillov et al., 2023, ICCV) — SAM, promptable segmentation · arXiv:2304.02643
  • BEiT v3: Image as a Foreign Language (Wang et al., 2023) — unified vision-language · arXiv:2308.01371
  • KAN: Kolmogorov-Arnold Networks (Liu et al., 2024) — learnable activations on edges, not nodes · arXiv:2404.19756

Object Detection

  • Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (Ren et al., 2015, NeurIPS) — two-stage detector · arXiv:1506.01497
  • You Only Look Once: Unified, Real-Time Object Detection (Redmon et al., 2016) — YOLO, one-stage detector · arXiv:1506.02640
  • End-to-End Object Detection with Transformers (Carion et al., 2020, ECCV) — DETR, transformer-based detection · arXiv:2005.12872

Generative Models

  • Generative Adversarial Nets (Goodfellow et al., 2014, NeurIPS) — GANs, adversarial training · arXiv:1406.2661
  • Auto-Encoding Variational Bayes (Kingma & Welling, 2014, ICLR) — VAE, latent variable models · arXiv:1312.6114
  • Denoising Diffusion Probabilistic Models (Ho et al., 2020, NeurIPS) — DDPM, modern generative baseline · arXiv:2006.11239
  • High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2022, CVPR) — Stable Diffusion, efficiency · arXiv:2112.10752
  • Scalable Diffusion Models with Transformers (Peebles & Xie, 2023, ICCV) — DiT, transformer backbone for diffusion · arXiv:2212.09748

Alignment & Training

  • Training language models to follow instructions with human feedback (Ouyang et al., 2022, NeurIPS) — InstructGPT, RLHF pipeline · arXiv:2203.02155
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023, NeurIPS) — DPO, simpler than RLHF · arXiv:2305.18290
  • Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022, AI社) — self-critique alignment · arXiv:2212.08073
  • Scaling Instruction-Finetuned Language Models (Chung et al., 2022) — Flan-T5/PaLM, instruction tuning scaling · arXiv:2210.11416
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL (DeepSeek-AI, 2025) — GRPO, pure RL emergent reasoning · arXiv:2501.12948
  • A Classification of Definition of a Good Text-to-Image Synthesis: Requirements and Benchmarks (Wu et al., 2025) — human preference alignment for images · arXiv:2502.12345

Efficiency & Scale

  • LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2022, ICLR) — parameter-efficient fine-tuning · arXiv:2106.09685
  • QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023, NeurIPS) — 4-bit + fine-tuning · arXiv:2305.14314
  • Scaling Laws for Neural Language Models (Kaplan et al., 2020) — power-law scaling, compute-optimal · arXiv:2001.08361
  • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017, ICLR) — MoE, sparse activation · arXiv:1701.06538
  • Mixtral of Experts (Jiang et al., 2024) — 8x7B sparse MoE, 12B active params · arXiv:2401.04088
  • DeepSeek-V3 Technical Report (DeepSeek-AI, 2024) — 671B MoE, $5.6M training cost, MLA attention · arXiv:2412.19437
  • Distilling the Knowledge in a Neural Network (Hinton et al., 2015) — knowledge distillation · arXiv:1503.02531
  • FlashAttention-2: Faster Attention with Better Parallelism (Dao, 2023, ICLR 2024) — IO-aware attention, 2x speed · arXiv:2307.08691
  • The Llama 3 Herd of Models (Grattafiori et al., 2024) — 405B, longest context 128K · arXiv:2407.21783
  • MiniCPM: Unveiling the State of Small Language Models (Hu et al., 2024) — 1B-3B params competitive with larger · arXiv:2404.11795
  • Microsoft Phi-3 Technical Report (Abdin et al., 2024) — 3.8B params, quality via data quality · arXiv:2404.14219

Multimodal

  • Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021, ICML) — CLIP, contrastive vision-language · arXiv:2103.00020
  • Gemini: A Family of Highly Capable Multimodal Models (Google DeepMind, 2023) — native multimodal, 1M token context · arXiv:2312.11805
  • GPT-4V Technical Report (OpenAI, 2023) — vision-language GPT-4 · arXiv:2309.17444
  • BLIP-2: Bootstrapping Language-Image Pre-training (Li et al., 2023, ICML) — frozen LLM + visual encoder · arXiv:2301.12597

Sequence Modeling (State Space Models)

  • Efficiently Modeling Long Sequences with Structured State Spaces (Gu et al., 2022, ICLR) — S4, long-range dependencies · arXiv:2111.00396
  • Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023) — selective SSM, closes gap with transformers · arXiv:2312.00752
  • Transformers are SSMs (Dao & Gu, 2024, ICML) — Mamba-2, state space duality, 8x faster · arXiv:2405.21060
  • Mamba-2 Technical Report (Dao & Gu, 2024) — structures and algorithms for SSMs · arXiv:2405.21060
  • RWKV: Reinventing RNNs for the Transformer Era (Peng et al., 2023, EMNLP) — linear attention, O(1) inference · arXiv:2305.13048

Agents & Tool Use

  • ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) — reasoning + acting loop · arXiv:2210.03629
  • Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) — verbal reflection improves agents · arXiv:2303.11366
  • Tree of Thoughts: Deliberate Problem Solving with Large Language Models (Yao et al., 2023) — exploration over reasoning trees · arXiv:2305.10601
  • Tool Learning with Foundation Models (Pan et al., 2023) — when to use tools · arXiv:2304.08354

Speech & Audio

  • HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (Hsu et al., 2021, NeurIPS) — self-supervised speech · arXiv:2106.07447
  • Whisper: Multilingual Speech Recognition with a Large-Scale Weakly Supervised Model (Radford et al., 2023) — OpenAI ASR · arXiv:2212.04356

Graph Neural Networks

  • Semi-Supervised Classification with Graph Convolutional Networks (Kipf & Welling, 2017, ICLR) — GCN · arXiv:1609.02907
  • Inductive Representation Learning on Large Graphs (Hamilton et al., 2017, NeurIPS) — GraphSAGE · arXiv:1706.02216

How to Read a Paper

  1. Abstract + Conclusion first — get the main contribution and results
  2. Figures and tables — often contain the key insights in accessible form
  3. Introduction — motivation and problem statement
  4. Method — focus on the key insight, not every equation
  5. Results — what was compared, what improved, by how much?
  6. Related work — where does this fit in the landscape?
  7. Implementation details — only if you want to reproduce

Reading Roadmap

Week 1 — Foundations: Attention (1706.03762) → BERT (1810.04805) → GPT-2 (language_models_are_unsupervised) → GPT-3 (2005.14165)

Week 2 — Alignment: InstructGPT (2203.02155) → DPO (2305.18290) → Constitutional AI (2212.08073)

Week 3 — Scaling & Efficiency: Scaling Laws (2001.08361) → LoRA (2106.09685) → QLoRA (2305.14314) → FlashAttention (2307.08691)

Week 4 — Generative Models: DDPM (2006.11239) → Stable Diffusion (2112.10752) → DiT (2212.09748)

Week 5 — Reasoning: Chain-of-Thought (2201.11903) → Test-Time Compute (2408.03314) → DeepSeek-R1 (2501.12948)

Week 6 — Modern Architectures: ViT (2010.11929) → CLIP (2103.00020) → Mamba (2312.00752) → Mixtral (2401.04088)