Reinforcement Learning Roadmap

An agent learns by interacting with an environment, receiving rewards, and improving its policy.

Core concepts

Agent: the learner/decision-maker
Environment: the world the agent interacts with
State: current situation
Action: what the agent can do
Reward: feedback signal (maximize this)
Policy: strategy mapping states → actions
Value function: expected future reward from a state

Topics

Foundations

RL Fundamentals — MDPs, policies, value functions, Bellman equation
Multi-Armed Bandits — exploration vs exploitation in the simplest setting
Q-Learning and DQN — value-based methods

Policy optimization

Policy Gradient Methods — directly optimize the policy
Actor-Critic Methods — combine value and policy methods (overview)
Actor-Critic and PPO — deep dive into PPO, the dominant RL algorithm

Advanced methods

Model-Based RL — learn a world model, plan with it (Dyna, Dreamer, MuZero)
Multi-Agent RL — cooperative, competitive, and mixed multi-agent settings

Practical

Reward Design and Curriculum — reward shaping, hacking, curiosity, RLHF
Tutorial - Sim-to-Real Transfer — bridging the reality gap for real-world deployment

Hands-on

Tutorial - PPO from Scratch — implement PPO from scratch in PyTorch
Tutorial - Multi-Agent Training — train multiple agents in PettingZoo

Design judgment

Case Study - RL System Design — drone navigation, multi-agent pursuit, RLHF failure analysis

Learning order

Phase 1: Foundations
  1. RL Fundamentals (MDP framework)
  2. Multi-Armed Bandits (exploration vs exploitation)
  3. Q-Learning and DQN (value-based methods)

Phase 2: Policy optimization
  4. Policy Gradient Methods (REINFORCE)
  5. Actor-Critic Methods (overview)
  6. Actor-Critic and PPO (the algorithm you'll actually use)
  7. Tutorial - PPO from Scratch (implement it)

Phase 3: Advanced topics
  8. Model-Based RL (when interactions are expensive)
  9. Reward Design and Curriculum (the hardest part of RL)
  10. Multi-Agent RL (multiple learning agents)
  11. Tutorial - Multi-Agent Training (build it)

Phase 4: Real-world deployment
  12. Tutorial - Sim-to-Real Transfer (simulation to hardware)
  13. Case Study - RL System Design (design judgment)

Key libraries

gymnasium (formerly gym) — standard RL environments
stable-baselines3 — reliable RL algorithm implementations
pettingzoo — multi-agent environments
torch — neural network framework for custom implementations

Applications

Game playing (AlphaGo, Atari, DOTA, StarCraft)
Robotics and control (locomotion, manipulation, drones)
RLHF — aligning language models with human preferences
Autonomous systems (navigation, swarm coordination)
Resource allocation, scheduling, network routing
Defense: swarm tactics, EW strategy, pursuit-evasion

AI/ML Notes

Explorer

Reinforcement Learning Roadmap

Reinforcement Learning Roadmap

Core concepts

Topics

Foundations

Policy optimization

Advanced methods

Practical

Hands-on

Design judgment

Learning order

Key libraries

Applications

Links

Graph View

Table of Contents

Backlinks