Reinforcement Learning Roadmap
An agent learns by interacting with an environment, receiving rewards, and improving its policy.
Core concepts
- Agent: the learner/decision-maker
- Environment: the world the agent interacts with
- State: current situation
- Action: what the agent can do
- Reward: feedback signal (maximize this)
- Policy: strategy mapping states → actions
- Value function: expected future reward from a state
Topics
Foundations
- RL Fundamentals — MDPs, policies, value functions, Bellman equation
- Multi-Armed Bandits — exploration vs exploitation in the simplest setting
- Q-Learning and DQN — value-based methods
Policy optimization
- Policy Gradient Methods — directly optimize the policy
- Actor-Critic Methods — combine value and policy methods (overview)
- Actor-Critic and PPO — deep dive into PPO, the dominant RL algorithm
Advanced methods
- Model-Based RL — learn a world model, plan with it (Dyna, Dreamer, MuZero)
- Multi-Agent RL — cooperative, competitive, and mixed multi-agent settings
Practical
- Reward Design and Curriculum — reward shaping, hacking, curiosity, RLHF
- Tutorial - Sim-to-Real Transfer — bridging the reality gap for real-world deployment
Hands-on
- Tutorial - PPO from Scratch — implement PPO from scratch in PyTorch
- Tutorial - Multi-Agent Training — train multiple agents in PettingZoo
Design judgment
- Case Study - RL System Design — drone navigation, multi-agent pursuit, RLHF failure analysis
Learning order
Phase 1: Foundations
1. RL Fundamentals (MDP framework)
2. Multi-Armed Bandits (exploration vs exploitation)
3. Q-Learning and DQN (value-based methods)
Phase 2: Policy optimization
4. Policy Gradient Methods (REINFORCE)
5. Actor-Critic Methods (overview)
6. Actor-Critic and PPO (the algorithm you'll actually use)
7. Tutorial - PPO from Scratch (implement it)
Phase 3: Advanced topics
8. Model-Based RL (when interactions are expensive)
9. Reward Design and Curriculum (the hardest part of RL)
10. Multi-Agent RL (multiple learning agents)
11. Tutorial - Multi-Agent Training (build it)
Phase 4: Real-world deployment
12. Tutorial - Sim-to-Real Transfer (simulation to hardware)
13. Case Study - RL System Design (design judgment)
Key libraries
gymnasium(formerly gym) — standard RL environmentsstable-baselines3— reliable RL algorithm implementationspettingzoo— multi-agent environmentstorch— neural network framework for custom implementations
Applications
- Game playing (AlphaGo, Atari, DOTA, StarCraft)
- Robotics and control (locomotion, manipulation, drones)
- RLHF — aligning language models with human preferences
- Autonomous systems (navigation, swarm coordination)
- Resource allocation, scheduling, network routing
- Defense: swarm tactics, EW strategy, pursuit-evasion
Links
- Deep Learning Roadmap
- Language Models — RLHF connection
- RLHF and Alignment — alignment techniques
- Computer Vision Roadmap — vision for autonomous RL agents