Computer Vision Roadmap
Teaching machines to understand images and video.
Topics
Core tasks
- Image Classification — what is in this image?
- Object Detection — where are the objects? (bounding boxes)
- Image Segmentation — pixel-level classification
- Image Generation — creating new images (GANs, diffusion)
Motion and tracking
- Multi-Object Tracking — track objects across video frames with persistent IDs
- Optical Flow — pixel-level motion estimation between consecutive frames
- Video Understanding — temporal analysis: action recognition, anomaly detection
3D and depth
- 3D Vision and Depth — stereo vision, monocular depth, point clouds, SfM
- Tutorial - Visual SLAM Concepts — simultaneous localization and mapping
Human understanding
- Pose Estimation — detect body keypoints, skeleton-based action recognition
Techniques
- Convolutional Neural Networks — the architecture for vision
- Transfer Learning — pretrained models (ResNet, EfficientNet, ViT)
- Data Augmentation — expand training data artificially
Learning order
Phase 1: Foundations
1. Convolutional Neural Networks (how vision models work)
2. Image Classification (the "hello world")
3. Transfer Learning (pretrained models)
Phase 2: Spatial tasks
4. Object Detection (bounding boxes)
5. Image Segmentation (pixel-level)
6. Pose Estimation (body keypoints)
Phase 3: Temporal tasks
7. Optical Flow (pixel motion)
8. Multi-Object Tracking (identity across frames)
9. Video Understanding (what's happening over time)
Phase 4: 3D understanding
10. 3D Vision and Depth (depth estimation, point clouds)
11. Visual SLAM Concepts (localization + mapping)
Phase 5: Applied
12. Tutorial - Object Tracking Pipeline (build a tracker)
13. Tutorial - Aerial Image Analysis (satellite/drone imagery)
14. Case Study - CV Pipeline Design (design judgment)
The modern workflow
1. Pick a pretrained model (torchvision, timm, huggingface)
2. Replace the classification head
3. Apply data augmentation
4. Fine-tune on your data
5. Evaluate and iterate
You almost never train a vision model from scratch.
Tutorials and applied
- Tutorial - Object Tracking Pipeline — build a multi-object tracker from scratch
- Tutorial - Aerial Image Analysis — satellite imagery, change detection, NDVI
- Tutorial - Visual SLAM Concepts — visual odometry and mapping
Design judgment
- Case Study - CV Pipeline Design — three real scenarios: drone tracking, satellite change detection, indoor activity monitoring
Key libraries
torchvision— datasets, models, transformstimm— huge collection of pretrained modelsalbumentations— fast image augmentationultralytics— YOLO for detection and trackingopencv-python— classical CV, optical flow, feature matchingopen3d— point cloud processing and visualizationrasterio— geospatial image loading (satellite data)mediapipe— pose estimation, face/hand detection (on-device)
Links
- Deep Learning Roadmap
- Convolutional Neural Networks
- Transfer Learning
- Reinforcement Learning Roadmap — RL for autonomous vision systems