Computer Vision Roadmap

Teaching machines to understand images and video.

Topics

Core tasks

Motion and tracking

3D and depth

Human understanding

  • Pose Estimation — detect body keypoints, skeleton-based action recognition

Techniques

Learning order

Phase 1: Foundations
  1. Convolutional Neural Networks (how vision models work)
  2. Image Classification (the "hello world")
  3. Transfer Learning (pretrained models)

Phase 2: Spatial tasks
  4. Object Detection (bounding boxes)
  5. Image Segmentation (pixel-level)
  6. Pose Estimation (body keypoints)

Phase 3: Temporal tasks
  7. Optical Flow (pixel motion)
  8. Multi-Object Tracking (identity across frames)
  9. Video Understanding (what's happening over time)

Phase 4: 3D understanding
  10. 3D Vision and Depth (depth estimation, point clouds)
  11. Visual SLAM Concepts (localization + mapping)

Phase 5: Applied
  12. Tutorial - Object Tracking Pipeline (build a tracker)
  13. Tutorial - Aerial Image Analysis (satellite/drone imagery)
  14. Case Study - CV Pipeline Design (design judgment)

The modern workflow

1. Pick a pretrained model (torchvision, timm, huggingface)
2. Replace the classification head
3. Apply data augmentation
4. Fine-tune on your data
5. Evaluate and iterate

You almost never train a vision model from scratch.

Tutorials and applied

Design judgment

Key libraries

  • torchvision — datasets, models, transforms
  • timm — huge collection of pretrained models
  • albumentations — fast image augmentation
  • ultralytics — YOLO for detection and tracking
  • opencv-python — classical CV, optical flow, feature matching
  • open3d — point cloud processing and visualization
  • rasterio — geospatial image loading (satellite data)
  • mediapipe — pose estimation, face/hand detection (on-device)