Knowledge Distillation
What
Train a smaller “student” model to mimic a larger “teacher” model. The student learns from the teacher’s soft probability outputs — not just hard labels.
Why soft targets help
Teacher outputs: [cat: 0.7, dog: 0.2, horse: 0.1]
Hard label: [cat: 1, dog: 0, horse: 0]
The soft targets carry dark knowledge: “this looks a bit like a dog too” — richer signal than a one-hot label. The teacher assigns non-trivial probability to incorrect classes, revealing the model’s understanding of conceptual similarity.
Process
Teacher (large, accurate) → generate soft labels on training data
Student (small, fast) → train on both:
- Soft labels from teacher (KL divergence loss)
- Hard labels from data (cross-entropy loss)
- Total loss = α × soft_loss + (1-α) × hard_loss
Distillation temperature
The softmax temperature T controls how “soft” the teacher’s distribution is:
def distill_loss(student_logits, teacher_logits, temperature=2.0):
soft_student = F.log_softmax(student_logits / T, dim=-1)
soft_teacher = F.softmax(teacher_logits / T, dim=-1)
return T**2 * F.kl_div(soft_student, soft_teacher, reduction='batchmean')- T=1: standard softmax
- T>1: softer probability distribution over more classes
- High T amplifies dark knowledge from teacher
Types of distillation
Response distillation
Student learns to match teacher’s final output layer. Simplest form — used in DistilBERT.
Feature distillation
Student learns to match intermediate representations. The teacher intermediate layers serve as hints:
# Feature matching: align hidden states
feature_loss = F.mse_loss(student_hidden, teacher_hidden)Relationship distillation
Student learns the relationships between teacher’s representations — attention maps, similarity matrices.
Modern distillation techniques
1. Self-distillation (Born-Again Networks)
A model distilled into an identical architecture. Iterative self-distillation often improves performance without a larger teacher — the student becomes its own teacher.
2. Language model distillation (LLM压缩)
Distilling large language models into smaller ones:
- Special loss for token-level knowledge
- Logit matching at the final layer
- Intermediate layer matching for deeper architectures
- Example:
TinyLlama(1.1B) distilled from Llama 2 (7B+)
3. Task-specific distillation
Fine-tune a general teacher on a specific domain, then distill. A GPT-4 distilled into a 7B model specifically for code generation or instruction following.
4. Data-free distillation
When you don’t have access to the original training data, generate synthetic data from the teacher (use teacher to label generated samples), or use adversarial setup to create informative samples.
Distillation vs other compression techniques
| Distillation | Quantization | Pruning | |
|---|---|---|---|
| Mechanism | Train small from large | Reduce weight precision | Remove weights |
| Quality | Best — leverages teacher knowledge | Good (INT8 > FP16) | Variable |
| Speed | Smaller model = faster | Faster matmuls | Depends on sparsity |
| Combination | Can combine with quantization | Can combine with distillation | Can combine with distillation |
Applications
- Deploy smaller models in production (DistilBERT = 60% of BERT, 97% quality)
- Mobile/edge deployment
- Using GPT-4 outputs to train smaller open models
- Specialized models from general models
Key papers
- Distilling the Knowledge in a Neural Network (Hinton et al., 2015) — arXiv:1503.02531
- Born Again Networks (Furlanello et al., 2018) — self-distillation
- TinyBERT (Jiao et al., 2020) — BERT distillation
- MiniLM (Wang et al., 2020) — deep self-distillation for language models
- A Survey on Knowledge Distillation (Gou et al., 2021) — arXiv:2006.05525
Links
- Quantization — complementary compression technique
- LoRA and PEFT — parameter-efficient fine-tuning
- Transfer Learning — foundation
- Cross-Entropy and KL Divergence — the losses used