Knowledge Distillation

What

Train a smaller “student” model to mimic a larger “teacher” model. The student learns from the teacher’s soft probability outputs — not just hard labels.

Why soft targets help

Teacher outputs: [cat: 0.7, dog: 0.2, horse: 0.1] Hard label: [cat: 1, dog: 0, horse: 0]

The soft targets carry dark knowledge: “this looks a bit like a dog too” — richer signal than a one-hot label. The teacher assigns non-trivial probability to incorrect classes, revealing the model’s understanding of conceptual similarity.

Process

Teacher (large, accurate) → generate soft labels on training data
Student (small, fast) → train on both:
  - Soft labels from teacher (KL divergence loss)
  - Hard labels from data (cross-entropy loss)
  - Total loss = α × soft_loss + (1-α) × hard_loss

Distillation temperature

The softmax temperature T controls how “soft” the teacher’s distribution is:

def distill_loss(student_logits, teacher_logits, temperature=2.0):
    soft_student = F.log_softmax(student_logits / T, dim=-1)
    soft_teacher = F.softmax(teacher_logits / T, dim=-1)
    return T**2 * F.kl_div(soft_student, soft_teacher, reduction='batchmean')

T=1: standard softmax
T>1: softer probability distribution over more classes
High T amplifies dark knowledge from teacher

Types of distillation

Response distillation

Student learns to match teacher’s final output layer. Simplest form — used in DistilBERT.

Feature distillation

Student learns to match intermediate representations. The teacher intermediate layers serve as hints:

# Feature matching: align hidden states
feature_loss = F.mse_loss(student_hidden, teacher_hidden)

Relationship distillation

Student learns the relationships between teacher’s representations — attention maps, similarity matrices.

Modern distillation techniques

1. Self-distillation (Born-Again Networks)

A model distilled into an identical architecture. Iterative self-distillation often improves performance without a larger teacher — the student becomes its own teacher.

2. Language model distillation (LLM压缩)

Distilling large language models into smaller ones:

Special loss for token-level knowledge
Logit matching at the final layer
Intermediate layer matching for deeper architectures
Example: TinyLlama (1.1B) distilled from Llama 2 (7B+)

3. Task-specific distillation

Fine-tune a general teacher on a specific domain, then distill. A GPT-4 distilled into a 7B model specifically for code generation or instruction following.

4. Data-free distillation

When you don’t have access to the original training data, generate synthetic data from the teacher (use teacher to label generated samples), or use adversarial setup to create informative samples.

Distillation vs other compression techniques

	Distillation	Quantization	Pruning
Mechanism	Train small from large	Reduce weight precision	Remove weights
Quality	Best — leverages teacher knowledge	Good (INT8 > FP16)	Variable
Speed	Smaller model = faster	Faster matmuls	Depends on sparsity
Combination	Can combine with quantization	Can combine with distillation	Can combine with distillation

Applications

Deploy smaller models in production (DistilBERT = 60% of BERT, 97% quality)
Mobile/edge deployment
Using GPT-4 outputs to train smaller open models
Specialized models from general models

Key papers

Distilling the Knowledge in a Neural Network (Hinton et al., 2015) — arXiv:1503.02531
Born Again Networks (Furlanello et al., 2018) — self-distillation
TinyBERT (Jiao et al., 2020) — BERT distillation
MiniLM (Wang et al., 2020) — deep self-distillation for language models
A Survey on Knowledge Distillation (Gou et al., 2021) — arXiv:2006.05525

AI/ML Notes

Explorer

Knowledge Distillation

Knowledge Distillation

What

Why soft targets help

Process

Distillation temperature

Types of distillation

Response distillation

Feature distillation

Relationship distillation

Modern distillation techniques

1. Self-distillation (Born-Again Networks)

2. Language model distillation (LLM压缩)

3. Task-specific distillation

4. Data-free distillation

Distillation vs other compression techniques

Applications

Key papers

Links

Graph View

Table of Contents

Backlinks