There are three ways to make a large language model smaller, and only one of them changes what the model is. You can quantize it — store the same weights at lower numeric precision, FP16 down to INT8 or INT4. You can prune it — delete the weights that turn out not to matter. Both keep the original model and make it cheaper to run. The third, knowledge distillation, throws the original architecture away and trains a brand-new, smaller model to behave like the big one.
That distinction is the whole point. Quantization and pruning shrink a model. Distillation moves a capability across a size class — from a teacher too expensive to serve into a student you can actually deploy. It is the only one of the three that can hand a 70B model's skill to a 7B model, or a transformer's skill to a different shape entirely. And because the student is trained, distillation is really a flavor of fine-tuning — one where the labels come from a model instead of a human.
The founding trick: copy the doubt, not just the answer
Hinton, Vinyals, and Dean named the field in 2015 with one observation. When a trained network classifies an image of a dog, it doesn't just output "dog" — it outputs a small probability for "wolf," a smaller one for "cat," and a vanishingly small one for "car." Those ratios encode what the model has learned about how the classes relate. Hard labels throw all of it away and keep only the winner. Soft targets — the teacher's full distribution — keep what they called the dark knowledge, and a student trained to match the distribution learns far more per example than one trained on the answer alone. (They raised the softmax temperature to make the small probabilities legible, the same temperature knob that governs sampling.)
The proof that this is practical, not just elegant, is DistilBERT (Sanh et al., 2019): distilled during pretraining, it came out roughly 40% smaller (66M parameters to BERT-base's 110M) and 60% faster, while retaining about 97% of BERT's score on the GLUE language-understanding benchmark. That ratio — keep almost all the quality, pay a fraction of the cost — is why distillation became a default rather than a research curiosity.
The shift that matters: from copying answers to grading attempts
Everything above is offline distillation: you collect a fixed pile of teacher outputs and train the student to reproduce them. Kim & Rush (2016) had already pushed this from matching per-token distributions to training on the teacher's whole generated sequences for translation. But offline distillation has a structural flaw that the LLM era made impossible to ignore — exposure bias.
An offline student only ever sees the teacher's flawless sequences in training. At inference it has to recover from its own mistakes — a situation it never once practiced.
The fix is on-policy distillation. In Agarwal et al.'s Generalized Knowledge Distillation (GKD, 2023), the student generates its own outputs during training, and the teacher grades them token by token. Now the student trains on exactly the distribution it will face at inference, learning to recover from its own errors instead of from a perfection it can't reproduce. A parallel line — MiniLLM (Gu et al., 2023) — swapped the usual forward KL for reverse KL, so a small student stops wastefully trying to cover every mode of the teacher's distribution and instead concentrates its limited capacity on the dominant ones. The supervision signal had moved: from "reproduce this answer" to "make your own attempt and I'll tell you where it went wrong."
DeepSeek-R1 said the quiet part
The blunt evidence arrived in January 2025. The DeepSeek-R1 team distilled their reasoning model into six smaller dense students (Qwen and Llama, 1.5B to 70B) using nothing fancier than supervised fine-tuning on ~800,000 teacher-generated reasoning traces. The distilled 32B scored 72.6% on AIME 2024 and 94.3% on MATH-500 — and the team ran the control everyone wanted: they applied large-scale RL directly to the same small base model. The distilled version won. In their words, small models relying on large-scale RL "may not even achieve the performance of distillation."
Read that as a division of labor, not a contradiction of RL. Reinforcement learning — GRPO, PPO and their kin — is how you grow a new capability in a frontier model that doesn't have it yet. Distillation is how you copy an existing capability into a small model for a fraction of the compute. Capability is cheaper to transfer than to discover.
What this means for builders
Distillation has quietly become a product. OpenAI ships Model Distillation in its API (stored completions from a large model, an eval harness, fine-tuning on the captured pairs); Google offers it in Vertex AI; the recipe is no longer a research artifact. The practitioner takeaway is to stop treating the compression axes as rivals: distill the behavior you need into a student the right size class, then quantize that student for the last cost cut. The teacher's expensive intelligence is the thing you're buying once and serving cheaply forever — and the lesson of the last decade is that you transfer it best not by making the student memorize the teacher's answers, but by making it practice and having the teacher mark the work.



