There is a strange fact at the center of modern LLM inference: when one user is waiting on one response, the GPU doing the work is almost entirely idle. Not idle as in "off" — idle as in its thousands of arithmetic units are sitting around waiting for memory. Generating a single token means streaming the model's entire weight matrix out of high-bandwidth memory and through the compute units once. The math is trivial; the moving is everything. Roofline analyses of single-stream decoding land deep in the memory-bound region, arithmetic intensity near 1, the compute hardware utilized in the low single digits of percent.
Speculative decoding is the trick that notices this idle compute and spends it.
The bet
Autoregressive generation is sequential by definition: token n+1 depends on token n, so you take one forward pass per token, and each pass pays the full memory tax. Speculative decoding breaks the sequence into a guess-and-check loop. A cheap drafter proposes several tokens ahead — say five. Then the expensive target model runs one forward pass over all five at once and decides how many it agrees with.
If you're paying to load the weights anyway, verifying five tokens in that single load is nearly free. When the drafter is good, you advance five tokens for the price of one slow pass plus some cheap drafting. When it's bad, you fall back to normal speed. Either way you never go backward.
The part people get wrong: this is lossless. It is not "a smaller model good enough most of the time." The verification step (Leviathan et al., 2022; DeepMind's Chen et al., 2023) uses rejection sampling so that the accepted sequence is distributed exactly as if the big model had produced every token itself.
Speculative decoding doesn't trade quality for speed. It trades idle silicon for speed, and proves the trade was free.
The only number that matters
Strip away the architectures and one equation governs all of it: your speedup is a function of acceptance rate (how many drafted tokens the target keeps) against draft cost (how expensive the guessing is). A drafter that's brilliant but as slow as the target buys you nothing. A drafter that's instant but always wrong buys you nothing. Every method below is an attempt to push acceptance up while keeping the drafter cheap.
The first generation used a separate small model as the drafter — a 7B model drafting for a 70B, say. It works, but it has a tax the papers undersell: you now have to find, align, and serve a second model whose vocabulary and behavior track the big one closely enough to be accepted. Two models in the serving path, two things to deploy.
Self-speculation: deleting the draft model
Medusa (Cai et al., 2024) made the field's first real leap by deleting the separate model. Instead it bolts a handful of extra decoding heads directly onto the frozen base model — head 1 predicts the token two positions ahead, head 2 three positions ahead, and so on — then uses tree attention to verify several candidate continuations at once. No second model to align; the heads ride on features the base model already computed. Medusa-1 keeps the backbone frozen and stays lossless; Medusa-2 retrains the backbone for higher acceptance at the cost of a careful recipe.
Medusa's weakness is structural: its heads predict each future position independently. Token three is guessed without conditioning on the guess for token two. Real language isn't independent across positions, so acceptance falls off as you reach further ahead.
EAGLE (Li et al., 2024) fixes exactly that. It runs a tiny autoregressive drafter, but — the non-obvious move — it drafts at the feature level, extrapolating the model's second-to-top hidden states rather than predicting tokens directly, then mapping those features to tokens at the end. Drafting in feature space restores the position-to-position dependency Medusa threw away, which is why EAGLE accepts longer runs. The original paper clocks it at 3x over vanilla, 1.6x over Medusa, on MT-Bench, with no fine-tuning of the target model. EAGLE-3 (2025) drops EAGLE's feature-prediction constraint, fuses features from all layers, and trains against the test-time draft distribution, pushing the reported speedup into the ~3–6.5x range and topping the speculative-sampling leaderboard.
The catch the benchmarks hide
Here is the line item that should change how you deploy this: the entire win is a batch-size-1 win. The free compute exists because a single stream is memory-bound. Stack up concurrent requests and the server fills its arithmetic units with real work — it becomes compute-bound, the idle capacity speculative decoding was spending evaporates, and the extra verification FLOPs now compete with paying customers. On a saturated serving fleet, naive speculative decoding can reduce throughput.
That's why this isn't a default. In vLLM and SGLang speculative decoding is a per-deployment switch — point it at an EAGLE head or an n-gram drafter, set the speculative token count, and the engine handles the rejection-sampling correctness. The right setting depends entirely on your traffic shape. If you're serving one latency-sensitive user at a time — a coding assistant, a local model, a low-QPS internal tool — turn it on; EAGLE-3 is close to free real estate. If you're running a high-throughput batched endpoint at capacity, measure before you trust it, because the GPU fact that makes speculative decoding brilliant for one user is the same fact that makes it a liability for a thousand.



