Speculative decoding has spent two years as the inference trick everyone had heard of and fewer had turned on. It works — a small draft model guesses the next several tokens, the big model verifies the whole guess in one pass, and you keep every token it accepts almost for free — but wiring it up meant choosing a draft method, tuning a drafter, and living on a code path that felt like a lab feature. SGLang's v0.5.13, shipped in June 2026, quietly changes that posture. The release makes Spec V2 the default speculative-decoding path and deprecates Spec V1. The interesting word there isn't default. It's deprecates.
Why a default flip is a bigger deal than a benchmark#
Vendors announce speedups constantly and they age badly. A change to which path is the default, paired with a deprecation of the old one, is a more durable signal: it means the maintainers now trust the feature enough to hand it to every user who doesn't opt out, and confident enough in the replacement to start sunsetting what came before. Speculative decoding just crossed from "expert knob you enable and tune" to "the way SGLang decodes, unless you say otherwise." That's the graduation.
The consolidation underneath it is the substance. Previously EAGLE and MTP — two different ways of producing the draft tokens — lived on separate code paths. V2 folds both onto a single unified worker. One worker instead of two means the drafting logic, the verification, and the backends all share a maintained core rather than diverging, which is how a feature stops being fragile. It's the unglamorous plumbing that turns a research capability into a default you can leave on in production.
A change to the default, paired with a deprecation, says the maintainers now trust this enough to give it to everyone. That's a stronger claim than any speedup.
Tree drafting is where the throughput actually is#
The capability that graduated alongside the default is tree drafting with topk > 1, now production-ready across the Triton, FlashAttention-3, MLA, and AITER backends. This is the part worth understanding if you care about the numbers.
Plain speculative decoding proposes one linear guess: the drafter says "the next five tokens are probably A B C D E," and the target verifies that single line. The moment the target disagrees at token C, you throw away D and E. Tree drafting proposes a branch instead — at each position, several candidate tokens — so the target verifies many possible continuations in one pass and accepts the longest path that holds up. Higher topk means more branches per step, which means a higher chance that a long run survives verification. Because the whole tree is checked in a single forward pass, more accepted tokens per pass is close to free throughput. This is the mechanism that separates a modest speculative speedup from a large one, and it's the reason EAGLE-style methods beat the earlier draft-model approaches: better-shaped drafts get accepted more often.
None of this changes what the model outputs. Speculative decoding is lossless by construction — the target model verifies every token, so you get exactly the distribution you'd have gotten decoding one token at a time, just faster. (It does not fix the nondeterminism that comes from batching and floating-point reduction order; that's a separate problem living a layer down.)
The migration nobody put in the headline#
Here's the cost the release notes don't lead with. If you never used speculative decoding, v0.5.13 is a clean upgrade and you get the new default for nothing. If you did — if you tuned an EAGLE or MTP drafter, set worker parallelism, or built deployment configs around V1's separate path — merging the two implementations onto one worker moved the knobs. Your old configuration isn't guaranteed to map cleanly onto the unified worker, and V1 is now the deprecated path you don't want to keep building on. This is the recurring shape of maturing infrastructure: the feature gets easier for newcomers and slightly disruptive for the people who invested early in the old surface.
So treat it as a migration, not a version bump. Re-verify your drafter and worker config against V2, confirm your acceptance rates didn't move, and get off V1 while it's still present rather than after it's removed. The through-line for anyone choosing a serving engine right now is that speculative decoding is no longer a differentiator you switch on — it's becoming table stakes, on by default, and the engines are competing on how well their default is tuned. SGLang just made its bet on which implementation that default should be.



