There are two different conversations hiding inside "I need the model to return valid JSON," and most teams only have the first one. The first is how do I describe the shape I want — a Pydantic model, a JSON schema, a grammar. That's the structured-output library layer, and it's well-trodden. The second conversation happens one level down, inside the inference server, on every single decoding step: given the grammar, which of the 100,000-odd tokens in the vocabulary are legal right now, and how fast can I mask the rest? Three projects own that layer, and the way to tell them apart is to follow how the problem itself moved.

Outlines: proving you can constrain at all

The foundational move belongs to Outlines and the paper behind it, Willard & Louf (2023). The idea is elegant: take a regex, a JSON schema, or a context-free grammar, and compile it into a finite-state machine. At each generation step you know which FSM state you're in, so you know exactly which tokens keep the output valid. Mask the logits of all the others to negative infinity, sample from what's left, and malformed output becomes not unlikely but impossible. No retries, no "please respond only with JSON," no parsing prayers.

The library that pioneered FSM-indexed guided generation — regex, JSON schema, and CFG to a logit mask
★ 14kPythondottxt-ai/outlines

Outlines made structured generation correct by construction, and it remains an excellent library to reach for directly in Python. (It's now maintained by the .txt team under dottxt-ai, with a separate Rust outlines-core for the hot path.) But once "can we constrain?" was answered yes, a second question surfaced — and it's a performance question.

The throughput tax

Here's the cost nobody mentions in the demo. That mask has to be recomputed every token. For a JSON schema with nested objects and a six-figure vocabulary, working out the legal set at each step is real work, and it sits squarely on the critical path of generation. Do it naively and you pay a per-token latency tax that's invisible on a single request and brutal at serving scale, where it stacks against batching and throughput. Correctness was solved; the bill came due on speed.

The first generation of constrained decoding made invalid output impossible. The second made valid output cheap. Those are different engineering problems.

XGrammar is the answer that won. It splits the vocabulary into context-independent tokens — ones it can pre-check and cache once, regardless of position — and the smaller set of context-dependent tokens it must evaluate at runtime. It keeps a persistent execution stack for fast pushdown-automaton transitions through nested structures, and crucially it co-designs with the inference engine to overlap grammar computation with the GPU's forward pass, so the masking hides behind work you were already doing. Its paper reports up to a 100x speedup on per-token grammar processing and, end-to-end, near-zero overhead.

Near-zero-overhead structured generation engine; context-independent token caching + GPU overlap
★ 1.8kC++mlc-ai/xgrammar

The third entrant attacks the same wall from a different language. llguidance is a Rust constraint engine — the one that powers the guidance library — built for raw mask-computation speed: it reports computing a full token mask in roughly 50 microseconds of single-core CPU time for a 128k-token tokenizer. It speaks a Lark-like grammar format and a large subset of JSON schema, and it's available inside vLLM as the guidance backend.

Rust grammar engine computing a token mask in ~50µs; backs the guidance library and serves as a vLLM backend

You probably don't pick the backend

The twist for anyone actually shipping: the choice is increasingly made upstream of you. vLLM's structured-outputs default is auto, which selects a backend per request and prefers XGrammar, falling back to guidance, outlines, or lm-format-enforcer when a request needs something XGrammar can't express. SGLang defaults to XGrammar and exposes the others behind --grammar-backend, explicitly recommending XGrammar "for its better performance." XGrammar is also wired into TensorRT-LLM and MLC-LLM.

So the realistic decision tree is short. If you're self-hosting on vLLM or SGLang, you are almost certainly already on XGrammar, and the right move is to use structured outputs aggressively rather than fearing the latency — that fear is a 2023 reflex the engine has since fixed. Reach for Outlines as a library when you want grammar-constrained generation in Python without committing to a particular server, or for its broader, batteries-included ergonomics. Reach for llguidance when you're building on the guidance library or want a fast constraint engine to embed in a Rust stack. And reach for XGrammar deliberately only if your serving layer hasn't already reached for it on your behalf — which, more and more, it has.