Three of the most-cited words in LLM systems — FlashAttention, PagedAttention, FlashInfer — share two syllables and a marketing instinct. All three say attention. All three imply fast. So engineers treat them as a menu: pick one, ship it, move on.
That's the wrong shape. These are not competitors. They sit at three different layers of the stack, solving three different problems, and a serious inference server runs all three at once. The mental model you want is kernel vs allocator vs engine — not a bake-off.
FlashAttention: a kernel that respects the memory hierarchy
FlashAttention (Dao et al., 2022) is a GPU kernel. Its job is to compute the exact attention output — softmax(QKᵀ)V — without ever writing the full N×N score matrix to memory.
Why does that matter? A GPU has two relevant tiers of memory: large, slow HBM (high-bandwidth memory, the multi-gigabyte pool) and tiny, fast on-chip SRAM. A naive attention kernel computes the N×N scores, writes that whole matrix to HBM, reads it back to apply softmax, writes again, reads again for the V multiply. For a long sequence the score matrix is enormous, and the kernel spends most of its wall-clock time shuttling it across the slow HBM bus. Attention is memory-bound, not compute-bound — the FLOPs are cheap, the IO is not.
FlashAttention fixes this with tiling and an online (streaming) softmax. It loads blocks of Q, K, and V into SRAM, computes a partial result, carries running softmax statistics forward, and never stores the big intermediate matrix anywhere. The score matrix is born and dies in SRAM. The original paper reported up to a 7.6× speedup on the attention computation, and crucially the result is bit-for-bit the standard attention.
FlashAttention is exact. It does not approximate softmax, drop tokens, or sparsify anything — it just refuses to write the big matrix to slow memory.
This is the point people miss most often. FlashAttention is not in the family of approximate-attention tricks. It changes the memory-access pattern, not the math. And because it's a kernel that makes the core op faster and lighter, it helps both training and inference.
FlashAttention-2 (2023) re-partitioned the work across thread blocks and warps for roughly another 2× on A100. FlashAttention-3 (2024) targets Hopper (H100) specifically, exploiting asynchronous Tensor Cores and adding FP8 support — reaching up to ~740 TFLOPs/s in FP16 and close to 1.2 PFLOPs/s in FP8. The lineage is the same idea, tuned to each generation of silicon.
PagedAttention: an allocator, not a kernel
PagedAttention (Kwon et al., SOSP 2023 — the vLLM paper) lives one layer up and solves a completely different problem: the KV cache.
During serving, every token a model generates appends a key and value vector to a per-request cache. Older systems allocated that cache as one big contiguous slab sized for the maximum possible sequence length. Most requests never fill it, so the reserved-but-unused tail is dead weight — internal fragmentation — and the gaps between requests are external fragmentation. The vLLM paper measured real systems packing only 20–38% of reserved KV memory with actual data.
The fix borrows directly from operating systems. PagedAttention stores the KV cache in fixed-size blocks that need not be contiguous in memory, with a block table mapping logical positions to physical blocks — exactly how an OS pages virtual memory onto physical RAM. Fragmentation collapses; the paper reports packing roughly 96% of memory with real data. Blocks can also be shared across requests with a common prefix, which is how prefix caching and parallel sampling stop duplicating identical KV. The payoff is throughput: 2–4× over prior systems like FasterTransformer and Orca at the same latency.
Note what PagedAttention does not do: it doesn't change how attention is computed. It changes where the cache lives. It's an allocator. (If you're tuning the cache itself, KV cache quantization is the orthogonal lever — cheaper bytes, same paging.)
FlashInfer: the engine that composes the other two
Here's the seam. A flash-style kernel wants nicely laid-out, contiguous-ish blocks of K and V. PagedAttention deliberately scatters them. Someone has to write the kernel that runs flash-style attention over paged, non-contiguous KV — and do it for prefill, decode, prefix-shared batches, and a dozen GPU generations.
That someone is FlashInfer (best paper, MLSys 2025). It's an attention engine: a kernel library that represents the KV cache as a block-sparse / composable format, JIT-compiles attention templates for your specific shapes, and load-balances across the ragged batch sizes that real serving produces. Under the hood it dispatches to FlashAttention-2/3-style kernels — but over the paged layout, not a flat tensor.
This is why it's the unifying layer. FlashInfer is where the compute kernel and the memory layout actually meet, which is why it now ships inside vLLM, SGLang, TensorRT-LLM, and TGI (repo). The split between prefill and decode — two phases with wildly different shapes — is exactly the kind of dynamism its scheduler exists to absorb.
The stack, top to bottom
So when you run vLLM with FlashInfer on an H100, you're not picking a winner. PagedAttention decides where the KV bytes live. FlashInfer decides which kernel to launch over them and JIT-compiles it. A FlashAttention-3 kernel does the exact attention math in SRAM, FP8 if you ask. Three layers, three problems, one request.
The "vs" in the title is a trap the vocabulary set for you. If you've ever debugged a serving stack and wondered why your engine has both a paged allocator and a flash kernel and an attention library, that's not redundancy — that's the architecture. The interesting comparisons (like vLLM vs TensorRT-LLM vs TGI) are between engines that assemble these layers differently, not between the layers themselves. Choose your engine. The kernel, the allocator, and the composition layer come as a set.



