Generating text is a memory problem before it is a math problem. To produce the next token, an autoregressive model has to read every weight it plans to use — and at a batch size of one, the interactive case, the chip spends most of its time waiting on memory, not computing. A modern GPU reads those weights from HBM at a few terabytes per second. That sounds enormous until you do the arithmetic on a 70-billion-parameter model and realize the bandwidth, not the FLOPs, is what caps your tokens per second.

Groq, Cerebras, and SambaNova all noticed the same thing and made the same bet: skip HBM, keep the weights in on-chip SRAM, which moves data roughly an order of magnitude faster. That is the whole story of why custom inference silicon beats GPUs on latency. Everything interesting is in how each one pays for that SRAM, and what it costs you.

Groq: determinism as a feature

Groq's Language Processing Unit carries roughly 500MB of SRAM per chip and no HBM at all. The clever part is not just the memory — it is that Groq's compiler schedules the entire execution graph, including chip-to-chip communication, down to the clock cycle. Nothing is decided at runtime, so there is no scheduler jitter and no cache-miss surprise. Groq's slogan, "determinism is speed," is marketing, but it is also accurate: predictable timing is what lets you pipeline many chips into one long, smooth assembly line.

On GroqCloud you get an OpenAI-compatible endpoint serving the usual open menu — Llama, Qwen, DeepSeek distills, gpt-oss, Mixtral, Gemma — at prices like $0.59 in / $0.79 out per million tokens for Llama 3.3 70B. On gpt-oss-120b, Artificial Analysis's independent measurement puts Groq around 476 tokens/sec. Respectable, and cheaper than the competition, but not the speed crown.

The catch is baked into that 500MB. A 70B model in FP8 is ~70GB of weights. You cannot fit that on one LPU, so Groq shards it across a rack of them. Speed comes from the rack, not the chip.

Cerebras: the whole model on one wafer

Cerebras took the brute-force route and refused to cut the wafer into chips at all. The WSE-3 is a single piece of silicon the size of a dinner plate: 4 trillion transistors, 900,000 cores, and 44GB of on-chip SRAM delivering an aggregate ~21 petabytes/second of memory bandwidth. That is thousands of times an H100's HBM bandwidth, and it means a 70B-class model can live entirely on-chip with no off-wafer hop for weights.

The numbers are the loudest in the category. Cerebras claims 2,100 tokens/sec on Llama 3.1-70B and up to 3,000 tokens/sec on gpt-oss-120b. Note claims — those are company figures on Cerebras's own cloud. Artificial Analysis, measuring the live API, clocked gpt-oss-120b nearer 1,758 tokens/sec. Still first by a wide margin, and a useful reminder to read benchmark bylines.

Cerebras's own number for gpt-oss-120b is 3,000 tokens/sec; the independent measurement is ~1,758. Both are true; only one is the marketing one.

The catch: 44GB is generous for one chip but small for the frontier. Models bigger than the wafer require wafer-scale-cluster gymnastics, the model menu is narrow, and pricing is famously hard to pin down. You rent speed; you do not buy a card.

SambaNova: the one that admits SRAM is too small

SambaNova is the interesting outlier because it designed around the SRAM ceiling instead of pretending it away. The SN40L RDU is a reconfigurable dataflow chip with three memory tiers: 520MB of on-chip SRAM, 64GB of co-packaged HBM, and up to 1.5TB of off-package DDR per socket. The published technical work (the "Scaling the AI Memory Wall" paper) is explicit that this hierarchy exists to hold trillion-parameter and many-expert workloads — cold weights sit in DDR, hot ones stream up into HBM and SRAM as the dataflow graph needs them.

That is why SambaNova can serve the full, non-distilled DeepSeek-R1 671B on 16 RDUs at a per-user speed it pegs around 200–250 tokens/sec (company claim), and why it can host hundreds of model checkpoints on one node. On gpt-oss-120b, Artificial Analysis measured it around 693 tokens/sec — behind Cerebras, ahead of Groq. The tradeoff is honest: tiering costs you peak latency versus an all-SRAM design, and you buy capacity with it.


How to actually choose

The decision is not "which is fastest." It is "am I even bound by the thing these chips fix?" Token-generation latency only dominates when you are decoding interactively at low batch size. The moment you are running a throughput-bound batch job, you can saturate a GPU with a large batch, amortize the weight reads, and running open models yourself on GPUs with self-hosted serving engines often wins on cost per token. Speed silicon shines per-request, not per-token-at-scale.

The honest summary: these are not faster computers, they are differently-shaped memory systems. SRAM is the fast lane and SRAM is tiny, so the real product is how many chips it takes to hold your model — and whether your workload is the kind that the fast lane even helps.