For most of the last three years, choosing between a TPU and a GPU to serve a large language model was a decision you could make with two facts and no benchmark. Fact one: TPUs were cheaper per token. Fact two: getting your model onto one meant a rewrite in JAX. The second fact is the one that ended the conversation. Almost nobody ships their inference stack twice, so almost everybody stayed on the GPU they already had, filed "TPUs are cheaper" under someday, and moved on.

Both facts are now wrong — or at least, they've stopped being the facts that matter. And the reason to look again isn't that TPUs got faster. It's that the two things that used to separate the platforms — the silicon and the software port — have quietly collapsed, which pushes the real decision onto an axis most teams have never had to think about.

The chips converged#

Line up Google's Ironwood (TPU v7) against Nvidia's B200 and the spec sheet is almost boring. Ironwood posts about 4,614 TFLOPS of FP8 compute per chip; the B200 lands around 4,500. Both carry 192 GB of HBM3E. Memory bandwidth is 7.37 TB/s on Ironwood versus 8.0 on Blackwell. On the three numbers that used to decide a serving buy — compute, capacity, bandwidth — the gap is inside ten percent, and it runs in different directions depending on which number you pick.

When the per-chip numbers agree to within a rounding error, the per-chip numbers stop being the answer.

So if you're comparing a single Ironwood chip to a single B200 to serve a model that fits on one of them, stop. It genuinely does not matter, and the GPU comes with a deeper toolchain. This is the same lesson the H100-vs-H200-vs-A100 shootout keeps landing for single-accelerator workloads: past a point, the spec sheet is a tiebreaker, not a decision.

The port stopped being a rewrite#

The bigger change is in software, and it's the one that actually reopens the question. In October 2025, vLLM shipped tpu-inference, a hardware plugin that gives TPUs a single lowering path for both JAX and PyTorch. A standard PyTorch model — the thing you already have — is translated to optimized TPU code through the XLA compiler via Torchax, with SPMD (Single Program, Multiple Data) as the default so the compiler shards the model across chips for you. There's no CUDA in the picture and, crucially, no rewrite: you change the serving backend, not the model.

It's also just faster than the TPU path that came before it. vLLM reported 2–5x throughput improvements over the prior backend — 3.6x on Llama 3.1-8B on a single v6e, 2.1x on Llama 3.1-70B across a v6e-8. That's TPU-vs-old-TPU, not TPU-vs-GPU, but it matters here for a different reason: it means the portability win didn't cost you a performance tax. The reason to stay on Nvidia "because the model already runs there" is now a config flag away from being untrue. This is the same trajectory the Trainium-vs-Nvidia story followed — the non-CUDA accelerator stops being a rewrite and starts being a backend — and it's why "just use vLLM" is increasingly a hardware-agnostic sentence.

So what's actually left to decide? The network.#

Here is the one number on the comparison that is not a coin flip, and it's off by two orders of magnitude. Nvidia's scale-up fabric, NVLink, connects up to 72 GPUs into one coherent domain in a GB200 NVL72 rack. Google's ICI fabric connects up to 9,216 Ironwood chips into a single pod. That's not a ten-percent edge; it's a 128x difference in how many accelerators can talk to each other at scale-up speed before traffic has to fall back to the slower scale-out network.

For a lot of workloads this is a spec you will never touch. A dense model that fits — with its KV cache — inside one NVLink domain doesn't care that a bigger fabric exists, any more than a laptop app cares about a supercomputer. The fabric only becomes the deciding feature when your serving unit — the smallest chunk of hardware one copy of the model needs — spills past a single domain. Three things push you there:

That's the real decision rule, and it's not "which chip is faster." It's: does my model's serving topology exceed one NVLink domain? If the answer is no — and for most teams serving most models, it is — you should stay on the GPU you already run, because you gain nothing and you keep the broader ecosystem. If the answer is yes, the TPU's fabric is the actual product you're buying, and the cheaper-per-token headline is a consequence of that fabric, not an independent reason.

Read the cost claim carefully#

Which brings us to the money, because "TPUs are cheaper" is still the sentence everyone repeats. Google's own accounting, as reported by SemiAnalysis, puts the per-chip total cost of ownership for a full Ironwood pod at roughly 44% below a GB200 server — enough to swamp the ~10% shortfall on peak FLOPs and bandwidth. That's a real and large number. It is also a cost-of-goods number, computed by the one company that both builds the chip and runs the datacenter. You are not that company. You rent the chip on Google Cloud at a price Google sets, and — unlike a GPU — you can't buy the silicon and run it in your own rack or a competitor's cloud. So the TPU economics question isn't "is the hardware cheaper," which you can't act on directly; it's "is a GCP commitment at Google's TPU rental price cheaper than my GPU option," which is a procurement negotiation, not a benchmark. Weigh it next to the self-hosting-vs-API cost math with that framing, not the datacenter TCO framing.

The honest 2026 summary: the chips tied, the software port evaporated, and the decision migrated to the interconnect — where it always secretly lived. If your model fits in one NVLink domain, none of this changes your Tuesday. If it doesn't, the question is finally worth re-opening, because for the first time the only thing standing between you and a 9,216-chip fabric is a backend flag instead of a rewrite.