The question arrives the moment an LLM feature gets traction: we're spending real money on API calls — should we just run the model ourselves? It feels like an engineering question with an engineering answer, and the napkin math looks seductive. An open 70B model is free. A rented H100 is a few dollars an hour. The API is billing you per token. Surely, at volume, owning the silicon wins.

It can. But the number that decides it is not the one everyone reaches for. It is not the API's token price, and it is not the GPU's hourly rate. It is utilization — what fraction of the time your GPU is actually serving a request instead of sitting warm and idle, billing you for nothing.

The fixed-cost trap

Here is the asymmetry that the per-token framing hides. An API bills you for tokens. A GPU bills you for time. A rented H100 runs you somewhere between $2 and $4 an hour on-demand — community clouds like RunPod at the low end, hyperscalers and Lambda higher — and it charges that whether it served a million tokens that hour or zero.

So your real cost per token is a division problem:

cost per token = (GPU $/hour) ÷ (tokens served that hour)

Plug in real figures. A modern serving stack like vLLM with continuous batching can push a 70B model to a few hundred up to roughly 800 aggregate output tokens per second on a single H100-class GPU. Take $2.50/hour and 700 tokens/second:

$2.50 ÷ (700 × 3,600) × 1,000,000 ≈ $0.99 per million output tokens

That's genuinely competitive — it's in the neighborhood of what the cheapest open-model API hosts charge for a 70B model (Together AI and similar). For a moment the napkin math looks vindicated.

Now change one variable. Suppose your traffic only keeps that GPU busy 10% of the time — a normal pattern for an internal tool, a B2B product with business-hours load, or anything bursty. You still pay $2.50 for the full hour, but you only served 70 seconds' worth of tokens. The same arithmetic now reads:

$2.50 ÷ (70 × 3,600) × 1,000,000 ≈ $9.90 per million tokens

Same hardware. Same model. Ten times the cost per token — and now the API, at well under a dollar per million, beats you by an order of magnitude. Nothing about the model changed. Only how busy it stayed.

A GPU is a taxi with the meter always running. The API is a bus. Below a certain ridership, the bus is absurdly cheaper per passenger — not because its engine is better, but because it's full.

Why the API wins below the line

This is the part the "it's free at scale" pitch leaves out. When you self-host, you are the only tenant on that GPU, so its idle time is pure waste you eat. When you call an API, the provider is packing your requests in with thousands of other customers' requests onto the same hardware. From the GPU's perspective it's near-100% utilized; from your perspective you only paid for your slice. Statistical multiplexing — the same principle that lets an airline oversell seats or a cloud oversubscribe CPUs — means the API can sell you effectively-full-utilization economics that a single tenant almost never reaches alone. You are not paying a markup for convenience. Below the break-even, you are paying less than your own idle GPU would cost. (For the related batch-vs-realtime tier decision, see LLM batch API vs realtime cost, and for trimming the bill on either path, how to reduce AI agent token costs.)

The hidden multiplier

Even the optimistic $1/million figure is too low, because raw GPU rent is not the bill. Self-hosting adds costs the API quietly absorbs for you:

Industry break-even analyses tend to land on a 2–3x multiplier over raw GPU cost once these are counted, and on the same conclusion: you need sustained, high volume — the kind that keeps the GPU genuinely busy — before owning the stack pays off. Before you size any of it, how much VRAM it takes to serve an LLM and where to run a long-running agent set the floor on what you're actually renting.

The decision, honestly

Run the division for your own traffic. Estimate the tokens-per-second you'll actually sustain — not your peak, your average — divide the GPU's hourly cost by it, and compare to the API's posted rate. If your sustained utilization is high and predictable, self-hosting wins and keeps winning as you grow, and the inference price trend is downward on both sides. If it's bursty, seasonal, or just early, the API is cheaper and less work, and it will stay that way until your meter is running all day.

The token price on the pricing page is the number everyone argues about. It's the wrong number. The one that decides whether you should own a GPU is how often it would be idle — and most teams, if they measured honestly before they built, would find the answer is "too often."