Here is the thing that makes 2026 agent invoices genuinely confusing: the price of a token fell all year, and the bills went up anyway. Claude Opus-tier models are $5 per million input tokens; Sonnet-tier and open models are cheaper still. If cost were just price times volume and price is dropping, the arrow should point down. Instead, finance teams that were comfortable with a chatbot's spend opened their first month of agent bills and found a number with an extra digit.
The instinct is to blame the model, the framework, or a runaway prompt. All three are usually innocent. The real culprit is a property of the API itself — one that's easy to miss because it's invisible in a single call and only shows up when you multiply.
The API has no memory, so you pay for it every step#
A chat-completions API is stateless. The model doesn't remember your last message; you remind it by re-sending the whole conversation on every request. For a chatbot that's cheap — one system prompt, a few turns of history, one reply. You pay for that context roughly once per user turn.
An agent is different in exactly one way that turns out to matter enormously: it runs a loop. Plan, call a tool, read the result, reason, call another tool, read that result, and so on — often twenty or fifty times to finish one task. And because the API is stateless, every step of that loop re-sends everything that came before it: the system prompt, the full tool schema, and the growing pile of prior tool results.
A chatbot pays for its system prompt once per turn. An agent on step 20 has paid for that same system prompt twenty times.
That repetition is the whole story. The new tokens a step adds are trivial. The tokens it re-reads are the bill.
Why it's quadratic, with real numbers#
Say the fixed overhead — system prompt plus tool definitions — is 10K tokens, and each step of the loop adds about 2K tokens of new context (a tool call and its result). At step n, the input you send is roughly 10K + 2K × n. Sum that across a 30-step task:
- Fixed overhead re-sent 30 times:
30 × 10K = 300Ktokens. - Growing history:
2K × (1 + 2 + … + 30) = 2K × 465 ≈ 930Ktokens. - Total: ~1.23M input tokens — for a single task.
At $5 per million, that's about $6 of input on one run, before you count a single output token. Now double the task to 60 steps. Linear intuition says the bill doubles. It doesn't — it roughly quadruples, because that 1 + 2 + … + n term grows with the square of the step count. The history you drag behind you is the dominant cost, and it compounds with depth.
This is why "agents burn far more tokens than chatbots" keeps getting reported as a shocking multiple — 30x, 50x — in every FinOps writeup of the year. It isn't a shocking anomaly. It's arithmetic. The moment your product goes from "answer one message" to "run a reasoning loop over tools," you switch from a linear cost curve to a quadratic one, and no per-token price cut outruns a squared term.
It also explains the horror stories: two agents left in a loop passing requests back and forth, no ceiling, discovered days later by a billing dashboard. Those aren't exotic failures. They're the quadratic with the brakes removed.
The fix is caching — and it's fragile#
The good news is that the expensive part of an agent's context is also the most repetitive part. The system prompt is identical every step. The tool schema is identical every step. The early tool results don't change once they've happened. That's precisely what prompt caching is for.
Under Anthropic's caching model, a cached prefix token bills at roughly one-tenth of the normal input price, against a one-time write premium of about 1.25x. Apply that to the loop above and the math inverts: the 10K system prompt you were re-paying for at full price thirty times now bills at ~0.1x after the first write. The quadratic re-read is still happening — but on the cached portion you're paying cache-read rates, not input rates. On a long loop that's commonly a 5-10x cut, and it bends the curve back toward linear.
Here's the catch, and it's the part teams get wrong. Caching is a prefix match: a single changed byte anywhere in the prefix invalidates the cache for everything after it. And agent harnesses are byte-instability machines. Inject the current timestamp into the system prompt and every request is a cache miss. Reorder your tool list per request and nothing caches. Let a sub-agent rebuild the system prompt with one word different and it misses the parent's cache entirely. The result is the worst of both worlds: you pay the quadratic re-read at full price, plus the write premium, and the cache_read_input_tokens field sits at zero while you wonder where the money went.
What to actually do about it#
Three things, in order of leverage.
Budget on depth, not requests. The unit that predicts an agent's cost is the expected number of loop steps and the cache-hit rate, not the number of user requests. A cost model that multiplies "requests × average tokens" will be wrong by a factor that grows with how agentic your product is. Model the square.
Treat cache-read rate as a first-class metric. Log cache_read_input_tokens versus input_tokens on every call. If reads are near zero across a long-running agent, you have a silent invalidator — a timestamp, an unsorted JSON blob, a per-request ID — sitting in your prefix. Finding it is often a single-digit-percent change to the bill's leading digit.
Put a hard token ceiling on every loop. Not a suggestion to the model — an enforced cap that stops the loop. Every "the agents ran for 264 hours" story is missing this one guardrail. It is the cheapest insurance in the stack.
The uncomfortable summary is that the agent era quietly changed the shape of your cost curve while the marketing was all about falling prices. Both things are true: tokens got cheaper, and your bill got bigger, because you started buying a lot more of the same tokens over and over. Price the square, cache the prefix, cap the loop — or keep being surprised.



