You build an agent. It works. You ship it. Then the invoice arrives and it's an order of magnitude past your back-of-the-envelope, and your first instinct — everyone's first instinct — is to reach for a cheaper model. Hold that thought, because for an agent it's usually the wrong lever, and understanding why tells you which levers are the right ones.
The bill has a shape, and the shape is quadratic
A chatbot has a simple cost: one prompt in, one answer out, pay once. An agent does not work that way. An agent loops. It calls the model, gets back a tool call, runs the tool, appends the result to the conversation, and calls the model again — and that next call re-sends the entire transcript so far, because the model is stateless and the only way it "remembers" step three is that you paste steps one and two back in.
So a twenty-step task does not pay for twenty steps. It pays for the transcript at step one, plus the transcript at step two, plus … plus the transcript at step twenty — each one longer than the last. The total is the triangular number of the conversation length: cost grows with roughly the square of how long the run gets, not linearly with the work done. Most of what you pay for on step twenty is the re-transmission of context the model already saw nineteen times.
An agent doesn't pay per task. It pays per step, times a transcript that gets longer every step. The bill is in the replay, not the reasoning.
Once you see the bill this way, the cheap-model reflex reveals its flaw. If 90% of your tokens are re-sent history, swapping models just makes the same wasteful pattern slightly cheaper. The real savings live in the repeated context. So attack that first.
Lever 1: prompt caching — discount the part that repeats
Your system prompt, your tool definitions, your few-shot examples, and every prior turn are identical across calls within a run. That is exactly what prompt caching is for: the provider stores the computed prefix and bills you a fraction to re-read it.
The numbers are not marginal. Anthropic charges cache reads at 0.1x the standard input rate — a 90% discount on the part of your prompt that doesn't change — against a one-time write premium of 1.25x (5-minute TTL) or 2x (1-hour). OpenAI caches automatically with no code change: 50% off cached input on the 4o family, and up to 90% on its newer models. The catch is that caching matches a prefix, so order your prompt deliberately — stable content (system, tools, instructions) first, volatile content (the live user turn) last. Put a timestamp near the top and you bust the cache on every call and pay full freight.
This one lever, correctly applied, often halves an agent's bill before you touch anything else.
Lever 2: compaction, not truncation
Caching makes the repeated context cheaper; compaction makes there be less of it. A long-running agent accumulates dead weight — the full text of a file it read twelve steps ago, a verbose API response it already extracted one number from. Replaying that on every subsequent turn is pure waste.
The naive fix is truncation: drop the oldest messages. But blind truncation throws away decisions and constraints the agent still needs, and that causes its own expensive failure — the agent re-discovers what it forgot, burning more tokens than you saved. The disciplined fix is compaction: once the transcript crosses a threshold, summarize the older stretch into a compact note that preserves the decisions and drops the bulk. Both major providers now ship this as a managed feature (Anthropic's context editing and memory tooling is one example), and the economics are lopsided — the cost of summarizing proactively is far below the cost of dragging raw context through the rest of the run.
Lever 3: route by difficulty
Not every model call in an agent is hard. Deciding which tool to use, extracting a field, classifying an intent, formatting an answer — these are easy calls that a small model handles fine, and small models can cost 15-50x less per token than a flagship. Reserve the expensive model for the calls that genuinely reason: planning, ambiguous judgment, synthesis.
The trap is doing the routing decision with an expensive model, which eats the savings. Use a cheap classifier or a dedicated router to triage, and accept that routing is a quality dial, not a free lunch — measure the easy tier's outputs before you trust it with more.
Lever 4: batch the work that can wait
A surprising share of "agent" workload is not interactive at all: nightly evals, dataset enrichment, bulk classification, backfills. None of it needs an answer in the next two seconds. Both OpenAI and Anthropic offer a Batch API that is a flat 50% off on input and output, in exchange for asynchronous turnaround (up to 24 hours). For anything offline, that's half your bill for the cost of patience.
Lever 5: mind the output rate
One last asymmetry teams forget: output tokens are billed far higher than input — commonly 3-5x. An agent that narrates its reasoning at length, or returns prose where a JSON object would do, is paying the premium rate to generate tokens nobody reads. Cap max_tokens, ask for structured outputs, and stop instructing the model to "explain your thinking" in production paths where the explanation is never consumed.
If you pull these in order — caching, then compaction, then routing, then batching, then output discipline — you'll usually find the bill falls by more than half before the question of which model to use even comes up. The cheaper model is a real lever. It's just the last one, not the first. An agent's cost is a property of how you manage its context, and context is something you control.



