Most teams still instrument an AI agent the way they instrumented a chatbot two years ago: a cost dashboard, one row per API call, sliced by model. For a single-shot completion, that view is honest — one request in, one answer out, one price. For an agent it is not just incomplete. It is actively misleading, because an agent's cost does not live in any single request. It lives in the trajectory: the whole loop of model calls and tool calls the agent takes to finish one task. Watch the requests and you will optimize the wrong thing with great precision.
The cost is in the loop, and the loop is quadratic#
Here is the structural fact that breaks the per-request view. LLM APIs are stateless. The model remembers nothing between turns, so on every turn your agent re-sends the entire accumulated conversation — the system prompt, the tool schemas, every prior thought, every tool result. Cumulative input tokens therefore grow not linearly with the number of steps but roughly O(n²) — the same quadratic curve that surprises every team the first time they plot it. Doubling an agent's steps does not double its token bill; it multiplies it by considerably more. A ten-cycle reflection loop can burn on the order of fifty times the tokens of a single pass.
This isn't a rounding error hiding in the tail. In one published analysis of 1,127 production agent runs, context re-accumulation accounted for 52% of total spend — more than half of every dollar went to the model re-reading tokens it had already read. A per-request dashboard takes that one steep quadratic curve and smears it across forty innocuous-looking rows. Each row is cheap. The task is expensive. You will never see it by staring at rows.
An agent's expensive request is rarely the expensive request. The expense is the shape of the whole loop, and you can only see a shape if you plot the whole loop.
So the first move in tracking agent cost is a unit change: instrument the session, not the request. Give every trajectory an ID and group all of its model and tool spans under it. This is exactly what the agent-native observability tools reconstruct for you — Helicone stitches spans into sessions at the gateway, Langfuse (MIT-licensed and self-hostable) builds the trace and prices it, AgentOps attributes cost per agent inside a multi-agent system. The capability to check for when you pick one is not "can it sum tokens by model" — everything sums tokens by model. It is "can it reassemble a trajectory and let me attribute it."
The number that actually matters is dollars per resolved task#
Getting the unit right lets you ask the only question finance and engineering can agree on: what did it cost to actually get the thing done? Not per token. Per outcome — per resolved ticket, per merged pull request, per qualified lead — with the retries and the dead ends folded in, because a task that took three attempts really did cost three attempts. This is the metric FinOps X 2026 floated as "cost per verified outcome," and it is the one worth putting on the wall.
It matters because it inverts an intuition that the token dashboard trains into you. On a per-token view, a cheaper model is always better; that is the entire content of the view. On a per-task view, a model that is twice as expensive per token but needs half as many retries is cheaper per delivered result. The expensive model that finishes in two turns beats the cheap model that flails for ten — not sometimes, but as a direct consequence of the quadratic. Optimizing cost-per-token is precisely how a team drives the dashboard number down while its real cost per outcome quietly climbs, and nobody notices until the invoice does.
What to actually build#
The instrumentation that follows from this is not exotic. Three things:
- Tag every trajectory. Attribute cost along the axes you'll actually slice by: provider, model, route (which workflow), user, tenant (which customer — this is what lets you bill), and experiment (which benchmark run). Attribution you didn't capture at write time is attribution you don't have.
- Join cost to outcome. Emit a success/failure signal at the end of each trajectory and store it next to the cost. Without the denominator, "cost per successful task" is just "cost," and you're back to counting tokens.
- Report the distribution, not the mean. Agents have long tails: a small fraction of runaway trajectories — the one that looped forty times, the one that re-planned into a corner — drives a large fraction of spend. Track the p95 and the variance of cost-per-task, not only the average, because the average is computed to hide exactly the runs that will blow your budget. This is also where a hard token budget per agent earns its keep — not as a cost-saver on the mean, but as a governor on the tail.
None of this requires new infrastructure. It requires a decision to stop treating the request as the unit of cost and start treating the task as the unit of value. The token meter will keep spinning either way. The question is whether you're measuring the meter or measuring what you got for it — and only one of those two numbers is the one your CFO is going to ask about.



