Here is a scene that plays out in a surprising number of well-run engineering organizations. The monthly model-provider invoice lands. It is larger than expected. Someone asks the obvious question — which customer drove that? — and the room goes quiet, because the honest answer is that nobody instrumented the thing that would let them know, and now the tokens are spent and the information is gone.

That quiet is the whole subject of this piece. LLM cost attribution has a reputation as an observability chore you get to later. It is actually a schema decision you make at request time, and the reason it can't wait is unforgiving: a token you don't tag when you emit it is unattributable forever. The provider's bill aggregates every call under your API key. It cannot reconstruct which of your users, features, or tenants produced which tokens, because that mapping only ever existed in your application — for the instant the request was in flight. Miss it then and no amount of later log-joining fully recovers it.

Tag at emission, or don't bother#

The unit of attribution is the individual model call, and the discipline is to attach enough metadata to it that you can answer product questions without re-instrumenting. A workable minimum is six fields:

From those six you can produce per-user, per-feature, and per-tenant rollups and rotate between them at will. The Braintrust playbook makes the sharp version of the point: build per-user, per-task, and per-tenant views from the start, because the alternative is re-instrumentation, and re-instrumentation only fixes future traffic.

The most expensive attribution mistake is not a wrong number. It's a deferred decision — shipping first, instrumenting "once there's traffic," then spending a quarter retrofitting tags onto tokens that are already gone.

One more refinement that separates teams who can optimize from teams who can only stare at a big number: track four token layers, not two. Prompt, tool, memory, and response tokens each behave differently and each has a different lever — prompt tokens respond to caching, tool tokens to schema pruning, memory tokens to retrieval limits. Collapsing them into a single input/output bucket hides where the money goes — and for agents, the money hides in exactly the layers a two-bucket view erases.

The gateway total lies about agents#

Now the non-obvious part, and the reason agents deserve their own treatment. The fastest way to start metering is an LLM gateway — LiteLLM, Portkey, Kong AI Gateway, Helicone — a proxy you point your base URL at, after which every request is logged at the wire. For a single-shot app (one prompt, one completion) that is genuinely enough, and Helicone will track cost across 300-plus models without you touching model-price tables.

But an agent is not one call. One agent run is a sequence — a planning call, several tool-augmented calls, a memory read, a retry after a malformed tool result, a final synthesis. The gateway sees each of these as an independent prompt-in/completion-out event. It can give you a correct total. What it cannot give you is the sentence you actually need: "60% of this tenant's spend went into a retry loop before the model ever produced an answer." That fact lives in the relationship between calls, and the wire-level view has thrown the relationship away.

Recovering it means span-level tracing, which is where the OpenTelemetry GenAI semantic conventions — standardized by a dedicated SIG since April 2024 — earn their setup cost. Each model call becomes a span carrying gen_ai.usage.input_tokens and gen_ai.usage.output_tokens; LiteLLM additionally emits cost as gen_ai.cost.{key}. Crucially, you attach your agent_run_id and customer_id once and propagate them to every nested span using OpenTelemetry Baggage with a BaggageSpanProcessor, so the retry buried three calls deep is still tagged to the right tenant. Backends like Langfuse ingest this over OTLP, so the instrumentation isn't a bet on one vendor — the same spans feed whichever observability platform you land on.

The rule that makes it cheap#

All of this sounds like a lot until you notice it collapses to a single rule: decide your attribution schema before you launch, and emit it on every call. The gateway-versus-spans choice, the six tags, the four token layers — none of them are hard to implement. They are only expensive when postponed, because the cost of postponement is denominated in a currency you can't get back: the tokens you already burned without a name on them.

Cost attribution isn't a dashboard you'll add when the numbers get scary. By the time the numbers are scary, the data to explain them either exists or doesn't — and which one it is was decided months earlier, in a request handler, by whoever chose whether to write down a customer_id.