You run a multi-tenant app on top of an LLM. The monthly provider bill arrives as one number, and finance wants to know which customers earned it — to set pricing, to find the tenant whose runaway agent is eating your margin, to bill usage-based plans honestly. So you reach for the obvious lever: the provider's per-user field. That is the first wrong turn.
The per-user field is not a billing dimension#
OpenAI's safety_identifier and Anthropic's metadata.user_id both take a stable, opaque ID for your end user. Read the docs and the purpose is explicit: they exist so the provider can trace activity back to an individual for abuse and safety monitoring. They are not a meter. The provider will not return you a per-end-user invoice, and you should hash the ID before sending it — it is a policy hook, not an accounting one.
What the provider does give you is the Cost API, which rolls spend up by project_id, api_key_id, and line_item — and nothing finer. So your two honest options are structural (a project or key per tenant, which the Cost API can then attribute) or computational (log each request's usage yourself and keep your own ledger). The structural path is clean until you have thousands of tenants, or until a shared prompt cache starts crossing key boundaries. For most apps, you end up computing it. Which means you have to know what a request actually costs.
Raw tokens are not priced tokens#
Here is the part that quietly breaks every naive tokens × rate spreadsheet: the same token is priced differently depending on the lane it took.
Anthropic prompt caching bills a cache read at 0.1x base input and the cache write that warmed it at 1.25x (5-minute TTL) or 2x (1-hour). That is a spread of more than twelve to one on the same shared prefix. In a multi-tenant app with one common system prompt and tool schema, exactly one tenant's request arrives cold, pays the write premium, and warms the cache — and every tenant who follows pays a tenth of input on those tokens. If you attribute by raw token count, you overcharge the cold-path customer for a cost the whole tenant pool consumed, and you undercharge everyone who rode the warm cache they paid for.
The honest unit of attribution is the priced token, not the raw token — and a shared cache warm-up is a joint cost you amortize, not a bill you hand to whoever tripped it.
The same distortion hides in two other lanes. The OpenAI Batch API discounts input and output 50% for asynchronous work, so a tenant who runs evals or backfills through batch incurs half the effective unit cost of a tenant doing the identical work synchronously. And reasoning/thinking tokens bill at the higher output rate even though the customer never sees them — a 500-token answer can carry thousands of billed thinking tokens underneath it. Cost includes tokens the tenant can't read.
A model that survives an audit#
The defensible attribution model has four moving parts, and you already have most of the data in each API response:
- Record priced usage, not raw counts. Every response reports input, output, cache-read, cache-write, and (where applicable) reasoning tokens separately. Store all of them per request, tagged with a hashed tenant ID, alongside the model and the lane (sync vs batch). The OpenTelemetry GenAI conventions give you stable attribute names (
gen_ai.usage.input_tokens,gen_ai.usage.output_tokens) if you want this to ride your existing tracing. - Amortize the shared warm-up. Pool cache-write costs against the tenants that consumed the cheap reads in the same window, instead of letting one cold request carry a whole pool's overhead.
- Convert with the right rate per lane. Cache reads, batch tokens, and output-rate thinking each get their own multiplier. This is the step the spreadsheet skips.
- Reconcile. Your computed ledger is an estimate until you check it against the provider Cost API each cycle. If they diverge, your lane model is wrong somewhere — usually thinking tokens or a cache rate.
If you don't want to build all of this, a gateway buys you the first mile: LiteLLM issues a virtual key per tenant and tracks spend and budgets per key, user, and team at the proxy; an observability layer like Langfuse computes cost per generation and aggregates it by user, session, or tag. Both still rely on a correct model of what a priced token costs — so you own that model either way.
The same caching that makes a per-customer invoice hard is the thing keeping your bill down — see how to reduce agent token costs and prefix caching vs prompt caching for the levers, and the prompt-caching price cards for the exact multipliers you'll plug into the ledger above. Attribute by the priced token, amortize the shared warm-up, reconcile monthly. Bill the raw token and you will be wrong in the direction of your most cost-conscious customers.



