The pricing pages all agree on a number that sounds like a promotion: roughly half off, every major provider, if you can wait. OpenAI's Batch API, Anthropic's Message Batches, Google's Gemini Batch Mode, Mistral, Together — all knock about 50% off input and output tokens in exchange for an asynchronous turnaround of up to 24 hours. Most finish far faster; Anthropic says the majority of its batches complete in under an hour.
Read as a coupon, this invites the wrong question — is 50% worth a day's wait? — and most teams answer "not for us" and move on, paying realtime rates for everything. That's the mistake. The discount isn't the interesting part. The contract is.
A batch isn't cheaper realtime. It's a different promise.
A synchronous API call makes you a bundle of guarantees: an immediate response, token streaming, a connection you can retry inside a loop. A batch request hands all of that back. What you get instead is an envelope that is asynchronous, best-effort, and partial-failure-tolerant — you submit a file of requests, you poll, and results return per-request, possibly out of order, matched by a custom_id you assigned.
The cleanest proof that batch is a different animal is what it forbids. Anthropic's batch docs reject streaming, threads, fast mode, and prewarming requests with validation errors. Those are precisely the features only an interactive agent needs. Strip them out and what remains is a bulk compute primitive wearing the same API.
The limits follow the same logic. Anthropic caps a batch at 100,000 requests or 256 MB, keeps results for 29 days, and expires the whole batch at 24 hours — at which point you simply aren't billed for what didn't finish. OpenAI governs throughput by per-model enqueued-token quotas rather than your realtime requests-per-minute. None of these is an SLA with a credit attached; the 24 hours is a ceiling, not a guarantee.
Once you treat batch as a reliability contract instead of a discount, the question stops being "is 50% worth it?" and becomes "is a human waiting on this token?"
The two planes of an agent's token budget
That reframed question cleaves an agent system neatly in two.
There's the realtime plane: the interactive loop, the user typing, the tool call whose result the next step depends on. This needs streaming and low latency. Keep it synchronous.
Then there's the offline plane, and it's almost always bigger than teams admit. Eval suites and LLM-as-judge runs. Bulk classification and extraction over a backlog. Synthetic data generation for fine-tuning. Embeddings backfills after a model upgrade. Content enrichment and moderation sweeps. No human is blocked on any of it — yet teams routinely generate these tokens at full realtime price, and burn realtime rate limits doing it. Anthropic's own documentation names exactly this set — large-scale evals, content moderation, bulk generation — as the batch sweet spot.
Splitting the budget along that seam does two things. The obvious one: it halves the cost of everything offline. The overlooked one: it removes that work from your realtime rate-limit pool, because batch runs against separate enqueued-token quotas. For teams who hit limits before they hit their bill, that second effect is the bigger prize — the offline backfill stops starving the live product.
Stacking the discounts (carefully)
Batch composes with prompt caching, but not uniformly, so check before you assume:
- Anthropic states it outright: the discounts from prompt caching and Message Batches can stack. Cache hits inside a batch are best-effort.
- Gemini supports context caching in batch — but the 50% batch discount and the caching discount are not multiplicative, so don't budget for a fictional 95% off.
- OpenAI is the trap: prompt caching does not apply inside the Batch API. The cheap-and-cached path on OpenAI is instead Flex processing (
service_tier="flex", roughly 50% off, best-effort), which sits between realtime and batch in a service-tier ladder that runs Priority → Standard → Flex → Batch → Scale.
And a category error to avoid: DeepSeek's off-peak pricing is not a batch API. It discounts realtime calls — reportedly around 50% on its V3-class chat model and 75% on its R1-class reasoner — but only during a daily window of about 16:30–00:30 UTC. You still get a synchronous answer. It's a time-of-day price, not an asynchronous contract, and it solves a different problem: shifting when you run, not how patiently.
The move
If you run agents in production, audit your token spend for one thing: which calls have a human on the other end. The ones that don't — and there are more than you think — belong in a queue, not a loop. Wire them through a gateway that can route by tier so the same prompt can take the patient lane when nothing is waiting on it. You'll pay roughly half, and you'll stop spending your realtime capacity on work that was never realtime to begin with.



