There is a line in every provider's batch documentation that gets quoted in every "cut your LLM bill" post, and it is the least important thing on the page. The line is 50% off. OpenAI, Anthropic, and Google Gemini all now sell the identical deal: submit your requests as a file, accept that they will finish sometime in the next 24 hours instead of the next few seconds, and pay half price on both input and output tokens, for every model they offer. It is a good deal. It is not the reason to use it.

The reason is a sentence buried further down, in OpenAI's Batch API FAQ and echoed by the others: batch requests run against a separate rate-limit pool and do not consume your standard per-model limits. That is the whole product. The discount is a rebate you happen to collect on the way.

The problem batch actually solves#

Picture the job that sends developers looking for batch in the first place. You changed embedding models, and now two million documents need re-embedding. Or you have an eval suite of 500,000 rows and you want to run it against three candidate models before Friday. Or a nightly classification pass over every new record.

Try to push that through the synchronous API and you hit a wall that has nothing to do with cost. Your tokens-per-minute limit meters the whole thing to a trickle. To go faster you throttle your own production traffic, because it draws from the same quota — the eval and the live request path are now fighting over one meter. Do the arithmetic on a few million requests at your tier's TPM and the honest answer is days, maybe weeks, and a paged on-call engineer somewhere in the middle of it.

Batch removes the meter. The pool is separate and dramatically larger, so the two-million-document job runs in its own lane without ever touching the quota your users depend on.

The discount is what they advertise. The separate rate-limit pool is what you're actually buying.

That reframing changes when you reach for it. Batch is not "the sync API but cheaper and slower." It is the mechanism for work that is too big to meter through the front door at all — and the 50% is a bonus that makes the finance conversation trivial.

What the shape of a batch job costs you#

The tradeoff is not only latency; it is a different failure model, and this is where teams get surprised. A synchronous call is one thing that either works or throws, and you wrap it in a try/except. A batch is a JSONL file — one line per request, each with a custom_id — that comes back as a results file, one line per outcome. Some succeeded. Some failed, independently, for their own reasons. Your client is no longer an exception handler; it is a reconciliation loop that joins outputs back to inputs by ID and decides what to retry.

The limits are generous but real: OpenAI takes up to 50,000 requests and 200 MB per batch; Anthropic up to 100,000 requests or 256 MB; Gemini bounds by file size, accepting JSONL up to 2 GB. Bigger jobs mean chunking into multiple batches and tracking them.

Three sharp edges worth pinning to the wall:

The rule of thumb#

Route by who is waiting. If the answer is a person or a live agent loop, it belongs on the synchronous path, and no discount changes that. If the answer is "nobody — this is an eval, a backfill, a bulk labeling pass, a synthetic-data run," then batch is not merely the cheaper option. It is the only one that lets you move the volume without strangling everything else you serve.

The 50% is the sticker in the window. The separate lane is the engine.