Here is a launch-day story you have heard in some form. The app sailed through every demo. Then real traffic arrived — a front-page link, a viral post, an enterprise pilot flipping the switch on a Monday — and within minutes it was returning errors and timing out. The team pulls up the dashboards expecting to find the model crawling under load, a GPU pinned, a queue backing up. Instead the model looks fine. The thing that fell over was something they never thought to test, because it isn't theirs.
If your app is built on a hosted LLM API — and most are — a load test is not measuring the model. It is measuring the provider's rate limiter and your own code's reaction to it. Get that one sentence wrong and you will spend a week optimizing throughput nobody was waiting on, while the actual failure mode sits untouched.
The wall belongs to someone else#
A conventional load test pushes a service until something you own saturates: CPU, a connection pool, a database. You watch a resource curve bend and you call that your capacity. LLM APIs break the assumption. Long before your servers strain, you hit a quota the vendor set, and the request comes back 429.
And it isn't one quota. OpenAI's rate limits are enforced across four independent dimensions at once — requests per minute, tokens per minute, requests per day, and tokens per day — and exceeding any single one returns the same 429. For an agent, the one that bites first is almost never RPM. A handful of long-context calls — a big system prompt, a retrieved document stuffed into the window — will blow your tokens-per-minute while you're nowhere near your request count. Which means a load profile that counts requests is measuring the wrong axis. You can be at 20% of your "limit" by the number you watched and fully throttled by the one you didn't.
You can't load-test your way out of a number someone else sets. You can only find where it is — and decide what your app does when it gets there.
Your load tool is lying to you, in three specific ways#
Reach for the standard harnesses and they will quietly mismeasure an LLM endpoint.
k6 can't see the stream. k6 treats a response as one unit and records the time from request to final byte. For a streamed completion that is total generation time — which tells you nothing about time-to-first-token, the number that actually governs whether the UI feels alive. k6 has no native server-sent-events support; you need a community extension to even observe the stream. A green k6 report can hide a TTFT that doubled.
Locust fights its own measurement. Locust can parse the stream, but tokenizing streamed responses is CPU work, and in Python that work runs under the GIL. The Locust docs are blunt that one process can't use more than one core — you're told to run one worker per core via --processes. Skip that and, under exactly the high concurrency you're testing for, the tokenization backlog inflates the latencies you're reading. The tool's overhead becomes part of your p99.
The test itself costs real money. This is the trap teams find late. A realistic soak at moderate concurrency burns millions of tokens; at list prices a thorough load-testing program runs into real dollars per night. You cannot treat "run it overnight" as free the way you do for a REST service.
The fix for the third one also doubles your speed on the first two: most of what a load test exercises is model-agnostic. Connection pooling, your queue, retry logic, autoscaling, your own latency under fan-out — none of it cares whether a real model or a mock answered. Point the bulk of your runs at a stub endpoint or a cheap small model, and spend full-price tokens only on the few questions that genuinely require the production model: real TTFT, real token-length distributions, and how your tier's limiter actually behaves at the edge.
Test the fall, not just the ceiling#
Once you accept that the ceiling is exogenous, the test's job changes. Finding the 429 boundary is the easy half. The valuable half is proving what your app does at the boundary — because the default behavior is catastrophic.
A naive retry-on-429 is the canonical own-goal: the limit trips, every client immediately retries, the retries land on an already-throttled endpoint, and a momentary limit becomes sustained overload that drains your quota faster than if you'd done nothing. (This is the same fan-out failure as backpressure: the fix is upstream flow control, not a politer retry.) A correct load test deliberately drives you past the limit and asserts the answers: does backoff honor the retry-after header instead of guessing? Does fan-out have admission control so twenty sub-agents don't become a self-inflicted DDoS? When TPM is exhausted, does the app shed, queue, or fall back to a smaller model — or does it just hang?
That is the actual deliverable. Not a tokens-per-second figure (that's a benchmarking question, and it's the serving engine's to answer, not your app's). The output of an LLM load test is a one-page runbook: the concurrency at which you begin shedding, the degradation path you chose, and proof that the path fires under load instead of in a postmortem.
Tools that already speak LLM — Ray's LLMPerf, the vLLM project's GuideLLM for SLO-driven sweeps — will give you TTFT, a fixed tokenizer for honest counts, and a load curve without the streaming blind spots. But the tool is the cheap part. The expensive part is deciding, before launch day, what your product is supposed to feel like the moment the limit you don't own says no.



