Your LLM call returns 200 OK. No exception, no error code, no retry triggered. The JSON parses — right up until the line where it doesn't, because the closing brace never came. The model was three sentences into a summary, or halfway through a tool argument, and then it simply stopped. Nothing in your try/except fired, because as far as HTTP is concerned, nothing went wrong.
This is the failure the reliability playbook keeps missing. We've all wired up retries, backoff, and fallback chains for the 4xx and 5xx errors, and we put a deadline on the whole loop. But truncation isn't in that family. It's a successful response that happens to be incomplete, and the only evidence is a status field you have to go looking for.
It's a field, not an exception#
Every major API tells you it truncated. It just doesn't tell you loudly. The signal is a stop-reason enum on the response body:
- OpenAI Chat Completions:
choices[0].finish_reason == "length". A natural finish is"stop";"length"means it ran out of room. - Anthropic Messages:
stop_reason == "max_tokens". The docs are blunt about the remedy — "Raisemax_tokensor continue the response" — and list a separatemodel_context_window_exceededfor when the whole window, not just your cap, is the wall. - Google Gemini:
candidates[0].finishReason == "MAX_TOKENS", with the added hazard that readingresponse.textcan throw when the truncated candidate came back with no usable part at all.
The load-bearing point is that a truncated call and a clean one are both 200. If your code path checks the HTTP status and then hands the body straight to json.loads or your Pydantic model, you have no truncation handling — you have a latent bug that surfaces as a parse error three layers away from its cause. The single highest-leverage thing you can do is log the stop field on every call. It converts an invisible failure into a countable one.
A truncated response is not the API failing. It's the API succeeding at giving you less than you needed — and only whispering that it did.
Why "just raise max_tokens" is a trap#
The reflex fix is to crank the output ceiling. Sometimes that's correct. Often it's a mistake wearing the mask of a fix.
Two reasons. First, some outputs are effectively unbounded: an agent that's looping, a model padding a list, a generation with no natural stopping point will happily consume whatever ceiling you give it, so a huge max_tokens trades a truncation bug for a runaway-cost bug and removes your only backstop.
Second — and this is the part the 2024-era advice never covered — reasoning models share one budget between thinking and output. On OpenAI's o-series and GPT-5, max_completion_tokens pays for the hidden reasoning tokens and the visible answer. On Gemini 2.5's thinking models, the thinking tokens count against maxOutputTokens. You cannot see these tokens, but they are billed and they are counted, so a budget that looks generous for the answer can be entirely consumed before a single visible character is emitted. The result, reported over and over across OpenAI's and Google's forums, is an empty response with the truncation flag already set — the model spent your whole budget reasoning and returned nothing. OpenAI's own guidance is to reserve on the order of 25,000 tokens for reasoning plus output when you start. If you're tuning a thinking budget separately, that ceiling has to sit above your visible-output estimate, not equal to it.
Continuation is safe for prose, hostile to JSON#
Once you've detected truncation, the textbook move is to continue: append the partial assistant message plus a short "continue" and call again. For prose, this works well — and there's a real cost lever hiding in it. As Anthropic's docs note, you should not resend the giant original prompt on the continuation; append only the partial output and the nudge, so prompt caching keeps the expensive prefix warm and you pay a fraction of the input cost per continuation.
But continuation quietly breaks on structured output, and this is where teams get burned. A JSON object is only valid once its closing brace lands, so a truncated JSON isn't "most of the answer" — it's unparseable. Stitching a second generation onto the fragment assumes the model resumes token-exactly, and it doesn't; it re-generates, drifts, sometimes re-opens a key it already wrote. Gemini users hit exactly this in batch mode: MAX_TOKENS yields invalid JSON that no amount of concatenation repairs. So the rule splits by output type. Prose: continue from the partial. Structured output: discard the fragment and retry the whole call with more headroom (or accumulate a streamed object with a tolerant partial parser and validate once at the end). Trying to "continue" a broken JSON is how a truncation bug becomes a data-corruption bug.
What to actually do#
Truncation handling is control flow, not a magic number. Four rules cover it:
- Read the stop field on every call, including the final chunk of a stream — that's where the signal rides, so a stream you stop reading early hides the truncation entirely.
- Branch on it: prose continues, JSON retries with headroom, and either way you never parse a
length/max_tokensbody as if it were complete. - Budget for reasoning separately on thinking models — reserve real headroom above the visible answer, or accept the occasional empty response as the cost of a tight cap.
- Log the field so silent truncation becomes a metric you can alert on instead of a bug a user reports.
The retries and timeouts you already built answer "what if the call fails?" Truncation answers a stranger question — "what if the call succeeds and still isn't done?" — and the whole trick is remembering to ask it.



