Ask an engineer whether a large language model is deterministic and you'll usually get a confident answer: set the temperature to zero and it is. Greedy decoding picks the single most probable token at every step; the same prompt should retrace the same path forever. It is the kind of thing that sounds obviously true and turns out to be wrong the moment you test it — even when you own the GPU and nobody else is touching it.
Run the experiment and the cracks show immediately. In September 2025, Thinking Machines Lab — the research outfit founded by former OpenAI CTO Mira Murati — published a piece called Defeating Nondeterminism in LLM Inference that did exactly this. They sent one prompt ("Tell me about Richard Feynman") to Qwen3-235B at temperature 0, one thousand times. They got 80 distinct completions. The thousand runs agreed perfectly for about the first 102 tokens, then forked.
The explanation everyone gives is incomplete
Press a developer on why and you'll hear the standard story: floating-point arithmetic is non-associative — (a + b) + c doesn't exactly equal a + (b + c) once each intermediate result is rounded — and GPUs sum thousands of values concurrently, in whatever order the hardware finishes, so the totals wobble run to run. Every clause of that is true. NVIDIA documents how atomicAdd and parallel reductions produce order-dependent results, and there's peer-reviewed work on exactly this reproducibility problem.
But Thinking Machines makes a sharper claim: in the typical forward pass of an LLM, there is essentially no atomic add to blame. The concurrency-plus-floating-point story explains a hazard that mostly isn't present. So it can't be the main reason your served model keeps changing its mind.
Your model's output doesn't depend only on your request. It depends on how many other people's requests happened to be in the batch with yours.
The real culprit: your output depends on the other tenants
The actual cause is something they name precisely: a lack of batch invariance. A production inference server doesn't run your request alone. It continuously groups incoming requests into batches and runs them together for throughput. The numerical result of the core kernels — the matrix multiplications, the attention computation, the RMSNorm — depends on how big that batch is, because the batch size changes the internal reduction strategy and therefore the order in which floating-point numbers get added.
And the batch size is set by load. How many other users are hitting the endpoint in the same tens of milliseconds is, from your seat, pure noise. So the chain runs: other people's traffic → the server's batch size → the kernel's reduction order → tiny shifts in your logits → a flipped argmax on a close call → a different token → a different completion. The randomness was never in your prompt. It leaked in from the tenants you can't see.
This reframes the whole problem. Nondeterminism stops being an inevitable property of floating-point hardware and becomes a kernel-engineering choice. Thinking Machines wrote batch-invariant versions of the three offending kernels — ones whose reduction order is fixed no matter the batch size — and re-ran the experiment. All 1,000 completions came back identical. They open-sourced the kernels with a vLLM integration. Bitwise reproducibility on shared serving infrastructure turns out to be achievable; it just costs some throughput, and nobody had bothered to pay for it.
Why temperature 0 and seeds don't rescue you
Greedy decoding doesn't help because the failure is upstream of the sampler. Temperature 0 faithfully takes the argmax — but when the top two logits differ by a hair, a sub-percent numeric perturbation is enough to swap their order. That's the mechanism behind the divergence at token 103: everything is identical until the first near-tie, and then the paths separate for good.
Seeds are the next thing people reach for, and the hosted APIs are honest about their limits if you read the docs. OpenAI's own guidance says a seed makes the system "make a best effort to sample deterministically" and that "determinism is not guaranteed," directing you to the system_fingerprint field to detect when the backend configuration changes underneath you. It goes further: outputs can differ even when the seed and the fingerprint both match. On a shared endpoint you cannot pin the batch size, so a seed buys you "usually the same," not "always the same."
Why an agent builder should care
For a chat feature, "usually the same" is fine. Two places it is not:
- Evals. If a benchmark run isn't reproducible, you can't tell a real regression from sampling noise, and a flaky test that fails one run in fifty erodes trust in the whole suite. This compounds with the rest of your evaluation harness.
- Reinforcement learning. This is the one Thinking Machines leads with, and it's the least obvious. RL fine-tuning assumes the model that generated a sample and the model that trains on it compute the same numbers. If inference and training use different kernels — different batch sizes, different reduction orders — your "on-policy" rollouts are subtly off-policy, and the optimization quietly degrades. Reproducible inference isn't a nicety there; it's a correctness requirement for the training loop.
So the practical posture: for product work, set temperature 0, pin a seed and the model version, hold parameters constant, and log system_fingerprint so you notice drift. For anything that demands bitwise agreement — rigorous evals, RL — accept that the hosted API can't give it to you, and reach for self-hosted batch-invariant kernels. The useful mental correction is to stop treating nondeterminism as physics. It's a property of how the server batches you with strangers, and that, it turns out, is fixable.



