The most expensive mistake in model selection right now is treating a reasoning model as a strictly better LLM — the one you reach for when you want good answers instead of cheap ones. It is not that. A reasoning model is a compute-allocation decision, and like every allocation decision it is wrong for most of where you would reflexively apply it.

Here is the mechanism, stripped of marketing. A standard instruct model maps your prompt to an answer in roughly one pass. A reasoning model — OpenAI's o1 and o3, DeepSeek-R1, the "thinking" modes now in Claude, Gemini, and Qwen — has been trained, usually with large-scale reinforcement learning, to first generate a long internal chain of thought and only then answer. OpenAI's own framing is that o1 "thinks before it answers," and that performance "consistently improves… with more time spent thinking." The accuracy comes from spending tokens at inference time. That is the whole trick, and also the whole catch.

The bill is hidden, not absent

Those thinking tokens are real output tokens. OpenAI states plainly that reasoning tokens "are billed as output tokens" even though they are not returned through the API; Anthropic's extended-thinking docs and Google's Gemini thinking docs say the same about their thinking budgets. So a reasoning model does not just answer your question — it writes an essay to itself first, at full output-token price, and then answers. A single hard query can burn thousands of invisible tokens before the first visible character appears, which is also why latency climbs: the model is, quite literally, taking its time.

A reasoning model answers your question only after writing a long essay to itself — and you pay output-token rates for the essay you never see.

For a high-volume, latency-sensitive endpoint, that is a poor trade. For a once-a-day hard planning step, it can be the best money you spend. The point is that the trade is real and directional, not a strict upgrade.

Where the accuracy actually lives

The gains are not spread evenly across tasks; they are concentrated on a specific shape of problem, and the reason is structural. The training signal for reasoning — RL reward, or the "coverage" you get from sampling many candidate solutions — needs something that can check an answer. The "Large Language Monkeys" work makes this explicit: where answers can be automatically verified, throwing more inference compute at a problem translates almost directly into solving more of them (their case rose from about 16% solved at one sample to 56% at 250 on a coding benchmark). DeepSeek-R1's reported strengths sit in the same place — mathematics, competition coding, STEM — all domains with a grader. The "s1" project and the DeepMind test-time-compute scaling paper round out the picture: under the right conditions, spending compute at inference can beat spending it on more parameters.

The corollary is the part people skip. On tasks without a crisp verifier — open-ended writing, summarization, fuzzy judgment calls, simple lookups — the extra reasoning has far less to bite on, and the cost and latency penalty arrives anyway. There is even a documented failure mode: a 2024 paper titled, memorably, "Do NOT Think That Much for 2+3=?" shows o1-like models spending wildly more tokens than a plain model to answer trivial arithmetic, with no accuracy benefit. That is overthinking as a measurable tax.

The dial is the real abstraction

The tell that the industry has internalized all of this is what every major vendor shipped next: not "a reasoning model" as a separate product you switch to, but a dial. OpenAI exposes reasoning_effort. Anthropic exposes an effort setting and a budget_tokens cap on extended thinking. Google exposes a thinkingBudget. Qwen3 ships a single model that toggles between think and no-think modes. The durable primitive is not the binary "reasoning vs standard" — it is how much thinking you authorize, per request, against the difficulty of that request.

That reframes the architecture. The right production pattern is not to standardize your stack on a reasoning model and eat the bill, nor to avoid them and lose the hard cases. It is to route by difficulty: cheap instruct model for the easy 80% — classification, extraction, tool selection — and a reasoning model (or a high effort setting) reserved for the genuinely hard, verifiable steps where the extra tokens pay for themselves. This is the same triage logic that governs agentic versus single-pass retrieval, and it is why the benchmarks that matter for agents increasingly grade verifiable artifacts: a verifier is exactly what makes the extra compute worth spending.

And none of this is locked behind a frontier API. DeepSeek-R1 is open-weights under the MIT license, with reasoning distilled into dense models from 1.5B to 70B — so you can buy down the cost by running a smaller reasoning model where a frontier one would be overkill. The capability is now a commodity. The skill that is not commoditized is knowing, query by query, how much of it to buy.

Benchmark figures above are each paper's or vendor's own published claims, attributed as such; no live leaderboard numbers are quoted, as they go stale quickly. Token-billing and thinking-budget behavior is drawn from the current OpenAI, Anthropic, and Google documentation as of 2026-06-22.