Most developers met these three techniques as a staircase. Zero-shot is the ground floor: just ask. Few-shot is one flight up: show a couple of examples. Chain-of-thought is the penthouse: tell the model to think step by step and watch the accuracy climb. Pick the highest one you can afford and you win.

That mental model is wrong in a way that costs real money in tokens, and on the newest models it is actively backwards. The three are not a ladder. They are different tools that fix different problems, and at least one of them now hurts on the class of model most teams are reaching for.

Where each one actually came from

The techniques arrived as separate discoveries, not rungs of one ladder.

Few-shot is the founding trick of the GPT-3 era. Brown et al.'s 2020 paper, Language Models are Few-Shot Learners, showed that a 175-billion-parameter model could do new tasks specified purely as a few text demonstrations in the prompt — no gradient updates, no fine-tuning. The examples don't teach the model anything new; they locate a capability it already has. That origin is the key to using it well, and almost everyone forgets it.

Chain-of-thought came two years later. Wei et al. (2022) found that prepending worked examples that show the reasoning steps sharply improved multi-step problems — but only above a scale threshold. Below roughly 100B parameters the curve is flat; above it, chain-of-thought unlocks a jump. The headline: PaLM 540B prompted with just eight chain-of-thought exemplars went from around 18% to roughly 57% on the GSM8K grade-school math benchmark, beating a fine-tuned GPT-3 with a verifier. Chain-of-thought was never a universal upgrade. It was an emergent one.

Zero-shot chain-of-thought is the punchline. Kojima et al. (2022) showed you often don't need the examples at all — just append "Let's think step by step." On a 2022-era model (text-davinci-002), that one phrase lifted MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7%. A sentence did most of the work the eight hand-built exemplars did.

Chain-of-thought was never a free upgrade you bolt onto any model. It was a capability that switched on at scale — which is exactly why it can switch off again when the model changes.

Few-shot's real job (and its hidden tax)

Because few-shot locates behavior rather than teaching it, its sweet spot is narrow and specific: pinning an output format, demonstrating a labeling scheme, showing the shape of an edge case. If you need strict JSON, three examples of strict JSON beat three paragraphs describing JSON. That is the job.

What few-shot is not is free accuracy, and here is the part that rarely makes it into tutorials. The examples you choose bias the output. Zhao et al.'s 2021 paper, bluntly titled Calibrate Before Use, showed GPT-3's few-shot accuracy "varies drastically" with the choice and ordering of examples, driven by three measurable biases: majority-label bias (the model drifts toward whichever label appears most in your demos), recency bias (it over-weights the last example), and common-token bias. Their calibration fix recovered up to ~30 points of accuracy. The lesson isn't "calibrate" so much as: your examples are not neutral. Four positive examples and one negative one is a thumb on the scale, not a demonstration.

The inversion nobody updated their prompts for

Now the part that breaks the staircase entirely.

The 2025–2026 wave of reasoning models — OpenAI's o-series and their peers — are trained to produce an internal reasoning trace before they answer. They do chain-of-thought inside the model, every time, without being asked. So what happens when you bolt the old techniques on top?

OpenAI's own reasoning guidance is unusually blunt about it. It tells you to "avoid chain-of-thought prompts" because "prompting them to 'think step by step' or 'explain your reasoning' is unnecessary," and that these models "often don't need few-shot examples," so you should "write prompts without examples first" — adding that examples which don't align tightly with your instructions "may produce poor results." Read that twice: on a reasoning model, two of the three rungs of the ladder range from redundant to harmful. The penthouse technique is built into the elevator, and the few-shot examples that used to help can now drag.

This is where it gets genuinely interesting, because the providers disagree. Anthropic's extended-thinking guidance says to prefer high-level "think deeply about this" instructions over prescriptive step-by-step ones — agreeing with OpenAI on chain-of-thought — but explicitly states that multishot prompting with example reasoning patterns still works well with Claude's extended thinking. One vendor says drop the examples; the other says keep them. They are not contradicting each other so much as describing different models.

That disagreement is the whole point of this piece. If zero-shot, few-shot, and chain-of-thought were a quality ladder, two leading labs could not give opposite advice about the same rung. They give opposite advice because these are tools matched to a model class, not universal best practices. The technique that wins is a function of which model you're prompting, and that's a decision that sits right next to picking a reasoning model over a standard one in the first place — and, when you want the prompt itself tuned by data rather than by hand, handing it to an automatic prompt optimizer.

The actual decision

Strip away the staircase and the rule is short. Reach for zero-shot first — on a capable model it's the cheapest thing that works, and on a reasoning model it's also the recommended thing. Add few-shot when the failure is format, not thinking — and when you do, watch your example mix for the majority-label and recency thumbprints. Add explicit chain-of-thought only on non-reasoning models doing multi-step work, where it still does what it did in 2022. And on a reasoning model, write the plain instruction, then stop — the model is already standing in the penthouse you were trying to climb to.