There is a reflex nearly every agent builder has had. The loop returns an answer, you don't fully trust it, so you add a step: now review your reasoning and correct any mistakes. It feels free — the model already has the answer in context; surely a second look can only help.

On reasoning tasks, it usually doesn't. It often makes the answer worse.

The finding that named the ceiling#

In 2024, a team from Google DeepMind and the University of Illinois — Jie Huang, Xinyun Chen, Denny Zhou and four co-authors — published a paper at ICLR with a title that left little room for hedging: Large Language Models Cannot Self-Correct Reasoning Yet. Their result was that intrinsic self-correction — a model revising its own answer using only its own judgment, with no external feedback — did not reliably improve reasoning accuracy. Across benchmarks, it tended to hold flat or drop.

The mechanism is intuitive once you say it out loud. The model doing the review is the model that produced the answer. It has the same training, the same context, the same blind spots. When it was wrong, it was usually confidently wrong — so on review it re-endorses the mistake. And when it was right but not certain, the act of second-guessing is exactly what talks it out of the correct answer. You are asking one mind to catch errors that same mind couldn't see the first time, using nothing it didn't already have.

But Reflexion works — so what gives?#

Here is the apparent contradiction. Reflexion, from Noah Shinn and colleagues at NeurIPS 2023, is built on a self-correction loop, and it works spectacularly: it raised pass@1 on HumanEval from GPT-4's 80% to 91% by having the agent write a natural-language post-mortem after each attempt and retry with that reflection in memory. Self-correction lifted the ceiling by eleven points. So which is it?

Look at what the reflection is anchored to. In Reflexion's coding setup, each attempt is run against unit tests, and the reflection is the model's analysis of why those tests failed. The critique is not the model staring at its own code and vibing about quality. It is the model reading a ground-truth verdict — this test, this input, this expected output, your actual output — and reasoning about the gap. The reflection carries information the model did not have when it wrote the code the first time.

That is the whole difference. Self-Refine (Madaan et al.) sits on the same fault line: it iterates productively exactly when its feedback step is grounded in something real. Strip the external signal out of either method and you are back in Huang et al.'s world, where the extra pass buys nothing.

Reflexion isn't self-correction. It's correction against an oracle, wearing self-correction's clothes.

The one idea: the generator-verifier gap#

The principle underneath all of this is the generator-verifier gap — the fact that, in many domains, checking a solution is easier than producing one. It is why we can grade an exam faster than sit it, why a failing test localizes a bug you couldn't foresee.

But the gap is conditional, and the condition is the part people skip. Verifying is easier than generating only when the verifier has an advantage the generator lacks: it can run the code, consult a test suite, hold ground-truth labels, use a tool the generator couldn't, or judge from a larger or more specialized model. Take that advantage away — make the verifier the same model, in the same context, with the same information — and the gap collapses to zero. There is nothing to exploit. "Check your work" is the zero-gap case. That's why it costs you double the tokens and returns a regression toward the mean.

So the useful question is never "should the agent self-correct?" It's "what does my verification step know that the generation step didn't?" If the honest answer is nothing, don't add the step.

Why coding agents self-improve and strategy memos don't#

This also explains a pattern every practitioner has felt. Agents get visibly better at code and math and visibly stuck on open-ended work. The reason is that verifiable domains ship a free, near-perfect verifier: a compiler, a unit-test runner, a proof checker, an exact-match grader. That oracle is what makes reinforcement learning from verifiable rewards viable, and it's what gives an agent a hard surface to bounce corrections off of. Trained generative verifiers and per-step process reward models push this further, turning "is this right?" into a signal you can optimize.

The tasks where agents flail — "write a persuasive memo," "design the schema," "is this the right architecture?" — are the ones with no cheap ground truth. There's nothing solid to correct against, so a self-critique loop is just the model negotiating with itself. Building trustworthy verifiers for those domains is the actual frontier, and it's why so much 2026 agent work is really verifier work in disguise.

What to actually do#

The trap in "just have it check its work" is that it sounds like diligence. It's usually the opposite: a second draw from the same flawed distribution, billed twice, dressed up as rigor. Self-correction is real — but only ever as strong as the thing you're correcting against.