There is a tempting story about AI agents that goes: the model writes something, looks at its own work, notices the mistake, fixes it, and converges on a correct answer. It is the story behind every "reflection" demo and every agent diagram with an arrow curving back on itself. It is also, in its purest form, mostly false — and the four most-cited self-correction methods are really four different answers to the question that story skips: who decides the first answer was wrong?

Three methods that ask the world#

Reflexion (Shinn et al., NeurIPS 2023) is usually described as an agent that "reflects on its mistakes," which undersells what makes it work. It doesn't reflect on a hunch. It takes a concrete environment signal — a unit test that failed, a game it lost, a wrong-answer flag — and converts that signal into a sentence of natural-language advice it files in an episodic memory, then re-reads on the next attempt. The reported 91% pass@1 on HumanEval is real, but the load-bearing part is the failed test, not the introspection. Reflexion is verbal reinforcement learning, and like all reinforcement learning it needs a reward to exist.

CRITIC (Gou et al., ICLR 2024) is even more honest about this in its own abstract. Its premise is that LLMs struggle to self-verify without external feedback, so it doesn't ask the model to grade itself — it has the model call tools. A search engine checks a factual claim; a code interpreter runs the snippet; the tool's output, not the model's opinion, is the critic. The correction is downstream of a measurement the model couldn't fake.

LATS (Zhou et al., ICML 2024) scales the same idea with money. It wraps Monte Carlo Tree Search around the reason-act loop: generate many candidate actions, evaluate them against environment feedback and a value estimate, reflect on dead ends, and search toward the branch that scores. It reports up to roughly 94.4% on HumanEval — the strongest of the four — and it gets there by evaluating more candidates against a real signal, not by thinking harder about one.

Reflexion, CRITIC, and LATS look like three techniques. They're three delivery mechanisms for the same thing: a verdict from outside the model.

The one method that asks itself#

Self-Refine (Madaan et al., NeurIPS 2023) is the method everyone pictures when they say "self-correction," because it is the only one that truly is self. One model generates an answer, the same model writes feedback on that answer, and the same model revises — no test, no tool, no reward. It reports around 20% average improvement across a spread of tasks, and on the tasks it suits — making prose clearer, fixing format, reversing sentiment, tightening a response — it genuinely helps. The trouble starts when people reach for it on tasks where being wrong is a matter of fact rather than taste.

The result that named names#

In 2023 the reflection literature was euphoric. Then Huang et al. published a paper with a title that functioned as a thesis: "Large Language Models Cannot Self-Correct Reasoning Yet" (ICLR 2024). Their finding, replicated since, is that intrinsic self-correction — looping with no oracle and no external tool — fails to improve reasoning and frequently makes it worse, turning correct answers into incorrect ones about as often as it rescues wrong ones.

The sharpest part was the autopsy of earlier optimism. Several results that looked like successful self-correction had quietly used an oracle to decide when to stop the loop — stop as soon as the answer is correct. That stopping rule leaks the ground-truth answer into the procedure. Remove the oracle and let the model decide for itself when it's done, and the gains evaporate or invert. Stechly et al. (2024) found the same collapse on planning and graph problems: self-critique hurt, sound external verification helped. Kamoi et al.'s TACL survey (2024) put it as a rule of thumb — reliable self-correction shows up when there's reliable external feedback, and largely not otherwise.

The mechanism is almost tautological once you say it out loud. Correcting an error requires detecting it first, and a model that could reliably detect its own reasoning errors would mostly not have made them. Without an external check, the critic is the same fallible reasoner in a different hat — which is why it confidently defends wrong answers and invents flaws in right ones.

Why this is the whole ballgame for agents#

A long-running agent is a self-correction loop that runs hundreds of times without a human in it. If each iteration's "did I get it right?" is answered by the agent's own confidence, the errors don't get caught — they compound down the chain, and the failure looks like an agent that drifts off-task rather than one that crashed. This is the difference between an agent and a workflow: a workflow's checks are written by an engineer; an agent has to source its own verdicts, and the only trustworthy sources are external — a test suite, a compiler, a type-checker, a tool that returns the real value. That's also why tool-use evaluation is really verifier evaluation in disguise.

The frontier now is honest about this. The named problem is the generation-verification gap — the distance between being able to produce a correct answer and being able to recognize one. Song et al. (ICLR 2025) formalize it and find, uncomfortably, that it can widen with model scale. Two escape routes are live: SCoRe (Kumar et al., ICLR 2025) trains self-correction in with reinforcement learning instead of prompting for it, and Weaver (Saad-Falcon et al., 2025) ensembles many weak verifiers into one strong enough to close the gap. Both are concessions to the same fact: you cannot prompt your way to a good critic. You have to build one, or borrow the world's.

So when you wire reflection into an agent, the question is never "which loop." It's "what closes the loop." If the answer is a failed test, you have self-correction. If the answer is the model asking itself whether it's sure, you have a slightly more expensive way of being wrong.