Every retrieval-augmented system rests on one quiet assumption, and it is usually wrong: that whatever the retriever hands back is worth answering from. Naive RAG takes the top-k passages and conditions the model on them unconditionally — no step in the pipeline is allowed to say this context is junk, don't use it. So when retrieval misses, the model doesn't fail loudly. It hallucinates fluently on top of bad evidence, which is worse. This is the same gap that pushes teams toward agentic RAG — letting the model drive retrieval — but Self-RAG and CRAG go after it without handing the whole loop to an agent.
Two well-cited methods fix this, and they get filed under the same "advanced RAG" heading as if they were competitors choosing between the same job. They aren't. Self-RAG and Corrective RAG (CRAG) intervene at different points, fix different failures, and — this is the part that should actually drive your decision — make opposite bets about where the judgment should live.
Self-RAG: teach the model to doubt itself
Self-RAG (Asai et al., 2023) moves the judgment inside the model. It fine-tunes the language model to emit special reflection tokens interleaved with its normal output, so that critiquing becomes part of generation rather than a step bolted on around it. There are four:
- Retrieve — before producing the next segment, decide whether retrieval is even needed. Sometimes the model already knows the answer and pulling documents only adds noise.
- ISREL — given a retrieved passage, is it actually relevant to the query?
- ISSUP — is the statement I just generated genuinely supported by that passage, or am I drifting past the evidence?
- ISUSE — how useful is the overall response, on a 1–5 scale?
The effect is a model that retrieves on demand and grades its own work segment by segment, even down-weighting a generation branch when ISSUP says the claim isn't backed. The intelligence is in the weights. That is its strength and its catch: you get adaptive, low-overhead self-criticism at inference time, but only after you have fine-tuned a model to do it — and you are then locked to that model.
CRAG: judge the evidence before the model sees it
CRAG (Yan et al., 2024) makes the opposite bet: leave the LLM completely untouched and put the judgment outside it. It adds a lightweight retrieval evaluator — a small, fast classifier — that scores the retrieved documents for a query and returns a confidence, which maps to three actions:
- Correct → knowledge refinement: decompose the documents into fine-grained "knowledge strips," throw out the irrelevant strips, and recompose the clean ones. Even good retrieval carries filler; this strips it.
- Incorrect → discard the retrieved documents entirely and fall back to a large-scale web search for fresh evidence. This is the move naive RAG can't make: when the corpus has nothing, go get something.
- Ambiguous → hedge and combine both refined internal docs and web results.
Crucially, all of this happens before the generator runs, and none of it touches the generator's weights. CRAG is plug-and-play and model-agnostic — it wraps any black-box LLM you're calling over an API.
Self-RAG retrains the reader to be skeptical of its sources. CRAG hires an editor to vet the sources before the reader ever opens them. Different fix, different place, different cost.
The decision is build-vs-bolt-on, not better-vs-worse
Lined up honestly, the "vs" dissolves. They fix different failures: Self-RAG improves how the model reasons over evidence; CRAG improves the evidence itself. Self-RAG can decide whether to retrieve; CRAG can decide what to do when retrieval was bad. In a serious system you might run both — CRAG cleans and, if necessary, replaces the context; Self-RAG reasons carefully over whatever survives.
So the real axis isn't quality. It's a question about your constraints: do you control the model's weights?
- If you can fine-tune and serve your own model, and you want relevance and support-checking baked in at inference time with no extra service in the loop, Self-RAG fits — at the cost of a training pipeline and being tied to that model.
- If you're calling a frontier model behind an API and need a correction layer you can ship this week, CRAG fits — at the cost of an extra evaluator pass and, on the fallback path, a web-search dependency and its latency.
Before you reach for either
One caution the papers won't give you. Both methods earn their keep only when retrieval quality genuinely varies and the cost of a confident wrong answer is high — medical, legal, support systems where a fluent hallucination is a real liability. If your corpus is clean and your retriever is already strong, you are reaching for a second model in the loop to solve a problem a reranker and a similarity threshold would have handled for a fraction of the latency. Add the self-checking machinery when you've measured that retrieval is the thing failing you. Not before — the most expensive correction step is the one guarding a pipeline that was already retrieving fine.



