The standard advice is a fork in the road. If your model needs to know things it didn't learn in pretraining, you either retrieve those things at query time or fine-tune them into the weights. RAG versus fine-tuning has launched a thousand architecture diagrams, and the usual verdict is sensible enough: retrieve when the knowledge changes, fine-tune when the behavior needs to change.
RAFT's contribution is to notice that the fork is false. There's a third road, and it runs straight down the middle.
The open-book exam nobody studied for#
RAFT — Retrieval-Augmented Fine-Tuning, from the UC Berkeley team behind Gorilla — starts from an observation about what plain RAG actually is. It's an open-book exam. The model walks in, gets handed a stack of retrieved pages, and is expected to find the answer. The catch is that nobody ever taught it to read this book. A general-purpose model doing RAG over your medical corpus or your internal API docs is improvising, and it improvises worst at exactly the moment retrieval hands it three relevant pages and two irrelevant ones.
Fine-tuning has the opposite problem. It's a closed-book exam: the model has studied the domain and recites from memory, but it never practiced using a document, so when you do hand it retrieved context at inference it often can't tell the signal from the noise.
RAFT is the student who studied for the specific open-book exam. You fine-tune the model on its own retrieval task.
The whole idea is the distractors#
Here's the part worth internalizing, because it's the part that's easy to skip. RAFT's training examples are not question-answer pairs. Each one is a question, a set of documents, and an answer — and the set is built on purpose to be imperfect.
Some examples contain the oracle document (the one that holds the answer) sitting alongside distractor documents that are topically plausible but wrong. And in a deliberate fraction of the examples — call it P% — the oracle document is removed entirely, leaving only distractors. The model is asked to answer anyway.
That second move is the clever one. By sometimes denying the model the document it needs, you force it to actually learn the domain rather than copy from context — so when retrieval whiffs in production, it isn't helpless. And by always padding the context with distractors, you train the one skill plain RAG never teaches: ignoring the wrong chunk. The answers the model is trained to produce are chain-of-thought that quote the source passage verbatim, so it learns to cite rather than to assert — a habit that doubles as a faithfulness signal at inference.
Plain RAG trains a model on the world. RAFT trains it on your retriever's mistakes.
What it buys, in numbers#
The Berkeley paper runs RAFT across PubMedQA, HotpotQA, and the API-documentation splits from Gorilla (HuggingFace, Torch Hub, TensorFlow). The pattern is consistent on the multi-document tasks: training with distractors beats both plain domain fine-tuning and a strong RAG baseline. On the HuggingFace API split RAFT reported 74% accuracy — a large margin over a domain-fine-tuned model and over GPT-3.5 with RAG — and the HotpotQA gains over plain fine-tuning were in the same double-digit territory. Microsoft thought enough of the recipe to ship it as a fine-tuning option in Azure AI.
The honest footnote is PubMedQA. There — where answers are essentially yes/no/maybe and the reasoning is shallow — RAFT barely separates from fine-tuning plus RAG. That's not a flaw; it's the boundary of the idea. Distractor-robustness training pays off when answering means reading the right thing out of several plausible things. When the task is a coin flip, there's no noise to be robust to.
When to reach for it#
The decision is less about accuracy ceilings than about what you're willing to give up.
Reach for plain RAG when your corpus moves — new docs weekly, a knowledge base that's edited by humans all day. RAFT bakes the domain into the weights, so updating means retraining, while RAG lets you re-embed and swap the index. Reach for plain fine-tuning when you're changing behavior — tone, format, a skill — and there's no retrieval step at inference at all.
Reach for RAFT when three things are true at once: the domain is fixed enough to justify a training run, the retriever is imperfect enough that wrong chunks are a real failure mode, and you have labeled questions to build the distractor-laced training set from. That's narrower than "use RAG" — but it's exactly the shape of the high-stakes vertical assistant, the one answering legal or clinical or internal-API questions over a stable corpus where a confidently-wrong answer to a misretrieved chunk is the thing that gets you fired.
And note what RAFT does not do: it doesn't replace the retriever. You still need good chunking and a real index, and a better retriever still helps. RAFT just lowers the price of the retriever's mistakes — which, if you've ever watched a RAG system answer fluently from the one chunk it should have ignored, is a price you already know you're paying.



