The question arrives pre-broken. "Should we fine-tune or use RAG?" is asked the way you'd ask whether to take the highway or the side roads — as if the two were rival paths to the same place. They aren't. They solve different failures, and the reason teams burn quarters on the wrong one is that the question hides that fact inside a fake either/or.

Here is the distinction that the framing erases. Retrieval-augmented generation — the technique Patrick Lewis and colleagues named in 2020 — gives a model access to a non-parametric store of facts at inference time: documents, a knowledge base, a vector index it consults before answering. It changes what the model can see. Fine-tuning changes the weights. It changes what the model does — its format, its tone, its tool-call syntax, the reasoning pattern it falls into when it sees a certain kind of input. One is about knowledge. The other is about behavior. They are not substitutes any more than a library is a substitute for a personality.

The decision rule is the failure mode

Stop asking which is better and ask what's actually going wrong.

This is not a contrarian read; it is the one the model providers themselves publish, once you get past the marketing. OpenAI's own accuracy guidance frames RAG as the move for introducing new knowledge and fine-tuning as the move for changing structure, tone, or behavior — and is explicit that the two are not mutually exclusive. Anthropic's prompt-engineering docs push even harder in one direction: try prompting first, because most of what people reach for fine-tuning to fix is reachable with examples, structure, and a better system prompt. The uncomfortable trade-secret of practitioners is that a large fraction of "we need to fine-tune" requests are knowledge problems wearing a behavior costume, or behavior problems that three good few-shot examples would have closed.

RAG is what the model can see. Fine-tuning is what the model does. A library is not a substitute for a personality.

The old objection is dead, and that's a trap

For years the tiebreaker was money. Fine-tuning meant full fine-tuning — retraining every weight, renting a cluster, owning the cost of a model you'd have to retrain again next quarter. So teams defaulted to RAG even for behavior problems, because RAG was cheap and fine-tuning was a capital expense.

That objection is gone. LoRA froze the base weights and trained small rank-decomposition matrices instead, reporting a 10,000x reduction in trainable parameters and a 3x cut in GPU memory versus full fine-tuning of GPT-3 175B. Then QLoRA quantized the frozen base to 4-bit and fine-tuned a 65-billion-parameter model on a single 48GB GPU while preserving full 16-bit task performance. Parameter-efficient fine-tuning turned a cluster job into something that fits on one card overnight.

The trap is treating "cheap" as "indicated." Cost falling does not convert a knowledge problem into a behavior problem. If your model is hallucinating last quarter's numbers, a QLoRA run will faithfully teach it to hallucinate them in your house style. Cheap fine-tuning just means you can now make the wrong choice faster and at lower cost — which, historically, is exactly what people do with anything that gets cheap.

What changed underneath the question

Two shifts have quietly moved the borders. Long-context windows and prompt caching have eaten part of RAG's old territory: when you can fit the whole manual in the prompt and cache it, the retrieval-and-chunking machinery — vector databases, embedding pipelines, the dark art of chunk boundaries — starts to look like overhead for corpora that used to demand it. RAG didn't die; its floor rose. You now reach for it when the knowledge is too large, too fresh, or too access-controlled to sit in context, not reflexively for every document.

And note one structural fact the vendor docs bury: OpenAI is winding down its hosted fine-tuning platform for new users even as the open-weight world makes LoRA trivial. The center of gravity for behavior-shaping is drifting toward models you can hold the weights of. That's worth knowing before you architect around a managed fine-tune that may not be there in a year.


So the mature answer to "fine-tune or RAG" is usually both, eventually, for different reasons — RAG feeding the model facts it can't carry, a light fine-tune fixing the form those facts come out in. But the honest answer for most teams reading this is neither, yet. Write the evals. Exhaust the prompt. Most failures you'd pin on architecture are failures of specification, and the cheapest fix is still the one that doesn't touch the weights or the index at all.

Diagnose the failure first. The tooling has gotten cheap enough that the only expensive mistake left is reaching for the wrong one.