Every team that ships an LLM feature arrives at the same meeting. Something confidently made up a court case, a price, a config flag that doesn't exist, and the room wants a fix that sounds like a fix: tell the model not to hallucinate. Add a line to the system prompt. "Only state facts you are certain about." It feels like progress and it changes almost nothing, because it misunderstands what a hallucination is.
A hallucination is not a separate failure mode the model slips into. It is the same next-token prediction that produces every correct sentence, applied to a region where the model's parameters don't actually encode the answer. The model is not lying; it has no concept of a fact it could betray. It is doing what it always does — emitting the most plausible continuation — and plausibility and truth only correlate when the training data made them correlate. OpenAI's 2025 analysis put a sharper point on it: the way we train and grade models rewards confident guessing over saying "I don't know," because a blank gets zero and a guess sometimes gets full marks (Kalai et al., 2025). We taught it to bluff. The survey literature has named the two flavors for years — intrinsic (the output contradicts the source) and extrinsic (the source can't confirm it either way) — and neither is an exception to the mechanism (Ji et al.).
That reframing is the whole game. If you can't make the model "more honest," then stop trying. Optimize for something you can control: making every claim cheap to check.
The wrong axis and the right one
The instinctive axis is accuracy — push the model toward being right more often. The problem is you have almost no lever on it at inference time, and the levers you do have (a better model, a better prompt) have flat, expensive returns.
The productive axis is verifiability. A fluent unsupported claim and a fluent true claim look identical on the page — same confidence, same prose. The difference that matters operationally is whether anything downstream can tell them apart. So the design goal becomes: never let the system emit a load-bearing claim that isn't traceable to a source a second process can inspect. You're not making the model honest. You're making dishonesty visible.
The levers, ranked by leverage
1. Ground it, then make it cite. Retrieval-augmented generation is the big one, because it converts "answer from your parameters" into "answer from this text," and copying is far more reliable than recall. But RAG alone leaks — when retrieval is thin, the model quietly falls back to its parameters and you can't tell. The fix that makes the gain durable is forced attribution: require the model to tag each claim with the passage it came from, and treat any claim it can't attribute as a claim to drop. Now an unsupported sentence isn't a silent error; it's a missing citation your parser catches. (If you haven't tuned this layer, start with how to add citations to a RAG pipeline and agentic vs. naive RAG.)
2. Let it say "I don't know." This is the cheapest high-value change and the one teams skip because it feels like giving up. If abstention isn't an explicitly allowed, explicitly rewarded output, the model does exactly what its training taught it to do under uncertainty: guess fluently. Anthropic's guidance is blunt about it — give the model permission to decline, and restrict it to the provided context (docs). The catch is in evaluation: if your eval scores a confident wrong answer the same as "I don't know," you're re-teaching the bluff every time you tune against it.
3. Make it check its own work. Chain-of-Verification has the model draft an answer, generate independent verification questions about its own claims, answer those in isolation, and revise — and it reports 50–70% hallucination reduction on QA and long-form tasks (Dhuliawala et al.). It costs extra calls, so reserve it for long, multi-fact outputs where one bad claim poisons the whole thing.
4. Sample and compare. For questions with a single answer, self-consistency — sample several reasoning paths, take the majority — buys real accuracy (about +10 points on GSM8K over a single sample) (Wang et al.). The same trick runs in reverse for detection: SelfCheckGPT flags hallucinations precisely because a made-up fact varies wildly across samples while a known one stays stable. Disagreement is the uncertainty signal.
A claim with a citation can be checked by a model a tenth the size. A fluent claim with no source can only be checked by a human who happens to know the answer — which is the exact thing you deployed the system to avoid.
What the cheap knobs actually buy you
Lowering temperature and adding constrained decoding belong in the stack, but be honest about what they do. Low temperature reduces low-probability token noise and makes outputs repeatable — genuinely useful for factual lookups — but a confident wrong answer is generated at temperature 0 just as readily as at 0.8. Constrained decoding (and structured-output libraries) guarantees the shape is valid — well-formed JSON, a legal enum value — which kills an entire class of format-level "hallucination," but the value inside the valid field can still be invented. These are noise filters, not truth filters. Ship them; don't expect them to save you.
Measure faithfulness, not vibes
You can't manage what you grade by gut. The metric that maps to this problem is faithfulness: of the claims in the output, what fraction is actually supported by the retrieved context? RAGAS operationalizes exactly this by extracting claims and checking each against the source (docs) — and the RAG evaluation playbook and LLM-as-a-judge cover how to run it at scale. For a model-level reference point, Vectara's HHEM leaderboard is sobering: top models hallucinate under 2% on the easy summarization task but 10–14% on the harder 2025 long-document version (leaderboard). The lesson isn't the exact number — it's that the rate is a property of your task's difficulty, so measure it on your own data or you don't know it.
None of this makes the model trustworthy. That was never available. What it makes is a system where the untrustworthy parts announce themselves — a missing citation, a sample that disagrees with its siblings, a faithfulness score that drops on a new corpus. You don't catch hallucinations by asking the model to stop. You catch them by building a pipeline where a wrong answer has nowhere to hide.



