The first thing to fix is the word. "Hallucination" is doing the work of two different concepts, and the tool you buy to detect one will be blind to the other.
The canonical taxonomy — set out in the standard survey on the topic — splits the failure into faithfulness and factuality. A faithfulness hallucination is an answer that isn't supported by the context the model was given: the retrieved passage, the source document, the conversation so far. A factuality hallucination is an answer that's false about the real world. These sound like the same defect described twice. They are not, and the gap between them is exactly where production systems get burned.
An answer can be perfectly faithful to a document that is itself wrong. Faithful, grounded, well-cited — and false.
Why one is cheap and the other is brutal
The reason almost every shipping detector measures faithfulness is that faithfulness is tractable. To check whether an answer is grounded in its context, you need only two things you already have on hand: the context and the answer. The question — "does this context support each claim in this answer?" — is a Natural Language Inference problem, the same entailment task NLP has had decent models for since well before the LLM era. No external knowledge required.
Factuality has no such shortcut. To check whether an answer is true, you have to ground it against the open world — retrieve evidence, consult a knowledge base, search the web — and then trust that evidence. It is open-ended, expensive, and itself prone to the same grounding failures it's trying to catch. So the market quietly did the rational thing: it built faithfulness detectors and called them hallucination detectors. That's defensible, as long as you know which one you bought.
What the popular tools actually measure
Run down the common options and the pattern is unmistakable.
- Patronus Lynx is an open, fine-tuned Llama-3 model (8B and 70B) that takes a
(document, question, answer)triple and returns PASS/FAIL with reasoning. Its explicit criterion is that the answer must not add information beyond the document or contradict it — faithfulness, scored on the HaluBench benchmark (~87% accuracy at both sizes, per the paper). - RAGAS Faithfulness decomposes the answer into atomic claims and checks each for entailment by the retrieved context. The metric is literally
supported claims / total claims. Pure faithfulness — and worth knowing it can be brittle: in one independent benchmark it failed to return a score on 83.5% of FinanceBench examples. - Vectara HHEM is a small "factual consistency" classifier that scores
(source, output)from 0 to 1 and runs on a CPU in under 600MB. Despite the word "factual," it measures consistency with the provided source, not world-truth — the same thing the others do, packaged as a fast gate.
The same lens explains the evaluation frameworks teams already run: Arize Phoenix ships LLM-as-judge templates that score an answer against reference context — faithfulness again. None of these is wrong. They are all answering the grounded-in-context question, which is the answerable one.
The partial exceptions worth knowing
Two approaches reach past faithfulness without claiming to fact-check the world. SelfCheckGPT rests on one clean idea: if a model actually knows something, its independently sampled answers stay consistent; if it's confabulating, the samples diverge and contradict each other. Sample several responses, measure sentence-level agreement (via NLI or QA), and inconsistency becomes a hallucination signal — no context or labels needed. Cleanlab TLM operationalizes the same instinct as a product, combining self-reflection with consistency sampling to emit a 0–1 trustworthiness score over any base model. These catch a class of reasoning and self-uncertainty errors the faithfulness tools miss. But notice what they still don't do: verify against ground truth. A model can be confidently, consistently wrong.
How to actually build detection
Stop shopping for "the hallucination detector." Build a layered pipeline and assign each layer the job it's good at:
- Gate with a fast classifier. A small model like HHEM scores every answer's consistency with its context in milliseconds — cheap enough to run on all traffic and flag the suspicious tail.
- Localize with a claim-level judge. On flagged answers, RAGAS-style claim decomposition or a Lynx PASS/FAIL tells you which sentence broke from the source, not just that something did.
- Catch reasoning drift with self-consistency. For high-stakes, low-context generation where faithfulness has nothing to ground against, sample-and-compare (SelfCheckGPT / TLM) surfaces the model's own uncertainty.
- Fix the input, not just the output. Most "hallucinations" in a RAG system are faithful answers to bad retrieval. A faithfulness score near 1 on a wrong answer is a retrieval bug wearing a generation costume — which is why detection belongs next to your retrieval evals, not bolted on after.
The one thing not to do is read a green faithfulness score as "this answer is true." It means the answer matches what you fed the model. Whether what you fed the model was true is a different question, with a different — and much shorter — list of tools that even try to answer it.



