Somewhere in most RAG deployments there is a metric called faithfulness, and somewhere near it is a team that believes it measures whether the answers are right. It does not. This is the most consequential confusion in retrieval-augmented generation, and it hides in plain sight because the metric works exactly as designed — it just measures a different thing than the one people are worried about.

Faithfulness asks a narrow, mechanical question: does every claim in the generated answer follow from the passages the model retrieved? Ragas computes it by decomposing the answer into individual claims and checking each against the source text. Vectara's HHEM and Azure's Groundedness do a similar job with a trained classifier. All of them audit the generator — the step that turns retrieved text into prose — and only that step.

The blind spot is the whole point#

Here is what that scope leaves out. A faithfulness check never looks at whether the retrieved passage is correct. It compares the answer to the context, not the context to reality.

If retrieval surfaces a stale, wrong, or adversarial document, a faithful model will faithfully repeat the error — and every faithfulness detector will score it a perfect 1.0.

Read that failure mode again, because it is not a corner case. It is the single most common way production RAG goes wrong: retrieval pulls the wrong revision of a policy, an outdated price, a superseded API, and the generator does its job flawlessly by grounding its answer in that bad source. The output is confidently wrong and perfectly faithful. The check you deployed specifically to stop wrong answers is blind to it by construction.

So faithfulness is not accuracy. Accuracy — what the eval literature calls answer correctness — asks whether the answer is true, which requires a gold reference to compare against. Faithfulness needs no reference at all, which is exactly why it's cheap and popular, and exactly why it cannot see ground truth. The two metrics are orthogonal. You can be faithful and wrong (bad context, good generation) or correct and unfaithful (the model added a true fact from its own weights that wasn't in the context). Optimizing one tells you almost nothing about the other.

Even within grounding, the detector picks your errors#

Suppose you accept the narrow scope and just want the best grounding check. There is still no single best answer, because the detectors have measurably different error profiles.

Groundedness classifiers like HHEM and Azure tend toward high precision, low recall — when they flag an unsupported claim, they're usually right, but they let subtle ones slip. Ragas-style faithfulness runs the other way: high recall, low precision — it catches more hallucinations and false-alarms more often. Neither is "better." They're different bets about whether a missed hallucination or a false alarm costs you more.

There's a second axis the 2026 real-time evaluation studies surfaced clearly: context length. Sentence-level fact-checkers such as Bespoke-MiniCheck are excellent on short question-answering but degrade on long documents. Long-context-tuned judges like Patronus Lynx — an open 8B/70B model that edges Claude 3.5 Sonnet on HaluBench — excel at summarization and falter on short NLI tasks. Deploy the wrong one for your output length and your "hallucination rate" is measuring the detector's weakness, not your system's.

What to actually build#

The instinct to reach for one number is the mistake. Grounding, retrieval quality, and truth are three separate audits, and a RAG system needs all three:

The reason this matters commercially: a team that ships only a faithfulness gate will watch its dashboard stay green while users hit confidently-wrong answers, because the green number was never measuring the thing that turned red. Faithfulness is a real, useful signal — the Vectara and evolving-leaderboard work has made it genuinely rigorous. It just answers "did the model stay on-script?" and not "was the script true?" Buy the first check knowing it doesn't sell you the second. For where the wrong context comes from in the first place, the retrieval side of this is its own discipline — see contextual retrieval vs naive RAG.