A team ships a RAG chatbot. A week later the complaints arrive: it "hallucinates," it "makes things up," it "ignores our docs." So they do the obvious thing — they tune the generation prompt. They add "only answer from the provided context." They swap to a bigger model. The hallucinations persist, and three weeks evaporate.
The diagnosis was wrong from the first hour. Here is the idea that reorganizes the whole problem:
Most RAG failures are retrieval failures wearing a generation costume. If the right chunk was never fetched, no prompt, no model, and no temperature setting can save the answer.
This is why you cannot evaluate a RAG pipeline by reading final answers and grading them. A bad answer tells you the system failed; it does not tell you which half failed. You have to measure the two halves separately.
The pipeline has a seam, so your metrics need one too
A RAG pipeline does two distinct jobs. First it retrieves — it searches an index and returns some chunks. Then it generates — an LLM reads those chunks and writes an answer. The seam between them is where diagnosis lives. The Ragas framework draws the same line: context precision and context recall score the retrieval step, while faithfulness and answer relevancy score the generation step (Ragas docs).
The single most important consequence is a ceiling. Retrieval recall sets the maximum quality of the entire system. If the chunk containing the answer is not in the candidate set you hand to the model, the answer is unrecoverable downstream — the model is being asked to cite a source it never saw. So before you touch a prompt, ask the only question that can't be patched later: did retrieval even fetch the right chunk?
Retrieval metrics: was the right chunk fetched, and ranked well?
These are classic information-retrieval metrics, computed against a labeled set of queries where you know which chunks are relevant.
- Recall@k — of the relevant chunks, how many landed in the top k you retrieved. This is the ceiling check. A relevant chunk at position 1 and one at position 10 both count the same; recall@k only asks whether it made the cut (IR metrics reference).
- Precision@k — of the k chunks you fetched, how many were actually relevant. Low precision means you're stuffing the context window with noise, which downstream invites hallucination.
- MRR (Mean Reciprocal Rank) — the average of 1/rank of the first relevant result. It rewards getting one good chunk to the top fast, which suits single-best-answer lookups (IR metrics reference).
- nDCG (Normalized Discounted Cumulative Gain) — rewards both relevance and position, discounting hits that appear lower down, then normalizing against the ideal ordering so the score sits between 0 and 1 (Evidently AI).
Recall@k tells you whether retrieval can succeed. MRR and nDCG tell you whether your ranking — and your reranker — puts the good chunk where the model will actually weight it. If recall@k is high but nDCG is low, you have a ranking problem, not a search-coverage problem, and the fix lives in the best reranker for RAG or in hybrid search vs semantic search — not in the LLM.
Generation metrics: did the model use the context it was given?
Once you trust retrieval, you measure whether the model honored it. Ragas leans on two reference-free scores, and the original paper frames them as the model's ability to exploit retrieved passages faithfully and to answer the actual question (Ragas, arXiv 2309.15217):
- Faithfulness / groundedness — is every claim in the answer supported by the retrieved context? This is the direct hallucination check. TruLens calls it groundedness: the extent to which the answer's claims can be attributed back to the source text (TruLens).
- Answer relevance — does the answer address the question that was asked, rather than wandering off into something adjacent and correct-sounding?
Two retrieval-flavored metrics also live in Ragas and bridge the seam: context precision (are the retrieved chunks relevant?) and context recall (does the retrieved context contain everything needed to answer?). The rule of thumb is clean: low context recall is a retrieval problem; low faithfulness is a generation problem.
The triad, the eval set, and the judge
TruLens packages this into the RAG triad: context relevance, groundedness, and answer relevance — one retrieval check and two generation checks (TruLens). Pass all three and you have real evidence the system is grounded, not just a vibe.
Two practical notes on how you measure. First, build an offline eval set — a frozen list of representative queries with known-relevant chunks and ideally reference answers. This is what makes recall@k and nDCG computable at all, and it's the asset most production teams are missing. Second, for the fuzzy generation metrics that have no clean ground truth, the standard move is LLM-as-a-judge: a carefully prompted model scoring faithfulness and relevance at scale. It's powerful and cheap, but it is itself a model with biases, so calibrate it against human labels before you trust its numbers — a discipline worth its own treatment in LLM-as-a-Judge.
Component evaluation finds the broken half. End-to-end evaluation confirms the whole thing serves users. You need both — but you start with the seam, because that's the only place that tells you what to fix.



