Here is a small irony at the heart of RAG evaluation. The metric the embedding leaderboards crown — NDCG@10, the headline number on MTEB and BEIR — is, for a lot of RAG pipelines, the wrong thing to optimize. Teams copy it anyway, because it's what "good retrieval" is supposed to mean. It measures ranking quality with surgical care. Your generator may not need most of that care.

To see why, you have to separate the metrics by the question each one actually asks.

Three families, one axis: does the metric care where?

Retrieval metrics sort cleanly once you ask whether they care about the position of the right chunk, not just its presence.

Presence-only (binary, rank-insensitive). Recall@k is the fraction of all truly relevant chunks that land in the top-k: four relevant, three retrieved in the top-5, Recall@5 = 0.75. Hit@k (Hit Rate) is its coarse sibling — 1 if any relevant chunk made the top-k, averaged over queries; the two coincide when exactly one chunk is relevant per query. Precision@k flips the question to noise: what fraction of the k you retrieved is relevant. None of these care whether the right chunk is at rank 1 or rank k. Inside the window, position is invisible to them.

Rank-sensitive (binary). MRR is the mean of 1/(rank of the first relevant result) — first hit at rank 1 scores 1.0, at rank 3 scores 0.33, and everything after the first hit is ignored. MAP is more thorough: average precision computed at every relevant position, meaned across queries. Both still use yes/no relevance, but both reward putting relevant docs earlier.

Graded and rank-sensitive. NDCG is the only one here that uses multi-level relevance grades. DCG@k sums each result's gain divided by a log2(i+1) positional discount; IDCG is the DCG of the ideal ordering; NDCG = DCG/IDCG, normalized to [0,1]. The discount is the whole point: a relevant item earns less the further down it sits, and a "highly relevant" item is worth more than a "marginally relevant" one. That precision is what makes NDCG the right metric for a ranked results page a human reads top-down — and what makes it borrowed clothes for RAG.

Why recall is the floor for RAG

A search engine's user scans results in order and usually clicks near the top, so where the best result lands is the product. A RAG generator does not scan. It receives the entire top-k context block at once, and it can read a relevant chunk whether that chunk arrived at rank 1 or rank 8.

That single architectural fact reorders the metrics. The failure mode that actually destroys a RAG answer is a relevant chunk being absent from the top-k — because that's the one failure the generator cannot recover from. A chunk ranked 5th instead of 1st is still in the prompt; a chunk that never made the cut is gone, and no amount of reasoning conjures it back.

Recall@k is a necessary condition: it asks whether the evidence is in the window at all. Rank metrics are a correction term: they ask whether the evidence that's present will actually get read.

So the first-order metric for a RAG retriever is Recall@k, with k pinned to your real context budget — the number of chunks you can actually afford to send after ordering and packing the prompt. Optimizing NDCG@10 when you only feed the model three chunks is measuring a ranking you then throw away. This is also why a reranker helps less than its demos suggest when your recall is already high and k is small: it reshuffles chunks the generator was going to see regardless.

Why position sneaks back in

If recall were the whole story, you'd retrieve a giant k, guarantee the chunk is in there, and stop. Two effects stop you.

Truncation. Retrieve more than fits and the low-ranked relevant chunks get cut before they reach the model — which silently converts a ranking problem back into a recall problem. The bigger your k relative to the window, the more rank position is quietly deciding recall.

Lost in the middle. Even when the chunk fits, where it sits in the assembled context changes whether the model uses it. Liu et al. (2023) showed that models retrieve information from the very start or end of a long context far more reliably than from the middle — a U-shaped curve that held even for explicitly long-context models. A relevant chunk buried at position 14 of 20 is present, counted by your recall metric, and substantially less likely to be read. This is the same family of failure as general long-context degradation: more context is not freely more signal.

Put together, the two effects are why rank metrics aren't useless — they're the second-order correction once recall is handled. A high MRR or a rank-weighted Context Precision tells you the useful chunk is near the top, which is exactly where truncation and lost-in-the-middle want it.

A measurement order that matches the architecture

The practical sequence falls out of all this, and it's nearly the reverse of how a search team would work:

  1. Measure Recall@k first, with k set to what you actually send. If recall is low, nothing downstream matters — fix retrieval (embeddings, hybrid search, chunking) before touching anything else.
  2. Once recall is high, reach for rank metrics — MRR, NDCG, or RAGAS Context Precision — and a reranker, specifically to fight truncation and lost-in-the-middle by getting the best chunk to the top or the edges.
  3. Use end-to-end RAG eval (faithfulness, answer correctness) as the outer loop, because retrieval metrics are necessary but not sufficient — perfect recall feeding a model that ignores the evidence still fails.

The headline number on the leaderboard isn't wrong. It's just answering a question — how well is this list ranked? — that your generator, reading the whole list at once, didn't ask. Ask the question it does: is the answer in here at all? That's recall, and for RAG it's the floor everything else stands on.