A reranker is the upgrade everyone recommends and almost nobody measures honestly. It is stateless, it bolts on after retrieval without touching your index, and the demos are persuasive: feed it the messy top-50 from your vector search and it floats the genuinely relevant chunks to the top. So the evaluation collapses into a single question — which reranker scores highest on the benchmark — and a team picks the leaderboard winner, wires it in, and watches the answer quality move by approximately nothing. The leaderboard wasn't lying. It was answering a question that wasn't the one that mattered.
The thing to internalize before you score a single model is what a reranker structurally is. It is not a retriever. It does not go find documents. It re-scores the candidates that first-stage retrieval already handed it, and reorders them. That one sentence contains the whole evaluation strategy, because it means there is a number that caps the reranker's value before the reranker runs — and that number belongs to your retriever, not to the reranker.
Measure the ceiling first#
Call it the recall ceiling. If the chunk that answers the query is sitting at rank 300 in similarity space and your pipeline only fetches the top 100 candidates, the reranker never sees it. It cannot promote a document that isn't in the pool. The retrieve-rerank-generate pattern formalized in systems like Re2G makes this explicit: the reranker's job is to reorder stage-one's output, so stage-one's recall is the hard upper bound on everything downstream.
So the first measurement in any reranker evaluation isn't a reranker at all. It is recall@fetch_k for your retriever, plotted as you grow the candidate pool. This tells you two things a reranker leaderboard cannot. First, whether reranking can help you at all — if your retriever already has the right chunk at rank 2, a reranker has almost nothing to fix, and the cheapest large win is somewhere else. Second, it surfaces the real lever, which is fetch_k itself, the size of the pool you rerank over. Set it too low and you starve the reranker of the document it was supposed to find for you; this is the single most common reason a "good" reranker shows no lift.
A reranker converts recall into precision. It cannot manufacture recall. Evaluate the thing it depends on before you evaluate the thing itself.
There is a ceiling on the pool side too. Quality climbs as you enlarge the candidate set and then flattens — vendor benchmarks put the knee somewhere around 50 to 100 candidates for typical NDCG@10, past which it plateaus while a cross-encoder's cost and latency keep rising about linearly. So the candidate count is a budget knob with a clear shape: the smallest pool whose recall clears your target, not the largest one you can afford.
The metric you debug with isn't the metric you decide with#
Once the pool is real, you reach for the ranking metrics, and they are genuinely useful — for diagnosis. NDCG, MRR, Recall@k, and MAP each see something different. NDCG@k is the most informative single number because it credits graded relevance (a perfect chunk counts more than a merely on-topic one) and discounts lower ranks on a logarithmic curve. MRR only cares where the first relevant hit lands. Recall@k is position-blind. They are the right instruments for answering "is my reranker putting better chunks higher" — which is exactly what you want when you're debugging.
They are the wrong instrument for the ship/don't-ship decision, and this is the second trap. NDCG scores a list the way a human reads it, one position at a time. RAG does not read the list; it staples the top-k together and hands the bundle to a generator. A reranker can lift NDCG by promoting five passages that are each relevant and all say the same thing, or that quietly contradict each other — and that redundant or conflicting set can leave the answer unchanged or worse. The argument that rank-centric metrics are misaligned with RAG's set-consumption model is the formal version of an effect most teams have felt: the diversity of the set matters, and a list metric is blind to it. So the decision metric has to be end-to-end — faithfulness and correctness of the generated answer on your task — even though it costs a generator call per query and is the step teams most often skip.
Three axes, not a winner#
Which is why the honest output of a reranker evaluation is not a name. It is a Pareto frontier across three axes: answer quality, latency, and cost. Quality without the other two ships the $50-per-thousand-queries reranker that adds 300ms to every request. A cross-encoder scores every query-document pair, so its latency grows with the pool; an LLM listwise reranker buys a higher quality ceiling but can add seconds and real cents per query. Cost has a shape you have to model on your own traffic: Cohere meters a query of up to 100 documents as a single search unit, while Voyage prices per token, so the same pool size lands very differently on the two bills. RAG-serving work like RAGO treats the reranker as one stage inside an end-to-end latency budget for exactly this reason — it is a component in a system, not a number on a board.
One last prerequisite that sinks evaluations before they start: you cannot compute NDCG without graded relevance labels, and most teams don't have them. Bootstrap a golden set — sample real queries, retrieve a pool, and have an LLM-as-judge grade each query-document pair on a 0–3 scale, which is most valuable for the cold-start tail with no click data. But treat the judge as a biased instrument: LLM judges tend to over-rate, so an uncalibrated golden set will flatter every reranker equally and tell you nothing. Calibrate it against a few hundred human labels first. And remember that a benchmark like BEIR measures zero-shot generalization across someone else's domains — a useful prior, never a substitute for the number on your queries.
The reranker really is one of the cheapest large wins left in a RAG pipeline. But "cheapest large win" is a claim about your specific stack, and it's only true when your retriever's recall ceiling is high enough to be worth converting, your pool is sized to reach it, and you measured the answer instead of the list. Evaluate the ceiling, then evaluate the climb.



