Most of a RAG pipeline runs once. You chunk a corpus once, embed it once, build the index once. The reranker is the exception: it runs on every single query, after retrieval has already happened, while the user waits. That one structural fact is why the reranker's architecture — not just its benchmark score — decides whether it belongs in your stack. And in the last year, reranking quietly stopped being one architecture and became three.
The job hasn't changed. Your first-stage retriever — vector search, BM25, or a hybrid of both — is tuned for recall: cast a wide net, return the top 50 to 200 candidate chunks, don't miss the right one. It is not tuned for precision, because a bi-encoder compresses each document into a vector ahead of time and never sees your query beside that document. The reranker fixes exactly that, reading the query and each candidate together. What changed is that there are now three fundamentally different ways to do the reading, and they make the same trade — accuracy for latency — in three different places.
Pointwise cross-encoders: the fast default#
The classic design, and still the workhorse. A model with a classification head reads one query-document pair and emits a score in a single forward pass. BGE-reranker-v2-m3 (568M params, Apache 2.0, multilingual) and mixedbread's mxbai-rerank-v2 (0.5B and 1.5B, Apache 2.0, reinforcement-tuned) are the commodity options; ZeroEntropy's zerank models are cross-encoders too, LoRA-tuned on a Qwen3 backbone.
The virtue here is mechanical, not glamorous: one pass per pair, fully parallel, trivially batched. You can score 200 candidates in roughly the time it takes to score one, because the GPU does them at once. For something sitting on the critical path, "boring and parallel" is the highest praise there is.
Generative rerankers: the prestige tier#
The new prestige option inverts the design. Qwen3-Reranker (0.6B / 4B / 8B, Apache 2.0, released mid-2025) is not a classifier at all — it's an LLM prompted to answer "is this document relevant: yes or no," and the relevance score is the probability the model assigns the "yes" token. That framing is what pushes it to the top of the open reranking charts.
It is also what makes it slow. The model carries the full autoregressive-decoder machinery to produce what amounts to a single bit of information, and independent testing repeatedly clocks the 4B model at over a second per query — several times the latency of a cross-encoder for a handful of benchmark points. You are renting a language model's entire reasoning stack to get a yes/no it then has to decode.
The generative reranker wins the leaderboard the same way it loses the deployment: by spending more compute per document. On a benchmark that's an advantage. On your critical path it's the bill.
Listwise rerankers: the genuinely new idea#
The third family is the most interesting and the least understood. A listwise reranker doesn't score documents one at a time at all. jina-reranker-v3 (0.6B, late 2025) attends over the entire candidate set in one context window and ranks the documents against each other. Jina calls it "last but not late" interaction — richer than a late-interaction model like ColBERT, which encodes each document independently and matches vectors afterward, because here a document is scored knowing what it's competing with. At 0.6B it posts numbers that embarrass pointwise models several times its size.
The catch is structural: because it reasons over the set, it processes candidates in bounded batches (jina-v3 tops out around 64 documents at a time) rather than as embarrassingly parallel pairs. Cross-document context is real signal — sometimes the only way to tell two near-identical passages apart is to see them side by side — but you pay for it in how the work batches.
The leaderboard is optimizing against you#
Here is the part worth slowing down on. Rank these architectures on BEIR, MTEB-R, or the head-to-head ELO boards, and the generative and large models lead. Now remember where the reranker lives: in the request, after you've paid for retrieval, scoring dozens of candidates before the LLM can even start. Its true cost is latency × candidate count, on the user's clock.
So the top of a reranker leaderboard is close to an anti-recommendation for a latency-sensitive product. Those models win the metric by spending more per pair — bigger backbones, autoregressive decoding, joint attention — which is precisely the resource a hot-path component cannot spend. The benchmark answers "which architecture ranks best with unlimited time?" You are asking "which ranks well enough inside my latency budget at top-50?" Those have different winners, and only the second one is yours.
It's worth being precise about why the gap exists, because it isn't that the leaderboard is wrong. It's measuring offline quality honestly. It simply isn't measuring the axis your users feel. A reranker that adds 80ms and recovers one buried-but-relevant chunk per query is a triumph; the same quality at 1.2 seconds is a regression you'll rip out by Friday.
How to actually choose#
Choose the architecture before the model, and choose it by latency position. Start with a fast pointwise cross-encoder — an Apache-2.0 model you can host, like bge-reranker-v2-m3 or mxbai-rerank-v2 — wired into your real retrieval pipeline, and measure both the accuracy lift and the added latency on your data at your candidate count. That baseline clears the bar far more often than the discourse suggests.
Escalate deliberately. Reach for a listwise model when you can show, on your own eval, that cross-document context separates results a pointwise scorer confuses — and when your candidate sets are small enough that batch limits don't bite. Reach for a generative reranker only when reranking quality is genuinely decisive and your queries-per-second and latency headroom can absorb the autoregressive cost, or when the reranking happens off the request path entirely. Which specific model wins inside each tier — and whether to self-host or call a managed API like Cohere Rerank or Voyage — is a real question, but it's a separate one, downstream of this: the reranker doesn't run in the benchmark harness. It runs in front of your user, once per query, forever.



