Most teams tuning a retrieval pipeline reach for the dramatic levers first — a fancier embedding model, a bigger chunk strategy, a graph on top. The lever that usually pays out fastest is the dull one almost nobody starts with: a reranker. It's the cheapest large accuracy win left in RAG, and the reason is structural. A reranker is stateless and additive. You don't re-embed your corpus, you don't rebuild your index, you don't migrate a vector store. You insert one re-scoring step between retrieval and the LLM, and a meaningful chunk of the relevance your vector search left on the floor comes back.
Why the step works at all
First-stage retrieval — the vector database lookup over your embeddings — is fast because it cheats. It compresses the query into one vector and the chunk into another and compares them in isolation. That's what makes it scale to millions of documents in milliseconds, and it's also why it ranks "close-but-wrong" chunks above "exactly-right-but-phrased-differently" ones.
A reranker is a cross-encoder: it reads the query and a candidate chunk together, in one forward pass, and scores how well that specific chunk answers that specific query. That joint attention is exactly what the bi-encoder retriever threw away for speed. It's far too slow to run over your whole corpus — but you don't. You run it over the ~50 candidates the retriever already shortlisted, and keep the top 3–5. Cheap stage casts a wide net; expensive stage picks the keepers. That two-stage shape is the whole game, and it's why reranking buys most of the accuracy for a fraction of the compute.
The reranker isn't competing with your retriever. It's cleaning up after it — re-reading the shortlist with attention the fast stage couldn't afford.
The contenders, and what they actually trade
Cohere Rerank is the lowest-friction path. It's a hosted API — no GPU, no weights, no ops. The current rerank-v3.5 covers 100-plus languages with a 4,096-token context and is the default "I just want it to work" choice. You pay per call and you send your text to a third party; in exchange you operate nothing. For a team without GPU infrastructure, that trade is often correct, and the integration is an afternoon.
BGE-reranker-v2-m3 is the open answer, and it's the one that punctures the leaderboard mindset. It's a lightweight cross-encoder with strong multilingual coverage, openly licensed, and on a GPU it can match a hosted API's latency — at zero marginal cost per query and with your data never leaving your boundary. If you already run a GPU, the honest question isn't "is the top-of-leaderboard model 2% better," it's "is 2% worth a per-call bill and a data-egress story forever." Usually it isn't.
Jina Reranker is excellent and contains the sharpest trap in the category: licensing. The jina-reranker-v2-base-multilingual weights ship under CC-BY-NC-4.0 — non-commercial. You can self-host them to evaluate, but deploying those weights in a commercial product is a license violation; commercial use routes you to Jina's hosted API or a paid arrangement. (Jina has since released a v3.) This is the line item that never appears in a benchmark table and absolutely belongs in your decision: a model can be both genuinely good and the wrong thing to bolt into your product, on license alone.
Don't pick the model, pick the harness
Here's the move that makes the whole decision reversible: don't hardcode a vendor, adopt a unified interface. The rerankers library wraps Cohere, Jina, BGE, ColBERT and the rest behind one small API, so swapping the model under your pipeline is a one-line change instead of a refactor. That matters because the right reranker is the one that lifts your eval set, and you can't know which that is from someone else's leaderboard — corpus, language mix, and query style decide it. Wire it through a harness, run all three over your own retrieval traces, and let your numbers choose.
The decision, in one line
Stop ranking rerankers by a public ELO and rank them by your two real constraints. No GPU and you want it working today → Cohere Rerank. You run a GPU and want zero marginal cost with full data control → BGE-reranker-v2-m3. Either way, put it behind rerankers so the choice stays cheap to revisit — and read the license before you ship, because the best-scoring open model in this space is the one you're not allowed to self-host commercially. The reranker is the highest-leverage step in RAG precisely because it's so easy to add; the only way to get it wrong is to choose it for a reason that has nothing to do with your data.



