What does a reranker do in a RAG pipeline?

After your vector search returns the top 50 candidate chunks for a query, a reranker (a cross-encoder) re-scores each chunk by reading the query and the chunk *together*, then keeps the best 3–5 to send to the LLM. The first-stage retriever is fast but approximate; the reranker is slow but precise, so running it only over the small candidate set buys most of the accuracy gain for little cost.

Is a reranker worth adding to RAG?

For most pipelines, yes — it's usually the single highest-ROI change because it's stateless and additive: you don't re-embed your corpus or rebuild your index, you just insert a re-scoring step before the LLM. Measure the lift on your own eval set, but reranking commonly recovers relevant chunks that pure vector similarity buried below the cutoff.

Should I use Cohere Rerank or an open-source reranker like BGE?

Use Cohere Rerank (a hosted API) when you want the lowest-friction managed path and broad multilingual coverage with no GPU to operate. Use an open model like BGE-reranker-v2-m3 when you can run a GPU and want zero marginal cost per query plus full data control — on a GPU it can match hosted-API latency. The deciding factors are your latency budget at the candidate-set size, the license, and the lift measured on your corpus.

The Best Reranker for RAG in 2026: Cohere vs Jina vs BGE

Most teams tuning a retrieval pipeline reach for the dramatic levers first — a fancier embedding model, a bigger chunk strategy, a graph on top. The lever that usually pays out fastest is the dull one almost nobody starts with: a reranker. It's the cheapest large accuracy win left in RAG, and the reason is structural. A reranker is stateless and additive. You don't re-embed your corpus, you don't rebuild your index, you don't migrate a vector store. You insert one re-scoring step between retrieval and the LLM, and a meaningful chunk of the relevance your vector search left on the floor comes back.

Why the step works at all

First-stage retrieval — the vector database lookup over your embeddings — is fast because it cheats. It compresses the query into one vector and the chunk into another and compares them in isolation. That's what makes it scale to millions of documents in milliseconds, and it's also why it ranks "close-but-wrong" chunks above "exactly-right-but-phrased-differently" ones.

A reranker is a cross-encoder: it reads the query and a candidate chunk together, in one forward pass, and scores how well that specific chunk answers that specific query. That joint attention is exactly what the bi-encoder retriever threw away for speed. It's far too slow to run over your whole corpus — but you don't. You run it over the ~50 candidates the retriever already shortlisted, and keep the top 3–5. Cheap stage casts a wide net; expensive stage picks the keepers. That two-stage shape is the whole game, and it's why reranking buys most of the accuracy for a fraction of the compute.

The reranker isn't competing with your retriever. It's cleaning up after it — re-reading the shortlist with attention the fast stage couldn't afford.

The contenders, and what they actually trade

Cohere Rerank is the lowest-friction path. It's a hosted API — no GPU, no weights, no ops. The current rerank-v3.5 covers 100-plus languages with a 4,096-token context and is the default "I just want it to work" choice. You pay per call and you send your text to a third party; in exchange you operate nothing. For a team without GPU infrastructure, that trade is often correct, and the integration is an afternoon.

▟ FlagOpen/FlagEmbedding

The BGE family of open retrieval models, including bge-reranker-v2-m3 — a lightweight, strongly multilingual cross-encoder that's become the default open self-host reranker

★ 11.8kPythonFlagOpen/FlagEmbedding

BGE-reranker-v2-m3 is the open answer, and it's the one that punctures the leaderboard mindset. It's a lightweight cross-encoder with strong multilingual coverage, openly licensed, and on a GPU it can match a hosted API's latency — at zero marginal cost per query and with your data never leaving your boundary. If you already run a GPU, the honest question isn't "is the top-of-leaderboard model 2% better," it's "is 2% worth a per-call bill and a data-egress story forever." Usually it isn't.

Jina Reranker is excellent and contains the sharpest trap in the category: licensing. The jina-reranker-v2-base-multilingual weights ship under CC-BY-NC-4.0 — non-commercial. You can self-host them to evaluate, but deploying those weights in a commercial product is a license violation; commercial use routes you to Jina's hosted API or a paid arrangement. (Jina has since released a v3.) This is the line item that never appears in a benchmark table and absolutely belongs in your decision: a model can be both genuinely good and the wrong thing to bolt into your product, on license alone.

Don't pick the model, pick the harness

▟ AnswerDotAI/rerankers

A lightweight, low-dependency unified API across reranking and cross-encoder models — Cohere, Jina, BGE, ColBERT, and more — behind one interface

★ 1.6kPythonAnswerDotAI/rerankers

Here's the move that makes the whole decision reversible: don't hardcode a vendor, adopt a unified interface. The rerankers library wraps Cohere, Jina, BGE, ColBERT and the rest behind one small API, so swapping the model under your pipeline is a one-line change instead of a refactor. That matters because the right reranker is the one that lifts your eval set, and you can't know which that is from someone else's leaderboard — corpus, language mix, and query style decide it. Wire it through a harness, run all three over your own retrieval traces, and let your numbers choose.

The decision, in one line

Stop ranking rerankers by a public ELO and rank them by your two real constraints. No GPU and you want it working today → Cohere Rerank. You run a GPU and want zero marginal cost with full data control → BGE-reranker-v2-m3. Either way, put it behind rerankers so the choice stays cheap to revisit — and read the license before you ship, because the best-scoring open model in this space is the one you're not allowed to self-host commercially. The reranker is the highest-leverage step in RAG precisely because it's so easy to add; the only way to get it wrong is to choose it for a reason that has nothing to do with your data.

The Best Reranker for RAG in 2026: Cohere vs Jina vs BGE

Why the step works at all

The contenders, and what they actually trade

Don't pick the model, pick the harness

The decision, in one line

Frequently asked

Dex Mareno

Dispatches from the machines, in your inbox

The Best Reranker for RAG in 2026: Cohere vs Jina vs BGE

Why the step works at all

The contenders, and what they actually trade

Don't pick the model, pick the harness

The decision, in one line

Frequently asked

Dex Mareno

Continue reading

OpenAI Agents SDK vs Pydantic AI vs Google ADK: The New Frameworks, Compared

Mem0 vs Zep vs Letta: Choosing a Memory Layer for Your AI Agent

LiveKit vs Pipecat vs Vapi: Building Voice AI Agents in 2026

Dispatches from the machines, in your inbox