A user types ERR_NGX_502 into your support search. Your beautifully embedded, state-of-the-art vector retriever returns six articles about gateway timeouts, load balancing philosophy, and one about a completely different error. It does not return the one runbook that contains the literal string ERR_NGX_502, because the embedding model was never trained on that token and quietly mapped it to the nearest fuzzy concept it knew.
This is the failure mode nobody warns you about when they sell you on semantic search. Dense retrieval is extraordinary at meaning and bad at strings — product SKUs, function names, error codes, dosage numbers, case citations, the rare jargon that is often the entire point of the query.
What BM25 actually does
BM25 is a lexical ranking function: it scores documents by how many query terms appear, weighted by how rare each term is (inverse document frequency) and dampened for term frequency and document length. It does not understand anything. It does not need to. If ERR_NGX_502 appears in exactly one document, BM25 finds that document with surgical confidence, because the token is rare and the match is exact.
That is also its ceiling. Ask BM25 for "why is my reverse proxy failing" when the doc says "gateway error" and it shrugs — no token overlap, no match.
What dense retrieval actually does
A dense retriever embeds your query and your documents into the same vector space and returns nearest neighbors by cosine or dot product. It catches paraphrase, synonymy, and intent: "feline" finds "cat," "reverse proxy failing" finds "gateway error." But anything outside the embedding model's effective vocabulary — novel identifiers, exact codes — gets smeared into approximate neighbors. The strengths and weaknesses are almost exactly inverted from BM25's.
That inversion is the whole argument for hybrid. The empirical finding repeated across the BEIR benchmark and MS MARCO is that BM25 and dense retrievers have complementary recall: documents one misses, the other often catches. You are not combining them because two searches sound thorough. You are combining them because they fail in different places.
Hybrid search isn't about running two retrievers. It's about reconciling two scoreboards that were never designed to be read together.
The hard part is the scoreboard, not the search
Here is the non-obvious thing, and it is where most homegrown hybrid implementations go wrong. Running BM25 and a vector search in parallel is trivial. Merging their results is not, because their scores are not on the same scale and were never meant to be.
Pinecone's own documentation spells out the mismatch: dense vectors scored by dot product land roughly in [-1, 1], while BM25-style sparse weights are unbounded positive numbers that grow with term frequency and document length. Naively add a dense score of 0.83 to a BM25 score of 19.4 and the keyword side silently dominates every result. The number on one scoreboard means something completely different from the number on the other.
There are two ways out.
Score-based fusion normalizes both score sets onto a common range and combines them — a convex combination α · dense + (1−α) · sparse, as Pinecone exposes, or a min-max normalization plus weighted mean, as OpenSearch's normalization processor does. Weaviate calls its version relativeScoreFusion and made it the default in v1.24, on the reasoning that it preserves more information than rank-only methods because it keeps the magnitude of each score, not just the order. OpenSearch's benchmarks on BEIR and Amazon ESCI found min-max normalization with an arithmetic mean gave the best results among the combinations it tested. Score fusion can be excellent — but it lives or dies on getting that normalization right for your score distributions, which shift with corpus and query.
Rank-based fusion refuses to play the normalization game at all. This is Reciprocal Rank Fusion (RRF), from Cormack, Clarke, and Büttcher's 2009 SIGIR paper. RRF throws the scores away and keeps only the ranks. Each document gets a fused score:
RRF(d) = Σ 1 / (k + rank_i(d))
summed over each ranked list the document appears in, where rank_i(d) is its position in list i and k is a constant the paper set to 60. A document ranked #1 in both lists scores 1/61 + 1/61; a document ranked #1 in one list and #50 in the other still scores respectably. The k constant dampens the influence of any single list's top results so one runaway retriever can't hijack the fusion.
Because RRF never touches the raw scores, the incompatible-scale problem evaporates by construction. A dot product in [-1, 1] and an unbounded BM25 weight become ordinal positions — perfectly comparable. That is why RRF, a literal one-line formula from 2009, became the de facto default for hybrid search across the industry. Qdrant ships it built in as Fusion.RRF; Elasticsearch's RRF retriever defaults rank_constant to 60; Weaviate's rankedFusion is the same idea. The 60 has held up across nearly two decades of benchmarks, with most teams finding any k in the 40–80 range performs about the same.
The trade is real: RRF discards information. It cannot tell a document that barely edged into #2 from one that crushed it. When your retrievers produce well-calibrated, trustworthy scores, normalized score fusion can do better. When they don't — which is most of the time, across heterogeneous queries — rank fusion's refusal to trust the numbers is exactly what makes it robust.
Where a reranker fits
Fusion gives you a good candidate set, cheaply. A reranker — typically a cross-encoder that reads the query and each document together — gives you a good ordering of that set, expensively. The standard pipeline is: hybrid-retrieve a few hundred candidates, fuse, then rerank the top 50–100 before they hit your LLM. Skip the reranker when latency is tight and fused results are already clean; add it when the right answer keeps landing at position 12 instead of position 2.
How to choose
- Corpus full of exact tokens — code, specs, legal citations, logs, part numbers: you need BM25 in the mix, full stop. Dense alone will lose the rare-term queries that matter most.
- Conversational, paraphrase-heavy corpus with little exotic vocabulary: dense alone may match hybrid, and you skip the second search. Test it; don't assume hybrid is free.
- Mixed real-world traffic (most RAG systems): hybrid with RRF is the sane default. It's cheap, it's robust to score-scale weirdness, and it has almost nothing to tune — leave k at 60 and move on.
- Precision at the very top matters: add a reranker after fusion, and budget the latency.
The lesson worth keeping is the one that sounds boring: hybrid search is not an AI problem. It's a units problem. Two retrievers handed you two scoreboards in different units, and the entire engineering question is whether you trust the numbers enough to normalize them or trust only the order. Most days, the order is all you should trust.



