Why does semantic search fail on exact matches like error codes or SKUs?

Embedding models are trained to generalize meaning, so a rare literal token like ERR_TLS_CERT_ALTNAME_INVALID or PROD-SKU-7842X lands in a generic region of vector space near similar-looking-but-wrong neighbors. Cosine similarity then returns plausible passages that don't contain the exact string. BM25 matches the literal token against an inverted index, so it either finds it or it doesn't — no blurring.

What is Reciprocal Rank Fusion (RRF) and why is it the default?

RRF combines multiple ranked lists by summing 1/(k + rank) for each document across lists, with k defaulting to 60. It uses only rank position, never raw scores, which sidesteps the fact that BM25 scores are unbounded while cosine similarity sits in a small range — they can't be added directly. It needs no tuning and is the default fusion in Elasticsearch and a built-in option in Weaviate and Qdrant.

When is pure semantic or pure lexical search good enough?

Pure semantic is fine when queries are natural-language and paraphrase-heavy with no identifiers — FAQ-style support, conceptual Q&A. Pure lexical (BM25) is fine when queries are almost always exact tokens — code symbol search, log/error-code lookup, catalog part numbers. Everything in between, which is most production traffic, should default to hybrid.

Hybrid Search vs Semantic Search: Why Vector RAG Misses Exact Matches

A user pastes ERR_TLS_CERT_ALTNAME_INVALID into your support bot. Your pure-vector RAG, the one that demoed beautifully on "how do I rotate a certificate," returns three calm, well-written passages about TLS handshakes, none of which mention that error code. The page that actually documents it — the one with the literal string in a fenced code block — sits at rank 14, below a paragraph about certificate alternative names that the embedding model decided was "close enough."

That is not a tuning problem. That is what semantic search is.

Embeddings smear the literal into the vicinity

Dense embedding models are trained to generalize across language — to put "car" near "automobile" and "refund" near "money back." That generalization is the whole point, and it is also exactly what kills exact-match retrieval. A rare token — an error code, a SKU, a function name, an acronym, a part number — is something the model has barely seen. So the query vector lands in a generic neighborhood, and cosine similarity happily returns the nearest plausible thing.

The canonical failure: dense retrieval sees PROD-SKU-7842X and confidently returns PROD-SKU-7842Y. Wrong answer, high score. Identifiers like 0x80070005, INV-2024-00847, or ENOMEM carry almost no semantic signal; the model has no principled way to keep 7842X and 7842Y apart, because in meaning-space they are the same thing.

BM25 — the lexical workhorse from the 1990s that still anchors every serious search stack — does not have this problem, because it does not think. It scores query terms against an inverted index of exact tokens: term frequency with diminishing returns, inverse document frequency so rare terms count more, length normalization. It either finds ERR_TLS_CERT_ALTNAME_INVALID or it doesn't. There is no "close enough."

Semantic search fails loudly on paraphrase and silently on identifiers. The silent failures are the ones that reach production.

The flip side is just as real. Ask BM25 "my site won't load after I changed the cert" and it flails — no shared tokens with the doc titled Resolving certificate validation errors. Vectors nail that. The two methods fail in opposite directions, which is the entire argument for running both.

Hybrid is the default — and that's where the work starts

"Run both and combine" sounds trivial until you try to combine. You have two ranked lists. The vector side hands you cosine similarities clustered in a narrow band — 0.78, 0.81, 0.83. The BM25 side hands you unbounded scores — 4.2, 11.7, 28.0 — whose magnitude depends on corpus statistics and query length. You cannot add 0.81 and 11.7 and expect the sum to mean anything. The scales aren't just different; they're incomparable, and any fixed weighting you pick is an arbitrary scaling decision wearing a lab coat.

There are two honest ways out.

Weighted score fusion (normalize-then-add): rescale both lists to [0, 1] — min/max or L2 — then take a weighted sum. This is what OpenSearch's normalization-processor does, and what Weaviate calls relativeScoreFusion. It works, but it's sensitive: one outlier BM25 score can stretch the whole normalized range, and the right weight drifts with your corpus.
Reciprocal Rank Fusion (RRF): throw the scores away entirely. Use only each document's position in each list.

Why RRF won

RRF, from Cormack, Clarke, and Büttcher's 2009 SIGIR paper, is almost insultingly simple. For each document, sum across the lists it appears in:

score(d) = Σ  1 / (k + rank_i(d))

rank_i(d) is the document's position in list i; k is a constant that defaults to 60 — the value from the original paper that has survived nearly two decades of benchmarks. That's it. A rank of 1 contributes 1/61; a rank of 14 contributes 1/74. A document ranked decently in both lists beats one ranked #1 in a single list and absent from the other. RRF rewards agreement, not the loudness of any single retriever's vote.

The reason it won production is the reason it looks too simple: by using ranks, it never has to reconcile BM25's unbounded scores with cosine's narrow band. There is nothing to normalize and nothing to tune. Elasticsearch ships it as a first-class rrf retriever with rank_constant defaulting to 60; Weaviate offers it as rankedFusion alongside an alpha knob that slides between pure-keyword and pure-vector; Qdrant exposes it in its Query API; Pinecone, OpenSearch, and pgvector-plus-ParadeDB all give you a hybrid path. Native hybrid is now table stakes for a vector database, not a differentiator.

Hybrid is not free

Two honest costs, since nobody in a comparison table mentions them.

First, you now maintain two indexes — a dense vector index and an inverted index — over the same corpus. More storage, more to keep in sync at ingest, two retrieval calls per query instead of one. Your chunking now has to serve both masters: chunks small enough for clean embeddings but with enough literal tokens left intact for BM25 to grip. Contextual retrieval — prepending a short context blurb to each chunk before indexing — happens to help both sides at once, which is part of why it caught on.

Second, fusion gives you a merged candidate set, not a final answer. The standard production shape is hybrid retrieve → fuse → rerank, where a cross-encoder re-scores the top fused candidates with full query-document attention. If you're assembling this pipeline, the reranker is the usual final stage, and it's where a lot of the real relevance lift lives.

The verdict

Default to hybrid. For nearly all real corpora — docs, code, support tickets, catalogs — queries are a mix of paraphrase and literal, and you cannot predict which a given user will type. Hybrid + RRF + a reranker is the boring, correct baseline, and RRF means you get it with essentially zero fusion tuning.

Skip hybrid only when your traffic is genuinely one-shaped:

Pure semantic is fine for natural-language, paraphrase-heavy queries with no identifiers — conceptual Q&A, FAQ bots, "explain this to me" traffic.
Pure lexical (BM25) is fine when queries are almost always exact tokens — code-symbol search, log and error-code lookup, part-number catalogs.

Everything between those poles is where vector search fails silently and quietly outranks the one document the user actually needed. That document is sitting at rank 14, with the literal string right there in it. Hybrid is how you stop shipping that.

Hybrid Search vs Semantic Search: Why Vector RAG Misses Exact Matches

Embeddings smear the literal into the vicinity

Hybrid is the default — and that's where the work starts

Why RRF won

Hybrid is not free

The verdict

Frequently asked

Dex Mareno

Dispatches from the machines, in your inbox

Hybrid Search vs Semantic Search: Why Vector RAG Misses Exact Matches

Embeddings smear the literal into the vicinity

Hybrid is the default — and that's where the work starts

Why RRF won

Hybrid is not free

The verdict

Frequently asked

Dex Mareno

Continue reading

The Best Chunking Strategy for RAG in 2026: Fixed vs Semantic vs Late Chunking

GraphRAG vs Vector RAG: When a Knowledge Graph Actually Earns Its Cost

RAG vs Long Context: When to Retrieve and When to Stuff the Window

Dispatches from the machines, in your inbox