Contextual Retrieval vs Naive RAG: Fix the Chunk, Not the Model

A chunk is a hostage. You split a document to fit a retriever's window, and in the cut you lose the thing that made the text legible. "The company's revenue grew 3% over the previous quarter." Which company. Which quarter. The sentence was clear inside its document and is noise outside it. The embedding you compute is an embedding of the noise.

This is the failure mode naive RAG never fixes, and the reason most teams blame the wrong layer. Retrieval misses, so they reach for a bigger embedding model, or a graph index, or a query-rewriting prompt. The defect is upstream of all of it. It happened at the chunk boundary, at index time, the moment context was stripped away.

The cheapest fix is also the most durable

Anthropic's contextual retrieval, published September 2024, treats the boundary as the bug. Before each chunk is embedded and indexed, a model writes a short, chunk-specific context — what document this is from, what it refers to, the named entity and quarter the orphaned sentence assumed you already knew — and prepends it to the chunk. The orphan gets threaded back onto its document. You embed the contextualized text and you index it for keyword search too.

The reported numbers are specific. Contextual embeddings alone cut the top-20 retrieval failure rate from 5.7% to 3.7% — a 35% reduction. Add contextual BM25 and the failure rate drops to 2.9%, a 49% reduction. Layer a reranker on top and it reaches 1.9%, a 67% reduction. Those are Anthropic's figures, and the direction is the point: each stage compounds, and none of them is a bigger embedding model.

The fix lives at index time, not query time — you pay once to contextualize, and every future query inherits a corpus that no longer lies about itself.

The objection writes itself: calling a model on every chunk in a corpus is expensive. It was, until prompt caching made it cheap. You load the full document into the cache once, then generate per-chunk context against the cached document at a fraction of the input cost. The economics flip. Contextualizing a corpus becomes a one-time write expense, not a recurring tax, which is exactly why fixing retrieval at index time beats fixing it at query time. Query-time tricks run forever. The index-time fix runs once and stays fixed.

Hybrid search: two retrievers that fail in opposite directions

Contextual retrieval is not one signal. It is two, deliberately.

Dense embeddings retrieve by meaning. Ask for "how the firm's earnings changed" and they will find the chunk about revenue growth even with no shared words. But embeddings smear exact tokens. A part number, an error code, a function name, a customer ID — ERR_4042, acct_9913 — gets blurred into a neighborhood of plausible neighbors, and the exact match you needed ranks below five approximate ones.

BM25 fails in the opposite direction. It is lexical: it rewards rare exact tokens and ignores meaning entirely. It nails ERR_4042 and whiffs on paraphrase. Dense and sparse retrievers miss orthogonally — what one drops, the other catches — which is why hybrid search is the correct architecture for any corpus that mixes prose with identifiers, and almost all of them do.

The two ranked lists are merged with reciprocal rank fusion (Cormack et al., 2009). RRF throws away the raw scores — which live on incompatible scales and resist normalization — and fuses by rank alone, summing the inverse position of each document across both lists. It is parameter-light and refuses to let one retriever's score distribution dominate. A chunk that both retrievers rank highly floats to the top; a chunk only one of them loves still gets a fair hearing.

Reranking is the last stage, not a substitute

Hybrid retrieval is recall-oriented: cast a wide net, pull twenty candidates, accept some junk. The final stage is a cross-encoder reranker — Cohere Rerank and its peers — that reads the query and each candidate together, rather than comparing two pre-computed vectors, and reorders them by true relevance. It is slower per pair, which is precisely why it runs last, on a short list, not across the corpus. In Anthropic's results it is the layer that takes a 49% failure reduction to 67%. It does not replace contextualization or hybrid search. It cleans up after them.

The order of operations

The spine is an ordering, not a menu. Before you swap embedding models, before you stand up a GraphRAG vs vector RAG comparison, before you add a query-rewriting layer that runs on every request forever — contextualize your chunks, retrieve hybrid, rerank the top candidates.

Most RAG retrieval failures are not a model problem. They are context-loss at chunk boundaries, and the cheapest, most durable fix is to stop tearing the context off the chunk in the first place. Fix the chunk, not the model.

Frequently asked

What is contextual retrieval?

A technique from Anthropic that prepends a short, LLM-generated chunk-specific context to each chunk before embedding and BM25 indexing, so chunks stay findable after being split from their source document.

Does contextual retrieval replace reranking?

No. They stack. Contextual embeddings plus contextual BM25 cut failures ~49%; adding a reranker on top reaches ~67%.

Is hybrid search better than vector search alone?

For real corpora, yes — dense embeddings catch semantics, BM25 catches exact tokens, IDs, and codes; reciprocal rank fusion merges both ranked lists.

Should I use GraphRAG instead?

Usually not first. Most retrieval failures are context loss at chunk boundaries, which contextualization fixes far more cheaply than a graph index.

Dimension	Naive RAG	Contextual Retrieval
Chunk context	Stripped at split — orphaned text	LLM-generated context prepended per chunk
Retrieval signal	Dense embeddings only	Hybrid: dense + BM25 fused via RRF
Where the fix happens	Query time (rewriting, expansion)	Index time (one-time contextualization)
Failure rate	Baseline 5.7% top-20 misses	Down to 1.9% with rerank (-67%)
Cost	Cheap to build, brittle to run	One-time write cost, slashed by prompt caching
When to use	Tiny, self-contained corpora	Real documents with codes, IDs, cross-references

Contextual Retrieval vs Naive RAG: Fix the Chunk, Not the Model

The cheapest fix is also the most durable

Hybrid search: two retrievers that fail in opposite directions

Reranking is the last stage, not a substitute

The order of operations

Frequently asked

Dex Mareno

Dispatches from the machines, in your inbox

Contextual Retrieval vs Naive RAG: Fix the Chunk, Not the Model

The cheapest fix is also the most durable

Hybrid search: two retrievers that fail in opposite directions

Reranking is the last stage, not a substitute

The order of operations

Frequently asked

Dex Mareno

Continue reading

The Best Embedding Model for RAG Is the One You Benchmark Yourself

RAG vs Long Context: When to Retrieve and When to Stuff the Window

Hybrid Search vs Semantic Search: Why Vector RAG Misses Exact Matches

Dispatches from the machines, in your inbox