There is a checklist that every team builds the second their RAG demo stops impressing people. Add a reranker. Add query rewriting. Add MMR for diversity. Each line item is a real technique with a real paper behind it, and the list has the satisfying shape of progress. The problem is that two of those items are quietly fixing the same thing in your head and completely different things in the retriever — and one of them, on the benchmark people cite to justify the whole list, does approximately nothing.
Start with what the reranker actually does, because the disappointment begins there. First-stage retrieval embeds your query and your chunks separately and ranks by cosine similarity; it is fast and approximate because the model that embedded each chunk had no idea, at the time, what query it would face. A cross-encoder reranker fixes that: it feeds the query and one candidate chunk through a model together, with full attention between them, and scores how relevant that chunk is to that query. It is the best precision upgrade in retrieval. It is also, structurally, scoring each chunk in isolation against the query. It never looks at the other chunks in your result set. So a reranker will happily hand you the five most relevant chunks in the corpus — and if those five chunks are the same sentence phrased five ways, that is exactly what you get, now sorted more confidently.
Why top-k clusters#
The reason near-duplicates pile up is almost tautological once you see it. The embedding signal that makes a chunk relevant to your query is the same signal that makes near-duplicate chunks relevant to each other. They sit in a tight knot in vector space, and top-k, having no notion of "I already have this one," reaches into the knot and pulls out the whole thing. It is worst exactly where real corpora are messy: the same policy quoted across three documents, boilerplate repeated on every page, an FAQ that restates the manual. You ask for the top 8 and get one fact, octupled, while the complementary facts a complete answer needed sit at rank 9 and never make the context window.
A reranker makes each result more relevant. Only a diversity step makes the set less redundant. They are not two strengths of the same upgrade — they are repairs to opposite failures.
This is the job Maximal Marginal Relevance was built for, and it has been around since Carbonell and Goldstein described it in 1998. MMR re-ranks candidates by relevance to the query minus a penalty for similarity to the chunks it has already selected, iteratively, so each pick is the most useful thing you don't already have. The trade-off is exposed as a single dial — LangChain calls it lambda_mult, where 1.0 is pure relevance (plain similarity again) and 0.0 is maximum diversity, defaulting to a hedged 0.5. Unlike the reranker, MMR is defined by looking at the rest of the set. That is the whole point. It is the only step in the pipeline that can see redundancy at all.
The knob that actually matters#
Here is the part the tutorials underplay. Everyone tunes lambda_mult and nobody tunes fetch_k, and fetch_k is the load-bearing parameter. MMR does not search your whole index; it diversifies over a candidate pool — fetch_k chunks pulled by ordinary similarity first, which LangChain defaults to 20 and vendors tell you to keep comfortably larger than k. If fetch_k is barely above k, MMR is choosing diversity from a bucket that is already mostly duplicates, and all it can do is reshuffle them. The diversity lives in the size of the pool, not in the aggressiveness of the dial. Crank lambda toward 0 with a tiny fetch_k and you get the worst of both worlds: less relevant results that are still redundant.
And now the inconvenient evidence, because it reframes the whole exercise. The ARAGOG study graded a battery of "advanced RAG" techniques against a naive baseline, and the headline most people quote is that HyDE and LLM reranking meaningfully improved retrieval precision. The line people skip: MMR and Cohere rerank showed no notable advantage over naive RAG. Not that diversity is fake — that it is conditional. MMR only pays off when redundancy is genuinely your bottleneck, which means broad, multi-faceted, or multi-hop queries over a corpus that actually contains overlapping content. Point it at a narrow factual lookup, where the right answer is one specific chunk, and trading relevance for novelty does the one thing you never wanted: it demotes the chunk that held the answer to make room for a "diverse" one that doesn't.
So the decision is not "should I turn on diversity." It is "is redundancy my failure mode at all" — and you answer that by reading your retrievals, not your config. If the top-k for a typical query is genuinely five takes on one fact, MMR earns its place. If it's already five distinct facts that are merely mis-ranked, you wanted a reranker (see also the cross-encoder vs. LLM vs. listwise tradeoff), and MMR will make it worse.
The fix upstream of the knob#
There is a deeper reading available, which is that MMR is a query-time band-aid for an index-time wound. If the same fact is sitting in your store five times, the durable fix is to not store it five times — dedup near-identical chunks during ingestion, before they ever compete for a slot, which also sharpens the chunking strategy the whole pipeline rests on. Query-time diversity then handles only the redundancy that's genuinely query-specific: two distinct chunks that happen to be interchangeable for this question.
The clean way to hold all of it in your head: cosine top-k optimizes per-chunk similarity and is blind to redundancy; a reranker optimizes per-chunk relevance and is also blind to redundancy; MMR is the only one that optimizes the set. Reach for the reranker when your results are relevant-but-mis-ranked, and for MMR — or better, dedup — when they're relevant-but-repetitive. Wire them in the wrong order and you will spend a sprint making each of five identical chunks individually more convincing. And if you can't yet say which failure you have, that's the actual first task, the one no retrieval metric reports until you go and look.



