The retrieval library, read in order — from the architecture call (is RAG the right tool, or long context / fine-tuning?) through chunking, the embedding models that encode your corpus, the vector databases and indexes that store it, the retrieval quality layer (hybrid search and reranking), the advanced patterns (GraphRAG, hierarchical, self-correcting), and the evaluation that tells you whether any of it works.
Most RAG retrieval failures are context lost at chunk boundaries — contextual retrieval fixes them at index time, cheaper than a bigger embedding model or GraphRAG.
Naive RAG retrieves once and hopes. Agentic RAG turns retrieval into a decision the model makes at runtime — paying for it on every query to win the queries that silently fail.
Million-token windows were supposed to kill retrieval. The benchmarks say something stranger — the choice is really between two different failure modes, and only one of them is loud.
Cache-augmented generation deletes the retriever and preloads your whole knowledge base into the KV cache. The real question isn't speed — it's whether your corpus fits and how often it changes.
They are not two answers to one question. RAG fixes what the model doesn't know; fine-tuning fixes what it won't do the way you need. Pick by the failure, not the fashion.
The chunk-size A/B test is the most over-run experiment in RAG. The teams winning on retrieval stopped tuning how they split and started fixing what each chunk forgets.
Your chunks lose the document around them before they're ever embedded. Jina and Anthropic solve it in opposite places — one in vector space for free, one in the text for a price.
The 'reorder so the best chunks sit at the start and end' trick everyone copies from LangChain is a 2023 patch for a 2023 problem. On a tight, well-reranked context it can quietly demote your second-best evidence to the worst seat in the room.
Voyage, OpenAI, Gemini, Cohere, and open-weight BGE all top some leaderboard. The MTEB score you're comparing is the least important number in the decision.
The embedding model you pick barely moves your bill. The dimensions you store and the precision you keep — that's the recurring cost, and it's the decision almost nobody makes on purpose.
A Matryoshka-trained embedding lets you chop off the tail of every vector and still search well — and a two-pass trick gets you the storage savings and the accuracy at the same time.
Re-embedding your corpus is cheap. The expensive part is that two models live in two incompatible vector spaces — and a naive rolling reindex hides the damage behind green dashboards.
Approximate nearest-neighbor search is a tax you pay to survive scale you may not have. Below a few hundred thousand vectors, exact brute-force is faster, perfectly accurate, and has no index to rot.
The benchmarks everyone argues about measure the thing that almost never decides the choice. The real axis is where your vectors live — and whether you can afford to keep them there.
All three clear the recall-and-latency bar for almost any agent you'll build. The real decision is where the operational cost lives — and there's a query volume where the answer flips.
They all scale now, and they all do hybrid search. The axis that still forks the decision is the one nobody puts on a benchmark chart: how each keeps a metadata filter from wrecking recall.
Almost every vector-index comparison argues about query speed. Below ten million vectors that is the one thing that rarely decides it. The real choice is where your vectors live, and what it costs to change them.
M, ef_construction, and ef_search decide whether your vector search is fast, accurate, or neither. Only one of them can be changed after you build the index — and it's the one most teams never touch.
Vector search quietly fails on product codes and function names. Here's why, what BM25 fixes, and why rank-based fusion beats score-mixing.
A reranker is the cheapest large win left in a RAG pipeline — a stateless model you bolt on after retrieval. The trap is choosing one by leaderboard rank instead of the two things that actually decide it.
They read like rivals you choose between. They're two stages of one pipeline, forced apart by a single computational fact — and that fact tells you exactly where each one belongs.
Dense, sparse, and late-interaction retrieval aren't a quality ladder. They're three answers to one question — where does the matching cost live — and the answer decides your storage bill.
Microsoft GraphRAG, LightRAG, and LazyGraphRAG all promise smarter retrieval. The honest question isn't which to pick — it's whether your queries are the kind a graph can even help.
Flat top-k retrieval returns the chunks most similar to your query. For "what is this document about?" that's exactly the wrong thing. RAPTOR retrieves at the right altitude instead.
Both bolt a quality check onto RAG, but they fix different failures at different points — and the choice comes down to one question: do you control the model's weights?
Most RAG failures are retrieval failures wearing a generation costume — so measure the two halves separately or you'll tune the wrong one for weeks.
Search teams optimize NDCG. RAG teams copy them — and measure the wrong thing. For a pipeline that hands the whole top-k to a generator, recall is the floor and rank position is a second-order correction.