Topic

RAG & Retrieval

The retrieval library, read in order — from the architecture call (is RAG the right tool, or long context / fine-tuning?) through chunking, the embedding models that encode your corpus, the vector databases and indexes that store it, the retrieval quality layer (hybrid search and reranking), the advanced patterns (GraphRAG, hierarchical, self-correcting), and the evaluation that tells you whether any of it works.

Contextual Retrieval vs Naive RAG: Fix the Chunk, Not the Model

Most RAG retrieval failures are context lost at chunk boundaries — contextual retrieval fixes them at index time, cheaper than a bigger embedding model or GraphRAG.

Agentic RAG vs Naive RAG: When to Let the Model Drive Retrieval

Naive RAG retrieves once and hopes. Agentic RAG turns retrieval into a decision the model makes at runtime — paying for it on every query to win the queries that silently fail.

RAG vs Long Context: When to Retrieve and When to Stuff the Window

Million-token windows were supposed to kill retrieval. The benchmarks say something stranger — the choice is really between two different failure modes, and only one of them is loud.

CAG vs RAG: When Cache-Augmented Generation Beats Retrieval

Cache-augmented generation deletes the retriever and preloads your whole knowledge base into the KV cache. The real question isn't speed — it's whether your corpus fits and how often it changes.

Fine-Tuning vs RAG: When to Actually Fine-Tune an LLM in 2026

They are not two answers to one question. RAG fixes what the model doesn't know; fine-tuning fixes what it won't do the way you need. Pick by the failure, not the fashion.

The Best Chunking Strategy for RAG in 2026: Fixed vs Semantic vs Late Chunking

The chunk-size A/B test is the most over-run experiment in RAG. The teams winning on retrieval stopped tuning how they split and started fixing what each chunk forgets.

Late Chunking vs Contextual Retrieval: Two Fixes for RAG's Context Problem

Your chunks lose the document around them before they're ever embedded. Jina and Anthropic solve it in opposite places — one in vector space for free, one in the text for a price.

RAG Context Ordering: Where to Put Your Best Chunk in the Prompt

The 'reorder so the best chunks sit at the start and end' trick everyone copies from LangChain is a 2023 patch for a 2023 problem. On a tight, well-reranked context it can quietly demote your second-best evidence to the worst seat in the room.

The Best Embedding Model for RAG Is the One You Benchmark Yourself

Voyage, OpenAI, Gemini, Cohere, and open-weight BGE all top some leaderboard. The MTEB score you're comparing is the least important number in the decision.

Voyage vs OpenAI vs Cohere vs Gemini: Choosing a Text Embedding API in 2026

The embedding model you pick barely moves your bill. The dimensions you store and the precision you keep — that's the recurring cost, and it's the decision almost nobody makes on purpose.

Matryoshka Embeddings: How to Shrink Vectors Without Wrecking Recall

A Matryoshka-trained embedding lets you chop off the tail of every vector and still search well — and a two-pass trick gets you the storage savings and the accuracy at the same time.

How to Migrate Embedding Models in Production Without Wrecking Retrieval

Re-embedding your corpus is cheap. The expensive part is that two models live in two incompatible vector spaces — and a naive rolling reindex hides the damage behind green dashboards.

Brute-Force vs Approximate Vector Search: Do You Even Need a Vector Database?

Approximate nearest-neighbor search is a tax you pay to survive scale you may not have. Below a few hundred thousand vectors, exact brute-force is faster, perfectly accurate, and has no index to rot.

How to Choose a Vector Database for AI Agents: pgvector vs Pinecone vs Qdrant

The benchmarks everyone argues about measure the thing that almost never decides the choice. The real axis is where your vectors live — and whether you can afford to keep them there.

pgvector vs Pinecone vs Qdrant: Picking a Vector Database in 2026

All three clear the recall-and-latency bar for almost any agent you'll build. The real decision is where the operational cost lives — and there's a query volume where the answer flips.

Qdrant vs Milvus vs Weaviate: Filtered Search Is the Question That Separates Them

They all scale now, and they all do hybrid search. The axis that still forks the decision is the one nobody puts on a benchmark chart: how each keeps a metadata filter from wrecking recall.

HNSW vs IVF vs DiskANN: Choosing a Vector Index

Almost every vector-index comparison argues about query speed. Below ten million vectors that is the one thing that rarely decides it. The real choice is where your vectors live, and what it costs to change them.

How to Tune HNSW: The Three Knobs Behind Vector Search Recall

M, ef_construction, and ef_search decide whether your vector search is fast, accurate, or neither. Only one of them can be changed after you build the index — and it's the one most teams never touch.

BM25 vs Dense vs Hybrid Search: How to Actually Combine Them for RAG

Vector search quietly fails on product codes and function names. Here's why, what BM25 fixes, and why rank-based fusion beats score-mixing.

The Best Reranker for RAG in 2026: Cohere vs Jina vs BGE

A reranker is the cheapest large win left in a RAG pipeline — a stateless model you bolt on after retrieval. The trap is choosing one by leaderboard rank instead of the two things that actually decide it.

Cross-Encoder vs Bi-Encoder: Why Your Retriever and Your Reranker Can't Be the Same Model

They read like rivals you choose between. They're two stages of one pipeline, forced apart by a single computational fact — and that fact tells you exactly where each one belongs.

ColBERT vs Dense vs Sparse Retrieval: When Late Interaction Is Worth It

Dense, sparse, and late-interaction retrieval aren't a quality ladder. They're three answers to one question — where does the matching cost live — and the answer decides your storage bill.

GraphRAG vs Vector RAG: When a Knowledge Graph Actually Earns Its Cost

Microsoft GraphRAG, LightRAG, and LazyGraphRAG all promise smarter retrieval. The honest question isn't which to pick — it's whether your queries are the kind a graph can even help.

RAPTOR vs Naive RAG: When Hierarchical Retrieval Actually Wins

Flat top-k retrieval returns the chunks most similar to your query. For "what is this document about?" that's exactly the wrong thing. RAPTOR retrieves at the right altitude instead.

Self-RAG vs Corrective RAG: Two Ways to Make Retrieval Check Itself

Both bolt a quality check onto RAG, but they fix different failures at different points — and the choice comes down to one question: do you control the model's weights?

How to Evaluate a RAG Pipeline: The Metrics That Predict Quality

Most RAG failures are retrieval failures wearing a generation costume — so measure the two halves separately or you'll tune the wrong one for weeks.

Retrieval Metrics for RAG: Recall@k vs MRR vs NDCG (and Which One Actually Matters)

Search teams optimize NDCG. RAG teams copy them — and measure the wrong thing. For a pipeline that hands the whole top-k to a generator, recall is the floor and rank position is a second-order correction.