The Best Chunking Strategy for RAG in 2026: Fixed vs Semantic vs Late Chunking

Open any RAG post-mortem and you'll find the same experiment, run again: someone has swept chunk size from 256 to 1,024 tokens, plotted retrieval accuracy, and declared a winner by a few points. It is the most over-run test in the field, and it has a quiet conclusion almost nobody acts on. The chunk-size knob has a low ceiling. The teams who actually moved their numbers in the last year stopped optimizing how they cut documents and started fixing what each cut throws away.

First, the boring answer is the right one

Start with the unglamorous default, because the benchmarks keep vindicating it. In Firecrawl's 2025 comparison, plain recursive character splitting at ~512 tokens landed around 69% accuracy — and the fancier-sounding option, semantic chunking, came in at 54%. The reason is mundane: semantic chunkers split wherever embedding similarity dips, which on real prose produces a spray of tiny fragments — averaging about 43 tokens in that test. A 43-token chunk is too small to answer anything; it retrieves a sentence and strands the LLM without the surrounding claim.

"Semantic chunking" sounds like the sophisticated choice and benchmarks like the naive one. The word semantic is doing marketing the algorithm can't cash.

So the practical floor is clear: recursive splitting, ~512 tokens, 10-20% overlap, and if you insist on semantic boundaries, enforce a minimum chunk size and merge fragments up to 200-400 tokens. That gets 80% of teams a working pipeline. It will not get them a great one, because all of these methods share the same defect.

The defect every splitter shares

Cut a document into pieces and embed each piece alone, and every chunk loses the thing that made it meaningful: its place in the document. The sentence "It cut latency by 40%" is useless in isolation — what cut latency, in which release, for which workload? The bi-encoder embeds that orphaned sentence as if the surrounding section never existed. No chunk-size sweep fixes this, because the information was discarded at the boundary, not at the wrong token count.

This is the realization that reframed the whole problem in the past year. The frontier moved from splitting to context injection: don't just decide where to cut — give each chunk back what the cut removed. Two methods do this, and they trade off on exactly one axis: cost.

Method one: contextual retrieval (pay an LLM per chunk)

Anthropic's contextual retrieval, published in September 2024, is the brute-force version and it works. Before embedding each chunk, you call an LLM to write a short blurb situating that chunk in its document — "This is from the Q3 earnings call, discussing the inference-cost reduction" — and prepend it. The chunk now embeds with its context. Anthropic reported this cut failed retrievals by 49%, and 67% when paired with a reranker.

The catch is in the method name: you are running an LLM call over every chunk in your corpus. For a large, static knowledge base, that's a real one-time bill (prompt caching softens it, but doesn't erase it). For a corpus that churns, you pay it again and again. Contextual retrieval is the right tool when accuracy dominates and the corpus is bounded.

Method two: late chunking (the embedding model already knows)

Jina's late chunking is the cheaper insight, and the more elegant one. Instead of chunk-then-embed, you embed-then-chunk: run a long-context embedding model over the entire document first, producing token-level embeddings that have already attended across the whole text — then pool those token vectors into chunk vectors afterward. Each chunk's embedding carries the document's context not because you wrote a blurb, but because the transformer saw the whole document before you ever drew the boundaries.

The decisive property: late chunking uses only the embedding model — no separate LLM pass. You get much of contextual retrieval's context-preservation at roughly the cost of ordinary embedding, and the gains grow with document length, which is exactly where naive chunking hurts most. It needs a long-context embedding model (Jina's v2/v3/v4 support it), and on short documents the benefit is small. But for long, reference-style corpora, it's the highest cost-efficiency move on the board.

The decision, in one line

Don't run the chunk-size sweep first; run it last. Start with recursive splitting at ~512 tokens — it beats the clever-sounding alternatives and gets you a baseline today. Then spend your effort where the ceiling actually is: if your corpus is bounded and accuracy is everything, layer on contextual retrieval and accept the per-chunk LLM bill; if you care about cost and your documents are long, reach for late chunking and let the embedding model carry the context for free. The chunk-size debate isn't wrong — it's just finished. The next 20 points of retrieval accuracy aren't hiding in a better place to cut. They're in refusing to let the cut erase what the chunk was about.

Frequently asked

What is the best chunk size for RAG?

For most corpora, recursive character splitting at roughly 512 tokens with ~10-20% overlap is the strongest default — one widely-cited 2025 benchmark put it around 69% accuracy, ahead of semantic chunking. Treat that as a starting point and confirm on your own eval set, because the right size depends on document type and query style.

Is semantic chunking better than fixed-size chunking?

Usually not, despite the name. Semantic chunking splits on embedding-similarity boundaries, which often yields very short fragments (averaging ~40 tokens in one benchmark) that underperform plain recursive splitting. If you use it, enforce a minimum-size floor and merge fragments up to 200-400 tokens.

What is late chunking and how is it different from contextual retrieval?

Both fix the same problem — a chunk losing the context of the document around it — but differently. Contextual retrieval uses an LLM to prepend a short context blurb to each chunk before embedding (accurate, but you pay per chunk). Late chunking runs the embedding model over the whole document first and pools the token embeddings into chunks afterward, so each chunk vector already carries document context — using only the embedding model, which is much cheaper at index time.

The Best Chunking Strategy for RAG in 2026: Fixed vs Semantic vs Late Chunking

First, the boring answer is the right one

The defect every splitter shares

Method one: contextual retrieval (pay an LLM per chunk)

Method two: late chunking (the embedding model already knows)

The decision, in one line

Frequently asked

Priya Sundaram

Dispatches from the machines, in your inbox

The Best Chunking Strategy for RAG in 2026: Fixed vs Semantic vs Late Chunking

First, the boring answer is the right one

The defect every splitter shares

Method one: contextual retrieval (pay an LLM per chunk)

Method two: late chunking (the embedding model already knows)

The decision, in one line

Frequently asked

Priya Sundaram

Continue reading

RAG vs Long Context: When to Retrieve and When to Stuff the Window

The Best Embedding Model for RAG Is the One You Benchmark Yourself

The Best Reranker for RAG in 2026: Cohere vs Jina vs BGE

Dispatches from the machines, in your inbox