Open any RAG post-mortem and you'll find the same experiment, run again: someone has swept chunk size from 256 to 1,024 tokens, plotted retrieval accuracy, and declared a winner by a few points. It is the most over-run test in the field, and it has a quiet conclusion almost nobody acts on. The chunk-size knob has a low ceiling. The teams who actually moved their numbers in the last year stopped optimizing how they cut documents and started fixing what each cut throws away.
First, the boring answer is the right one
Start with the unglamorous default, because the benchmarks keep vindicating it. In Firecrawl's 2025 comparison, plain recursive character splitting at ~512 tokens landed around 69% accuracy — and the fancier-sounding option, semantic chunking, came in at 54%. The reason is mundane: semantic chunkers split wherever embedding similarity dips, which on real prose produces a spray of tiny fragments — averaging about 43 tokens in that test. A 43-token chunk is too small to answer anything; it retrieves a sentence and strands the LLM without the surrounding claim.
"Semantic chunking" sounds like the sophisticated choice and benchmarks like the naive one. The word semantic is doing marketing the algorithm can't cash.
So the practical floor is clear: recursive splitting, ~512 tokens, 10-20% overlap, and if you insist on semantic boundaries, enforce a minimum chunk size and merge fragments up to 200-400 tokens. That gets 80% of teams a working pipeline. It will not get them a great one, because all of these methods share the same defect.
The defect every splitter shares
Cut a document into pieces and embed each piece alone, and every chunk loses the thing that made it meaningful: its place in the document. The sentence "It cut latency by 40%" is useless in isolation — what cut latency, in which release, for which workload? The bi-encoder embeds that orphaned sentence as if the surrounding section never existed. No chunk-size sweep fixes this, because the information was discarded at the boundary, not at the wrong token count.
This is the realization that reframed the whole problem in the past year. The frontier moved from splitting to context injection: don't just decide where to cut — give each chunk back what the cut removed. Two methods do this, and they trade off on exactly one axis: cost.
Method one: contextual retrieval (pay an LLM per chunk)
Anthropic's contextual retrieval, published in September 2024, is the brute-force version and it works. Before embedding each chunk, you call an LLM to write a short blurb situating that chunk in its document — "This is from the Q3 earnings call, discussing the inference-cost reduction" — and prepend it. The chunk now embeds with its context. Anthropic reported this cut failed retrievals by 49%, and 67% when paired with a reranker.
The catch is in the method name: you are running an LLM call over every chunk in your corpus. For a large, static knowledge base, that's a real one-time bill (prompt caching softens it, but doesn't erase it). For a corpus that churns, you pay it again and again. Contextual retrieval is the right tool when accuracy dominates and the corpus is bounded.
Method two: late chunking (the embedding model already knows)
Jina's late chunking is the cheaper insight, and the more elegant one. Instead of chunk-then-embed, you embed-then-chunk: run a long-context embedding model over the entire document first, producing token-level embeddings that have already attended across the whole text — then pool those token vectors into chunk vectors afterward. Each chunk's embedding carries the document's context not because you wrote a blurb, but because the transformer saw the whole document before you ever drew the boundaries.
The decisive property: late chunking uses only the embedding model — no separate LLM pass. You get much of contextual retrieval's context-preservation at roughly the cost of ordinary embedding, and the gains grow with document length, which is exactly where naive chunking hurts most. It needs a long-context embedding model (Jina's v2/v3/v4 support it), and on short documents the benefit is small. But for long, reference-style corpora, it's the highest cost-efficiency move on the board.
The decision, in one line
Don't run the chunk-size sweep first; run it last. Start with recursive splitting at ~512 tokens — it beats the clever-sounding alternatives and gets you a baseline today. Then spend your effort where the ceiling actually is: if your corpus is bounded and accuracy is everything, layer on contextual retrieval and accept the per-chunk LLM bill; if you care about cost and your documents are long, reach for late chunking and let the embedding model carry the context for free. The chunk-size debate isn't wrong — it's just finished. The next 20 points of retrieval accuracy aren't hiding in a better place to cut. They're in refusing to let the cut erase what the chunk was about.



