You chunk your documents at 256 tokens because small chunks retrieve well — the embedding of a short passage is sharp, undiluted, and lands close to a focused query. Then a user asks a question, your retriever nails the exact passage, and the answer is wrong anyway, because the 256 tokens that matched the query don't contain enough surrounding context for the model to actually answer. The pronoun's antecedent was two sentences up. The number was in the table caption you split away. The match was perfect and useless.
This is the central tension of chunking, and it has a name now: the chunk size that retrieves best is not the chunk size that answers best. Small chunks win retrieval; large chunks win synthesis. For years the standard advice was to pick a compromise size and live with it. The better answer is to refuse the compromise.
Decouple the two chunks
LlamaIndex's production-RAG documentation states the principle outright: "decouple chunks used for retrieval vs. chunks used for synthesis." Embed and search a small unit so your vectors are precise and your matches are sharp. But when a small unit hits, don't hand it to the LLM — hand it the larger unit it lives inside. Retrieve small, synthesize big.
The community label for this is small-to-big retrieval, and three named patterns implement it. They are not really three different ideas. They are one idea — expand a precise hit into its surrounding context — that differ only in how they define "the surrounding context" and when they expand. Knowing which to reach for is mostly a question of what your documents look like.
Retrieve on the unit that matches the query. Generate on the unit that contains the answer. They are almost never the same size.
Parent Document Retrieval (LangChain)
LangChain's ParentDocumentRetriever is the most literal implementation. You give it two splitters. The child_splitter cuts documents into small chunks that get embedded and stored in the vector store. The parent_splitter (optional) defines the larger unit. At query time it searches the small child embeddings, then looks up and returns the parents of whatever children matched.
It has two modes. Leave parent_splitter unset and the parent is the entire original document — search small, return whole files. Set both splitters (say parent at 2000 characters, child at 400) and the parent is a larger chunk, not the whole document. LangChain's own framing of the tradeoff is exactly the tension above: you want documents "small enough that their embeddings can most accurately reflect their meaning" yet "long enough that the context of each chunk is retained."
Reach for this when your corpus has natural parents — documents with clear sections, pages, or files where "the thing that contains the answer" is an obvious larger unit. The failure mode is an oversized parent: return a full 40-page PDF for a one-sentence match and you've spent your context budget and reintroduced the lost-in-the-middle problem that small chunks were supposed to avoid.
Sentence-Window Retrieval (LlamaIndex)
LlamaIndex's SentenceWindowNodeParser takes the idea to its smallest possible retrieval unit: a single sentence. Each sentence becomes a node, embedded on its own, but the node also stashes a window of the surrounding sentences in its metadata. The default window_size is 3 — three sentences on each side. (Note for the blog-skimmers: several write-ups claim the default is five; the source code says three.)
The expansion happens at synthesis, not retrieval. After the single best-matching sentences come back, a MetadataReplacementPostProcessor swaps each retrieved sentence for its stored window before the text reaches the LLM. So you get the precision of sentence-level matching with the context of a short paragraph, and the size of that paragraph is a fixed, predictable ±window_size.
This is the right tool for dense prose — research papers, contracts, documentation — where the match is genuinely a single sentence but understanding it requires the sentences around it. The knob to watch is window_size: too tight and you're back to context-starved answers; too wide and every hit balloons.
Auto-Merging Retrieval (LlamaIndex)
The third pattern makes expansion dynamic. HierarchicalNodeParser parses each document into a hierarchy of chunk sizes at once — the default chunk_sizes are [2048, 512, 128], so every document exists simultaneously as big, medium, and small nodes, with children pointing at their parents. You embed and retrieve the leaves (the 128-token nodes).
Here's the move: AutoMergingRetriever watches which leaves come back, and if enough leaves of the same parent are retrieved, it removes those leaves from the result set and substitutes the parent — recursively, up the tree. "Enough" is the simple_ratio_thresh parameter, default 0.5: for each parent it computes the fraction of that parent's children that were retrieved, and if the ratio clears the threshold it merges them into the parent (scoring the parent as the average of the merged children's scores).
The behavior this produces is exactly what you want. A single stray leaf that matched some incidental phrase stays small and isolated. But when a query's answer is genuinely spread across one section, several of that section's leaves all hit, the ratio clears the threshold, and they collapse into one coherent parent chunk instead of five fragments your LLM has to stitch together. It's parent-document retrieval that only fires when the evidence concentrates.
This earns its complexity on long, structured documents with a real size hierarchy — manuals, books, large codebases. On flat, short documents there's nothing meaningful to merge and you've added machinery for no gain.
How to choose
- Documents with obvious parents (files, pages, sections), and you're in the LangChain ecosystem:
ParentDocumentRetriever. Start with larger parent chunks, not whole documents, unless your documents are short. - Dense prose where the answer is a sentence-in-context: sentence-window. It's the cheapest contextual win — predictable expansion, one knob.
- Long structured documents and you want expansion to follow the evidence: auto-merging. Pay the setup cost only when the hierarchy is real.
- Short, self-contained chunks already: none of the above. The chunk is the answer; expansion buys nothing and costs tokens.
The unifying lesson is worth stating plainly, because it generalizes past these three classes. A chunking strategy that optimizes a single chunk size is solving the wrong problem — it's trying to make one unit good at two jobs that pull in opposite directions. Retrieval wants precision; synthesis wants context. The fix isn't a better compromise size. It's two sizes, and a rule for getting from the small one to the big one.



