"How do I summarize a document that's too long for the model?" is one of the most-asked questions in applied LLM work, and it has a canonical set of answers — stuff, map-reduce, refine — that were codified back when a long context was 8,000 tokens. Two things are worth knowing before you reach for any of them: each classic strategy fails in a specific way you can predict, and in 2026 the honest answer to a lot of these questions is that you shouldn't be summarizing at all.

The three classic chains, and where each breaks#

Stuff is the trivial one: put the whole document in a single prompt and ask for a summary. It is lossless — no intermediate compression — and it is one call. It's also not a strategy for "too long," because the entire premise is that the document fits. Use it whenever it fits; the rest of this piece is about when it doesn't.

Map-reduce is the reflex answer. Chunk the document, summarize each chunk independently (this maps beautifully across workers), then combine the partial summaries into a final one. It has two failure modes, and the first is genuinely counterintuitive: the reduce step can overflow on its own. If you have hundreds of chunks, their partial summaries can add up to more than the context window — so the combine step hits the exact wall you adopted map-reduce to get around. LangChain's implementation handles this with a recursive collapse: summarize groups of summaries until the set is small enough to reduce in one call. It works, but every collapse layer is another lossy pass stacked on the last.

The second failure mode is quieter: a summary-of-summaries loses cross-chunk connections. If the argument on page 3 only makes sense given the definition on page 40, map-reduce never sees them together — each chunk was summarized in isolation, and the reduce step is working from lossy fragments, not the source.

Refine trades parallelism for continuity. It walks the chunks in order, building a running summary: summarize chunk one, then hand that summary plus chunk two to the model and ask it to refine, and so on. Because each step conditions on the accumulated summary, it preserves continuity map-reduce loses. The cost is in the name of the tradeoff: it is strictly sequential — chunk n can't start until chunk n−1 finishes — so it doesn't parallelize and it's slow on long inputs, and an error introduced early rides along through every later step.

Map-reduce is fast and forgets the connections. Refine remembers the connections and can't be made fast. That tension is the whole design space — until you change the size of the window.

The 2026 reframe: maybe don't summarize#

Here's what's changed. The classic chains exist because the document didn't fit. In 2026, a lot of documents fit. Context windows are 1M tokens on Claude Opus and Sonnet and on GPT-4.1, and 2M tokens on Gemini 1.5 Pro. A contract, a codebase, a research paper, a quarter of support tickets — things that forced a summarization pipeline three years ago now drop into a single prompt. And when the document fits, stuffing it whole tends to beat a multi-stage pipeline, because you've stopped throwing information away at every intermediate step.

So the first question isn't "map-reduce or refine?" It's "does this even need a summarization chain, or can I just put the whole thing in the window and ask?"

The catch — and it is a real one — is that a big window is not big recall. The "Lost in the Middle" work showed model accuracy follows a U-shaped curve across the context: information at the very beginning and the very end is used well, and information in the middle is used worst, with a dip large enough to matter even on models built for long context. The 2025-2026 framing of "context rot" generalizes it: accuracy drifts down as the input grows, sometimes noticeably before you hit the advertised limit. Stuffing a million tokens is a legitimate move, but you should place the material you most need answered near the edges of the context, not bury it in the center, and you should not assume the model attended to everything just because it all fit.

There's a budget angle too. This is the same long-context-versus-RAG tradeoff that decides so many pipeline designs: under a real token or latency constraint, recent work on stronger RAG baselines finds that a simple structured retrieve-then-read — pull the handful of relevant passages, keep their original order, answer — matches or beats intricate multi-stage summarization pipelines. A lot of the elaborate tree-summarizers were solving a problem that a bigger window, or a bit of retrieval, dissolves.

The actual decision#

Skip the question of which chain and answer three others:

  1. Does it fit long context? If yes and you can afford the tokens, stuff it whole — it's lossless and simplest. Put the key material near the start or end.
  2. Global summary or targeted question? If you need a specific answer, don't summarize the whole thing — retrieve the relevant passages and answer from them. Summarization is for when you genuinely need a condensed view of the whole.
  3. What's the budget? Under a tight token or latency budget, structured retrieval usually wins. When you truly need a global summary of something that exceeds even long context, then pick between the chains — map-reduce when the content is embarrassingly parallel and cross-chunk links don't carry the meaning, refine when order does.

The classic strategies aren't wrong; they're answers to a question the frontier has partly moved past. Map-reduce still overflows on its reduce step, refine still can't parallelize, and both still compress away detail you might need. Before you pay those costs, check whether the document simply fits now — and if it does, the best summary is often no summary at all, just the whole thing in the window and a question aimed at the edges.