Here is a sentence that sounds like a bug report and is actually a research program: to make your agent's long history cheaper, stop sending it as text and send it as a picture of text.

The logic is less silly than it sounds. A language model pays by the token, and a page of prose is a lot of tokens. But a vision model reads an image with a small, roughly fixed budget of vision tokens — and an image of that same page carries the same words. So if a vision-language model can read the picture back accurately enough, you've just fit a page's worth of context into a fraction of the token cost. The whole 2025–2026 line of work under the banner optical context compression is an attempt to cash that arbitrage.

The anchor result: DeepSeek-OCR#

The paper that made everyone look was DeepSeek-OCR, released in October 2025 under the deliberately provocative subtitle Contexts Optical Compression. Its encoder renders a document page into a small set of vision tokens — 64, 100, 256, or 400 depending on the resolution mode — and a small MoE decoder reads them back out as text.

The headline number is the compression ratio: how many text tokens' worth of content you pack per vision token. DeepSeek-OCR reports that when the source text is within 10x the number of vision tokens, decoding precision holds around 97%. Push to 20x, and it drops to about 60%. That second number is the important one, and it's rarely quoted — the accuracy cliff is real, and it's close. On the OmniDocBench document benchmark the method matches a 256-token-per-page OCR baseline using only 100 vision tokens, which is a genuine efficiency win for the OCR job it was built for.

The 10x headline is real. The 20x cliff sitting right behind it is realer, and it's the number that decides whether you can ship this.

From documents to context windows: Glyph and AgentOCR#

DeepSeek-OCR is about reading documents. The interesting move for the rest of us is applying the same trick to a model's working context.

Glyph, from Zhipu, does exactly that: it renders long text input into images so a vision-language model can process it, claiming 3–4x token compression while matching a text LLM on long-context benchmarks like LongBench and RULER. The framing is that a 128K-token VLM can, by reading rendered pages, cover tasks that would otherwise need a much larger context window. Note the honesty gap with DeepSeek's headline: aimed at task performance rather than OCR reconstruction, the safe compression drops to 3–4x. That's your real number.

For agent builders, the one to read is AgentOCR (an ACL 2026 oral). It takes the accumulated observation–action history of a running agent — the transcript that grows every turn and eventually dominates your bill — and renders it into images. Two ideas make it more than a demo. Segment optical caching hashes history segments so already-rendered spans aren't re-drawn, giving a ~20x rendering speedup. And agentic self-compression trains the agent, with a compression-aware reward, to emit its own compression rate turn by turn — sharp for the recent step, blurry for the ancient one. The reported payoff: roughly 55% fewer tokens on ALFWorld and 70% on search-based QA, while retaining over 95% of the text-history agent's task performance. If your long-horizon agent's problem is that its context tax scales with its transcript, that is a lever worth knowing about — and it maps cleanly onto the compaction-vs-context-editing decisions you're already making.

The non-obvious part: what is it actually competing with?#

Here is where the naive reading — free 10x context! — falls apart, and where the one idea worth taking away lives.

Every method above is lossy. DeepSeek-OCR's own most elegant framing is as a forgetting mechanism: render recent context at high resolution and older context at progressively lower resolution, so distant memory literally blurs, the way ours does. That's a lovely picture. It's also an admission. A forgetting mechanism does not compete with keeping the full text. It competes with the other ways you throw information away: summarization and compaction, and the crudest baseline of all — truncation, just deleting the old turns.

And against that bar, the evidence is unflattering. A December 2025 paper with a title that does not hedge — Optical Context Compression Is Just (Bad) Autoencoding — ran the comparison the hype skipped. Its finding: for language modeling, the vision route performs no better than truncation, and loses to a cheap, near-zero-parameter hierarchical text encoder at every compression ratio. Its argument is almost insulting in its simplicity: your text already lives as learned token embeddings inside the model; rendering those embeddings out to pixels and asking a vision encoder to reconstruct them throws away the representation you already had and charges you encoder FLOPs for the privilege. The compression you're quoting is measured in decoder tokens — but you paid a vision encoder to produce them, so "10x fewer tokens" is not "10x cheaper end to end."

None of this means the idea is dead. AgentOCR's agent-specific numbers are real and its caching makes the encoder cost tractable; for a document-heavy pipeline, DeepSeek-OCR is a legitimately efficient front end. But it reframes the buying decision. Optical context compression is not a bigger context window. It is a particular, image-shaped way of forgetting, and you should reach for it exactly when forgetting was already the plan — a long-horizon agent about to compact, a giant corpus you were going to chunk and drop. Reach for it expecting to later attend to an exact order number you rendered into a 12-pixel-tall smudge, and the long-context-vs-retrieval tradeoffs you thought you'd escaped come back with worse handwriting.

The most useful thing the skeptics did was restore the right question. Not how much can I compress? — the demos already answered that, loudly. The question is: compared to just deleting it, how much did the picture actually buy me? On today's evidence, for a lot of workloads, the honest answer is: less than the headline, and sometimes nothing at all.