Take a page of text — say a thousand words. Feed it to a language model the normal way and the tokenizer turns it into roughly a thousand-plus tokens, each one paying the model's quadratic attention tax. Now render that same page as an image, run it through a vision encoder, and you get back about a hundred vision tokens. Hand those hundred tokens to a decoder and it reconstructs the original text at around 97% accuracy. You stored ten pages' worth of words in one page's worth of tokens, and almost nothing fell off.
That is the trick at the center of DeepSeek-OCR: Contexts Optical Compression, a paper from DeepSeek's Haoran Wei, Yaofeng Sun, and Yukun Li, posted in October 2025. It looks like an OCR release. It is really an argument about what a context window is made of.
The non-obvious part: a pixel can be a denser carrier of text than a token
We treat text tokens as the natural unit of language for a model, and a vision token as something for photographs. DeepSeek-OCR inverts that. Its claim, stated plainly in the paper's abstract, is that when the number of text tokens is within 10x the number of vision tokens, the decoder reconstructs the text at about 97% precision. Ten text tokens of information, carried by one vision token, nearly losslessly.
If a page of text fits into a hundred vision tokens at 97% fidelity, then "long context" was never a capacity problem. It was a compression problem we hadn't been treating as one.
The architecture that does this is two parts: a DeepEncoder that ingests a high-resolution image at low activation cost and squeezes it into a small token count, and a DeepSeek3B-MoE-A570M decoder — a mixture-of-experts model with about 570M active parameters — that reads those tokens back into text. The official repo exposes the dial directly: Tiny mode is 64 vision tokens, Small is 100, Base is 256, Large is 400. You choose how hard to compress.
This is why Andrej Karpathy used the paper as a springboard for a larger thesis: maybe text tokens are simply a wasteful input format, and all inputs to a language model should be images — compressing context, preserving layout and styling, enabling bidirectional attention at the input, and deleting the tokenizer along with its Unicode baggage. That is speculation built on top of a measured result, and worth keeping in that order.
The benchmark numbers are the load-bearing evidence
The compression claim would be cheap talk without a parsing benchmark behind it, so the paper runs OmniDocBench, the CVPR 2025 document-parsing suite of 1,651 real PDF pages. Two comparisons matter, per VentureBeat's writeup and the DigitalOcean tutorial:
- It beats GOT-OCR2.0 — which uses 256 tokens per page — while using only 100 vision tokens.
- It beats MinerU2.0 — which averages 6,000+ tokens per page — while using fewer than 800.
Better scores, with one-half to one-eighth the tokens. And it is not a lab toy on throughput: a single A100-40G processes 200,000+ pages per day, which is what makes it plausible as a data-generation engine, not just a demo.
The efficiency framing rhymes with everything we already know about why long context degrades and how much it costs — feeding fewer, denser tokens is a direct lever on agent token spend, and it sits in the same problem space as prompt compression methods like LLMLingua, just operating in pixel space instead of token space.
Where the skepticism goes
Read the claim precisely and the limits are right there in it. The 97% figure holds under 10x compression. Push to 20x and accuracy falls to about 60% — a number the authors report and do not hide. So this is lossy compression with a clearly sloped curve, not a free lunch. The right mental model is JPEG, not ZIP: crank the ratio and the artifacts arrive.
Two more cautions worth stating out loud. First, OCR reconstruction is not reasoning over the page. Proving the decoder can transcribe an image of text back into text at 97% does not prove a model can reason over content stored that way as well as it reasons over native tokens — the paper measures reconstruction fidelity, and inference quality over optically-compressed context is a separate question it does not fully close. Second, the authors themselves frame this as an "initial investigation" into optical compression, not a finished system. The honest reading is: promising mechanism, early evidence.
The idea worth watching: optical memory decay
Here is the part that earns the "memory" framing. If old context can be rendered as an image and compressed, you can compress old context harder than recent context — render the last hour at full resolution, render last week at Tiny mode, let the distant past blur into a handful of tokens. The paper points at exactly this, noting the approach's promise for historical long-context compression and forgetting mechanisms. It is a built-in decay curve: a way to manage context in a long-running agent where memory fades with age the way human memory does, and where the tradeoff isn't retrieval versus a long window but resolution versus recency.
That is the contribution. Not "DeepSeek built a good OCR model" — though by OmniDocBench it did. The contribution is treating vision tokens as a tunable, lossy storage medium for text, with a measured fidelity-versus-compression curve you can engineer against. Whether that becomes how models hold their context, or stays a clever OCR result, depends on evidence this paper deliberately doesn't claim to have yet.
For now: a page of text fits in a hundred tokens at 97%, and it breaks at twenty times. Both halves of that sentence are the finding.



