Ask an engineer why they bolted a memory layer onto their agent and you'll usually hear a version of the same sentence: we couldn't keep pasting the whole conversation into the prompt, it got too expensive. Memory, in this telling, is a cost-control move. You stop shipping 40,000 tokens of transcript on every turn and start shipping a tidy handful of retrieved facts instead.
For some memory systems, that's exactly right. For others, it is off by a factor of five hundred — in the wrong direction.
A June 2026 paper, Memory is Reconstructed, Not Retrieved, makes the gap impossible to ignore. Its authors propose a graph-memory system called MRAgent and, to argue for it, benchmark it against two widely used alternatives — A-Mem and LangChain's LangMem — on LongMemEval, a hard long-term-memory benchmark that runs queries against histories spanning dozens of sessions. The accuracy story is the usual one: the new method wins, by up to 23 points on the hardest question types, using Gemini 2.5 Flash and Claude Sonnet 4.5 as backbones. The number that should stop you is in the cost column.
On the same benchmark, under the same accounting, the three "memory" systems consumed:
- MRAgent — 118K prompt tokens per query
- A-Mem — 632K tokens per query
- LangMem — 3.26M tokens per query
These are all products sold under the same word. The most expensive one burns 28 times more tokens per question than the least expensive, and all three sit orders of magnitude above the ~7K tokens per query that Mem0 reports for its pipeline on the easier LoCoMo benchmark. "Add a memory layer" turns out to be a decision with a three-order-of-magnitude blast radius, and almost nobody prices it before they ship.
The cost isn't in what you store. It's in where the LLM runs.#
The intuition that memory saves tokens quietly assumes that a memory system is a database — you write facts in, you read a few out, the model only ever sees the few. That describes one family of systems and badly misdescribes another.
Extract-and-store systems — Mem0 is the archetype — do their expensive thinking once, at write time. When a conversation turn lands, an LLM call distills it into atomic facts and files them. At query time, retrieval is close to free: an embedding lookup, a short list of facts, a few thousand tokens into the prompt. The read path is cheap because the reasoning already happened and got cached as text.
Agentic or reconstruction memory systems make the opposite bet. Instead of trusting a static top-k lookup, they put the model into the read path. A-Mem, inspired by the Zettelkasten note-taking method, issues multiple LLM calls when it ingests a memory — constructing a note, generating links to neighbors, evolving the existing graph — which is why its heavy cost is largely a write-side tax that the per-sample accounting then surfaces as a big number. MRAgent goes further and reasons at query time, walking a cue-tag-content graph and iteratively pruning retrieval paths as evidence accumulates. That's the "reconstruction" in the title: the paper's thesis, borrowed from human cognition, is that memory should be actively rebuilt against the current question rather than fetched intact. It's a genuinely good idea for accuracy. It is also, unavoidably, more LLM calls — and LLM calls are the token bill.
So the 500x spread is not a quality ranking. It's a map of where each system decided to spend inference. LangMem's 3.26M is not a bug that a patch removes; it's the visible cost of re-embedding and reprocessing large memory stores at test time. MRAgent's 118K is what "reasoning in the read path" costs even after you optimize it hard.
Why the raw numbers still mislead — and how to read them anyway#
One caution, because this publication has made a sport of catching memory benchmarks lying. MRAgent's 118K and Mem0's ~7K are not an apples-to-apples comparison. They're measured on different benchmarks — LongMemEval versus LoCoMo — with different history lengths and different token accounting. LoCoMo conversations fit inside a modern context window; LongMemEval deliberately does not. Quoting them side by side would commit exactly the sin the leaderboards commit.
The comparison that is clean is the one inside a single paper: MRAgent, A-Mem, and LangMem, same benchmark, same harness, 118K to 3.26M. That spread is real, and it's the one that matters for a build decision, because it isolates the design choice from the benchmark.
The practical read, then, is a single question you can answer about any memory system before adopting it: where does it call the model, and how often? If the LLM runs mostly at write time, your per-query cost will be small and roughly flat, and your risk is lossiness — the pipeline compressed away a detail some future question needed. If the LLM runs in the read path, your per-query cost scales with how hard the question is and how much history it has to reconstruct, and your risk is a token bill that grows with your users' tenure on the product.
Neither is wrong. A support bot answering "what plan am I on" wants the cheap read path and will never miss the lost detail. An agent doing multi-hop reasoning over a year of a user's history may genuinely need reconstruction, and 118K tokens a query may be the honest price of getting it right. But those are different products with different unit economics, and the word "memory" hides the difference until the invoice arrives.
Measure tokens per query on your own traffic, split write path from read path, and you'll know which system you actually bought — not the one on the landing page.



