If you want to know what a field believes is hard, look at the benchmark it just built. For agent memory, three benchmarks in three years tell the whole story — and the story is that the thing we were measuring in 2024 has already become too easy to bother measuring in 2026.
LoCoMo set the bar, then got cleared#
LoCoMo, introduced in early 2024, was the first benchmark to take very long-term conversational memory seriously. Its conversations run about 35 sessions, roughly 300 turns, around 9,000 tokens, all in one coherent narrative, with questions spanning single-hop, multi-hop, temporal, and open-domain recall plus event summarization. For a while it was the number everyone quoted.
It is now close to saturated. Managed memory systems report scores in the low 90s on its QA split — Mem0 cites 92.5 — and once a benchmark clusters that high, it stops separating a good system from a great one. A near-perfect LoCoMo score in 2026 is table stakes, not evidence. Treat it as a regression test: if you drop on LoCoMo you broke something, but topping it proves little. (This is exactly the trap reading an agent-memory benchmark is supposed to keep you out of: a headline number means nothing without the saturation curve behind it.)
LongMemEval kept the conversational frame but raised the stakes#
LongMemEval (ICLR 2025) sharpened the question. It curated 500 questions across five named abilities — information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention — embedded in freely scalable chat histories. Its headline finding was a gut-check: commercial assistants and long-context models lost about 30% accuracy as histories grew long. Critically, it decomposed memory into indexing, retrieval, and reading, which is the right mental model — most "memory" failures are really retrieval or reading failures wearing a memory costume.
LongMemEval is still discriminative, which is exactly why it is more useful than LoCoMo right now. But it is still fundamentally a conversational benchmark, and its histories are still measured in the tens to hundreds of thousands of tokens. The next benchmark decided that wasn't long enough.
BEAM moves the finish line to 10 million tokens#
BEAM (ICLR 2026), from a team at the University of Alberta and UMass Amherst, is the one that should reset how you think about this. It builds 100 conversations that scale to 10 million tokens — roughly a year of daily conversation, or a software project's entire trace across hundreds of sessions — with 2,000 validated questions across ten ability categories: preference following, instruction following, information extraction, knowledge update, multi-session reasoning, summarization, temporal reasoning, event ordering, abstention, and contradiction resolution.
The construction is the point. BEAM's authors call out that many "long" benchmarks cheat by stitching together unrelated sessions from different users. That creates abrupt topic shifts and weak continuity, which ironically makes the task easier — a model can lean on local retrieval instead of maintaining a consistent internal state. BEAM instead generates a single, topically coherent narrative, so the system actually has to remember rather than re-find.
Stitching unrelated sessions into a "long" conversation doesn't test memory. It tests search, and then congratulates the model for passing.
The result that matters: long context is not long memory#
Here is the finding to take to your architecture review. On BEAM, LLMs with 1M-token context windows — with and without retrieval augmentation — still degrade as the dialogue lengthens. The window does not save you. The benchmark cannot be solved by buying more context, which is precisely what makes it relevant to production.
What does help is structure — the same architectural bet that separates a real memory layer from a vector store you query like RAG. The same paper introduces LIGHT, a memory framework modeled on human cognition: a long-term episodic memory, a short-term working memory, and a scratchpad that accumulates salient facts. Across backbones from GPT-4.1-nano and Gemini-2.0-flash to Qwen2.5-32B and Llama-4-Maverick, LIGHT improved on the strongest baselines by 3.5% to 12.69% on average — and at the 10M-token extreme, where no baseline natively supports the full context, the gains ballooned to +155.7% for Llama-4-Maverick and +107.3% for GPT-4.1-nano. Mem0's own numbers tell the same story from the other side: a system scoring 92.5 on LoCoMo falls to 64.1 on BEAM at 1M and 48.6 at 10M. The cliff is real.
The practical lesson for anyone shipping an agent that lives longer than a single session: stop benchmarking memory on LoCoMo, stop assuming a bigger context window is a memory strategy, and start measuring the abilities that actually break at scale — contradiction resolution, knowledge update, temporal ordering — on a benchmark that won't let your retriever do memory's job for it. The window is rented. The architecture is yours.



