There is a comfortable story about agent memory that goes like this: context windows keep growing, so memory is a solved problem in waiting — just put the whole history in the prompt and let attention sort it out. It is a tidy story, and the two benchmarks that matter most were built specifically to break it.
The reason "just use long context" fails is not capacity. It is that accuracy falls apart as the history fills up — the effect now widely called context rot. A model can technically fit a hundred sessions of conversation and still answer worse than it would on five. So evaluating memory cannot mean checking whether a fact is somewhere in the window. It has to mean checking whether the agent can find it, reason over it, and know when it isn't there. That is what LoCoMo and LongMemEval actually grade.
LoCoMo: the long conversation#
LoCoMo — Long Conversational Memory — is the closest thing the field has to a standard for multi-session recall. Each dialogue runs an average of 27.2 sessions of back-and-forth, the kind of accumulated relationship a personal assistant would have with a user over months. On top of those transcripts sit 1,540 questions, and the category split is the whole point: 841 single-hop (the fact is in one place), 282 multi-hop (you must connect facts across sessions), 321 temporal (what was true when), and 96 open-domain.
The headline result is sobering. Even with strategies like long-context models or retrieval-augmented memory, systems substantially trail human performance — and the gap is widest on exactly the categories that matter for a real assistant: temporal reasoning and long-range causal understanding. Single-hop recall looks fine. "What did I tell you about my sister's wedding three months ago, and has the date changed since?" does not.
LongMemEval: the assistant that has to say no#
LongMemEval attacks the same problem from the chat-assistant angle and adds the test I find most underrated. It places 500 questions inside freely scalable histories — the standard setting is around 115k tokens of prior interaction — and grades five distinct abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
Two of those deserve emphasis. Knowledge update asks whether the agent notices that a fact changed — you moved cities, you switched jobs — rather than dutifully reciting the stale version it learned first. Abstention asks whether it will admit the memory simply doesn't contain the answer instead of confabulating one. Most memory demos never test either, because both are about restraint, and restraint doesn't screenshot well.
The numbers justify the pessimism. On LongMemEval, long-context LLMs show a 30–60% accuracy drop as the interaction history grows. More context, less reliable memory. The benchmark's own framing splits memory design into indexing, retrieval, and reading — a useful reminder that "memory" is a pipeline, not a window, and each stage can be the one that fails you.
The benchmarks agree on the uncomfortable part: simple recall is easy and stays easy. It's temporal reasoning, knowledge updates, and knowing when to abstain that collapse — and a single average score hides all three.
How to actually evaluate your agent's memory#
The practical takeaway is not "go run LoCoMo." Most teams won't, and their data doesn't look like a benchmark's anyway. It's to steal the benchmarks' structure for your own evals.
- Score by category, never by average. A single memory-accuracy number is a trap: it lets strong single-hop recall mask weak temporal and update handling. Break your eval into the same buckets — single-hop, multi-hop, temporal, knowledge-update, abstention — and watch which one is dragging.
- Test for change, not just recall. Seed your eval with facts that get overwritten later in the history, then ask the current value. An agent that always returns the first thing it learned has no memory; it has a cache with no invalidation. This is the failure that quietly poisons personalization.
- Reward abstention. Add questions whose answers are genuinely not in the history and score a confident wrong answer as worse than "I don't have that." A memory system you can't trust to say no is one you'll have to double-check on everything, which defeats the point of having it.
- Vary the history length. Run the same questions at 5k, 50k, and 150k tokens of surrounding chatter. If accuracy slides as the haystack grows, you've found your context-rot ceiling — and probably your argument for external memory over a longer window.
None of this requires a research budget. It requires treating memory as something you measure on purpose, in pieces, the way LoCoMo and LongMemEval do — rather than something you assume the next model release will hand you. The industry ships agents far faster than it ships memory, and the gap between a memory feature and a memory you can trust is exactly the set of questions these benchmarks ask and most internal evals don't. The window keeps getting bigger. The forgetting, measured honestly, has not gone away — it has just moved further down the transcript, where nobody is looking.



