There is a comfortable story about agent memory that goes like this: context windows keep growing, so memory is a solved problem in waiting — just put the whole history in the prompt and let attention sort it out. It is a tidy story, and the two benchmarks that matter most were built specifically to break it.

The reason "just use long context" fails is not capacity. It is that accuracy falls apart as the history fills up — the effect now widely called context rot. A model can technically fit a hundred sessions of conversation and still answer worse than it would on five. So evaluating memory cannot mean checking whether a fact is somewhere in the window. It has to mean checking whether the agent can find it, reason over it, and know when it isn't there. That is what LoCoMo and LongMemEval actually grade.

LoCoMo: the long conversation#

LoCoMo — Long Conversational Memory — is the closest thing the field has to a standard for multi-session recall. Each dialogue runs an average of 27.2 sessions of back-and-forth, the kind of accumulated relationship a personal assistant would have with a user over months. On top of those transcripts sit 1,540 questions, and the category split is the whole point: 841 single-hop (the fact is in one place), 282 multi-hop (you must connect facts across sessions), 321 temporal (what was true when), and 96 open-domain.

The headline result is sobering. Even with strategies like long-context models or retrieval-augmented memory, systems substantially trail human performance — and the gap is widest on exactly the categories that matter for a real assistant: temporal reasoning and long-range causal understanding. Single-hop recall looks fine. "What did I tell you about my sister's wedding three months ago, and has the date changed since?" does not.

LongMemEval: the assistant that has to say no#

LongMemEval attacks the same problem from the chat-assistant angle and adds the test I find most underrated. It places 500 questions inside freely scalable histories — the standard setting is around 115k tokens of prior interaction — and grades five distinct abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Two of those deserve emphasis. Knowledge update asks whether the agent notices that a fact changed — you moved cities, you switched jobs — rather than dutifully reciting the stale version it learned first. Abstention asks whether it will admit the memory simply doesn't contain the answer instead of confabulating one. Most memory demos never test either, because both are about restraint, and restraint doesn't screenshot well.

The numbers justify the pessimism. On LongMemEval, long-context LLMs show a 30–60% accuracy drop as the interaction history grows. More context, less reliable memory. The benchmark's own framing splits memory design into indexing, retrieval, and reading — a useful reminder that "memory" is a pipeline, not a window, and each stage can be the one that fails you.

The benchmarks agree on the uncomfortable part: simple recall is easy and stays easy. It's temporal reasoning, knowledge updates, and knowing when to abstain that collapse — and a single average score hides all three.

How to actually evaluate your agent's memory#

The practical takeaway is not "go run LoCoMo." Most teams won't, and their data doesn't look like a benchmark's anyway. It's to steal the benchmarks' structure for your own evals.

None of this requires a research budget. It requires treating memory as something you measure on purpose, in pieces, the way LoCoMo and LongMemEval do — rather than something you assume the next model release will hand you. The industry ships agents far faster than it ships memory, and the gap between a memory feature and a memory you can trust is exactly the set of questions these benchmarks ask and most internal evals don't. The window keeps getting bigger. The forgetting, measured honestly, has not gone away — it has just moved further down the transcript, where nobody is looking.