If you are choosing an agent-memory system in 2026 — Mem0, Zep, Letta, one of the newer entrants — you have probably seen a chart. One vendor is at 84%. Another publishes a replication showing that same vendor at 58%. The first vendor responds with 75%. The numbers are precise, confident, and irreconcilable.
They are irreconcilable for a reason worth understanding before you spend a quarter integrating the wrong thing. Almost every one of these figures comes from a single benchmark called LOCOMO — and LOCOMO is smaller and softer than the decimal points suggest. (If you're still deciding whether you even need a dedicated memory layer versus retrieval over your own store, start with agent memory vs RAG and the types of agent memory first — this piece assumes you've decided you want one.)
What everyone is actually measuring#
LOCOMO comes from the paper Evaluating Very Long-Term Conversational Memory of LLM Agents (Maharana et al., February 2024). It's a genuinely thoughtful dataset: machine-generated dialogues grounded in personas and temporal event graphs, then verified and edited by human annotators for long-range consistency.
It is also ten conversations. Each spans an average of 27.2 sessions and 21.6 turns per session, around 16.6K tokens — long, but only ten of them. The tasks are question answering, event summarization, and multimodal dialogue. When a vendor tells you their memory layer scores X% "on LOCOMO," X% is a grade on those ten conversations.
With a sample that small, a handful of disputed items moves the headline by whole points. That is the structural reason the leaderboard is so noisy — and it's before anyone touches how the test is run.
Three numbers for one system#
The clearest illustration is the Zep dispute. Zep originally reported roughly 84% on LOCOMO. Then Mem0 re-ran Zep's system and scored it at 58.44%, alleging methodology errors. Zep rebutted with 75.14%.
That's an 84 → 58 → 75 spread for one product on one dataset. None of the parties is necessarily lying. They are running the same ten conversations through different machinery:
- Different retrieval configs — how much memory is fetched, how it's ranked, what's injected into context.
- Different judge models — LOCOMO answers are graded by an LLM, and a stricter or more lenient judge shifts the score without touching the memory system at all.
- Different prompt formats — how the question and retrieved memories are framed for the answering model.
When every vendor runs the same benchmark under its own configuration, the benchmark stops measuring the systems and starts measuring the configurations.
Mem0's own numbers sit in the same fog: about 66.9% accuracy, with independent reruns landing closer to 58–66%. The dataset also carries documented flaws — speaker misattribution, where an answer is credited to the wrong participant, and ambiguous questions with more than one defensible answer. On ten conversations, those aren't rounding errors; they're swing votes.
The numbers that actually price your product#
Here is the part the accuracy war buries. Alongside its 66.9%, Mem0 reports a 0.71s median latency and roughly 1,800 tokens per conversation. Those two figures, not the accuracy percentage, are what determine whether a memory system is viable in your product.
A memory layer runs on every turn. If it adds a second of latency and a couple thousand tokens per exchange, that cost compounds across every user, every session, every day — and it shows up on your inference bill and in your p95 response time long before anyone notices a two-point accuracy difference. A system that wins the leaderboard by three points while doubling per-turn token consumption is, for most production workloads, the worse purchase.
How to actually shop#
Treat published LOCOMO scores as evidence that a system is in the credible range, not as a ranking. Then evaluate on your own traffic, measuring the triangle together:
- **Accuracy on your questions** — build a small eval set from real user sessions in your domain. Ten generic conversations don't predict your recall.
- Added latency per turn — memory is on the hot path; measure what it does to p95, not just the mean.
- Tokens (dollars) per turn — the recurring cost that scales with usage. This is the same discipline that separates the credible vendors when you compare them head-to-head, as in Telemem vs Mem0.
The uncomfortable truth of the 2026 agent-memory market is that no single accuracy number is comparable across vendors. That's not a scandal to wait out; it's the permanent condition of a field benchmarking itself on ten conversations with home-field configs. The vendors will keep publishing decimals. Your job is to stop reading them as a scoreboard and start running the only benchmark that predicts your bill: yours.



