If you are choosing an agent-memory system in 2026 — Mem0, Zep, Letta, one of the newer entrants — you have probably seen a chart. One vendor is at 84%. Another publishes a replication showing that same vendor at 58%. The first vendor responds with 75%. The numbers are precise, confident, and irreconcilable.

They are irreconcilable for a reason worth understanding before you spend a quarter integrating the wrong thing. Almost every one of these figures comes from a single benchmark called LOCOMO — and LOCOMO is smaller and softer than the decimal points suggest. (If you're still deciding whether you even need a dedicated memory layer versus retrieval over your own store, start with agent memory vs RAG and the types of agent memory first — this piece assumes you've decided you want one.)

What everyone is actually measuring#

LOCOMO comes from the paper Evaluating Very Long-Term Conversational Memory of LLM Agents (Maharana et al., February 2024). It's a genuinely thoughtful dataset: machine-generated dialogues grounded in personas and temporal event graphs, then verified and edited by human annotators for long-range consistency.

It is also ten conversations. Each spans an average of 27.2 sessions and 21.6 turns per session, around 16.6K tokens — long, but only ten of them. The tasks are question answering, event summarization, and multimodal dialogue. When a vendor tells you their memory layer scores X% "on LOCOMO," X% is a grade on those ten conversations.

With a sample that small, a handful of disputed items moves the headline by whole points. That is the structural reason the leaderboard is so noisy — and it's before anyone touches how the test is run.

Three numbers for one system#

The clearest illustration is the Zep dispute. Zep originally reported roughly 84% on LOCOMO. Then Mem0 re-ran Zep's system and scored it at 58.44%, alleging methodology errors. Zep rebutted with 75.14%.

That's an 84 → 58 → 75 spread for one product on one dataset. None of the parties is necessarily lying. They are running the same ten conversations through different machinery:

When every vendor runs the same benchmark under its own configuration, the benchmark stops measuring the systems and starts measuring the configurations.

Mem0's own numbers sit in the same fog: about 66.9% accuracy, with independent reruns landing closer to 58–66%. The dataset also carries documented flaws — speaker misattribution, where an answer is credited to the wrong participant, and ambiguous questions with more than one defensible answer. On ten conversations, those aren't rounding errors; they're swing votes.

The numbers that actually price your product#

Here is the part the accuracy war buries. Alongside its 66.9%, Mem0 reports a 0.71s median latency and roughly 1,800 tokens per conversation. Those two figures, not the accuracy percentage, are what determine whether a memory system is viable in your product.

A memory layer runs on every turn. If it adds a second of latency and a couple thousand tokens per exchange, that cost compounds across every user, every session, every day — and it shows up on your inference bill and in your p95 response time long before anyone notices a two-point accuracy difference. A system that wins the leaderboard by three points while doubling per-turn token consumption is, for most production workloads, the worse purchase.

How to actually shop#

Treat published LOCOMO scores as evidence that a system is in the credible range, not as a ranking. Then evaluate on your own traffic, measuring the triangle together:

The uncomfortable truth of the 2026 agent-memory market is that no single accuracy number is comparable across vendors. That's not a scandal to wait out; it's the permanent condition of a field benchmarking itself on ten conversations with home-field configs. The vendors will keep publishing decimals. Your job is to stop reading them as a scoreboard and start running the only benchmark that predicts your bill: yours.