Every few weeks someone building an agent asks the same sharp question: isn't memory just RAG with extra steps? You embed a query, you search a vector store, you paste the matching text into the prompt. That's RAG. That's also, apparently, how every "agent memory" library works. So why does memory get its own product category, its own startups, its own benchmarks?
The question is fair, because at read time the two really are nearly identical. The honest answer is that they diverge at a step RAG doesn't have — and once you see where, the rest of the differences fall out of it.
They retrieve the same way. They're written differently.
RAG was formalized in 2020 by Lewis et al. as a pairing: a parametric model (the LLM's weights) plus a non-parametric memory (a dense vector index of documents the model can retrieve from). The defining property, in AWS's own phrasing, is that RAG references "an authoritative knowledge base outside of its training data" before answering. Someone builds that knowledge base offline — your docs, your tickets, Wikipedia — and the agent reads from it. It never writes to it. The corpus at the end of the session is byte-for-byte the corpus at the start.
Agent memory inverts exactly that one property. The store is something the agent writes to, during the conversation, about the conversation. AWS's own AgentCore docs draw the line cleanly: RAG is query-time and read-only, while long-term memory has a distinct write phase at conversation time and a read phase at query time. Mem0 makes the same cut for a developer audience. Memory is not a different retrieval algorithm. It's RAG where the agent is also the author of the corpus.
Memory is RAG where the agent writes the index. Everything hard about it follows from that one inversion.
The write phase is the whole game
If writing were just appending, this would be a footnote. It isn't, because new facts contradict old ones. The user said they're vegetarian in March and ordered ribs in June; they preferred Nike, now they prefer On. A log that only appends turns into a pile of mutually contradictory statements, and vector search — which returns whatever is most similar, not most current — will happily hand back the stale one. Zep's canonical example is exactly this: a user changes their sneaker preference, and RAG-as-memory keeps recommending the old brand because that text is still the closest match.
So production memory does real work on write. The Mem0 paper describes a two-stage pipeline: an extraction step where an LLM pulls candidate facts from the latest exchange, then a consolidation step that compares each fact against what's stored and chooses one of four operations — ADD a new fact, UPDATE an existing one, DELETE a contradicted one, or NOOP. Letta's MemGPT lineage frames the same idea as the agent self-editing tiered memory through tool calls. None of these operations exist in RAG, because RAG never decides what to keep. Its corpus is someone else's problem, settled before the agent ran.
Which is why they fail differently
Here's the part that should actually drive your architecture. RAG's failure mode is retrieval: it pulls a wrong, irrelevant, or stale document, and the model grounds an answer in it. Bad, but bounded — the corpus was curated externally, so its errors are someone's editorial mistakes, fixable by re-indexing.
Memory's signature failure is worse and stranger: it can poison itself. Because the agent authors its own store, a wrong conclusion the agent reaches can be written back as a fact and then retrieved later as verified truth. Redis describes context poisoning as exactly this loop — stale or self-written memory surfaces, gets treated as ground truth, and "every future interaction references the same wrong info." The reasoning looks coherent the entire time, which is what makes it hard to catch. A RAG corpus can't do this to itself, because the agent has no pen.
So: which one?
The split is clean once you stop treating them as competitors. Reach for RAG when the question is what do trusted sources say — product docs, policies, a knowledge base, anything authoritative and externally maintained. Reach for memory when the question is who is this user and what happened before — preferences, past decisions, the running state of a long task that has to survive across sessions.
Most serious agents need both, in separate stores, for separate reasons. The mistake isn't choosing wrong between them; it's pointing one vector store at both jobs and discovering, a few weeks in, that your "memory" is just a RAG index quietly accumulating contradictions it has no way to resolve.



