Every argument about agent memory this year has been about shape. Mem0, Zep, and Letta each pick a different structure — a vector store you bolt on, a temporal knowledge graph that reasons about facts that change, an OS-style memory the model pages itself. The debate is real, and the benchmarks are worth reading. But Redis's agent-memory-server quietly answers a different question, and it's the one that actually bites you in production: not how should memory be structured, but where should the work run.

Its answer is that memory should be a server. Not a library you import, not a class you instantiate inside your agent — a standalone service, Apache-2.0 and written in Python, that your agents reach over an API. That sounds like a boring packaging detail. It is the whole idea.

Two tiers, borrowed from the OS#

The architecture splits memory into two tiers that map cleanly onto a computer's own hierarchy. Working memory is session-scoped: the recent messages, any structured facts you've attached, and a running summary that automatically compacts older turns as the context window fills — the RAM and page cache of the system. Long-term memory is the persistent tier, the disk: records that survive the session, searchable by semantic (vector), keyword (full-text), or hybrid query.

The OS analogy isn't decoration — it maps onto the usual taxonomy of agent memory and tells you exactly what the server is optimizing. Working memory is cheap, fast, and disposable; long-term memory is durable and indexed. The interesting engineering lives in the swap between them.

The part everyone else runs on the hot path#

Here is the design decision that separates this from an embedded memory library. When something in a session is worth keeping, something has to read the conversation, call an LLM to extract the durable facts, embed them into vectors, model their topics, recognize entities, deduplicate against what's already stored, and write the result. That is a pile of latency — several model and embedding calls — and in a library-shaped memory system, it happens inline, on the turn, while your user waits.

Redis moves all of it off the request path. Promotion from working to long-term memory is a background job, queued through Docket, a distributed task backend, and processed by a separate worker fleet. Your API server stays responsive even while extraction is grinding through expensive model calls somewhere else. For local development you can run the same pipeline inline with an asyncio backend; for production you point it at workers and scale them independently of your API.

Two smaller details show the same instinct. Extraction is debounced — a five-minute default window coalesces repeated writes, so a chatty session isn't re-extracted on every single turn. And thread extraction resolves cross-message references before storing, so "he said he'd prefer the second option" becomes a fact with actual referents attached, not a dangling pronoun.

The hard part of agent memory was never the storage. It's the LLM deciding what's worth remembering — and that decision is expensive enough that where you run it is an architecture choice, not a config flag.

Two doors into the same memory#

The server exposes its memory twice over. There's a REST API for your application code — PUT /v1/working-memory/{session_id} to stash a session, POST /v1/long-term-memory/search to query, POST /v1/memory/prompt to get a memory-enriched prompt back. And there's an MCP server, over stdio or SSE, that surfaces the same operations as tools: search_long_term_memory, create_long_term_memory.

That second door is more consequential than it looks. In the usual framework model, an SDK silently injects retrieved context into your prompt — the memory decides what the model sees. Expose memory as MCP tools instead, and the relationship inverts: the model decides when to search, when to save, what to look up. The agent pages its own memory over a protocol, the way Letta intends, except the memory itself is a shared service any number of agents — or services in other languages — can point at. Extraction rides on LiteLLM, so the provider doing the remembering can be OpenAI, Anthropic, Bedrock, or a local Ollama model, independent of whatever runs your agent.

What the server model costs you#

None of this is free, and the honest read matters more than the pitch. Choosing the server means you now operate a distributed system: an API, a worker pool, a task queue, and Redis, where an embedded library was a dependency and nothing more. If you're building one agent, that's overhead you don't need — reach for a library and move on.

And "automatic extraction" is doing a lot of quiet work in the marketing. What gets promoted to long-term memory is chosen by a nondeterministic LLM applying an extraction policy. Tune it loose and you accumulate noise the agent will later retrieve with confidence; tune it tight and it forgets things you wanted kept. That recall-versus-precision dial is the actual product, and it isn't fully in your hands — it's in a prompt and a model you configure but don't control.

The server model earns its complexity in exactly one situation, and it's a common one: many agents, many sessions, memory that should be shared and searched behind a single API, and extraction costs you refuse to pay on the turn. If that's your shape, Redis's contribution isn't a new theory of memory. It's the recognition that memory, like every other stateful thing agents depend on, eventually wants to be a service.