The most quietly influential design decision of the last year in agents wasn't a model. It was the shape of Claude Code: a coding agent that could run for an hour, keep a to-do list, spawn helpers, and write to files instead of drowning in its own context. That shape has a name now — the deep agent — and it's being cloned in the open.
If you want the concept from the ground up, we've unpacked what deep agents are before; the clearest articulation comes from LangChain, which shipped a reference harness and, in its "Deep Agents" writeup, named the four pillars that separate a deep agent from a plain tool-calling loop:
- Planning tools. A
write_todostool that, mechanically, does almost nothing — it just maintains a structured task list in state. Its value is that it gives the model a place to plan and re-plan out loud, which measurably steadies long runs. - Sub-agents. A
tasktool that spawns an ephemeral agent with clean, isolated context. The orchestrator delegates a chunk of work, gets back a result, and never pays the token cost of the sub-agent's scratch reasoning. - A filesystem. Files as offloaded memory. Instead of stuffing every intermediate result into a ~200K-token window until it collapses, the agent writes to disk and passes references.
- A long, detailed system prompt. Hundreds of lines, Claude Code-style. Deep agents are prompt-heavy on purpose.
A shallow agent is tools in a loop. A deep agent adds planning, context offloading, and delegation — the three things that let a run survive past the context window.
The reference, and its one lock-in#
The obvious starting point is the original:
It's mature, widely used, and MIT-licensed. It also carries one architectural commitment worth naming — one we've compared against LangChain and LangGraph directly: it's built on LangGraph. If you already live in that runtime, that's a feature. If you don't — if you want the deep-agent shape without adopting a graph runtime and its state model — you've been out of luck.
That's the gap a small, newer cluster of repos is filling on a different base: Pydantic AI, the type-validated, genuinely model-agnostic framework from the Pydantic team.
The self-hosted Claude Code#
The headline project rebuilds the whole harness end to end:
Its own description is the pitch: "Open-source, self-hosted Claude Code — a terminal AI assistant and the Python framework behind it. Tool-calling, sandboxed execution, multi-agent teams, skills, checkpoints, unlimited context — on Pydantic AI, any model." Under the hood that's Docker-sandboxed execution with persistent named workspaces, multi-agent teams that share a to-do list and message each other, SKILL.md skills loaded on demand, checkpoints you can save/rewind/fork, and auto-summarization to stretch context. It is, deliberately, the four pillars plus a terminal.
The honest caveat: it's young (first commit was late November 2025) and small — under a thousand stars. Treat it as a promising community project, not infrastructure. The load-bearing maturity lives one layer down, in pydantic-ai itself.
Skills, without the prompt bloat#
If you don't want the whole terminal and just want to bolt capabilities onto a Pydantic AI agent, there's a narrower tool:
It implements the Agent Skills standard — the open SKILL.md format Anthropic released in December 2025 — with progressive disclosure: the agent sees a one-line skill description first and pulls the full instructions only when a task actually calls for it. Both filesystem skills (folders of markdown) and programmatic skills (Python decorators) are supported. The point is context economy: you can carry fifty skills without paying for fifty in your system prompt.
Why the base framework is the real story#
Here's the non-obvious part, and the reason to care which foundation you pick. In a deep agent, the orchestrator hands tool arguments to sub-agents across dozens of hops over hours. The failure mode that eats these runs isn't a crash — it's a subtly malformed handoff that doesn't crash, and quietly corrupts everything downstream until you notice, much later, that the last two hours were garbage.
Pydantic AI validates structured tool inputs and outputs at every boundary. A bad call fails loudly and locally, at the hop where it happened, instead of poisoning the trajectory. Deep-agent architecture is what creates the long horizons; the validation boundary is what keeps them from silently rotting. That's a different value proposition than "types are nice" — it's the specific defect class that long runs generate, caught at the specific place it's cheapest to catch.
And because Pydantic AI is model-agnostic, the whole harness — sandbox, skills, sub-agents — runs on whatever model you own. "Self-hosted Claude Code" turns out to describe the shape of the tool, not a dependency on Anthropic. That's the part the name undersells.



