There is a moment, somewhere around the fortieth connected tool, when an MCP agent stops feeling clever and starts feeling expensive. The transcript fills with tool definitions the model will never use on this task. Each intermediate result — a 4,000-row spreadsheet, a verbose API blob — gets dutifully piped back through the context window so the model can look at it and decide what to do next. You are paying, per token, to shuttle data the model doesn't need to read.
This is not a tuning problem. It is structural, and it has a name now: two of them, depending on who you ask.
The two costs nobody budgeted for
The Model Context Protocol made tools composable. It did not make them free. Connect enough servers and you hit two compounding bills.
The first is definition bloat. Every tool's schema — name, description, parameters — has to sit in context so the model knows it exists. A few servers is fine. Hundreds of tools is tens of thousands of tokens spent before the task begins. This isn't only a cost story; it's an accuracy story. The RAG-MCP paper measured it directly: as the tool pool grows, selection accuracy collapses — 13.62% baseline against 43.13% when only the relevant tools were retrieved and shown. More tools make the model worse at choosing among them, the same way a menu with four hundred items makes you order worse.
The second is result round-tripping. The dominant pattern is one tool call per turn, with the full result returned to the model. Chain five calls and you've passed five intermediate payloads through the context — including the parts you only wanted to filter, count, or hand to the next call. If you've already weighed the case for an MCP server over raw function calling, you know the protocol solved discovery and auth. It did not solve the arithmetic of moving data through a language model one turn at a time.
The fix: make the model write code
The emerging answer is almost insolent in its simplicity. Stop handing the model tools. Hand it an API, and let it write code.
In Anthropic's framing, the MCP client presents each server as a set of code modules on a filesystem — TypeScript files the model can import. The model explores ./servers/, reads only the tool files it needs, and writes code that imports and composes them. The code runs in a sandbox. Loops, filters, and joins happen there, in the execution environment, not in the context window. Only the final result comes back to the model.
The number Anthropic put on it is the one everyone quotes, and it's worth quoting accurately: one workflow that consumed roughly 150,000 tokens dropped to about 2,000 — a 98.7% reduction. Simon Willison, walking through the post, landed on the same framing: this is progressive disclosure for tools, the catalog living in code rather than in the prompt.
Cloudflare shipped the same idea under a blunter name — Code Mode — with a sharper justification: models are already trained on enormous corpora of code, and comparatively little on whatever bespoke tool-call format you invented this quarter. So let them do the thing they're good at. Cloudflare reports cutting token usage by up to 81% in general use, and roughly 99.9% across its full surface of more than 2,500 API endpoints, by converting them into a typed SDK the agent writes against.
The model was never bad at using tools. It was bad at using tools the way we made it — one definition at a time, one result at a time.
The data-privacy bonus is real and underrated: when records flow from one server to another inside the sandbox, the names and emails and phone numbers never pass through the model at all. You can't accidentally log what the model never saw.
The bill doesn't vanish. It moves.
Here is the part the token-reduction headlines skip. Code execution doesn't make the problem disappear. It relocates it — out of the model and into your infrastructure.
You are now running code a language model wrote, which is to say untrusted code, which is to say you need a real sandbox: resource limits, network egress control, monitoring, isolation that holds when the generated code does something stupid or hostile. Anthropic says this plainly — the benefits should be weighed against the operational overhead and security considerations that direct tool calls simply avoid. A trade, not a free lunch.
This is the non-obvious turn. We swapped a model problem — context bloat, tool-selection accuracy — for an infrastructure problem. And infrastructure problems are someone's pager. The reason Cloudflare's version is interesting isn't the token math; it's that they had the sandbox already. Their Dynamic Workers run each agent's code in a V8 isolate that starts in milliseconds instead of a container that takes seconds — now in open beta — and use bindings that inject credentials on the way out, so the agent's code never sees an API key. The agent gets an authorized client; the secrets stay with the supervisor.
That detail is the whole ballgame. If your sandbox is just a container with network access, you have not built a sandbox — a point worth sitting with, because your container is probably not the boundary you think it is. And if you don't already operate one, code mode hands you a build-or-buy decision before you write a line of agent logic. The sandbox vendors exist precisely because this is the hard part now.
So which do you use
Direct tool calls are not dead, and pretending otherwise is the kind of overcorrection that gets you a sandbox you didn't need. For a handful of tools and single-step tasks, loading definitions and calling them directly is simpler, lower-latency, and carries no code-execution attack surface. The break-even arrives when the tool count climbs, the workflows go multi-step, or you find yourself piping fat intermediate results through the model just to filter them.
The honest read: MCP solved the wrong half of the problem first. It standardized how agents reach tools. Code execution is the industry quietly admitting that how the model uses them — one definition, one result, one turn at a time — was the part that didn't scale. The protocol was never the bottleneck. The sandbox is.



