The first time you connect an agent to ten MCP servers it feels like progress. The tenth time, it feels like the agent got slower and dumber, and you are not imagining it.

Here is the arithmetic nobody puts on the demo slide. Tool definitions are not free. They are JSON schemas — names, descriptions, every argument and its description — and they all load into the context window before the model reads your first request. A typical multi-server setup, GitHub plus Slack plus Sentry plus Grafana plus Splunk, runs about 55,000 tokens of definitions just sitting there, per Anthropic's own docs. Scale that the way real deployments do — Anthropic's code-execution post sketches an agent wired to dozens of servers, thousands of tools, roughly 150 tokens each — and you are spending six figures of context before the agent does anything at all.

And it is not only a cost problem. It is an accuracy problem. The same docs note that Claude's ability to pick the right tool degrades significantly once you exceed 30-50 available tools. More tools, worse selection. The thing you added to make the agent more capable is the thing making it choose wrong.

So the industry shipped two fixes in late 2025. They look like competitors. They are not.

Fix one: stop loading schemas you won't use#

The first fix attacks the schema-loading layer. Anthropic's Tool Search Tool, part of its advanced tool use release, lets you mark tools with defer_loading: true. Deferred tools never enter the system prompt. The model sees only a small search tool and your three-to-five always-on favorites. When it needs something else, it searches the catalog — by regex or BM25 — and the API expands the 3-5 most relevant matches into full definitions inline.

The payoff is the inverse of the problem. That 55k-token multi-server setup shrinks by over 85%, because you load the tools a given request actually needs instead of every tool the agent could theoretically reach. The catalog can hold up to 10,000 tools and the context stays lean. Crucially, because deferred schemas stay out of the cached prefix, prompt caching survives intact — the cheap part of your bill stays cheap.

This is just-in-time retrieval applied to tools instead of documents. It is the obvious fix, and it works. But notice what it does not touch.

Fix two: stop loading results you don't need to see#

Tool search trims the menu. It does nothing about the meal. Every tool the model calls still routes its full output back through the context window, and every call is a separate model turn. Check budget compliance across 20 employees and you pay 20 round trips, each dragging thousands of expense line-items into context so the model can eyeball them.

Programmatic Tool Calling attacks that layer instead. Mark a tool with allowed_callers: ["code_execution_20260120"] and the model stops calling it directly; it writes Python that calls the tool as a function inside a sandboxed container. The container loops, filters, aggregates — and only the final printed summary comes back. Intermediate results never enter the model's context and, notably, do not count toward your token bill.

The numbers are concrete and honestly reported. On a 75-tool project-management benchmark, programmatic calling cut billed input tokens by roughly 38% with no change in accuracy. Across production traffic with 10-49 tools, typical savings ran 20-40%. On agentic-search benchmarks it improved performance ~11% while using 24% fewer input tokens. And Anthropic is candid about the failure case: on τ²-bench, where each turn is one or two sequential calls, it left scores flat and cost ~8% more. Strictly sequential reasoning, where the model must think between every call, gets no benefit — the script cannot skip the turn it exists to skip.

Tool search cuts the schemas you load before the work. Code execution cuts the results you load during it. They are not rival answers to one question — they are answers to two.

The non-obvious part: they compose#

Treat these as an either/or and you will choose badly, because they operate at different layers of the same pipeline. Tool search governs what definitions enter context. Code execution governs what results leave the sandbox and how many model turns you pay. One is about the front door, the other about the plumbing.

Stack them and the effects multiply. The code-execution-with-MCP approach does exactly this: present each MCP server as code modules on a filesystem, let the agent import only the tools it needs (discovery), and let it process data in the execution environment before anything returns (result-handling). The headline example — a workflow that consumed about 150,000 tokens rebuilt to run on roughly 2,000, a 98.7% reduction — is not one trick. It is both, working at both layers at once.

There is a third effect the token math undersells. Code execution does not just keep results out of context; it collapses multi-tool chaining into a single model turn. Twenty lookups, a filter, and a verdict that were twenty round trips become one script the model writes once. For a small tool set that overhead is not worth it. For a sprawling one it is the difference between an agent that reasons over conclusions and one that drowns in raw data it requested itself.

What to actually do#

If you have fewer than ten tools and they fire on most requests, do nothing; the machinery costs more than it saves. If you have a large catalog where any one request touches a handful, turn on tool search first — it is the cheapest win and it directly buys back selection accuracy. If your workloads fan out, filter big results, or chain steps, add programmatic calling on top. And if you are wiring an agent to a wall of MCP servers, assume you want both, because the bloat is coming from both layers and trimming one just relocates the problem to the other.

The lesson under all of this: connecting more tools was never the hard part. Connecting them without quietly poisoning the context window is. The fix is not fewer tools — it is fewer of them in the model's head at any given moment.