There's a quiet assumption baked into how most people build agents: that capability scales with the tool list. Wire up more MCP servers, register more functions, and the agent can do more. It feels obviously true. It is, past a surprisingly low threshold, false. The agent with two hundred tools is usually worse at picking the right one than the agent with twenty — and the reason has nothing to do with the model's intelligence and everything to do with where those tools live.

Every tool is a tax on attention

When you give an agent a tool, its name, description, and JSON schema get serialized into the prompt — and they stay there, every single turn, whether the agent uses the tool or not. Twenty tools is a manageable preamble. Two hundred is a wall of near-identical descriptions the model has to read, hold, and disambiguate before it does any actual work. The options don't just cost tokens; they cost clarity. A paper studying function calling on small models put it with refreshing bluntness: "the large number of available options confuses the LLM." Their fix — present fewer, more relevant tools — raised accuracy and cut latency at the same time.

The research that names the mechanism most directly is RAG-MCP. The authors run a stress test inspired by needle-in-a-haystack: give the model one correct tool buried among a growing pile of distractors and watch selection accuracy fall as the pile grows. Then they try the obvious alternative — retrieve the relevant tools first, and put only those in the prompt — and tool-selection accuracy roughly triples against the naive baseline while prompt tokens drop by more than half. The model didn't get smarter. It got a smaller, cleaner menu.

The agent never needed all the tools. It needed the right three. Everything past that is noise you're paying to inject into every decision.

MCP turned a slow leak into a flood

This was a manageable problem when "tools" meant a dozen functions you hand-wrote. MCP changed the scale. An MCP server typically mirrors an entire product's API — one tool per endpoint, each with a verbose schema — so connecting a few servers can load thousands of tool definitions before the user types anything. Anthropic's own engineering team has put rough numbers on it: a single mid-sized server can run to tens of thousands of tokens, and a setup of five servers exposing 58 tools occupied roughly 55,000 tokens of context before the conversation began. The MCP maintainers themselves track this as a first-class problem — there's an open spec proposal (SEP-1576) titled, plainly, "Mitigating Token Bloat in MCP."

The crude mitigation is a hard cap. Cursor, for instance, limits an agent to 40 MCP tools and truncates the rest. That keeps the context bounded, but it solves the problem by amputation: the forty-first tool simply doesn't exist, regardless of whether it was the one you needed. A cap is an admission that the all-tools-in-the-prompt model doesn't scale, dressed up as a feature.

Tool selection is a retrieval problem

The better framing is the one RAG-MCP's name gives away: this is RAG, pointed at your tools instead of your documents. You index every tool — usually by embedding its description — and for each incoming request you retrieve the top-k most relevant tools and expose only those to the model. The catalog can hold thousands; the prompt only ever sees a handful.

This is now a supported pattern, not a research curiosity. LangGraph BigTool gives an agent a long-term store of tool descriptions and semantically retrieves the relevant ones per query; LlamaIndex's ObjectIndex does the same for arbitrary tool objects, explicitly to "remove the complexity of having too many functions to fit in the prompt." The one caveat the ToolRet benchmark adds is sobering: off-the-shelf retrievers are mediocre at tool retrieval — tool descriptions are short and confusable in ways document passages aren't — so the retriever itself needs attention. Bad tool retrieval just moves the failure upstream.

The newest move: tools as code you import

The most aggressive reframing drops the menu metaphor entirely. Instead of presenting tools as a list of schemas, you present them as a filesystem of code the agent can explore and import on demand — list the available servers, read a specific tool's definition only when it decides it needs that tool, and call it from a code block rather than round-tripping a JSON object through the model. Intermediate results stay in the execution environment instead of flowing back through the context window on every step.

Anthropic's report on this "code execution with MCP" pattern is the number that should reset everyone's intuition: a complex task that consumed roughly 150,000 tokens under the load-everything approach dropped to about 2,000 tokens when the agent read tool definitions on demand and ran the orchestration in code — a ~98% cut. The tradeoff is real — you now need a sandbox to execute that code, which is its own security surface — but the direction is unmistakable.

The throughline across all three fixes — retrieval, code execution, even the crude cap — is the same correction to that original assumption. An agent's power was never in how many tools it could see. It's in how few it has to consider to pick the right one. Build for that, and "add another server" stops being a slow poison. Ignore it, and every integration you're proud of shipping is quietly making the agent a little bit worse.