Connect three or four MCP servers to an agent and watch the system prompt swell. A single busy server can spend several thousand tokens describing its tools; a realistic stack pushes tool schemas past 50,000 tokens before the user has typed a word. The usual reaction is to blame the context window and start counting tokens. That instinct is right about the symptom and wrong about the disease.

Here is the non-obvious part, and it is the whole reason this problem is worth an article. The expensive failure isn't running out of room. It's that tool-selection accuracy collapses long before the window fills. Every additional tool — especially a near-duplicate, get_user next to fetch_user next to lookup_account — is a distractor that raises the odds the model picks the wrong one or invents arguments. The RAG-MCP paper makes the size of this brutally concrete: on an MCP stress test, baseline tool-selection accuracy with the full catalog in context was 13.62%, and pre-filtering the candidate set with retrieval lifted it to 43.13% — more than triple — while also cutting prompt tokens by over half. Anthropic reports the same shape from the other direction: with its Tool Search Tool enabled, Opus 4.5's tool-use accuracy on their internal eval rose from 79.5% to 88.1%, and an older model jumped from 49% to 74%.

Retrieving tools isn't primarily a cost optimization. It's an accuracy intervention — you'd want it even if tokens were free.

So the move everyone is converging on is: don't put all the tools in front of the model. Retrieve them. But "retrieve them" hides three genuinely different architectures, and the difference is what they retrieve and when.

Three shapes of the fix

Tool search keeps your agent loop exactly as it is and swaps eager loading for lazy. In Anthropic's version you still register every tool, but mark the long tail with defer_loading: true. The model boots seeing only a built-in search tool plus your handful of always-on tools; when it needs something else, it searches — by regex (tool_search_tool_regex_20251119) or by natural-language BM25 (tool_search_tool_bm25_20251119) — and the matching definitions expand into context on demand. Anthropic puts the saving at roughly 85% of tool-definition tokens while the full library stays reachable. This is the lowest-effort migration: same mental model, one flag.

Tool-RAG (the RAG-MCP shape, and what Red Hat calls Tool RAG) goes a step earlier. It puts a semantic retriever in front of the model: tool names, descriptions, and parameters are embedded into a vector index, and for each query you retrieve the top-k most relevant tools and surface only those. The model never sees the catalog; it sees a curated shortlist. Search defers loading; RAG decides candidacy. This is the right tool when the catalog is enormous or full of overlapping capabilities — and it's why this belongs in the same conversation as how many tools an agent can actually handle and the uncountable sprawl of MCP servers.

Code execution is the most aggressive. Instead of tool calls, the model writes code against an API of your tools — Anthropic's code execution with MCP presents them as files the model can list and import, with only names and ~60-character descriptions in the prompt (progressive disclosure). The payoff isn't just fewer definitions; it's that intermediate results stay in the execution sandbox and never re-enter the context window. A pipeline that fetches a 10,000-row sheet, filters it, and passes three rows to the next tool keeps the 9,997 rows out of the model entirely. Anthropic reports one task dropping from roughly 150,000 to 2,000 tokens — about 98.7%. The price is operational: you're now running a sandbox and orchestrating code, not just parsing JSON tool calls. It pairs naturally with how you already think about MCP versus plain function calling.

The failure mode you just bought

Now the kicker, the part that should change how you build, not just which API you call. Retrieval reintroduces a recall ceiling. If the correct tool isn't in the retrieved top-k, the agent cannot call it — full stop. And notice what that does to your failure surface: a model staring at all the tools and picking the wrong one fails loudly and recoverably (it gets an error, it retries). A retriever that silently omits the right tool fails invisibly — the agent confidently does the wrong thing, or gives up, and there is no error to retry against. You have traded a recoverable error for a silent one.

Which means the engineering question quietly changed underneath you. It is no longer "how many tools fit in the window." It is "what is my tool-retrieval recall@k, and what is the behavior on a miss?" That is the exact discipline retrieval-augmented generation already forced on the document layer — you measure recall@k, MRR, and nDCG, you tune k against precision, you decide on a fallback. The same rigor now applies to tools. Build the eval set: real user requests labeled with the tool that should fire, and measure whether your index surfaces it — the way you'd evaluate a RAG pipeline. Then engineer the miss path: keep your five or ten genuinely-always-needed tools resident (don't defer those), widen k when confidence is low, and let the model fall back to a broad search rather than hallucinate a tool that retrieval hid.

The framing to leave with: tool retrieval is not a context-budget trick you bolt on at the end. It moves the agent's reliability from "did the model choose well among everything" to "did my retriever surface the right candidates" — and that second system is one you have to measure, not assume. The token savings are real and large. They are also the least interesting thing about the change.