The agent-tooling conversation has spent two years on the input side of a tool: the name, the description, the JSON schema, and how many tools to put in front of the model at once. It's worthwhile work — we wrote a whole guide on writing tool descriptions, because those tokens ship on every call and the model reads them more carefully than your system prompt.
But walk the trace of a production agent that is quietly failing and the wound is usually somewhere else. The model picked the right tool. It called it with the right arguments. And then it drowned in what came back.
Here is the canonical version. A search_orders tool returns 200 orders, each a 40-field object: full shipping addresses, line items, tax breakdowns, internal status flags. The agent's actual question was "has this customer's refund shipped?" The answer is a single boolean, and it is now buried in forty kilobytes of JSON that the model has to carry — token by token, re-read on every subsequent turn — for the rest of the conversation.
A tool's return value is not a data structure your code consumes. It is a prompt fragment you pay for on input, and the model has to reason over every byte of it.
That reframing is the whole piece. Once you treat the return value as a prompt rather than a payload, the design rules write themselves.
Shape the response before it leaves the tool#
The first lever is verbosity, and it is nearly free. Anthropic's tool-writing guidance recommends making a tool's response configurable — a concise mode that returns just the essentials alongside a detailed mode with full metadata — and reports that switching their own example to a concise default, which drops IDs and ancillary fields, cut token usage by roughly a third. The model almost never needed the uuid or the mime_type; it needed the name and the file_type. Return the second kind of field. The same agent-versus-engineer instinct that should govern your inputs governs your outputs.
The second lever is size. Any tool whose result could be large needs guardrails baked in: pagination, range selection, filtering, and truncation, each with a sensible default. This isn't optional politeness — Claude Code enforces it structurally, capping any single tool response at 25,000 tokens and truncating past that, with a note telling the agent how to fetch the continuation. A read tool with no upper bound on its output is a context overflow waiting for the wrong query.
Format is the third lever, and the smallest. JSON, XML, and Markdown are not interchangeable — models predict the next token best on structures they saw most in training, so the same data can score differently depending on how you wrap it. Worth measuring, but don't mistake it for the main event. How much you return dominates how you format it.
The best result is often one the model never sees#
Shaping helps, but there's a more radical move: don't put the data in the context window at all.
This is the thesis behind Anthropic's code execution with MCP work. Instead of every tool result flowing back through the model, the agent writes code that calls the tools, and the intermediate data stays in the execution environment. The model sees only what the code chooses to surface. Anthropic measured one workflow that consumed about 150,000 tokens when tools and intermediate results passed directly through the model, and re-implemented it with code execution at roughly 2,000 tokens — a 98.7% reduction. The ten-thousand-row export never touched the context; a three-line summary did.
The Model Context Protocol already gives you the vocabulary for this gradient. A tool result carries content — model-oriented output, explicitly optimized for readability and token efficiency — separately from structuredContent, a JSON object for programmatic use validated against the tool's output schema. And it can return a resource_link instead of the bytes: a handle the agent dereferences only if it actually needs the payload. Return the pointer, not the file.
Errors are tool results too#
The failure path is where most tool designs go silent. When a call fails, the model's only information about what happened is what you return — and a raw stack trace tells it nothing it can act on. MCP makes this a first-class field: a tool that fails during execution sets isError: true and describes the failure in the content. Spend that message well. "order_id not found — call search_orders first to get a valid ID" is a result the agent can recover from. A 500 with a Python traceback is a loop.
The rule under all of it#
There's one principle that generates every tactic above: return the smallest thing that lets the model decide its next action. Sometimes that's three fields. Sometimes it's a file path. Sometimes it's an error sentence. It is almost never the API's raw response.
The model's context is a finite attention budget, not a database cursor you can page through for free. You spent real effort getting the agent to call the right tool. Don't undo it on the way back out.



