You wrote a tool. It calls an API, the schema validates, the unit tests pass. Then you hand it to an agent and it calls the wrong tool, or calls the right one with garbage arguments, or — worse — refuses to call it at all and apologizes instead. Nothing is broken. The code is fine. The problem is that you wrote documentation, and the model needed a prompt.

The description is the highest-reread text in your agent

Here is the thing most teams miss: a tool's name, its description, and each parameter's description are the model's entire interface to that tool. The model cannot read your implementation. It reads the words. As Anthropic's tool-writing guidance puts it, every word in a tool's name, description, and parameter documentation shapes how the agent understands and uses it. The description is not metadata attached to the real thing — to the model, it is the thing.

And it is a prompt you pay for repeatedly. Tool schemas are re-sent on every model call and billed as input tokens — roughly 200 tokens for a moderately documented tool, so five tools quietly add a thousand tokens to every single turn, before the user says anything. The description is simultaneously the most-reread and most-rebilled text in your whole agent. Your system prompt gets skimmed once per turn; your tool descriptions get consulted every time the model decides what to do next. Write them like the prompts they are.

A tool description is a prompt that ships on every call and gets read more carefully than your system prompt. Stop writing it like an API docstring.

The bigger lever is fewer tools, not better prose

Before you polish a single description, count your tools. The instinct to expose your whole API surface — one tool per endpoint — is exactly backwards, because tool-selection accuracy collapses as the candidate set grows. The RAG-MCP benchmark makes this concrete: when every available tool was injected into the prompt, the model picked the right one only 13.62% of the time. Retrieve only the handful of relevant tools for the task, and accuracy jumped to 43.13% — three times better — while cutting prompt tokens by more than half.

This is the same "more context, worse performance" curve that haunts long prompts generally. OpenAI's guidance is to keep fewer than ~20 tools available at the start of a turn. Past a few dozen, the right architecture is not a longer list but a tool-retrieval step — a search over your tools that surfaces only the ones this task needs. Consolidate, too: one well-described search_orders beats list_orders, filter_orders, and get_order_by_date competing for the model's attention with overlapping, vague descriptions. Overlap is poison — when two tools sound alike, the model calls the wrong one or freezes.

Write agent-facing, not engineer-facing

Once the surface is small, the craft is in the wording, and the rule is simple: write for the agent, not for the next engineer.

Make malformed calls impossible

Good prose reduces errors; constraints eliminate them. Use strict JSON-schema or structured-output mode — OpenAI's strict: true, LangChain's docstring-and-type-hint schema, Pydantic models — so the generated arguments must conform. Use enums to make invalid states unrepresentable: a status field that can only be open or closed can never be hallucinated into pending. And don't ask the model to supply what your code already knows — if you're holding the order_id, inject it server-side and let the model provide only the intent. Every argument you don't delegate is an error you can't have.

For anything that writes, the MCP spec gives you behavior annotations — readOnlyHint, destructiveHint, idempotentHint — but it also tells you to treat them as untrusted unless the server is, and to keep a human in the loop before sensitive operations. Mark your destructive tools, gate them behind confirmation, and when something fails, return an error the agent can act on ("no order found for that email") rather than a stack trace it will paste back to the user.

A tool, as Anthropic frames it, is a contract between a deterministic system and a non-deterministic agent. The description is where you write the contract. Most teams write it last, for the wrong reader. Write it first, for the model, and measure the calls — the same way you'd measure any other prompt that ships a thousand times a day. It's the cheapest reliability win in your agent, and it's hiding in plain text.