You wrote the system prompt. "You are a helpful, expert assistant. You are careful, thorough, and friendly." You gave the model some tools and turned it loose. And it loops on the same file four times, or fires off a dozen searches for a fact it already has, or declares victory after one step and hands a half-finished job back to the user. Nothing crashed. The prose was fine. You wrote a personality where the agent needed a policy.

This is the quiet category error in agent prompting. The skills that make a great chatbot system prompt — a vivid role, a warm tone, a tidy output format — are nearly orthogonal to the skills that make a great agent system prompt. A chatbot prompt frames a single reply. An agent prompt is control logic for a loop.

A chatbot prompt frames a reply; an agent prompt runs a loop#

Start from what an agent actually is. Anthropic's Building Effective Agents draws the line cleanly: workflows are systems where "LLMs and tools are orchestrated through predefined code paths," while agents are systems where "LLMs dynamically direct their own processes and tool usage." In a workflow you own the plumbing. In an agent, the model owns the plumbing — and the system prompt is the only place you get to tell it how to behave while it does.

And it is not read once. The agent runs a loop — call a tool, read the result, decide the next move, repeat — and on every pass the entire context is re-sent to the model: system prompt, tool definitions, and the whole accreting history of tool calls and outputs. Your chatbot prompt gets skimmed once to set the mood. Your agent prompt gets re-executed every single turn, as the standing instruction the model consults before each decision. That changes what belongs in it.

A chatbot prompt is read once to set a tone. An agent prompt is reread every turn to make a decision. Write the second one like the control logic it is.

The persona is the least load-bearing part#

The instinct is to spend the opening lines on identity: you are a world-class senior engineer with twenty years of experience. It feels like the foundation. The evidence says it's decoration. Zheng et al., in the bluntly titled When "A Helpful Assistant" Is Not Really Helpful, tested 162 distinct personas in system prompts across several model families and thousands of factual questions. Adding a persona did not reliably improve performance over a plain prompt; the effect of any given persona was, in their word, largely random.

A role still earns its place when the job is voice — you genuinely want a terse SRE register or a patient-tutor tone. But it does not change which tool the agent picks, whether it stops, or whether it respects a constraint. Those are decisions, and decisions are governed by rules, not by an adjective. Spend your tokens accordingly.

What actually belongs in there#

Treat the system prompt as the agent's operating manual, written in rough priority order:

The right altitude, and the minimum dose#

There's a failure mode on each side of good. Hardcode a brittle decision tree into prose and you get a fragile prompt that breaks the moment reality deviates. Wave your hands with "use good judgment" and you've given the model nothing to act on. Anthropic's context-engineering guidance names the target the "right altitude" — specific enough to steer behavior, general enough to leave the model room — and pairs it with the discipline of the "minimum effective dose": the smallest possible set of high-signal tokens that gets the outcome.

That frugality isn't aesthetic. Because the prompt is reread every turn against an ever-growing history, length is a tax on attention. The "Lost in the Middle" work (Liu et al.) showed models reliably neglect information stranded in the middle of a long context — and a bloated system prompt is exactly the thing that gets stranded as the context fills up. Structure helps the model find what matters: clear sections, or explicit tags, beat a wall of prose (format is a real lever).

Last, treat the prompt like code, because it behaves like it. Sclar et al. found that semantically identical prompts — same meaning, different formatting — can swing benchmark accuracy by up to 76 points. A reword you'd call cosmetic can quietly change your agent's behavior. So version it, and test it against an eval set instead of vibes. The persona you'll get right on the first try. The policy you'll have to earn.