Your agent did the wrong thing, and you open the file where it lives, and you start adding print statements. Stop — you're about to instrument the one part of the system that almost certainly isn't broken. The loop iterated correctly. The tool call executed. The JSON parsed. Every line of your code did exactly what you told it to. What went wrong was a decision: the model, handed some context, chose the wrong tool, or filled in the wrong argument, or quietly gave up. And that decision was made in a place your debugger can't reach — the message array you never printed.

The unit of debugging is the transcript

The mental shift is the whole thing. In an ordinary program, the interesting state is your variables. In an agent, the interesting state is what the model saw — the rendered system prompt, the tool definitions exactly as they were serialized, the full running message history, and the raw completion that came back. The bug is overwhelmingly in there: a tool whose description doesn't say what it actually does, a retrieved chunk that crowded out the instruction, a prior tool result the model misread. Anthropic makes this point structurally with what it calls the agent–computer interface — the model can only act on what it actually receives, so your tool documentation deserves the same care as your prompt. Debugging an agent is, mostly, reading.

The agent's source code is the last place to look. The model didn't run your code wrong; it reasoned correctly over a context you've never actually read.

So step one is to capture the context, completely. This is what agent tracing tools exist for: LangSmith, Langfuse, and Arize Phoenix each record the full input and output of every model call and every tool invocation, laid out as a tree you can walk. If you'd rather not be locked to a vendor, OpenTelemetry's GenAI semantic conventions standardize the span attributes — the gen_ai.* fields for model, tokens, and finish reason — so the same trace reads across tools. Whatever you use, the bar is the same: you must be able to open the failing run and see, verbatim, what the model was looking at when it went wrong.

Replay the inputs, not the outputs

Once you can see the bad run, you want to test a fix — change a line in the system prompt, swap the model, rewrite a tool's description — and re-run the same context to see if it helps. Langfuse's sessions and playground are built for exactly this: take a captured generation, edit one thing, run it again.

The trap is expecting the replay to be deterministic. It isn't, and not because of anything in your code. LLM inference isn't bit-for-bit reproducible even at temperature 0: Thinking Machines ran the same temperature-0 prompt 1,000 times and got 80 distinct completions, and traced the cause not to sampling randomness but to how concurrent requests get batched on the server. The practical consequence (more here) is that "reproduce the bug" means reproduce the inputs. You freeze the context, change one variable, and judge the fix by whether the failure rate drops across several runs — not by whether one re-run happens to match. One green replay proves nothing; a lower failure rate across ten proves something.

Read thirty traces before you trust a metric

The instinct after fixing one trace is to write an automated checker and move on. Do the unglamorous thing first. Hamel Husain's repeated finding from real LLM projects is that error analysis — sitting down and reading your actual traces, by hand — is the highest-leverage activity in the entire eval loop, and it comes before any automated metric. The recipe is plain: pull 30-odd representative runs, write a short open-ended note on what went wrong in each, then group those notes into a failure taxonomy. The payoff is that your thirty individual bugs almost always collapse into three or four categories — "tool called with a stale ID," "model hedged instead of acting," "retrieval missed the relevant doc" — and now you know which one to fix first because you can count it.

Turn the fix into a test, or you'll debug it again

The last step is what separates debugging from whack-a-mole. Every failure you diagnose should leave behind an eval case: the input that triggered it, captured, with an assertion about what should happen instead. Drop it into an eval dataset that runs in CI. Without that, the loop is brutal — you fix a bug today, edit the prompt for something unrelated in three weeks, and silently reintroduce the exact failure with nothing to catch it, because prompt edits have side effects no type system will warn you about. The fixed bug isn't done when it stops happening; it's done when a test would scream if it started again. A related discipline, hallucination detection, is just this pattern aimed at a specific failure class.

Capture the transcript, replay the inputs, read your data, lock the fix. The debugger you reach for isn't a stepper — it's the trace, and the willingness to read what the model actually saw.