Picture the dashboard the morning after you ship an agent. Request rate: steady. Error rate: 0.2%. p99 latency: comfortably under budget. By every signal a web app lives or dies on, the agent is healthy. It is also, on roughly one run in twenty, confidently answering the wrong question — calling a refund tool when it should have called a lookup, or citing a document it never retrieved. Your monitoring has no idea, because it was built to watch a different kind of program.
An agent can be 200-OK, fast, and cheap while being completely wrong. That sentence is the entire reason agent monitoring is its own discipline.
Accept that RED metrics are structurally blind to agent failure#
Rate, errors, duration — the RED method — assumes failure looks like a non-200 or a slow response. Agents break inside the success case. As one production post-mortem of the pattern puts it, "each individual LLM call may succeed — 200 OK, valid JSON, within latency thresholds — while the overall chain produces a wrong or harmful outcome." Traditional APM "cannot detect when an agent selects the wrong tool or gets trapped in a reasoning loop," because it cannot model the multi-step causal chain at all.
This is not a tooling gap you close by adding more counters. The unit your APM watches — the request — is the wrong unit. The thing that succeeded or failed is the trajectory: the sequence of decisions, model calls, and tool calls that produced the answer. Until that's the unit you observe, you are measuring the wrapper and ignoring the program. (The trace, not the log, is the new primitive — and for an agent the trace is the whole reasoning path.)
Instrument the trajectory as spans#
Make every step its own span, nested under the run. The model call is a span. The tool call is a span. The retrieval is a span. The agent invocation that orchestrates them is the parent. Now a wrong-tool failure isn't an invisible 200 — it's a span in the tree you can point at.
The encouraging news is that there's finally a standard shape for this. OpenTelemetry's GenAI semantic conventions define the attributes: gen_ai.operation.name (with values like chat, execute_tool, embeddings), gen_ai.request.model, and gen_ai.usage.input_tokens / output_tokens on each span, with a span name of {operation} {model}. There's a dedicated agent-span convention too — create_agent, invoke_agent, execute_tool, with gen_ai.agent.name and gen_ai.agent.id — so a multi-step run renders as a labeled tree instead of a pile of HTTP calls.
You don't have to hand-roll the instrumentation. Vendor specs and libraries sit on the same OTel base: Arize's OpenInference defines span kinds (LLM, RETRIEVER, RERANKER, TOOL, AGENT, GUARDRAIL, EVALUATOR), where an AGENT span is explicitly "a reasoning block that acts on tools using the guidance of an LLM." Traceloop's OpenLLMetry auto-instruments the providers, vector DBs, and frameworks you already use (the OTel-native options compared). One honest caveat: the OTel GenAI conventions still carry a Status: Development badge. They are the right bet, but pin your versions — attribute names will move.
Put a dollar figure and a step count on every run#
With trajectory spans in place, the agent-specific operational metrics fall out of the data instead of being guessed at. OTel's GenAI metrics give you gen_ai.client.token.usage (split into input and output) and gen_ai.client.operation.duration — the raw material for the numbers that actually predict agent trouble:
- Tokens and cost per run, broken down by step and model. A p99-latency chart hides the run that quietly burned 200K tokens; a per-trace cost chart doesn't (why your eval needs a dollar axis).
- Steps per run. A creeping average step count is the early signature of a loop forming before it trips your max-step guard.
- Tool error rate and time-to-first-token, tracked per tool and as latency percentiles, so a degrading dependency shows up as its own span getting slower, not as a vague whole-system slowdown.
Add an online quality signal — the one infra can never give you#
Everything so far tells you what the agent did. None of it tells you whether the answer was good. For that you need evaluation running against live traffic, not just in CI.
The pattern that's emerged is the online eval: sample a slice of production traces and score them asynchronously with an LLM-as-a-judge (or cheaper rule-based checks). Langfuse runs "fully managed LLM-as-a-judge evaluations on production traces," queuing each matching trace and scoring it out of band; it recommends sampling 5–10% of traffic to hold cost to roughly a cent to a nickel per eval. Arize Phoenix frames the same idea as tasks that "continuously run your evaluators on incoming data," with a configurable sampling rate so you grade a representative subset rather than every call. The judge runs on a strong model even when production runs on a cheap one — quality grading is the one place you pay up.
A sampled judge score, trended over time and broken down by route, is the metric that would have caught the one-in-twenty wrong-answer rate from the opening — long before a customer did. Pair it with simulated-user tests before you ship, and you've closed the loop on both sides of the deploy.
The reframe is the whole job. Monitoring a web service asks did the request succeed? Monitoring an agent asks did the right things happen, in the right order, for the right cost, to produce a good answer? — four questions a green RED dashboard answers none of. Instrument the trajectory, price every step, and judge a sample of the output, and the failures that used to hide inside HTTP 200 finally have somewhere to show up. When one does, you'll be debugging a labeled trace instead of guessing at a black box.



