Topic

AI Agent Evaluation & Observability

The evaluation-and-observability library, read in order — from why you eval (eval-driven development, online vs offline) through building the eval (datasets, CI/CD gates), the judge that does the measuring (LLM-as-a-judge and its biases, agent-as-a-judge), evaluating a specific capability (tool use, coding, deep research, voice), reliability metrics (pass@k, cost-aware eval), the standardized benchmarks (SWE-bench, Tau-bench, Terminal-Bench, GAIA), and production observability and the eval/tracing platforms (Langfuse, LangSmith, Braintrust, Arize, Phoenix, OpenLLMetry).

Eval-Driven Development: How to Ship an AI Agent Without Guessing

Write the eval before the prompt. The test suite you build first is the only thing that lets you change models next month without praying — and in 2026, you will change models.

The Evals Are the Product

Agents got trivial to build and impossible to trust. The repos worth starring now aren't frameworks — they're the eval and tracing layer that tells you whether the thing actually works.

Online vs Offline Evals for AI Agents: Why Production Traces Need a Different Scorer

Offline evals ask whether the agent matched a known answer. Online evals can't — there is no answer. Treating them as one pipeline with one metric is the mistake that lets agents pass every test and still fail in production.

How to Build an LLM Eval Dataset

The scoring framework is the commodity. The hard, valuable, un-buyable work is looking at your own outputs and distilling real failures into labeled cases — your eval set is a precipitate of error analysis, not a download.

How to Add LLM Evals to CI/CD Without Building a Flaky Gate

You wire your eval into GitHub Actions, gate the merge on it, and a week later it's red on a PR that changed nothing. The fix isn't a retry — it's admitting an eval is a measurement, not an assertion.

LLM-as-a-Judge: How to Build an Eval That Doesn't Quietly Lie to You

Using a model to grade your model feels like measurement. Until you learn what the judge is actually rewarding — verbosity, position, and its own prose — it's closer to a focus group of one.

Your LLM Judge Is Biased: Position, Verbosity, and Self-Preference — and Which Ones You Can Fix

An LLM judge flips up to a third of its verdicts when you swap the answer order, and scores its own writing 10–25% higher. Three biases corrupt your evals — and only one has a cheap fix.

Agent-as-a-Judge vs LLM-as-a-Judge: Grading the Trajectory, Not Just the Answer

An LLM judge scores the final answer. For a multi-step agent, that signal is sparse, late, and easy to fool — a broken trajectory can still land on a right answer, and you'd never know.

How to Evaluate an AI Agent's Tool Use, Not Just Its Answer

There is rarely one correct path through a task, so grading an agent against a golden trajectory fails. Grade invariants over the path, and the final state, instead.

How to Evaluate an AI Coding Agent

Public leaderboards answer 'which model is smartest,' not 'will it fix my bugs' — the only test that predicts your outcome is a private eval built from your own repo.

How to Evaluate a Deep Research Agent: Report Quality vs. Citation Accuracy

A deep research agent hands you a long, confident, well-structured report. Grading it means measuring two different things at once — how good it reads, and whether a single sentence is actually supported.

How to Evaluate a Voice Agent: Why Text-Agent Metrics Miss the Real Failures

Transcription accuracy is table stakes. The failure surface that actually loses calls is conversational timing — turn-taking, barge-in, and an end-to-end latency budget you have to measure component by component.

Pass@k vs Pass^k: Measuring Whether an Agent Is Reliable, Not Just Capable

pass@k asks whether an agent can ever solve a task. pass^k asks whether it solves it every single time. For long-horizon agents those are different questions — and the gap is where production failures live.

Cost-Aware Agent Evaluation: Why Your Benchmark Needs a Dollar Axis

An agent leaderboard that ranks only on accuracy is secretly ranking on willingness to spend. Add the cost axis and the board's #1 is often not even on the frontier.

SWE-bench vs τ-bench vs GAIA: Which Agent Benchmark Actually Predicts Production

They look like a difficulty ladder. They're three orthogonal axes — and only one of them measures the thing that decides whether your agent survives contact with real users.

SWE-bench Pro vs SWE-bench Verified: Why Top Coding Agents Dropped From 70% to 23%

The same models that ace SWE-bench Verified collapse on its successor. The gap isn't difficulty — it's the size of an illusion, and the only durable fix turned out to be a software license.

Terminal-Bench vs SWE-bench: Why Patching Code and Operating a Shell Are Different Skills

SWE-bench hands an agent a broken test and a healthy repo. Terminal-Bench hands it a live machine and lets it break things. That's why a top SWE-bench score tells you almost nothing about the second number.

τ-bench vs τ²-bench: The Agent Benchmark That Scores Whether You Can Guide a Human

Most agent benchmarks hand the whole task to the model. τ-bench keeps the user in the loop, and τ²-bench gives the user their own hands — which is where frontier agents quietly fall apart.

SWE-EVO vs SWE-bench: The Long-Horizon Test Coding Agents Fail

A new benchmark drops the same models from ~73% to ~25% — not by making the bugs harder, but by taking away the one thing SWE-bench always handed over: a map to the change.

GAIA2: The Agent Benchmark Where the Clock Never Stops

Static benchmarks freeze the world while an agent thinks. Meta's GAIA2 lets time run — and the smartest model, GPT-5, turns out to be the one that misses deadlines.

The Benchmarks Are Theater Now

When every frontier model clusters within a tenth of a point on the same saturated tests, the leaderboard stops measuring quality and starts measuring marketing.

How to Monitor an AI Agent in Production

Your agent can be HTTP-200, fast, and cheap while being completely wrong. The metrics that keep a web app healthy are blind to the ways an agent actually fails.

The Trace Is the New Log

Agent observability didn't invent a standard. It surrendered to a boring one from 2019 — and in doing so quietly retired the log as the unit of truth.

OpenLLMetry vs OpenInference: OpenTelemetry for LLM Agents in 2026

Both libraries emit OpenTelemetry spans for your agent. They disagree on what to name the attributes — and that disagreement, not the instrumentation, is your real lock-in.

Langfuse vs LangSmith vs Arize Phoenix: Choosing LLM & Agent Observability in 2026

The real choice isn't which dashboard looks nicer — it's what unit of work you trace and who owns the trace data after the agent finishes.

Langfuse vs LangSmith vs Braintrust: LLM Observability and Evals Compared

Three platforms that look like competitors but optimize for different primary jobs, with lock-in profiles that diverge sharply once you read the fine print.

Braintrust vs Arize vs Opik: Choosing an LLM Eval Platform in 2026

The eval-tooling field just split into three camps and lost two players to acquisition in a single month. Pick on philosophy and independence, not the feature grid.

DeepEval vs Ragas vs Promptfoo: Choosing an LLM Eval Framework

Three popular eval frameworks that look interchangeable answer three different questions — pick the one that matches the question you actually have.

Prompt Management: Langfuse vs PromptLayer vs Agenta (and Why a Registry Isn't Enough)

A prompt registry lets you change prompts without a deploy. On its own, that just lets you change them faster — not better. The tools that compound tie every version to an eval.