The evaluation-and-observability library, read in order — from why you eval (eval-driven development, online vs offline) through building the eval (datasets, CI/CD gates), the judge that does the measuring (LLM-as-a-judge and its biases, agent-as-a-judge), evaluating a specific capability (tool use, coding, deep research, voice), reliability metrics (pass@k, cost-aware eval), the standardized benchmarks (SWE-bench, Tau-bench, Terminal-Bench, GAIA), and production observability and the eval/tracing platforms (Langfuse, LangSmith, Braintrust, Arize, Phoenix, OpenLLMetry).
Write the eval before the prompt. The test suite you build first is the only thing that lets you change models next month without praying — and in 2026, you will change models.
Agents got trivial to build and impossible to trust. The repos worth starring now aren't frameworks — they're the eval and tracing layer that tells you whether the thing actually works.
Offline evals ask whether the agent matched a known answer. Online evals can't — there is no answer. Treating them as one pipeline with one metric is the mistake that lets agents pass every test and still fail in production.
The scoring framework is the commodity. The hard, valuable, un-buyable work is looking at your own outputs and distilling real failures into labeled cases — your eval set is a precipitate of error analysis, not a download.
You wire your eval into GitHub Actions, gate the merge on it, and a week later it's red on a PR that changed nothing. The fix isn't a retry — it's admitting an eval is a measurement, not an assertion.
Using a model to grade your model feels like measurement. Until you learn what the judge is actually rewarding — verbosity, position, and its own prose — it's closer to a focus group of one.
An LLM judge flips up to a third of its verdicts when you swap the answer order, and scores its own writing 10–25% higher. Three biases corrupt your evals — and only one has a cheap fix.
An LLM judge scores the final answer. For a multi-step agent, that signal is sparse, late, and easy to fool — a broken trajectory can still land on a right answer, and you'd never know.
There is rarely one correct path through a task, so grading an agent against a golden trajectory fails. Grade invariants over the path, and the final state, instead.
Public leaderboards answer 'which model is smartest,' not 'will it fix my bugs' — the only test that predicts your outcome is a private eval built from your own repo.
A deep research agent hands you a long, confident, well-structured report. Grading it means measuring two different things at once — how good it reads, and whether a single sentence is actually supported.
Transcription accuracy is table stakes. The failure surface that actually loses calls is conversational timing — turn-taking, barge-in, and an end-to-end latency budget you have to measure component by component.
pass@k asks whether an agent can ever solve a task. pass^k asks whether it solves it every single time. For long-horizon agents those are different questions — and the gap is where production failures live.
An agent leaderboard that ranks only on accuracy is secretly ranking on willingness to spend. Add the cost axis and the board's #1 is often not even on the frontier.
They look like a difficulty ladder. They're three orthogonal axes — and only one of them measures the thing that decides whether your agent survives contact with real users.
The same models that ace SWE-bench Verified collapse on its successor. The gap isn't difficulty — it's the size of an illusion, and the only durable fix turned out to be a software license.
SWE-bench hands an agent a broken test and a healthy repo. Terminal-Bench hands it a live machine and lets it break things. That's why a top SWE-bench score tells you almost nothing about the second number.
Most agent benchmarks hand the whole task to the model. τ-bench keeps the user in the loop, and τ²-bench gives the user their own hands — which is where frontier agents quietly fall apart.
A new benchmark drops the same models from ~73% to ~25% — not by making the bugs harder, but by taking away the one thing SWE-bench always handed over: a map to the change.
Static benchmarks freeze the world while an agent thinks. Meta's GAIA2 lets time run — and the smartest model, GPT-5, turns out to be the one that misses deadlines.
When every frontier model clusters within a tenth of a point on the same saturated tests, the leaderboard stops measuring quality and starts measuring marketing.
Your agent can be HTTP-200, fast, and cheap while being completely wrong. The metrics that keep a web app healthy are blind to the ways an agent actually fails.
Agent observability didn't invent a standard. It surrendered to a boring one from 2019 — and in doing so quietly retired the log as the unit of truth.
Both libraries emit OpenTelemetry spans for your agent. They disagree on what to name the attributes — and that disagreement, not the instrumentation, is your real lock-in.
The real choice isn't which dashboard looks nicer — it's what unit of work you trace and who owns the trace data after the agent finishes.
Three platforms that look like competitors but optimize for different primary jobs, with lock-in profiles that diverge sharply once you read the fine print.
The eval-tooling field just split into three camps and lost two players to acquisition in a single month. Pick on philosophy and independence, not the feature grid.
Three popular eval frameworks that look interchangeable answer three different questions — pick the one that matches the question you actually have.
A prompt registry lets you change prompts without a deploy. On its own, that just lets you change them faster — not better. The tools that compound tie every version to an eval.