The Evals Are the Product

Building an agent is now a weekend. Wire a model to a tool loop, give it a system prompt with some adjectives in it, and you have a demo that works on the three inputs you tried. The frameworks made this part free. That is the whole problem.

Because the demo working is not the same as the thing working, and the gap between those two facts is where every shipped agent goes to quietly fail. It calls the wrong tool on the fourth turn. It hallucinates a refund policy that was never in the context. It loops twice, gives up, and returns a confident paragraph of nonsense. None of this shows up in the demo. All of it shows up in production, where you find out from the customer.

The missing layer was never another framework. It is the answer to a much less glamorous question: how do you know?

The two things you actually need

There are exactly two capabilities here, and people conflate them constantly.

Tracing tells you what happened. Every span, every tool call, every token, every retry, laid out so you can see why the agent did the dumb thing on turn four. It is observability borrowed wholesale from distributed systems, because that is what a multi-step agent is.

Evals tell you whether what happened was good — scored against something, repeatably, so that "I changed the prompt and it feels better" becomes a number you can defend. Without evals you are not engineering. You are vibing at scale.

You need both. Tracing without evals is a flight recorder on a plane nobody grades. Evals without tracing is a failing grade with no way to find the bug.

Tracing without evals is a flight recorder on a plane nobody grades. Evals without tracing is a failing grade with no way to find the bug.

The eval frameworks: turning "feels better" into a number

Start with the one the field underrates because it has the fewest stars. Inspect, from the UK's AI Security Institute, is the most serious evaluation framework in the open — datasets, solvers, and scorers as composable parts, model-graded evals built in, and a couple hundred pre-built evals to crib from. It is built by people whose job is to decide whether a frontier model is safe to release, which is a higher bar than "did my chatbot pass."

▟ UKGovernmentBEIS/inspect_ai

A rigorous LLM and agent evaluation framework from the UK AI Security Institute — composable datasets, solvers, and scorers with model-graded evals and 200+ prebuilt tasks.

★ 2.2kPythonUKGovernmentBEIS/inspect_ai

DeepEval is the pragmatic counterpart: evals that feel like Pytest, because they basically are. Assert that your RAG answer is faithful to its context, that the agent stayed on task, that G-Eval scores above your threshold — then run it in CI so a regression fails the build instead of failing the user.

▟ confident-ai/deepeval

A Pytest-style LLM evaluation framework with research-backed metrics (G-Eval, faithfulness, relevancy) for unit-testing RAG, chatbots, and agents in CI.

★ 16kPythonconfident-ai/deepeval

promptfoo comes at it from the command line and brings red-teaming with it. Declare your test cases in YAML, run the same prompts across GPT, Claude, and Gemini side by side, and scan for the failure modes you would rather not discover via a screenshot on social media. It is the rare tool that treats security evals and quality evals as the same job.

▟ promptfoo/promptfoo

A CLI and library for evaluating and red-teaming LLM apps — declarative test cases, side-by-side model comparison, and vulnerability scanning wired into CI.

★ 22kTypeScriptpromptfoo/promptfoo

And the ancestor worth knowing even if you never run it: OpenAI's Evals defined the registry-plus-custom-eval pattern that the rest of this list inherited. It is more reference than daily driver now, but it is the genealogy.

▟ openai/evals

The framework that popularized the eval registry pattern — a benchmark library plus a structure for writing your own private evals against your own data.

★ 18kPythonopenai/evals

The observability platforms: watching the thing run

Tracing is where the agentic shift bites hardest, because a single user request fans out into a tree of model calls and tool invocations, and when it goes wrong you need the whole tree, not a log line.

Langfuse is the one most teams land on — open-source, self-hostable, tracing plus prompt management plus evals plus dataset benchmarking in one platform, with SDKs that hook into LangChain, the OpenAI SDK, and the rest without ceremony.

▟ langfuse/langfuse

An open-source LLM engineering platform: tracing, prompt versioning, evals, and dataset benchmarking, self-hostable, with drop-in SDK integrations.

★ 29kTypeScriptlangfuse/langfuse

Phoenix, from Arize, leans harder into the observability heritage and the OpenTelemetry standard — traces, evals, experiments, and a prompt playground, built to debug the messy middle of a pipeline rather than just log its endpoints.

▟ Arize-ai/phoenix

An open-source AI observability platform for tracing, evaluation, and experimentation — OpenTelemetry-native and built to troubleshoot the middle of a pipeline.

★ 10kPythonArize-ai/phoenix

Opik, from Comet, covers the same triangle — tracing, automated evals, production dashboards — with a focus on the full lifecycle from first trace to monitored deployment.

▟ comet-ml/opik

An open-source platform for tracing, evaluating, and monitoring LLM and agent applications across development and production.

★ 19kPythoncomet-ml/opik

If you would rather not adopt a platform at all, OpenLLMetry is the unopinionated floor: OpenTelemetry instrumentation for the LLM stack, so your traces flow into Datadog or Honeycomb or whatever you already pay for, with no new dashboard to learn.

▟ traceloop/openllmetry

OpenTelemetry-based instrumentation for LLM apps — emit standard traces for models, vector DBs, and frameworks into the observability tools you already run.

★ 7.2kPythontraceloop/openllmetry

The uncomfortable part

Here is the thing nobody selling you a framework will say out loud: the eval suite is now the most valuable artifact your team owns, more than the agent itself. The agent is a prompt and a loop — replaceable in an afternoon, obsolete every time a new model drops. The evals are the accumulated, specific, hard-won knowledge of what your task actually requires and how it actually breaks. That asset compounds. The agent depreciates.

Which means the build-versus-trust asymmetry resolves in a direction most people are not ready for. The model providers will keep making the building part more trivial. The trusting part stays exactly as hard as your problem is, because it is your problem — not theirs. Star the frameworks if you like. But the repos on this list are the ones that tell you whether any of it was real.

The two things you actually need

The eval frameworks: turning "feels better" into a number

The observability platforms: watching the thing run

The uncomfortable part

Dex Mareno

Dispatches from the machines, in your inbox

The Evals Are the Product

The two things you actually need

The eval frameworks: turning "feels better" into a number

The observability platforms: watching the thing run

The uncomfortable part

Dex Mareno

Continue reading

The Price Fell. The Bill Rose. Both Numbers Are True.

The Duty of Care Died Before Anyone Had to Meet It

The Code Was Always a Menu

Dispatches from the machines, in your inbox