Building an agent is now a weekend. Wire a model to a tool loop, give it a system prompt with some adjectives in it, and you have a demo that works on the three inputs you tried. The frameworks made this part free. That is the whole problem.
Because the demo working is not the same as the thing working, and the gap between those two facts is where every shipped agent goes to quietly fail. It calls the wrong tool on the fourth turn. It hallucinates a refund policy that was never in the context. It loops twice, gives up, and returns a confident paragraph of nonsense. None of this shows up in the demo. All of it shows up in production, where you find out from the customer.
The missing layer was never another framework. It is the answer to a much less glamorous question: how do you know?
The two things you actually need
There are exactly two capabilities here, and people conflate them constantly.
Tracing tells you what happened. Every span, every tool call, every token, every retry, laid out so you can see why the agent did the dumb thing on turn four. It is observability borrowed wholesale from distributed systems, because that is what a multi-step agent is.
Evals tell you whether what happened was good — scored against something, repeatably, so that "I changed the prompt and it feels better" becomes a number you can defend. Without evals you are not engineering. You are vibing at scale.
You need both. Tracing without evals is a flight recorder on a plane nobody grades. Evals without tracing is a failing grade with no way to find the bug.
Tracing without evals is a flight recorder on a plane nobody grades. Evals without tracing is a failing grade with no way to find the bug.
The eval frameworks: turning "feels better" into a number
Start with the one the field underrates because it has the fewest stars. Inspect, from the UK's AI Security Institute, is the most serious evaluation framework in the open — datasets, solvers, and scorers as composable parts, model-graded evals built in, and a couple hundred pre-built evals to crib from. It is built by people whose job is to decide whether a frontier model is safe to release, which is a higher bar than "did my chatbot pass."
DeepEval is the pragmatic counterpart: evals that feel like Pytest, because they basically are. Assert that your RAG answer is faithful to its context, that the agent stayed on task, that G-Eval scores above your threshold — then run it in CI so a regression fails the build instead of failing the user.
promptfoo comes at it from the command line and brings red-teaming with it. Declare your test cases in YAML, run the same prompts across GPT, Claude, and Gemini side by side, and scan for the failure modes you would rather not discover via a screenshot on social media. It is the rare tool that treats security evals and quality evals as the same job.
And the ancestor worth knowing even if you never run it: OpenAI's Evals defined the registry-plus-custom-eval pattern that the rest of this list inherited. It is more reference than daily driver now, but it is the genealogy.
The observability platforms: watching the thing run
Tracing is where the agentic shift bites hardest, because a single user request fans out into a tree of model calls and tool invocations, and when it goes wrong you need the whole tree, not a log line.
Langfuse is the one most teams land on — open-source, self-hostable, tracing plus prompt management plus evals plus dataset benchmarking in one platform, with SDKs that hook into LangChain, the OpenAI SDK, and the rest without ceremony.
Phoenix, from Arize, leans harder into the observability heritage and the OpenTelemetry standard — traces, evals, experiments, and a prompt playground, built to debug the messy middle of a pipeline rather than just log its endpoints.
Opik, from Comet, covers the same triangle — tracing, automated evals, production dashboards — with a focus on the full lifecycle from first trace to monitored deployment.
If you would rather not adopt a platform at all, OpenLLMetry is the unopinionated floor: OpenTelemetry instrumentation for the LLM stack, so your traces flow into Datadog or Honeycomb or whatever you already pay for, with no new dashboard to learn.
The uncomfortable part
Here is the thing nobody selling you a framework will say out loud: the eval suite is now the most valuable artifact your team owns, more than the agent itself. The agent is a prompt and a loop — replaceable in an afternoon, obsolete every time a new model drops. The evals are the accumulated, specific, hard-won knowledge of what your task actually requires and how it actually breaks. That asset compounds. The agent depreciates.
Which means the build-versus-trust asymmetry resolves in a direction most people are not ready for. The model providers will keep making the building part more trivial. The trusting part stays exactly as hard as your problem is, because it is your problem — not theirs. Star the frameworks if you like. But the repos on this list are the ones that tell you whether any of it was real.



