Every team that ships something on top of a language model arrives at the same uncomfortable moment: the demo worked, the stakeholders nodded, and now somebody asks how you know it still works after the last prompt change. You do not know. You have vibes and a screenshot. So you go looking for an eval framework, and the same three names come back every time — DeepEval, Ragas, and Promptfoo.

The lazy version of this comparison ranks them by stars and tells you to pick the winner. That is useless, because they are not competing. They answer three different questions, and most of the pain in LLM evaluation comes from grabbing the one that answers a question you did not ask.

Promptfoo: which prompt or model is better?

Config/CLI-driven eval matrix and red-teaming for prompts, models, and agents
★ 22kTypeScriptpromptfoo/promptfoo

Promptfoo is the comparison engine. You write a YAML config — a set of prompts, a set of providers (GPT, Claude, Gemini, whatever), and a set of test cases with assertions — and it runs the full matrix and hands you a side-by-side grid. It is the only one of the three written in TypeScript, and it is by some distance the most popular, with north of 22k stars at the time of writing. Its tagline is blunt: "Test your prompts, agents, and RAGs. Red teaming / pentesting / vulnerability scanning for AI."

That red-teaming half is the real differentiator. Promptfoo will generate adversarial inputs and probe for jailbreaks and leaks, then emit a vulnerability report. Nothing else here ships that out of the box. Reach for Promptfoo when the question is comparative — should I use this prompt or that one, this model or the cheaper one — and when you want a CI gate that fails the build when a regression slips in. It is config-first, which engineers either love or resent depending on how they feel about YAML.

DeepEval: does this output pass the assertion?

Pytest-style unit testing for LLM outputs with a deep metric library

DeepEval, around 16k stars, sells itself as "the LLM evaluation framework," but the honest description is Pytest for LLMs. You write test_*.py files, you assert that an output's faithfulness or relevancy clears a threshold, and you run deepeval test run. If you already test Python code, the ergonomics are instantly familiar — fixtures, assertions, the red/green loop.

Its metric catalog is the broadest of the three: G-Eval (define a criterion in plain English and let a model grade against it), faithfulness, hallucination, contextual recall, answer relevancy, plus task-completion and tool-correctness metrics aimed at agents. Reach for DeepEval when the question is absolute — does this specific output meet a standard I can write down — and when you want evals to live next to your unit tests rather than in a separate YAML world.

Ragas: is my RAG faithful to its context?

Reference-free RAG metrics: faithfulness, context precision/recall, answer relevancy

Ragas (about 14k stars; the repo now lives under the vibrantlabsai org but the explodinggradients URL still resolves) is the specialist. It is built for one architecture — retrieval-augmented generation — and it decomposes that pipeline into measurable parts: context precision and context recall for the retriever, faithfulness and answer relevancy for the generator. Its defining trait, laid out in the 2024 EACL paper, is that the core metrics are reference-free: you do not need a hand-written golden answer, because the metrics judge the answer against the retrieved context instead.

If you are debugging a RAG system, this separation is gold. A low faithfulness score points at the generator; poor context precision points at the retriever. Reach for Ragas when the question is diagnostic and RAG-shaped. Reach elsewhere when it is not — Ragas has little to say about a general chatbot or a tool-using agent.

The trap underneath all three

Here is the non-obvious part, and it is the one nobody puts on the landing page. G-Eval, faithfulness, hallucination, answer relevancy, context precision — almost every metric that makes these tools feel rigorous is itself a language model grading a language model. The judge is the same kind of system whose output you distrusted enough to start evaluating in the first place.

An LLM-as-judge eval does not remove the model's bias and non-determinism from your pipeline. It launders them into a number with two decimal places.

Run the same eval twice and the scores drift. Swap the judge model and your "objective" thresholds shift under you. A verbose, confident, wrong answer can score higher on relevancy than a terse correct one, because the judge has the same preferences as the thing it is judging. None of this makes the tools worthless — a noisy signal still beats a screenshot and a prayer. But it means a green eval suite is evidence, not proof, and you should pin your judge model, log its version, and keep a small set of human-checked cases as ground truth against the grader itself.

So, which one

If you want one rule: Promptfoo to choose between options and gate CI, DeepEval to assert pass/fail in a Pytest workflow, Ragas to dissect a RAG pipeline. They compose — plenty of teams run Ragas metrics inside a Promptfoo or DeepEval harness — so the question is never really "which framework," but "which question am I asking today." Answer that honestly and the choice makes itself. Just don't mistake the resulting number for the truth. It's a model's opinion, wearing a lab coat.