Every team building on an LLM eventually buys an eval framework — DeepEval, Ragas, Promptfoo, a hosted dashboard — wires it up, watches a number appear, and feels productive. Then the number doesn't move when the product gets better, or moves when it gets worse, and they conclude "evals don't work." Evals work. What didn't work was buying the commodity and skipping the craft. The grader is the easy part. The dataset is the eval.
The dataset is a precipitate of error analysis
There is a persistent fantasy that an eval set is something you download — a benchmark, a public leaderboard, a synthetic dump from a generator. But a generic benchmark tells you almost nothing about your task. OpenAI's own guide to building an eval refuses to name a minimum example count and insists instead on quality over quantity and thematic consistency around your specific use case. The value isn't in the rows; it's in the judgment baked into them.
That judgment comes from one unglamorous activity: looking at your outputs. Hamel Husain's much-cited argument is that you read your model's outputs, do open coding — categorize the kinds of errors you see — and when you hit a failure, you write a test that captures it. Your eval set is the precipitate of that error analysis, not a thing that exists before it. This is why Anthropic advises sourcing realistic tasks from the failures you actually observe, and warns that evals only get harder to build the longer you wait: early on, your product requirements translate naturally into test cases; later, you're reverse-engineering success criteria from a live system you no longer fully understand.
Teams over-invest in graders and dashboards and under-invest in the only artifact that encodes their quality bar: a small set of labeled, real failures.
Start with 20-50 real failures
The counterintuitive part is the size. You do not need ten thousand cases. Anthropic puts the starting point at 20-50 tasks drawn from real failures, and the reasoning is statistical, not lazy: early in development each change has a large effect size, so a small, representative sample gives clear signal. Hamel and Shankar frame the same number as a recurring discipline — spend thirty minutes reviewing 20-50 outputs every time you make a significant change. Representativeness is the constraint, not volume. Fifty real failures you understand beat five hundred synthetic prompts you don't.
When you have no traffic yet, synthetic generation is a legitimate bootstrap — especially for RAG, where Ragas builds test sets from your documents via a knowledge-graph-and-evolution method and claims to cut authoring time by around 90%. But treat synthetic cases as scaffolding. The moment real users hit your system, mine the production traces where it underperformed and have an expert add the expected output. That's your golden set.
Score binary, and own the bar
How you grade matters as much as what you grade. Anthropic's cookbook names three grading methods: code-based (string, regex, exact match — "by far the best" where it applies), human (most capable, incredibly slow), and model-based (a capable LLM judging the output). Reach for code-based first; it's free, fast, and deterministic. And whatever the method, prefer binary pass/fail tied to a specific question — "did the agent hand off to a human? Yes/No" — over a fuzzy 1-5 quality score that hides disagreement and resists action.
On labeling, resist the committee. Hamel and Shankar's "benevolent dictator" model gives one domain expert — a lawyer for a legal tool, a clinician for a health bot — definitive authority over the quality bar. This eliminates the annotation deadlock that kills eval projects. Bring in multiple annotators only at scale, and when you do, measure their agreement with Cohen's kappa rather than assuming it.
Validate the judge before you trust it
The most common silent failure is an LLM-as-a-judge you never checked against a human. An unvalidated judge is an unmeasured instrument — you've replaced "I don't know if my product is good" with "I don't know if my grader is good," which feels like progress and isn't. OpenAI bakes the fix into its framework: a meta-eval scores the grader against human-provided choice labels, and a good model-graded eval should approach a metascore near 1.0. The practitioner version, from EvidentlyAI and others: label a sample yourself, then measure judge-human agreement with precision and recall — not raw accuracy, which class imbalance flatters — and treat roughly 75-90% agreement as the bar to scale. Notice that even validating the metric reduces back to a labeled dataset. It's datasets all the way down.
Finally, the set is alive. Keep a held-out golden set for regression so a "fix" can't quietly degrade what already worked, version it so you can pin an eval to a snapshot, add new failure modes as production surfaces them, and re-check judge alignment on a schedule because both your traffic and the underlying models drift. The teams that win at evals aren't the ones with the fanciest grader. They're the ones who kept looking at their data.



