Most teams build their evaluation story as a ladder. First you assemble a test set with known-good answers and run it in CI. Then, once you're "mature," you graduate to watching production. The implied promise is that online evaluation is just offline evaluation pointed at real traffic — same metrics, bigger dataset.
It isn't, and the gap is not a matter of scale. It is that the two regimes disagree about whether a right answer exists.
Offline knows the answer. Online never does.
An offline eval runs against a dataset you built. You chose the inputs, and for most of them you wrote down — or can compute — the correct output. That is what lets you score with reference-based metrics: exact match, "does it equal the gold answer," structured checks against an expected value. The eval is fundamentally a comparison against ground truth, and it answers one question: did the agent match the known answer?
An online eval runs against production traces. The user supplied the input, the agent produced an output, and nobody knows what the right output was — there is no reference, and there never will be. As the evaluation literature now puts it plainly, online evals operate on "messy, reference-free production traces." Every metric that depends on a gold answer is dead on arrival. You cannot compute exact match against a value you don't have.
So online scoring has to be reference-free, and that is a different toolbox:
- Rubric-based judges. An LLM-as-a-judge scoring the trace against a standard of acceptable behavior — grounded in the retrieved context, on-policy, no hallucinated tool calls — rather than against a specific correct string.
- Guardrail and policy checks. Deterministic signals: did it leak PII, call a tool it shouldn't, violate a format contract. These need no ground truth because the rule is the truth.
- Implicit user signals. Retries, manual edits, thumbs-down, conversation abandonment. The user never labels the trace, but their behavior scores it for you.
This is why the better tooling — Braintrust, Galileo, the LangSmith/Langfuse/Phoenix tier we mapped in our observability comparison — keeps the scoring framework shared but the scorers different. Same harness, different graders, because the questions are different.
Offline evaluation measures correctness against a known answer. Online evaluation measures behavior against a standard — because in production there is no answer, only conduct.
The arrow points backward
Here is the part the maturity-ladder framing hides. The valuable flow between the two isn't offline → online. It's online → offline.
Offline evals have a fixed, fatal limitation: they can only test for failures you already imagined. The dataset is a museum of yesterday's bugs. Production, meanwhile, is an endless generator of inputs you never thought to write down — the distribution drift, the adversarial phrasing, the tool that times out only on Tuesdays. Microsoft's production guidance and Anthropic's evals advice converge on the same point: static tests cannot surface the novel, real-world failures that post-launch monitoring catches.
So the move that actually compounds is the harvest. When online monitoring flags a low-scoring trace — a judge fail, a guardrail trip, a user who abandoned — you label it and fold it into your offline set. The next CI run tests for it forever. Online eval stops being a dashboard and becomes a sourcing pipeline for the only test cases that matter: the ones that already bit you.
This is the same realization we reached from the training side — that an eval and an RL environment are the same artifact. A scored production trace is not just a number on a chart; it's a labeled example. Offline is where examples accumulate. Online is where they're born.
What to actually do
Build offline first — you can't debug what you can't reproduce, and a trusted fixed set is the spine of every release. But don't mistake it for coverage. Stand up online evals with reference-free scorers from day one of production, and treat their lowest-scoring traces as your highest-value backlog: triage, label, promote into the offline set.
LangChain's 2026 survey found 57% of organizations already running agents in production and named quality the top barrier to deploying more. The teams on the right side of that number aren't the ones with a bigger test set. They're the ones who wired production failure back into it.



