You wrote an agent that talks to customers. Now you have to test it, and you hit the wall every conversational-agent team hits: you can't write the test cases. A unit test pins an input to an expected output, but a conversation has no fixed input — what the user says on turn three depends on what your agent said on turn two, which depends on the model's sampling that run. Script the user and you only ever test the one path you imagined. Real users don't follow your path.
So the field converged on a clever move: put a second LLM in the user's chair. Give it a persona and a goal — "you're cancelling order #1234, you're mildly annoyed, you won't volunteer your email unless asked" — and let the two models talk until the task resolves or falls apart. Then grade the transcript, usually by comparing the final system state to a goal state, or with an LLM judge on a rubric. This is now a first-class feature. LangChain's OpenEvals ships create_simulated_user; LangGraph has a tutorial that wires a simulated-user node against a chatbot node; DeepEval has a ConversationSimulator; the open-source LangWatch Scenario names the three roles outright — agent, user simulator, judge. Sierra's τ-bench, the benchmark everyone quotes, is this pattern at scale, and the voice-agent vendors — Coval, Hamming, Cekura, Vapi — sell it as "run ten thousand simulated calls overnight."
It works. You get multi-turn coverage, cheaply, at a volume no human QA team could match. And then it quietly lies to you.
The simulator is the measurement instrument now#
Here is the part the tooling pages skip. The moment you delegate the user's role to an LLM, the realism of that LLM becomes the ceiling on your eval's validity. You haven't removed the measurement problem; you've moved it into a component you probably never validated. And the bias isn't random noise — it has a direction, and the direction flatters you.
Default LLM simulators are too cooperative. They answer the question you asked on the first try, never misread the agent, never get impatient, never go off on a tangent, never supply half the information and make the agent dig for the rest. They are the easiest customer your agent will ever meet. A 2026 study, Mind the Sim2Real Gap, ran the τ-bench protocol against 451 real people across 165 tasks and benchmarked 31 different LLM simulators on how closely they matched real interactive behavior. The best simulator scored 76.0 on their user-sim index; real humans scored 92.9. That gap is not a rounding error — it is your inflated pass rate, and your production users are the ones who pay the difference. The kicker: throwing a bigger, more capable model at the simulator did not close the gap. Fidelity to a human is a different axis than raw capability, and scaling the wrong axis buys you nothing.
The flip side proves the point. When researchers built deliberately non-collaborative user simulators — users who request unavailable services, digress, get impatient, send incomplete utterances — state-of-the-art tool agents degraded sharply. Same agents, same tasks; the only thing that changed was a more honest user. Your agent's score is not a property of your agent. It's a property of the user you tested it against.
A simulated-user eval doesn't measure your agent. It measures your agent's performance against a customer you invented — and by default you invented a saint.
The user model is a free variable, and it's biased#
It gets worse, because the simulator isn't just optimistic — it's a knob you didn't know you were turning. Lost in Simulation, which ran a real user study across the US, India, Kenya, and Nigeria, found agent success rates swing by up to 9 percentage points depending only on which LLM plays the user, with systematic miscalibration: simulators overstate performance on moderate tasks and understate it on hard ones. So the same agent "passes" or "fails" your bar based on a configuration choice nobody on the team treated as load-bearing.
And the choice carries hidden bias. The same work found that pairing an agent with a same-family user model produced higher mean scores and markedly lower variance — a self-preference effect where a model rates conversations with its own kin more kindly. Grade your GPT agent with a GPT user and you've built a flattering mirror, the same trap LLM-as-judge evals fall into. The bias also lands unevenly: simulated users were a worse proxy for AAVE and Indian-English speakers, so the eval is least accurate exactly for the users it already serves worst.
Calibrate the fake user, or don't trust its number#
None of this means simulation is a dead end — it's the only way to get multi-turn coverage at scale, and it's worth doing. It means the simulator is an instrument, and instruments get calibrated. This isn't new wisdom; agenda-based user simulators were calibrated against real dialogues back in 2007, long before LLMs made them fluent. LLMs gave us simulators that sound human without being faithful to one — fluency masquerading as validity. The fix is to measure the masquerade: benchmarks like MirrorBench and clem:todd score a simulator on human-likeness directly — lexical diversity plus a Turing-style test of whether its turns are distinguishable from a real person's — decoupled from whether the agent succeeds.
So the actual checklist is shorter than the tooling makes it look. Write adversarial personas, not just happy-path ones — the impatient, the underspecified, the ones who change their mind. Run every eval across at least two simulator models, and use a different model family for the user than for the agent. Before you trust a sim pass-rate, collect a few dozen real human transcripts for the same scenarios and check that your simulator isn't obviously easier than they are. And read the sim number as a ceiling: it tells you the best your agent could plausibly do, against a user kinder than the ones who are coming. The reliability your queue actually lives on — whether the agent gets it right every time, not once — only shows up when the fake user stops being so nice.



