Type "best LLM for function calling" into a search box and you're asking the wrong question — not because the answer changes monthly (it does), but because the metric most people reach for doesn't predict the thing they actually care about. The number at the top of a function-calling leaderboard tells you a model can emit a correctly-shaped tool call once. Production doesn't ask for one call. It asks for the right call, every time, across a conversation, with the discipline to call nothing when nothing is warranted. Those are different abilities, and they come from different numbers.
What BFCL actually scores
The Berkeley Function-Calling Leaderboard (BFCL) is the canonical answer to "can this model do function calling at all," and it's a good one. It has grown in versions. v1 scored single-turn calls — simple, parallel, multiple, and parallel-multiple — using AST evaluation, which parses the model's output into an abstract syntax tree and checks it against a reference call without running anything. v2 added live, community- and enterprise-contributed data to resist contamination, and leaned harder on relevance/irrelevance detection. v3 introduced multi-turn and multi-step tasks, scored by comparing backend state after the calls execute. v4 pushes into agentic territory — web search, memory management, format sensitivity — and weights those new agentic categories heavily in its overall score.
Two evaluation modes run underneath: AST accuracy (does the parsed call match the reference structure?) and executable accuracy (does invoking it in a sandbox return the ground-truth output?). Both are mostly answering one question: did the model produce a well-formed, correct call for this prompt?
That's necessary. It is not sufficient. And one BFCL category quietly proves the point.
The skill that doesn't look like a skill
BFCL scores irrelevance detection — the no_call case, where the correct move is to emit no function call. This is the failure mode nobody benchmarks for in their head and everybody hits in production: the agent that fires a tool on a query that needed a plain answer, or invents an argument to satisfy a schema. A model can post a spotless score on well-formed calls and still flunk knowing when to keep its hands off the API.
The hardest part of tool use isn't making the call. It's not making the call.
If your agent's expensive failure is calling the wrong tool, the headline accuracy number is not where you look. The relevance/irrelevance column is. Treating those as one number is the first mistake the search query encourages.
The number production actually feels
Here's the non-obvious anchor. BFCL, even in its multi-turn v3 form, mostly gives each task one shot. Production gives the same task to a thousand customers. The gap between those is where tau-bench (and its successor tau2-bench) from Sierra lives.
tau-bench drops the model into a simulated customer-service conversation — a second LLM plays the user — across retail, airline, and telecom domains, each with real policy constraints: return windows, fare rules, account verification. The agent has to gather information over multiple turns, obey the policy, and execute database operations through tools. That alone is harder than BFCL. But the metric is the real contribution.
tau-bench reports pass^k: the probability that all k independent trials of the same task succeed. Not pass@k ("at least one of k worked") — the inverse. Because pass^k = p^k, it decays exponentially. A model that's 90% reliable per attempt is at roughly 57% over eight. In tau-bench's own results, a GPT-4o retail agent that looked respectable at pass^1 fell to about 25% at pass^8 — meaning a one-in-four chance it resolves the same issue cleanly across eight different customers. That is the number a support queue lives and dies on, and no single-shot leaderboard surfaces it.
This is the same lesson the rest of the field keeps relearning: the evals are the product, and an eval that measures the wrong unit flatters the wrong model.
Reading the standings without getting fooled
Model rankings on both boards move fast — new frontier releases reshuffle the top of BFCL and the tau-bench tables on a timescale of weeks, so any specific name I print here is stale by the time you read it. Go to the live leaderboard for current standings. What doesn't go stale is the methodology, and the methodology is what you should be picking on:
- If you need well-formed calls against many APIs, BFCL's AST and executable accuracy is your signal — and it's a high bar most current frontier models clear.
- If your costly failure is a tool fired when none was needed, ignore the headline and read BFCL's irrelevance/relevance detection.
- If you're shipping a multi-turn agent that repeats the same task at volume, BFCL's top score is a vanity number. tau-bench's pass^k is the one that predicts your incident count.
The trap in "best LLM for function calling" is that it implies a single ranked list with a single winner. There isn't one. There's a model that's best at shape, a model that's best at restraint, and a model that's best at consistency under repetition — and on a given week they may not be the same model. Decide which failure you can't live with first. The leaderboard doesn't lie about what it measures. It only lies if you let its top row answer a question it was never scoring.



