If you want to know which model handles function calling best, you can look it up. The Berkeley Function-Calling Leaderboard has answered that question for two years, and it answers it well — for the setup it assumes. That setup is a small, curated set of tools, handed to the model, which then has to emit a correctly-shaped call. It is a real skill and the benchmark measures it honestly.
It is also not the setup you ship. An agent connected over the Model Context Protocol does not receive a tidy list of four tools chosen by a benchmark author. It connects to live servers, discovers whatever tools those servers advertise, and has to choose — often among dozens or hundreds it never saw during training. That is a different test, and over the last year a different benchmark family grew up to run it.
The question stopped being "can the model format the call" and became "can it pick the right tool when nine wrong ones sit next to it." Those are measured by different benchmarks because they are different abilities.
Three benchmarks, three tool-set scales#
The cleanest way to read the MCP benchmark family is by one axis: how far each one lets the tool set sprawl.
MCP-Bench keeps it small on purpose. It wires an agent to 28 live MCP servers exposing 250 tools across finance, travel, scientific computing, and academic search, then scores the thing a small tool set makes hard: planning. The tasks give fuzzy instructions with no tool names, so the agent has to retrieve the right tool from a description, chain multi-hop trajectories, ground its answer in intermediate outputs, and coordinate across domains. Its evaluation runs at three levels — schema understanding, trajectory planning, and task completion.
MCPToolBench++ scales up. It draws on a marketplace of over 4,000 MCP servers across 40-plus categories, with 1,500 queries spanning search, browsing, maps, and more, in several languages. Its contribution is less about size than about honesty in metrics: it separates Tool Call Success Rate — did the call run without an error — from Pass@K, which also checks that the parameters were correct and the result matched the expected ground truth. The two numbers diverge exactly where you'd fear: a model can "succeed" on every call while failing Pass@K because it called the wrong tool with arguments that happened to execute.
MCPAgentBench goes to the far end. After deduplication it holds definitions for 9,714 MCP servers and more than 20,000 tools, run in an AutoGen-based sandbox against locally maintained servers for reproducibility. Its design choice is the tell: it deliberately injects distractor tools into the candidate list. That isolates the one skill the other two mostly assume the model already has — selection and discrimination, picking the correct tool when wrong-but-plausible ones are adjacent. It scores this with a Tool Execution Efficiency Score, and reports the number you would predict: as the count of candidate tools grows, TEFS drifts down.
A fourth, MCP-Atlas, arrived this year as the newest large-scale competency benchmark built on real MCP servers — a sign the category is consolidating rather than fading.
The one finding they all point at#
Read across the three and the same result keeps surfacing, which is more convincing than any single leaderboard. The basic mechanics of tool calling have converged. In MCP-Bench, tool-name validity and schema compliance clear roughly 95% even for mid-scale models — the part everyone worried about in 2024 is effectively solved. What has not converged is everything downstream of formatting. On task fulfillment, frontier systems clear 0.63 and ground above 0.70, while smaller models sit below 0.35 fulfillment and 0.45 grounding. The gap is not "can it call a tool." It is "can it plan across several, and pick correctly among many."
That reframes a claim this publication has made qualitatively — that agents get worse as you add tools, and that tool sprawl is a first-order problem, not a rounding error. The MCP benchmarks now put a curve under it. Selection accuracy degrades with candidate-set size; the degradation is the finding, not the noise.
What this changes for you#
The practical consequence is that "best model for MCP tool use" is an incomplete question until you say how many tools your host exposes. A model that tops a 250-tool planning benchmark is being measured in a regime most production hosts left behind the moment they connected their third server. If your deployment connects a broad catalog — and MCP's whole appeal is that it makes connecting catalogs cheap — the number that predicts your outcome is the one measured under distractors at scale, not the one measured against a curated handful.
So pick the benchmark that matches your blast radius. Few servers, hard planning: MCP-Bench. Broad catalog, structural correctness: MCPToolBench++, and read Pass@K, not Tool Call Success Rate. Large sprawling tool set where selection is the risk: MCPAgentBench, because it is the only one that makes the model choose wrong on purpose. And whichever you use, evaluate the tool use, not just the answer — because by the time a wrong tool returns a plausible result, the answer looks fine and the trajectory is already lost.



