There are two ways to let an agent act, and the difference is what it emits when it decides to do something. A tool-calling agent returns a structured JSON object — {"name": "search", "arguments": {"q": "..."}} — that your runtime parses and executes. A code agent returns a snippet of executable code — results = search("..."); top = results[0] — that your runtime runs in a sandbox. The first is the function-calling style every provider ships, descended from ReAct (Yao et al., 2022). The second is the CodeAct paradigm (Wang et al., ICML 2024), and it's what Hugging Face's smolagents means by "agents that think in code."
You've probably seen the number: code agents win by up to 20%. It's true. It's also one of the most context-dependent stats in the agent literature, and quoting it as a blanket law is how teams pick the wrong one.
The 20% is real — and it's conditional
CodeAct's own evaluation tested 17 LLMs on two benchmarks. On M³ToolEval — tasks that need multiple tools and several interaction turns — code actions delivered up to a 20-point absolute improvement in success rate while using up to 30% fewer actions. Across the 17 models, code came out ahead on both success rate and step count in 12 of them. That's a strong, well-replicated result.
But on API-Bank, re-scoped to atomic single-tool calls, code actions were merely "comparable." The gap didn't vanish because the method got worse — it vanished because the thing that makes code better wasn't being exercised.
The advantage of code isn't that it's a smarter format. It's that one code block can loop, branch, and pass data between tool calls — work that, in JSON, is a separate round-trip to the model for every single step.
That's the whole mechanism. A task that needs "call tool A, filter the results, call tool B on each survivor, sum the outputs" is one code snippet with a for loop and a variable. In JSON tool-calling, it's a dozen sequential calls, each one a fresh LLM turn that re-reads the growing transcript. smolagents puts the same point in numbers: writing actions as code uses roughly 30% fewer steps, and fewer steps means fewer LLM calls means lower cost. Anthropic's "code execution with MCP" work pushes it further — letting the model write code that orchestrates tools, instead of stuffing every tool definition and result through the context window, took one example workflow from ~150,000 tokens to ~2,000.
So the real question isn't "which paradigm is better." It's how compositional is your agent's work? The more your steps chain together with data flowing between them, the more code pulls ahead. The more each step is a discrete, isolated action, the more the two converge — and then the tiebreakers are all on the other side.
What JSON tool calling buys you back
Code actions are not free. Two costs are structural, not incidental.
First, you have to run the code somewhere safe. smolagents is blunt about it: the only way to execute model-generated code with robust isolation is a remote sandbox like E2B or Docker. So choosing code agents means a sandbox is now mandatory infrastructure — an operational and security commitment a JSON tool-calling agent simply doesn't incur, because it never executes free-form code.
Second, code agents assume the model is genuinely good at code. That's a safe bet for frontier models and a shaky one below them; CodeAct found the best open-source model still managed only 13.4% on the hard benchmark even with code actions. The format doesn't rescue a weak reasoner.
Against that, structured tool calling has three quiet advantages. Guaranteed schema: provider constrained decoding (OpenAI's Structured Outputs) makes it literally impossible to emit a token that breaks the schema — 100% adherence, versus the sub-40% that older models managed by prompting alone. Auditability: a JSON trace is trivially parseable, loggable, and replayable; a code-execution log is messier to reason about after the fact. Zero code-exec surface: there's nothing to sandbox. For a system where each step is one bounded action, those aren't consolation prizes — they're the right defaults, which is exactly why smolagents recommends its ToolCallingAgent for "simple systems that don't require variable handling or complex tool calls."
The binary is already dissolving
Here's the part that should change how you read the whole debate: the frontier isn't code or JSON — it's code inside JSON. Hugging Face's "structured CodeAgent" has the model emit its reasoning and its code as fields of a structured object, and it outperformed the plain CodeAgent approach by 2–7 percentage points on average. The reason is the unglamorous one: reliability. In their traces, 2.4% of first calls had parsing errors, and runs without parsing errors succeeded 21.3% more often than runs with them. Code gives you composition; structure gives you the parse-reliability that composition was quietly losing.
So don't read "code agents vs tool-calling agents" as a fork in the road. Read it as a gradient. If your agent fires one discrete tool per step and you want a guaranteed schema with nothing to sandbox, JSON tool calling is the correct, boring choice. If your agent orchestrates many tools with data flowing between them, code actions will cut your steps, your tokens, and your error rate — and you'll pay for a sandbox to get it. And if you're building something serious in 2026, the answer is increasingly both: code as the action, structure as the wrapper.
For the closely related question of running tools as code through the protocol layer, see MCP code execution vs direct tool calls and parallel vs sequential tool calling; for the reasoning-loop ancestor of all of this, ReAct vs Plan-and-Execute vs Reflexion.



