Someone on your team reads that the model supports parallel tool calling, flips the flag on, and reports back that nothing got faster. They're not wrong, and they're not doing it wrong. They've just run into the thing the feature's name hides: "parallel tool calling" is two separate decisions, and the API only makes one of them for you.
The two decisions
Decision one belongs to the model: when it plans its next step, it can return several tool calls in a single assistant turn instead of one. Anthropic's docs put it plainly — by default Claude "may use multiple tools to answer a user query," emitting several tool_use blocks at once. OpenAI does the same; parallel_tool_calls defaults to true, and the model bundles its calls into one response.
Decision two belongs to you. Here is the sentence from Anthropic's own documentation that the whole topic turns on: when the model returns multiple tool-use blocks, "how you run them is your decision." You can run them concurrently with Promise.all or asyncio.gather, sequentially, or in any mix. The provider hands you a list. It does not run anything.
So the flag your teammate flipped only affects decision one — whether the model is allowed to emit a batch. If the code that receives that batch still loops through it one call at a time, every request runs in series and the latency is identical. The model parallelized its intent; the runtime serialized its execution. Nothing got faster because the half that makes things faster was never touched.
The API parallelizes the model's request, not your I/O. Turning on parallel tool calling and then awaiting each call in a for-loop is the most common way to ship the feature and get none of the benefit.
This is why the frameworks earn their keep. LangGraph's prebuilt ToolNode is explicit that "if multiple tool calls are requested, they will be run in parallel," returning one message per call. The OpenAI Agents SDK surfaces a parallel_tool_calls field in ModelSettings. They aren't just passing the flag through — they're closing the gap by actually dispatching the returned calls concurrently, which is the part that costs latency if you do it by hand and skip.
The half that can corrupt your results
Get past the "it does nothing" trap and a sharper one waits, because parallelism isn't only a performance setting — it's a correctness setting.
Two tool calls can run at the same time only if neither needs the other's output and neither steps on shared state. Anthropic spells out the rule: "independent, read-only operations are usually safe to run in parallel," while "tools with side effects, shared state, or ordering requirements might be better run sequentially." Search three databases at once: fine. Write a record and then read it back: not fine — run those in parallel and the read races the write. The model, eager to look efficient, will happily batch calls that have a hidden dependency, and the parallel result is computed against inputs that don't exist yet.
The defense is a system-prompt instruction Anthropic recommends verbatim — "Only batch tool calls that are independent of each other" — but the real insight is structural. The 2024 LLMCompiler work (arXiv 2312.04511) reframes the whole problem as a dependency graph: plan the calls as a DAG, dispatch every node whose inputs are ready, and make dependent nodes wait for theirs. Parallelism stops being a flag and becomes a property you can prove from the graph. Planning that graph instead of looping is where the wins live — LLMCompiler reported up to 3.7x lower latency and 6.7x lower cost than a sequential ReAct loop, and even edged out naive parallel function calling by planning dependencies the flat batch couldn't see. The number that matters isn't "parallel is faster"; it's "knowing what's independent is faster, and the only safe way to parallelize is to know."
The settings that quietly turn it off
The last surprise is that several common configurations disable parallelism without telling you, which is why "why aren't my calls parallel?" is such a frequent question.
The big one is strict structured outputs. OpenAI states that Structured Outputs is not compatible with parallel function calls — when the model generates a parallel call, it may not match the supplied schema — so enabling strict mode forces parallel_tool_calls to false. Reach for guaranteed-valid JSON and you have silently traded away batching. Forcing tool_choice to a specific named function does the same thing from the other direction: you've told the model the only call it may make is that one, so there's nothing to parallelize. And some reasoning models (OpenAI's o-series among the reported cases) don't emit parallel tool calls at all — worth checking your model's capabilities before you design around the feature. There's even a formatting trap: return each tool result in its own separate message instead of bundling them into one, and you teach the model to stop batching on the next turn.
What to actually do
Treat parallel tool calling as two checkboxes, not one. First, make sure your runtime — or a framework like LangGraph that does it for you — actually executes the returned calls concurrently, or the flag is decorative. Second, only let the model batch calls you can show are independent, ideally by modeling them as a dependency graph rather than trusting the model's eagerness. And remember that the moment you turn on strict Structured Outputs or pin a model to a single forced function call, you've turned parallelism off, whatever the flag says. The feature is real, and it's worth having — but only when you own both halves of it. If you're still deciding whether you even need a server in front of these tools, that's the MCP-vs-function-calling question, and it comes first.



