Here is a cost you will not find on any invoice, and it is one of the largest ones you control. Take a single agent task — retrieve two records, compare them, write a sentence — and run it through six different agent frameworks on the same model. The model charges the same price per token to all of them. And yet the bill at the end differs by two to three times, because the framework, not the model, decides how many tokens the task actually spends.

That gap is the most under-examined line item in agent engineering. Teams agonize over which model to call and then reach for whichever framework the tutorial used, as if the wrapper were free. It isn't. In a 2026 benchmark that ran six frameworks — LangGraph, LangChain's AgentExecutor, AutoGen, CrewAI, Semantic Kernel, and Haystack — through five production-style tasks, 100 runs each, all pinned to GPT-4o precisely so the model couldn't be the variable, the frameworks fanned out across a 2–3× token spread. LangChain came out leanest, roughly 22% under the cross-framework median. CrewAI sat at about double the pack on tokens and ran several times slower on the simple flows.

The model sets the price per token. The framework decides how many tokens each task spends. Only one of those shows up in your model dashboard.

The headline number is real, and it's the wrong thing to memorize#

It is tempting to read "CrewAI costs 2× more" and file it as a verdict. Don't, because the ranking flips depending on the task, and the flip is the actual lesson. The usual framework comparison stops at features and control flow — who owns the loop, LangGraph vs CrewAI vs AutoGen — and that framing is right about fit. It just leaves the money on the table, because the same design choices that decide who drives also decide what you pay.

On a one-tool-call flow — the bread and butter of most production agents — CrewAI's overhead is worst, because every agent in a crew carries a role, a goal, and a backstory prepended to each turn. For a task that needs none of that structure, you are paying to re-send a persona on every call. But hand the same frameworks a long, branching, many-step task and a different cost driver takes over: LangGraph, whose explicit state machine folds the growing history of each manual tool call back into context at every node, spiked to just over 15,000 prompt tokens on a single call in the heaviest task. The lean framework on simple work is not the lean framework on complex work.

So the benchmark's real output isn't a leaderboard. It's a warning that token cost is a function of task shape crossed with framework architecture — and if you memorize a single winner, you will pick the wrong tool for half your workloads.

Three drivers, and you can read them off the architecture#

The useful move is to stop treating token cost as something you discover after a month of bills and start predicting it from the framework's control-flow model before you write a line. Three mechanisms do almost all of the work, and every popular framework leans on one of them by design:

None of these are bugs. Each is the price of a real capability: resumability, role structure, emergent reasoning. The mistake is buying the capability on a task that doesn't need it. A simple retrieval agent inside a role-heavy crew pays the persona tax for nothing. A long branching workflow on a framework that re-sends history pays the accumulation tax every step. The 2–3× spread lives almost entirely in these mismatches.

The rule that actually saves money#

Which turns the question around. "Which framework is cheapest" has no stable answer, but "does this framework's cost structure match the shape of my task" does, and you can answer it up front:

The discipline is to pick the framework whose dominant cost driver aligns with what your task actually is, then let the tokens fall where the architecture puts them. Only once you've matched the shape do the tactical token-cost reductions — trimming context, caching, smaller models for sub-steps — start compounding instead of fighting the framework. Do that and the scary 2–3× headline mostly evaporates, because a well-matched framework rarely loses enough on tokens to justify a rewrite. Ignore it — reach for the framework in the tutorial and hope — and you will ship the mismatch, then spend a quarter wondering why an agent that calls a cheap model somehow runs an expensive bill. The number was never the model's. It was the wrapper's, and the wrapper told you it would be, in its architecture, before you ever ran it.