"Which small model should I use for my agent?" is the most reasonable question a team can ask in 2026, and the reason the answers feel contradictory is that it is secretly two questions wearing one coat. The case for using a small model at all is settled — most of what an agent does is narrow, repetitive, format-constrained work that a frontier model is wildly overqualified for. What is not settled, and what the leaderboards actively obscure, is that the word "small" now splits along a seam, and which side you land on decides the whole shortlist.

The seam: are you short on memory, or short on tokens-per-dollar?#

There are two ways to be small, and they optimize opposite resources.

The first is small in footprint. A dense four-billion-parameter model — Qwen3-4B, Microsoft's 3.8B Phi-4-mini, a Gemma edge size — loads in a couple of gigabytes and runs on a laptop, a handset, a Jetson, or an air-gapped box with no network at all. Every parameter is active on every token, so the compute is modest and the memory is tiny. This is the model you reach for when the deployment target is the constraint: the agent has to run there, and "there" has 8GB of RAM.

The second is small in active compute. NVIDIA's Nemotron 3 Nano is a Mixture-of-Experts model that activates roughly 3.2 billion of its 31.6 billion parameters per token. It computes like a 3B model — NVIDIA reports it serving several times faster than a 30B dense peer on a single H200 — but you still have to hold all 31.6B in memory to route between the experts. It is small in FLOPs and large in footprint, which is exactly backwards from the dense edge models. This is the model you reach for when the constraint is the bill: you serve high volume on datacenter GPUs and you pay by the token, so tokens-per-dollar is the number that matters and VRAM is cheap by comparison.

A dense 4B is small where the RAM is scarce. A small-active MoE is small where the tokens are expensive. They are not competing for the same slot.

The mistake I keep seeing is a team that needs an on-device model benchmarking Nemotron Nano's throughput, loving it, and then discovering it will never fit on the device — or a team serving millions of server-side calls picking a dense 4B and leaving a large multiple of throughput on the table. Same word, opposite hardware.

The score that actually predicts a good agent#

Once you're on the right side of the seam, the second trap is choosing by the wrong benchmark. The reflex is to sort small models by a general-reasoning score — MMLU, GPQA — and take the top of the column. For an agent, that column is close to irrelevant.

An agent's small-model nodes almost never do open-ended reasoning. They pick a tool and fill a JSON schema. The failure that actually breaks the loop is a malformed argument, a hallucinated parameter, or a tool call fired when the right move was to call nothing at all. None of that is what a knowledge benchmark measures. It is precisely what the Berkeley Function Calling Leaderboard (BFCL) measures: abstract-syntax-tree accuracy on the emitted call, whether the call actually executes, multi-turn tool interactions, and — the one everyone forgets — relevance detection, whether the model correctly declines when no function fits. Its newer agentic tiers push into multi-hop web search and memory, closer still to what a real agent does. The multi-turn, multi-domain τ²-bench tests the same muscle under conversation.

This is why a model can top the reasoning charts and still be a worse agent than a purpose-built 4B. Phi-4-mini leads with function calling as a headline capability, and Qwen3 ships tool use as a first-class feature of even its smallest sizes. Their sibling Phi-4-mini-reasoning, by contrast, is superb at competition math and the wrong tool for an agent node — a clean reminder that you have to select on the axis the node is actually graded on, not the axis that's easiest to rank.

A shortlist that survives contact#

None of this is a knock on frontier models. It's the heterogeneous pattern doing its job: the smallest model that clears each node's bar, a big model reserved for the steps that genuinely need one. Just don't buy a model for a resource you weren't short on — and don't grade a tool-caller on a test it never has to take. Before you check the VRAM math, decide which "small" you actually meant.