If you are choosing an open-weight model for an agent by reading this week's leaderboard, you are optimizing a number with a half-life measured in weeks. Qwen, DeepSeek, Mistral, and the rest ship new versions almost monthly; by the time your evaluation harness finishes, the model you tested has a successor. Chasing the top score is a treadmill. The way off it is to decide on the two things about a model family that don't churn between releases: its license, and the economics of its architecture.
The license is the only stable spec
Every other property — context length, benchmark scores, parameter count — changes with each release. The license usually doesn't, and it's the constraint that follows you for the entire life of whatever you build. In 2026 the open-weight licensing map looks nothing like the one developers carry in their heads from 2023.
The most permissively licensed serious models now come from labs many Western teams still think of as the challengers. Qwen ships its open models under Apache 2.0, full stop. Mistral's open line — Mistral Small, the Mixtral mixture-of-experts models, the Devstral coding model, the Magistral reasoning model — is Apache 2.0. DeepSeek releases its code under MIT, with weights under a license that permits commercial use. These are the no-asterisk options.
The asterisks belong to the incumbents, and they don't point the same way. Meta's Llama 4 (Scout and Maverick) ships under the Llama Community License, which is not OSI-approved open source. It carries three live restrictions: a clause requiring a separate, discretionary license from Meta if your products exceed 700 million monthly active users; a requirement to display "Built with Llama" prominently; and a rule that any distributed fine-tune must put "Llama" at the start of its name. Usable, widely used — but with strings the Apache models don't have.
The surprise of 2026 isn't which model scores highest. It's that the most permissive licenses moved to the labs people still call the upstarts.
Google supplies the clearest sign that "open" is now a gradient rather than a switch. For years Gemma shipped under a custom Gemma Terms of Use — commercial use allowed, but with a prohibited-use policy and downstream flow-down obligations that kept it off the OSI list. With its 2026 generation, Google moved Gemma to Apache 2.0. A vendor relaxing its license between generations is exactly why you choose the family on its licensing trajectory, not a single checkpoint.
The architecture is your serving bill
The second durable property is how the model is built, because that sets what it costs to run — and an agent runs the model constantly, one sequential call after another.
The pivotal distinction is dense versus mixture-of-experts (MoE). A dense model activates all its parameters on every token. An MoE model has a large total parameter count but routes each token through only a fraction of it. DeepSeek-V3 is the canonical example: 671 billion total parameters, but only 37 billion activated per token. Your inference compute and latency track the active count; the total mostly determines how much memory you need to hold the weights. The counterintuitive result is that a "huge" MoE model can be cheaper and faster to serve than a much smaller dense one — provided you can fit it in memory. DeepSeek's later work pushed this further with sparse attention to cut the long-context cost that punishes agents stuffing tool outputs back into the window. If serving cost is your constraint, the active-parameter number is the spec to read, not the headline size (the full tradeoff is worth its own look: mixture-of-experts vs dense models for agents).
For agents, reliability beats raw intelligence
When you do benchmark — and you should, on your own tasks — measure the right thing. An agent's failure mode is rarely that the model wasn't smart enough; it's that the model emitted a malformed tool call, hallucinated an argument, or lost the thread across a dozen steps. That's why the Berkeley Function Calling Leaderboard moving to its agentic v4 matters: it grades multi-step tool use and memory, not the single-shot function call its earlier versions tested. A model that tops a knowledge benchmark but can't reliably complete a 20-step tool sequence is the wrong pick for an agent, no matter how it ranks (more on why the leaderboard misleads here: best LLM for function calling).
The decision, made to last
Filter the field by license first, because that constraint is permanent: if you need true open source, Qwen, Mistral, DeepSeek, and 2026-era Gemma qualify and Llama 4 does not. Then sort by serving economics — MoE active parameters against your hardware. Only then run the current versions through your own agent eval, weighting tool-calling reliability over trivia, and pick a winner you'll happily replace next quarter when the same family ships its next checkpoint. You're not choosing a model. You're choosing a family to follow — and the license and the architecture are what tell you where it's going.



