For two years the open-weight question was "Qwen or Llama or DeepSeek," and the answer was mostly about who topped MMLU last month. That framing is dead. The models that actually run agents in 2026 are a different cohort, all mixture-of-experts, all post-trained specifically for tool use: Kimi K2 from Moonshot AI, GLM-4.6 from Zhipu, MiniMax M2, and the latest Qwen3. They are genuinely downloadable — modified MIT, MIT, MIT, and Apache 2.0 respectively — and choosing among them rewards looking at exactly the numbers the launch tweets bury.

The headline number is the wrong number

Kimi K2 is a one-trillion-parameter model. That sounds like the obvious heavyweight until you read the second number: it activates 32 billion parameters per token, because it's a 384-expert MoE that routes each token to a handful of experts. GLM-4.6 is 355B total and also activates 32B. So the model that is nearly 3x larger on paper has the same active footprint — and active parameters, not total, are what set your serving cost, your latency, and your VRAM-per-replica.

This is the lens that reorders the whole field. MiniMax M2 is a 230B model that activates only 10B. In a chatbot, where you pay for one forward pass per turn, that's a modest efficiency note. In an agent — which fires dozens to hundreds of sequential model calls to plan, call a tool, read the result, and plan again — that per-step cost compounds into the dominant line on your bill. M2's headline "230B" makes it sound mid-pack; its 10B active makes it the cheapest loop to run in the group, full stop.

Total parameters tell you how impressive the model sounds. Active parameters tell you what the agent costs. They are not the same story, and the launch post only tells the first one.

The moat is post-training, not capacity

If active params are the cost story, the quality story is even less visible on a leaderboard. The benchmark everyone screenshots is SWE-bench Verified — Kimi K2 Thinking's vendor-reported 71.3%, MiniMax M2's 69.4, GLM-4.6's ~68%, Qwen3-Coder's 67% (rising to ~69.6% in a 500-turn agentic harness, and treat all vendor-reported figures as optimistic). Those are single-task, often single-shot scores. They tell you almost nothing about the failure mode that actually breaks production agents: degradation over a long run.

Kimi K2 Thinking's standout claim isn't a benchmark at all. It's that the model stays coherent across roughly 200 to 300 sequential tool calls — the difference between an agent that finishes a multi-step task and one that quietly loses the plot on call #150. That is a property of the reinforcement-learning and post-training recipe, not of parameter count, and it's the single hardest thing to fake. It's also why the most credible signal in this whole comparison isn't a vendor table: Kimi K2 Thinking sits at #2 on Artificial Analysis's Agentic Index (behind GPT-5), and was formally evaluated by NIST's CAISI in late 2025 — third-party scrutiny the others haven't matched.

The flip side is that "agentic intelligence index" claims deserve the same skepticism as any other vendor number. MiniMax's self-reported intelligence scores have diverged sharply from independent re-runs, so weight its cost advantage, which is structural and verifiable, over its quality claims, which aren't yet.

Pick by failure mode

The mistake is asking which model is best. Ask which way your agent fails, and the field sorts itself:

None of these is the "best open model," and the MoE economics are why: the spec sheet that decides your bill is the active-parameter line, and the spec sheet that decides whether your agent finishes its task isn't on the sheet at all. The leaderboard score lies about both. Choose for how your agent runs, not for how the model demos.