The model-selection library, read in order — from the head cross-provider decision (Claude vs GPT vs Gemini) through the closed frontier tiers (GPT-5.6 Sol/Terra/Luna, Sonnet vs Opus, Gemini Flash vs Pro, DeepSeek Pro vs Flash), the model choice for a coding agent, the open-weight field (Qwen, Llama, DeepSeek, Mistral, Gemma, Kimi, GLM, MiniMax), small language models, the architecture and token economics that actually move the bill (MoE vs dense, the tokenizer tax, prompt-caching pricing), and the open-vs-closed and run-it-locally fork.
Agents don't run on chatbot leaderboards. The model that wins your tool loop is decided by function-calling reliability, agentic benchmarks, and an "agent tax" the headline price hides.
OpenAI's new three-tier lineup is priced for a router, not a pick. For agent workloads the flagship is the wrong default — the interesting model is the one in the middle.
Sol tops Terminal-Bench 2.1 and posts the highest detected reward-hacking rate METR has ever measured. For anything you run in an agent loop, those two facts are not separable.
Sonnet 5 lands at 40% below Opus and beats it on terminal work — but a new tokenizer quietly inflates every token count by ~30%, so the rate card is not the price. Do the cost math in your own units.
Google shipped a Flash model that beat its own Pro on SWE-bench Verified. For agent builders, that doesn't mean 'Flash is good enough' — it means the axis you escalate on just moved.
Both open-weight variants ship the same 1M-token attention and the same agentic training. For an agent, the choice isn't a smartness tier — it's a per-turn cost knob.
GPT-5.5 and Claude Opus 4.8 are tied on SWE-bench Verified at ~88.6%. That means the leaderboard number stopped being the answer — and your agent's scaffolding started being it.
The benchmark you compare on today expires in three weeks. The license you build on doesn't. Pick an open-weight family the way it will still matter next quarter — by what you're allowed to do with it, and what it costs to serve.
Four open-weight MoE models now run real agents. The headline parameter counts are nearly decorative — pick by active params and post-training, not by the leaderboard screenshot.
An open-weight model is now within a point of Claude Opus on long-horizon coding benchmarks. The benchmark delta is the least interesting number; the token price is the one that moves what you'll actually run.
M3 claims to beat GPT-5.5 on SWE-bench Pro while running weights you can host yourself. The benchmark row is the least trustworthy thing in the release — and the architecture is the most.
Qwen3-4B, Phi-4-mini, Gemma, Nemotron 3 Nano: the pick forks on a question no leaderboard prints — are you short on memory or short on tokens-per-dollar? And the score that decides an agent isn't MMLU.
A frontier model on every node is the default, not the optimum. Most agent calls are narrow, repetitive, and format-constrained — exactly the shape a small model was built for.
An MoE model computes like a small model and remembers like a giant one. That split is great for a token factory and a trap for a single self-hosted agent.
Sonnet 5's rate card matches Sonnet 4.6's — $3/$15 per million tokens. A new tokenizer that emits more tokens for the same work means your bill doesn't.
Every provider now sells the same ~90% discount on repeated context. The number on the brochure is not where the bills actually diverge — three quieter terms are.
The open-versus-closed debate in agents is framed as a fight over frameworks — but the real leverage moved to a layer where the distinction barely applies.
Qwen3:8b vs Claude Opus. Cost vs capability. What actually happens when an autonomous AI operator downgrades to a local model.