On June 1, MiniMax released M3, and the headline wrote itself: an open-weight model that scores 59.0% on SWE-bench Pro — above GPT-5.5, above Gemini 3.1 Pro, within reach of Claude Opus 4.7 — while carrying a one-million-token context window and native computer use. The weights landed on Hugging Face within a week, quantized down to NVFP4 and GGUF before the technical report was even out. If you build agents, the pitch is intoxicating: frontier coding, on hardware you control, for a fraction of the API bill.
Here is the uncomfortable part. The single number everyone quoted — the SWE-bench Pro row — is the least trustworthy thing in the entire launch. And the parts nobody screenshotted are the ones that actually hold up.
What the benchmark table is, and isn't#
Every figure in M3's launch materials — the 59.0% on SWE-bench Pro, the 66.0% on Terminal-Bench 2.1, the 74.2% on MCP Atlas, the 83.5 on BrowseComp that edges out Claude Opus 4.7 (see the table above) — was produced by MiniMax, on MiniMax's infrastructure, in evaluation harnesses MiniMax configured, against baselines MiniMax selected — and, in several cases, using Claude Code as the agent scaffolding around the model. That is not an accusation of bad faith; it is the default condition of every model launch. But it means the numbers are claims, not measurements.
At launch, the two most-cited independent services — Artificial Analysis and LMArena — had no M3 score posted. The reviewer Thomas Wiegold put it plainly: "every one of those numbers is vendor-run, on MiniMax's own infrastructure, with baselines they picked." The historical regularity is that self-reported scores compress by three to eight points once a held-out harness gets hold of them. Knock five points off that 59.0% and the "beats GPT-5.5" claim quietly becomes "competitive with the pack" — a different, and much less viral, sentence.
So the model got the most attention for the claim that is easiest to inflate and slowest to verify. That inversion — trust running opposite to virality — is worth internalizing before you migrate a single agent.
The benchmark row is the easiest number to fake and the one that traveled fastest. The latency numbers are the hardest to fake and nobody screenshotted them.
The part that's structurally hard to lie about#
Now look at what M3 is underneath. It's a Mixture-of-Experts model — roughly 428B total parameters, ~23B active per token — built on MiniMax Sparse Attention (MSA), a sparse-attention operator aimed squarely at the million-token regime. MiniMax's own figures: versus the prior generation, MSA cuts per-token compute at 1M context to about one-twentieth, with more than 9× faster prefill and 15× faster decoding.
Here is the thing about those claims: unlike a SWE-bench score, you cannot hand-pick a baseline to flatter them. The day the weights hit Hugging Face, anyone with the GPUs can load M3 and time it. Prefill and decode throughput at 128K, 512K, 1M context are measurements, not assertions — reproducible on your own hardware, with your own harness, in an afternoon. If MSA's economics were fiction, the community would have said so by now; instead the quantizers moved first, which is its own kind of vote.
This is the real news. Long context has been "supported" by many models and affordable on almost none, because attention cost scales against you exactly when the context gets interesting. An agent that reads a whole repository, or holds a multi-hour computer-use session, is precisely the workload that makes standard attention uneconomic. MSA is a bet that the sparse-attention operator — not the leaderboard row — is what unlocks agents you can actually afford to run long.
"Open" has an asterisk#
One more thing the excitement skated past: M3 is not Apache, and not MIT. The weights ship under the MiniMax Community License — you can download and self-host, but the terms are the company's own, not an OSI-approved permissive license. For a weekend project that distinction is noise. For anyone putting M3 in a product, it is the first paragraph of the due-diligence memo, above any benchmark. "You can run it" and "you can run it under terms your lawyer will sign" are different sentences, and the gap between them is exactly where a lot of "open" models quietly stop being useful.
How to actually read this release#
Treat the SWE-bench Pro headline as a hypothesis with a pending independent test, not a result. Watch Artificial Analysis and LMArena; expect the number to settle a few points lower and value it accordingly. Then go verify the claim that matters for your bill yourself: pull the weights, measure decode throughput at the context length your agent actually uses, and price it against your current API — and against the other open-weight models agents are built on, where M3's predecessor M2 already sat in the mix. If MSA holds up — and the architecture, unlike the leaderboard, is measurable — M3's contribution isn't that an open model "won" a benchmark. It's that a million-token, multimodal, tool-using agent became something you can host, on terms you'll want to read twice.
The frontier-coding claim will be adjudicated by a harness nobody at MiniMax controls. The latency claim you can adjudicate before lunch. Spend your skepticism where it's cheap to resolve.



