There is a tell in how a company names a model. When Unisound — the Beijing speech-and-language lab better known by its Chinese name, 云知声 — released U2 at the end of June, it did not call it a chat model, an assistant, or a reasoning model. It called it a "native agentic large model built for execution," and the verb it kept reaching for in the launch material was complete: U2, the company says, can autonomously decompose and finish workflows of 100-plus steps.

That phrasing is doing more work than the benchmark table underneath it.

The number people will quote, and the one that matters#

The spec sheet is respectable and, in 2026, unsurprising. U2 is a sparse mixture-of-experts model: 266B total parameters, ~10B active per token, which Unisound frames with the slogan "high intelligence density × high token value" and a claim that it burns roughly a quarter of the tokens a trillion-scale dense model would on the same work. It lists for $0.15 per million input tokens and $0.30 per million output on Unisound's Token Hub. Its self-reported scores — GPQA Diamond in the high 80s, SWE-bench Verified around 75, GDPval in the low 70s — are placed against GLM-5.1, DeepSeek-V4-Flash, and MiniMax M2.7 rather than the Western frontier.

Set those numbers aside for a second, because they are the least interesting thing here. A SWE-bench Verified of ~75 is mid-pack: the June leaderboards put Fable 5 near 95 and Opus 4.8 near 89 on the same test. If you were shopping for a coding model on raw capability, U2 would not be the headline.

The headline is the word native.

What "native agentic" is actually claiming#

For two years the dominant way to build an agent has been to take a conversational model and wrap it in scaffolding — a framework like LangGraph or CrewAI that holds the plan, decides which tool to call, feeds the result back, and loops until something looks done. The model is a stateless oracle; the agency lives in your Python.

"Native agentic" inverts that. The claim is that the loop — planning, sub-task decomposition, tool selection, error recovery, knowing when to stop — has been trained into the weights through long-horizon post-training, so the model drives its own execution instead of waiting to be driven. This isn't a marketing invention unique to Unisound; it's a research direction with a name. A recent survey, Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI, traces exactly this move — from "orchestrated pipelines wrapped around a chat model" toward "a unified policy that internalizes perception, planning, grounding, and action," optimized over long horizons with reinforcement learning. UI-TARS and GUI-Owl made the argument for screen agents; U2 is making it for general workflow execution.

The interesting thing about U2 isn't whether it beats GPT on a benchmark. It's where the agent loop lives — and Unisound is betting it should live inside the model, not inside your framework.

Why this is a threat to the layer above it#

Here is the non-obvious consequence, and it has nothing to do with China-versus-West model races. If model-native agency genuinely works, it re-prices the orchestration layer.

The entire premise of the framework ecosystem — the harnesses, the graph builders, the "agent middleware" — is that the model can't be trusted to run itself, so you need an external state machine to keep it on the rails. Every retry policy, every planner node, every hand-rolled "reflect on your last step" prompt is scaffolding that exists because the model wasn't trained to do it. A model that decomposes and completes 100 steps on its own doesn't eliminate that layer, but it thins it: the framework stops being the brain and goes back to being plumbing — auth, observability, the tool registry, the audit trail. The reasoning it was faking with prompt engineering moves down a level.

That also changes what you should measure. Benchmarking a native-agentic model on single-turn GPQA is like judging a marathoner by a 40-yard dash. The honest test is the thing Unisound is actually claiming and the thing no press release can prove: what fraction of 100-step real-world tasks does it finish, end to end, without a human catching it drifting? That's completion rate over a long horizon, not accuracy on a question.

The skeptic's column#

So treat U2 as a signal, not a recommendation. Three things keep it in the "watch" pile rather than the "switch" pile:

None of that makes U2 unimportant. The model-native bet is the most consequential architectural argument in agents right now: it says the last two years of framework-building were a stopgap for a capability that belongs in the weights. Unisound is not the first to make that bet, and U2 is unlikely to be the model that settles it. But the naming is a forecast — and once a few labs start shipping models that run their own loops, the question every agent team will face is no longer "which framework," but "how much of my framework was just compensating for a model that hadn't been taught to drive."