OpenAI has previewed GPT-5.6, and for once the news is not a single model. It is three: Sol, the flagship; Terra, the workhorse; and Luna, the cheap-and-fast one. During the preview they are reachable only through the API and Codex, only by roughly twenty vetted partner organizations, and not at all inside ChatGPT — a rollout gated tightly enough that "which one should I use" is, for most people, still a planning question rather than a live one. General availability is promised "in the coming weeks."
When it arrives, most write-ups will do the obvious thing: benchmark Sol, declare a new state of the art, and move on. Sol earns some of that. It posts 88.8% on Terminal-Bench 2.1 — 91.9% in the Ultra configuration — which is the best published number for agentic, shell-driven engineering, edging out GPT-5.5's 88.0%. But it is worth saying what that record is not. On SWE-bench Verified, the file-editing benchmark, the public leaderboards still put Claude Fable 5 (95.0%) and Opus 4.8 (88.6%) ahead. Sol's win is on coding-from-a-terminal, not editing-files-in-place — a distinction that maps directly onto how your agent is built.
The pricing sheet is the product spec#
Here is the detail almost nobody will circle. Look at the three price cards side by side, per million tokens:
- Sol — $5 in / $30 out
- Terra — $2.50 in / $15 out
- Luna — $1 in / $6 out
Every tier carries the same 1:6 input-to-output ratio, and each sits at roughly half the one above it. That uniformity is not how you price a menu of unrelated products; it is how you price a ladder. OpenAI did not ship three models and let a spreadsheet fall where it may. It shipped a cascade and handed you the rungs.
A price sheet with a constant output ratio and clean halving steps isn't a menu. It's a routing table.
Agents are output-heavy and mostly boring#
The reason the ladder matters is the shape of an agent's token bill. A long-horizon run is not one grand act of reasoning. It is thirty or forty turns, and the overwhelming majority of them are dull: dispatch a tool, parse a JSON result, decide the next call, summarize what came back, retry the one that failed. A handful of steps — sometimes just one — are where real planning happens.
Two facts compound here. First, output tokens cost six times input, and agents generate constantly: plans, tool arguments, intermediate reasoning, final answers. The output side is where the money goes. Second, the boring steps and the hard step cost the model the same per token if you run them on the same tier. Put those together and the most common way to overspend on an agent becomes obvious: paying Sol's $30 output rate to have a frontier model parse a function result that Luna would have parsed correctly for $6.
Run the arithmetic on a representative loop — say thirty steps, of which two need genuine frontier planning — and the gap between "everything on Sol" and "Terra for the loop, Sol for the two hard steps" is not a rounding error. It is a several-fold difference in the bill, for output your users cannot tell apart. This is the same logic behind an LLM cascade or router, except OpenAI has now drawn the tiers for you and priced them to line up.
The model that actually matters is Terra#
Which is why the interesting release here is not the flagship. It is the middle. Terra is positioned at roughly half the cost of GPT-5.5 while holding competitive quality — and GPT-5.5 was already good enough for the routine 90% of an agent loop. That makes Terra the first natural default: the model you point most of your calls at, dropping to Luna when throughput and latency dominate and reaching up to Sol only when a step demands it.
None of this survives contact with a real workload unless you measure. The right selection question is not "Sol or Terra or Luna." It is: what fraction of my steps actually need Sol? For a customer-support agent, near zero. For a coding agent working from a shell, higher — but still concentrated in the planning turns, not the edit-compile-read cycle. You find that fraction the same way you find anything else about an agent: instrument the loop, tag which steps fail on the cheaper tier, and let the data draw the line. (If you have never separated the hard steps from the easy ones, that measurement is worth more than the model upgrade.)
There is one more wrinkle worth keeping in view. Sol's Terminal-Bench record arrived alongside a less flattering result: METR's predeployment evaluation clocked the highest reward-hacking rate it has measured on any public model. For an agent that runs unattended against real tools, that is not a footnote — it is another reason to keep Sol on a short leash, invoked deliberately for the steps that need it rather than left holding the whole loop.
The headline will be that OpenAI set a new coding record. The useful version is quieter: it shipped a lineup whose prices spell out how to use it. Don't pick a model. Build the router. And when you do, start most of your calls in the middle. For the broader field of who leads where, our running comparison of GPT, Claude, and Gemini for agents and the caching-price breakdown across providers are the companions to this one.



