Every team that watches its inference bill eventually has the same idea: most prompts are easy, so why pay frontier prices for all of them? Send the easy ones to a cheap model, keep the hard ones on the expensive one, and the bill falls without anyone noticing the quality. That idea has a name now — model routing — and three projects that will sell it to you in different forms. The interesting question in 2026 isn't which one is best. It's whether the idea works as well as the pitch.

First, the distinction that the word "routing" quietly smudges. A gateway like LiteLLM or Portkey also "routes," but it routes by rules you wrote: this virtual key goes to that model, fall back to a second provider on a 529, cut traffic off at this budget. It never guesses. A model router guesses on purpose. It looks at the prompt and predicts whether the cheap model will produce an answer as good as the expensive one would, then sends it there. Gateways are deterministic plumbing. Routers are a learned bet placed on every request.

The open one that started the conversation

A framework for serving and evaluating learned LLM routers, with pre-trained routers and an OpenAI-compatible server
★ 5kPythonlm-sys/RouteLLM

RouteLLM, from the LMSYS team behind Chatbot Arena, is the project that made routing legible. It trains routers — a matrix factorization model, a BERT classifier, a few others — on human preference data: pairs of answers people judged, augmented with GPT-4-as-judge labels, to learn where a weak model's answer is indistinguishable from a strong one's. The ICLR 2025 paper is where the famous numbers come from: on MT-Bench, the matrix-factorization router reaches 95% of GPT-4's quality while calling GPT-4 on only 26% of prompts, and with data augmentation that drops to 14% — the headline "up to 85% cheaper" figure.

Read those numbers precisely, because they are doing more work than the slogan suggests. They are measured on MT-Bench, a benchmark of the kind of broad, chatty questions where a cheap model often is good enough. The further your traffic sits from that distribution — narrow domain, structured extraction, agent tool-calls where one wrong route breaks a chain — the less the win-rate the router learned transfers. RouteLLM is the honest place to start precisely because it hands you the eval harness too. You can measure the routable share of your traffic instead of inheriting someone else's.

The commercial pair that sells the managed version

An open "routing on random forest" framework — pre-trained model-pair routers using embeddings plus a tunable confidence threshold
★ 200PythonNot-Diamond/RoRF

NotDiamond and Martian take the same premise and remove the homework. Both expose a drop-in, OpenAI-compatible endpoint: you change a base URL, and a hosted meta-model decides where each request goes, with a max-cost knob and cross-provider failover. NotDiamond will train a custom router on your own eval data and open-sources part of its research stack (the RoRF repo above); Martian keeps its "model mapping" approach closed and is the most heavily funded of the three. Their marketing lands in the same place — NotDiamond claims 20–40% savings with no quality loss; Martian advertises cuts as steep as 97% while "often beating GPT-4."

The managed version buys you a real thing: you skip building, calibrating, and maintaining a router, and you get one that improves from production feedback. What you give up is visibility into the bet being placed on your traffic, and you add a network hop and a per-request prediction whose cost the savings have to clear.

What the neutral scorekeeper found

Here is the part the three pitches have in common and the part they leave out. Every savings figure above is measured by the party selling the routing. When an independent group built RouterArena to score routers on a common footing, the result was not a clean leaderboard win for the commercial options — it found no router optimal across all metrics, and ranked a leading commercial router 12th, specifically for over-selecting expensive models. The thing you'd buy to stop overpaying was, on a neutral bench, the one overpaying.

That isn't a reason to dismiss routing. It's the reason to frame it correctly.

A router is not free money. It's an inference-shaped cost you add in the hope of removing a larger one — and that trade only clears above a threshold.

The router itself is a model call or an embedding pass: NotDiamond's own figures put the added decision latency around 100–200ms per step. So routing pays off when two conditions hold together — the price gap between your strong and weak model is large, and a meaningful fraction of your traffic is genuinely routable to the cheap one. A workload that's mostly hard reasoning, or where strong and weak models are close in price, will spend latency to save pennies. A workload that's mostly easy chat over a 10× price gap is where the 85% headline lives.

How to actually decide

Don't pick a router. Measure your routable share first. Run RouteLLM's evaluator against a sample of real production prompts and see what fraction the cheap model answers within tolerance — that number, not the vendor's, is your ceiling. If it's small, the bill isn't a routing problem and no product will fix it. If it's large and the price gap is wide, you have a real lever, and the only question left is build-versus-buy: self-host RouteLLM when you want to own the decision and audit it, pay NotDiamond or Martian when you'd rather rent the calibration and the upkeep.

The deeper signal is in where the category is drifting. Gateways are bolting on auto-routers; routers are shipping drop-in OpenAI endpoints that look exactly like gateways. They're converging because the durable layer is the gateway — the place every request already passes through — and learned routing is best understood as one optional feature of that layer, switched on where the math works, rather than a product you buy on the promise of a number someone else measured.