Every team picking an embedding model starts in the same place: the MTEB leaderboard. They sort by average score, take whatever is on top, and ship it. This is the wrong first move, and the reason is structural — the leaderboard measures the one thing that barely affects your bill, and ignores the three things that do.

The leaderboard is the least useful number

MTEB has become a victim of its own success. With hundreds of models clustered within a point or two of each other, the board now rewards overfitting to its own task distribution. Worse, its retrieval datasets — the BEIR suite — are no longer truly zero-shot, because those datasets routinely end up in training pipelines. The community knows this: MMTEB, the 2025 expansion to 500-plus tasks across 250-plus languages, found that the best publicly available model was a 560M-parameter encoder, not the largest one on offer. Scale and leaderboard rank are not the same thing as retrieval quality, and neither is the same thing as retrieval quality on your data.

So treat the headline as a coarse filter — it tells you which models are credible — and then make the decision on the three levers that actually compound.

Lever one: the cost is in the database, not the API

Here is the accounting mistake almost everyone makes. The embedding API is a one-time charge: you pay per token to encode a document, roughly $0.02 to $0.18 per million tokens depending on the model, and then you never pay to encode that document again. OpenAI's text-embedding-3-small sits at the floor (~$0.02/1M); 3-large and Voyage's larger models run higher. These differences are real but small, and they're paid once.

The recurring cost — the one that scales with your corpus forever — lives in the vector database. It is a function of dimensions × precision × number of vectors. A 3072-dimensional float vector occupies four times the storage of a 768-dim one, and roughly thirty-two times a 512-dim binary one. Multiply that across a hundred million chunks and the gap between "expensive model" and "cheap model" at the API disappears next to the gap between "stored at full width" and "stored compressed."

You don't choose an embedding model. You choose a storage footprint, and the model is downstream of it.

Lever two: Matryoshka and quantization beat switching vendors

This is why the most important feature in 2026 isn't accuracy — it's shrinkability. Matryoshka Representation Learning trains a model so that information is nested coarse-to-fine inside a single vector. Keep only the first 512 of 3072 dimensions and you still retrieve well, at no extra inference cost. Every serious provider now exposes this: OpenAI through a dimensions parameter, Voyage and Cohere and Gemini through explicit tiers (Google reports its 768-dim setting costs about a quarter-percent of quality versus 3072, at a quarter of the storage).

Stack quantized output on top — int8 or binary vectors, which Voyage and Cohere both emit directly — and you can cut a vector-DB footprint 4x to 32x. The two-tier trick makes binary survivable: search the binary vectors in RAM, then rescore the top candidates against full-precision copies on disk. The upshot is blunt: truncating dimensions and quantizing output usually saves more money than any provider switch, and it's a config change, not a migration. This is the same compression story playing out one layer down in the index, which is worth understanding on its own (see binary vs scalar vs product quantization).

Lever three: input length and domain fit

The least-discussed spec is the one that silently corrupts results: maximum input length. Gemini's gemini-embedding-001 caps at about 2,048 tokens — anything longer is quietly truncated, so a long document loses its tail without warning. OpenAI sits at ~8,191, Voyage at 32K, and Cohere's v4 at a remarkable 128K, enough to embed a 200-page document in one call. A short cap isn't disqualifying, but it forces aggressive chunking, which multiplies your vector count (back to lever one) and risks splitting meaning across chunks. Voyage's contextual-chunk model, voyage-context-3, exists precisely to fight that loss by encoding each chunk with awareness of the whole document.

And domain fit is where the leaderboard most misleads. A generalist that tops MTEB can lose, on your corpus, to a model trained for it: Voyage ships code- and finance-tuned variants; Cohere v4 is natively multimodal for visual-document retrieval; the right specialist beats the right generalist on the only benchmark that pays rent.

How to actually choose

Filter on credibility with the leaderboard, then decide on the levers. If you want the cheapest competent baseline with the deepest ecosystem, OpenAI is the safe default. If you embed long documents, Cohere's 128K input or Voyage's 32K context model removes the chunking tax. If you live on Google infrastructure and want the strongest general scores, Gemini delivers them — just respect the 2,048-token cap. If retrieval quality on a specialized corpus is the whole game, Voyage's domain models are the sharpest tool.

But before you sign with any of them, run your own queries against your own documents and read the results. Then truncate the dimensions and quantize the output until quality just starts to bend — and store that. The provider is a starting point. The footprint is the decision. For the methodology of benchmarking on your own data rather than the leaderboard, the companion argument is the best embedding model is the one you benchmark yourself.