Every team running a vector database eventually hits the same wall: the embeddings are too big. A few million documents at 3072 dimensions is tens of gigabytes of float32 that wants to live in RAM, and the index that searches it scales with that width. The reflexive fix is quantization — store the vectors at lower precision. There is a second lever, older and less discussed, that changes the length of the vector instead of its precision, and it comes with a trick that lets you avoid the usual tradeoff entirely.

The dolls inside the vector

The technique is Matryoshka Representation Learning, introduced by Kusupati and colleagues at NeurIPS 2022 and named for Russian nesting dolls. The idea is a training-time intervention. A normal embedding model optimizes one loss on the full vector, so information is smeared across all 3072 dimensions with no particular order. An MRL model applies the loss at many nested sizes at once — the first 8 dimensions, the first 16, 32, 64, all the way up — so each prefix is independently pushed to be a complete, usable embedding.

The effect is that the model learns to front-load. The most important semantic information lands in the earliest dimensions, and the tail carries diminishing detail. Now truncation means something: keep the first 256 numbers and discard the rest, and you have dropped the least informative part of the vector rather than a random slice. As the Hugging Face writeup puts it, the loss is applied "on both full-size embeddings and truncated portions of the embeddings" — which is the whole game.

This is why OpenAI's text-embedding-3 models expose a dimensions parameter. They are MRL-trained, so you can ask for a 256-dim vector from a model whose native size is 3072. The number OpenAI reports is the one worth memorizing: a text-embedding-3-large vector shortened to 256 dimensions still outperforms a full 1536-dimension text-embedding-ada-002 vector. That is roughly a twelvefold reduction in storage against the previous generation with no loss in quality — the kind of win that is usually a typo.

Truncation only feels like a free lunch because the model paid for it during training.

The two-pass trick that refuses the tradeoff

Shortening vectors looks like a straight tradeoff: smaller index, faster search, slightly worse recall at very small sizes. The interesting part of the MRL paper is that you do not have to accept the recall hit. You can have the small-index speed and the full-vector accuracy, through a pattern the authors call adaptive retrieval.

It is two passes. First, search a cheap low-dimensional index — the truncated prefixes — to pull a candidate shortlist. Then rerank only that shortlist using the full-dimensional vectors. The expensive comparison happens against a few hundred candidates, not the whole corpus, so you pay full-precision accuracy on a tiny set and small-index cost on the large one.

The numbers are stark. In the original paper, shortlisting on 16 dimensions and reranking on 2048 matched full-2048 retrieval accuracy on ImageNet at roughly 128× fewer FLOPs per query and a 14× wall-clock speedup. Supabase reproduced the pattern on OpenAI vectors: a single-pass search at 1536 dimensions scored 89.2% accuracy, while a two-stage search — a 512-dim first pass reranked at 3072 — scored 99% at nearly the same throughput. You spend a few percent of QPS to recover almost all the lost accuracy.

This is the non-obvious payoff. The usual framing is "trade accuracy for storage." Adaptive retrieval says: store small, but keep the full vectors around for the cheap second pass, and the accuracy comes back.

Two ways to get this wrong

There are exactly two mistakes, and they are easy.

The first: truncating a model that was not trained for it. If you take an ordinary embedding and chop off everything past dimension 256, you will get worse results than a model built to produce 256-dim vectors. The information is not ordered in a non-MRL model; the tail you are discarding is not the unimportant part. This is a property you have to confirm on the model card — OpenAI text-embedding-3, Nomic Embed v1.5 (64–768 dims), Jina v3, Cohere embed-v4.0, and Snowflake Arctic Embed v2.0 have it; most models do not.

The second: forgetting to renormalize. A vector normalized to unit length at 3072 dimensions is no longer unit length once you cut it to 256. If your similarity metric assumes normalized vectors — most cosine setups do — you have to renormalize after truncating, or your scores quietly drift.

Get those two right and Matryoshka stacks cleanly with the other size levers. It composes with vector quantization — shrink the dimension and the precision — and it should inform which embedding model you pick in the first place, because a model that can resize is a model that can grow with your corpus. The cheapest vector is the one you trained to survive being cut.