A retrieval index is mostly numbers you will never look at closely. A single OpenAI text-embedding-3-large vector is 3072 floats; at four bytes each that is roughly 12 KB before you store a single word of the document it points to. Ten million chunks and your vector database is carrying tens of gigabytes of RAM whose only job is to hold decimal places that don't change which document comes back first. That is the quiet bill that embedding quantization is built to stop paying — and the interesting part is not the savings, which are obvious, but how cheaply you buy back the quality you give up.

Three precisions, one decision

The vector you store does not have to keep the precision the model emitted. There are three rungs.

So the storage math is not subtle: 1x, 4x, 32x. The whole argument is about what 32x costs you in answers.

The catch is recall — and the fix is cheaper than the catch

Binarizing a vector throws away nearly everything. Intuitively that should wreck retrieval, and naively it does dent it: the Hugging Face / mixedbread study measured binary quantization retaining about 92.5% of retrieval performance on its own. The move that makes this practical is oversample + rescore, and it is the one idea worth taking from this piece.

You search the compressed index for more candidates than you actually want — say you need the top 100, so you ask the binary index for the top 300 — and then you re-score that small shortlist with the full-precision (or int8) vectors. The binary scan is what makes the index cheap and fast; the rescore is a few hundred dot products, which is nothing. With it, the HF/mixedbread numbers jump back to ~96% for binary (mxbai-embed-large-v1 hits 96.45%) and ~99-100% for int8.

The compression is the cheap part. Recall is bought back by re-ranking a shortlist — so you keep both the 32x memory win and almost all the quality.

Qdrant's benchmarks make the recovery concrete. With binary quantization and 3x oversampling, recall lands at 0.9966 for text-embedding-3-large, 0.9847 for text-embedding-3-small, and 0.98 for ada-002 at 4x — while RAM for 100K OpenAI vectors falls from roughly 900 MB to about 128 MB, and they clock the binary path at up to 40x faster retrieval. Cohere reports the same shape from the model side: int8 Embed v3 retaining ~99% of search quality with a rescore multiplier of 4, binary giving 32x memory reduction.

Why binary needs big vectors

Binary is not a free default. Collapsing a dimension to one bit only preserves enough signal when there are many dimensions voting. Qdrant says so directly — binary gives "poorer results for small embeddings i.e. less than 1024 dimensions" — and the data agrees: Mistral Embed at 768 dims only reached 0.9445 recall where the 1536- and 3072-dim OpenAI models cleared 0.98. This is why the models marketed for binarization — mxbai-embed-large-v1 (1024), Cohere Embed v3/v4, OpenAI text-embedding-3-large (3072) — are all high-dimensional. Below ~1024 dims, int8 is the right rung: 4x smaller, ~99% quality, and far more forgiving. Scalar's one real chore is calibration — computing per-dimension min/max, where Qdrant's quantile parameter (e.g. 0.99) trims the outliers that would otherwise stretch the range and blur everything else.

Matryoshka stacks on top

The other lever, Matryoshka representation learning, is orthogonal to this one and they multiply. Matryoshka lets you truncate a vector to fewer dimensions (OpenAI's dimensions parameter, mxbai's shortenable output) with graceful quality loss; quantization shrinks each dimension you keep. You can do both — shorten 3072 dims to 1024, then binarize — and Vespa documents the combination explicitly. The two knobs answer different questions: how many numbers, and how many bits each.

So which rung

The mental model that makes this easy: precision is not a property of the embedding, it is a dial on the index. The model hands you float32 because it has to hand you something. What you store is your call — and for retrieval, the last 30 of those 32 bits were almost never doing any work.