Ask three retrieval engineers which is better — dense embeddings, sparse learned retrieval, or ColBERT-style late interaction — and you'll get a fight, because the question is malformed. These are not three rungs on a quality ladder where you climb to the best one your budget allows. They are three different answers to a single design question: where does the matching cost live, and what do you pay to put it there? Once you see the axis, the choice stops being tribal and starts being arithmetic.

Dense: cheap to store, lossy by construction

Dense retrieval is the default of the embedding era. You run a passage through an encoder, get one fixed-length vector, store it in an approximate-nearest-neighbor index, and compare it to the query's single vector. It is wonderful at meaning — paraphrases, synonyms, "the thing I mean even though I didn't say the word" — and it is cheap, because each document is one point in space.

The cost is structural, and it has a name: the single-vector bottleneck. You are compressing an entire passage — every entity, number, and clause — into one vector of a few hundred floats. Subtle or multi-topic passages get averaged into a blur, and a query that hinges on one specific term can lose to something that is merely, vaguely on-topic. Dense retrieval doesn't fail loudly; it fails by quietly ranking the almost-right thing first. That's why the pragmatic stack bolts a reranker on top, and why pure vector RAG misses exact matches — a failure mode worth understanding on its own before you reach for heavier machinery (see hybrid search vs semantic search).

Sparse: exact terms, learned weights

Sparse retrieval is the modern descendant of BM25. Instead of a dense point, a model like SPLADE produces a high-dimensional sparse vector of learned term weights over the vocabulary — including expansion terms the document didn't literally contain — and matches via the same inverted index that has powered keyword search for decades. You get exact-term precision and interpretability (you can read which terms fired), with a learned model's sense of which terms matter.

The bill arrives as index size and compute: term expansion inflates postings, scoring is heavier than classic BM25, and the learned weights are language-bound in a way dense multilingual encoders aren't. Sparse shines exactly where dense is weakest — queries that turn on a specific token, code, identifier, or rare entity — which is why the two are so often fused rather than chosen between.

Late interaction: keep every token, match at query time

ColBERT takes the most expensive-sounding position and makes it pay. Instead of one vector per passage, it keeps one vector per token, and it defers the comparison. At query time, each query token looks across all document tokens and takes its best match — the MaxSim operation — and those maxima sum to the score. Nothing is collapsed into a summary vector, so the token-level precision dense throws away is preserved. This is "late" interaction precisely because the query and document representations are computed independently and only meet at the end, which keeps documents pre-encodable while restoring fine-grained matching.

The historical objection was storage, and it was damning: a vector per token is a lot of vectors. Two advances dissolved it. ColBERTv2 introduced residual compression — cluster every token vector, store the nearest centroid ID plus a 1–2-bit quantized residual — cutting per-vector storage from roughly 256 bytes to about 20–36 bytes, a 6–10x reduction; on MS MARCO the whole index fits in 16–25 GiB. Then PLAID made it fast, pruning irrelevant passages with those same residuals and reaching 6.8x speedups on GPU and up to 45x on CPU over vanilla ColBERTv2 at matched quality.

Late interaction stopped being a research luxury the moment your existing vector database learned to store more than one vector per row.

What actually changed in 2026

For years, running ColBERT in production meant adopting a bespoke engine — PLAID, or a homegrown MaxSim service — separate from whatever held your dense vectors. That operational tax, more than any quality argument, is what kept late interaction in the papers and out of the stacks.

That tax is gone. As of 2026, native multi-vector indexing is a first-class feature in the mainstream vector databases: Qdrant stores and searches multi-vectors out of the box and treats ColBERT/ColPali as ordinary inputs; LanceDB indexes multi-vector columns and runs MaxSim search; Vespa and Weaviate support late-interaction representations directly. The same databases people already run for dense retrieval now hold per-token vectors, so adopting late interaction is a schema choice, not a new piece of infrastructure to operate. The convergence the research community has been chasing — SPLATE and SLIM mapping late-interaction outputs onto sparse inverted indexes — is the same story from the other direction: the boundaries between the three approaches are turning into knobs on one engine.

The decision, made plainly

Hold all three against the cost axis and the call is usually clear:

The mistake is treating these as better-and-worse rather than cheaper-and-richer. You are not buying the best retriever; you are choosing where to spend — in storage, in compute, or in the precision you're willing to lose. Price the loss, not the hype, and the architecture chooses itself.