Retrieval has two famous answers and they fight in every RAG thread. BM25 matches words: fast, exact, decades-proven, and blind to the fact that "car" and "automobile" mean the same thing. Dense embeddings match meaning: they collapse "car" and "automobile" to nearby points, and then quietly miss the part number A-1138 because nothing in a 1024-float vector is built to land an exact token. The standard fix is to run both and fuse them — hybrid search. But there is a third answer that keeps getting left out of the fight, and it is the one that actually tries to be both: learned sparse retrieval, whose best-known model is SPLADE.
What SPLADE actually is
SPLADE — Sparse Lexical and Expansion model — is the 2021 SIGIR paper by Formal, Piwowarski, and Clinchant, since iterated through v2, SPLADE++, and v3. The mechanism is the clever part. You run a BERT over the text, but instead of taking the pooled embedding, you take the masked-language-model head — the layer that, during pretraining, predicts a probability over the entire ~30,522-term WordPiece vocabulary. Pass those logits through a ReLU and a log(1 + ReLU(w)) saturation, max-pool across the input positions, and you get one learned weight per vocabulary term. Most are zero. What survives is a sparse vector — a bag of weighted terms, exactly the shape BM25 produces.
The difference is twofold. The weights are learned, not handed down by a TF-IDF formula. And because the MLM head can fire on terms the text never contained, SPLADE performs expansion: a document about a "car" gets non-zero weight on "vehicle," "automobile," "sedan." That is the cure for the vocabulary mismatch that hobbles BM25 — and per the SPLADE authors, the expansion terms, most of which aren't in the original passage, are exactly what drive its zero-shot strength. Sparsity itself isn't free; it's trained in with a FLOPS regularizer, a differentiable estimate of retrieval cost that the model is penalized by, so it learns to spend its non-zero terms where they earn their keep.
SPLADE keeps the inverted index and borrows the transformer's judgment about which words a document is really about. That's the whole pitch — and the whole bill.
The payoff that the dense camp can't match: a SPLADE vector is still a sparse bag of terms, so it runs on a standard inverted index — Lucene, Anserini, PISA, Elasticsearch, OpenSearch, Vespa. No ANN graph, no HNSW tuning, scoring is a dot product over the non-zero terms two documents share. On the numbers, it earns the seat: the naver/splade README reports MS MARCO dev MRR@10 climbing from 34.0 (v2) to 36.8 (v2-distil), and the v3 paper reports 40.2 MRR@10 with a 51.7 average nDCG@10 across BEIR — comfortably past BM25's strong-but-flat zero-shot baseline and, the authors note, competitive with cross-encoder rerankers.
The catch is latency, not relevance
Here is the part the leaderboard hides. Expansion is a double-edged sword: the same trick that adds "automobile" to your query also means the query now touches far more postings lists, and walking them costs time. The efficiency study measured short-query latency exceeding six times BM25's before tuning. On top of that, a naive SPLADE query needs a full BERT forward pass before retrieval even starts — a GPU tax BM25 simply doesn't pay.
This is why "is SPLADE fast" has no honest yes/no. The escape hatch is document-only mode (OpenSearch ships it as the default for neural sparse search, and it's the spirit of Efficient-SPLADE and the v3-Doc variant): do all the expansion at index time, and at query time just tokenize the query and look up the learned term weights — no transformer, no query-side expansion. OpenSearch's own docs call this mode "as efficient as BM25." You give back a little relevance for it, but you erase both the encoder and the long-postings cost. The decision you're actually making, then, isn't sparse-versus-dense. It's where you can afford to spend the transformer — at index time, where it's amortized, or at query time, where it's a per-request bill.
Where it sits in 2026
The vendors have quietly made this a productized choice. Elastic ships ELSER, its own learned-sparse encoder, and reports its v2 winning 10 of 12 BEIR-subset tasks against BM25 with roughly +18% average nDCG@10. OpenSearch's neural sparse runs on the inverted index with the inference-free query mode above. Pinecone added sparse-only indexes plus its own pinecone-sparse-english-v0, which it clocks at ~23% better average nDCG@10 than BM25 on TREC. Qdrant and Vespa both take native sparse vectors. The pattern across all of them is the same: learned sparse is sold as the upgrade you can make without leaving the search engine you already run.
So which one
- You can fine-tune on your domain, in-domain quality is everything: a tuned dense retriever, or a BM25 + dense hybrid. Hybrid often matches SPLADE here while being built from off-the-shelf parts — and if you want exact-token precision back, a late-interaction model like ColBERT is the other sparse-adjacent option, with nothing new to train or serve.
- You can't fine-tune and your corpus is unlike anything public — legal, biomedical, internal jargon: this is SPLADE's home court. Its zero-shot, out-of-domain generalization is the edge that's hardest to reproduce with a dense model you can't train, which is precisely the gap Elastic markets ELSER into.
- You want learned-sparse quality but live on a tight latency budget: document-only / inference-free mode. Pay the transformer once, at index time, and serve queries at near-BM25 speed.
The reason SPLADE keeps falling out of the BM25-versus-dense argument is that it refuses to pick a side — and that's exactly its value. It is not a faster BM25 or a cheaper embedding. It's the bet that the inverted index was never the problem; the fixed term weights were. Fix those, decide where to pay for it, and you don't have to choose between matching words and matching meaning.



