There is a ritual at the start of every retrieval project. You open the MTEB leaderboard, sort by the top column, copy the name in the first row, and ship it. It feels like diligence. It is closer to a coin flip with extra steps.
The Massive Text Embedding Benchmark is a real achievement — Muennighoff and colleagues measured 33 models across 8 task types and 58 datasets, and the leaderboard has only grown since. The problem is not the benchmark. The problem is the column you sorted by.
The number on the leaderboard is an average of things you are not doing#
That headline MTEB score is a mean across classification, clustering, pair classification, reranking, retrieval, semantic textual similarity, and summarization. Seven or eight different jobs, collapsed into one figure so 100 models can be put in a line.
But you are doing exactly one of those jobs: retrieval. You embed a corpus, embed a query, and pull back the nearest neighbors. A model that is brilliant at semantic textual similarity — scoring how alike two sentences are — can sit near the top of the overall board while being mediocre at finding the right document. The averaging hides it.
The first discipline is trivial and almost nobody does it: sort by the retrieval sub-score, not the overall score.
That alone moves the ranking. But it is still someone else's retrieval task. The retrieval datasets in MTEB are things like MSMARCO and web-scraped QA — general-knowledge queries against general-knowledge corpora. Your corpus is insurance policies, or internal runbooks, or three years of support tickets in your company's particular dialect. Your queries are not phrased like a search engine. The distribution that produced the leaderboard ranking is not the distribution you serve.
Build the eval before you look at the leaderboard#
The fix is unglamorous and cheap, which is probably why it gets skipped. Label a small set of your own data, and measure on that.
Concretely:
- Collect 50–200 real queries. Pull them from logs if you have them, or write them in your users' actual phrasing if you do not. Cover the spread — the easy lookups and the oblique, underspecified ones where retrieval actually earns its keep.
- Map each query to its known-relevant document(s). This is the labor. If you understand your knowledge base, you can do a hundred in an afternoon. You are building a small answer key: for this question, these chunks are the ones that should come back.
- Pick the metric that matches what you ship. Compute recall@k and nDCG@k at the k your retriever actually uses — if you feed the model the top 10 chunks, measure at 10, not at the leaderboard's default. Recall@k asks whether the right document was in the set you retrieved at all; nDCG rewards putting it near the top. If your application returns a single answer, add MRR, which punishes a correct hit that lands in position 7.
Then run each shortlisted model over the same queries and read the numbers off your data, not Hugging Face's. This is the retrieval half of a full RAG evaluation — get it wrong and no amount of prompt tuning downstream will save the answer. You do not need a framework for this — it is twenty lines around a vector search and a set-membership check — but if you want one, Sentence Transformers can drive MTEB's own evaluators against a custom task, and tools like Evidently wrap the retrieval metrics directly.
Do this before you read the leaderboard, not after. If you rank the candidates on your own data first, the public ranking can't anchor you into explaining away a result you don't like.
What the leaderboard cannot tell you, and what to do about it#
Even a perfect retrieval sub-score is silent on the three things that usually decide the choice once two models are within noise on recall:
- Embedding dimension. A 1536-dim model and a 768-dim model can retrieve equally well on your set, but one doubles your vector storage and your approximate-nearest-neighbor index size — and the per-query latency that comes with it. Matryoshka models let you truncate; measure the truncated dimension on your eval, not the full one.
- Latency and price. A hosted model that wins by half a recall point and costs 4× per million tokens, or adds 80ms to every query, is not winning. Put cost and p95 latency in the same table as recall.
- Max sequence length. If your documents are long and the model silently truncates at 512 tokens, your recall ceiling is set by what got dropped, not by the model's quality.
A reasonable acceptance bar for a narrow-domain system, per practitioner targets, is Precision@5 around 0.7 and Recall@20 around 0.8 — but treat those as a sanity floor, not a goal. The real goal is a number that is yours: a frozen eval set that every future model swap has to clear before it ships. The first time a vendor's shiny new model regresses your recall by four points and your gate catches it, the afternoon of labeling will have paid for itself many times over.
The leaderboard is a fine place to start a shortlist. It is a terrible place to end a decision.



