There's a question that shows up in every RAG design doc and almost every retrieval interview: cross-encoder or bi-encoder? It's phrased like a fork in the road, two interchangeable options where you pick the better one. It isn't. The two models do different jobs in the same pipeline, and they're forced apart by one fact about how transformers compute — a fact worth getting straight, because it dictates the entire shape of a retrieval system.

The fork is a single architectural choice: encode together, or encode apart

A bi-encoder — also called a dual-encoder — runs the query through the model and each document through the model separately. Each side collapses to one fixed-length vector, and relevance is just the similarity between the query vector and a document vector. The thing that matters here isn't the accuracy; it's the independence. Because a document's vector doesn't depend on the query, you can compute it once, offline, and store it. At query time you encode only the query and run an approximate-nearest-neighbor search over a precomputed index of millions of document vectors. That's why every vector database on the market is, under the hood, a bi-encoder retrieval system.

A cross-encoder does the opposite. It feeds the query and a single document into the model together, as one input, so the transformer's self-attention runs across both at once — every query token can attend to every document token and vice versa. The output is a single relevance score. This is strictly more expressive: the model can notice that a word in the query is disambiguated by a phrase in the document, the kind of fine-grained interaction a bi-encoder threw away when it squashed each side into one vector. The Sentence Transformers docs put the trade-off in one line: cross-encoders "achieve higher performance than Bi-Encoders, however, they do not scale well for large datasets."

The reason they don't scale is the same fact, inverted. Because the document's representation is now entangled with the specific query, there is nothing to precompute. A cross-encoder "does not produce a sentence embedding." Every document you want to score has to go through the full model again, paired with this exact query.

The cost that ends the argument

Put a number on it. Sentence-BERT's own framing: to find the most similar pair among just 10,000 sentences, a BERT cross-encoder needs n·(n−1)/2 = 49,995,000 inference passes — about 65 hours on a V100 GPU. Encode them once with a bi-encoder and compare the resulting vectors, and the same task takes about 5 seconds, while keeping nearly the same accuracy.

A cross-encoder is the only model accurate enough to be the last word on relevance, and the only model too expensive to ever be the first.

Five seconds versus sixty-five hours is not a tuning detail. It is the reason you cannot run a cross-encoder as a first-stage retriever over any real corpus, and the reason you would never want to skip a bi-encoder when one exists. The two models sit at opposite ends of the same trade: the bi-encoder spends accuracy to buy scale; the cross-encoder spends scale to buy accuracy.

So you don't choose — you stage

Once you see that, the standard architecture writes itself, and it's not "pick one." It's retrieve-and-rerank, the pattern the reranker pieces describe and the Sentence Transformers "Retrieve & Re-Rank" docs lay out directly:

  1. Retrieve. A bi-encoder (often combined with BM25 in a hybrid) searches the whole corpus and returns the top ~100 candidates in milliseconds. This stage optimizes recall — get the relevant documents into the set at all.
  2. Rerank. A cross-encoder scores each of those ~100 candidates against the query and reorders them. This stage optimizes precision — get the best of the set to the top.

The cross-encoder's crippling per-query cost stops being crippling, because you only ever point it at a short list the bi-encoder already narrowed down. You apply the expensive model exactly where it's affordable. This is also why "which is better" is a category error: the bi-encoder isn't competing with the cross-encoder for the reranking job, and the cross-encoder isn't auditioning for the retrieval job. Each is the only option for its stage.

The decision you actually face is narrower and more useful: does your pipeline need the rerank stage at all? A reranker earns its latency when recall is already high and the ordering within your retrieved set is what's costing you — when you can fit only a few chunks in the prompt and it matters which few, or which sit at the start and end of the context rather than buried mid-list. If your generator reads the whole retrieved block and uses a relevant chunk wherever it lands, a second model that merely reshuffles those chunks may not pay for itself.

ColBERT: the model that refused to pick a side

There's a third design that's worth knowing precisely because it dissolves the binary. ColBERT (Khattab & Zaharia, 2020) keeps the bi-encoder's superpower — it encodes query and document separately, so document representations precompute offline and it scales to retrieval — but instead of crushing each document into one vector, it stores one vector per token. At query time it scores with late interaction, or MaxSim: for each query token, take its highest similarity to any document token, then sum those maxima. That brings back a slice of the cross-encoder's token-level matching without re-running the model on every pair, which is how ColBERT lands roughly two orders of magnitude faster than a BERT cross-encoder at four orders of magnitude fewer FLOPs per query.

Nothing is free. The bill for late interaction is storage — a per-token index is much larger than a per-document one, which is the entire reason ColBERTv2's residual compression exists, cutting that footprint several-fold. So the spectrum is really three points, all governed by the same lever — how much do query and document get to interact, and when do you pay for it. A bi-encoder: no interaction, all paid offline. A cross-encoder: full interaction, all paid at query time. ColBERT: token-level interaction, mostly paid offline in storage.

Stop asking which one wins. Ask where in the pipeline you are. The retrieve stage wants the model that precomputes; the rerank stage wants the model that interacts; and if the gap between them is hurting your recall, there's a model that splits the difference and bills you in disk.