You have a retrieval pipeline and a self-hosting decision. The hosted embedding APIs work fine until one of four things pushes back — cost at volume, tail latency, data residency, or the simple fact that the open model you want to use isn't on anyone's menu. So you go looking for something to put a GPU behind, and three names keep surfacing: Text Embeddings Inference, Infinity, and vLLM.
The comparison everyone reaches for is throughput: whose tokens-per-second is biggest on an H100. That benchmark is real, and it is also the least useful way to choose, because the published numbers contradict each other depending on who paid for the blog post and how long the sequences were. The decision that actually follows you for years is architectural: should serving embeddings be a dedicated specialist, or should it ride on the same engine that already runs your LLM? Answer that first, and the shortlist collapses on its own.
The thing embedding traffic does that generation doesn't
Start with the workload, because it shapes everything. Embedding traffic is bimodal. There is the bulk job — re-indexing a corpus, where you feed the GPU millions of documents and want pure throughput — and there is the trickle, query-time embedding of one user question at a time, where you want low latency and the batch is almost empty. A good embedding server has to be excellent at both, and the lever for that is batching: pulling concurrent requests into one forward pass so the accelerator is never idle on a batch of size one.
That is why "raw transformers in a loop" is the wrong baseline. A naive loop processes one request per forward pass and leaves most of the GPU dark. All three of these servers exist to fix that, and once batching is correct, the gap between two well-configured servers is far smaller than the gap between either of them and the loop you were tempted to ship. Keep that proportion in mind every time a benchmark waves a 2x at you.
TEI: the Rust specialist
Text Embeddings Inference is Hugging Face's bet that embedding serving should be a small, fast, single-purpose binary. It is written in Rust — about 87% of the repo — with Candle and ONNX backends, Metal for Apple Silicon, and experimental ROCm for AMD. It serves embedding models (BERT, XLM-RoBERTa, GTE, ModernBERT, NomicBERT, Qwen3-Embedding, and friends) and sequence-classification rerankers, with dynamic batching and a deliberately tiny footprint.
The detail worth noticing is its pooling support: CLS, mean, last-token, and SPLADE. That last one matters if you're doing learned-sparse retrieval, because it means TEI can produce sparse term-weight vectors, not just dense ones — a capability the dense-only crowd quietly lacks.
Choose TEI when embeddings (and maybe reranking) are the whole job and you want the leanest, fastest dedicated service you can operate. The cost is that its world ends at embeddings and rerankers — if your pipeline reaches for more exotic models, you'll be standing up a second server anyway.
Infinity: the all-in-one RAG model server
Infinity makes the opposite bet: that a modern RAG pipeline needs more than dense vectors, so the server should speak the whole zoo. It is a Python server (torch, with optimum/ONNX, TensorRT, and CTranslate2 backends, plus FlashAttention) that serves embeddings, rerankers, ColBERT late-interaction models, ColPali for document-image retrieval, CLIP for images, and CLAP for audio — all behind one REST API with dynamic batching.
That breadth is the reason to pick it. If your retrieval stack is dense embeddings plus a reranker plus multi-vector late interaction plus multimodal, Infinity is one service and one deployment instead of three. (If you don't yet know whether you need late interaction, that's worth settling on its own terms — see ColBERT vs dense vs sparse retrieval before you provision for it.)
Choose Infinity when your bottleneck is model-type sprawl, not raw speed on a single embedding model. The tradeoff is the usual one for a generalist: on a pure dense-embedding bulk job, a Rust specialist will likely edge it on throughput-per-dollar.
vLLM: don't run a second server at all
vLLM's pitch is the one most teams overlook. If you are already running vLLM for generation, it can also serve embeddings from the same engine — the same paged attention, the same continuous batching, the same GPU. You point it at a pooling model (vllm serve <model> --task embed) and embedding requests flow through the identical scheduler, each one a single forward pass plus a pooling step. Continuous batching lets those requests join and leave the batch freely, which is exactly the property bimodal embedding traffic wants.
The non-obvious win here isn't a throughput number; it's the service you don't deploy. Consolidating embeddings onto an engine you already operate removes a whole component from your architecture — one fewer thing to scale, monitor, secure, and page someone about. That's the same consolidation logic that makes vLLM the default in the generation tier, extended one model-class to the left.
The caveats are real: vLLM is the heaviest of the three to stand up if you don't already have it, pooling-task configuration has shifted across releases (multitask pooling was removed; you now set the task explicitly), and not every small encoder embedding model is a first-class citizen the way decoder-style ones are. If embeddings are your only GPU workload, paying vLLM's operational weight to avoid a lightweight specialist is the wrong trade.
How to actually choose
Run the decision in this order and you'll rarely be wrong:
- Already operate vLLM with GPU headroom? Serve embeddings from it and delete a service. Stop here unless your embedding models aren't well supported.
- Embeddings (± reranking) are the whole job? TEI — smallest footprint, fastest dedicated path, SPLADE if you need sparse.
- Pipeline needs many model types — rerankers, ColBERT, ColPali, multimodal? Infinity — one API for the zoo beats three deployments.
Only after that should you open the benchmark tab, and only to break a tie between two servers that both clear the bar. The largest performance variable in your retrieval stack is not which of these three you picked — it's whether you let it batch. Get the architecture right, turn batching on, and the rest is a rounding error you can measure later.



