Pick a multimodal embedding model and the first number anyone quotes is its ImageNet zero-shot accuracy. It is the headline on every model card, the column everyone sorts by, the figure that decided which model your team will index a million documents with. For multimodal RAG, it is close to irrelevant — and optimizing for it quietly doubles your infrastructure.
The benchmark measures the wrong job
ImageNet zero-shot accuracy asks a classification question: shown an image, can the model pick the right phrase from a fixed list of labels? That is a real skill, and the CLIP lineage is genuinely good at it. But retrieval is a different job. RAG asks the model to take a query and rank thousands of candidate passages or images by relevance, then hand the top few to a language model. A model can recognize a cat with 80% zero-shot accuracy and still be mediocre at retrieving the paragraph sitting next to the cat.
A CLIP score tells you the model can name a picture. It says nothing about whether it can find the document.
The skill RAG actually needs is twofold: strong cross-modal retrieval (a text query finds the right image, and vice versa) and strong text-to-text retrieval (a text query finds the right passage). Most corpora are mostly text with images sprinkled in. If your embedder is great cross-modal but weak text-to-text, you have not solved retrieval — you have solved a quarter of it.
Why CLIP and SigLIP leave you with two indexes
This is where the original contrastive image-text models betray a RAG pipeline. OpenAI's CLIP trains its text encoder on short web captions, and the encoder is hard-capped at 77 tokens — the positional embeddings simply do not go further. A model trained to match "a photo of a golden retriever" to a picture never learns to embed a 600-word technical passage so that a related passage lands nearby. Its text tower is a caption matcher, not a document retriever.
SigLIP, Google's sigmoid-loss reformulation, is a better-trained model — its per-pair loss avoids the giant global batches CLIP's softmax needs, and SigLIP 2 adds multilingual data, captioning objectives, self-distillation, and strong dense features for localization and OCR. But it is still optimized for the cross-modal and classification jobs. The text tower is not the product.
The practical consequence is the part nobody benchmarks. If you standardize on CLIP or SigLIP for a corpus of documents and images, you end up running a second model — a real text embedder — for the text-to-text leg, and maintaining a second index. The "multimodal" model handles only the image leg. You wanted one model and one vector store; the benchmark talked you into two of each.
The models built to collapse the two indexes
The newer entrants treat "be good at text-to-text too" as a design requirement, not an afterthought. The tell is in the title of the Jina paper: Your CLIP Model Is Also Your Text Retriever.
- Jina CLIP v2 pairs an image tower with a genuine 8192-token, 89-language text encoder (its text tower is a full text-embedding model), trained multi-task so the same model handles cross-modal and text-to-text retrieval. It supports Matryoshka representations — you can truncate the output from 1024 dims down to 64 to trade a little accuracy for a lot of storage. The catch lives in the license, not the math: the weights are CC BY-NC 4.0, non-commercial. To ship it in a product you need a commercial license or the hosted API.
- Nomic Embed Vision v1.5 takes the other route to one index: instead of one model with two towers, it trains a vision encoder aligned to the existing latent space of Nomic Embed Text v1.5. Because the two share a space, every text embedding you already computed becomes multimodal — text queries hit image embeddings and vice versa, in the same index, with no re-embedding. Nomic reports this unified space beating OpenAI CLIP (cross-modal) and OpenAI's text-embedding-3-small (text) simultaneously — the exact both-legs win the others miss. The weights are Apache-2.0.
How to actually choose
Start from the index, not the leaderboard. If your corpus is mostly text with images and you want a single store, you need a model with a real text tower: Jina CLIP v2 if multilingual and you can live with the non-commercial license (or pay for it), Nomic Embed Vision if you want permissive weights or already run Nomic Embed Text. If you are doing pure image search or zero-shot classification with no document-retrieval leg, the classical models are fine and SigLIP 2 is the strongest of them.
And if your "documents" are really scanned PDFs — tables, figures, layout — note that this whole class embeds a whole image as one vector, which blurs dense pages. That is a different problem with a different answer: late-interaction visual document models like ColPali, or a strong text embedder fed by a good document parser. Match the embedder to the shape of the data — and stop letting a classification score pick your retrieval stack.
Architecture, context-length, language-count, and license figures are drawn from each model's paper, official announcement, or model card, cited above. Retrieval-quality claims are reported by the model makers; as always with vendor-reported numbers, treat the direction as more reliable than the decimal.



