Pick a multimodal embedding model and the first number anyone quotes is its ImageNet zero-shot accuracy. It is the headline on every model card, the column everyone sorts by, the figure that decided which model your team will index a million documents with. For multimodal RAG, it is close to irrelevant — and optimizing for it quietly doubles your infrastructure.

The benchmark measures the wrong job

ImageNet zero-shot accuracy asks a classification question: shown an image, can the model pick the right phrase from a fixed list of labels? That is a real skill, and the CLIP lineage is genuinely good at it. But retrieval is a different job. RAG asks the model to take a query and rank thousands of candidate passages or images by relevance, then hand the top few to a language model. A model can recognize a cat with 80% zero-shot accuracy and still be mediocre at retrieving the paragraph sitting next to the cat.

A CLIP score tells you the model can name a picture. It says nothing about whether it can find the document.

The skill RAG actually needs is twofold: strong cross-modal retrieval (a text query finds the right image, and vice versa) and strong text-to-text retrieval (a text query finds the right passage). Most corpora are mostly text with images sprinkled in. If your embedder is great cross-modal but weak text-to-text, you have not solved retrieval — you have solved a quarter of it.

Why CLIP and SigLIP leave you with two indexes

This is where the original contrastive image-text models betray a RAG pipeline. OpenAI's CLIP trains its text encoder on short web captions, and the encoder is hard-capped at 77 tokens — the positional embeddings simply do not go further. A model trained to match "a photo of a golden retriever" to a picture never learns to embed a 600-word technical passage so that a related passage lands nearby. Its text tower is a caption matcher, not a document retriever.

SigLIP, Google's sigmoid-loss reformulation, is a better-trained model — its per-pair loss avoids the giant global batches CLIP's softmax needs, and SigLIP 2 adds multilingual data, captioning objectives, self-distillation, and strong dense features for localization and OCR. But it is still optimized for the cross-modal and classification jobs. The text tower is not the product.

The practical consequence is the part nobody benchmarks. If you standardize on CLIP or SigLIP for a corpus of documents and images, you end up running a second model — a real text embedder — for the text-to-text leg, and maintaining a second index. The "multimodal" model handles only the image leg. You wanted one model and one vector store; the benchmark talked you into two of each.

The models built to collapse the two indexes

The newer entrants treat "be good at text-to-text too" as a design requirement, not an afterthought. The tell is in the title of the Jina paper: Your CLIP Model Is Also Your Text Retriever.

How to actually choose

Start from the index, not the leaderboard. If your corpus is mostly text with images and you want a single store, you need a model with a real text tower: Jina CLIP v2 if multilingual and you can live with the non-commercial license (or pay for it), Nomic Embed Vision if you want permissive weights or already run Nomic Embed Text. If you are doing pure image search or zero-shot classification with no document-retrieval leg, the classical models are fine and SigLIP 2 is the strongest of them.

And if your "documents" are really scanned PDFs — tables, figures, layout — note that this whole class embeds a whole image as one vector, which blurs dense pages. That is a different problem with a different answer: late-interaction visual document models like ColPali, or a strong text embedder fed by a good document parser. Match the embedder to the shape of the data — and stop letting a classification score pick your retrieval stack.

Architecture, context-length, language-count, and license figures are drawn from each model's paper, official announcement, or model card, cited above. Retrieval-quality claims are reported by the model makers; as always with vendor-reported numbers, treat the direction as more reliable than the decimal.