The standard RAG pipeline for a PDF does something faintly absurd: it takes a document a human reads with their eyes, throws away everything visual, OCRs it into a string, guesses where the columns and tables were, chunks the wreckage, and embeds the chunks. Every stage is a place to lose information, and on the documents people actually care about — financial filings, research papers, slide decks, scanned forms — the losses compound. Visual document retrieval asks the obvious question back: why not just look at the page?

The idea: retrieve on pixels, not parsed text

ColPali (Faysse et al., 2024) renders each page as an image and embeds it with a vision-language model, skipping OCR and layout parsing entirely. A table is retrieved as a table, a chart as a chart, because the model was trained to understand the image. On the ViDoRe benchmark — built specifically to measure visual document retrieval — this approach beats strong text pipelines, and it deletes an entire class of brittle preprocessing.

The repo that owns this layer is the engine itself:

Trains and runs the ColVision models — ColPali (on PaliGemma-3B), ColQwen2, ColSmol — that turn page images into retrieval embeddings
★ 2.7kPythonilluin-tech/colpali

This is where the models live and where you'd build if you want full control. But using it directly surfaces the thing that makes visual RAG hard, and it isn't the model.

The catch nobody warns you about: one page, a thousand vectors

ColPali is a late-interaction model, the same family as ColBERT. It doesn't compress a page into a single embedding. At 448x448 resolution the vision transformer emits roughly 1,030 patch tokens per page, each projected to a 128-dimensional vector. So a single page becomes ~1,030 vectors, and a 1,000-page corpus becomes over a million.

That's the real engineering problem, and it's a storage-and-scoring problem, not a modeling one. A standard vector database that expects one vector per document can't represent this at all. You need native multi-vector support — Qdrant, Milvus, or Postgres/pgvector with a schema built for it — and your storage grows with pages times patches. The multi-vector explosion is exactly the regime where embedding quantization stops being optional: binary quantization with rescoring, or pgvector's halfvecs, is often what makes the bill survivable.

This is the axis the three tools actually differ on. Not who has the best model — they all lean on colpali-engine — but who owns the storage problem.

Byaldi: the wrapper that punts on scale (on purpose)

A thin wrapper over colpali-engine — RAGatouille's sister project — that indexes and searches with ColPali-class models in a few lines of code
★ 850PythonAnswerDotAI/byaldi

Byaldi's whole design goal is to get you from zero to a working visual-RAG demo in minutes. It mirrors RAGatouille's "fewest lines possible" philosophy, hides colpali-engine behind a clean API, and keeps the embeddings in memory. That's a feature for prototyping and a hard ceiling for production: nothing persists, nothing filters on metadata, and the moment your corpus outgrows RAM you're done. Reach for byaldi to find out whether visual retrieval helps your documents at all. Don't reach for it to serve them.

ColiVara: the layer that solves the hard part

A production visual-document-retrieval platform: Postgres + pgvector storage, REST API with Python/TS SDKs, 100+ file formats, built on ColPali-class models
★ 1.5kPythontjmlabs/ColiVara

ColiVara starts where byaldi stops. It persists the multi-vectors in Postgres + pgvector, runs a separate embeddings service (ColiVarE, which wants 8GB+ of GPU VRAM), exposes a REST API with Python and TypeScript SDKs, ingests 100+ file formats, and supports metadata filtering on collections and documents. It posts about 86.8 average on ViDoRe with end-to-end latencies in the few-seconds range. In other words, it's taken the storage-and-serving problem the model layer hands you and made it someone's product. The tradeoff is the usual one for a platform: you adopt its database, its services, and its opinions.

A fourth repo worth knowing if you'd rather compare techniques than commit to one is adithya-s-k/VARAG (~497 stars), a vision-first engine that puts Simple RAG, Vision RAG, ColPali RAG, and a hybrid side by side — useful as a sandbox before you pick a lane.

How to actually choose

These three aren't rivals so much as three rungs on one ladder, and you climb it as your corpus grows:

And the prior question is whether you need visual retrieval at all. If your documents are clean and text-native, a good parsing-and-chunking pipeline is cheaper and nearly as good. Visual RAG earns its GPU and storage costs precisely where text extraction fails — the messy, structured, image-heavy documents that broke your pipeline in the first place.