For a couple of years, choosing among the three serious open-source vector databases had a staircase answer. Lean deployment? Qdrant. Mid-size app that wants hybrid search and modules? Weaviate. Hundreds of millions or billions of vectors? Milvus. You found yourself on the scale axis and read off a name.
That staircase has flattened. Qdrant runs across clusters; Weaviate runs as a single binary; Milvus ships Milvus Lite, an embedded Python mode you can pip install. All three now do hybrid search — BM25-style keyword scoring fused with dense vectors — which used to be Weaviate's signature and is now table stakes. And the benchmark everyone quotes, raw queries-per-second at 95% recall, has converged into a band where the differences are noise.
Worse than converged: that benchmark was always measuring a query you will never run.
Nobody runs unfiltered search in production#
Open the ANN benchmark leaderboards and you are looking at unfiltered nearest-neighbor search — pure "find the closest vectors in the whole set." No real retrieval system does that. Every production query carries a predicate riding alongside the vector: tenant_id = 4491, permission_scope IN (...), published_after = '2026-01-01', doc_type = 'contract'. The filter is not a garnish. In multi-tenant RAG it is a hard security boundary — return one row from the wrong tenant and you have a data breach, not a bad search result.
And here is the thing the marketing charts hide: filtering and approximate search do not compose cleanly — the pre-filter vs. post-filter tradeoff is the whole ballgame. An HNSW index is a navigable graph built for the whole dataset. Restrict it to the 0.2% of nodes that pass your filter and two failure modes appear. Pre-filter — mask the graph to matching nodes only — and the graph can shatter into disconnected islands, so the greedy traversal gets stranded and recall collapses. Post-filter — search first, then drop non-matches — and you either over-fetch wildly or come back with fewer than the topK you asked for. The gap between a database that handles this well and one that doesn't is not 10%; it is the 2–3× latency cliff that shows up exactly when your filters get selective.
This is the one axis where Qdrant, Milvus, and Weaviate still genuinely diverge — and each solved it a different way.
Three architectures for the same hard problem#
Qdrant builds the filter into the graph walk. Its payload (metadata) indexes are integrated directly into HNSW traversal, so the predicate is evaluated during the search rather than as a pre- or post-pass. Written in Rust, it consistently posts the lowest p50 filtered latency of the purpose-built engines — roughly 4 ms in independent 2026 comparisons — and its TurboQuant mode delivers about 8× compression at near-baseline recall, with memory-mapped storage for collections larger than RAM. If your filters are high-cardinality and latency is the SLA you're judged on, this is the default.
Weaviate ships ACORN, made the default filter strategy in v1.34 and based on the "predicate-agnostic" ACORN research. Instead of masking the graph and hoping it stays connected, ACORN skips non-matching nodes in distance calculations, does a two-hop neighborhood expansion when a connecting node fails the filter, and seeds extra matching entry points so the walk reaches relevant regions fast. The payoff is that filtered performance stays predictable regardless of what you filter on — arbitrary boolean predicates, not just a blessed field. Weaviate also keeps the RAG batteries inside the database: BM25 hybrid, rerankers, vectorizer and generative modules.
Milvus takes the distributed-systems route: iterative filtering runs the vector search in batches, scalar-filtering each batch until it accumulates topK matches, which avoids both the disconnected-graph and the too-few-results traps. Its sharper tool is the partition key — declare a field like tenant_id as the partition key and Milvus physically groups data by it, so a tenant-scoped query prunes entire segments before any vector math happens. Pair that with its index zoo (IVF, HNSW, DiskANN, GPU) and you have the engine that still earns the "billions of vectors" label — at the cost of an etcd/object-store/message-queue footprint you'll want ops staff to run.
The decision rule#
Stop choosing by vector count and start choosing by filter shape. If your queries carry high-cardinality, per-request predicates — permissions, user IDs, arbitrary tag sets — and you're measured on tail latency, Qdrant's in-graph filtering is the cleanest fit. If you filter on arbitrary boolean expressions and want the retrieval pipeline (hybrid, rerank, generate) to live in the database, Weaviate's ACORN plus its modules earns its keep. If one low-cardinality field dominates your access pattern and you're genuinely at billion-scale, Milvus's partition keys turn that filter into free pruning.
Quantization, once a real tiebreaker, no longer is: all three now compress hard enough that "which fits in memory" is a config flag. The honest 2026 differentiator is the query you actually run — filtered, credential-scoped, and nothing like the benchmark on the box.



