Almost no production retrieval query is a pure nearest-neighbor lookup. It's a nearest-neighbor lookup with a WHERE clause: the ten most relevant chunks for this tenant, in English, from documents this user is allowed to see. The metadata predicate is the difference between a demo and a system. It is also the part that quietly breaks.
The reason it breaks is that "search, then filter" and "filter, then search" are both traps, and the obvious one is worse than it looks.
The two-stage traps
Post-filtering is what you write first: ask the index for the top 50 vectors, then drop the ones whose metadata doesn't match. It feels safe because the filter is exact. The problem is arithmetic. If your predicate matches 10% of the corpus and you fetch 50 candidates, you keep about five — and if you asked for ten results, you just silently returned half. Make the filter more selective and you return zero, from an index that contains plenty of valid answers. Pinecone calls this the missing WHERE clause for a reason: post-filtering can hand back fewer results than you asked for, and nothing in the response tells you it happened (Pinecone).
Brute-force pre-filtering is the correct-but-slow answer: compute the subset that matches the predicate, then compare the query against every vector in it. The results are exactly right. You've also thrown away the approximate-nearest-neighbor index that was the whole point of using a vector database, and at a few million vectors that exhaustive scan is your latency budget gone.
So the real question isn't pre versus post. It's: can you apply the filter inside the index without wrecking it? And the answer reveals something most teams never see.
Why a filter breaks the graph
HNSW — the index under most vector search — is a navigable graph. You enter at one node and greedily hop to whichever neighbor is closest to your query, again and again, until you can't get closer. It is fast precisely because it only ever looks at a handful of nodes along the path.
Now delete 90% of the nodes mid-walk because they failed the filter. The greedy hop has nowhere to go. Worse, the surviving nodes can fragment into disconnected islands, and there may be no path at all from the entry point to the island where your matching results live. The search terminates early in the wrong neighborhood and reports high confidence about it.
A selective filter doesn't just shrink the haystack. It cuts the threads the index uses to walk through it — and recall falls off a cliff while latency still looks fine.
Qdrant describes this in the language of percolation theory: past a certain filtering ratio the graph decomposes into small components and search stops working (Qdrant). This is the non-obvious fact that reorganizes the whole topic. Filtering is not a post-processing step you bolt onto retrieval. It is a property the index has to be built for.
Four engines, four different repairs
The mature vector databases all converged on "filter during the search," and each rebuilt the index to survive it differently.
Qdrant extends the HNSW graph with additional links derived from indexed payload values, so that for a given filter the matching nodes stay connected and traversal still has a path. It also keeps a query planner: estimate the predicate's cardinality from a payload index, and when the matching set is small enough, skip the graph and just do a full scan — the brute-force path, used deliberately, exactly where it's cheap (Qdrant).
Weaviate historically resolved the filter into an allow-list from its inverted index and carried that list through HNSW traversal, only scoring allowed nodes. Its newer ACORN strategy attacks the disconnection problem head-on: it expands the neighbor list at runtime to route through non-matching nodes without scoring them, and seeds extra entry points inside the filtered region so the walk converges (Weaviate). The same idea anchors the ACORN paper, which builds a predicate-agnostic filtered search by densifying the graph and traversing only the predicate's subgraph (arXiv 2403.04871).
pgvector took the iterative route. Before 0.8.0, Postgres applied filters after the index returned its candidates — the classic post-filter shortfall, where a 10%-selective filter on the default ef_search could leave roughly four rows. Version 0.8.0 added iterative index scans: if the filtered result set comes up short, it keeps pulling more candidates from the index until it has enough or hits a cap, plus better cost estimates so the planner can pick a B-tree on the filter column when that's actually faster (pgvector 0.8.0).
Pinecone folds the metadata index into the vector index so the predicate is evaluated during the scan — single-stage filtering that aims for pre-filter accuracy without the post-filter shortfall (Pinecone).
What to actually do
Three moves cover most teams. First, stop trusting post-filter result counts — if a query can return fewer than k after filtering, treat that as a bug, not an edge case, and measure recall on a labeled eval set with your real filters applied, not on unfiltered queries. Second, index the fields you filter on; every engine above needs a payload/metadata index to estimate selectivity and stay fast. Third, know your selectivity distribution: if most queries filter to a tiny slice (one tenant out of thousands), a partition or a brute-force fallback often beats any clever graph trick, and the engines that expose a full-scan threshold let you say so.
The WHERE clause was never the easy part. It's the part that decides whether your retrieval is correct.



