Everyone benchmarks the model that writes the code. Almost no one benchmarks the step before it — the one where the agent has to find the right twenty lines inside a repository of two million. That retrieval step is where coding agents quietly diverge, and the divergence is sharper than you'd expect: the two best agents in the field disagree not on tuning, but on architecture. One builds a vector index of your entire codebase. The other deleted the index and runs grep.

Camp one: index everything

The textbook approach treats code like any other corpus. Chunk the repository, embed each chunk into a vector, store the vectors, and at query time embed the user's request and pull the nearest neighbors. Cursor is the reference implementation, and the most instructive thing about it isn't the search — it's how much machinery the freshness problem demands.

Per Cursor's own security writeup, indexing computes a Merkle tree of file hashes so that syncing a changed repo only walks the branches whose hashes differ, instead of re-uploading everything. Files are chunked locally, the chunks are sent up to compute embeddings, and the vectors land in a server-side vector database (with obfuscated file paths and line ranges as metadata) while the raw source is not persisted past the request. There are even "content proofs" so a teammate can't pull chunks for code they don't actually have. That is a lot of distributed-systems engineering, and nearly all of it exists for one reason: an index of a thing that changes every few seconds is perpetually trying to catch up to the truth.

Camp two: don't index at all

The other camp looked at that machinery and walked away. The team behind Claude Code is unusually blunt about it. As its creator put it, early versions "used RAG + a local vector db, but we found pretty quickly that agentic search generally works better. It is also simpler and doesn't have the same issues around security, privacy, staleness, and reliability." So Claude Code navigates a repo with the same tools a human uses — grep for content, glob for filenames, read for specific files — and lets the model decide where to look next.

A code embedding is a photograph of a moving target. Rename one symbol and the index is subtly wrong everywhere that symbol appears — and you won't get an error, just worse retrieval.

This is the heart of the matter, and it's why "which is more accurate, embeddings or grep?" is the wrong question. The deciding variable is the staleness tax. An embedding is computed from a snapshot; the instant you refactor, the vectors drift away from the live code, silently. Keeping them honest costs real infrastructure — the Merkle trees, the re-embedding, the cache invalidation. Agentic grep pays nothing here because it always reads the current files. What it pays instead is per-query latency and tokens: every search is a live tool call, not a precomputed lookup.

The tell, and the middle paths

If embeddings were clearly winning for code, the company that sold code embeddings wouldn't have removed them. But Sourcegraph did exactly that: it replaced Cody's embeddings with its native keyword search, citing privacy, the operational burden of keeping embeddings current, and the fact that vector search over codebases with more than 100,000 repositories was too resource-intensive to scale. Code, it turns out, is unusually hostile to dense retrieval — it's full of exact identifiers (symbol names, API names, error strings) that lexical matching nails and semantic similarity fumbles, the long-standing case for exact lexical match in retrieval.

The smart money is increasingly on hybrids that refuse the false choice. Aider builds a structural index instead of a semantic one: it parses 130+ languages with tree-sitter, builds a graph of which files reference which symbols, and ranks it with a personalized PageRank biased toward the current conversation — a map that's cheap enough to regenerate that staleness never accrues. Relace keeps embeddings but bolts a code reranker on top, retrieving broadly then reordering precisely (it reports recall@k of 0.71 versus 0.61 for the next-best system on a UI-generation task — a vendor benchmark, so treat it as directional). And the lesson from RAG generally applies in full force here: how you chunk code decides more than which embedding model you pick.

Pick by your constraint, not by fashion. If code can't leave the machine, or your repo churns constantly, agentic grep is the honest default and the reason it feels "dumber" — no fancy vectors — is exactly why it stays correct. If you're searching enormous repos cold and your queries are vague, an index earns its keep, provided you're willing to fund the sync. Either way, retrieval is only the first half of the agent's job: once it's found the code, it still has to write the edit back fast — a problem with its own dedicated models.