The pitch for text-to-SQL is irresistible: someone in operations types "what were our top five products by margin last quarter," and an agent writes the query, runs it, and answers. No analyst in the loop. The demos work. Then you point one at your actual warehouse — the one with a cust_t table, a status column that holds seven undocumented integer codes, and a "revenue" figure everyone knows excludes refunds but nobody wrote down — and the accuracy falls off a cliff.

That cliff is the whole story, and it tells you what to build. The mistake is to treat text-to-SQL as a model-quality problem and go shopping for the smartest LLM. The benchmarks say it is a context problem, and the popular open-source tools are the ones that took that seriously.

The benchmark that reframes the problem

For years the reference benchmark was Spider 1.0, built on small, clean databases — under ten tables, tidy names, no garbage values. Models crossed 90% execution accuracy on it, and text-to-SQL started to look solved. It wasn't; the benchmark was just easy.

BIRD ("Can LLM Already Serve as A Database Interface?", NeurIPS 2023) was built specifically to break that illusion: 12,751 question–SQL pairs over 95 large real databases across dozens of domains, with what the authors call "dirty and noisy database values," external knowledge that has to be grounded against the data, and SQL-efficiency concerns at scale. On BIRD, a human baseline of data engineers and database students hits 92.96% execution accuracy. At publication, GPT-4 managed about 46%. Today, after two years of furious leaderboard climbing, the top systems sit around 80–82% — still a double-digit gap behind humans, on the benchmark that actually resembles your job.

Spider is solved and BIRD is not, and the only thing that changed between them is whether the database is messy. That is the entire argument for context-first tooling.

The lesson is not "models are bad at SQL." A frontier model writes flawless SQL against a schema it understands. The lesson is that understanding your schema — the joins, the codes, the business definitions — is the binding constraint, and that knowledge does not live in the model. It lives in your database and in the heads of your analysts. The job of a text-to-SQL tool is to get it into the prompt.

Three bets on the same insight

The most-starred open-source projects all attack context, and the differences are in how they package it.

Model-agnostic text-to-SQL via "agentic retrieval": you train it on your DDL, documentation, and example queries, and it retrieves that context to generate SQL with any LLM
★ 24kPythonvanna-ai/vanna

Vanna is the purest expression of the thesis. It is a RAG library, not an application: you embed your DDL, your documentation, and — crucially — a set of known-good example queries, and Vanna retrieves the relevant pieces at question time to ground the generation. It is model-agnostic by design (OpenAI, Anthropic, Ollama, Gemini, Bedrock, and more) and storage-agnostic for the vector layer. The implicit claim is blunt: swap the LLM all you like, the retrieved context is what moves accuracy. If you want to understand the pattern, Vanna is the one to read.

GenBI platform: a governed semantic layer (its Modeling Definition Language) in front of the model that turns natural language into SQL, charts, and dashboards across 20+ data sources
★ 16kPythonCanner/WrenAI

WrenAI takes the same insight and hardens it into governance. Instead of retrieving raw schema, it asks you to define a semantic layer — its Modeling Definition Language (MDL), which encodes models, relationships, metrics, and access rules. The model then generates against your definitions, not against the raw tables, which is how you make "revenue" mean the one thing it's supposed to mean. It runs an Apache DataFusion engine across many sources and produces charts and dashboards, positioning itself as "GenBI" rather than a SQL box. The cost is the upfront modeling work; the payoff is answers your finance team will actually trust.

NL-to-SQL engine and agent with a context store for schema understanding, plus an enterprise API, admin console, and Slackbot

Dataherald sits between the two: a deployable NL-to-SQL engine with a context store, an API, an admin console, and a Slack integration. Smaller community, but the shape is the same — a place to put context, and an engine that uses it.

The contrast that proves the rule is the project that bets the other way:

A fine-tuned open model purpose-built to convert natural-language questions into SQL — the model-weights approach to the same problem
★ 4kJupyter Notebookdefog-ai/sqlcoder

SQLCoder is genuinely good, and it is the control group. A fine-tuned SQL model still does not know that your status = 3 means "refunded." Fine-tuning teaches a model the grammar and idioms of SQL; it cannot teach it your particular tables. That is why the retrieval and semantic-layer tools out-star it: they solve the part that actually fails in production. (The deeper version of this trade-off is the same one behind fine-tuning vs RAG everywhere else — teach the model a skill, or feed it the facts.)

How to choose

Reach for Vanna if you want a library to embed in your own agent and you're comfortable curating example queries — it's the lightest way to test whether context-grounding fixes your accuracy. Reach for WrenAI if multiple non-technical people will ask questions and the answers have to be governed — the MDL is the feature, not overhead. Reach for Dataherald if you want a standalone engine with enterprise plumbing already attached.

But whichever you pick, budget your effort the way the benchmark tells you to: not on choosing the cleverest model, but on writing down what your schema actually means. The model is the cheap, interchangeable part. Your context is the product.