What does "optimizing a prompt" actually mean here?

Instead of you editing the wording of a prompt by hand, the library treats the prompt (and sometimes few-shot examples or weights) as a parameter and runs a search loop that proposes variations and keeps the ones that score better — either against a metric you define or against an LLM's written critique of the output.

Do I need a labeled dataset to use these?

DSPy effectively requires a trainset plus a metric to compile against — it bootstraps few-shot demonstrations and searches instructions to maximize that metric. TextGrad needs almost none: the "gradient" is an LLM's natural-language feedback, so a handful of examples or even one task can drive it. AdalFlow supports both modes depending on how you configure its trainer.

Is AdalFlow the same as the LightRAG graph-RAG project?

No, and the name collision is real. This AdalFlow was published as `lightrag` on PyPI before its 2024 rename, but it is unrelated to the popular graph-based retrieval project also called LightRAG. If you `pip install lightrag` expecting prompt optimization, you'll get the wrong library — install `adalflow`.

DSPy vs TextGrad vs AdalFlow: Optimizing Prompts Instead of Writing Them

For two years the craft of getting good output from a language model was, embarrassingly, string editing. You wrote a prompt, eyeballed the result, added "think step by step," moved an instruction to the top, threatened the model, bribed it with a tip, and re-ran. The prompt was a parameter you tuned by hand, with your eyes as the loss function. Everyone knew this was beneath the field's dignity and nobody had a better loop.

The better loop is the premise these three Python libraries share: stop hand-editing the string and let an algorithm optimize it. You define what "good" looks like, hand over a few examples, and a search process proposes prompt variations and keeps the ones that score higher. It is the move from prompt engineering to prompt optimization — and once you accept it, the only real question is what the optimizer needs from you before it can start. That is exactly where these three disagree.

The one that compiles against a metric

▟ stanfordnlp/dspy

A framework for programming — not prompting — language models: write modules with typed signatures and let optimizers compile the prompts against a metric

★ 35kPythonstanfordnlp/dspy

DSPy is the category's center of gravity, and "DSPy alternatives" is the search that sends most people to the other two. Its bet is that a prompt should never be written at all — you declare a Signature (question -> answer), compose Modules (Predict, ChainOfThought, ReAct), and an Optimizer (MIPROv2, BootstrapFewShot) compiles the actual prompt by bootstrapping few-shot demonstrations and searching instruction phrasings to maximize a metric you provide. The official gloss — "Declarative Self-improving Python" — is the whole pitch: you write the program, the compiler writes the prompt.

The tell is the word compile. A compiler needs an objective and inputs to optimize over, and DSPy needs the same: a metric and a trainset. That requirement is its strength when you have them — give it a hundred labeled examples and a scoring function and it will out-tune anything you'd do by hand, and re-tune for free when you swap models. It is also its barrier to entry. If you can't write down a metric, or you have no examples yet, the most powerful tool in the space has nothing to grip. DSPy is what you reach for when the task is measurable — and the discipline of defining that metric is the same discipline the eval frameworks sell.

The one that backpropagates a critique

▟ zou-group/textgrad

Automatic "differentiation" via text: an LLM's natural-language feedback is treated as a gradient and backpropagated to optimize any text variable

★ 3.6kPythonzou-group/textgrad

TextGrad takes the autograd metaphor literally. You mark a text variable as something to optimize, define a loss in natural language, and a TGD ("Textual Gradient Descent") optimizer runs the loop: an LLM critiques the current output, that written critique is the gradient, and it flows backward through the computation graph to edit the variable. No numbers, no labels — the feedback text is the entire training signal. The work is serious enough that it landed in Nature in 2025, not a venue that publishes prompt-tuning hacks lightly.

What this buys you is the thing DSPy can't: it runs when you have a judgment but no dataset. "This answer should cite a source and not hedge" is a usable loss in TextGrad and a non-starter in a metric-driven compiler. And because the gradient is just text, the variable doesn't have to be a prompt — it optimizes code, solutions, even molecule descriptions. The cost is that an LLM-written critique is a noisier, more expensive gradient than a numeric score, and a loop that re-queries a model to grade itself can wander. It is the right tool when the objective lives in your head, not in a spreadsheet.

The one that wants to be the whole app

▟ SylphAI-Inc/AdalFlow

A PyTorch-like library to build and auto-optimize LLM applications, with a trainer that unifies textual-gradient and few-shot bootstrap optimization

★ 4.2kPythonSylphAI-Inc/AdalFlow

AdalFlow refuses to pick. Its pitch — "the library to build and auto-optimize LLM applications" — is that optimization shouldn't be a separate pass you bolt onto a finished app; it should be the framework you built the app in. So it gives you a PyTorch-shaped way to assemble chatbots, RAG, and agents, and a single Trainer that runs both paradigms: a TextGrad-style textual-gradient mode (its "LLM-AutoDiff") and DSPy-style few-shot bootstrapping, in one loop. Whichever signal you can supply, it can use.

One disambiguation, because it bites people: this AdalFlow shipped as lightrag on PyPI before a 2024 rename, and it is not the graph-RAG project of the same name. pip install lightrag gets you the wrong library. With that cleared, AdalFlow's position is the integrator's: youngest of the three, betting that teams don't want to wire a build framework to a separate optimizer and would rather have one library own both. The risk of any unifier is the usual one — it can be a worse build tool than a dedicated harness and a worse optimizer than the specialist it borrows from. The payoff is one mental model from prototype to tuned system.

All three turn your prompt into a parameter. DSPy needs a metric and a dataset to optimize it, TextGrad needs only a critique, and AdalFlow will take whichever you've got.

Pick by what you can hand it

The star spread — 35k versus 4.2k versus 3.6k — is maturity and reach, not a ranking of the idea. The honest decision procedure has nothing to do with popularity and everything to do with inventory: what can you actually supply the optimizer?

You have a metric and a trainset → DSPy. A measurable objective with examples is exactly the input its compiler is built to exploit, and nothing else will out-tune it on that footing.
You have a judgment but no dataset → TextGrad. When "good" is a sentence you can write but not a number you can compute, a natural-language gradient is the only loop that runs.
You want one library for building and tuning the whole app → AdalFlow. If you'd rather not maintain a seam between your framework and your optimizer, it bundles both — and quietly supports either signal as you acquire it.

Prompt optimization is one of the few places in the agent stack where the academic idea — treat the prompt as a learnable parameter — fully arrived in shippable libraries. The mistake is shopping by stars and then discovering you have no metric to feed the compiler, or no dataset for the tool that demanded one. Take inventory of what you can give the loop first. The right library is the one whose appetite matches your pantry.

Dimension	DSPy	TextGrad	AdalFlow
Mechanism	Compile: bootstrap few-shot demos + search instructions vs. a metric	Backpropagate an LLM's natural-language critique as a "textual gradient"	Unified trainer bundling textual-gradient + few-shot bootstrap
Requires a metric?	Yes — a measurable score to maximize	A loss expressed in natural language	Mode-dependent
Requires a dataset?	Yes — a trainset of examples	Minimal; feedback is the signal	Mode-dependent
Generality	Pipelines of LM calls	Any text variable — prompts, code, solutions	Whole LLM apps (RAG, agents, chatbots)
Pedigree	Stanford NLP (Omar Khattab)	Stanford (James Zou lab); Nature 2025	SylphAI + UT Austin VITA group
Language	Python	Python	Python

DSPy vs TextGrad vs AdalFlow: Optimizing Prompts Instead of Writing Them

The one that compiles against a metric

The one that backpropagates a critique

The one that wants to be the whole app

Pick by what you can hand it

Frequently asked

Dex Mareno

Dispatches from the machines, in your inbox

DSPy vs TextGrad vs AdalFlow: Optimizing Prompts Instead of Writing Them

The one that compiles against a metric

The one that backpropagates a critique

The one that wants to be the whole app

Pick by what you can hand it

Frequently asked

Dex Mareno

Continue reading

Your Competitor Is Running Agents While You're Still Writing Prompts

OpenAI Agents SDK vs Pydantic AI vs Google ADK: The New Frameworks, Compared

Mem0 vs Zep vs Letta: Choosing a Memory Layer for Your AI Agent

Dispatches from the machines, in your inbox