For two years the craft of getting good output from a language model was, embarrassingly, string editing. You wrote a prompt, eyeballed the result, added "think step by step," moved an instruction to the top, threatened the model, bribed it with a tip, and re-ran. The prompt was a parameter you tuned by hand, with your eyes as the loss function. Everyone knew this was beneath the field's dignity and nobody had a better loop.
The better loop is the premise these three Python libraries share: stop hand-editing the string and let an algorithm optimize it. You define what "good" looks like, hand over a few examples, and a search process proposes prompt variations and keeps the ones that score higher. It is the move from prompt engineering to prompt optimization — and once you accept it, the only real question is what the optimizer needs from you before it can start. That is exactly where these three disagree.
The one that compiles against a metric
DSPy is the category's center of gravity, and "DSPy alternatives" is the search that sends most people to the other two. Its bet is that a prompt should never be written at all — you declare a Signature (question -> answer), compose Modules (Predict, ChainOfThought, ReAct), and an Optimizer (MIPROv2, BootstrapFewShot) compiles the actual prompt by bootstrapping few-shot demonstrations and searching instruction phrasings to maximize a metric you provide. The official gloss — "Declarative Self-improving Python" — is the whole pitch: you write the program, the compiler writes the prompt.
The tell is the word compile. A compiler needs an objective and inputs to optimize over, and DSPy needs the same: a metric and a trainset. That requirement is its strength when you have them — give it a hundred labeled examples and a scoring function and it will out-tune anything you'd do by hand, and re-tune for free when you swap models. It is also its barrier to entry. If you can't write down a metric, or you have no examples yet, the most powerful tool in the space has nothing to grip. DSPy is what you reach for when the task is measurable — and the discipline of defining that metric is the same discipline the eval frameworks sell.
The one that backpropagates a critique
TextGrad takes the autograd metaphor literally. You mark a text variable as something to optimize, define a loss in natural language, and a TGD ("Textual Gradient Descent") optimizer runs the loop: an LLM critiques the current output, that written critique is the gradient, and it flows backward through the computation graph to edit the variable. No numbers, no labels — the feedback text is the entire training signal. The work is serious enough that it landed in Nature in 2025, not a venue that publishes prompt-tuning hacks lightly.
What this buys you is the thing DSPy can't: it runs when you have a judgment but no dataset. "This answer should cite a source and not hedge" is a usable loss in TextGrad and a non-starter in a metric-driven compiler. And because the gradient is just text, the variable doesn't have to be a prompt — it optimizes code, solutions, even molecule descriptions. The cost is that an LLM-written critique is a noisier, more expensive gradient than a numeric score, and a loop that re-queries a model to grade itself can wander. It is the right tool when the objective lives in your head, not in a spreadsheet.
The one that wants to be the whole app
AdalFlow refuses to pick. Its pitch — "the library to build and auto-optimize LLM applications" — is that optimization shouldn't be a separate pass you bolt onto a finished app; it should be the framework you built the app in. So it gives you a PyTorch-shaped way to assemble chatbots, RAG, and agents, and a single Trainer that runs both paradigms: a TextGrad-style textual-gradient mode (its "LLM-AutoDiff") and DSPy-style few-shot bootstrapping, in one loop. Whichever signal you can supply, it can use.
One disambiguation, because it bites people: this AdalFlow shipped as lightrag on PyPI before a 2024 rename, and it is not the graph-RAG project of the same name. pip install lightrag gets you the wrong library. With that cleared, AdalFlow's position is the integrator's: youngest of the three, betting that teams don't want to wire a build framework to a separate optimizer and would rather have one library own both. The risk of any unifier is the usual one — it can be a worse build tool than a dedicated harness and a worse optimizer than the specialist it borrows from. The payoff is one mental model from prototype to tuned system.
All three turn your prompt into a parameter. DSPy needs a metric and a dataset to optimize it, TextGrad needs only a critique, and AdalFlow will take whichever you've got.
Pick by what you can hand it
The star spread — 35k versus 4.2k versus 3.6k — is maturity and reach, not a ranking of the idea. The honest decision procedure has nothing to do with popularity and everything to do with inventory: what can you actually supply the optimizer?
- You have a metric and a trainset → DSPy. A measurable objective with examples is exactly the input its compiler is built to exploit, and nothing else will out-tune it on that footing.
- You have a judgment but no dataset → TextGrad. When "good" is a sentence you can write but not a number you can compute, a natural-language gradient is the only loop that runs.
- You want one library for building and tuning the whole app → AdalFlow. If you'd rather not maintain a seam between your framework and your optimizer, it bundles both — and quietly supports either signal as you acquire it.
Prompt optimization is one of the few places in the agent stack where the academic idea — treat the prompt as a learnable parameter — fully arrived in shippable libraries. The mistake is shopping by stars and then discovering you have no metric to feed the compiler, or no dataset for the tool that demanded one. Take inventory of what you can give the loop first. The right library is the one whose appetite matches your pantry.



