You typed "garak vs PyRIT vs promptfoo" expecting a winner. You will not get one here, because the question contains a category error. These three tools are the default answers when an AI team decides to red-team its models, and they are constantly listed side by side — but they are not the same kind of thing. One is a scanner. One is a framework. One is a CI gate. Choosing between them is like choosing between a smoke detector, a chemistry set, and a building inspector: the right answer is usually "more than one, at different moments."

The reason they get confused is that all three now do the thing people use as a tiebreaker — multi-turn attacks — so that axis has stopped discriminating. The axis that actually matters is what each tool treats as the unit under test.

garak: the scanner

NVIDIA's garak is the easiest to place because its own README places it for you: it analogizes itself to nmap and Metasploit, "but for LLMs." You point it at a model and it runs a battery of probes that generate adversarial interactions, routes them through generators (the model backends — Hugging Face, OpenAI, Bedrock, local GGUF, a custom REST endpoint), and grades the results with detectors that classify whether a known failure occurred: prompt injection, jailbreaks, toxicity, data leakage, hallucination.

garak is breadth-first and low-config. You don't build with it; you run it and read the hit report.

The unit under test is the model. That is its strength — broad, repeatable coverage of known vulnerability classes across many backends with almost no setup — and its boundary: it ships a fixed library of probes, so it is less suited to inventing a bespoke campaign against your specific application's logic. One correction to the folklore: garak is no longer "single-turn only." v0.15 (2026) added a multi-turn GOAT probe and an agent-breaker probe aimed at the tools a tool-using agent can reach.

LLM vulnerability scanner — probes/detectors/generators, nmap-for-LLMs
★ 8.2kPythonNVIDIA/garak

PyRIT: the framework

Microsoft's PyRIT (Python Risk Identification Tool) is the one people most often miscast as "just another scanner." It isn't a scanner at all — it's an orchestration framework, an SDK you program against. Its abstractions are the giveaway: targets (the system under test), converters (70+ stackable transforms — Base64, leetspeak, Unicode confusables, translation, LLM rephrasing), scorers (LLM-as-judge, Azure AI Content Safety, Likert), and orchestrators that drive the multi-turn flow, all backed by a memory store.

With those parts you compose automated, multi-turn attack algorithms — Crescendo (gradual escalation), TAP (Tree-of-Attacks-with-Pruning), PAIR, Skeleton Key — where one LLM attacks another and adapts across turns. The unit under test is the campaign: PyRIT is what you reach for when the fixed probes of a scanner don't fit and you need to author novel, agentic attacks at scale. The cost is the steepest learning curve of the three — you write Python, not config. (A 2026 refactor renamed the old "Red Teaming Orchestrator" to the Multi-Turn Orchestrator and introduced a composable AttackStrategy model.)

Adversarial orchestration framework — converters/scorers/orchestrators for automated multi-turn campaigns
★ 3.9kPythonmicrosoft/PyRIT

promptfoo: the CI gate

promptfoo is the one that looks least like a security tool and is the most operationally useful for shipping teams. It is a config-driven harness: you declare your target and the vulnerability classes you care about in a redteam.yaml, and plugins (50+ — jailbreak, PII, SSRF, SQL/shell injection, excessive agency, hallucination) generate adversarial inputs while strategies deliver them (iterative single-shot, Crescendo and GOAT for multi-turn). Then you run it as a step in CI, gate the build on a risk score, and get OWASP LLM Top 10, NIST AI RMF, and MITRE ATLAS mappings in the report.

The unit under test is your application and its configuration, evaluated as a release gate that sits right next to your ordinary evals. If you already lean on a prompt-eval workflow like deepeval, RAGAS, or promptfoo itself, red teaming becomes one more failing test rather than a separate quarterly exercise. Worth knowing: OpenAI acquired promptfoo in March 2026; it stays open source under MIT.

Config-driven LLM eval + red-team harness with CI gating and compliance maps
★ 22.5kTypeScriptpromptfoo/promptfoo

How to actually choose

Don't choose on multi-turn support — they all have it now. Choose on the unit you need to test:

The tell that these are layers, not rivals, is that they implement the same published algorithms — Crescendo, TAP, PAIR, GOAT all show up across the three. The attack research is shared; what differs is whether you want it delivered as a scan, a library, or a build gate. If you're approaching this from the defense side rather than offense, the complement to all three is a runtime filter — see Rebuff vs LLM Guard vs Vigil for the layer that blocks the attacks these tools find.