There's a comfortable story about synthetic training data that goes like this: frontier models are now good enough to write their own homework, so you point a strong model at a prompt, collect a few hundred thousand examples, and fine-tune. The story is half true. Generation really is solved — any capable model will produce fluent instruction/response pairs all day for the price of inference. The trouble is that the half everyone skips is the half that decides whether your dataset helps your model or quietly degrades it.
The bottleneck moved, and the tools followed
The clearest result in this corner of the literature is also the most inconvenient: train on synthetic data without verification and the model gets worse. "Beyond Model Collapse" (arXiv 2406.07515) shows that retraining on synthesized data escapes collapse only when an external verifier — a human, a stronger model, or a programmatic check — injects real information by filtering out the bad samples. Self-Instruct (2212.10560) understood this in 2022: its pipeline was generate and filter, not just generate. The filter was always the point.
So the right way to read these three tools is by where they put the verifier, and how much else they ask of you. None of them is just a generator. Each ships a curation stage. What differs is generality, scale, and how opinionated the path is.
The pipeline framework
distilabel treats a dataset like a build artifact. You declare Steps and Tasks and wire them into a pipeline that serializes — the whole process is a shareable, reproducible object — and caches per step, so a re-run skips the work it already did even after you edit the graph. The payoff is that "generate → judge → filter" is three nodes, and the judge node can be a published method rather than a prompt you improvised: distilabel ships UltraFeedback-style AI-feedback and Self-Instruct-style flows as ready-made Tasks.
That is the distinguishing bet. If you care that your preference data was built the way the Constitutional AI / RLAIF line of papers describes it — and that someone else can rebuild it byte-for-byte — distilabel is the tool whose whole design is reproducibility and method coverage. The cost is that you think in pipelines; the structure is the product.
The generation engine
Curator starts from the other end: you have a generation job that needs to run a lot, fast, and you want to see what's coming out. Its core is a single LLM class with prompt() and parse() methods, first-class Pydantic structured outputs (so the generator emits typed records, not strings you regex later), native batch-API support across providers, automatic caching, and fault recovery for runs that span hours. The piece that gives it personality is the Curator Viewer — a live dashboard for watching data as it's generated, which turns curation from a post-hoc script into something you supervise in real time.
The lesson Curator encodes is that at scale, observability is verification's front door: you can't filter what you can't see, and a million-row synthetic run that you only inspect after the fact is a million-row gamble. If your bottleneck is throughput and keeping eyes on the output, this is the one that was built for that day.
The one-command path
synthetic-data-kit refuses to be a framework. It's a four-command CLI — ingest, create, curate, save — aimed at exactly one job: turn the documents you already have (PDF, HTML, DOCX, PPTX, even YouTube transcripts) into a fine-tuning set, with an LLM-as-judge quality score in the curate step and standard export formats (Alpaca, ChatML) in the save step. It lives in the meta-llama org and its examples are framed around adding reasoning to Llama, but it emits ordinary training formats, so the Llama framing is the worked example, not a lock-in.
Two honest signals to weigh: it's MIT-licensed (the most permissive of the three) and it ships from main with no tagged releases — a maturity tell worth knowing before you wire it into something load-bearing. What you get in exchange is the shortest distance between "I have a folder of docs" and "I have a training set," with the verifier already in the loop.
How to actually choose
The wrong question is which tool generates better data, because the generator is whatever model you plug in, and they're all good now. The right question is what shape your problem has:
- You want a reproducible, research-faithful pipeline — generate, judge with a published method, filter, and hand someone an artifact they can rebuild. That's distilabel.
- Your bottleneck is volume and visibility — millions of typed records, batch APIs, and a live view of what's landing. That's Curator.
- You want your own documents turned into a training set with the fewest moving parts. That's synthetic-data-kit.
Whichever you pick, notice that all three made the same decision you should: the curator stage is not optional. Synthetic data is cheap to make and expensive to trust, and the tool's real job is the trust. If the next step after this is choosing how to actually train on the data you've curated, that's a separate decision about fine-tuning methods — and whether you even need to fine-tune instead of reaching for retrieval in the first place.



