There are three popular ways to teach a language model which of two answers a human prefers, and the temptation is to rank them — PPO the old heavyweight, DPO the modern default, ORPO the new thing. That framing misses what actually connects them. Read in order, they aren't three competing techniques. They're one pipeline being taken apart, component by component, and the interesting question at each step is the same: what did we just delete, and was it holding anything up?
PPO: the full apparatus
The method that aligned the first useful instruction models is the one with the most moving parts. InstructGPT (Ouyang et al., 2022) runs in three stages. First, supervised fine-tuning on demonstrations. Second, collect human rankings of model outputs and train a separate reward model to predict them. Third, use reinforcement learning — PPO — to push the policy toward higher reward, with a per-token KL penalty pinning it near the SFT model so it doesn't drift into gibberish that games the reward.
In practice that third stage holds four models in memory at once: the policy you're training, a frozen reference (the KL anchor), the reward model (scores each generation), and a value/critic network (PPO is actor–critic, so it estimates advantages). That count isn't a line in the paper — it's the standard shape of any RLHF training rig, the one TRL and NeMo are built around. Four models, an online sampling loop, and RL's famous instability. It works, and for years it was the only thing that did. It is also a lot to stand up.
DPO: delete the reward model and the RL
DPO (Rafailov et al., 2023) starts from a sly observation in its subtitle — your language model is secretly a reward model. The authors derive a closed-form relationship between the reward function and the optimal policy under the RLHF objective. Rearranged, it means you never have to fit the reward model or run RL at all: you can optimize the policy directly on the same preference pairs, using a loss that looks like binary classification — make the chosen response more likely than the rejected one, scaled by how a frozen reference rates each.
That is a large deletion. The separate reward model: gone. The sampling-based RL loop, with its tuning and its crashes: gone. What survives is the frozen reference model, still loaded to compute the loss. Two models instead of four, a stable offline objective, and a training run that looks like ordinary fine-tuning. This is why nearly every open-source alignment recipe now reaches for DPO first — it gets most of RLHF's benefit at a fraction of the operational weight. The same memory math that governs the parameter-efficient fine-tuning methods applies here, doubled: every model you can drop is gradients and optimizer state you stop paying for.
Each method in this lineage is defined less by what it adds than by which part of the pipeline it proves you didn't need.
ORPO: delete the reference model too — and the SFT stage
ORPO (Hong et al., 2024) asks the next question: if DPO is just fine-tuning with a preference term, why run it after a separate SFT stage, and why keep a reference model around at all? Its answer — the "monolithic" in the title — is to fold everything into one objective. ORPO adds an odds-ratio penalty to the standard SFT negative-log-likelihood loss: the same single pass that teaches the model to produce the chosen responses also weakly penalizes the rejected ones. No reward model, no reference model, no second phase. One model, one stage. On UltraFeedback, Mistral-ORPO reported 12.20% on AlpacaEval 2.0 and 7.32 on MT-Bench — competitive with multi-stage recipes, from a simpler one.
The trajectory is now obvious, and it isn't unique to ORPO. KTO (Ethayarajh et al., 2024) drops the requirement for paired data entirely — it learns from individual "good"/"bad" labels, which are far cheaper to collect than ranked pairs. SimPO drops the reference model via a length-normalized reward. Every recent method is a subtraction.
What you delete is sometimes load-bearing
Here's the turn. A pipeline that keeps getting simpler and also keeps getting better would be a free lunch, and there are no free lunches in optimization. The counter-evidence is direct: "Is DPO Superior to PPO for LLM Alignment?" (Xu et al., 2024) finds that a well-tuned online PPO still beats DPO on the hardest testbeds, competitive code generation among them.
The reason is exactly the thing DPO removed. DPO trains on a fixed, offline set of preference pairs. The paper shows it can find biased solutions that exploit responses out of the policy's own distribution — answers the model would never actually generate, but which the offline loss happily rewards. PPO's despised RL loop keeps sampling fresh outputs from the current policy, so it never optimizes against fantasy data. The KL anchor, the value network, the online rollouts — the machinery DPO and ORPO discard for being heavy — was also what kept training honest on the long tail.
So the real decision isn't "which is best." It's how far down the deletion ladder your problem lets you go. For most instruction-tuning and preference work, ORPO or DPO will match the old apparatus at a fraction of the cost, and you should start there. When you're chasing the last few points on a genuinely hard domain — code, math, anything with a sharp correctness signal — the offline shortcut starts to leak, and the full online loop earns its four models back. The pipeline collapsed because most of the time, most of it was scaffolding. Most of the time is the part worth reading carefully.



