Here is the strange result that should make you suspicious of how everyone trains reasoning models. When the Qwen team built GSPO, their new reinforcement-learning algorithm, it ended up clipping roughly two orders of magnitude more tokens than the GRPO algorithm it replaced — throwing away far more of the gradient signal — and yet it trained more efficiently and more stably. If discarding 100× more of your training signal makes training better, the signal you were so carefully keeping was mostly noise. That is the whole story of GSPO in one sentence, and it is worth understanding why.
The setup: what GRPO got right
Start with what GRPO deserves credit for. RL from a reward model traditionally meant PPO, and PPO carries a critic — a second network, usually as large as the policy, trained to estimate the value of each token. For reasoning tasks where the reward is a single verifiable signal at the end (did the math check out, did the tests pass), that critic is expensive and hard to fit. GRPO, introduced in DeepSeek's 2024 DeepSeekMath paper, threw the critic out. Instead of learning a value function, it samples a group of N responses to the same prompt, scores them, and defines each response's advantage by normalizing its reward within the group. The group is the baseline. This is cheap, it is elegant, and it is most of why the open reasoning-model boom of 2025 happened on commodity budgets. Nearly every RLVR ("RL from verifiable rewards") stack — and the DPO/ORPO preference-optimization alternatives it competes with — lives downstream of this idea.
But GRPO kept one piece of PPO unexamined: the importance-sampling ratio, computed per token.
The category error
Importance sampling exists to fix a mismatch. In on-policy RL you generate data with one version of the policy and then take several gradient steps, so by the time you update, the data was drawn from a slightly old policy. The ratio π_new/π_old reweights each sample to correct for that. The crucial fine print: importance sampling is a statistical estimator, and like any estimator it needs enough samples from each distribution to mean anything.
GRPO assigns one reward to an entire response — the reward is a property of the sequence. Then it applies the correction one token at a time. And at each token position, it has exactly one sample. A single-sample importance ratio doesn't correct a distribution mismatch; there is no distribution to estimate from one draw. What it does instead is inject high-variance noise into every token's gradient.
GRPO rewards the sentence but corrects the words. With one sample per word, the correction isn't a correction — it's noise wearing the costume of math.
That noise does not stay small. It accumulates across the length of the response — longer outputs, more positions, more variance — and then the clipping mechanism, which is supposed to stabilize training by bounding the ratio, instead amplifies it, because responses of different lengths need different effective clip ranges and the per-token ratios swing hardest exactly where you can least afford it. The visible symptom is familiar to anyone who has run long-chain-of-thought RL: training that looks fine for a while, then drifts, then sometimes collapses irreversibly. The usual response is to blame the reward model or the learning rate. GSPO's claim is that the bug was in the objective the whole time.
Where it becomes undeniable: Mixture-of-Experts
The cleanest evidence comes from MoE models, because they make the variance visible. In an MoE, a router selects which experts process each token. As the policy updates, the router's choices shift. The Qwen team measured this directly: after a single gradient update on Qwen3-30B-A3B, for the same input sample, about 10% of the activated experts changed. Ten percent of the network that produced your token probabilities is now a different network.
For a per-token importance ratio, that is catastrophic — the denominator (old policy) and numerator (new policy) are computing through different experts, so the ratio thrashes. GRPO simply will not converge on MoE without a patch called Routing Replay: you cache the experts the old policy chose and force the new policy to reuse them, just to keep the ratio sane. It works, but it is a tax — extra memory, extra bookkeeping, and a constraint that fights the very adaptation you are training for.
The fix: optimize where you reward
GSPO's resolution is almost anticlimactic. Match the unit. Since the reward is assigned to the whole sequence, do the importance sampling on the whole sequence too. GSPO defines a single, length-normalized importance ratio per response — mechanically, the geometric mean of the token ratios (the 1/|y| exponent keeps responses of different lengths on a comparable numerical scale) — and it clips, weights, and optimizes at that level.
Two things fall out for free. The variance no longer compounds token-by-token, because there is one ratio, not hundreds multiplied together. And the MoE problem evaporates: a sequence's overall likelihood is stable even when individual tokens route through different experts, so GSPO trains MoE models without Routing Replay at all. The Qwen team reports better training efficiency and stronger results on AIME'24, LiveCodeBench, and CodeForces at equal compute, and credits GSPO with the gains in the latest Qwen3 models. There is also a GSPO-token variant that re-exposes per-token gradients for cases like multi-turn shaping, while keeping the stable sequence-level scaling factor.
The lesson worth keeping
The reason to care about GSPO even if you never train a model yourself is the principle underneath it, which generalizes past RL. When the granularity of your signal and the granularity of your correction disagree, the finer-grained machinery doesn't give you precision — it gives you variance dressed up as precision. GRPO's per-token ratios looked like they were doing careful, fine-grained credit assignment. They were doing nothing of the kind, and the proof is that you can throw 100× more of them away and train better. If you are choosing an RL recipe in 2026, especially for long outputs or an MoE, the question is no longer "GRPO or PPO." It is whether your importance sampling is being done at the unit you actually reward — and if it isn't, you are paying for noise and calling it signal.



