The hottest track in post-training from 2024–2026: from PPO to critic-free GRPO / RLOO, on to long-CoT reasoning RL (DAPO / Dr.GRPO) and RLVR. Frequently asked at frontier labs (Seed / DeepSeek / Qwen / Moonshot). ⚠️ For specific paper numbers (benchmark scores, etc.) always defer to the original paper; this page focuses on mechanisms and trade-offs and deliberately avoids stacking numbers.
0. The evolution
PPO (actor + critic + ref + RM, GAE advantage) → GRPO (drops the critic, uses "group-relative" as the baseline) → DAPO / Dr.GRPO (fixes GRPO's bias and entropy collapse under long-CoT); side branch RLOO (leave-one-out baseline). Reward source: learned RM → (in verifiable domains) RLVR (rules / verifiers supply the reward).
1. PPO recap
- Four models: policy (actor), value (critic), reference, reward model.
- Advantage computed with GAE; objective is the clipped surrogate , ; plus .
- Pain points: memory (4 models), difficulty training the value network, sparse rewards for long sequences.
2. GRPO — dropping the critic / Group Relative Policy Optimization
- For each prompt, sample a group of responses with rewards ; use within-group statistics as the baseline in place of a value network:
- Objective is the same clipped surrogate as PPO, but the advantage is , with no critic, no GAE; KL penalty against the reference is retained (k1/k2/k3 estimators and in-reward vs in-loss placement: see llm-post-training §9.4).
- Benefits: saves one value model and avoids training a value function; especially stable for verifiable rewards (math / code). Used by the DeepSeek family.
From-scratch implementation (group z-score advantage + per-token clip + K3 KL, in-loss):
import torch
def grpo_loss(logp, logp_old, logp_ref, rewards, mask, group_size,
clip_eps=0.2, beta=0.04):
# logp/logp_old/logp_ref: (B, T) per-token logprobs; B = n_prompts * group_size
r = rewards.view(-1, group_size) # (n_prompts, G)
adv = (r - r.mean(1, keepdim=True)) / (r.std(1, keepdim=True) + 1e-6)
adv = adv.reshape(-1, 1) # (B,1) group z-score advantage
ratio = torch.exp(logp - logp_old) # importance ratio ρ
surr1 = ratio * adv
surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv
policy = torch.min(surr1, surr2) # clipped surrogate
logr = logp_ref - logp # log(π_ref/π_θ)
kl = torch.exp(logr) - logr - 1 # K3 estimator, always ≥ 0
per_tok = policy - beta * kl # KL placed in the loss
seq = (per_tok * mask).sum(1) / mask.sum(1).clamp(min=1) # 1/|o_i| length normalization
return -seq.mean()
# Dr.GRPO de-bias: drop the /std in adv; replace 1/|o_i| with a constant (e.g. max length).
- Key points: ① the advantage is standardized within the group (z-score), replacing the value baseline; ②
min(surr1, surr2)is the same clipping as PPO, but the ratio uses per-token logprobs; ③ K3 = is an unbiased, non-negative KL estimator (mind the direction oflogr: ); ④ the trailing1/|o_i|length normalization is the original GRPO formulation — Dr.GRPO shows it favors long wrong responses, so de-biasing replaces it with a constant.
3. RLOO — REINFORCE leave-one-out
- Also critic-free: the baseline for sample = the mean reward of the other samples, REINFORCE-style gradient.
- Simpler than PPO (no clip / critic), competitive with PPO on RLHF. Difference from GRPO lies in baseline construction (leave-one-out vs. within-group standardization) and whether clipping is applied.
4. DAPO — keeping GRPO stable under long-CoT / Decoupled-clip & Dynamic-sAmpling PO
ByteDance 2025 open-source recipe — four modifications targeting long-chain reasoning RL:
- Clip-Higher: decouple the upper and lower clip bounds and raise the upper bound → gives low-probability tokens room to rise, preventing entropy collapse (policy becoming deterministic too early and ceasing to explore).
- Dynamic Sampling: discard prompts where the entire group is correct or entirely wrong (within-group advantage is always 0, yielding zero gradient), ensuring every batch contributes useful signal.
- Token-level loss: average by token rather than by sequence, preventing the gradient of long responses from being diluted (critical for long CoT).
- Overlong reward shaping: soft penalty for excessively long responses, stabilizing training.
5. Dr.GRPO — fixing GRPO's optimization bias / GRPO Done Right
- Identifies two biases in GRPO: std normalization in the advantage (amplifies imbalance across problem difficulty) + 1/response-length normalization in the loss (biases toward "longer wrong responses").
- Fix: remove the std division + remove length normalization (replace with a constant) → a more unbiased estimate; same performance with fewer tokens and no artificially inflated response length.
5.5 GSPO — sequence-level importance ratio / Group Sequence Policy Optimization
GSPO (Qwen team, Zheng et al., arXiv:2507.18071, 2025-07) lifts the granularity of importance-sampling (IS) correction from "each token" to "the whole sequence", mitigating GRPO's instability when training large-scale MoE models.
Why GRPO's token-level ratio is unstable. Following PPO, GRPO computes a separate ratio per token, :
- A single-token ratio is a single-sample estimate — intrinsically high variance — and the noise accumulates along long CoT sequences.
- Whenever some strays outside , that token's gradient is clipped to zero — frequent in long sequences even when the overall policy shift is small.
- Acute for MoE: after an update the router may send the same token to a different set of experts, so numerator and denominator run through different compute paths; routing drift shows up as ratio spikes that trigger clipping, which the paper calls "catastrophic and irreversible model collapse" (its words).
GSPO's fix: unit matching. The reward is granted to the whole sequence, so the unit of IS correction should be the sequence too. The sequence-level ratio is the length-normalized geometric mean:
The objective has the same PPO-clip form, but with the ratio replaced by and a sequence-level advantage (within-group z-score, as in GRPO):
The whole sequence is either used or clipped as a unit — a single token's routing jump can no longer trigger gradient zeroing on its own.
| Aspect | GRPO (token-level) | GSPO (sequence-level) |
|---|---|---|
| IS ratio | ||
| clip range (paper's setup) | ||
| clipping granularity | each token independently | whole sequence |
| MoE routing drift | ratio spikes → spurious clipping | geometric mean smooths most jitter |
Don't misread the order-of-magnitude gap in . GSPO's is far smaller than GRPO's , but that is a design choice flowing from the different ratio definitions — not a mathematical inevitability of the geometric mean "compressing shifts to 1". If all tokens move the same direction, is the same order as the token ratios and is not compressed. The geometric mean only smooths within-sequence sign-mixed jitter (lower variance); GSPO uses a tiny to impose a tighter sequence-level proximal constraint, so in practice clipping is active almost every step.
Stability and engineering payoffs (paper's results, no independent replication):
- MoE stability: sequence likelihood doesn't fluctuate wildly with per-token routing drift, removing the need for the earlier Routing Replay (an internal stopgap, first disclosed in this paper).
- Precision robustness: sequence-level aggregation is insensitive to per-token numerical precision, so one can feed log-probs straight from an inference engine (e.g. vLLM) without recomputing through the training engine.
- On Qwen3-30B-A3B-Base, GSPO's training curves (AIME'24 / LiveCodeBench / CodeForces Elo) beat GRPO; the paper credits it with contributing to Qwen3's performance gains (an association claim, no controlled ablation).
GMPO (arXiv:2507.20673) argues sequence-level clipping is "too aggressive" and discards gradient information, advocating token-level clipping with geometric-mean weighting instead; the two trade off differently and there is no settled verdict yet.
CISPO (MiniMax, arXiv:2506.13585, 2025-06) attacks the clip-zeroes-gradients problem from another angle: instead of clipping the probability ratio (which zeroes the gradient of out-of-range tokens), it clips the scalar IS weight itself while keeping every token's gradient. The paper reports ~2× training speedup over DAPO on Qwen2.5-32B. GSPO does unit-matching at the sequence level, CISPO preserves gradient integrity at the token level — two complementary ways to repair GRPO's clipping.
31 行 / lines
import torch
# Toy: G=3 responses of lengths 6/5/4; per-token logprobs under new vs old policy.
logp_new = torch.tensor([[-1.2,-0.8,-1.5,-0.4,-2.1,-1.0],
[-0.9,-1.3,-0.7,-1.8,-0.6, 0.0],
[-1.1,-0.5,-1.4,-0.9, 0.0, 0.0]])
logp_old = torch.tensor([[-1.3,-0.9,-1.4,-0.5,-2.0,-1.1],
[-1.0,-1.2,-0.8,-1.7,-0.7, 0.0],
[-1.0,-0.6,-1.3,-1.0, 0.0, 0.0]])
lengths = torch.tensor([6., 5., 4.])
mask = torch.arange(6)[None, :] < lengths[:, None].long() # (G,T) real-token mask
log_ratio = logp_new - logp_old # per-token log-ratio
w_token = torch.exp(log_ratio) # GRPO: token-level ratio w_{i,t}
# GSPO: sequence-level ratio = length-normalized geometric mean of token ratios
mean_log_ratio = (log_ratio * mask.float()).sum(1) / lengths
s_seq = torch.exp(mean_log_ratio) # s_i = (pi_theta/pi_old)^(1/|y_i|)
eps = 0.2 # GRPO token-level clip
eps_l, eps_r = 3e-4, 4e-4 # GSPO sequence-level clip (asymmetric)
grpo_clip = (((w_token < 1-eps) | (w_token > 1+eps)) & mask).sum().item()
gspo_clip = ((s_seq < 1-eps_l) | (s_seq > 1+eps_r)).sum().item()
for i in range(3):
r = mask[i]
print(f"resp{i} len={int(lengths[i])} token-ratio[{w_token[i][r].min():.3f},{w_token[i][r].max():.3f}] s_i={s_seq[i]:.4f}")
print(f"GRPO clipped {grpo_clip}/{int(mask.sum())} tokens (eps={eps})")
print(f"GSPO clipped {gspo_clip}/3 sequences (eps_l={eps_l}, eps_r={eps_r})")
# Note: s_i here (~1.02-1.03) already exceeds GSPO's tiny eps -> nearly every
# sequence is clipped in practice. The small eps is an intentional tight proximal
# constraint, NOT evidence that GSPO clips less than GRPO.
6. RLVR — RL from Verifiable Rewards
- Rewards come from rules / verifiers (math exact-match, code unit tests), not a learned neural RM.
- Advantages: almost no "neural RM being hacked" (verifier ≈ ground truth); disadvantage: only applicable to verifiable domains. This is the reward foundation for o1 / R1-style reasoning RL.
6.5 DeepSeek-R1 recipe
Chains the GRPO + RLVR pieces above into a full pipeline. R1 is not "RL all the way through" but four stages alternating SFT and RL:
| Stage | Name | What it does | Reward / data |
|---|---|---|---|
| 1 | Cold-start SFT | Fine-tune base on a small set of high-quality long-CoT samples | Supervised data (fixes readability / format / language mixing) |
| 2 | Reasoning RL | GRPO + RLVR to push reasoning on math/code | Rule rewards (answer exact-match + format + language consistency) |
| 3 | Rejection-sampling SFT | Sample heavily from the stage-2 policy, keep the correct ones, then SFT | Self-distilled data (~800k in the paper: reasoning + general mixed) |
| 4 | All-scenario RL | RL again over all prompts to align general preferences | Rule rewards for verifiable domains + helpful/harmless RM for general |
- R1-Zero: pure RL, no SFT (run GRPO + rule rewards directly from base). Proves reasoning can emerge spontaneously from RL (self-reflection / verification), but suffers poor readability / language mixing — precisely the motivation for the stage-1 cold-start SFT.
- R1-Distill: distill R1's generated reasoning data into smaller dense models (Qwen / Llama 1.5B–70B) via SFT only (no RL). Paper finding: distillation > running RL directly on small models — small models can't explore enough on their own, so feeding them the big model's reasoning traces wins.
- For the process-reward (PRM) vs outcome-reward (ORM) trade-off and Math-Shepherd-style rollout auto-labeling, see reward-modeling-eval §2; R1's main line uses a rule-based ORM (RLVR) rather than a neural PRM.
7. long chain-of-thought & test-time scaling
- RLVR trained on long CoT → the model learns to "think longer" (more reasoning tokens), and accuracy improves with test-time compute (inference-time scaling).
- Observed phenomena: self-reflection / backtracking / spontaneous "aha moments"; evaluation shifts from "single-pass accuracy" to "accuracy given a compute budget".
8. self-rewarding / self-play
- Self-Rewarding LM: the model acts as its own judge to generate preference data and iteratively applies preference optimization, reducing dependence on human annotation.
- SPIN (self-play): the model's own previous outputs serve as "negative samples" for adversarial fine-tuning. Risk: self-preference gets amplified.
Stratified follow-ups
L1 Fundamentals
- What does GRPO save compared to PPO? How exactly is the "group-relative advantage" computed?
- Where does RLVR's reward come from? Why does it mitigate reward hacking? What are its limitations?
L2 Advanced
- In long-CoT RL, why does "token-level vs. sequence-level" loss matter? (Gradient dilution for long responses.)
- What bias does std normalization of the advantage introduce in GRPO? How does Dr.GRPO fix it?
- What problem does DAPO's clip-higher solve? What is entropy collapse and why is it harmful?
L3 Deep Dive
- Derive: why can GRPO's advantage be seen as "approximating the value baseline with the within-group mean"? How are bias and variance balanced?
- Why does dynamic sampling (discarding all-correct / all-wrong groups) improve efficiency? What is its relationship to curriculum / difficulty sampling?
- What does test-time scaling imply for the paradigm of "capabilities are acquired at training time"? What are the trade-offs between reasoning RL and "pure SFT distillation of long CoT"?
- Under what conditions does critic-free (GRPO / RLOO) actually underperform value-based PPO?
Extended L3
Q: GRPO retains the KL penalty , but in long-CoT training the model needs to explore long reasoning paths far beyond the reference distribution. How should we understand this tension? What happens if KL is removed?
A: The role of the KL penalty is policy anchoring — preventing the policy from drifting under reward hacking (collapsing onto some reward shortcut). But in long-CoT settings, precisely what the model needs to learn is the long-chain reflection behavior that the reference model cannot do, so the KL is inherently penalizing "novel reasoning paths." Practical trade-offs: too large → model fails to learn long CoT, reasoning capacity is capped by the reference; too small → policy may degrade into reward hacking (e.g., repeating tokens to fool the verifier). DAPO's original recipe actually removes KL and instead relies on clip-higher + dynamic sampling to prevent collapse; GRPO retains KL but typically sets it to a low value. At its core this is a tightrope walk between "not collapsing" and "being able to explore". Follow-up: If you want to remove KL to open up exploration while still preventing policy collapse, what alternative anchoring mechanisms are feasible beyond clip-higher? (e.g., EMA reference, regularization toward the SFT checkpoint, etc.)
Q: From a variance reduction perspective, what are the theoretical pros and cons of GRPO's within-group standardization baseline vs. RLOO's leave-one-out baseline?
A: Both methods are variance reduction variants of REINFORCE; the difference lies in baseline construction. GRPO uses and divides by (i.e., a z-score); RLOO gives sample a baseline of . RLOO's leave-one-out baseline is unbiased (because is excluded from its own baseline), whereas GRPO's includes itself, introducing a mild self-correlation bias (negligible when is large). However, GRPO's std normalization simultaneously applies variance scaling, making it more robust when reward magnitudes are uncertain — at the cost of the problem-difficulty bias that Dr.GRPO identifies. RLOO does not apply std normalization: it is more sensitive to reward scale but more unbiased. The choice depends on the stability of the task's reward distribution. Follow-up: What would happen if you combined RLOO's leave-one-out baseline with GRPO's std normalization? Are there known problems with this hybrid?
Q: DAPO's token-level loss divides the sequence gradient by the total number of tokens , i.e., . Does this in turn introduce an implicit bias toward "short correct responses"?
A: There is some effect, but the direction is more complex than intuition suggests. Token-level loss ensures every token contributes an equally weighted gradient, which does mean each token in a short correct response receives a larger gradient (total gradient is divided among tokens; smaller means larger per-token gradient). But the key factor is the sign of the gradient: correct responses receive positive reinforcement, wrong responses receive negative penalization. Token-level loss therefore makes short wrong responses receive more concentrated, stronger penalty — which is not necessarily bad for training efficiency. The real risk arises when rewards are binary 0/1: a long correct response and a short correct response receive the same total reward, but under token-level loss the per-token reinforcement signal for the long response is weaker, which may gradually push the model to compress correct reasoning chains. Follow-up: If DAPO token-level loss is combined with outcome reward (only the final answer's correctness), how do "compress reasoning chain length" and "compress to correct answer" compete with each other?
Q: RLVR currently only works in verifiable domains (math/code). Can a process reward model (PRM) be used as a "soft verifier" and integrated into the GRPO/DAPO framework? What are the technical obstacles?
A: The idea is feasible in principle: use the PRM to score each step of the CoT, aggregate step-level scores into a sequence-level reward, and feed this into GRPO's within-group advantage computation. But there are three obstacles: ① PRM annotation and training — step-level human annotations or automated labels (e.g., Monte Carlo rollout estimation) are required, which is expensive and of limited accuracy; ② Reward alignment — the PRM scores "reasoning step quality," which may be inconsistent with final answer correctness (good steps but wrong answer vs. rough steps but correct answer), producing conflicting RL signals; ③ Temporal credit assignment — when aggregating step-level scores into a sequence reward, the choice of weighting scheme (mean? final step? worst step?) directly affects learning dynamics. A simple mean blurs the contribution of critical steps; using only the final step degenerates into outcome reward. Follow-up: Within the GRPO framework, is it possible to use different reward aggregation strategies (adaptive weighting) for different samples in the group, rather than a uniform scheme?
Q: In reasoning RL, 0/1 binary rewards are common (correct answer = 1, wrong = 0). When prompt difficulty varies widely, a group of samples may be all correct or all wrong. Beyond dynamic sampling (discarding such prompts), what methods can extract useful training signal from an "all-wrong group"?
A: The core problem with an all-wrong group is that all rewards are identical → advantage is always zero → zero gradient. Several approaches: ① Introduce process rewards — even if all final answers are wrong, the quality of intermediate reasoning steps may differ; use a PRM or auxiliary signals such as reasoning length and format compliance to create within-group variation; ② Mixed reward design — layer format rewards, reasoning completeness rewards, and other soft signals on top of the outcome reward so that even "all-wrong" groups can still distinguish better from worse responses; ③ Cross-prompt baseline — rather than restricting to within-group comparison, use a batch-level or moving-average global baseline to provide gradient direction even for all-wrong groups; ④ Difficulty bucketing + resampling — mark all-wrong prompts as "too hard," downsample their frequency without fully discarding them, avoiding a training-set bias toward easy problems. Each approach has different costs: process rewards require extra annotation or models; cross-prompt baselines may introduce high variance. Follow-up: How would a cross-prompt baseline (e.g., using an EMA global mean as the baseline) be implemented concretely within GRPO? How should the weight be tuned when mixing with the within-group baseline?
Q: How does the choice of group size in GRPO theoretically and practically affect training? What does GRPO degenerate into at and ?
A: At there is only one sample, , and the advantage is always zero — no gradient at all; GRPO is completely ineffective. At it degenerates into pairwise comparison, essentially an online version of pairwise preference learning. As the within-group mean and standard deviation converge to the population expectation and standard deviation, so the baseline approximates a Monte Carlo estimate of the global value function; in theory GRPO then approaches REINFORCE with a global baseline. Practical trade-offs: too small → high baseline variance, noisy advantage estimates; too large → high sampling cost per prompt (inference cost grows linearly) and diversity may actually decrease (many similar responses in repeated sampling). DeepSeek's practice uses a moderate value of . There is also coupling between and the clip range and learning rate — with larger the advantage estimate is more accurate, so more aggressive update steps are tolerable. Follow-up: Is there an adaptive- strategy — sampling more from "hard" prompts and fewer from "easy" ones? How would this coordinate with dynamic sampling?
Q: In the self-rewarding paradigm the model generates preference data using its own judgments and trains iteratively. From the perspective of online learning theory, under what conditions does this self-play converge, and under what conditions does it mode-collapse?
A: The core condition for convergence is that the reward signal must continuously provide effective discrimination — the model must be able to distinguish the quality of its own outputs. When model capability is far below task difficulty, self-judgment is noisy but not systematically biased; training may be slow but won't collapse. Typical triggers for mode collapse: ① Amplified anchoring effect — the model prefers responses that resemble its own style; a positive feedback loop continuously reduces output diversity until it collapses to a narrow "self-preferred" mode; ② Judgment saturation — when model outputs become too similar in quality, the reward signal degrades to noise and training loses direction; ③ Reward hacking its own judge — the model learns to "convince" its own judge rather than genuinely improving, analogous to Goodhart's Law applied to itself. Mitigations include: retaining an SFT anchor, periodically introducing external verification signals, and limiting the number of iterations. Follow-up: If a fixed external verifier (e.g., code unit tests) is introduced into the self-rewarding loop as an "anchor" to calibrate self-scoring, to what extent can it prevent mode collapse?
Q: DAPO's clip-higher raises the upper clip bound to give low-probability tokens room to increase. From an information geometry perspective, why does standard PPO's symmetric clip systematically suppress valuable low-probability reasoning tokens in long CoT?
A: Standard PPO's symmetric clip acts on the importance ratio . The key insight: in long CoT, already high-probability tokens (e.g., common reasoning connectives) have close to 1, so the symmetric clip barely affects them; but newly learned low-probability reasoning patterns (e.g., specific token sequences for backtracking or self-correction) start at very low probability, and when the advantage is positive , rises from below 1 — and hits the upper bound quickly, hard-capping the growth of these tokens. Meanwhile, when , the lower bound equally limits the decline of high-probability tokens, but high-probability tokens have ample room to decrease anyway. Therefore the symmetric clip creates an asymmetric learning dynamic in information-geometric terms: the "downward channel" for high-probability tokens is wider than the "upward channel" for low-probability tokens. Clip-higher breaks this asymmetry by raising the upper bound. Follow-up: Apart from decoupling the clip bounds, is it possible to address this problem more elegantly from a trust-region perspective (e.g., using a KL constraint instead of a hard clip)? What is the computational cost of doing so in the long-CoT setting?
§A Key Papers Timeline
- 2017-07 · PPO — Schulman et al., arXiv preprint. arXiv:1707.06347 — Clipped surrogate objective + GAE advantage estimation; establishes the LLM-RL baseline (actor + critic + ref + RM).
- 2024-01 · Self-Rewarding LM — Yuan et al., Preprint (Meta). arXiv:2401.10020 — Model acts as its own judge via LLM-as-a-Judge to generate preference data and iterate; risk: self-preference amplification.
- 2024-01 · SPIN — Chen et al., ICML 2024. arXiv:2401.01335 — Self-play fine-tuning using the model's own older outputs as negatives.
- 2024-02 · GRPO / DeepSeekMath — Shao et al., Preprint. arXiv:2402.03300 — Removes the critic; group-relative reward (z-score) replaces the value baseline; keeps KL to ref; core DeepSeek algorithm.
- 2024-02 · RLOO — Ahmadian et al., ACL 2024. arXiv:2402.14740 — Critic-free; baseline for sample i = mean reward of the other G-1 samples (leave-one-out); pure REINFORCE, no clip; competitive with PPO on RLHF.
- 2025-01 · DeepSeek-R1 / RLVR — Guo et al., Nature 2025. arXiv:2501.12948 — Rule/verifier rewards (math exact-match, code unit tests) replace the neural RM, nearly eliminating neural-RM hacking (verifier hacking / format gaming remain); GRPO + long-CoT RL induces self-reflection; opens inference-time scaling.
- 2025-03 · DAPO — Yu et al., Preprint (ByteDance Seed / Tsinghua AIR). arXiv:2503.14476 — Four long-CoT-RL fixes: Clip-Higher (anti entropy-collapse), Dynamic Sampling, token-level loss, overlong reward shaping.
- 2025-03 · Dr.GRPO — Liu et al., Preprint. arXiv:2503.20783 — Fixes two GRPO biases (std normalization, 1/length normalization); removing both yields an unbiased estimator with better token efficiency.
- 2025-06 · CISPO — MiniMax team, Preprint (MiniMax-M1 tech report). arXiv:2506.13585 — Clips the scalar IS weight itself rather than the probability ratio, keeping every token's gradient signal; the paper reports ~2× training speedup over DAPO on Qwen2.5-32B.
- 2025-07 · GSPO — Zheng et al., Preprint (Alibaba Qwen). arXiv:2507.18071 — Lifts IS correction from token-level to sequence-level (length-normalized geometric mean), mitigating GRPO's collapse when training large-scale MoE models; used for Qwen3 training.