Cheatsheet

Reasoning-RL Frontier

The hottest track in post-training from 2024–2026: from PPO to critic-free GRPO / RLOO, on to long-CoT reasoning RL (DAPO / Dr.GRPO) and RLVR. Frequently asked at frontier labs (Seed / DeepSeek / Qwen / Moonshot). ⚠️ For specific paper numbers (benchmark scores, etc.) always defer to the original paper; this page focuses on mechanisms and trade-offs and deliberately avoids stacking numbers.

0. The evolution

PPO (actor + critic + ref + RM, GAE advantage) → GRPO (drops the critic, uses "group-relative" as the baseline) → DAPO / Dr.GRPO (fixes GRPO's bias and entropy collapse under long-CoT); side branch RLOO (leave-one-out baseline). Reward source: learned RM → (in verifiable domains) RLVR (rules / verifiers supply the reward).

1. PPO recap

2. GRPO — dropping the critic / Group Relative Policy Optimization

From-scratch implementation (group z-score advantage + per-token clip + K3 KL, in-loss):

import torch

def grpo_loss(logp, logp_old, logp_ref, rewards, mask, group_size,
              clip_eps=0.2, beta=0.04):
    # logp/logp_old/logp_ref: (B, T) per-token logprobs; B = n_prompts * group_size
    r = rewards.view(-1, group_size)                       # (n_prompts, G)
    adv = (r - r.mean(1, keepdim=True)) / (r.std(1, keepdim=True) + 1e-6)
    adv = adv.reshape(-1, 1)                               # (B,1) group z-score advantage
    ratio = torch.exp(logp - logp_old)                     # importance ratio ρ
    surr1 = ratio * adv
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv
    policy = torch.min(surr1, surr2)                       # clipped surrogate
    logr = logp_ref - logp                                 # log(π_ref/π_θ)
    kl = torch.exp(logr) - logr - 1                        # K3 estimator, always ≥ 0
    per_tok = policy - beta * kl                           # KL placed in the loss
    seq = (per_tok * mask).sum(1) / mask.sum(1).clamp(min=1)  # 1/|o_i| length normalization
    return -seq.mean()
# Dr.GRPO de-bias: drop the /std in adv; replace 1/|o_i| with a constant (e.g. max length).

3. RLOO — REINFORCE leave-one-out

4. DAPO — keeping GRPO stable under long-CoT / Decoupled-clip & Dynamic-sAmpling PO

ByteDance 2025 open-source recipe — four modifications targeting long-chain reasoning RL:

  1. Clip-Higher: decouple the upper and lower clip bounds ϵ\epsilon and raise the upper bound → gives low-probability tokens room to rise, preventing entropy collapse (policy becoming deterministic too early and ceasing to explore).
  2. Dynamic Sampling: discard prompts where the entire group is correct or entirely wrong (within-group advantage is always 0, yielding zero gradient), ensuring every batch contributes useful signal.
  3. Token-level loss: average by token rather than by sequence, preventing the gradient of long responses from being diluted (critical for long CoT).
  4. Overlong reward shaping: soft penalty for excessively long responses, stabilizing training.

5. Dr.GRPO — fixing GRPO's optimization bias / GRPO Done Right

5.5 GSPO — sequence-level importance ratio / Group Sequence Policy Optimization

提示 / Note

GSPO (Qwen team, Zheng et al., arXiv:2507.18071, 2025-07) lifts the granularity of importance-sampling (IS) correction from "each token" to "the whole sequence", mitigating GRPO's instability when training large-scale MoE models.

Why GRPO's token-level ratio is unstable. Following PPO, GRPO computes a separate ratio per token, wi,t=πθ(yi,tx,yi,<t)/πθold()w_{i,t}=\pi_\theta(y_{i,t}\mid x,y_{i,<t})/\pi_{\theta_\text{old}}(\cdots):

GSPO's fix: unit matching. The reward is granted to the whole sequence, so the unit of IS correction should be the sequence too. The sequence-level ratio is the length-normalized geometric mean:

si(θ)=(πθ(yix)πθold(yix))1/yi=exp ⁣(1yit=1yilogπθ(yi,tx,yi,<t)πθold())s_i(\theta)=\left(\frac{\pi_\theta(y_i\mid x)}{\pi_{\theta_\text{old}}(y_i\mid x)}\right)^{1/|y_i|}=\exp\!\left(\frac{1}{|y_i|}\sum_{t=1}^{|y_i|}\log\frac{\pi_\theta(y_{i,t}\mid x,y_{i,<t})}{\pi_{\theta_\text{old}}(\cdots)}\right)

The objective has the same PPO-clip form, but with the ratio replaced by sis_i and a sequence-level advantage A^i\hat A_i (within-group z-score, as in GRPO):

JGSPO(θ)=E ⁣[1Gi=1Gmin ⁣(siA^i, clip(si,1εl,1+εr)A^i)]J_\text{GSPO}(\theta)=\mathbb{E}\!\left[\frac{1}{G}\sum_{i=1}^G\min\!\big(s_i\hat A_i,\ \mathrm{clip}(s_i,1{-}\varepsilon_l,1{+}\varepsilon_r)\,\hat A_i\big)\right]

The whole sequence is either used or clipped as a unit — a single token's routing jump can no longer trigger gradient zeroing on its own.

Aspect GRPO (token-level) GSPO (sequence-level)
IS ratio wi,t=πθ(yi,t)/πθold(yi,t)w_{i,t}=\pi_\theta(y_{i,t})/\pi_{\theta_\text{old}}(y_{i,t}) si=(πθ(yi)/πθold(yi))1/yis_i=(\pi_\theta(y_i)/\pi_{\theta_\text{old}}(y_i))^{1/|y_i|}
clip range (paper's setup) εl=0.2, εr=0.27\varepsilon_l{=}0.2,\ \varepsilon_r{=}0.27 εl=3×104, εr=4×104\varepsilon_l{=}3{\times}10^{-4},\ \varepsilon_r{=}4{\times}10^{-4}
clipping granularity each token independently whole sequence
MoE routing drift ratio spikes → spurious clipping geometric mean smooths most jitter
提示 / Note

Don't misread the order-of-magnitude gap in ε\varepsilon. GSPO's ε104\varepsilon\sim10^{-4} is far smaller than GRPO's 0.2\sim0.2, but that is a design choice flowing from the different ratio definitionsnot a mathematical inevitability of the geometric mean "compressing shifts to 1". If all tokens move the same direction, sis_i is the same order as the token ratios and is not compressed. The geometric mean only smooths within-sequence sign-mixed jitter (lower variance); GSPO uses a tiny ε\varepsilon to impose a tighter sequence-level proximal constraint, so in practice clipping is active almost every step.

Stability and engineering payoffs (paper's results, no independent replication):

注意 / Caution

GMPO (arXiv:2507.20673) argues sequence-level clipping is "too aggressive" and discards gradient information, advocating token-level clipping with geometric-mean weighting instead; the two trade off differently and there is no settled verdict yet.

CISPO (MiniMax, arXiv:2506.13585, 2025-06) attacks the clip-zeroes-gradients problem from another angle: instead of clipping the probability ratio (which zeroes the gradient of out-of-range tokens), it clips the scalar IS weight itself while keeping every token's gradient. The paper reports ~2× training speedup over DAPO on Qwen2.5-32B. GSPO does unit-matching at the sequence level, CISPO preserves gradient integrity at the token level — two complementary ways to repair GRPO's clipping.

31 行 / lines
import torch
# Toy: G=3 responses of lengths 6/5/4; per-token logprobs under new vs old policy.
logp_new = torch.tensor([[-1.2,-0.8,-1.5,-0.4,-2.1,-1.0],
                         [-0.9,-1.3,-0.7,-1.8,-0.6, 0.0],
                         [-1.1,-0.5,-1.4,-0.9, 0.0, 0.0]])
logp_old = torch.tensor([[-1.3,-0.9,-1.4,-0.5,-2.0,-1.1],
                         [-1.0,-1.2,-0.8,-1.7,-0.7, 0.0],
                         [-1.0,-0.6,-1.3,-1.0, 0.0, 0.0]])
lengths = torch.tensor([6., 5., 4.])
mask = torch.arange(6)[None, :] < lengths[:, None].long()   # (G,T) real-token mask

log_ratio = logp_new - logp_old                             # per-token log-ratio
w_token = torch.exp(log_ratio)                              # GRPO: token-level ratio w_{i,t}
# GSPO: sequence-level ratio = length-normalized geometric mean of token ratios
mean_log_ratio = (log_ratio * mask.float()).sum(1) / lengths
s_seq = torch.exp(mean_log_ratio)                           # s_i = (pi_theta/pi_old)^(1/|y_i|)

eps = 0.2                      # GRPO token-level clip
eps_l, eps_r = 3e-4, 4e-4      # GSPO sequence-level clip (asymmetric)
grpo_clip = (((w_token < 1-eps) | (w_token > 1+eps)) & mask).sum().item()
gspo_clip = ((s_seq < 1-eps_l) | (s_seq > 1+eps_r)).sum().item()

for i in range(3):
    r = mask[i]
    print(f"resp{i} len={int(lengths[i])}  token-ratio[{w_token[i][r].min():.3f},{w_token[i][r].max():.3f}]  s_i={s_seq[i]:.4f}")
print(f"GRPO clipped {grpo_clip}/{int(mask.sum())} tokens (eps={eps})")
print(f"GSPO clipped {gspo_clip}/3 sequences (eps_l={eps_l}, eps_r={eps_r})")
# Note: s_i here (~1.02-1.03) already exceeds GSPO's tiny eps -> nearly every
# sequence is clipped in practice. The small eps is an intentional tight proximal
# constraint, NOT evidence that GSPO clips less than GRPO.

6. RLVR — RL from Verifiable Rewards

6.5 DeepSeek-R1 recipe

Chains the GRPO + RLVR pieces above into a full pipeline. R1 is not "RL all the way through" but four stages alternating SFT and RL:

Stage Name What it does Reward / data
1 Cold-start SFT Fine-tune base on a small set of high-quality long-CoT samples Supervised data (fixes readability / format / language mixing)
2 Reasoning RL GRPO + RLVR to push reasoning on math/code Rule rewards (answer exact-match + format + language consistency)
3 Rejection-sampling SFT Sample heavily from the stage-2 policy, keep the correct ones, then SFT Self-distilled data (~800k in the paper: reasoning + general mixed)
4 All-scenario RL RL again over all prompts to align general preferences Rule rewards for verifiable domains + helpful/harmless RM for general

7. long chain-of-thought & test-time scaling

8. self-rewarding / self-play


Stratified follow-ups

L1 Fundamentals

L2 Advanced

L3 Deep Dive

Extended L3

Q: GRPO retains the KL penalty βKL(πθπref)\beta\,\mathrm{KL}(\pi_\theta\|\pi_{ref}), but in long-CoT training the model needs to explore long reasoning paths far beyond the reference distribution. How should we understand this tension? What happens if KL is removed?

A: The role of the KL penalty is policy anchoring — preventing the policy from drifting under reward hacking (collapsing onto some reward shortcut). But in long-CoT settings, precisely what the model needs to learn is the long-chain reflection behavior that the reference model cannot do, so the KL is inherently penalizing "novel reasoning paths." Practical trade-offs: β\beta too large → model fails to learn long CoT, reasoning capacity is capped by the reference; β\beta too small → policy may degrade into reward hacking (e.g., repeating tokens to fool the verifier). DAPO's original recipe actually removes KL and instead relies on clip-higher + dynamic sampling to prevent collapse; GRPO retains KL but typically sets it to a low value. At its core this is a tightrope walk between "not collapsing" and "being able to explore". Follow-up: If you want to remove KL to open up exploration while still preventing policy collapse, what alternative anchoring mechanisms are feasible beyond clip-higher? (e.g., EMA reference, regularization toward the SFT checkpoint, etc.)


Q: From a variance reduction perspective, what are the theoretical pros and cons of GRPO's within-group standardization baseline vs. RLOO's leave-one-out baseline?

A: Both methods are variance reduction variants of REINFORCE; the difference lies in baseline construction. GRPO uses mean(r)\mathrm{mean}(r) and divides by std(r)\mathrm{std}(r) (i.e., a z-score); RLOO gives sample ii a baseline of 1G1jirj\frac{1}{G-1}\sum_{j\neq i}r_j. RLOO's leave-one-out baseline is unbiased (because rir_i is excluded from its own baseline), whereas GRPO's mean(r)\mathrm{mean}(r) includes rir_i itself, introducing a mild self-correlation bias (negligible when GG is large). However, GRPO's std normalization simultaneously applies variance scaling, making it more robust when reward magnitudes are uncertain — at the cost of the problem-difficulty bias that Dr.GRPO identifies. RLOO does not apply std normalization: it is more sensitive to reward scale but more unbiased. The choice depends on the stability of the task's reward distribution. Follow-up: What would happen if you combined RLOO's leave-one-out baseline with GRPO's std normalization? Are there known problems with this hybrid?


Q: DAPO's token-level loss divides the sequence gradient by the total number of tokens TT, i.e., Ltoken=1TttL_{\text{token}}=\frac{1}{T}\sum_t \ell_t. Does this in turn introduce an implicit bias toward "short correct responses"?

A: There is some effect, but the direction is more complex than intuition suggests. Token-level loss ensures every token contributes an equally weighted gradient, which does mean each token in a short correct response receives a larger gradient (total gradient is divided among TT tokens; smaller TT means larger per-token gradient). But the key factor is the sign of the gradient: correct responses receive positive reinforcement, wrong responses receive negative penalization. Token-level loss therefore makes short wrong responses receive more concentrated, stronger penalty — which is not necessarily bad for training efficiency. The real risk arises when rewards are binary 0/1: a long correct response and a short correct response receive the same total reward, but under token-level loss the per-token reinforcement signal for the long response is weaker, which may gradually push the model to compress correct reasoning chains. Follow-up: If DAPO token-level loss is combined with outcome reward (only the final answer's correctness), how do "compress reasoning chain length" and "compress to correct answer" compete with each other?


Q: RLVR currently only works in verifiable domains (math/code). Can a process reward model (PRM) be used as a "soft verifier" and integrated into the GRPO/DAPO framework? What are the technical obstacles?

A: The idea is feasible in principle: use the PRM to score each step of the CoT, aggregate step-level scores into a sequence-level reward, and feed this into GRPO's within-group advantage computation. But there are three obstacles: ① PRM annotation and training — step-level human annotations or automated labels (e.g., Monte Carlo rollout estimation) are required, which is expensive and of limited accuracy; ② Reward alignment — the PRM scores "reasoning step quality," which may be inconsistent with final answer correctness (good steps but wrong answer vs. rough steps but correct answer), producing conflicting RL signals; ③ Temporal credit assignment — when aggregating step-level scores into a sequence reward, the choice of weighting scheme (mean? final step? worst step?) directly affects learning dynamics. A simple mean blurs the contribution of critical steps; using only the final step degenerates into outcome reward. Follow-up: Within the GRPO framework, is it possible to use different reward aggregation strategies (adaptive weighting) for different samples in the group, rather than a uniform scheme?


Q: In reasoning RL, 0/1 binary rewards are common (correct answer = 1, wrong = 0). When prompt difficulty varies widely, a group of samples may be all correct or all wrong. Beyond dynamic sampling (discarding such prompts), what methods can extract useful training signal from an "all-wrong group"?

A: The core problem with an all-wrong group is that all rewards are identical → advantage is always zero → zero gradient. Several approaches: ① Introduce process rewards — even if all final answers are wrong, the quality of intermediate reasoning steps may differ; use a PRM or auxiliary signals such as reasoning length and format compliance to create within-group variation; ② Mixed reward design — layer format rewards, reasoning completeness rewards, and other soft signals on top of the outcome reward so that even "all-wrong" groups can still distinguish better from worse responses; ③ Cross-prompt baseline — rather than restricting to within-group comparison, use a batch-level or moving-average global baseline to provide gradient direction even for all-wrong groups; ④ Difficulty bucketing + resampling — mark all-wrong prompts as "too hard," downsample their frequency without fully discarding them, avoiding a training-set bias toward easy problems. Each approach has different costs: process rewards require extra annotation or models; cross-prompt baselines may introduce high variance. Follow-up: How would a cross-prompt baseline (e.g., using an EMA global mean as the baseline) be implemented concretely within GRPO? How should the weight be tuned when mixing with the within-group baseline?


Q: How does the choice of group size GG in GRPO theoretically and practically affect training? What does GRPO degenerate into at G=1G=1 and GG\to\infty?

A: At G=1G=1 there is only one sample, mean(r)=r1\mathrm{mean}(r)=r_1, and the advantage is always zero — no gradient at all; GRPO is completely ineffective. At G=2G=2 it degenerates into pairwise comparison, essentially an online version of pairwise preference learning. As GG\to\infty the within-group mean and standard deviation converge to the population expectation and standard deviation, so the baseline approximates a Monte Carlo estimate of the global value function; in theory GRPO then approaches REINFORCE with a global baseline. Practical trade-offs: GG too small → high baseline variance, noisy advantage estimates; GG too large → high sampling cost per prompt (inference cost grows linearly) and diversity may actually decrease (many similar responses in repeated sampling). DeepSeek's practice uses a moderate value of GG. There is also coupling between GG and the clip range and learning rate — with larger GG the advantage estimate is more accurate, so more aggressive update steps are tolerable. Follow-up: Is there an adaptive-GG strategy — sampling more from "hard" prompts and fewer from "easy" ones? How would this coordinate with dynamic sampling?


Q: In the self-rewarding paradigm the model generates preference data using its own judgments and trains iteratively. From the perspective of online learning theory, under what conditions does this self-play converge, and under what conditions does it mode-collapse?

A: The core condition for convergence is that the reward signal must continuously provide effective discrimination — the model must be able to distinguish the quality of its own outputs. When model capability is far below task difficulty, self-judgment is noisy but not systematically biased; training may be slow but won't collapse. Typical triggers for mode collapse: ① Amplified anchoring effect — the model prefers responses that resemble its own style; a positive feedback loop continuously reduces output diversity until it collapses to a narrow "self-preferred" mode; ② Judgment saturation — when model outputs become too similar in quality, the reward signal degrades to noise and training loses direction; ③ Reward hacking its own judge — the model learns to "convince" its own judge rather than genuinely improving, analogous to Goodhart's Law applied to itself. Mitigations include: retaining an SFT anchor, periodically introducing external verification signals, and limiting the number of iterations. Follow-up: If a fixed external verifier (e.g., code unit tests) is introduced into the self-rewarding loop as an "anchor" to calibrate self-scoring, to what extent can it prevent mode collapse?


Q: DAPO's clip-higher raises the upper clip bound to give low-probability tokens room to increase. From an information geometry perspective, why does standard PPO's symmetric clip systematically suppress valuable low-probability reasoning tokens in long CoT?

A: Standard PPO's symmetric clip [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] acts on the importance ratio ρ=πθ/πθold\rho=\pi_\theta/\pi_{\theta_{old}}. The key insight: in long CoT, already high-probability tokens (e.g., common reasoning connectives) have ρ\rho close to 1, so the symmetric clip barely affects them; but newly learned low-probability reasoning patterns (e.g., specific token sequences for backtracking or self-correction) start at very low probability, and when the advantage is positive A>0A>0, ρ\rho rises from below 1 — and hits the upper bound 1+ϵ1+\epsilon quickly, hard-capping the growth of these tokens. Meanwhile, when A<0A<0, the lower bound 1ϵ1-\epsilon equally limits the decline of high-probability tokens, but high-probability tokens have ample room to decrease anyway. Therefore the symmetric clip creates an asymmetric learning dynamic in information-geometric terms: the "downward channel" for high-probability tokens is wider than the "upward channel" for low-probability tokens. Clip-higher breaks this asymmetry by raising the upper bound. Follow-up: Apart from decoupling the clip bounds, is it possible to address this problem more elegantly from a trust-region perspective (e.g., using a KL constraint instead of a hard clip)? What is the computational cost of doing so in the long-CoT setting?

§A Key Papers Timeline