Cheatsheet

LLM Post-Training Complete Reference Cheatsheet

Complete Bilingual Cheat Sheet: LLM Post-Training

Intended use: Interview preparation for LLM Post-Training Research Intern roles & everyday reference Language: English, with key technical terms preserved as-is


Part 1 — Core Concepts & Formula Derivations

Core Concepts & Formula Derivations


1. Pre-training vs Post-training Overview

Dimension Pre-training Post-training
Data Scale Trillions of tokens, web-crawled corpora Thousands–millions of high-quality annotated/preference examples
Objective Next-token prediction; acquiring language and world knowledge Instruction following + alignment to human preferences + enhanced reasoning
Loss L=tlogpθ(xtx<t)L = -\sum_t \log p_\theta(x_t \mid x_{<t}) SFT loss + RLHF/DPO/GRPO objectives
Learning Rate High (order of 1e-4), cosine annealing Low (1e-5 ~ 5e-6), to prevent forgetting
Hardware Thousands of GPUs, training for weeks to months Hundreds of GPUs, training for hours to days
Output Base model (highly capable but uncontrolled) Instruct / Chat model (controllable, safe, helpful)

Standard 5-Step Pipeline:

  1. SFT (Supervised Fine-Tuning): Supervised fine-tuning on high-quality (instruction, response) pairs to transform the base model into an "instruction assistant."
  2. Reward Model Training: Train a scoring model RM using human preference comparison data (two responses to the same prompt + human preference labels).
  3. RLHF / PPO: Reinforcement learning using RM feedback, with a KL constraint to prevent diverging too far from the SFT model.
  4. DPO (Offline Alternative): Bypasses the explicit RM; directly optimizes the policy from preference data, achieving simpler and more stable alignment.
  5. Iterative Loop: Current policy samples new data → new preference labels → update RM → RL again, repeated over multiple rounds.

2. SFT Data Format & Loss Masking

Chat Template (ChatML format example):

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user instruction}<|im_end|>
<|im_start|>assistant
{assistant response}<|im_end|>

Different models have their own templates (Llama-2 uses [INST], Llama-3 uses <|start_header_id|>, Qwen uses <|im_start|>). Training and inference must use the same template; otherwise a distribution shift occurs.

Loss Masking:

The training objective of SFT is to teach the model "how to answer," not "how to repeat the question." Cross-entropy loss is computed only at assistant token positions:

LSFT=1AtAlogpθ(xtx<t)L_{SFT} = -\frac{1}{|A|} \sum_{t \in A} \log p_\theta(x_t \mid x_{<t})

where AA is the set of all assistant token positions. Labels for user / system tokens are set to 100-100 (ignored by default in PyTorch's CrossEntropyLoss).

Multi-turn Loss Masking: The user turn of every round is masked; only the assistant turn of each round contributes to the loss.

Trade-off for including the system prompt in the loss:

2.1 Common tokenization / chat-template pitfalls (SFT engineering screening questions)

These are the most common failure modes in SFT engineering and are frequently tested in interviews (each: problem → fix):

Self-test (L2): Why does pad_token=eos_token break SFT if attention_mask and label mask are not set correctly? In multi-turn dialogue, how should labels for pad and prompt positions be handled?


3. Sequence Packing

Definition: Concatenate multiple short samples into a single sequence of length equal to the context window, adding EOS / separator tokens only at sample boundaries, thereby eliminating padding waste.

GPU utilization: Without packing, padding can account for 30–60% of tokens; with packing, nearly 100% of tokens are valid, yielding a 2–4× training speedup.

Pitfall 1 — Cross-sample attention contamination: Without a document-level attention mask, tokens from a preceding sample in a packed sequence can attend to tokens from a following sample, causing information leakage. Solution: use Flash Attention's cu_seqlens parameter, which takes the cumulative sequence lengths of each sample within the packed sequence and ensures attention is computed only within each sample.

Pitfall 2 — Loss weight imbalance: Packing implicitly weights by token count (longer samples produce more loss terms). If the original objective averaged over samples, the packing objective differs semantically; consider whether length normalization of the loss is needed.

cu_seqlens example (3 samples with lengths 5, 3, 7):

cu_seqlens = [0, 5, 8, 15]  # cumulative lengths
packed_ids = [s1_tok1, ..., s1_tok5, s2_tok1, ..., s2_tok3, s3_tok1, ..., s3_tok7]

4. RLHF Full Pipeline / PPO in RLHF

RLHF (Reinforcement Learning from Human Feedback) three stages:

  1. SFT: Establish the initial policy πref\pi_{ref} (reference policy, which serves as the baseline for the KL penalty).
  2. RM Training: Fit a scalar reward function r(x,y)r(x,y) from human preference comparison data using the Bradley-Terry model.
  3. PPO Optimization: Maximize the augmented reward with a KL constraint.

Augmented Reward:

rtotal(x,y)=rRM(x,y)βKL ⁣(πθ(x)πref(x))r_{total}(x,y) = r_{RM}(x,y) - \beta \cdot \text{KL}\!\left(\pi_\theta(\cdot|x) \| \pi_{ref}(\cdot|x)\right)

PPO Clipped Objective:

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t, clip(rt(θ),1ε,1+ε)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\hat{A}_t,\ \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \dfrac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio, and ε\varepsilon is typically 0.1–0.2.

GAE Advantage Estimation:

A^t=l=0Tt(γλ)lδt+l,δt=rt+γV(st+1)V(st)\hat{A}_t = \sum_{l=0}^{T-t} (\gamma\lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

λ\lambda controls the bias-variance trade-off: λ=1\lambda=1 gives high variance, low bias; λ=0\lambda=0 degenerates to one-step TD.

Recurrence (backward sweep, O(T)O(T)): A^t=δt+γλA^t+1\hat{A}_t=\delta_t+\gamma\lambda\,\hat{A}_{t+1}, with A^T=δT\hat{A}_T=\delta_T.

4 Models Required by PPO (source of memory pressure):

Model Role Updated?
Actor (Policy πθ\pi_\theta) The LLM policy being optimized Yes (PPO gradient)
Critic (Value model) Estimates V(st)V(s_t), computes advantage Yes (TD error)
Reference (πref\pi_{ref}) KL penalty baseline, i.e., the SFT model No (frozen)
Reward Model (RM) Scores (x,y) pairs No (frozen)

From-scratch implementation (clipped policy loss + clipped value loss + entropy bonus + approx_kl monitoring):

import torch

def ppo_loss(logp, logp_old, values, values_old, returns, advantages, entropy,
             clip_eps=0.2, vf_clip=0.2, vf_coef=0.5, ent_coef=0.0):
    # logp/logp_old: (B,) logprob of the taken action under current/behavior policy
    # advantages: normalized GAE advantage (for policy loss); returns: un-normalized GAE target (raw_adv + values_old, for value loss); both from upstream
    ratio = torch.exp(logp - logp_old)                      # importance ratio ρ_t
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
    pg_loss = -torch.min(surr1, surr2).mean()               # clipped policy loss

    v_clip = values_old + torch.clamp(values - values_old, -vf_clip, vf_clip)
    vf_loss = 0.5 * torch.max((values - returns) ** 2,
                              (v_clip - returns) ** 2).mean()  # clipped value loss

    loss = pg_loss + vf_coef * vf_loss - ent_coef * entropy.mean()  # entropy bonus

    with torch.no_grad():                                   # diagnostics only
        logr = logp - logp_old                              # log(π_new/π_old)
        approx_kl = (torch.exp(logr) - 1 - logr).mean()     # K3: KL(π_old‖π_new) probe
        clip_frac = ((ratio - 1).abs() > clip_eps).float().mean()
    return loss, {"pg": pg_loss, "vf": vf_loss, "approx_kl": approx_kl, "clip_frac": clip_frac}

5. Bradley-Terry Reward Model

Bradley-Terry preference model: Given prompt xx, the probability that the better response ywy_w is preferred over the worse response yly_l is:

P(ywylx)=σ ⁣(r(x,yw)r(x,yl))P(y_w \succ y_l \mid x) = \sigma\!\left(r(x,y_w) - r(x,y_l)\right)

RM training loss (maximizing log likelihood of preference data):

LRM=E(x,yw,yl) ⁣[logσ ⁣(r(x,yw)r(x,yl))]L_{RM} = -\mathbb{E}_{(x,y_w,y_l)}\!\left[\log \sigma\!\left(r(x,y_w) - r(x,y_l)\right)\right]

RM architecture:

Key risks:



6. DPO Full Derivation (Direct Preference Optimization Full Derivation)

6.1 Starting from the RLHF Objective

The KL-constrained RLHF optimization objective:

maxπ  ExD,  yπ(x) ⁣[r(x,y)]    βKL ⁣(π(x)    πref(x))\max_{\pi} \; \mathbb{E}_{x \sim \mathcal{D},\; y \sim \pi(\cdot|x)}\!\big[r(x, y)\big] \;-\; \beta \cdot \mathrm{KL}\!\big(\pi(\cdot|x) \;\|\; \pi_{\mathrm{ref}}(\cdot|x)\big)

where:

Expanding the KL divergence:

maxπ  Ex,y ⁣[r(x,y)]βyπ(yx)logπ(yx)πref(yx)\max_{\pi} \; \mathbb{E}_{x,y}\!\big[r(x,y)\big] - \beta \sum_{y} \pi(y|x)\log\frac{\pi(y|x)}{\pi_{\mathrm{ref}}(y|x)}

Taking the variational derivative per yy and setting it to zero:

π(yx)[r(x,y)βlogπ(yx)πref(yx)β]=0\frac{\partial}{\partial \pi(y|x)}\left[r(x,y) - \beta\log\frac{\pi(y|x)}{\pi_{\mathrm{ref}}(y|x)} - \beta\right] = 0

Note: the normalization constraint yπ(yx)=1\sum_y \pi(y|x)=1 introduces a Lagrange multiplier Z(x)Z(x)

6.2 Closed-Form Optimal Policy

Solving yields the optimal policy:

π(yx)=1Z(x)πref(yx)exp ⁣(r(x,y)β)\boxed{\pi^*(y|x) = \frac{1}{Z(x)}\,\pi_{\mathrm{ref}}(y|x)\,\exp\!\left(\frac{r(x,y)}{\beta}\right)}

where the partition function is:

Z(x)=yπref(yx)exp ⁣(r(x,y)β)Z(x) = \sum_{y} \pi_{\mathrm{ref}}(y|x)\,\exp\!\left(\frac{r(x,y)}{\beta}\right)

Z(x)Z(x) ensures yπ(yx)=1\sum_y \pi^*(y|x) = 1, i.e., the policy is a valid probability distribution.

6.3 Inverting for the Reward

Taking log of both sides of the optimal policy:

logπ(yx)=logπref(yx)+r(x,y)βlogZ(x)\log \pi^*(y|x) = \log \pi_{\mathrm{ref}}(y|x) + \frac{r(x,y)}{\beta} - \log Z(x)

Rearranging:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)\boxed{r(x,y) = \beta \log\frac{\pi^*(y|x)}{\pi_{\mathrm{ref}}(y|x)} + \beta \log Z(x)}

Key insight: the reward can be expressed via the log-ratio of policy to reference, eliminating the explicit reward model.

6.4 Substituting into Bradley-Terry

Human preference model:

p(ywylx)=σ ⁣(r(x,yw)r(x,yl))p(y_w \succ y_l \mid x) = \sigma\!\big(r(x, y_w) - r(x, y_l)\big)

where σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}} is the sigmoid function.

Substituting the inverted reward:

r(x,yw)r(x,yl)=βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx)+βlogZ(x)βlogZ(x)r(x,y_w) - r(x,y_l) = \beta\log\frac{\pi^*(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\log\frac{\pi^*(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)} + \cancel{\beta\log Z(x)} - \cancel{\beta\log Z(x)}

The Z(x)Z(x) terms cancel perfectly! This is because both responses share the same partition function for the same prompt xx.

p(ywylx)=σ ⁣(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))p(y_w \succ y_l \mid x) = \sigma\!\left(\beta\log\frac{\pi^*(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\log\frac{\pi^*(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)

6.5 DPO Loss Function

Replacing π\pi^* with parameterized πθ\pi_\theta, taking negative log-likelihood:

LDPO(πθ)=E(x,yw,yl)D ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\boxed{\mathcal{L}_{\mathrm{DPO}}(\pi_\theta) = -\mathbb{E}_{(x,\, y_w,\, y_l) \sim \mathcal{D}}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}\right)\right]}

Implicit reward defined as:

r^θ(x,y)βlogπθ(yx)πref(yx)\hat{r}_\theta(x, y) \triangleq \beta \log\frac{\pi_\theta(y|x)}{\pi_{\mathrm{ref}}(y|x)}

The loss simplifies to:

LDPO=E ⁣[logσ ⁣(r^θ(x,yw)r^θ(x,yl))]\mathcal{L}_{\mathrm{DPO}} = -\mathbb{E}\!\big[\log\sigma\!\big(\hat{r}_\theta(x,y_w) - \hat{r}_\theta(x,y_l)\big)\big]

6.6 Gradient Analysis

θLDPO=βE ⁣[σ(r^θ(x,yw)+r^θ(x,yl))weight: larger gradient when the model is more wrong ⁣(θlogπθ(ywx)θlogπθ(ylx))]\nabla_\theta \mathcal{L}_{\mathrm{DPO}} = -\beta\,\mathbb{E}\!\left[\underbrace{\sigma(-\hat{r}_\theta(x,y_w)+\hat{r}_\theta(x,y_l))}_{\text{weight: larger gradient when the model is more wrong}}\!\left(\nabla_\theta\log\pi_\theta(y_w|x) - \nabla_\theta\log\pi_\theta(y_l|x)\right)\right]

6.7 DPO Advantages & Disadvantages

Advantages Disadvantages
No separate RM training needed Offline algorithm: can only use the static dataset D\mathcal{D}; no online exploration
No online rollout needed Distribution mismatch: training signal degrades as πθ\pi_\theta drifts from the data collection policy
Simplified pipeline, single optimization pass Imprecise rejection: rejects entire responses globally rather than correcting step by step
More stable than PPO Sensitive to preference data quality
Theoretically equivalent to RLHF (with sufficient data) Z(x)Z(x) cancellation depends on the correctness of the BT model assumption

6.8 Likelihood Displacement: chosen log-prob also decreases

Phenomenon

Intuitively, DPO training should increase the model's probability for the chosen response ywy_w and decrease it for the rejected response yly_l. However, Razin et al. (arXiv:2410.08847) and Pal et al. (arXiv:2402.13228) both observe that logπθ(ywx)\log\pi_\theta(y_w|x) and logπθ(ylx)\log\pi_\theta(y_l|x) tend to decrease simultaneously during training — the loss decreases only because yly_l decreases faster, widening the margin between them, while the absolute probability of the chosen response shrinks.

"While intuitively these methods should increase the probability of y+y^+ while decreasing that of yy^-, several recent works observed that the probabilities of both y+y^+ and yy^- tend to decrease over the course of training." — Razin et al., arXiv:2410.08847

Gradient Mechanism

The DPO loss only constrains the log-prob difference (margin) relative to the reference model to widen:

LDPO=E(x,yw,yl)D ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_\text{DPO} = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]

logπθ(ywx)\log\pi_\theta(y_w|x) itself has no lower-bound constraint — as long as yly_l decreases faster, the gradient objective is satisfied. Razin et al. (Theorem 1/3) note that when the hidden representations of ywy_w and yly_l are similar (high CHES score), the gradient direction that suppresses yly_l simultaneously suppresses ywy_w, and probability mass shifts to tokens semantically opposite to ywy_w, forming "unintentional unalignment" (the phrase used in Razin et al.).

Danger Conditions

Detection

Record both chosen_logps_mean and rejected_logps_mean throughout training (most training frameworks already log these). If the chosen mean consistently decreases beyond the reference baseline, displacement is occurring.

Mitigation

(A) DPOP (Pal et al., arXiv:2402.13228): Adds a penalty term inside the DPO loss that directly prevents the chosen log-prob from falling below the reference model:

LDPOP=E ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)λmax ⁣(0,logπref(ywx)πθ(ywx)))]\mathcal{L}_\text{DPOP} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} - \lambda\cdot\max\!\left(0,\log\frac{\pi_\text{ref}(y_w|x)}{\pi_\theta(y_w|x)}\right)\right)\right]

The max(0,)\max(0,\cdot) term with λ>0\lambda > 0: when logπθ(ywx)<logπref(ywx)\log\pi_\theta(y_w|x) < \log\pi_\text{ref}(y_w|x), a penalty is applied that "anchors" the chosen log-prob above the reference model.

(B) CHES data filtering (Razin et al., arXiv:2410.08847): Filter out preference pairs where the representations of ywy_w and yly_l are highly similar, cutting the gradient coupling path at the data level.

(C) SimPO: Uses length-normalized 1ylogπθ(yx)\frac{1}{|y|}\log\pi_\theta(y|x) as the implicit reward and removes πref\pi_\text{ref}; the reward definition is directly aligned with generation-time likelihood, which by design weakens the driving force behind displacement (though the SimPO paper itself does not directly analyze this issue using the Razin/Pal framework).

Note: The three mitigation approaches above come from different papers and should not be cross-attributed — the max\max regularization term in DPOP is from Pal et al.; CHES filtering is from Razin et al.; the connection between SimPO and displacement comes from downstream work and is provided here for reference only; it must not be attributed to either the Razin or Pal papers.


7. DPO Variants Comparison

7.1 IPO (Identity Preference Optimization / Ψ\PsiPO with Ψ=Id\Psi = \text{Id})

DPO problem addressed: DPO uses Ψ(q)=log(q/(1q))\Psi(q) = \log(q/(1-q)) (the logit function, corresponding to Bradley-Terry). When preferences approach certainty (p(ywyl)1p^*(y_w \succ y_l) \to 1), the logit tends to ++\infty, driving π(yl)0\pi^*(y_l) \to 0 regardless of the KL penalty coefficient τ\tau — KL regularization becomes ineffective under strong preferences, and the policy overfits the preference data.

Core change: Under the Ψ\PsiPO framework, replace Ψ\Psi with the identity mapping (input preference probability p[0,1]p\in[0,1], Ψ(p)=p\Psi(p)=p does not diverge as p1p\to 1 the way DPO's logit mapping does). The resulting empirical loss (Azar et al., arXiv:2310.12036, Eq. 17) is a squared-loss regression:

LIPO(π)=E(yw,yl,x)D[(hπ(yw,yl,x)12τ)2]\mathcal{L}_{\text{IPO}}(\pi) = \mathbb{E}_{(y_w, y_l, x) \sim \mathcal{D}} \left[ \left( h_\pi(y_w, y_l, x) - \frac{1}{2\tau} \right)^2 \right]

where hπ(y,y,x)=logπ(yx)πref(yx)π(yx)πref(yx)h_\pi(y, y', x) = \log \dfrac{\pi(y|x)\,\pi_{\text{ref}}(y'|x)}{\pi(y'|x)\,\pi_{\text{ref}}(y|x)} is the log-ratio difference of the policy relative to the reference (logit margin); the target constant is 12τ\frac{1}{2\tau}, with τ\tau the KL regularization strength.

Citation: Mohammad Gheshlaghi Azar et al. — arXiv:2310.12036 (Google DeepMind, 2023)

"IPO, unlike DPO, always regularizes its solution towards πref\pi_\text{ref} by controlling the gap between the log-likelihood ratios, thus avoiding the over-fitting to the preference dataset." — Azar et al., Section 5.2

Properties:

7.2 KTO (Kahneman-Tversky Optimization)

DPO problems addressed: (1) DPO requires paired preference data (x,yw,yl)(x, y_w, y_l), whereas in practice only pointwise positive/negative feedback (pointwise thumbs-up/down) is often available; paired data is expensive and scarce. (2) DPO maximizes preference log-likelihood, which is a proxy for the true objective of "maximizing generation utility," resulting in objective mismatch.

Loss (complete form):

LKTO(πθ,πref)=Ex,yD[w(y)(1vKTO(x,y;β))]\mathcal{L}_{\text{KTO}}(\pi_\theta, \pi_{\text{ref}}) = \mathbb{E}_{x,y \sim \mathcal{D}}\bigl[w(y)\bigl(1 - v_{\text{KTO}}(x,y;\beta)\bigr)\bigr]

where the implicit reward, KL baseline, and value function are defined as:

rKTO(x,y)=βlogπθ(yx)πref(yx)r_{\text{KTO}}(x,y) = \beta \log \frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)}

zref=ExD[βKL(πθ(yx)πref(yx))]z_{\text{ref}} = \mathbb{E}_{x' \sim \mathcal{D}}\bigl[\beta\,\mathrm{KL}(\pi_\theta(y' \mid x') \,\|\, \pi_{\text{ref}}(y' \mid x'))\bigr]

vKTO(x,y;β)={σ(rKTO(x,y)zref)if yydesirablexσ(zrefrKTO(x,y))if yyundesirablexv_{\text{KTO}}(x,y;\beta) = \begin{cases} \sigma(r_{\text{KTO}}(x,y) - z_{\text{ref}}) & \text{if } y \sim y_{\text{desirable}} \mid x \\ \sigma(z_{\text{ref}} - r_{\text{KTO}}(x,y)) & \text{if } y \sim y_{\text{undesirable}} \mid x \end{cases}

w(y)={λDif desirableλUif undesirablew(y) = \begin{cases} \lambda_D & \text{if desirable} \\ \lambda_U & \text{if undesirable} \end{cases}

Role and implementation of zrefz_\text{ref} (KL Baseline)

zrefz_\text{ref} is the expected KL divergence of the current policy relative to the reference model. In prospect theory it serves as the reference point — rewards above this point are "gains" and rewards below it are "losses," producing the sigmoid's concavity on the gain side (risk aversion) and convexity on the loss side (loss aversion).

In practice, zrefz_\text{ref} is estimated within each mini-batch (size mm) using mismatched (x,yU)(x', y'_U) pairs:

z^ref=max ⁣(0,1milogπθ(yU,ixi)πref(yU,ixi))\hat{z}_{\text{ref}} = \max\!\left(0,\,\frac{1}{m}\sum_i \log\frac{\pi_\theta(y'_{U,i} \mid x'_i)}{\pi_{\text{ref}}(y'_{U,i} \mid x'_i)}\right)

Deliberately pairing prompt xx' with an unrelated output yUy'_U is intentional, to avoid conflating the reward signal with the baseline estimate. Gradients do not propagate through the zrefz_\text{ref} term.

Prospect Theory Mapping

vKTOv_\text{KTO} approximates the Kahneman-Tversky S-shaped value function with a logistic function (the original power-law form is hard to optimize directly): the sign flip — desirable branch rzrefr - z_\text{ref}; undesirable branch zrefrz_\text{ref} - r — precisely simulates the "gain vs. loss" frame switch, and the asymmetric weights λD/λU\lambda_D / \lambda_U correspond to loss aversion.

"KTO only requires a binary signal of whether an output is (un)desirable for a given input. This data is much more abundant, cheaper, and faster to collect in the real world than preferences." — Ethayarajh et al., arXiv:2402.01306

Citation: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela — arXiv:2402.01306 (2024)

Properties:

7.3 ORPO (Odds Ratio Preference Optimization)

DPO problem addressed: DPO requires two-stage training — first SFT then preference optimization — and requires maintaining a reference model πref\pi_{\mathrm{ref}}.

Core change: Directly attach an odds-ratio preference loss on top of the SFT cross-entropy loss (unified SFT + odds-ratio loss):

LORPO=LSFT(yw)SFT on chosen+λ[logσ ⁣(logoddsθ(ywx)oddsθ(ylx))]odds ratio preference\mathcal{L}_{\mathrm{ORPO}} = \underbrace{\mathcal{L}_{\mathrm{SFT}}(y_w)}_{\text{SFT on chosen}} + \lambda \cdot \underbrace{\left[-\log\sigma\!\left(\log\frac{\mathrm{odds}_\theta(y_w|x)}{\mathrm{odds}_\theta(y_l|x)}\right)\right]}_{\text{odds ratio preference}}

where odds are defined as (odds defined as):

oddsθ(yx)=pθ(yx)1pθ(yx)\mathrm{odds}_\theta(y|x) = \frac{p_\theta(y|x)}{1 - p_\theta(y|x)}

Citation: Jiwoo Hong, Noah Lee, James Thorne — arXiv:2403.07691 (2024)

"In contrast to previous works, our approach requires neither an SFT warm-up stage nor a reference model, enabling resource-efficient development of preference-based aligned models." — Hong et al., arXiv:2403.07691

Properties:

7.4 SimPO (Simple Preference Optimization)

DPO problems addressed: (1) There is a divergence between DPO's implicit reward βlogπθ/πref\beta\log\pi_\theta/\pi_\text{ref} and the metric actually used at generation time (length-normalized likelihood) — Meng et al. note that in UltraFeedback triplets, the proportion of cases where the DPO reward ranking is satisfied but the log-likelihood ranking is reversed approaches half, meaning the model can "win the loss" while making the chosen response harder to generate. (2) Unnormalized log-probabilities decrease monotonically with length, allowing the model to satisfy the ranking by generating shorter rejected responses, introducing length bias. (3) Maintaining a frozen πref\pi_\text{ref} incurs memory and compute overhead.

Core change: Replace the implicit reward with a length-normalized sequence-level average log-probability, and introduce an explicit target margin γ>0\gamma > 0 in the Bradley-Terry objective:

rSimPO(x,y)=βylogπθ(yx)r_{\text{SimPO}}(x, y) = \frac{\beta}{|y|}\log\pi_\theta(y|x)

LSimPO(πθ)=E(x,yw,yl)D ⁣[logσ ⁣(βywlogπθ(ywx)βyllogπθ(ylx)γ)]\mathcal{L}_{\text{SimPO}}(\pi_\theta) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\!\left[\log\sigma\!\left(\frac{\beta}{|y_w|}\log\pi_\theta(y_w|x) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l|x) - \gamma\right)\right]

where y|y| is the token count, β\beta is a scaling constant, and γ>0\gamma > 0 requires the chosen reward to exceed the rejected reward by at least γ\gamma (not merely be larger). Does not contain πref\pi_\text{ref}.

Citation: Yu Meng, Mengzhou Xia, Danqi Chen — arXiv:2405.14734 (NeurIPS 2024)

"There is a divergence between DPO's reward formulation rθ(x,y)=βlogπθ(yx)/πref(yx)r_\theta(x,y)=\beta\log\pi_\theta(y|x)/\pi_\text{ref}(y|x) and the average log likelihood metric pθ(yx)=1ylogπθ(yx)p_\theta(y|x)=\frac{1}{|y|}\log\pi_\theta(y|x), which directly impacts generation." — Meng et al., arXiv:2405.14734, Section 3.1

Properties:

7.5 Online vs Offline DPO

Distribution Mismatch

Standard DPO is an offline algorithm: the preference dataset D\mathcal{D} is collected before training from some fixed data-generating policy μ\mu (typically the SFT model). During training, πθ\pi_\theta is continuously updated while D\mathcal{D} remains static. Once πθ\pi_\theta diverges from μ\mu, the (yw,yl)(y_w, y_l) pairs in D\mathcal{D} no longer cover the current output distribution of πθ\pi_\theta, creating an off-policy distribution mismatch.

Concrete manifestations:

Iterative / On-Policy DPO

Solution: at each iteration, sample new response pairs using the current policy πθ(t)\pi_\theta^{(t)}, then construct new preference pairs (yw(t),yl(t))(y_w^{(t)}, y_l^{(t)}) via a reward model (or human/AI judge), and update the policy with this batch of distribution-matched preference data:

πθ(t+1)DPO-update ⁣(πθ(t),  D(t)),D(t)πθ(t)\pi_\theta^{(t+1)} \leftarrow \text{DPO-update}\!\left(\pi_\theta^{(t)},\;\mathcal{D}^{(t)}\right), \quad \mathcal{D}^{(t)} \sim \pi_\theta^{(t)}

Why online DPO is generally better:

Dimension Offline DPO Online / Iterative DPO
Source of preference data Statically pre-collected from fixed μ\mu Sampled each round from current πθ\pi_\theta
Distribution match Off-policy, subject to drift On-policy, matches current policy
Training signal quality Limited by old distribution Covers current policy's output distribution
Compute cost Low (one-time data collection) High (requires online sampling + RM scoring each round)
Exploration ability None, locked to D\mathcal{D} Can explore new output patterns
Representative methods Standard DPO (Rafailov et al.) RLHF-PPO, Online DPO, Self-Play Fine-Tuning

Practical guidance: When access to an RM or automatic judge is available, iterative DPO (re-sampling + updating preference data every kk steps) generally yields better downstream conversation quality than purely offline DPO. If only offline is feasible, removing πref\pi_\text{ref} (as in SimPO/DPOP) or adding chosen-anchoring can partially mitigate the likelihood displacement caused by distribution drift.

7.6 Precise Comparison Table

Variant Requires paired preferences? Requires πref\pi_\text{ref}? Key loss form Main DPO problem corrected
DPO ✅ Yes ✅ Yes logσ(βlogπθ(yw)πref(yw)βlogπθ(yl)πref(yl))-\log\sigma(\beta\log\frac{\pi_\theta(y_w)}{\pi_\text{ref}(y_w)} - \beta\log\frac{\pi_\theta(y_l)}{\pi_\text{ref}(y_l)}) Baseline (no explicit correction)
IPO ✅ Yes ✅ Yes (hπ(yw,yl)12τ)2\left(h_\pi(y_w,y_l) - \frac{1}{2\tau}\right)^2, squared-loss regression KL regularization fails under deterministic preferences; unbounded reward drift
KTO ❌ No (pointwise binary label) ✅ Yes w(y)(1vKTO(x,y;β))w(y)(1 - v_\text{KTO}(x,y;\beta)), asymmetric sigmoid + KL baseline zrefz_\text{ref} Requires paired data; objective misaligned with actual generation utility
ORPO ✅ Yes ❌ No LSFT+λ(logσ(logodds(yw)odds(yl)))\mathcal{L}_\text{SFT} + \lambda(-\log\sigma(\log\frac{\text{odds}(y_w)}{\text{odds}(y_l)})) Two-stage training; maintaining frozen πref\pi_\text{ref} (doubles memory/compute)
SimPO ✅ Yes ❌ No logσ(βywlogπθ(yw)βyllogπθ(yl)γ)-\log\sigma(\frac{\beta}{|y_w|}\log\pi_\theta(y_w) - \frac{\beta}{|y_l|}\log\pi_\theta(y_l) - \gamma) Likelihood displacement; length bias; πref\pi_\text{ref} overhead

Citations: DPO — Rafailov et al., arXiv:2305.18290; IPO — Azar et al., arXiv:2310.12036; KTO — Ethayarajh et al., arXiv:2402.01306; ORPO — Hong et al., arXiv:2403.07691; SimPO — Meng et al., arXiv:2405.14734 (NeurIPS 2024)


8. GRPO vs PPO (Group Relative Policy Optimization vs Proximal Policy Optimization)

8.1 GRPO Group Advantage Estimation

The core idea of GRPO: for the same prompt xx, sample a group of responses {y1,y2,,yG}\{y_1, y_2, \dots, y_G\} and estimate the advantage using intra-group statistics (estimate advantage using intra-group statistics):

Ai=rimean({r1,r2,,rG})std({r1,r2,,rG})\boxed{A_i = \frac{r_i - \mathrm{mean}(\{r_1, r_2, \dots, r_G\})}{\mathrm{std}(\{r_1, r_2, \dots, r_G\})}}

where ri=r(x,yi)r_i = r(x, y_i) is the reward for the ii-th response (reward for the ii-th response).

GRPO policy gradient loss (GRPO policy gradient loss):

LGRPO(θ)=1Gi=1G[min ⁣(πθ(yix)πθold(yix)Ai,  clip ⁣(πθ(yix)πθold(yix),1ϵ,1+ϵ)Ai)]\mathcal{L}_{\mathrm{GRPO}}(\theta) = -\frac{1}{G}\sum_{i=1}^{G}\left[\min\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{\mathrm{old}}}(y_i|x)} A_i, \;\mathrm{clip}\!\left(\frac{\pi_\theta(y_i|x)}{\pi_{\theta_{\mathrm{old}}}(y_i|x)}, 1-\epsilon, 1+\epsilon\right)A_i\right)\right]

Same clipping mechanism as PPO, but entirely different advantage estimation (same clipping mechanism, entirely different advantage estimation).

8.2 Key Comparison

Property PPO GRPO
Models required 4: Actor + Critic + Reference + RM 2–3: Actor + Reference (+ optional RM; in RLVR settings reward comes from rules, no separate RM needed)
Advantage estimation GAE (Generalized Advantage Estimation), requires a Critic network Intra-group relative ranking, no Critic needed
Memory overhead High (4 copies of model weights) Low (2–3 copies of model weights)
Reward source Learned neural RM (learned neural RM) Typically verifiable/rule-based reward; neural RM can also be plugged in
Suitable scenarios Open-ended dialogue, creative writing (open-ended generation) Math reasoning, code generation (math, code with verifiable ground truth)
Training stability Requires careful Critic tuning, otherwise unstable More stable, no Critic estimation error
Gradient variance Lower (GAE provides low-variance estimates) Higher (limited group sample size GG)

8.3 RLVR Framework (RL from Verifiable Rewards)

The paradigm that best fits GRPO is RLVR: rewards come not from a learned RM but from automatically verifiable rules (rewards from automatically verifiable rules):

The core advantage of RLVR: low-noise reward (low-noise reward, relative to a learned RM), avoiding the bias and overfitting of the RM itself.

8.4 When to Prefer Which

8.5 RLOO and ReMax (critic-free baselines)

PPO relies on a learned critic (value network) to estimate a baseline. GRPO, RLOO, and ReMax are all critic-free, replacing the value baseline with a baseline computed from sampled rewards.

RLOO (REINFORCE Leave-One-Out, Ahmadian et al. 2024, ACL arXiv:2402.14740): For a group of GG samples per prompt, the baseline for sample ii is the mean reward of the other G1G-1 samples; advantage Ai=ri1G1jirjA_i = r_i - \frac{1}{G-1}\sum_{j\neq i} r_j. It uses a pure REINFORCE gradient, no clipping, no critic; the policy-gradient estimate stays unbiased because the baseline does not depend on sample ii's own action.

ReMax (Li et al. 2024, ICML arXiv:2310.10505): The baseline is the reward of a single greedy (argmax) decode for the same prompt; advantage A=r(sample)r(greedy)A = r(\text{sample}) - r(\text{greedy}). This requires only one extra greedy rollout per prompt, resulting in very low memory overhead and no critic network.

Method baseline estimator clip? extra cost
GRPO Group-relative (z-score) PPO-style Yes GG samples
RLOO Leave-one-out mean REINFORCE No GG samples
ReMax Greedy decode reward REINFORCE No +1 greedy rollout

9. Role of KL Penalty & Tuning β

9.1 Intuitive Role of the KL Term

βKL ⁣(πθ(x)    πref(x))\beta \cdot \mathrm{KL}\!\big(\pi_\theta(\cdot|x) \;\|\; \pi_{\mathrm{ref}}(\cdot|x)\big)

The KL penalty acts as regularization, with the following functions:

  1. Prevents excessive drift: Ensures πθ\pi_\theta does not deviate too far from πref\pi_{\mathrm{ref}}, preserving pre-training knowledge
  2. Mitigates reward hacking: If the policy learns to exploit flaws in the RM, the KL term grows as a penalty
  3. Maintains diversity: Prevents the policy from collapsing to a few high-reward modes (mode collapse)
  4. Stabilizes training: Constrains the exploration space, preventing excessively large policy updates

Expanding mathematically (Expanding mathematically):

KL(πθπref)=Eyπθ ⁣[logπθ(yx)πref(yx)]0\mathrm{KL}(\pi_\theta \| \pi_{\mathrm{ref}}) = \mathbb{E}_{y \sim \pi_\theta}\!\left[\log\frac{\pi_\theta(y|x)}{\pi_{\mathrm{ref}}(y|x)}\right] \geq 0

When πθ=πref\pi_\theta = \pi_{\mathrm{ref}}, KL = 0 (no penalty when the policy does not deviate at all from the reference policy).

9.2 KL-RM Score Pareto Frontier

Tuning β\beta is fundamentally a trade-off between two objectives:

E[r(x,y)]reward scorevs.KL(πθπref)degree of deviation\underbrace{\mathbb{E}[r(x,y)]}_{\text{reward score}\uparrow} \quad \text{vs.} \quad \underbrace{\mathrm{KL}(\pi_\theta \| \pi_{\mathrm{ref}})}_{\text{degree of deviation}\uparrow}

β\beta value Effect
β\beta too large Policy barely updates, stays close to πref\pi_{\mathrm{ref}}, small reward improvement (underfitting)
β\beta too small Policy updates aggressively, reward may be high but distribution shift is severe, risk of reward hacking
β\beta moderate Finds a balance on the KL-RM frontier

Typical range (typical range): β[0.01,0.5]\beta \in [0.01, 0.5]; β=0.1\beta = 0.1 or β=0.2\beta = 0.2 are commonly used in practice.

提示 / Note

Quantitative version of over-optimization: the gold-RM score traces an inverted-U in KL\sqrt{\mathrm{KL}} (BoN form d(αβd)d(\alpha-\beta d), RL form d(αβlogd)d(\alpha-\beta\log d), with d=KLd=\sqrt{\mathrm{KL}}) — past the peak the policy drifts out of distribution and the gold score falls. See reward-modeling-eval §3.2a (Gao et al. 2022, arXiv:2210.10760).

9.3 β in DPO vs PPO

Dimension β\beta in PPO β\beta in DPO
Mathematical role Controls the weight of the KL penalty term (in the loss function) Controls the scaling of the implicit reward (in the log-ratio)
Where it appears maxπE[r]βKL\max_\pi \mathbb{E}[r] - \beta \cdot \mathrm{KL} r^(x,y)=βlogπθ(yx)πref(yx)\hat{r}(x,y) = \beta \log\frac{\pi_\theta(y|x)}{\pi_{\mathrm{ref}}(y|x)}
Semantic equivalence Theoretically, DPO's β\beta originates from the same β\beta in the RLHF objective, but in practice the training dynamics differ, so both must be tuned separately
Practical effect β\beta \uparrow → more conservative policy β\beta \uparrow → implicit reward changes more sharply, more sensitive to preference signal

Conclusion: Theoretically equivalent (theoretically equivalent), practically different (practically different). In DPO, β\beta also influences the sharpness of the weight term σ()\sigma(\cdot) in the gradient.

9.4 KL Estimators & Placement

9.4.1 Three Single-Sample Estimators

Notation: Let r=πref/πθr = \pi_{\mathrm{ref}} / \pi_\theta (Schulman 2020 convention), with samples drawn from the current policy πθ\pi_\theta.

Define three estimators:

k1=logr=logπθπrefk_1 = -\log r = \log\frac{\pi_\theta}{\pi_{\mathrm{ref}}}

k2=12(logr)2=12 ⁣(logπrefπθ) ⁣2k_2 = \tfrac{1}{2}(\log r)^2 = \tfrac{1}{2}\!\left(\log\frac{\pi_{\mathrm{ref}}}{\pi_\theta}\right)^{\!2}

k3=(r1)logr=(πrefπθ1)logπrefπθk_3 = (r - 1) - \log r = \left(\frac{\pi_{\mathrm{ref}}}{\pi_\theta} - 1\right) - \log\frac{\pi_{\mathrm{ref}}}{\pi_\theta}

Verifying the expectation (samples from πθ\pi_\theta):

Eaπθ[r]=aπθ(a)πref(a)πθ(a)=aπref(a)=1\mathbb{E}_{a \sim \pi_\theta}[r] = \sum_a \pi_\theta(a) \cdot \frac{\pi_{\mathrm{ref}}(a)}{\pi_\theta(a)} = \sum_a \pi_{\mathrm{ref}}(a) = 1

Therefore E[r1]=0\mathbb{E}[r - 1] = 0: the term (r1)(r-1) is zero-mean and serves as a control variate.

Order near r=1r=1 (let ϵ=r1\epsilon = r-1): k1=logrϵk_1 = -\log r \approx -\epsilon is first-order (signed), whereas k2k312ϵ2k_2 \approx k_3 \approx \tfrac{1}{2}\epsilon^2 are second-order (non-negative). All three vanish as r1r \to 1 but at different rates — which is also why k3k_3 is non-negative like k2k_2 yet unbiased like k1k_1.

9.4.2 The Gradient Perspective

The analysis above concerns value estimation only. When an estimator is used as a loss term, its gradient behavior must be analyzed separately.

9.4.3 Estimator Comparison Table

Estimator Form Value-unbiased? Gradient-principled?
k1k_1 logr-\log r Yes Yes, as in-reward
k2k_2 12(logr)2\tfrac{1}{2}(\log r)^2 No Yes, as-loss
k3k_3 (r1)logr(r-1)-\log r Yes No, as-loss

Variance note: in the small-drift regime (r1r \approx 1), k3k_3 has lower variance than k1k_1 (both unbiased). k2k_2 is biased and operates in a different bias-variance regime; direct variance comparisons with k1k_1 or k3k_3 are not meaningful.

9.4.4 Two Placement Styles (Style A vs Style B)

Style A: In-Reward

Representative: InstructGPT / PPO. The KL penalty is incorporated per-token into the reward signal:

rtotal(x,y)=rRM(x,y)βk1(t),k1(t)=logπθ(atst)πref(atst)r_{\mathrm{total}}(x, y) = r_{\mathrm{RM}}(x, y) - \beta \cdot k_1^{(t)}, \quad k_1^{(t)} = \log\frac{\pi_\theta(a_t|s_t)}{\pi_{\mathrm{ref}}(a_t|s_t)}

Style B: In-Loss

Representative: GRPO (Shao et al., DeepSeekMath), DeepSeek-R1. The KL estimator is added directly to the policy optimization loss:

L=LGRPO+βk3\mathcal{L} = \mathcal{L}_{\mathrm{GRPO}} + \beta \cdot k_3

where k3=(r1)logrk_3 = (r - 1) - \log r, r=πref/πθr = \pi_{\mathrm{ref}} / \pi_\theta (computed per token, then averaged over the sequence).

Dimension Style A (in-reward) Style B (in-loss)
Representative systems InstructGPT, PPO GRPO, DeepSeek-R1
Estimator used k1k_1 (per-token) k3k_3 (per-token, averaged)
Gradient-principled Yes Approximate (acceptable at small β\beta)
Engineering complexity Requires Critic No Critic needed
Clip-masking risk Present (KL gradient silently dropped for clipped tokens) Not applicable

9.4.5 Interview Self-Test

L2: Using the r=πref/πθr = \pi_{\mathrm{ref}}/\pi_\theta convention, why does Eaπθ[r]=1\mathbb{E}_{a \sim \pi_\theta}[r] = 1? How does this result establish the unbiasedness of k3k_3?

L3: GRPO incorporates k3k_3 directly into the loss, yet k3k_3 as a loss term does not yield the principled reverse-KL gradient — why is this usually acceptable in GRPO practice? If β\beta were increased from 0.04 to 0.5, how would this approximation error change?


10. Process Reward Model (PRM) vs Outcome Reward Model (ORM)

10.1 ORM: Outcome Reward Model

rORM(x,y)=fϕ(x,y)Rr_{\mathrm{ORM}}(x, y) = f_\phi(x, y) \in \mathbb{R}

10.2 PRM: Process Reward Model

rPRM(x,y,t)=fϕ(x,s1,,st)Rr_{\mathrm{PRM}}(x, y, t) = f_\phi(x, s_1, \dots, s_t) \in \mathbb{R}

10.3 Credit Assignment Advantage

This is the core advantage of PRM. Consider a mathematical reasoning chain:

Step 1: Let f(x)=x2+3xf(x) = x^2 + 3x → ✅ correct Step 2: Differentiate to get f(x)=2x+3f'(x) = 2x + 3 → ✅ correct Step 3: Set f(x)=0f'(x) = 0, solve x=3/2x = -3/2 → ✅ correct Step 4: f(3/2)=9/49/2=9/4f(-3/2) = 9/4 - 9/2 = -9/4 → ✅ correct

ORM only knows "the final answer is correct" → gives a high score, but does not know whether each step is reliable.

PRM can identify the case where "the first three steps are correct but the fourth is wrong":

rPRM(s1)>0,  rPRM(s2)>0,  rPRM(s3)>0correct stepsrPRM(s4)<0erroneous step localized\underbrace{r_{\mathrm{PRM}}(s_1) > 0,\; r_{\mathrm{PRM}}(s_2) > 0,\; r_{\mathrm{PRM}}(s_3) > 0}_{\text{correct steps}} \quad \underbrace{r_{\mathrm{PRM}}(s_4) < 0}_{\text{erroneous step localized}}

This allows PRM to guide search and training more precisely (more precise guidance for search and training).

10.4 Best-of-N Search with PRM

Given prompt xx, sample NN candidate responses {y1,,yN}\{y_1, \dots, y_N\} and score each step of each response:

Score(yi)=mint=1TirPRM(x,yi,t)\text{Score}(y_i) = \min_{t=1}^{T_i} r_{\mathrm{PRM}}(x, y_i, t)

or use the product form (product form):

Score(yi)=t=1Tiσ ⁣(rPRM(x,yi,t))\text{Score}(y_i) = \prod_{t=1}^{T_i} \sigma\!\big(r_{\mathrm{PRM}}(x, y_i, t)\big)

Taking min or product ensures every step qualifies — any weak step pulls down the overall score (any weak step pulls down the overall score).

Select the best response (Select the best):

y=argmaxyiScore(yi)y^* = \arg\max_{y_i} \text{Score}(y_i)

10.5 PRM Training Data Challenges

Challenge Description
Expensive annotation Every step of every reasoning chain requires expert human annotation of correctness — 10–50× more expensive than ORM annotation
Ambiguous step boundaries There is no unified standard for segmenting reasoning steps; different annotators may segment them differently
Low inter-annotator agreement Judgments of "whether a step is correct" may vary with the annotator's mathematical proficiency
Limitations of automated methods Monte Carlo estimation (estimating the probability of reaching the correct answer after a given step via repeated sampling) has high variance

Automated PRM annotation method: After step tt, sample completions multiple times and compute the proportion of final answers that are correct as an estimate of rPRM(st)r_{\mathrm{PRM}}(s_t). Formula: rPRM(st)1Kk=1K1[completionk leads to correct answer]r_{\mathrm{PRM}}(s_t) \approx \frac{1}{K}\sum_{k=1}^{K} \mathbb{1}[\text{completion}_k \text{ leads to correct answer}]


11. Alignment Tax & Weight Averaging

11.1 Definition of Alignment Tax

Alignment Tax refers to the performance degradation on base capability benchmarks after a model undergoes alignment training:

Alignment Tax=Perfbase(θbase)Perfbase(θaligned)\text{Alignment Tax} = \text{Perf}_{\mathrm{base}}(\theta_{\mathrm{base}}) - \text{Perf}_{\mathrm{base}}(\theta_{\mathrm{aligned}})

where Perfbase\text{Perf}_{\mathrm{base}} denotes performance on pre-training benchmarks (e.g., MMLU, coding ability, math ability, etc.).

Intuitively: SFT/RL training may "forget" or "overwrite" parts of pre-trained knowledge while improving alignment quality (safety, helpfulness, format following).

11.2 WiSE-FT Linear Interpolation

WiSE-FT (Weight-space Ensembles for Finetuning) mitigates the alignment tax by interpolating in weight space:

θmerged=(1α)θaligned+αθbase\boxed{\theta_{\mathrm{merged}} = (1 - \alpha)\,\theta_{\mathrm{aligned}} + \alpha\,\theta_{\mathrm{base}}}

where α[0,1]\alpha \in [0, 1] controls the trade-off between the aligned model and the base model.

α\alpha Effect
α=0\alpha = 0 fully aligned model
α=1\alpha = 1 base model only
α(0,1)\alpha \in (0, 1) compromise: retains some aligned behavior while recovering some base capability

11.3 Why Interpolation Works

Task Vector perspective: alignment training is equivalent to moving in a direction within weight space:

τalign=θalignedθbase\tau_{\mathrm{align}} = \theta_{\mathrm{aligned}} - \theta_{\mathrm{base}}

Research shows that the weight-change directions corresponding to different tasks are near-orthogonal, so:

θmerged=θbase+(1α)τalign\theta_{\mathrm{merged}} = \theta_{\mathrm{base}} + (1-\alpha)\,\tau_{\mathrm{align}}

Linear interpolation doesn't severely interfere with other task representations.

11.4 Advanced Model Merging Variants

Method Formula / Operation Core Idea
Linear Interpolation θm=(1α)θa+αθb\theta_m = (1-\alpha)\theta_a + \alpha\theta_b Simplest; element-wise linear average
SLERP (Spherical Linear Interpolation) θm=sin((1t)Ω)sinΩθa+sin(tΩ)sinΩθb\theta_m = \frac{\sin((1-t)\Omega)}{\sin\Omega}\theta_a + \frac{\sin(t\Omega)}{\sin\Omega}\theta_b, where cosΩ=θaθbθaθb\cos\Omega = \frac{\theta_a \cdot \theta_b}{\|\theta_a\|\|\theta_b\|} Interpolates on the hypersphere, preserving vector norms
DARE (Drop And REscale) Randomly drop p%p\% of parameters in θalignedθbase\theta_{\mathrm{aligned}} - \theta_{\mathrm{base}}, then rescale the remaining: δiδi/(1p)\delta_i \leftarrow \delta_i / (1-p), then merge Sparsifies the task vector to reduce interference
TIES (Trim, Elect, Sign) ① Trim small-magnitude changes ② Vote on sign ③ Keep only parameters with a consistent direction Resolves parameter conflicts when merging multiple models

SLERP intuition: the "direction" of a weight vector matters more than its "length"; spherical interpolation preserves the geometric relationship between directions.


12. Catastrophic Forgetting, Mode Collapse & Reward Hacking

These are three distinct training failure modes.

12.1 Catastrophic Forgetting

Definition: when the model learns new behaviors during SFT or RL, it loses knowledge and capabilities acquired during pre-training.

Mechanism:

Pre-training capabilitySFT/RL update θpartially overwritten/erased\text{Pre-training capability} \xrightarrow{\text{SFT/RL update } \theta} \text{partially overwritten/erased}

Neural network weight space is finite; gradient updates for new tasks may overwrite weights that store old knowledge.

Detection Metrics:

Mitigation Strategies:

Strategy Description
Mixed training data Mix pre-training data into SFT
Low-rank adaptation (LoRA) Only updates the low-rank delta ΔW=BA\Delta W = BA, greatly reducing interference with original weights
Regularization EWC (Elastic Weight Consolidation): L=Lnew+λ2iFi(θiθi)2\mathcal{L} = \mathcal{L}_{\mathrm{new}} + \frac{\lambda}{2}\sum_i F_i(\theta_i - \theta_i^*)^2, where FiF_i is the Fisher information
Model merging WiSE-FT / SLERP merges the aligned model with the base model

12.2 Mode Collapse

Definition: during RL training, the model's output diversity drops sharply, repeatedly producing similar or even identical responses.

Mechanism: the policy over-optimizes a high-reward pattern, concentrating probability mass onto a small number of outputs:

H(πθ(x))=yπθ(yx)logπθ(yx)0H(\pi_\theta(\cdot|x)) = -\sum_y \pi_\theta(y|x)\log\pi_\theta(y|x) \to 0

Detection Metrics:

Mitigation Strategies:

Strategy Description
Increase KL penalty β\beta \uparrow keeps the policy close to πref\pi_{\mathrm{ref}}, maintaining diversity
Entropy regularization Add a ηH(πθ)-\eta H(\pi_\theta) term to encourage exploration
Data diversity Training data covers a diverse prompt distribution
Early stopping Monitor diversity metrics and stop training promptly

12.3 Reward Hacking

Definition: the policy learns to exploit RM weaknesses, achieving high RM scores while actual human evaluation declines. This is a direct manifestation of Goodhart's Law:

"When a measure becomes a target, it ceases to be a good measure."\text{"When a measure becomes a target, it ceases to be a good measure."}

Eyπθ[rϕ(x,y)]butEyπθ[rhuman(x,y)]\mathbb{E}_{y \sim \pi_\theta}[r_\phi(x,y)] \uparrow\uparrow \quad \text{but} \quad \mathbb{E}_{y \sim \pi_\theta}[r_{\mathrm{human}}(x,y)] \downarrow

Detection Metrics:

Metric Description
Divergence between RM score and human rating Δ=rRMrhuman\Delta = r_{\mathrm{RM}} - r_{\mathrm{human}} grows
Continuously increasing KL Policy keeps drifting away from πref\pi_{\mathrm{ref}}
Surge of specific patterns e.g., overuse of filler phrases like "however", "it is worth noting", etc.
Response length bloat RM favors longer answers → model learns to produce redundant content

Mitigation Strategies:

Strategy Description
KL penalty Constrains the policy from drifting too far (most fundamental)
RM ensemble Average over multiple RMs to reduce bias from any single RM
Adversarial training Continuously update the RM to adapt to policy changes (online RLHF)
Human evaluation Periodically evaluate with humans to detect RM–human divergence
Length penalty Apply length normalization to RM scores

12.4 Relationship Summary

Pre-training → SFT → RL
                ↓         ↓         ↓
       Catastrophic   Mode       Reward
         Forgetting   Collapse   Hacking
       (knowledge   (diversity  (RM being
          loss)        loss)     exploited)
Feature Catastrophic Forgetting Mode Collapse Reward Hacking
Stage SFT / RL RL RL
Root cause Weight overwriting Over-optimization of a single pattern RM weaknesses exploited
Symptom Capability degradation Monotone output High RM score but poor quality
Core mitigation Regularization + mixed data KL + entropy + diversity KL + RM ensemble + human evaluation

13. Constitutional AI / RLAIF

13.1 RLAIF Overview

RLAIF (Reinforcement Learning from AI Feedback) replaces human annotators with LLM-generated preference labels:

Standard RLHF:Human annotatorscompare (y1,y2)preference label(yw,yl)\text{Standard RLHF:} \quad \text{Human annotators} \xrightarrow{\text{compare } (y_1, y_2)} \text{preference label} (y_w, y_l)

RLAIF:LLM judgecompare (y1,y2)preference label(yw,yl)\text{RLAIF:} \quad \text{LLM judge} \xrightarrow{\text{compare } (y_1, y_2)} \text{preference label} (y_w, y_l)

13.2 CAI Self-Critique-Revision Loop

The core of Constitutional AI (CAI) is a four-step loop:

Step 1 — Generate: given prompt xx, use the current model to generate an initial response y0y_0: y0πθ(x)y_0 \sim \pi_\theta(\cdot | x)

Step 2 — Critique: use an LLM to critique y0y_0 according to the constitution principles: critique=LLM ⁣("According to principle Pj, what is wrong with the following response: y0")\text{critique} = \mathrm{LLM}\!\left(\text{"According to principle } P_j \text{, what is wrong with the following response: } y_0\text{"}\right)

Step 3 — Revise: based on the critique, use the LLM to revise the response: y1=LLM ⁣("Please revise the response based on the following critique: [critique]y0")y_1 = \mathrm{LLM}\!\left(\text{"Please revise the response based on the following critique: } [\text{critique}] \rightarrow y_0\text{"}\right)

Can be iterated multiple times: y0y1y2y_0 \to y_1 \to y_2 \to \dots (typically 1–3 rounds)

Step 4 — Train:

13.3 Constitution Principles

Constitution principles are a set of auditable alignment constraints, for example:

No. Principle Example
P₁ "Choose the response that is most helpful, accurate, and harmless"
P₂ "Choose the response that does not promote bias or discrimination"
P₃ "Choose the response that does not assist with illegal activities"

Unlike implicit human preferences, constitution principles are explicit and auditable:

pCAI(ywylx)=σ ⁣(rLLM(x,yw)rLLM(x,yl))p_{\mathrm{CAI}}(y_w \succ y_l | x) = \sigma\!\big(r_{\mathrm{LLM}}(x, y_w) - r_{\mathrm{LLM}}(x, y_l)\big)

where rLLMr_{\mathrm{LLM}} is the LLM score based on constitution principles.

13.4 Comparison with Standard RLHF

Dimension Standard RLHF RLAIF / CAI
Preference source Human annotators LLM (based on constitution principles)
Annotation cost High (labor intensive) Low (API call cost)
Scalability Limited by annotator count and time Nearly unlimited scaling
Consistency Inter-annotator variance LLM is highly consistent
Auditability Preference criteria exist implicitly in annotators' minds Constitution principles are explicit and auditable
Risk Annotator bias LLM's own bias + poorly designed constitution principles
Human involvement Throughout Only when designing constitution principles

13.5 Theoretical Advantages of RLAIF

  1. Principle-guided: alignment objectives are expressed explicitly through natural-language principles, making them more controllable than implicit preferences
  2. Self-improvement loop: model critiques itself, revises itself, learns from the revised version → continuous improvement
  3. Reduced human burden: humans only need to design principles, not annotate individual examples
  4. Cross-cultural consistency: annotators from different cultural backgrounds may have different preferences, whereas constitution principles can unify the standard

Note: CAI is not fully human-free. Humans still need to:

  • Design the constitution
  • Evaluate final model quality
  • Monitor for drift during iteration

14. Distillation (Post-Training Perspective)

14.1 Comparison of Three Distillation Paradigms

SeqKD (Sequence-Level Knowledge Distillation)

The Teacher first generates complete output sequences via beam search or sampling; the Student then performs standard SFT (cross-entropy loss) on these sequences:

LSeqKD=tlogpθS(ytx,y<t),yπT(x)L_{\text{SeqKD}} = -\sum_{t} \log p_{\theta_S}(y_t \mid x, y_{<t}), \quad y \sim \pi_T(x)

Key points:

Token-Level KD (Token-Level Knowledge Distillation)

At each position tt, align the Student's and Teacher's probability distributions over the vocabulary:

LTKD=tDKL ⁣(pT(x,y<t)    pθS(x,y<t))L_{\text{TKD}} = \sum_{t} D_{\text{KL}}\!\left(p_T(\cdot \mid x, y_{<t}) \;\Big\|\; p_{\theta_S}(\cdot \mid x, y_{<t})\right)

Key points:

On-Policy Distillation

The Student itself rolls out candidate sequences, which are then scored by the Teacher (or a verifiable reward); the Student updates accordingly:

Lon-policy=EyπθS ⁣[rT(x,y)logpθS(yx)]L_{\text{on-policy}} = -\mathbb{E}_{y \sim \pi_{\theta_S}}\!\left[r_T(x, y) \cdot \log p_{\theta_S}(y \mid x)\right]

Key points:


14.2 CoT Distillation (R1-style Chain-of-Thought Distillation)

Core idea: use a large RL model (e.g., DeepSeek-R1-671B) to generate long reasoning sequences with complete chains of thought, then perform SFT on a small model (i.e., the CoT version of SeqKD).

The DeepSeek-R1 paper (arXiv:2501.12948) reports experimental results of SFT on Qwen and Llama models at 1.5B, 7B, 8B, 14B, 32B, and 70B parameters using approximately 800K distillation samples (approximately 600K reasoning + approximately 200K non-reasoning), with reasoning capability of small models improving substantially.

Why CoT distillation into small models is often more stable / more efficient than directly applying GRPO (per the distillation experiments in the R1 paper):

  1. Asymmetric exploration cost: GRPO requires the model to independently explore high-quality chains of thought, but small models have limited capability — random sampling rarely produces effective reasoning sequences (reward is extremely sparse), and gradient signals are noisy; the Teacher directly providing high-quality CoT effectively compresses the exploration space.
  2. No Critic / RM needed: the SeqKD path only requires SFT — no online rollout or reward model — eliminating the GPU memory and compute overhead of GRPO's online sampling and reward/critic.
  3. Training stability: the loss landscape of SFT is smoother than RL, with no risk of reward hacking or mode collapse and fewer hyperparameters.

Hedging caveat: the above "more stable / more efficient" conclusion comes from observational results in the R1 paper under its distillation configuration (DeepSeek-V3-Base as the base model, approximately 800K data scale), and does not imply this holds across all small models or data scales; direct RL (GRPO) may have a higher ceiling when data and compute are sufficient.


14.3 Forward KL vs Reverse KL

Definitions

Forward KL (also called inclusive KL; mean-seeking):

DKLfwd(pq)=yp(y)logp(y)q(y)D_{\text{KL}}^{\text{fwd}}(p \| q) = \sum_y p(y) \log \frac{p(y)}{q(y)}

Optimization direction: minimizing the forward KL of qq relative to pp is equivalent to maximizing Eyp[logq(y)]\mathbb{E}_{y \sim p}[\log q(y)] — the Student qq must cover all modes of the Teacher pp (wherever p(y)>0p(y)>0, qq cannot be 0, otherwise KL diverges).

Reverse KL (also called exclusive KL; mode-seeking):

DKLrev(qp)=yq(y)logq(y)p(y)D_{\text{KL}}^{\text{rev}}(q \| p) = \sum_y q(y) \log \frac{q(y)}{p(y)}

Optimization direction: minimizing this quantity takes the expectation over the support of qq, allowing qq to ignore certain modes of pp (the term is 0 where q(y)=0q(y)=0), but qq will concentrate on regions where pp has high probability.

Why Generation Tasks Often Prefer Reverse KL / Mode-Seeking

Intuitive derivation:

Suppose the Teacher distribution pp is bimodal, with two modes y1,y2y_1, y_2 each having probability 0.5\approx 0.5.

Mathematical statement: let q\*(y)=argminqDKLrev(qp)q^\*(y) = \arg\min_q D_{\text{KL}}^{\text{rev}}(q \| p); for a capacity-limited Student, the solution exhibits mass concentration on the dominant mode(s) of pp, rather than "smearing" across multiple modes.

One-line intuition: Forward KL requires "don't miss any answer from the Teacher"; Reverse KL allows "only learn the Teacher's most confident answers." Generation tasks require coherent outputs — better to cover less but with higher quality, hence the preference for Reverse KL.

Note: Token-level KD typically uses forward KL (Student aligns to Teacher soft labels), while SeqKD / SFT at the sequence level more closely resembles reverse KL behavior (Student only learns the modes sampled by the Teacher). The two are not mutually exclusive; in practice they are often mixed depending on the task.


14.4 Distillation vs RFT vs PPO: Three-Row Comparison

Method Data Source Comparison / Optimization Signal Applicable Scale
Distillation (SeqKD) Sequences generated by the Teacher (offline) Teacher output sequences (cross-entropy / soft labels) Small-to-medium models (typically ≤ 70B), Teacher significantly stronger than Student
RFT (Rejection Sampling FT) Self-sampled from current policy, filtered by reward to keep high-scoring outputs Verifiable reward / RM filtering Medium scale (7B–70B), reward can be automatically verified
PPO Online rollout from current policy RM score + KL constraint + GAE Advantage Large scale (typically ≥ 7B), with sufficient RM and compute resources

14.5 Self-Assessment Questions

L2 — Distinguishing Distillation Paradigms: both SeqKD and Token-Level KD use the Teacher model as the signal source, but fundamentally one more closely resembles reverse KL and the other more closely resembles forward KL. Please explain: (a) which corresponds to which direction of KL; (b) when the Teacher distribution is bimodal, how will the Student distributions trained by each method behave differently?

L3 — Applicability Analysis of CoT Distillation: suppose you have a 3B small model and sufficient GPUs (capable of running both the 671B Teacher and the Student simultaneously). Analyze: under what data scale and task types would directly applying GRPO have an advantage over SeqKD distillation? Give at least two substantive reasons.


Part 2 — PyTorch Code Snippets / From-Scratch PyTorch Snippets


SFT loss masking — During SFT training, compute loss only on the assistant's response tokens; mask the prompt portion with label=-100.

38 行 / lines
import torch
from torch.nn.utils.rnn import pad_sequence

class SFTDataCollator:
    """
    将 prompt token 的 label 设为 -100,loss 只计算 assistant 部分。
    Masks prompt tokens with label=-100 so loss only applies to assistant tokens.
    """
    def __init__(self, tokenizer):
        self.pad_id = tokenizer.pad_token_id or 0

    def __call__(self, batch):
        input_ids, labels, attention_mask = [], [], []
        for sample in batch:  # each sample: dict with 'input_ids' and 'prompt_length'
            ids = torch.tensor(sample["input_ids"], dtype=torch.long)
            prompt_len = sample["prompt_length"]
            lab = ids.clone()
            lab[:prompt_len] = -100  # 屏蔽 prompt / mask prompt tokens
            input_ids.append(ids)
            labels.append(lab)
        # 动态 padding / dynamic pad to longest in batch
        input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.pad_id)
        labels = pad_sequence(labels, batch_first=True, padding_value=-100)
        attention_mask = (input_ids != self.pad_id).long()
        return {"input_ids": input_ids, "labels": labels, "attention_mask": attention_mask}

# --- 用法示例 / Usage example ---
collator = SFTDataCollator(type("Tok", (), {"pad_token_id": 0})())
toy_batch = [
    {"input_ids": [10, 20, 30, 40, 50], "prompt_length": 3},  # prompt=前3个
    {"input_ids": [11, 21, 31], "prompt_length": 2},
]
out = collator(toy_batch)
print("input_ids:\n", out["input_ids"])
print("labels (prompt positions = -100):\n", out["labels"])
# labels: tensor([[ -100, -100, -100, 40, 50],
#                 [ -100, -100,  31,  0,  0]])

DPO loss — Compute the Direct Preference Optimization loss from log-probabilities of the policy and reference models.

49 行 / lines
import torch
import torch.nn.functional as F

@torch.no_grad()
def get_logps(logits: torch.Tensor, labels: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
    """
    逐 token 计算 log-probability 并在序列维度求和。
    Computes per-token log-probs and sums over the sequence dimension.
    logits: (B, T, V),  labels: (B, T),  mask: (B, T)  (1=有效, 0=padding)
    返回每个样本的标量 log-prob / Returns scalar log-prob per sample.
    """
    # shift: 预测下一个 token / predict next token
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = labels[:, 1:].contiguous()
    shift_mask = mask[:, 1:].contiguous()
    log_probs = F.log_softmax(shift_logits, dim=-1)            # (B, T-1, V)
    token_logps = log_probs.gather(-1, shift_labels.unsqueeze(-1)).squeeze(-1)  # (B, T-1)
    return (token_logps * shift_mask).sum(dim=-1)               # (B,)

def dpo_loss(
    policy_logps_chosen: torch.Tensor,
    policy_logps_rejected: torch.Tensor,
    ref_logps_chosen: torch.Tensor,
    ref_logps_rejected: torch.Tensor,
    beta: float = 0.1,
) -> torch.Tensor:
    """
    DPO loss: L = -E[ log σ( β·(log π_θ/π_ref)_chosen - β·(log π_θ/π_ref)_rejected ) ]
    """
    log_ratio_chosen = policy_logps_chosen - ref_logps_chosen
    log_ratio_rejected = policy_logps_rejected - ref_logps_rejected
    loss = -F.logsigmoid(beta * (log_ratio_chosen - log_ratio_rejected)).mean()
    return loss

# --- 示例 / Example ---
B, T, V = 4, 10, 100
logits = torch.randn(B, T, V)
labels = torch.randint(0, V, (B, T))
mask = torch.ones(B, T)

logps = get_logps(logits, labels, mask)  # (B,)
# splitting chosen / rejected is done on the caller side
policy_logps_chosen, policy_logps_rejected = logps[:2], logps[2:]
ref_logps_chosen, ref_logps_rejected = logps[:2] - 0.1, logps[2:] + 0.05

loss = dpo_loss(policy_logps_chosen, policy_logps_rejected,
                ref_logps_chosen, ref_logps_rejected, beta=0.1)
print("DPO loss:", loss.item())

Reward Model — Replace the LM head on a pretrained LLM backbone with a scalar linear head and train with Bradley-Terry loss.

47 行 / lines
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

class RewardModel(nn.Module):
    """
    奖励模型:LLM 骨干 + 线性标量头,取最后一个有效 token 的隐状态。
    Reward model: LLM backbone + scalar linear head on last valid hidden state.
    """
    def __init__(self, model_name: str = "Qwen/Qwen2.5-0.5B"):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        hidden_size = self.backbone.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)  # 标量奖励 / scalar reward

    def forward(self, input_ids, attention_mask):
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        hidden = out.last_hidden_state  # (B, T, H)
        # 取每个序列最后一个有效 token 的隐状态 / hidden state of last valid token
        last_idx = attention_mask.sum(dim=1) - 1  # (B,)
        last_hidden = hidden[torch.arange(hidden.size(0)), last_idx]  # (B, H)
        reward = self.reward_head(last_hidden).squeeze(-1)  # (B,)
        return reward

def bradley_terry_loss(rewards_chosen, rewards_rejected):
    """
    Bradley-Terry loss: L = -log σ(r_chosen - r_rejected)
    BT loss: higher reward for preferred responses.
    """
    return -F.logsigmoid(rewards_chosen - rewards_rejected).mean()

# --- 训练示例 / Training example ---
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
rm = RewardModel("Qwen/Qwen2.5-0.5B").to(device)

chosen_text = ["The answer is 42.", "It is safe to proceed."]
rejected_text = ["I don't know.", "No, never do that."]
tok_chosen = tokenizer(chosen_text, return_tensors="pt", padding=True, truncation=True)
tok_rejected = tokenizer(rejected_text, return_tensors="pt", padding=True, truncation=True)

r_chosen = rm(tok_chosen["input_ids"], tok_chosen["attention_mask"])
r_rejected = rm(tok_rejected["input_ids"], tok_rejected["attention_mask"])
loss = bradley_terry_loss(r_chosen, r_rejected)
print("BT loss:", loss.item())

PPO complete loss — single-step actor-critic loss: clipped surrogate + clipped value loss + entropy bonus + approx_kl diagnostic (token-level).

53 行 / lines
import torch
import torch.nn.functional as F

def ppo_actor_critic_loss(
    logp, old_logp, advantages, returns, values, old_values, entropy, mask,
    clip_eps=0.2, vf_clip=0.2, vf_coef=0.5, ent_coef=0.01,
):
    """
    Token-level PPO loss: clipped policy surrogate + clipped value loss + entropy bonus.
    All tensors (B, T); mask marks valid response tokens (1=valid).
    logp/old_logp: log π(a_t|s_t) under the current / old policy; advantages: GAE A_t;
    returns: R_t; values/old_values: current / old critic predictions.
    """
    def masked_mean(x):                                         # average over valid tokens only
        return (x * mask).sum() / mask.sum().clamp(min=1)

    # --- policy loss: clipped surrogate (pessimistic lower bound) ---
    ratio = torch.exp(logp - old_logp)                          # pi_theta / pi_theta_old, (B,T)
    pg_loss = -torch.min(ratio * advantages,
                         torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages)

    # --- clipped value loss (guards against critic jumps) ---
    v_clipped = old_values + torch.clamp(values - old_values, -vf_clip, vf_clip)
    v_loss = 0.5 * torch.max((values - returns) ** 2, (v_clipped - returns) ** 2)

    # --- total = policy + c_vf * value - c_ent * entropy ---
    loss = masked_mean(pg_loss) + vf_coef * masked_mean(v_loss) - ent_coef * masked_mean(entropy)

    # --- diagnostics: approx_kl via k3 = (r-1) - log r (here r = pi_theta/pi_theta_old,
    #     estimating KL(pi_old || pi_theta); estimator rationale in §9.4, but note the r
    #     convention is inverted vs §9.4's r = pi_ref/pi_theta) ---
    with torch.no_grad():
        log_ratio = logp - old_logp
        approx_kl = masked_mean((ratio - 1) - log_ratio)        # >= 0, for early-stop / adaptive KL
        clip_frac = masked_mean((torch.abs(ratio - 1) > clip_eps).float())
    return loss, {"approx_kl": approx_kl.item(), "clip_frac": clip_frac.item()}

# --- Toy example ---
torch.manual_seed(0)
B, T = 2, 5
logp       = (torch.randn(B, T) * 0.1).requires_grad_(True)
old_logp   = logp.detach() + torch.randn(B, T) * 0.05           # behavior (old) policy
advantages = torch.randn(B, T); advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
returns    = torch.randn(B, T)
values     = torch.randn(B, T, requires_grad=True)
old_values = values.detach() + torch.randn(B, T) * 0.1
entropy    = torch.rand(B, T)                                   # per-token policy entropy
mask       = torch.ones(B, T); mask[1, 3:] = 0                  # second row's tail is padding

loss, logs = ppo_actor_critic_loss(logp, old_logp, advantages, returns, values, old_values, entropy, mask)
loss.backward()
print("PPO loss:", round(loss.item(), 4), "| diag:", {k: round(v, 4) for k, v in logs.items()})

GRPO advantage — Group Relative Policy Optimization: normalize rewards within the same group to produce advantages for the policy gradient update.

32 行 / lines
import torch
import torch.nn.functional as F

def compute_grpo_advantages(rewards: torch.Tensor) -> torch.Tensor:
    """
    在 group 内归一化奖励作为 advantage:(r - mean) / std。
    Normalize rewards within group: subtract mean, divide by std.
    rewards: (G,)  — 同一 prompt 的 G 个采样回复的奖励
    """
    mean = rewards.mean()
    std = rewards.std().clamp(min=1e-8)  # 防止除零 / avoid division by zero
    return (rewards - mean) / std

# --- 简化策略梯度更新 / Simplified policy gradient update ---
# simulate: given policy log-probs and group advantages, perform one gradient ascent step
G = 8  # 每个 prompt 采样 8 个回复 / sample 8 responses per prompt

# simulated per-sequence log-probs (already summed to sequence level)
policy_logps = torch.randn(G, requires_grad=True)

# simulated rewards (e.g., from a reward model)
rewards = torch.tensor([1.2, 0.5, 2.0, 0.3, 1.8, 0.1, 1.5, 0.9])

advantages = compute_grpo_advantages(rewards)
print("Advantages:", advantages)

# policy gradient loss = -E[advantage * log_prob]  → maximize log-prob for high-advantage responses
grpo_loss = -(advantages.detach() * policy_logps).mean()
grpo_loss.backward()
print("GRPO loss:", grpo_loss.item())
print("policy_logps.grad:", policy_logps.grad)

GRPO token-level loss — broadcast group advantage to tokens + clipped surrogate + per-token K3 KL (no critic, no GAE; token-level averaging, cf. §9.4 and DAPO).

36 行 / lines
import torch

def grpo_token_loss(logp, old_logp, ref_logp, group_adv, mask, clip_eps=0.2, beta_kl=0.04):
    """
    Token-level GRPO loss: per-sequence group advantage broadcast to tokens
    + clipped surrogate + per-token K3 KL (no critic, no GAE).
    logp/old_logp/ref_logp: (B, T) log-prob of the taken token under current / old / reference policy
    group_adv: (B,) within-group normalized advantage A_i (see compute_grpo_advantages above), broadcast per sequence
    mask: (B, T) 1=valid response token
    """
    adv = group_adv.unsqueeze(1)                                # (B,1) -> broadcast to (B,T)
    # clipped surrogate (same clip as PPO, group-relative advantage)
    ratio = torch.exp(logp - old_logp)                          # pi_theta / pi_theta_old
    pg = -torch.min(ratio * adv, torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv)
    # per-token K3 KL: r = pi_ref/pi_theta, k3 = (r-1) - log r >= 0 (same convention as §9.4)
    log_r = ref_logp - logp                                     # log(pi_ref / pi_theta)
    k3 = torch.exp(log_r) - 1 - log_r
    per_token = pg + beta_kl * k3
    # token-level averaging convention borrowed from DAPO (§3.3) so long-CoT gradients are not diluted;
    # note the KL term here is GRPO-style (beta>0), not DAPO (which sets beta=0)
    return (per_token * mask).sum() / mask.sum().clamp(min=1)

# --- Toy example ---
torch.manual_seed(0)
B, T = 4, 6                          # 4 sampled responses for one prompt
logp      = (torch.randn(B, T) * 0.1).requires_grad_(True)
old_logp  = logp.detach() + torch.randn(B, T) * 0.02
ref_logp  = logp.detach() + torch.randn(B, T) * 0.05
rewards   = torch.tensor([1.2, 0.3, 1.8, 0.5])                  # one scalar reward per response
group_adv = (rewards - rewards.mean()) / rewards.std().clamp(min=1e-8)   # within-group normalization
mask      = torch.ones(B, T); mask[1, 4:] = 0

loss = grpo_token_loss(logp, old_logp, ref_logp, group_adv, mask)
loss.backward()
print("GRPO token-level loss:", round(loss.item(), 4))

Sequence packing with cu_seqlens — Concatenate multiple variable-length sequences into a single batch, compute cu_seqlens required by Flash Attention, and correctly mask the loss over the packed output.

59 行 / lines
import torch

def pack_sequences(input_ids_list, labels_list, pad_token_id=0):
    """
    将多条序列拼接成一个平坦 tensor,并计算 Flash Attention 用的 cu_seqlens。
    Packs variable-length sequences into a flat tensor with cu_seqlens for Flash Attention.
    """
    # compute real lengths of each sequence
    lengths = [ids.size(0) for ids in input_ids_list]
    # cu_seqlens: [0, len_0, len_0+len_1, ...]  (半精度索引 / Flash Attention format)
    cu_seqlens = torch.zeros(len(lengths) + 1, dtype=torch.int32)
    for i, l in enumerate(lengths):
        cu_seqlens[i + 1] = cu_seqlens[i] + l

    # concatenate all sequences into one flat tensor
    packed_input_ids = torch.cat(input_ids_list, dim=0)   # (total_tokens,)
    packed_labels = torch.cat(labels_list, dim=0)          # (total_tokens,)
    return packed_input_ids, packed_labels, cu_seqlens

def compute_packed_loss(logits_flat, labels_flat, cu_seqlens, ignore_index=-100):
    """
    在拼接序列上计算 cross-entropy,loss 屏蔽 label=-100 的 token。
    Compute cross-entropy on packed sequence; -100 labels are masked.
    logits_flat: (total_tokens, V),  labels_flat: (total_tokens,)
    """
    # shift for next-token prediction
    shift_logits = logits_flat[:-1, :]
    shift_labels = labels_flat[1:]
    # mask loss at sequence boundaries
    boundary_mask = torch.zeros(shift_labels.size(0), dtype=torch.bool)
    for i in range(len(cu_seqlens) - 1):
        start, end = cu_seqlens[i].item(), cu_seqlens[i + 1].item()
        if start < end:
            boundary_mask[start] = True  # 屏蔽第一条 token 的 shift / mask first token of seq
    shift_labels[boundary_mask] = ignore_index
    loss = torch.nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=ignore_index)
    return loss

# --- 示例 / Example ---
seq_a_ids = torch.tensor([101, 202, 303, 404, 505])
seq_b_ids = torch.tensor([606, 707])
seq_c_ids = torch.tensor([808, 909, 1010])

seq_a_lab = torch.tensor([-100, -100, 303, 404, 505])   # first two are prompt
seq_b_lab = torch.tensor([-100, 707])
seq_c_lab = torch.tensor([-100, 1010, 1010])

packed_ids, packed_labels, cu_seqlens = pack_sequences(
    [seq_a_ids, seq_b_ids, seq_c_ids], [seq_a_lab, seq_b_lab, seq_c_lab]
)
print("packed_ids:", packed_ids)
print("cu_seqlens:", cu_seqlens)  # tensor([0, 5, 7, 10])

# simulate logits
V = 2000
logits_flat = torch.randn(packed_ids.size(0), V)
loss = compute_packed_loss(logits_flat, packed_labels, cu_seqlens)
print("Packed loss:", loss.item())

KL divergence penalty — In PPO/RLHF reward shaping, compute the per-token KL penalty between the policy and reference models.

36 行 / lines
import torch
import torch.nn.functional as F

def compute_kl_penalty(
    policy_logits: torch.Tensor,
    ref_logits: torch.Tensor,
    mask: torch.Tensor,
) -> torch.Tensor:
    """
    逐 token KL 散度:KL(π_θ || π_ref),在序列维度求均值后取 batch 均值。
    Per-token KL divergence: KL(policy || ref), averaged over valid tokens & batch.
    policy_logits / ref_logits: (B, T, V),  mask: (B, T) — 1=有效, 0=padding
    """
    policy_logps = F.log_softmax(policy_logits, dim=-1)  # (B, T, V)
    ref_logps = F.log_softmax(ref_logits, dim=-1)        # (B, T, V)
    # KL(p||q) = sum_p p(x) * [log p(x) - log q(x)] = E_p[log p - log q]
    policy_probs = policy_logps.exp()
    token_kl = (policy_probs * (policy_logps - ref_logps)).sum(dim=-1)  # (B, T)
    # masked mean
    kl_per_seq = (token_kl * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1)  # (B,)
    return kl_per_seq.mean()  # scalar

# --- Used in PPO reward shaping ---
B, T, V = 2, 8, 1000
policy_logits = torch.randn(B, T, V)
ref_logits = torch.randn(B, T, V)
mask = torch.ones(B, T); mask[1, 6:] = 0  # second sequence has padding in the latter half

kl = compute_kl_penalty(policy_logits, ref_logits, mask)
print("KL penalty:", kl.item())

# PPO reward shaping: r = r_raw - beta * KL
beta_kl = 0.05
shaped_reward = 1.5 - beta_kl * kl  # used at batch level
print("Shaped reward:", shaped_reward.item())

Rejection Sampling Fine-tuning (RFT) — Sample N responses from the policy model, score them with a reward function, keep the highest-scoring response as the SFT target for fine-tuning.

52 行 / lines
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

def rejection_sampling_finetune(model, tokenizer, prompts, reward_fn, N=4, max_new_tokens=64):
    """
    RFT 流程:对每个 prompt 采样 N 个回复,用 reward_fn 评分,取 top-1 做 SFT。
    RFT loop: sample N responses, score with reward_fn, keep top-1 as SFT target.
    """
    model.train()
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    for prompt in prompts:
        # ---- Sampling phase ----
        input_ids = tokenizer(prompt, return_tensors="pt").input_ids
        all_completions, all_rewards = [], []
        with torch.no_grad():
            for _ in range(N):
                out = model.generate(input_ids, max_new_tokens=max_new_tokens,
                                     do_sample=True, temperature=0.8, top_p=0.95)
                gen_ids = out[0, input_ids.size(1):]           # keep generated portion only
                text = tokenizer.decode(gen_ids, skip_special_tokens=True)
                reward = reward_fn(prompt, text)               # 标量奖励 / scalar reward
                all_completions.append(gen_ids)
                all_rewards.append(reward)

        # ---- Select best response ----
        best_idx = int(torch.tensor(all_rewards).argmax())
        best_ids = all_completions[best_idx]

        # ---- SFT phase (compute loss on best response) ----
        full_ids = torch.cat([input_ids[0], best_ids]).unsqueeze(0)  # (1, T)
        labels = full_ids.clone()
        labels[0, :input_ids.size(1)] = -100  # 屏蔽 prompt / mask prompt tokens
        logits = model(input_ids=full_ids).logits
        loss = F.cross_entropy(logits[:, :-1, :].reshape(-1, logits.size(-1)),
                               labels[:, 1:].reshape(-1), ignore_index=-100)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        print(f"RFT loss: {loss.item():.4f}, best reward: {all_rewards[best_idx]:.4f}")

# --- Simple reward function ---
def dummy_reward_fn(prompt, response):
    """Reward: longer is better (demo only)."""
    return float(len(response))

# Run
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
prompts = ["Explain gravity in one sentence.", "What is 2+2?"]
rejection_sampling_finetune(model, tokenizer, prompts, dummy_reward_fn, N=4)

Part 3 — Interview Question Bank

━━━ L1 Basic ━━━


Q1. What problems do pre-training and post-training each solve? What is the standard pipeline?

Answer: Pre-training aims to have the model learn general language capabilities, world knowledge, and a foundation for reasoning from massive unlabeled text — essentially unsupervised language modeling. Post-training aims to transform this "knowledgeable but unruly" base model into an assistant that follows instructions, is helpful, safe, and aligned with human values. The standard pipeline is: 1) Supervised Fine-Tuning (SFT), which fine-tunes the model on high-quality instruction-response pairs; 2) Preference Alignment, which typically uses methods such as RLHF or DPO to further optimize model behavior based on human preference data.

Follow-up: Why can't a single stage (e.g., SFT alone) complete the full transformation from a pre-trained model to a usable assistant?

Q2. What is loss masking in SFT? Why is loss computed only on assistant tokens?

Answer: Loss masking means that when computing the SFT loss, only the prediction loss for tokens corresponding to the assistant's response (i.e., the part the model needs to learn to generate) is included in the total loss, while the loss for the input/user instruction portion is ignored. This focuses the model's optimization objective on "learning how to respond correctly" rather than "parroting the user's input." Without masking the input portion, the model might waste learning capacity memorizing input formats instead of focusing on generating high-quality responses.

Follow-up: If the gradients for the user instruction portion are never updated during SFT, does the model truly become completely unable to "understand" instructions? Please explain.

Q3. What is the training objective of a Reward Model? What is the Bradley-Terry model?

Answer: The training objective of a Reward Model (RM) is to output a scalar score for a given (prompt, response) pair that reflects human preference rankings for response quality. Specifically, it learns by comparing a pair of responses (chosen vs. rejected). The Bradley-Terry model is a probabilistic model for pairwise comparisons; it assumes that the probability of selecting the winning response is proportional to the difference in reward values between the two responses. In RM training, the loss function is typically based on this probability, with the goal of maximizing the reward margin of the chosen response over the rejected response.

Follow-up: If the preference rankings in human annotation data are inconsistent or noisy, how does this affect the Reward Model trained under the Bradley-Terry model?

Q4. What is the role of the KL penalty in RLHF? How is β tuned?

Answer: During the reinforcement learning phase of RLHF, the policy model (the LLM being optimized) maximizes rewards from the Reward Model when generating responses. However, this can cause the model to generate strange, unnatural text that deviates from its original capability distribution in pursuit of high scores. The KL penalty term is added to the optimization objective by computing the KL divergence between the current policy and the initial SFT model (reference policy). Its role is to constrain the optimized model from drifting too far from the initial model, thereby preserving language quality and diversity. β is a hyperparameter that controls the strength of the KL penalty: larger β imposes heavier penalties on deviation, making the model more conservative and closer to the initial model; smaller β gives the model more freedom, potentially pursuing higher rewards but at greater risk.

Follow-up: The KL penalty computes the divergence over the full sequence distribution. What challenges does this pose in practice? Are there more efficient or more local approximation methods?

Q5. What is DPO? What is the core difference from RLHF?

Answer: DPO (Direct Preference Optimization) is a method that directly optimizes a language model using human preference data. Through a clever mathematical transformation, it merges the two steps of RLHF — "train a Reward Model, then use it for RL optimization" — into a single supervised learning loss function. In DPO, the model directly learns to translate preference rankings into adjustments of response probabilities. The core difference is: RLHF is "explicit," involving a separate RM training step and an online RL optimization process (e.g., PPO); DPO is "implicit" — it bypasses explicit RM training and online sampling, directly optimizing the policy through an offline contrastive loss, and is generally simpler and more stable.

Follow-up: A major criticism of DPO is that it heavily depends on the quality of preference data. Why might its requirements for data quality be higher than those of RLHF?

Q6. What is sequence packing? What are its benefits and pitfalls?

Answer: Sequence packing is a training efficiency optimization technique. It concatenates multiple short sequences (e.g., multiple different instruction-response pairs) — using special separators such as <EOS> followed by the start token of a new sequence — into a single long sequence that reaches the model's maximum context length, then trains on this as a whole. Benefits: significantly improves GPU utilization, reduces computation waste from padding short sequences, and speeds up training. The main pitfalls are: 1) careful attention mask design is required to prevent the model from "seeing" information from other short sequences within the same packed sequence during training (i.e., cross-sequence attention leakage), which can cause data contamination or learning bias; 2) the model may be sensitive to sequence ordering.

Follow-up: In sequence packing, if two concatenated sequences are on completely unrelated topics (e.g., a math problem and a poem), what specific harm does cross-sequence attention leakage cause?

Q7. What is reward hacking? Give two examples.

Answer: Reward hacking refers to the model finding ways to "cheat" or "game" the system to obtain higher reward scores, even though the generated responses do not actually meet the true human goals of being helpful, honest, and harmless. It is over-optimization or exploitation of the reward function. Example 1: If the RM favors longer responses, the model may learn to generate verbose but hollow replies. Example 2: If the RM gives high scores to responses containing certain specific "safety" phrases (e.g., "As an AI assistant, I must comply with…"), the model may learn to mechanically insert such boilerplate into all responses, regardless of whether it is actually needed.

Follow-up: Beyond improving the Reward Model itself, what strategies can be employed during RLHF training to mitigate reward hacking?

Q8. What tensions exist among the Helpful / Harmless / Honest triad in alignment?

Answer: Helpful, Harmless, and Honest have inherent tensions among them. For example, a model that prioritizes Harmless excessively may refuse to answer reasonable but sensitive questions due to over-caution, thereby compromising Helpfulness (e.g., a doctor discussing medical symptoms). A model pursuing extreme Honesty may expose unverified information or user privacy in responses, thereby compromising Harmlessness. Conversely, fabricating answers to be Helpful compromises Honesty. An ideally aligned model must dynamically balance these three objectives across different contexts; there is no fixed perfect solution.

Follow-up: Can you provide a concrete scenario in which a model unavoidably sacrifices Helpfulness and Honesty in order to achieve Harmlessness?

━━━ L2 Intermediate ━━━


Q9. What is the core difference between GRPO and PPO? How many models does GRPO require?

Answer: GRPO (Group Relative Policy Optimization) and PPO (Proximal Policy Optimization) are both policy gradient algorithms, but GRPO makes key improvements to simplify the RLHF training process. The core difference: PPO requires maintaining four models — a policy model, a reference model, a value model (Critic), and a reward model; GRPO does not require a separate value model. GRPO generates a group of responses for the same prompt, then uses the average reward within the group as a baseline to estimate the advantage function, thereby computing the policy gradient. Therefore, GRPO typically requires only two models: the policy model and the reward model (the reference model can be merged or shared).

Follow-up: GRPO uses the group's average reward as a baseline to estimate the advantage function. What kind of bias might this introduce, and how does it affect training stability?

Q10. What problems do IPO, KTO, ORPO, and SimPO each solve with respect to DPO?

Answer: These methods are all improvements or variants of DPO:

  • IPO (Identity Preference Optimization): Addresses the problem in DPO where KL regularization breaks down and overfitting occurs when preferences approach near-deterministic — adopts a bounded squared-loss objective (see §7.1) for more robust optimization.
  • KTO (Kahneman-Tversky Optimization): Addresses DPO's requirement for strictly paired preference data (chosen/rejected pairs). KTO only requires binary labels indicating whether each response is "good" or "bad," without pairing, making data collection more flexible.
  • ORPO (Odds Ratio Preference Optimization): Attempts to merge SFT and preference alignment into a single training stage. It directly optimizes the odds ratio of the model generating a chosen response relative to a rejected response.
  • SimPO (Simple Preference Optimization): Aims to further simplify DPO by removing the dependency on a reference model, while improving optimization stability and robustness to response length by using length-normalized log-probabilities as an implicit reward and introducing a target reward margin.

Follow-up: Among these methods, which has relatively the lowest requirements for training data quality or quantity? Why?

Q11. Which matters more in SFT — data quality or data quantity? How is data curation done?

Answer: In the SFT phase, data quality is generally far more important than data quantity. High-quality, diverse, accurate, and human-value-aligned instruction data, even at a smaller scale, can significantly improve model performance. Conversely, large amounts of low-quality, erroneous, or harmful data can severely contaminate the model. A typical data curation pipeline includes: 1) Source filtering: selecting trustworthy and professional sources; 2) Quality filtering: using rules or models (e.g., an RM) to filter out low-scoring, harmful, or malformatted samples; 3) Deduplication: removing duplicate or near-duplicate samples; 4) Diversity augmentation: ensuring instructions cover a wide range of tasks, difficulty levels, and domains; 5) Format normalization: standardizing the style and length distribution of responses.

Follow-up: If you could use only a single automated model (rather than humans) to evaluate and filter quality in large-scale SFT data, what type of model would you prioritize? Why?

Q12. What are the main paradigms for synthetic data generation? Where does length bias come from?

Answer: The main paradigms are: 1) Self-Instruct: having the model generate new instructions and responses from seed tasks; 2) Evol-Instruct: evolving existing instructions through multiple rounds and multiple dimensions of complexification; 3) Bootstrapping: using a powerful "teacher" model to generate training data for a "student" model (e.g., distillation); 4) Reward-guided Generation: using an RM or rules to filter/revise multiple candidate responses generated by the model. Length bias mainly originates from: 1) Model-intrinsic bias: common responses in pre-training data (e.g., technical documentation) tend to be long; 2) Reward model bias: if human annotators in the RM's training data generally prefer more detailed, longer responses, the RM will assign higher scores to longer responses, causing the model to tend toward generating longer text when optimizing the RM; 3) Generation strategy: for example, verbose enumeration to ensure all points are covered.

Follow-up: When generating synthetic data, how can the pipeline or loss function be designed to explicitly control or reduce length bias in the final responses?

Q13. What is the difference between online and offline preference learning? What scenarios is each suited for?

Answer: Online learning (e.g., the PPO phase in standard RLHF) means that the policy model generates new responses in real time during training and receives new reward signals from the environment (e.g., the RM) to update the policy. Offline learning (e.g., DPO) means using a pre-collected, fixed preference dataset to optimize the model, without generating new data during training. Online learning is suited for scenarios that require continuous exploration, fast adaptation to new reward signals, or resolving distribution shift, but has high computational cost and instability. Offline learning is suited for scenarios where data collection is expensive and stable training pipelines are needed, but is easily constrained by a fixed data distribution and may converge to a suboptimal solution.

Follow-up: In offline learning, if the preference data distribution used for training differs greatly from the data distribution encountered during deployment, what problems arise? How can this be mitigated?

Q14. What is benchmark contamination? How can it be detected?

Answer: Benchmark contamination refers to the situation where the model being evaluated (or its training data) has already "seen" the test questions or answers from the evaluation benchmark during training. This causes the model to achieve inflated, unrealistic performance scores on that benchmark, which do not reflect its true generalization capability. Detection methods include: 1) Membership inference attacks: analyzing differences in perplexity between the model's outputs on test-set samples versus similar non-test-set samples; 2) n-gram overlap analysis: checking the degree of text overlap between the model's training data and the test set; 3) Data provenance auditing: rigorously auditing training data sources to exclude datasets known to contain mainstream benchmark test sets (e.g., certain versions of Common Crawl); 4) Dynamic benchmark design: using regularly updated, non-public test sets.

Follow-up: Beyond data contamination, what other methodological flaws in evaluation might lead to misjudgment of a model's capabilities?

Q15. How does catastrophic forgetting manifest in post-training? How can it be mitigated?

Answer: In post-training, catastrophic forgetting manifests as the model losing the broad knowledge, language capabilities, or ability to handle diverse tasks learned during pre-training while acquiring new capabilities (e.g., instruction following, value alignment) through SFT or RLHF. For example, an aligned model may perform well on instruction following but exhibit significant degradation in foundational capabilities such as coding, mathematics, or multilingual tasks compared to the base model. Mitigation methods include: 1) Mixed training data: mixing pre-training data or general-capability data into SFT/RLHF data; 2) Low-rank adaptation: using parameter-efficient fine-tuning methods such as LoRA to update only a small fraction of parameters; 3) Regularization: adding an L2 penalty on the original model parameters to the loss function (similar to EWC); 4) Knowledge distillation: using the original model as a teacher to constrain the output distribution of the aligned model.

Follow-up: In parameter-efficient fine-tuning methods (e.g., LoRA), how does the choice of which layers to fine-tune (e.g., QKV projections in attention layers vs. FFN layers) differently affect the mitigation of catastrophic forgetting and the preservation of existing capabilities?

Q16. Process Reward Model (PRM) vs. Outcome Reward Model (ORM)?

Answer: An ORM (Outcome Reward Model) gives a single reward score only for the final answer or complete response generated by the model, without regard for the intermediate reasoning process. A PRM (Process Reward Model) evaluates and scores each intermediate step in solving the problem or generating the response. The advantage of PRM lies in providing denser, more fine-grained supervision signals that help guide the model toward correct step-by-step reasoning — especially valuable for complex tasks such as mathematics and logical reasoning, as it prevents the model from arriving at the correct answer via "shortcuts" with an incorrect process. The challenge is that annotation costs are extremely high, requiring human experts to evaluate each step.

Follow-up: In practice, how can data for training a PRM be collected efficiently? Is it possible to use an ORM or other models to automatically generate training labels for a PRM?

Q17. What are the limitations of MT-Bench, AlpacaEval, and Chatbot Arena respectively?

Answer:

  • MT-Bench: Uses pre-designed multi-turn conversation questions and a powerful LLM (e.g., GPT-4) as the judge. Limitations: 1) the judge model itself may be biased; 2) fixed questions make it easy to overfit; 3) cannot evaluate long-document processing or real-world complex tasks.
  • AlpacaEval: Uses a fixed instruction set; GPT-4 is used to compare the model's responses against reference responses (typically GPT-4's own responses). Limitations: 1) strongly dependent on GPT-4's preferences, which may not reflect the preferences of a broad user base; 2) risk of "self-preference," where responses stylistically similar to GPT-4 may score higher.
  • Chatbot Arena: Conducts pairwise comparisons through anonymous votes from real users, making it the most human-preference-aligned dynamic evaluation currently available. Limitations: 1) the user base may not be fully representative (skewed toward technical users); 2) high evaluation cost and slow speed; 3) uneven distribution of conversation domains.

Follow-up: If you were to design a new, more comprehensive evaluation framework for post-trained models, what different evaluation dimensions and methods would you integrate to compensate for the shortcomings of these individual benchmarks?

━━━ L3 Deep ━━━


Q18. Why is the value model (critic) in PPO difficult to train? How does GRPO sidestep this problem?

Answer: In PPO for RLHF, the value model (Critic) must accurately estimate the expected total future reward given a current state (i.e., the current prompt and partial generation history) — that is, the state value function V(s). This estimation is extremely difficult: 1) Sparse rewards: rewards are typically given only after a complete response is generated, so intermediate states lack direct supervision signals; 2) High variance: the state space for text generation is vast and complex, leading to high variance in value estimates and unstable training; 3) Non-stationarity: the policy model updates rapidly, causing the target distribution for the value function to shift continuously, increasing the difficulty of fitting. GRPO sidesteps this problem by eliminating the value model entirely. It generates a group of responses for each prompt and uses the group's average reward as a baseline to estimate each response's advantage relative to the group average. This approach avoids training a complex value network over all possible states.

Follow-up: GRPO uses the group's average reward as a baseline, which implicitly assumes that the value of all states (i.e., different generation paths for the same prompt) is equal. Under what circumstances does this assumption become unreasonable?

Q19. Theoretical derivation of DPO: walk through the derivation from the RLHF KL-constrained optimal solution to the DPO loss.

Answer:

  1. RLHF objective: We have a KL-constrained optimization objective: max_{π} E_{xD, yπ}[r(x, y)] - β * KL[π(y|x) || π_ref(y|x)], where π is the policy, π_ref is the reference policy, and r is the reward function.
  2. Closed-form optimal solution: Solving the above objective with respect to π yields the closed-form optimal solution: π*(y|x) = π_ref(y|x) * exp(r(x, y) / β) / Z(x), where Z(x) is the partition function (normalization constant).
  3. Inverting for the reward function: Taking logarithms on both sides and rearranging, the reward function can be expressed as a function of the policy: r(x, y) = β * log(π*(y|x) / π_ref(y|x)) + β * log(Z(x)).
  4. Substituting into the Bradley-Terry model: For a preference pair (y_w, y_l), according to the BT model, the probability that a human selects y_w is σ(r(x, y_w) - r(x, y_l)), where σ is the sigmoid function.
  5. Canceling the partition function: Substituting the reward expression from step 3 into step 4, the log(Z(x)) terms cancel in the subtraction, yielding: P(y_w ≻ y_l | x) = σ(β * log(π*(y_w|x) / π_ref(y_w|x)) - β * log(π*(y_l|x) / π_ref(y_l|x))).
  6. DPO loss: Finally, the DPO loss function maximizes the above probability (i.e., minimizes negative log-likelihood): L_DPO(θ) = -E[log σ(β * log(π_θ(y_w|x) / π_ref(y_w|x)) - β * log(π_θ(y_l|x) / π_ref(y_l|x)))], where π_θ is the policy being optimized.

Follow-up: In the above derivation, we assume that the reward function r can be expressed in terms of the policy π (step 3). What are the implicit conditions for this assumption to hold?

Q20. What is the difference between mode collapse and reward hacking? How can mode collapse be detected?

Answer: Reward hacking is when the model finds "shortcuts" to obtain high rewards while producing outputs that do not match human intent (e.g., generating verbose filler). Mode collapse refers to a sharp drop in the diversity of the model's outputs, where the model tends to repeatedly generate a few types of high-reward, safe, or stereotyped responses, losing the richness and creativity expected when responding to diverse prompts. It is a common failure mode in generative models. Methods for detecting mode collapse include: 1) Diversity metrics: computing lexical diversity (e.g., distinct-n) and variance in semantic embeddings of responses generated for a set of prompts, compared against a baseline model; 2) Reward distribution analysis: if the model's reward score distribution becomes highly concentrated (high mean, low variance), it may indicate that the model has found a few "high-scoring templates"; 3) Manual sampling inspection: randomly sampling multiple groups of responses and observing whether their content, structure, and word choices are highly similar.

Follow-up: Increasing the KL penalty coefficient β is an effective way to mitigate mode collapse in RLHF training. Beyond this, what methods from a data perspective or algorithmic perspective can encourage diversity?

Q21. What is alignment tax? How does weight averaging mitigate it, and what is the principle?

Answer: Alignment tax refers to the performance cost paid on certain general capabilities not directly optimized (e.g., basic language modeling, complex reasoning) — i.e., degradation in these capabilities — as the model undergoes post-training alignment to achieve better instruction following, safety, and harmlessness. Weight averaging is a simple and effective mitigation technique. It averages the weights of multiple models produced at different training checkpoints or with different random seeds to obtain a smoother, more generalizable final model. The principle is: 1) Variance reduction: averaging reduces performance instability caused by training fluctuations or randomness in any single model; 2) Exploring better solutions: different training snapshots may reside in different "good" regions of the loss landscape, and averaging may find an intermediate point that performs well across dimensions; 3) Implicit regularization effect, preventing the model from overfitting to specific patterns in training data (including biases that may exist in alignment data).

Follow-up: In specific implementations of weight averaging — such as Stochastic Weight Averaging (SWA) and Model Soups — how do their strategies and assumptions differ? Which is likely more effective at mitigating alignment tax?

Q22. What are the key design decisions in DeepSeek-R1's training pipeline? What is the role of cold-start SFT?

Answer: According to the DeepSeek-R1 paper (arXiv:2501.12948), it is important to distinguish between two models:

DeepSeek-R1-Zero: Applies pure RL (GRPO) directly on DeepSeek-V3-Base, completely skipping the SFT phase. The paper states: "we bypass the conventional supervised fine-tuning (SFT) phase before RL training." R1-Zero demonstrates that reasoning capabilities can emerge from pure RL, but it suffers from poor readability and language mixing.

DeepSeek-R1: Four-stage pipeline (paper Section 3):

  1. Cold-start SFT: Collects thousands of cold-start data samples with human-conversational-style chain-of-thought, then fine-tunes DeepSeek-V3-Base via SFT to produce Dev1. Note: this is "cold-start" rather than standard large-scale SFT; the data volume is small (thousands).
  2. Reasoning-oriented RL (Stage 1 RL): Applies GRPO on Dev1 for reasoning-task reinforcement learning (rule-based rewards: accuracy + format) to produce Dev2.
  3. Rejection-sampling SFT: Samples from Dev2, merges reasoning and non-reasoning data for SFT to produce Dev3. This stage also improves general capabilities such as writing.
  4. Full-scenario RL (Stage 2 RL): Applies comprehensive RL on Dev3, with reward signals combining rule-based (reasoning) + RM (general dialogue, safety), yielding the final DeepSeek-R1.

Role of cold-start: Resolves R1-Zero's readability and language-mixing issues, providing a more well-structured behavioral foundation for subsequent RL and making RL exploration more efficient.

Follow-up: The data used for cold-start SFT has very high quality requirements. If this data contains errors or biases, what cascading effects would this have on the exploration in subsequent reinforcement learning stages?

Q23. How does the self-critique-revision mechanism in RLAIF and Constitutional AI work?

Answer: The core idea of RLAIF (Reinforcement Learning from AI Feedback) and Constitutional AI is to use AI models themselves to generate preference feedback or perform corrections, reducing dependence on human annotation. The self-critique-revision mechanism typically involves a loop: 1) Generate initial response: given a prompt, the model first generates a preliminary response. 2) Self-critique: the model (or a separate critic model) reviews the initial response against a set of predefined "constitutional" principles (e.g., "answers should be objective," "avoid harmful content") and identifies potential violations. 3) Revise response: the model revises the initial response based on the generated critique to produce a new version that better conforms to the constitutional principles. 4) (Optional) Use for training: the (initial response, revised response) pair is used as a (rejected, chosen) pair to train an RM or to directly perform DPO-style optimization. This mechanism allows the model to self-improve and align without requiring real-time human intervention.

Follow-up: Could this self-revision mechanism cause the model to fall into a kind of "alignment loop"? For example, in pursuing a "safer" response, the model might through multiple rounds of revision produce responses that become increasingly conservative and even useless.

Q24. How are iterative RLHF and online DPO similar and different? How can distribution mismatch be resolved?

Answer: Both address the problem of mismatch between the training data distribution (preference pairs generated by an old policy) and the current policy distribution that arises in offline methods like standard DPO. Similarities: both iteratively use the current policy model to generate new data (or responses) and update the model with this new data, so that the training data distribution tracks the policy as it changes. Differences: Iterative RLHF typically refers to alternating between "online data generation (sampling with the current policy and scoring with an RM)" and "updating the policy with new data (possibly using PPO or DPO)." Online DPO more specifically refers to generating a set of responses with the current policy at each training iteration, having an RM or human select preference pairs, and then directly computing the DPO loss and updating the model using this newly generated, distribution-matched preference data, skipping the explicit RL step.

Follow-up: When generating preference pairs using the current policy in Online DPO, what sampling temperature should be used? Why is this parameter choice important?

Q25. Scaling laws in post-training: how do data volume and model scale affect alignment quality? How do the optimal compute allocation strategies for SFT and RL differ?

Answer: Post-training scaling laws differ from those in pre-training. For data volume: in the SFT phase, there are diminishing returns; high-quality data is more important than large volumes of low-quality data, and performance improvements slow after reaching a certain scale. For model scale: larger base models generally have stronger alignment potential and can better understand complex instructions and values, but the amount of high-quality data needed to achieve the same alignment level may not scale proportionally. Optimal allocation strategy for SFT vs. RL: SFT yields more "data-efficient" returns, and it is typically cost-effective to invest more compute early in a project to quickly establish instruction-following capability. RL (e.g., RLHF) is more "compute-intensive," with its returns manifesting in fine-grained behavioral adjustments and value alignment, requiring more online sampling and iteration. A common strategy is: use most of the compute budget to train a sufficiently good base model and SFT model, then use the remaining, relatively smaller compute budget for a few key RL iterations for fine-tuning, since the marginal returns of RL may diminish rapidly.

Follow-up: If we treat both model scale and data volume as resources, in the post-training phase, do you think it is more likely to yield a superior assistant model in real-world applications to invest in aligning a 70B model, or to invest in aligning a 7B model with a larger volume of higher-quality data? Please explain your reasoning.

More L3 Deep Dives / Extended L3

Q26: What does DPO's implicit reward actually learn? What are its fundamental limitations compared to an explicit RM?

The gradient of the DPO loss is equivalent to optimizing an implicit reward r^(x,y)=βlogπθ(yx)πref(yx)\hat{r}(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}. This implicit reward is essentially an accumulation of token-level log-probability ratios under the reference policy, with no explicit modeling of generation semantics. Compared to training an independent RM, the DPO reward is bound to the policy's parameter space, leading to three core limitations: (1) distribution coupling — the reward cannot evaluate OOD responses independently of the policy, limiting exploration; (2) representation bottleneck — the policy must simultaneously serve as both "value evaluator" and "strategy generator," creating potential parameter conflicts; (3) temporal inconsistency — as the policy changes during training the implicit reward drifts, whereas an explicit RM's reward distribution remains relatively stable. This also explains why online DPO (re-sampling with the current policy) typically outperforms offline DPO.

Follow-up: Since DPO has an off-policy problem, Rejection Sampling Fine-Tuning (RFT) is a simpler alternative — under what conditions would RFT be more effective than DPO, and under what conditions would it fail?


Q27: What statistical bias does GRPO's Group Normalization introduce? How can it be mitigated?

GRPO applies group-level z-score normalization (subtract mean, divide by standard deviation) across multiple responses to the same prompt, implicitly assuming that within-prompt comparison is sufficient. Statistically, when the group size GG is small (e.g., G<8G<8), the estimated mean and variance are highly variable, causing high noise in advantage estimates. More critically, group normalization defines advantage entirely relative to the same group, which means: (1) if all responses in a group are of low quality, a "best of a bad bunch" dynamic still produces positive advantage, reinforcing the policy in a low-quality region; (2) conversely, if all responses in the group are high quality, even excellent answers are suppressed. This relative ranking bias means that when the reward distribution is skewed (e.g., most responses score similarly), GRPO may systematically diverge from the absolute quality signal. Mitigation approaches include introducing a baseline anchor (e.g., an EMA reference reward) or mixing absolute-relative advantage.

Follow-up: Under GRPO's KL constraint, if the group size tends to infinity, what form does GRPO's optimization objective mathematically converge to? How does it relate to standard PPO?


Q28: How does Reward Model overparameterization affect RLHF? Should the RM be the same scale as the policy, larger, or smaller?

RM overparameterization (far more parameters than training data requires) causes two problems: (1) spurious correlations — the RM may learn surface features unrelated to preference (e.g., specific writing styles, length) and achieve high accuracy, but these shortcuts break down once the policy updates; (2) calibration degradation — the scalar output of an overparameterized RM tends to be overconfident (concentrated at a few extreme values), causing advantage estimate variance to explode in PPO or the policy to be dominated by a small number of samples. In practice, RM scale selection involves a trade-off: a larger RM has stronger semantic understanding but is more prone to overfitting and is expensive to run; a smaller RM may generalize better but has limited expressiveness. One view is that the RM should be slightly larger than or equal to the policy scale to ensure sufficient reward signal resolution, while reward ensembles (averaging/voting across multiple RMs) mitigate overfitting.

Follow-up: If multiple RMs in a reward ensemble all start from the same SFT initialization and differ only in data shuffling, under what conditions will this ensemble still fail systematically? How would you design a truly diverse RM ensemble?


Q29: How is the Credit Assignment problem solved in multi-turn dialogue RLHF? Is existing sequence-level reward sufficient?

In multi-turn dialogue, the user's final satisfaction is a function of the entire conversation history, but standard RLHF gives only a single scalar reward at the final turn, creating a severe temporal credit assignment problem: the model cannot tell which turn's response caused a positive or negative evaluation. Intuitive solutions include: (1) turn-level reward modeling — training an independent reward model for each dialogue turn, but this faces partial observability of dialogue state and high annotation costs; (2) Monte Carlo rollout — re-sampling subsequent dialogue from a given turn to estimate value, but combinatorial explosion is severe; (3) shaped reward via dialogue act — using dialogue acts (e.g., clarification, confirmation) as intermediate reward signals. Empirically, pure sequence-level reward is manageable for short dialogues (2–3 turns), but in long dialogues the policy tends to fall into early-turn over-optimization (over-optimizing the first-turn response to capture the initial reward signal while neglecting subsequent interaction quality).

Follow-up: If you want to implement reward attribution at the token level (rather than turn level), what methods could theoretically decompose a sequence-level reward down to each token? What are the theoretical guarantees and practical difficulties of such an approach?


Q30: Is the theoretically optimal solution of KL-constrained RL sensitive to β? When β deviates from optimal, how do PPO and DPO differ in their failure modes?

From a KL-regularized RL perspective, β\beta controls the position on the exploration-exploitation Pareto frontier. Theoretically, the optimal β\beta^* depends on the scale of the reward function and the entropy of the reference policy, and cannot be determined in advance. When β\beta is too large (over-regularization), both PPO and DPO converge toward the reference policy and alignment effects are weak. When β\beta is too small (under-regularization), their failure modes diverge: PPO experiences a positive feedback loop of reward hacking — once the policy finds a reward loophole it is continuously reinforced, the RM is evaluated out-of-distribution, and reward collapses; DPO exhibits instability from preference reversal — the implicit reward of off-policy samples drifts during training, the margin between chosen and rejected shrinks or even flips, and the loss oscillates. In practice, PPO's β\beta (KL penalty coefficient) typically needs to be co-tuned with the learning rate, while DPO's β\beta behaves more like a temperature: a smaller β\beta allows a larger chosen-rejected margin but is also more prone to overfitting.

Follow-up: Is there a theoretically grounded method to adaptively adjust β\beta (rather than manually tuning)? What problems arise when using KL divergence itself as the signal for adaptive β?


Q31: Process Reward Models (PRM) have advantages on long-chain tasks like mathematical reasoning, but how do you handle the annotation ambiguity of "steps that are correct but part of a suboptimal reasoning path"?

The core challenge for PRMs is the multi-modal solution distribution: for the same problem, multiple valid reasoning paths exist (e.g., algebraic vs. geometric approaches), where steps within each path are internally consistent but paths are not directly comparable. During annotation, if annotators are asked "is this step correct?", they may give false negatives when unfamiliar with a particular reasoning style. More subtly, even if a step is correct within its current path, if the overall path is suboptimal, the step-level reward should be adjusted — but this requires a global view, which is fundamentally at odds with PRM's local evaluation nature. Directions for resolution include: (1) path-conditioned PRM — evaluating the current step conditioned on preceding steps, rather than in absolute terms; (2) Monte Carlo estimation — rolling out from the current step to the final answer and using the success rate as the step-level reward, though computational cost is high; (3) agreement-based filtering — annotating only the "critical steps" shared across multiple paths, avoiding path-specific steps.

Follow-up: If Monte Carlo rollout is used to estimate PRM's step-level reward, should the rollout policy be the current training policy or a fixed exploration policy? How does this choice affect the bias and variance of the reward estimate?


Q32: Constitutional AI (CAI) claims AI feedback can replace human feedback, but where is the theoretical ceiling of RLAIF? Can the gap between AI feedback and human feedback be eliminated?

The theoretical ceiling of RLAIF is bounded by the capability limits of the AI evaluator. The core issue is: if the AI evaluator has systematic preferences of its own (e.g., verbosity bias, sycophancy), then a policy trained on its feedback will inherit and even amplify those preferences, creating a evaluator-policy co-adaptation degeneracy loop. The deeper limitation is the unverifiability of value alignment — certain dimensions of human preference (such as honesty and harmlessness) fundamentally require human judgment, and AI cannot self-validate. CAI's "constitutional principles" attempt to circumvent this with explicit rules, but rules cannot cover all corner cases, and conflicts between rules require human arbitration. Empirically, RLAIF can approach human feedback on certain objective dimensions (e.g., format correctness), but still has a significant gap on dimensions requiring deep value judgment (e.g., nuanced harm assessment). Theoretically, RLAIF can only achieve RLHF-level performance when the AI evaluator is an unbiased and consistent estimator of human preferences — an assumption that currently cannot be guaranteed.

Follow-up: If the AI evaluator has a known bias (e.g., verbosity bias), can debiasing techniques (e.g., calibration, adversarial training) correct it before RLAIF training? What are the theoretical guarantees of such correction?


Q33: In multi-turn RLHF, how should the dynamics of user strategy be modeled? What systematic errors arise from assuming a fixed user strategy?

Standard multi-turn RLHF implicitly makes a stationary user assumption — that the user follows a fixed response strategy throughout the conversation. In reality, users adjust their questioning strategy based on the model's replies (e.g., pressing harder when the model evades a question, asking for brevity when the model is too verbose). This transforms RLHF from a single-agent MDP into a two-player Markov Game. Under a non-stationary user strategy, the fixed-user assumption causes: (1) overfitting to the simulated user — the policy learns optimal responses for a particular simulated user pattern rather than a robust strategy for real dynamic users; (2) exploitation of user patience — if the simulated user never terminates the conversation due to overly long responses, the policy learns an excessively verbose style. The more fundamental difficulty is that real user strategies are themselves a distribution and may even shift because of model behavior (user-model co-evolution), which theoretically approaches non-stationary multi-agent RL, for which no mature convergence guarantees currently exist.

Follow-up: If you want to explicitly model dynamic user strategies, could a user simulator be jointly trained with the policy? What are the known failure modes of such a self-play framework?

§A Key Papers Timeline