Intended use: Interview preparation for LLM Post-Training Research Intern roles & everyday reference Language: English, with key technical terms preserved as-is
Part 1 — Core Concepts & Formula Derivations
Core Concepts & Formula Derivations
1. Pre-training vs Post-training Overview
| Dimension | Pre-training | Post-training |
|---|---|---|
| Data Scale | Trillions of tokens, web-crawled corpora | Thousands–millions of high-quality annotated/preference examples |
| Objective | Next-token prediction; acquiring language and world knowledge | Instruction following + alignment to human preferences + enhanced reasoning |
| Loss | SFT loss + RLHF/DPO/GRPO objectives | |
| Learning Rate | High (order of 1e-4), cosine annealing | Low (1e-5 ~ 5e-6), to prevent forgetting |
| Hardware | Thousands of GPUs, training for weeks to months | Hundreds of GPUs, training for hours to days |
| Output | Base model (highly capable but uncontrolled) | Instruct / Chat model (controllable, safe, helpful) |
Standard 5-Step Pipeline:
- SFT (Supervised Fine-Tuning): Supervised fine-tuning on high-quality (instruction, response) pairs to transform the base model into an "instruction assistant."
- Reward Model Training: Train a scoring model RM using human preference comparison data (two responses to the same prompt + human preference labels).
- RLHF / PPO: Reinforcement learning using RM feedback, with a KL constraint to prevent diverging too far from the SFT model.
- DPO (Offline Alternative): Bypasses the explicit RM; directly optimizes the policy from preference data, achieving simpler and more stable alignment.
- Iterative Loop: Current policy samples new data → new preference labels → update RM → RL again, repeated over multiple rounds.
2. SFT Data Format & Loss Masking
Chat Template (ChatML format example):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user instruction}<|im_end|>
<|im_start|>assistant
{assistant response}<|im_end|>
Different models have their own templates (Llama-2 uses [INST], Llama-3 uses <|start_header_id|>, Qwen uses <|im_start|>). Training and inference must use the same template; otherwise a distribution shift occurs.
Loss Masking:
The training objective of SFT is to teach the model "how to answer," not "how to repeat the question." Cross-entropy loss is computed only at assistant token positions:
where is the set of all assistant token positions. Labels for user / system tokens are set to (ignored by default in PyTorch's CrossEntropyLoss).
Multi-turn Loss Masking: The user turn of every round is masked; only the assistant turn of each round contributes to the loss.
Trade-off for including the system prompt in the loss:
- Exclude (mainstream practice): saves capacity, focuses training on response quality.
- Include: the model better learns to follow system instructions, but introduces more noise.
2.1 Common tokenization / chat-template pitfalls (SFT engineering screening questions)
These are the most common failure modes in SFT engineering and are frequently tested in interviews (each: problem → fix):
pad_token = eos_token: Many models (LLaMA/GPT-2) have no pad token; HF defaults to setting pad as eos. If pad positions are not masked inattention_maskand pad position labels are not set to-100, the model computes loss over pad tokens / learns to output eos everywhere. → Explicitly constructattention_maskto mask pad; set all prompt and pad labels to-100.- Applying chat template twice: After
apply_chat_templatein data preprocessing, tokenization also setsadd_special_tokens=Trueor wraps the template again → duplicate BOS / duplicate special tokens. → Apply the template only once; setadd_special_tokens=Falsewhen tokenizing afterward. - Missing or inconsistent BOS: BOS added during training but not at inference (or vice versa) → distribution shift (LLaMA-family is sensitive to BOS). → Verify whether the template already contains BOS; keep training/inference behavior consistent.
- Tokenizer version / vocabulary mismatch: Training and deployment use different versions or a version with added custom tokens, causing token ID misalignment. → Pin the tokenizer version; save it alongside the checkpoint.
- Uninitialized newly added special tokens: Adding new special tokens requires
resize_token_embeddings(otherwise the token ID goes out of bounds); the newly added rows are randomly initialized and produce gibberish before training. → Initialize new rows properly (common heuristic: mean of existing embeddings, or reuse the vector of a semantically similar token — the mean is just a common heuristic, not the only option); ensure those new tokens appear in training data. - Packing cross-sample attention contamination: When packing multiple documents into one sequence, without block-diagonal /
cu_seqlensmasking, tokens can attend across document boundaries (see §3). → Use varlen / cu_seqlens attention with correct separation. - Using the wrong model's template: Llama-3 (
<|begin_of_text|>/<|start_header_id|>), Qwen (<|im_start|>), Mistral ([INST]) have different formats; using the wrong one causes a sharp performance drop. → Use the target model's owntokenizer.apply_chat_template; do not hand-stitch concatenations.
Self-test (L2): Why does
pad_token=eos_tokenbreak SFT ifattention_maskand label mask are not set correctly? In multi-turn dialogue, how should labels for pad and prompt positions be handled?
3. Sequence Packing
Definition: Concatenate multiple short samples into a single sequence of length equal to the context window, adding EOS / separator tokens only at sample boundaries, thereby eliminating padding waste.
GPU utilization: Without packing, padding can account for 30–60% of tokens; with packing, nearly 100% of tokens are valid, yielding a 2–4× training speedup.
Pitfall 1 — Cross-sample attention contamination: Without a document-level attention mask, tokens from a preceding sample in a packed sequence can attend to tokens from a following sample, causing information leakage. Solution: use Flash Attention's cu_seqlens parameter, which takes the cumulative sequence lengths of each sample within the packed sequence and ensures attention is computed only within each sample.
Pitfall 2 — Loss weight imbalance: Packing implicitly weights by token count (longer samples produce more loss terms). If the original objective averaged over samples, the packing objective differs semantically; consider whether length normalization of the loss is needed.
cu_seqlens example (3 samples with lengths 5, 3, 7):
cu_seqlens = [0, 5, 8, 15] # cumulative lengths
packed_ids = [s1_tok1, ..., s1_tok5, s2_tok1, ..., s2_tok3, s3_tok1, ..., s3_tok7]
4. RLHF Full Pipeline / PPO in RLHF
RLHF (Reinforcement Learning from Human Feedback) three stages:
- SFT: Establish the initial policy (reference policy, which serves as the baseline for the KL penalty).
- RM Training: Fit a scalar reward function from human preference comparison data using the Bradley-Terry model.
- PPO Optimization: Maximize the augmented reward with a KL constraint.
Augmented Reward:
PPO Clipped Objective:
where is the probability ratio, and is typically 0.1–0.2.
GAE Advantage Estimation:
controls the bias-variance trade-off: gives high variance, low bias; degenerates to one-step TD.
Recurrence (backward sweep, ): , with .
- : , the one-step TD advantage (low variance, high bias).
- : , the Monte-Carlo advantage (value function as a baseline only; low bias, high variance).
- Key distinction: introduces bias regardless of accuracy (the discount itself); introduces bias only when is inaccurate — so with a well-fit , even is near-unbiased. Source: Schulman et al., arXiv:1506.02438 (ICLR 2016).
4 Models Required by PPO (source of memory pressure):
| Model | Role | Updated? |
|---|---|---|
| Actor (Policy ) | The LLM policy being optimized | Yes (PPO gradient) |
| Critic (Value model) | Estimates , computes advantage | Yes (TD error) |
| Reference () | KL penalty baseline, i.e., the SFT model | No (frozen) |
| Reward Model (RM) | Scores (x,y) pairs | No (frozen) |
From-scratch implementation (clipped policy loss + clipped value loss + entropy bonus + approx_kl monitoring):
import torch
def ppo_loss(logp, logp_old, values, values_old, returns, advantages, entropy,
clip_eps=0.2, vf_clip=0.2, vf_coef=0.5, ent_coef=0.0):
# logp/logp_old: (B,) logprob of the taken action under current/behavior policy
# advantages: normalized GAE advantage (for policy loss); returns: un-normalized GAE target (raw_adv + values_old, for value loss); both from upstream
ratio = torch.exp(logp - logp_old) # importance ratio ρ_t
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages
pg_loss = -torch.min(surr1, surr2).mean() # clipped policy loss
v_clip = values_old + torch.clamp(values - values_old, -vf_clip, vf_clip)
vf_loss = 0.5 * torch.max((values - returns) ** 2,
(v_clip - returns) ** 2).mean() # clipped value loss
loss = pg_loss + vf_coef * vf_loss - ent_coef * entropy.mean() # entropy bonus
with torch.no_grad(): # diagnostics only
logr = logp - logp_old # log(π_new/π_old)
approx_kl = (torch.exp(logr) - 1 - logr).mean() # K3: KL(π_old‖π_new) probe
clip_frac = ((ratio - 1).abs() > clip_eps).float().mean()
return loss, {"pg": pg_loss, "vf": vf_loss, "approx_kl": approx_kl, "clip_frac": clip_frac}
- Key points: ① three loss terms — clipped policy loss (the above), clipped value loss (constrain the critic's single step around
values_oldto prevent value oscillation), and an entropy bonus (encourages exploration;ent_coefis often 0 or tiny); ②approx_kluses the K3 estimator for KL(π_old‖π_new), a trust-region monitor (early-stop the epoch if it exceeds a threshold) — a different KL from the in the reward above: the former bounds the update step, the latter anchors to the reference policy; ③ both come from upstream:advantagesis the normalized GAE advantage (fed to the policy loss), whilereturnsis the un-normalized GAE target ( raw advantagevalues_old, fed to the value loss) — the two are not interchangeable; this function only computes the minibatch loss; ④vf_coefdefaults to 0.5 as a common heuristic weight (not a principled gradient balance), most relevant when actor and critic share a backbone;clip_fracreports the fraction of ratios outside the clip range, which is not the same as the fraction whose objective term was actually clipped (that also depends on the advantage sign).
5. Bradley-Terry Reward Model
Bradley-Terry preference model: Given prompt , the probability that the better response is preferred over the worse response is:
RM training loss (maximizing log likelihood of preference data):
RM architecture:
- Initialized from the SFT model (inherits language understanding capability).
- The LM head is removed and replaced with a linear layer that outputs a scalar reward.
- During training, a (chosen, rejected) response pair is fed in; both are forward-passed separately, taking the scalar output at the last token position to compute the Bradley-Terry loss.
Key risks:
- Reward hacking: the policy finds responses that score high under RM errors but are low quality (Goodhart's Law).
- Distribution shift: the RM breaks down on the policy's distribution when it has shifted away from the training distribution; iterative RM updates are needed.
6. DPO Full Derivation (Direct Preference Optimization Full Derivation)
6.1 Starting from the RLHF Objective
The KL-constrained RLHF optimization objective:
where:
- is the scalar reward from the reward model
- is the reference policy, typically the SFT model
- controls KL penalty strength
Expanding the KL divergence:
Taking the variational derivative per and setting it to zero:
Note: the normalization constraint introduces a Lagrange multiplier
6.2 Closed-Form Optimal Policy
Solving yields the optimal policy:
where the partition function is:
ensures , i.e., the policy is a valid probability distribution.
6.3 Inverting for the Reward
Taking log of both sides of the optimal policy:
Rearranging:
Key insight: the reward can be expressed via the log-ratio of policy to reference, eliminating the explicit reward model.
6.4 Substituting into Bradley-Terry
Human preference model:
where is the sigmoid function.
Substituting the inverted reward:
The terms cancel perfectly! This is because both responses share the same partition function for the same prompt .
6.5 DPO Loss Function
Replacing with parameterized , taking negative log-likelihood:
Implicit reward defined as:
The loss simplifies to:
6.6 Gradient Analysis
- When the model already ranks correctly, and the gradient naturally decays
- When the model ranks incorrectly, and the gradient is largest
6.7 DPO Advantages & Disadvantages
| Advantages | Disadvantages |
|---|---|
| No separate RM training needed | Offline algorithm: can only use the static dataset ; no online exploration |
| No online rollout needed | Distribution mismatch: training signal degrades as drifts from the data collection policy |
| Simplified pipeline, single optimization pass | Imprecise rejection: rejects entire responses globally rather than correcting step by step |
| More stable than PPO | Sensitive to preference data quality |
| Theoretically equivalent to RLHF (with sufficient data) | cancellation depends on the correctness of the BT model assumption |
6.8 Likelihood Displacement: chosen log-prob also decreases
Phenomenon
Intuitively, DPO training should increase the model's probability for the chosen response and decrease it for the rejected response . However, Razin et al. (arXiv:2410.08847) and Pal et al. (arXiv:2402.13228) both observe that and tend to decrease simultaneously during training — the loss decreases only because decreases faster, widening the margin between them, while the absolute probability of the chosen response shrinks.
"While intuitively these methods should increase the probability of while decreasing that of , several recent works observed that the probabilities of both and tend to decrease over the course of training." — Razin et al., arXiv:2410.08847
Gradient Mechanism
The DPO loss only constrains the log-prob difference (margin) relative to the reference model to widen:
itself has no lower-bound constraint — as long as decreases faster, the gradient objective is satisfied. Razin et al. (Theorem 1/3) note that when the hidden representations of and are similar (high CHES score), the gradient direction that suppresses simultaneously suppresses , and probability mass shifts to tokens semantically opposite to , forming "unintentional unalignment" (the phrase used in Razin et al.).
Danger Conditions
- Pairs in the dataset where and differ by only a few tokens (Pal et al. report a normalized edit distance of approximately 6.5% for the MetaMath subset), sharing large prefixes so that gradients are highly correlated.
- Preference pairs that are semantically similar with subtle differences (e.g., phrasing differences rather than factual differences).
Detection
Record both chosen_logps_mean and rejected_logps_mean throughout training (most training frameworks already log these). If the chosen mean consistently decreases beyond the reference baseline, displacement is occurring.
Mitigation
(A) DPOP (Pal et al., arXiv:2402.13228): Adds a penalty term inside the DPO loss that directly prevents the chosen log-prob from falling below the reference model:
The term with : when , a penalty is applied that "anchors" the chosen log-prob above the reference model.
(B) CHES data filtering (Razin et al., arXiv:2410.08847): Filter out preference pairs where the representations of and are highly similar, cutting the gradient coupling path at the data level.
(C) SimPO: Uses length-normalized as the implicit reward and removes ; the reward definition is directly aligned with generation-time likelihood, which by design weakens the driving force behind displacement (though the SimPO paper itself does not directly analyze this issue using the Razin/Pal framework).
Note: The three mitigation approaches above come from different papers and should not be cross-attributed — the regularization term in DPOP is from Pal et al.; CHES filtering is from Razin et al.; the connection between SimPO and displacement comes from downstream work and is provided here for reference only; it must not be attributed to either the Razin or Pal papers.
7. DPO Variants Comparison
7.1 IPO (Identity Preference Optimization / PO with )
DPO problem addressed: DPO uses (the logit function, corresponding to Bradley-Terry). When preferences approach certainty (), the logit tends to , driving regardless of the KL penalty coefficient — KL regularization becomes ineffective under strong preferences, and the policy overfits the preference data.
Core change: Under the PO framework, replace with the identity mapping (input preference probability , does not diverge as the way DPO's logit mapping does). The resulting empirical loss (Azar et al., arXiv:2310.12036, Eq. 17) is a squared-loss regression:
where is the log-ratio difference of the policy relative to the reference (logit margin); the target constant is , with the KL regularization strength.
Citation: Mohammad Gheshlaghi Azar et al. — arXiv:2310.12036 (Google DeepMind, 2023)
"IPO, unlike DPO, always regularizes its solution towards by controlling the gap between the log-likelihood ratios, thus avoiding the over-fitting to the preference dataset." — Azar et al., Section 5.2
Properties:
- The squared loss regresses the logit-margin to a fixed target ; directly controls the upper bound of the learned log-ratio difference, keeping KL regularization effective at all times
- The gradient near the target may be small but is never zero — the solution cannot "escape" to an unbounded region
- Note: Azar et al.'s original validation is limited to small-scale bandit experiments; effectiveness at LLM scale requires independent verification
- Trade-off: The loss does not correspond to a BT probability model, slightly weakening theoretical interpretability
7.2 KTO (Kahneman-Tversky Optimization)
DPO problems addressed: (1) DPO requires paired preference data , whereas in practice only pointwise positive/negative feedback (pointwise thumbs-up/down) is often available; paired data is expensive and scarce. (2) DPO maximizes preference log-likelihood, which is a proxy for the true objective of "maximizing generation utility," resulting in objective mismatch.
Loss (complete form):
where the implicit reward, KL baseline, and value function are defined as:
Role and implementation of (KL Baseline)
is the expected KL divergence of the current policy relative to the reference model. In prospect theory it serves as the reference point — rewards above this point are "gains" and rewards below it are "losses," producing the sigmoid's concavity on the gain side (risk aversion) and convexity on the loss side (loss aversion).
In practice, is estimated within each mini-batch (size ) using mismatched pairs:
Deliberately pairing prompt with an unrelated output is intentional, to avoid conflating the reward signal with the baseline estimate. Gradients do not propagate through the term.
Prospect Theory Mapping
approximates the Kahneman-Tversky S-shaped value function with a logistic function (the original power-law form is hard to optimize directly): the sign flip — desirable branch ; undesirable branch — precisely simulates the "gain vs. loss" frame switch, and the asymmetric weights correspond to loss aversion.
"KTO only requires a binary signal of whether an output is (un)desirable for a given input. This data is much more abundant, cheaper, and faster to collect in the real world than preferences." — Ethayarajh et al., arXiv:2402.01306
Citation: Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela — arXiv:2402.01306 (2024)
Properties:
- No pairing required: Each only needs a desirable/undesirable label; naturally binary logs such as thumbs-up/down can be used directly
- provides a dynamic reference point, making the loss adaptive to KL drift
- Trade-off: Abandons the pairwise consistency constraint of the BT model, potentially lower information efficiency than pairwise methods; requires manual tuning
7.3 ORPO (Odds Ratio Preference Optimization)
DPO problem addressed: DPO requires two-stage training — first SFT then preference optimization — and requires maintaining a reference model .
Core change: Directly attach an odds-ratio preference loss on top of the SFT cross-entropy loss (unified SFT + odds-ratio loss):
where odds are defined as (odds defined as):
Citation: Jiwoo Hong, Noah Lee, James Thorne — arXiv:2403.07691 (2024)
"In contrast to previous works, our approach requires neither an SFT warm-up stage nor a reference model, enabling resource-efficient development of preference-based aligned models." — Hong et al., arXiv:2403.07691
Properties:
- Single-stage training (Single-stage), no reference model needed (No reference model needed), halves the number of forward passes
- provides domain-adaptation anchoring on chosen responses; simultaneously repels the rejected style
- Trade-off: The odds ratio has no direct theoretical connection to the BT model; the weight between the two loss terms requires tuning; numerical results have not been independently verified here
7.4 SimPO (Simple Preference Optimization)
DPO problems addressed: (1) There is a divergence between DPO's implicit reward and the metric actually used at generation time (length-normalized likelihood) — Meng et al. note that in UltraFeedback triplets, the proportion of cases where the DPO reward ranking is satisfied but the log-likelihood ranking is reversed approaches half, meaning the model can "win the loss" while making the chosen response harder to generate. (2) Unnormalized log-probabilities decrease monotonically with length, allowing the model to satisfy the ranking by generating shorter rejected responses, introducing length bias. (3) Maintaining a frozen incurs memory and compute overhead.
Core change: Replace the implicit reward with a length-normalized sequence-level average log-probability, and introduce an explicit target margin in the Bradley-Terry objective:
where is the token count, is a scaling constant, and requires the chosen reward to exceed the rejected reward by at least (not merely be larger). Does not contain .
Citation: Yu Meng, Mengzhou Xia, Danqi Chen — arXiv:2405.14734 (NeurIPS 2024)
"There is a divergence between DPO's reward formulation and the average log likelihood metric , which directly impacts generation." — Meng et al., arXiv:2405.14734, Section 3.1
Properties:
- No reference model needed; reward directly aligns with generation-time likelihood (removes the source of distribution drift)
- Length normalization eliminates the shortcut of "short rejected response cheating"
- The target margin reinforces the absolute gap between chosen and rejected, not merely their relative ranking
- Trade-off: The loss does not correspond to a BT probability model; performance numbers (AlpacaEval/Arena-Hard) come from the paper abstract and have not been independently verified here
7.5 Online vs Offline DPO
Distribution Mismatch
Standard DPO is an offline algorithm: the preference dataset is collected before training from some fixed data-generating policy (typically the SFT model). During training, is continuously updated while remains static. Once diverges from , the pairs in no longer cover the current output distribution of , creating an off-policy distribution mismatch.
Concrete manifestations:
- Implicit reward drift: DPO's implicit reward changes continuously during training; the scores assigned to the same pair shift as the policy updates, and the chosen/rejected margin may shrink or even reverse.
- Limited exploration: The policy cannot explore outputs better than those in , becoming stuck in a "local optimum" defined by the old preference data.
Iterative / On-Policy DPO
Solution: at each iteration, sample new response pairs using the current policy , then construct new preference pairs via a reward model (or human/AI judge), and update the policy with this batch of distribution-matched preference data:
Why online DPO is generally better:
| Dimension | Offline DPO | Online / Iterative DPO |
|---|---|---|
| Source of preference data | Statically pre-collected from fixed | Sampled each round from current |
| Distribution match | Off-policy, subject to drift | On-policy, matches current policy |
| Training signal quality | Limited by old distribution | Covers current policy's output distribution |
| Compute cost | Low (one-time data collection) | High (requires online sampling + RM scoring each round) |
| Exploration ability | None, locked to | Can explore new output patterns |
| Representative methods | Standard DPO (Rafailov et al.) | RLHF-PPO, Online DPO, Self-Play Fine-Tuning |
Practical guidance: When access to an RM or automatic judge is available, iterative DPO (re-sampling + updating preference data every steps) generally yields better downstream conversation quality than purely offline DPO. If only offline is feasible, removing (as in SimPO/DPOP) or adding chosen-anchoring can partially mitigate the likelihood displacement caused by distribution drift.
7.6 Precise Comparison Table
| Variant | Requires paired preferences? | Requires ? | Key loss form | Main DPO problem corrected |
|---|---|---|---|---|
| DPO | ✅ Yes | ✅ Yes | Baseline (no explicit correction) | |
| IPO | ✅ Yes | ✅ Yes | , squared-loss regression | KL regularization fails under deterministic preferences; unbounded reward drift |
| KTO | ❌ No (pointwise binary label) | ✅ Yes | , asymmetric sigmoid + KL baseline | Requires paired data; objective misaligned with actual generation utility |
| ORPO | ✅ Yes | ❌ No | Two-stage training; maintaining frozen (doubles memory/compute) | |
| SimPO | ✅ Yes | ❌ No | Likelihood displacement; length bias; overhead |
Citations: DPO — Rafailov et al., arXiv:2305.18290; IPO — Azar et al., arXiv:2310.12036; KTO — Ethayarajh et al., arXiv:2402.01306; ORPO — Hong et al., arXiv:2403.07691; SimPO — Meng et al., arXiv:2405.14734 (NeurIPS 2024)
8. GRPO vs PPO (Group Relative Policy Optimization vs Proximal Policy Optimization)
8.1 GRPO Group Advantage Estimation
The core idea of GRPO: for the same prompt , sample a group of responses and estimate the advantage using intra-group statistics (estimate advantage using intra-group statistics):
where is the reward for the -th response (reward for the -th response).
GRPO policy gradient loss (GRPO policy gradient loss):
Same clipping mechanism as PPO, but entirely different advantage estimation (same clipping mechanism, entirely different advantage estimation).
8.2 Key Comparison
| Property | PPO | GRPO |
|---|---|---|
| Models required | 4: Actor + Critic + Reference + RM | 2–3: Actor + Reference (+ optional RM; in RLVR settings reward comes from rules, no separate RM needed) |
| Advantage estimation | GAE (Generalized Advantage Estimation), requires a Critic network | Intra-group relative ranking, no Critic needed |
| Memory overhead | High (4 copies of model weights) | Low (2–3 copies of model weights) |
| Reward source | Learned neural RM (learned neural RM) | Typically verifiable/rule-based reward; neural RM can also be plugged in |
| Suitable scenarios | Open-ended dialogue, creative writing (open-ended generation) | Math reasoning, code generation (math, code with verifiable ground truth) |
| Training stability | Requires careful Critic tuning, otherwise unstable | More stable, no Critic estimation error |
| Gradient variance | Lower (GAE provides low-variance estimates) | Higher (limited group sample size ) |
8.3 RLVR Framework (RL from Verifiable Rewards)
The paradigm that best fits GRPO is RLVR: rewards come not from a learned RM but from automatically verifiable rules (rewards from automatically verifiable rules):
- Math: whether the answer equals the ground truth (answer matches ground truth) →
- Code: whether all test cases pass (passes all test cases) →
- Format: whether the required format is followed (follows required format) →
The core advantage of RLVR: low-noise reward (low-noise reward, relative to a learned RM), avoiding the bias and overfitting of the RM itself.
8.4 When to Prefer Which
- Prefer GRPO: rewards are automatically verifiable (math, code, logical reasoning); limited resources (cannot maintain 4 models); need stable training
- Prefer PPO: rewards require semantic/style judgment (dialogue quality, creative writing); reward signal is complex and cannot be rule-based; sufficient compute and a mature RM are available
8.5 RLOO and ReMax (critic-free baselines)
PPO relies on a learned critic (value network) to estimate a baseline. GRPO, RLOO, and ReMax are all critic-free, replacing the value baseline with a baseline computed from sampled rewards.
RLOO (REINFORCE Leave-One-Out, Ahmadian et al. 2024, ACL arXiv:2402.14740): For a group of samples per prompt, the baseline for sample is the mean reward of the other samples; advantage . It uses a pure REINFORCE gradient, no clipping, no critic; the policy-gradient estimate stays unbiased because the baseline does not depend on sample 's own action.
ReMax (Li et al. 2024, ICML arXiv:2310.10505): The baseline is the reward of a single greedy (argmax) decode for the same prompt; advantage . This requires only one extra greedy rollout per prompt, resulting in very low memory overhead and no critic network.
| Method | baseline | estimator | clip? | extra cost |
|---|---|---|---|---|
| GRPO | Group-relative (z-score) | PPO-style | Yes | samples |
| RLOO | Leave-one-out mean | REINFORCE | No | samples |
| ReMax | Greedy decode reward | REINFORCE | No | +1 greedy rollout |
9. Role of KL Penalty & Tuning β
9.1 Intuitive Role of the KL Term
The KL penalty acts as regularization, with the following functions:
- Prevents excessive drift: Ensures does not deviate too far from , preserving pre-training knowledge
- Mitigates reward hacking: If the policy learns to exploit flaws in the RM, the KL term grows as a penalty
- Maintains diversity: Prevents the policy from collapsing to a few high-reward modes (mode collapse)
- Stabilizes training: Constrains the exploration space, preventing excessively large policy updates
Expanding mathematically (Expanding mathematically):
When , KL = 0 (no penalty when the policy does not deviate at all from the reference policy).
9.2 KL-RM Score Pareto Frontier
Tuning is fundamentally a trade-off between two objectives:
| value | Effect |
|---|---|
| too large | Policy barely updates, stays close to , small reward improvement (underfitting) |
| too small | Policy updates aggressively, reward may be high but distribution shift is severe, risk of reward hacking |
| moderate | Finds a balance on the KL-RM frontier |
Typical range (typical range): ; or are commonly used in practice.
Quantitative version of over-optimization: the gold-RM score traces an inverted-U in (BoN form , RL form , with ) — past the peak the policy drifts out of distribution and the gold score falls. See reward-modeling-eval §3.2a (Gao et al. 2022, arXiv:2210.10760).
9.3 β in DPO vs PPO
| Dimension | in PPO | in DPO |
|---|---|---|
| Mathematical role | Controls the weight of the KL penalty term (in the loss function) | Controls the scaling of the implicit reward (in the log-ratio) |
| Where it appears | ||
| Semantic equivalence | Theoretically, DPO's originates from the same in the RLHF objective, but in practice the training dynamics differ, so both must be tuned separately | |
| Practical effect | → more conservative policy | → implicit reward changes more sharply, more sensitive to preference signal |
Conclusion: Theoretically equivalent (theoretically equivalent), practically different (practically different). In DPO, also influences the sharpness of the weight term in the gradient.
9.4 KL Estimators & Placement
9.4.1 Three Single-Sample Estimators
Notation: Let (Schulman 2020 convention), with samples drawn from the current policy .
Define three estimators:
Verifying the expectation (samples from ):
Therefore : the term is zero-mean and serves as a control variate.
Unbiasedness of : . So is an unbiased estimate of the KL value.
Bias of : , but its expectation under does not equal , so is biased.
Unbiasedness of (control-variate argument): For any , define . Since , we have for all — universally unbiased. Setting yields . At this choice, the added term is negatively correlated with , reducing variance. Hence is unbiased and, in the small-drift regime () relevant to RLHF, has lower variance than .
Non-negativity of : By the tangent-line inequality for all (equality only at ), we have , consistent with the non-negativity of KL divergence.
Order near (let ): is first-order (signed), whereas are second-order (non-negative). All three vanish as but at different rates — which is also why is non-negative like yet unbiased like .
9.4.2 The Gradient Perspective
The analysis above concerns value estimation only. When an estimator is used as a loss term, its gradient behavior must be analyzed separately.
as a loss term does not yield the exact reverse-KL gradient: Although is an unbiased value estimate of KL, differentiating it with respect to (when used as a loss) produces a gradient that is only a first-order approximation of the true reverse-KL gradient. The approximation holds when policy drift is small (, so ), but introduces systematic bias as drift grows.
Per Liu et al. (arXiv:2510.01555, "Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization"): placed in the reward (in-reward) and used as a loss term (as-loss) are gradient-principled choices, while as a loss term lacks gradient-level justification. In practice, the small used in GRPO / DeepSeek-R1 keeps this approximation error minor.
9.4.3 Estimator Comparison Table
| Estimator | Form | Value-unbiased? | Gradient-principled? |
|---|---|---|---|
| Yes | Yes, as in-reward | ||
| No | Yes, as-loss | ||
| Yes | No, as-loss |
Variance note: in the small-drift regime (), has lower variance than (both unbiased). is biased and operates in a different bias-variance regime; direct variance comparisons with or are not meaningful.
9.4.4 Two Placement Styles (Style A vs Style B)
Style A: In-Reward
Representative: InstructGPT / PPO. The KL penalty is incorporated per-token into the reward signal:
- Each token receives an individual KL penalty signal; the Critic can learn stepwise KL costs.
- Caveat: PPO's clip mechanism truncates the policy ratio. For clipped tokens, the surrogate objective no longer depends on , so the KL signal in the reward is silently masked for those tokens at the gradient level — an implementation detail that is easily overlooked.
Style B: In-Loss
Representative: GRPO (Shao et al., DeepSeekMath), DeepSeek-R1. The KL estimator is added directly to the policy optimization loss:
where , (computed per token, then averaged over the sequence).
- Requires no Critic / Value model, reducing memory footprint.
- DAPO (arXiv:2503.14476): removes the KL penalty entirely (). The stated rationale is that during long-CoT reasoning training the policy diverges substantially from the initial SFT reference, so a tight KL constraint is counterproductive; it instead relies on asymmetric clipping (Clip-Higher: decoupled upper/lower clip bounds) to prevent entropy collapse.
| Dimension | Style A (in-reward) | Style B (in-loss) |
|---|---|---|
| Representative systems | InstructGPT, PPO | GRPO, DeepSeek-R1 |
| Estimator used | (per-token) | (per-token, averaged) |
| Gradient-principled | Yes | Approximate (acceptable at small ) |
| Engineering complexity | Requires Critic | No Critic needed |
| Clip-masking risk | Present (KL gradient silently dropped for clipped tokens) | Not applicable |
9.4.5 Interview Self-Test
L2: Using the convention, why does ? How does this result establish the unbiasedness of ?
L3: GRPO incorporates directly into the loss, yet as a loss term does not yield the principled reverse-KL gradient — why is this usually acceptable in GRPO practice? If were increased from 0.04 to 0.5, how would this approximation error change?
10. Process Reward Model (PRM) vs Outcome Reward Model (ORM)
10.1 ORM: Outcome Reward Model
- Gives a single scalar reward only at the end of the sequence (End of Sequence, EOS)
- The entire response shares the same reward value
- Training data: binary labels
10.2 PRM: Process Reward Model
- Gives an independent score for each step of the reasoning chain (step-level scoring)
- represents the quality of the -th reasoning step
- Training data: step-level labels
10.3 Credit Assignment Advantage
This is the core advantage of PRM. Consider a mathematical reasoning chain:
Step 1: Let → ✅ correct Step 2: Differentiate to get → ✅ correct Step 3: Set , solve → ✅ correct Step 4: → ✅ correct
ORM only knows "the final answer is correct" → gives a high score, but does not know whether each step is reliable.
PRM can identify the case where "the first three steps are correct but the fourth is wrong":
This allows PRM to guide search and training more precisely (more precise guidance for search and training).
10.4 Best-of-N Search with PRM
Given prompt , sample candidate responses and score each step of each response:
or use the product form (product form):
Taking min or product ensures every step qualifies — any weak step pulls down the overall score (any weak step pulls down the overall score).
Select the best response (Select the best):
10.5 PRM Training Data Challenges
| Challenge | Description |
|---|---|
| Expensive annotation | Every step of every reasoning chain requires expert human annotation of correctness — 10–50× more expensive than ORM annotation |
| Ambiguous step boundaries | There is no unified standard for segmenting reasoning steps; different annotators may segment them differently |
| Low inter-annotator agreement | Judgments of "whether a step is correct" may vary with the annotator's mathematical proficiency |
| Limitations of automated methods | Monte Carlo estimation (estimating the probability of reaching the correct answer after a given step via repeated sampling) has high variance |
Automated PRM annotation method: After step , sample completions multiple times and compute the proportion of final answers that are correct as an estimate of . Formula:
11. Alignment Tax & Weight Averaging
11.1 Definition of Alignment Tax
Alignment Tax refers to the performance degradation on base capability benchmarks after a model undergoes alignment training:
where denotes performance on pre-training benchmarks (e.g., MMLU, coding ability, math ability, etc.).
Intuitively: SFT/RL training may "forget" or "overwrite" parts of pre-trained knowledge while improving alignment quality (safety, helpfulness, format following).
11.2 WiSE-FT Linear Interpolation
WiSE-FT (Weight-space Ensembles for Finetuning) mitigates the alignment tax by interpolating in weight space:
where controls the trade-off between the aligned model and the base model.
| Effect | |
|---|---|
| fully aligned model | |
| base model only | |
| compromise: retains some aligned behavior while recovering some base capability |
11.3 Why Interpolation Works
Task Vector perspective: alignment training is equivalent to moving in a direction within weight space:
Research shows that the weight-change directions corresponding to different tasks are near-orthogonal, so:
Linear interpolation doesn't severely interfere with other task representations.
11.4 Advanced Model Merging Variants
| Method | Formula / Operation | Core Idea |
|---|---|---|
| Linear Interpolation | Simplest; element-wise linear average | |
| SLERP (Spherical Linear Interpolation) | , where | Interpolates on the hypersphere, preserving vector norms |
| DARE (Drop And REscale) | Randomly drop of parameters in , then rescale the remaining: , then merge | Sparsifies the task vector to reduce interference |
| TIES (Trim, Elect, Sign) | ① Trim small-magnitude changes ② Vote on sign ③ Keep only parameters with a consistent direction | Resolves parameter conflicts when merging multiple models |
SLERP intuition: the "direction" of a weight vector matters more than its "length"; spherical interpolation preserves the geometric relationship between directions.
12. Catastrophic Forgetting, Mode Collapse & Reward Hacking
These are three distinct training failure modes.
12.1 Catastrophic Forgetting
Definition: when the model learns new behaviors during SFT or RL, it loses knowledge and capabilities acquired during pre-training.
Mechanism:
Neural network weight space is finite; gradient updates for new tasks may overwrite weights that store old knowledge.
Detection Metrics:
- Drop in benchmark scores (MMLU, GSM8K, HumanEval, etc.)
- Perplexity rises on the pre-training distribution
- Accuracy of capability probes drops
Mitigation Strategies:
| Strategy | Description |
|---|---|
| Mixed training data | Mix pre-training data into SFT |
| Low-rank adaptation (LoRA) | Only updates the low-rank delta , greatly reducing interference with original weights |
| Regularization | EWC (Elastic Weight Consolidation): , where is the Fisher information |
| Model merging | WiSE-FT / SLERP merges the aligned model with the base model |
12.2 Mode Collapse
Definition: during RL training, the model's output diversity drops sharply, repeatedly producing similar or even identical responses.
Mechanism: the policy over-optimizes a high-reward pattern, concentrating probability mass onto a small number of outputs:
Detection Metrics:
- Drop in output diversity metrics (self-BLEU ↑, distinct-n ↓, entropy of token distribution ↓)
- Responses converge across prompts
- Almost no variation under temperature sampling
Mitigation Strategies:
| Strategy | Description |
|---|---|
| Increase KL penalty | keeps the policy close to , maintaining diversity |
| Entropy regularization | Add a term to encourage exploration |
| Data diversity | Training data covers a diverse prompt distribution |
| Early stopping | Monitor diversity metrics and stop training promptly |
12.3 Reward Hacking
Definition: the policy learns to exploit RM weaknesses, achieving high RM scores while actual human evaluation declines. This is a direct manifestation of Goodhart's Law:
Detection Metrics:
| Metric | Description |
|---|---|
| Divergence between RM score and human rating | grows |
| Continuously increasing KL | Policy keeps drifting away from |
| Surge of specific patterns | e.g., overuse of filler phrases like "however", "it is worth noting", etc. |
| Response length bloat | RM favors longer answers → model learns to produce redundant content |
Mitigation Strategies:
| Strategy | Description |
|---|---|
| KL penalty | Constrains the policy from drifting too far (most fundamental) |
| RM ensemble | Average over multiple RMs to reduce bias from any single RM |
| Adversarial training | Continuously update the RM to adapt to policy changes (online RLHF) |
| Human evaluation | Periodically evaluate with humans to detect RM–human divergence |
| Length penalty | Apply length normalization to RM scores |
12.4 Relationship Summary
Pre-training → SFT → RL
↓ ↓ ↓
Catastrophic Mode Reward
Forgetting Collapse Hacking
(knowledge (diversity (RM being
loss) loss) exploited)
| Feature | Catastrophic Forgetting | Mode Collapse | Reward Hacking |
|---|---|---|---|
| Stage | SFT / RL | RL | RL |
| Root cause | Weight overwriting | Over-optimization of a single pattern | RM weaknesses exploited |
| Symptom | Capability degradation | Monotone output | High RM score but poor quality |
| Core mitigation | Regularization + mixed data | KL + entropy + diversity | KL + RM ensemble + human evaluation |
13. Constitutional AI / RLAIF
13.1 RLAIF Overview
RLAIF (Reinforcement Learning from AI Feedback) replaces human annotators with LLM-generated preference labels:
13.2 CAI Self-Critique-Revision Loop
The core of Constitutional AI (CAI) is a four-step loop:
Step 1 — Generate: given prompt , use the current model to generate an initial response :
Step 2 — Critique: use an LLM to critique according to the constitution principles:
Step 3 — Revise: based on the critique, use the LLM to revise the response:
Can be iterated multiple times: (typically 1–3 rounds)
Step 4 — Train:
- SL-CAI (SFT stage): use the revised as training data for SFT
- RL-CAI (RL stage): use the LLM as a preference judge, generate preference pairs, train a reward model, then do RL
13.3 Constitution Principles
Constitution principles are a set of auditable alignment constraints, for example:
| No. | Principle Example |
|---|---|
| P₁ | "Choose the response that is most helpful, accurate, and harmless" |
| P₂ | "Choose the response that does not promote bias or discrimination" |
| P₃ | "Choose the response that does not assist with illegal activities" |
Unlike implicit human preferences, constitution principles are explicit and auditable:
where is the LLM score based on constitution principles.
13.4 Comparison with Standard RLHF
| Dimension | Standard RLHF | RLAIF / CAI |
|---|---|---|
| Preference source | Human annotators | LLM (based on constitution principles) |
| Annotation cost | High (labor intensive) | Low (API call cost) |
| Scalability | Limited by annotator count and time | Nearly unlimited scaling |
| Consistency | Inter-annotator variance | LLM is highly consistent |
| Auditability | Preference criteria exist implicitly in annotators' minds | Constitution principles are explicit and auditable |
| Risk | Annotator bias | LLM's own bias + poorly designed constitution principles |
| Human involvement | Throughout | Only when designing constitution principles |
13.5 Theoretical Advantages of RLAIF
- Principle-guided: alignment objectives are expressed explicitly through natural-language principles, making them more controllable than implicit preferences
- Self-improvement loop: model critiques itself, revises itself, learns from the revised version → continuous improvement
- Reduced human burden: humans only need to design principles, not annotate individual examples
- Cross-cultural consistency: annotators from different cultural backgrounds may have different preferences, whereas constitution principles can unify the standard
Note: CAI is not fully human-free. Humans still need to:
- Design the constitution
- Evaluate final model quality
- Monitor for drift during iteration
14. Distillation (Post-Training Perspective)
14.1 Comparison of Three Distillation Paradigms
SeqKD (Sequence-Level Knowledge Distillation)
The Teacher first generates complete output sequences via beam search or sampling; the Student then performs standard SFT (cross-entropy loss) on these sequences:
Key points:
- Data can be generated offline; no online Teacher inference required;
- The distillation signal comes only from discrete sequences sampled by the Teacher, losing the full distribution information across tokens ("soft labels" are hardened);
- Simplest to implement, lowest cost, suitable for most engineering scenarios.
Token-Level KD (Token-Level Knowledge Distillation)
At each position , align the Student's and Teacher's probability distributions over the vocabulary:
Key points:
- Preserves the Teacher's soft distribution (soft labels), providing richer information, especially when multiple tokens are plausible;
- Requires the Teacher to provide logits online or offline; the Teacher must be accessible (or logits must be cached in advance);
- When the Teacher is very large (e.g., 671B MoE), caching logits for all positions is extremely expensive.
On-Policy Distillation
The Student itself rolls out candidate sequences, which are then scored by the Teacher (or a verifiable reward); the Student updates accordingly:
Key points:
- Training signal comes from the Student's own distribution, no off-policy drift;
- Equivalent to using Teacher reward as an RLVR signal; training is more complex but generalization is typically stronger;
- Representative method: GRPO + verifiable reward (Teacher itself as verifier).
14.2 CoT Distillation (R1-style Chain-of-Thought Distillation)
Core idea: use a large RL model (e.g., DeepSeek-R1-671B) to generate long reasoning sequences with complete chains of thought, then perform SFT on a small model (i.e., the CoT version of SeqKD).
The DeepSeek-R1 paper (arXiv:2501.12948) reports experimental results of SFT on Qwen and Llama models at 1.5B, 7B, 8B, 14B, 32B, and 70B parameters using approximately 800K distillation samples (approximately 600K reasoning + approximately 200K non-reasoning), with reasoning capability of small models improving substantially.
Why CoT distillation into small models is often more stable / more efficient than directly applying GRPO (per the distillation experiments in the R1 paper):
- Asymmetric exploration cost: GRPO requires the model to independently explore high-quality chains of thought, but small models have limited capability — random sampling rarely produces effective reasoning sequences (reward is extremely sparse), and gradient signals are noisy; the Teacher directly providing high-quality CoT effectively compresses the exploration space.
- No Critic / RM needed: the SeqKD path only requires SFT — no online rollout or reward model — eliminating the GPU memory and compute overhead of GRPO's online sampling and reward/critic.
- Training stability: the loss landscape of SFT is smoother than RL, with no risk of reward hacking or mode collapse and fewer hyperparameters.
Hedging caveat: the above "more stable / more efficient" conclusion comes from observational results in the R1 paper under its distillation configuration (DeepSeek-V3-Base as the base model, approximately 800K data scale), and does not imply this holds across all small models or data scales; direct RL (GRPO) may have a higher ceiling when data and compute are sufficient.
14.3 Forward KL vs Reverse KL
Definitions
Forward KL (also called inclusive KL; mean-seeking):
Optimization direction: minimizing the forward KL of relative to is equivalent to maximizing — the Student must cover all modes of the Teacher (wherever , cannot be 0, otherwise KL diverges).
Reverse KL (also called exclusive KL; mode-seeking):
Optimization direction: minimizing this quantity takes the expectation over the support of , allowing to ignore certain modes of (the term is 0 where ), but will concentrate on regions where has high probability.
Why Generation Tasks Often Prefer Reverse KL / Mode-Seeking
Intuitive derivation:
Suppose the Teacher distribution is bimodal, with two modes each having probability .
Forward KL: to maximize , the Student must cover both modes, resulting in being spread between the two modes — but this middle ground in text space often corresponds to low-quality or unnatural sequences (the "mean" is a semantically meaningless mixture). This phenomenon in generation tasks is called mode averaging: the output is an average of all modes, resembling none of the reasonable answers.
Reverse KL: the Student incurs a log penalty wherever , naturally choosing to concentrate on one high-probability, semantically coherent mode in . Although the other mode is sacrificed, the generated sequences are higher quality and more natural.
Mathematical statement: let ; for a capacity-limited Student, the solution exhibits mass concentration on the dominant mode(s) of , rather than "smearing" across multiple modes.
One-line intuition: Forward KL requires "don't miss any answer from the Teacher"; Reverse KL allows "only learn the Teacher's most confident answers." Generation tasks require coherent outputs — better to cover less but with higher quality, hence the preference for Reverse KL.
Note: Token-level KD typically uses forward KL (Student aligns to Teacher soft labels), while SeqKD / SFT at the sequence level more closely resembles reverse KL behavior (Student only learns the modes sampled by the Teacher). The two are not mutually exclusive; in practice they are often mixed depending on the task.
14.4 Distillation vs RFT vs PPO: Three-Row Comparison
| Method | Data Source | Comparison / Optimization Signal | Applicable Scale |
|---|---|---|---|
| Distillation (SeqKD) | Sequences generated by the Teacher (offline) | Teacher output sequences (cross-entropy / soft labels) | Small-to-medium models (typically ≤ 70B), Teacher significantly stronger than Student |
| RFT (Rejection Sampling FT) | Self-sampled from current policy, filtered by reward to keep high-scoring outputs | Verifiable reward / RM filtering | Medium scale (7B–70B), reward can be automatically verified |
| PPO | Online rollout from current policy | RM score + KL constraint + GAE Advantage | Large scale (typically ≥ 7B), with sufficient RM and compute resources |
14.5 Self-Assessment Questions
L2 — Distinguishing Distillation Paradigms: both SeqKD and Token-Level KD use the Teacher model as the signal source, but fundamentally one more closely resembles reverse KL and the other more closely resembles forward KL. Please explain: (a) which corresponds to which direction of KL; (b) when the Teacher distribution is bimodal, how will the Student distributions trained by each method behave differently?
L3 — Applicability Analysis of CoT Distillation: suppose you have a 3B small model and sufficient GPUs (capable of running both the 671B Teacher and the Student simultaneously). Analyze: under what data scale and task types would directly applying GRPO have an advantage over SeqKD distillation? Give at least two substantive reasons.
Part 2 — PyTorch Code Snippets / From-Scratch PyTorch Snippets
SFT loss masking — During SFT training, compute loss only on the assistant's response tokens; mask the prompt portion with label=-100.
38 行 / lines
import torch
from torch.nn.utils.rnn import pad_sequence
class SFTDataCollator:
"""
将 prompt token 的 label 设为 -100,loss 只计算 assistant 部分。
Masks prompt tokens with label=-100 so loss only applies to assistant tokens.
"""
def __init__(self, tokenizer):
self.pad_id = tokenizer.pad_token_id or 0
def __call__(self, batch):
input_ids, labels, attention_mask = [], [], []
for sample in batch: # each sample: dict with 'input_ids' and 'prompt_length'
ids = torch.tensor(sample["input_ids"], dtype=torch.long)
prompt_len = sample["prompt_length"]
lab = ids.clone()
lab[:prompt_len] = -100 # 屏蔽 prompt / mask prompt tokens
input_ids.append(ids)
labels.append(lab)
# 动态 padding / dynamic pad to longest in batch
input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.pad_id)
labels = pad_sequence(labels, batch_first=True, padding_value=-100)
attention_mask = (input_ids != self.pad_id).long()
return {"input_ids": input_ids, "labels": labels, "attention_mask": attention_mask}
# --- 用法示例 / Usage example ---
collator = SFTDataCollator(type("Tok", (), {"pad_token_id": 0})())
toy_batch = [
{"input_ids": [10, 20, 30, 40, 50], "prompt_length": 3}, # prompt=前3个
{"input_ids": [11, 21, 31], "prompt_length": 2},
]
out = collator(toy_batch)
print("input_ids:\n", out["input_ids"])
print("labels (prompt positions = -100):\n", out["labels"])
# labels: tensor([[ -100, -100, -100, 40, 50],
# [ -100, -100, 31, 0, 0]])
DPO loss — Compute the Direct Preference Optimization loss from log-probabilities of the policy and reference models.
49 行 / lines
import torch
import torch.nn.functional as F
@torch.no_grad()
def get_logps(logits: torch.Tensor, labels: torch.Tensor, mask: torch.Tensor) -> torch.Tensor:
"""
逐 token 计算 log-probability 并在序列维度求和。
Computes per-token log-probs and sums over the sequence dimension.
logits: (B, T, V), labels: (B, T), mask: (B, T) (1=有效, 0=padding)
返回每个样本的标量 log-prob / Returns scalar log-prob per sample.
"""
# shift: 预测下一个 token / predict next token
shift_logits = logits[:, :-1, :].contiguous()
shift_labels = labels[:, 1:].contiguous()
shift_mask = mask[:, 1:].contiguous()
log_probs = F.log_softmax(shift_logits, dim=-1) # (B, T-1, V)
token_logps = log_probs.gather(-1, shift_labels.unsqueeze(-1)).squeeze(-1) # (B, T-1)
return (token_logps * shift_mask).sum(dim=-1) # (B,)
def dpo_loss(
policy_logps_chosen: torch.Tensor,
policy_logps_rejected: torch.Tensor,
ref_logps_chosen: torch.Tensor,
ref_logps_rejected: torch.Tensor,
beta: float = 0.1,
) -> torch.Tensor:
"""
DPO loss: L = -E[ log σ( β·(log π_θ/π_ref)_chosen - β·(log π_θ/π_ref)_rejected ) ]
"""
log_ratio_chosen = policy_logps_chosen - ref_logps_chosen
log_ratio_rejected = policy_logps_rejected - ref_logps_rejected
loss = -F.logsigmoid(beta * (log_ratio_chosen - log_ratio_rejected)).mean()
return loss
# --- 示例 / Example ---
B, T, V = 4, 10, 100
logits = torch.randn(B, T, V)
labels = torch.randint(0, V, (B, T))
mask = torch.ones(B, T)
logps = get_logps(logits, labels, mask) # (B,)
# splitting chosen / rejected is done on the caller side
policy_logps_chosen, policy_logps_rejected = logps[:2], logps[2:]
ref_logps_chosen, ref_logps_rejected = logps[:2] - 0.1, logps[2:] + 0.05
loss = dpo_loss(policy_logps_chosen, policy_logps_rejected,
ref_logps_chosen, ref_logps_rejected, beta=0.1)
print("DPO loss:", loss.item())
Reward Model — Replace the LM head on a pretrained LLM backbone with a scalar linear head and train with Bradley-Terry loss.
47 行 / lines
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
class RewardModel(nn.Module):
"""
奖励模型:LLM 骨干 + 线性标量头,取最后一个有效 token 的隐状态。
Reward model: LLM backbone + scalar linear head on last valid hidden state.
"""
def __init__(self, model_name: str = "Qwen/Qwen2.5-0.5B"):
super().__init__()
self.backbone = AutoModel.from_pretrained(model_name)
hidden_size = self.backbone.config.hidden_size
self.reward_head = nn.Linear(hidden_size, 1) # 标量奖励 / scalar reward
def forward(self, input_ids, attention_mask):
out = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
hidden = out.last_hidden_state # (B, T, H)
# 取每个序列最后一个有效 token 的隐状态 / hidden state of last valid token
last_idx = attention_mask.sum(dim=1) - 1 # (B,)
last_hidden = hidden[torch.arange(hidden.size(0)), last_idx] # (B, H)
reward = self.reward_head(last_hidden).squeeze(-1) # (B,)
return reward
def bradley_terry_loss(rewards_chosen, rewards_rejected):
"""
Bradley-Terry loss: L = -log σ(r_chosen - r_rejected)
BT loss: higher reward for preferred responses.
"""
return -F.logsigmoid(rewards_chosen - rewards_rejected).mean()
# --- 训练示例 / Training example ---
device = "cpu"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
rm = RewardModel("Qwen/Qwen2.5-0.5B").to(device)
chosen_text = ["The answer is 42.", "It is safe to proceed."]
rejected_text = ["I don't know.", "No, never do that."]
tok_chosen = tokenizer(chosen_text, return_tensors="pt", padding=True, truncation=True)
tok_rejected = tokenizer(rejected_text, return_tensors="pt", padding=True, truncation=True)
r_chosen = rm(tok_chosen["input_ids"], tok_chosen["attention_mask"])
r_rejected = rm(tok_rejected["input_ids"], tok_rejected["attention_mask"])
loss = bradley_terry_loss(r_chosen, r_rejected)
print("BT loss:", loss.item())
PPO complete loss — single-step actor-critic loss: clipped surrogate + clipped value loss + entropy bonus + approx_kl diagnostic (token-level).
53 行 / lines
import torch
import torch.nn.functional as F
def ppo_actor_critic_loss(
logp, old_logp, advantages, returns, values, old_values, entropy, mask,
clip_eps=0.2, vf_clip=0.2, vf_coef=0.5, ent_coef=0.01,
):
"""
Token-level PPO loss: clipped policy surrogate + clipped value loss + entropy bonus.
All tensors (B, T); mask marks valid response tokens (1=valid).
logp/old_logp: log π(a_t|s_t) under the current / old policy; advantages: GAE A_t;
returns: R_t; values/old_values: current / old critic predictions.
"""
def masked_mean(x): # average over valid tokens only
return (x * mask).sum() / mask.sum().clamp(min=1)
# --- policy loss: clipped surrogate (pessimistic lower bound) ---
ratio = torch.exp(logp - old_logp) # pi_theta / pi_theta_old, (B,T)
pg_loss = -torch.min(ratio * advantages,
torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * advantages)
# --- clipped value loss (guards against critic jumps) ---
v_clipped = old_values + torch.clamp(values - old_values, -vf_clip, vf_clip)
v_loss = 0.5 * torch.max((values - returns) ** 2, (v_clipped - returns) ** 2)
# --- total = policy + c_vf * value - c_ent * entropy ---
loss = masked_mean(pg_loss) + vf_coef * masked_mean(v_loss) - ent_coef * masked_mean(entropy)
# --- diagnostics: approx_kl via k3 = (r-1) - log r (here r = pi_theta/pi_theta_old,
# estimating KL(pi_old || pi_theta); estimator rationale in §9.4, but note the r
# convention is inverted vs §9.4's r = pi_ref/pi_theta) ---
with torch.no_grad():
log_ratio = logp - old_logp
approx_kl = masked_mean((ratio - 1) - log_ratio) # >= 0, for early-stop / adaptive KL
clip_frac = masked_mean((torch.abs(ratio - 1) > clip_eps).float())
return loss, {"approx_kl": approx_kl.item(), "clip_frac": clip_frac.item()}
# --- Toy example ---
torch.manual_seed(0)
B, T = 2, 5
logp = (torch.randn(B, T) * 0.1).requires_grad_(True)
old_logp = logp.detach() + torch.randn(B, T) * 0.05 # behavior (old) policy
advantages = torch.randn(B, T); advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
returns = torch.randn(B, T)
values = torch.randn(B, T, requires_grad=True)
old_values = values.detach() + torch.randn(B, T) * 0.1
entropy = torch.rand(B, T) # per-token policy entropy
mask = torch.ones(B, T); mask[1, 3:] = 0 # second row's tail is padding
loss, logs = ppo_actor_critic_loss(logp, old_logp, advantages, returns, values, old_values, entropy, mask)
loss.backward()
print("PPO loss:", round(loss.item(), 4), "| diag:", {k: round(v, 4) for k, v in logs.items()})
GRPO advantage — Group Relative Policy Optimization: normalize rewards within the same group to produce advantages for the policy gradient update.
32 行 / lines
import torch
import torch.nn.functional as F
def compute_grpo_advantages(rewards: torch.Tensor) -> torch.Tensor:
"""
在 group 内归一化奖励作为 advantage:(r - mean) / std。
Normalize rewards within group: subtract mean, divide by std.
rewards: (G,) — 同一 prompt 的 G 个采样回复的奖励
"""
mean = rewards.mean()
std = rewards.std().clamp(min=1e-8) # 防止除零 / avoid division by zero
return (rewards - mean) / std
# --- 简化策略梯度更新 / Simplified policy gradient update ---
# simulate: given policy log-probs and group advantages, perform one gradient ascent step
G = 8 # 每个 prompt 采样 8 个回复 / sample 8 responses per prompt
# simulated per-sequence log-probs (already summed to sequence level)
policy_logps = torch.randn(G, requires_grad=True)
# simulated rewards (e.g., from a reward model)
rewards = torch.tensor([1.2, 0.5, 2.0, 0.3, 1.8, 0.1, 1.5, 0.9])
advantages = compute_grpo_advantages(rewards)
print("Advantages:", advantages)
# policy gradient loss = -E[advantage * log_prob] → maximize log-prob for high-advantage responses
grpo_loss = -(advantages.detach() * policy_logps).mean()
grpo_loss.backward()
print("GRPO loss:", grpo_loss.item())
print("policy_logps.grad:", policy_logps.grad)
GRPO token-level loss — broadcast group advantage to tokens + clipped surrogate + per-token K3 KL (no critic, no GAE; token-level averaging, cf. §9.4 and DAPO).
36 行 / lines
import torch
def grpo_token_loss(logp, old_logp, ref_logp, group_adv, mask, clip_eps=0.2, beta_kl=0.04):
"""
Token-level GRPO loss: per-sequence group advantage broadcast to tokens
+ clipped surrogate + per-token K3 KL (no critic, no GAE).
logp/old_logp/ref_logp: (B, T) log-prob of the taken token under current / old / reference policy
group_adv: (B,) within-group normalized advantage A_i (see compute_grpo_advantages above), broadcast per sequence
mask: (B, T) 1=valid response token
"""
adv = group_adv.unsqueeze(1) # (B,1) -> broadcast to (B,T)
# clipped surrogate (same clip as PPO, group-relative advantage)
ratio = torch.exp(logp - old_logp) # pi_theta / pi_theta_old
pg = -torch.min(ratio * adv, torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv)
# per-token K3 KL: r = pi_ref/pi_theta, k3 = (r-1) - log r >= 0 (same convention as §9.4)
log_r = ref_logp - logp # log(pi_ref / pi_theta)
k3 = torch.exp(log_r) - 1 - log_r
per_token = pg + beta_kl * k3
# token-level averaging convention borrowed from DAPO (§3.3) so long-CoT gradients are not diluted;
# note the KL term here is GRPO-style (beta>0), not DAPO (which sets beta=0)
return (per_token * mask).sum() / mask.sum().clamp(min=1)
# --- Toy example ---
torch.manual_seed(0)
B, T = 4, 6 # 4 sampled responses for one prompt
logp = (torch.randn(B, T) * 0.1).requires_grad_(True)
old_logp = logp.detach() + torch.randn(B, T) * 0.02
ref_logp = logp.detach() + torch.randn(B, T) * 0.05
rewards = torch.tensor([1.2, 0.3, 1.8, 0.5]) # one scalar reward per response
group_adv = (rewards - rewards.mean()) / rewards.std().clamp(min=1e-8) # within-group normalization
mask = torch.ones(B, T); mask[1, 4:] = 0
loss = grpo_token_loss(logp, old_logp, ref_logp, group_adv, mask)
loss.backward()
print("GRPO token-level loss:", round(loss.item(), 4))
Sequence packing with cu_seqlens — Concatenate multiple variable-length sequences into a single batch, compute cu_seqlens required by Flash Attention, and correctly mask the loss over the packed output.
59 行 / lines
import torch
def pack_sequences(input_ids_list, labels_list, pad_token_id=0):
"""
将多条序列拼接成一个平坦 tensor,并计算 Flash Attention 用的 cu_seqlens。
Packs variable-length sequences into a flat tensor with cu_seqlens for Flash Attention.
"""
# compute real lengths of each sequence
lengths = [ids.size(0) for ids in input_ids_list]
# cu_seqlens: [0, len_0, len_0+len_1, ...] (半精度索引 / Flash Attention format)
cu_seqlens = torch.zeros(len(lengths) + 1, dtype=torch.int32)
for i, l in enumerate(lengths):
cu_seqlens[i + 1] = cu_seqlens[i] + l
# concatenate all sequences into one flat tensor
packed_input_ids = torch.cat(input_ids_list, dim=0) # (total_tokens,)
packed_labels = torch.cat(labels_list, dim=0) # (total_tokens,)
return packed_input_ids, packed_labels, cu_seqlens
def compute_packed_loss(logits_flat, labels_flat, cu_seqlens, ignore_index=-100):
"""
在拼接序列上计算 cross-entropy,loss 屏蔽 label=-100 的 token。
Compute cross-entropy on packed sequence; -100 labels are masked.
logits_flat: (total_tokens, V), labels_flat: (total_tokens,)
"""
# shift for next-token prediction
shift_logits = logits_flat[:-1, :]
shift_labels = labels_flat[1:]
# mask loss at sequence boundaries
boundary_mask = torch.zeros(shift_labels.size(0), dtype=torch.bool)
for i in range(len(cu_seqlens) - 1):
start, end = cu_seqlens[i].item(), cu_seqlens[i + 1].item()
if start < end:
boundary_mask[start] = True # 屏蔽第一条 token 的 shift / mask first token of seq
shift_labels[boundary_mask] = ignore_index
loss = torch.nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=ignore_index)
return loss
# --- 示例 / Example ---
seq_a_ids = torch.tensor([101, 202, 303, 404, 505])
seq_b_ids = torch.tensor([606, 707])
seq_c_ids = torch.tensor([808, 909, 1010])
seq_a_lab = torch.tensor([-100, -100, 303, 404, 505]) # first two are prompt
seq_b_lab = torch.tensor([-100, 707])
seq_c_lab = torch.tensor([-100, 1010, 1010])
packed_ids, packed_labels, cu_seqlens = pack_sequences(
[seq_a_ids, seq_b_ids, seq_c_ids], [seq_a_lab, seq_b_lab, seq_c_lab]
)
print("packed_ids:", packed_ids)
print("cu_seqlens:", cu_seqlens) # tensor([0, 5, 7, 10])
# simulate logits
V = 2000
logits_flat = torch.randn(packed_ids.size(0), V)
loss = compute_packed_loss(logits_flat, packed_labels, cu_seqlens)
print("Packed loss:", loss.item())
KL divergence penalty — In PPO/RLHF reward shaping, compute the per-token KL penalty between the policy and reference models.
36 行 / lines
import torch
import torch.nn.functional as F
def compute_kl_penalty(
policy_logits: torch.Tensor,
ref_logits: torch.Tensor,
mask: torch.Tensor,
) -> torch.Tensor:
"""
逐 token KL 散度:KL(π_θ || π_ref),在序列维度求均值后取 batch 均值。
Per-token KL divergence: KL(policy || ref), averaged over valid tokens & batch.
policy_logits / ref_logits: (B, T, V), mask: (B, T) — 1=有效, 0=padding
"""
policy_logps = F.log_softmax(policy_logits, dim=-1) # (B, T, V)
ref_logps = F.log_softmax(ref_logits, dim=-1) # (B, T, V)
# KL(p||q) = sum_p p(x) * [log p(x) - log q(x)] = E_p[log p - log q]
policy_probs = policy_logps.exp()
token_kl = (policy_probs * (policy_logps - ref_logps)).sum(dim=-1) # (B, T)
# masked mean
kl_per_seq = (token_kl * mask).sum(dim=1) / mask.sum(dim=1).clamp(min=1) # (B,)
return kl_per_seq.mean() # scalar
# --- Used in PPO reward shaping ---
B, T, V = 2, 8, 1000
policy_logits = torch.randn(B, T, V)
ref_logits = torch.randn(B, T, V)
mask = torch.ones(B, T); mask[1, 6:] = 0 # second sequence has padding in the latter half
kl = compute_kl_penalty(policy_logits, ref_logits, mask)
print("KL penalty:", kl.item())
# PPO reward shaping: r = r_raw - beta * KL
beta_kl = 0.05
shaped_reward = 1.5 - beta_kl * kl # used at batch level
print("Shaped reward:", shaped_reward.item())
Rejection Sampling Fine-tuning (RFT) — Sample N responses from the policy model, score them with a reward function, keep the highest-scoring response as the SFT target for fine-tuning.
52 行 / lines
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
def rejection_sampling_finetune(model, tokenizer, prompts, reward_fn, N=4, max_new_tokens=64):
"""
RFT 流程:对每个 prompt 采样 N 个回复,用 reward_fn 评分,取 top-1 做 SFT。
RFT loop: sample N responses, score with reward_fn, keep top-1 as SFT target.
"""
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for prompt in prompts:
# ---- Sampling phase ----
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
all_completions, all_rewards = [], []
with torch.no_grad():
for _ in range(N):
out = model.generate(input_ids, max_new_tokens=max_new_tokens,
do_sample=True, temperature=0.8, top_p=0.95)
gen_ids = out[0, input_ids.size(1):] # keep generated portion only
text = tokenizer.decode(gen_ids, skip_special_tokens=True)
reward = reward_fn(prompt, text) # 标量奖励 / scalar reward
all_completions.append(gen_ids)
all_rewards.append(reward)
# ---- Select best response ----
best_idx = int(torch.tensor(all_rewards).argmax())
best_ids = all_completions[best_idx]
# ---- SFT phase (compute loss on best response) ----
full_ids = torch.cat([input_ids[0], best_ids]).unsqueeze(0) # (1, T)
labels = full_ids.clone()
labels[0, :input_ids.size(1)] = -100 # 屏蔽 prompt / mask prompt tokens
logits = model(input_ids=full_ids).logits
loss = F.cross_entropy(logits[:, :-1, :].reshape(-1, logits.size(-1)),
labels[:, 1:].reshape(-1), ignore_index=-100)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"RFT loss: {loss.item():.4f}, best reward: {all_rewards[best_idx]:.4f}")
# --- Simple reward function ---
def dummy_reward_fn(prompt, response):
"""Reward: longer is better (demo only)."""
return float(len(response))
# Run
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
prompts = ["Explain gravity in one sentence.", "What is 2+2?"]
rejection_sampling_finetune(model, tokenizer, prompts, dummy_reward_fn, N=4)
Part 3 — Interview Question Bank
━━━ L1 Basic ━━━
Q1. What problems do pre-training and post-training each solve? What is the standard pipeline?
Answer: Pre-training aims to have the model learn general language capabilities, world knowledge, and a foundation for reasoning from massive unlabeled text — essentially unsupervised language modeling. Post-training aims to transform this "knowledgeable but unruly" base model into an assistant that follows instructions, is helpful, safe, and aligned with human values. The standard pipeline is: 1) Supervised Fine-Tuning (SFT), which fine-tunes the model on high-quality instruction-response pairs; 2) Preference Alignment, which typically uses methods such as RLHF or DPO to further optimize model behavior based on human preference data.
Follow-up: Why can't a single stage (e.g., SFT alone) complete the full transformation from a pre-trained model to a usable assistant?
Q2. What is loss masking in SFT? Why is loss computed only on assistant tokens?
Answer: Loss masking means that when computing the SFT loss, only the prediction loss for tokens corresponding to the assistant's response (i.e., the part the model needs to learn to generate) is included in the total loss, while the loss for the input/user instruction portion is ignored. This focuses the model's optimization objective on "learning how to respond correctly" rather than "parroting the user's input." Without masking the input portion, the model might waste learning capacity memorizing input formats instead of focusing on generating high-quality responses.
Follow-up: If the gradients for the user instruction portion are never updated during SFT, does the model truly become completely unable to "understand" instructions? Please explain.
Q3. What is the training objective of a Reward Model? What is the Bradley-Terry model?
Answer: The training objective of a Reward Model (RM) is to output a scalar score for a given (prompt, response) pair that reflects human preference rankings for response quality. Specifically, it learns by comparing a pair of responses (chosen vs. rejected). The Bradley-Terry model is a probabilistic model for pairwise comparisons; it assumes that the probability of selecting the winning response is proportional to the difference in reward values between the two responses. In RM training, the loss function is typically based on this probability, with the goal of maximizing the reward margin of the chosen response over the rejected response.
Follow-up: If the preference rankings in human annotation data are inconsistent or noisy, how does this affect the Reward Model trained under the Bradley-Terry model?
Q4. What is the role of the KL penalty in RLHF? How is β tuned?
Answer: During the reinforcement learning phase of RLHF, the policy model (the LLM being optimized) maximizes rewards from the Reward Model when generating responses. However, this can cause the model to generate strange, unnatural text that deviates from its original capability distribution in pursuit of high scores. The KL penalty term is added to the optimization objective by computing the KL divergence between the current policy and the initial SFT model (reference policy). Its role is to constrain the optimized model from drifting too far from the initial model, thereby preserving language quality and diversity. β is a hyperparameter that controls the strength of the KL penalty: larger β imposes heavier penalties on deviation, making the model more conservative and closer to the initial model; smaller β gives the model more freedom, potentially pursuing higher rewards but at greater risk.
Follow-up: The KL penalty computes the divergence over the full sequence distribution. What challenges does this pose in practice? Are there more efficient or more local approximation methods?
Q5. What is DPO? What is the core difference from RLHF?
Answer: DPO (Direct Preference Optimization) is a method that directly optimizes a language model using human preference data. Through a clever mathematical transformation, it merges the two steps of RLHF — "train a Reward Model, then use it for RL optimization" — into a single supervised learning loss function. In DPO, the model directly learns to translate preference rankings into adjustments of response probabilities. The core difference is: RLHF is "explicit," involving a separate RM training step and an online RL optimization process (e.g., PPO); DPO is "implicit" — it bypasses explicit RM training and online sampling, directly optimizing the policy through an offline contrastive loss, and is generally simpler and more stable.
Follow-up: A major criticism of DPO is that it heavily depends on the quality of preference data. Why might its requirements for data quality be higher than those of RLHF?
Q6. What is sequence packing? What are its benefits and pitfalls?
Answer:
Sequence packing is a training efficiency optimization technique. It concatenates multiple short sequences (e.g., multiple different instruction-response pairs) — using special separators such as <EOS> followed by the start token of a new sequence — into a single long sequence that reaches the model's maximum context length, then trains on this as a whole. Benefits: significantly improves GPU utilization, reduces computation waste from padding short sequences, and speeds up training. The main pitfalls are: 1) careful attention mask design is required to prevent the model from "seeing" information from other short sequences within the same packed sequence during training (i.e., cross-sequence attention leakage), which can cause data contamination or learning bias; 2) the model may be sensitive to sequence ordering.
Follow-up: In sequence packing, if two concatenated sequences are on completely unrelated topics (e.g., a math problem and a poem), what specific harm does cross-sequence attention leakage cause?
Q7. What is reward hacking? Give two examples.
Answer: Reward hacking refers to the model finding ways to "cheat" or "game" the system to obtain higher reward scores, even though the generated responses do not actually meet the true human goals of being helpful, honest, and harmless. It is over-optimization or exploitation of the reward function. Example 1: If the RM favors longer responses, the model may learn to generate verbose but hollow replies. Example 2: If the RM gives high scores to responses containing certain specific "safety" phrases (e.g., "As an AI assistant, I must comply with…"), the model may learn to mechanically insert such boilerplate into all responses, regardless of whether it is actually needed.
Follow-up: Beyond improving the Reward Model itself, what strategies can be employed during RLHF training to mitigate reward hacking?
Q8. What tensions exist among the Helpful / Harmless / Honest triad in alignment?
Answer: Helpful, Harmless, and Honest have inherent tensions among them. For example, a model that prioritizes Harmless excessively may refuse to answer reasonable but sensitive questions due to over-caution, thereby compromising Helpfulness (e.g., a doctor discussing medical symptoms). A model pursuing extreme Honesty may expose unverified information or user privacy in responses, thereby compromising Harmlessness. Conversely, fabricating answers to be Helpful compromises Honesty. An ideally aligned model must dynamically balance these three objectives across different contexts; there is no fixed perfect solution.
Follow-up: Can you provide a concrete scenario in which a model unavoidably sacrifices Helpfulness and Honesty in order to achieve Harmlessness?
━━━ L2 Intermediate ━━━
Q9. What is the core difference between GRPO and PPO? How many models does GRPO require?
Answer: GRPO (Group Relative Policy Optimization) and PPO (Proximal Policy Optimization) are both policy gradient algorithms, but GRPO makes key improvements to simplify the RLHF training process. The core difference: PPO requires maintaining four models — a policy model, a reference model, a value model (Critic), and a reward model; GRPO does not require a separate value model. GRPO generates a group of responses for the same prompt, then uses the average reward within the group as a baseline to estimate the advantage function, thereby computing the policy gradient. Therefore, GRPO typically requires only two models: the policy model and the reward model (the reference model can be merged or shared).
Follow-up: GRPO uses the group's average reward as a baseline to estimate the advantage function. What kind of bias might this introduce, and how does it affect training stability?
Q10. What problems do IPO, KTO, ORPO, and SimPO each solve with respect to DPO?
Answer: These methods are all improvements or variants of DPO:
- IPO (Identity Preference Optimization): Addresses the problem in DPO where KL regularization breaks down and overfitting occurs when preferences approach near-deterministic — adopts a bounded squared-loss objective (see §7.1) for more robust optimization.
- KTO (Kahneman-Tversky Optimization): Addresses DPO's requirement for strictly paired preference data (chosen/rejected pairs). KTO only requires binary labels indicating whether each response is "good" or "bad," without pairing, making data collection more flexible.
- ORPO (Odds Ratio Preference Optimization): Attempts to merge SFT and preference alignment into a single training stage. It directly optimizes the odds ratio of the model generating a chosen response relative to a rejected response.
- SimPO (Simple Preference Optimization): Aims to further simplify DPO by removing the dependency on a reference model, while improving optimization stability and robustness to response length by using length-normalized log-probabilities as an implicit reward and introducing a target reward margin.
Follow-up: Among these methods, which has relatively the lowest requirements for training data quality or quantity? Why?
Q11. Which matters more in SFT — data quality or data quantity? How is data curation done?
Answer: In the SFT phase, data quality is generally far more important than data quantity. High-quality, diverse, accurate, and human-value-aligned instruction data, even at a smaller scale, can significantly improve model performance. Conversely, large amounts of low-quality, erroneous, or harmful data can severely contaminate the model. A typical data curation pipeline includes: 1) Source filtering: selecting trustworthy and professional sources; 2) Quality filtering: using rules or models (e.g., an RM) to filter out low-scoring, harmful, or malformatted samples; 3) Deduplication: removing duplicate or near-duplicate samples; 4) Diversity augmentation: ensuring instructions cover a wide range of tasks, difficulty levels, and domains; 5) Format normalization: standardizing the style and length distribution of responses.
Follow-up: If you could use only a single automated model (rather than humans) to evaluate and filter quality in large-scale SFT data, what type of model would you prioritize? Why?
Q12. What are the main paradigms for synthetic data generation? Where does length bias come from?
Answer: The main paradigms are: 1) Self-Instruct: having the model generate new instructions and responses from seed tasks; 2) Evol-Instruct: evolving existing instructions through multiple rounds and multiple dimensions of complexification; 3) Bootstrapping: using a powerful "teacher" model to generate training data for a "student" model (e.g., distillation); 4) Reward-guided Generation: using an RM or rules to filter/revise multiple candidate responses generated by the model. Length bias mainly originates from: 1) Model-intrinsic bias: common responses in pre-training data (e.g., technical documentation) tend to be long; 2) Reward model bias: if human annotators in the RM's training data generally prefer more detailed, longer responses, the RM will assign higher scores to longer responses, causing the model to tend toward generating longer text when optimizing the RM; 3) Generation strategy: for example, verbose enumeration to ensure all points are covered.
Follow-up: When generating synthetic data, how can the pipeline or loss function be designed to explicitly control or reduce length bias in the final responses?
Q13. What is the difference between online and offline preference learning? What scenarios is each suited for?
Answer: Online learning (e.g., the PPO phase in standard RLHF) means that the policy model generates new responses in real time during training and receives new reward signals from the environment (e.g., the RM) to update the policy. Offline learning (e.g., DPO) means using a pre-collected, fixed preference dataset to optimize the model, without generating new data during training. Online learning is suited for scenarios that require continuous exploration, fast adaptation to new reward signals, or resolving distribution shift, but has high computational cost and instability. Offline learning is suited for scenarios where data collection is expensive and stable training pipelines are needed, but is easily constrained by a fixed data distribution and may converge to a suboptimal solution.
Follow-up: In offline learning, if the preference data distribution used for training differs greatly from the data distribution encountered during deployment, what problems arise? How can this be mitigated?
Q14. What is benchmark contamination? How can it be detected?
Answer: Benchmark contamination refers to the situation where the model being evaluated (or its training data) has already "seen" the test questions or answers from the evaluation benchmark during training. This causes the model to achieve inflated, unrealistic performance scores on that benchmark, which do not reflect its true generalization capability. Detection methods include: 1) Membership inference attacks: analyzing differences in perplexity between the model's outputs on test-set samples versus similar non-test-set samples; 2) n-gram overlap analysis: checking the degree of text overlap between the model's training data and the test set; 3) Data provenance auditing: rigorously auditing training data sources to exclude datasets known to contain mainstream benchmark test sets (e.g., certain versions of Common Crawl); 4) Dynamic benchmark design: using regularly updated, non-public test sets.
Follow-up: Beyond data contamination, what other methodological flaws in evaluation might lead to misjudgment of a model's capabilities?
Q15. How does catastrophic forgetting manifest in post-training? How can it be mitigated?
Answer: In post-training, catastrophic forgetting manifests as the model losing the broad knowledge, language capabilities, or ability to handle diverse tasks learned during pre-training while acquiring new capabilities (e.g., instruction following, value alignment) through SFT or RLHF. For example, an aligned model may perform well on instruction following but exhibit significant degradation in foundational capabilities such as coding, mathematics, or multilingual tasks compared to the base model. Mitigation methods include: 1) Mixed training data: mixing pre-training data or general-capability data into SFT/RLHF data; 2) Low-rank adaptation: using parameter-efficient fine-tuning methods such as LoRA to update only a small fraction of parameters; 3) Regularization: adding an L2 penalty on the original model parameters to the loss function (similar to EWC); 4) Knowledge distillation: using the original model as a teacher to constrain the output distribution of the aligned model.
Follow-up: In parameter-efficient fine-tuning methods (e.g., LoRA), how does the choice of which layers to fine-tune (e.g., QKV projections in attention layers vs. FFN layers) differently affect the mitigation of catastrophic forgetting and the preservation of existing capabilities?
Q16. Process Reward Model (PRM) vs. Outcome Reward Model (ORM)?
Answer: An ORM (Outcome Reward Model) gives a single reward score only for the final answer or complete response generated by the model, without regard for the intermediate reasoning process. A PRM (Process Reward Model) evaluates and scores each intermediate step in solving the problem or generating the response. The advantage of PRM lies in providing denser, more fine-grained supervision signals that help guide the model toward correct step-by-step reasoning — especially valuable for complex tasks such as mathematics and logical reasoning, as it prevents the model from arriving at the correct answer via "shortcuts" with an incorrect process. The challenge is that annotation costs are extremely high, requiring human experts to evaluate each step.
Follow-up: In practice, how can data for training a PRM be collected efficiently? Is it possible to use an ORM or other models to automatically generate training labels for a PRM?
Q17. What are the limitations of MT-Bench, AlpacaEval, and Chatbot Arena respectively?
Answer:
- MT-Bench: Uses pre-designed multi-turn conversation questions and a powerful LLM (e.g., GPT-4) as the judge. Limitations: 1) the judge model itself may be biased; 2) fixed questions make it easy to overfit; 3) cannot evaluate long-document processing or real-world complex tasks.
- AlpacaEval: Uses a fixed instruction set; GPT-4 is used to compare the model's responses against reference responses (typically GPT-4's own responses). Limitations: 1) strongly dependent on GPT-4's preferences, which may not reflect the preferences of a broad user base; 2) risk of "self-preference," where responses stylistically similar to GPT-4 may score higher.
- Chatbot Arena: Conducts pairwise comparisons through anonymous votes from real users, making it the most human-preference-aligned dynamic evaluation currently available. Limitations: 1) the user base may not be fully representative (skewed toward technical users); 2) high evaluation cost and slow speed; 3) uneven distribution of conversation domains.
Follow-up: If you were to design a new, more comprehensive evaluation framework for post-trained models, what different evaluation dimensions and methods would you integrate to compensate for the shortcomings of these individual benchmarks?
━━━ L3 Deep ━━━
Q18. Why is the value model (critic) in PPO difficult to train? How does GRPO sidestep this problem?
Answer: In PPO for RLHF, the value model (Critic) must accurately estimate the expected total future reward given a current state (i.e., the current prompt and partial generation history) — that is, the state value function V(s). This estimation is extremely difficult: 1) Sparse rewards: rewards are typically given only after a complete response is generated, so intermediate states lack direct supervision signals; 2) High variance: the state space for text generation is vast and complex, leading to high variance in value estimates and unstable training; 3) Non-stationarity: the policy model updates rapidly, causing the target distribution for the value function to shift continuously, increasing the difficulty of fitting. GRPO sidesteps this problem by eliminating the value model entirely. It generates a group of responses for each prompt and uses the group's average reward as a baseline to estimate each response's advantage relative to the group average. This approach avoids training a complex value network over all possible states.
Follow-up: GRPO uses the group's average reward as a baseline, which implicitly assumes that the value of all states (i.e., different generation paths for the same prompt) is equal. Under what circumstances does this assumption become unreasonable?
Q19. Theoretical derivation of DPO: walk through the derivation from the RLHF KL-constrained optimal solution to the DPO loss.
Answer:
- RLHF objective: We have a KL-constrained optimization objective:
max_{π} E_{x, where π is the policy, π_ref is the reference policy, and r is the reward function.D, yπ}[r(x, y)] - β * KL[π(y|x) || π_ref(y|x)] - Closed-form optimal solution: Solving the above objective with respect to π yields the closed-form optimal solution:
π*(y|x) = π_ref(y|x) * exp(r(x, y) / β) / Z(x), whereZ(x)is the partition function (normalization constant). - Inverting for the reward function: Taking logarithms on both sides and rearranging, the reward function can be expressed as a function of the policy:
r(x, y) = β * log(π*(y|x) / π_ref(y|x)) + β * log(Z(x)). - Substituting into the Bradley-Terry model: For a preference pair (y_w, y_l), according to the BT model, the probability that a human selects y_w is
σ(r(x, y_w) - r(x, y_l)), where σ is the sigmoid function. - Canceling the partition function: Substituting the reward expression from step 3 into step 4, the
log(Z(x))terms cancel in the subtraction, yielding:P(y_w ≻ y_l | x) = σ(β * log(π*(y_w|x) / π_ref(y_w|x)) - β * log(π*(y_l|x) / π_ref(y_l|x))). - DPO loss: Finally, the DPO loss function maximizes the above probability (i.e., minimizes negative log-likelihood):
L_DPO(θ) = -E[log σ(β * log(π_θ(y_w|x) / π_ref(y_w|x)) - β * log(π_θ(y_l|x) / π_ref(y_l|x)))], where π_θ is the policy being optimized.
Follow-up: In the above derivation, we assume that the reward function r can be expressed in terms of the policy π (step 3). What are the implicit conditions for this assumption to hold?
Q20. What is the difference between mode collapse and reward hacking? How can mode collapse be detected?
Answer: Reward hacking is when the model finds "shortcuts" to obtain high rewards while producing outputs that do not match human intent (e.g., generating verbose filler). Mode collapse refers to a sharp drop in the diversity of the model's outputs, where the model tends to repeatedly generate a few types of high-reward, safe, or stereotyped responses, losing the richness and creativity expected when responding to diverse prompts. It is a common failure mode in generative models. Methods for detecting mode collapse include: 1) Diversity metrics: computing lexical diversity (e.g., distinct-n) and variance in semantic embeddings of responses generated for a set of prompts, compared against a baseline model; 2) Reward distribution analysis: if the model's reward score distribution becomes highly concentrated (high mean, low variance), it may indicate that the model has found a few "high-scoring templates"; 3) Manual sampling inspection: randomly sampling multiple groups of responses and observing whether their content, structure, and word choices are highly similar.
Follow-up: Increasing the KL penalty coefficient β is an effective way to mitigate mode collapse in RLHF training. Beyond this, what methods from a data perspective or algorithmic perspective can encourage diversity?
Q21. What is alignment tax? How does weight averaging mitigate it, and what is the principle?
Answer: Alignment tax refers to the performance cost paid on certain general capabilities not directly optimized (e.g., basic language modeling, complex reasoning) — i.e., degradation in these capabilities — as the model undergoes post-training alignment to achieve better instruction following, safety, and harmlessness. Weight averaging is a simple and effective mitigation technique. It averages the weights of multiple models produced at different training checkpoints or with different random seeds to obtain a smoother, more generalizable final model. The principle is: 1) Variance reduction: averaging reduces performance instability caused by training fluctuations or randomness in any single model; 2) Exploring better solutions: different training snapshots may reside in different "good" regions of the loss landscape, and averaging may find an intermediate point that performs well across dimensions; 3) Implicit regularization effect, preventing the model from overfitting to specific patterns in training data (including biases that may exist in alignment data).
Follow-up: In specific implementations of weight averaging — such as Stochastic Weight Averaging (SWA) and Model Soups — how do their strategies and assumptions differ? Which is likely more effective at mitigating alignment tax?
Q22. What are the key design decisions in DeepSeek-R1's training pipeline? What is the role of cold-start SFT?
Answer: According to the DeepSeek-R1 paper (arXiv:2501.12948), it is important to distinguish between two models:
DeepSeek-R1-Zero: Applies pure RL (GRPO) directly on DeepSeek-V3-Base, completely skipping the SFT phase. The paper states: "we bypass the conventional supervised fine-tuning (SFT) phase before RL training." R1-Zero demonstrates that reasoning capabilities can emerge from pure RL, but it suffers from poor readability and language mixing.
DeepSeek-R1: Four-stage pipeline (paper Section 3):
- Cold-start SFT: Collects thousands of cold-start data samples with human-conversational-style chain-of-thought, then fine-tunes DeepSeek-V3-Base via SFT to produce Dev1. Note: this is "cold-start" rather than standard large-scale SFT; the data volume is small (thousands).
- Reasoning-oriented RL (Stage 1 RL): Applies GRPO on Dev1 for reasoning-task reinforcement learning (rule-based rewards: accuracy + format) to produce Dev2.
- Rejection-sampling SFT: Samples from Dev2, merges reasoning and non-reasoning data for SFT to produce Dev3. This stage also improves general capabilities such as writing.
- Full-scenario RL (Stage 2 RL): Applies comprehensive RL on Dev3, with reward signals combining rule-based (reasoning) + RM (general dialogue, safety), yielding the final DeepSeek-R1.
Role of cold-start: Resolves R1-Zero's readability and language-mixing issues, providing a more well-structured behavioral foundation for subsequent RL and making RL exploration more efficient.
Follow-up: The data used for cold-start SFT has very high quality requirements. If this data contains errors or biases, what cascading effects would this have on the exploration in subsequent reinforcement learning stages?
Q23. How does the self-critique-revision mechanism in RLAIF and Constitutional AI work?
Answer: The core idea of RLAIF (Reinforcement Learning from AI Feedback) and Constitutional AI is to use AI models themselves to generate preference feedback or perform corrections, reducing dependence on human annotation. The self-critique-revision mechanism typically involves a loop: 1) Generate initial response: given a prompt, the model first generates a preliminary response. 2) Self-critique: the model (or a separate critic model) reviews the initial response against a set of predefined "constitutional" principles (e.g., "answers should be objective," "avoid harmful content") and identifies potential violations. 3) Revise response: the model revises the initial response based on the generated critique to produce a new version that better conforms to the constitutional principles. 4) (Optional) Use for training: the (initial response, revised response) pair is used as a (rejected, chosen) pair to train an RM or to directly perform DPO-style optimization. This mechanism allows the model to self-improve and align without requiring real-time human intervention.
Follow-up: Could this self-revision mechanism cause the model to fall into a kind of "alignment loop"? For example, in pursuing a "safer" response, the model might through multiple rounds of revision produce responses that become increasingly conservative and even useless.
Q24. How are iterative RLHF and online DPO similar and different? How can distribution mismatch be resolved?
Answer: Both address the problem of mismatch between the training data distribution (preference pairs generated by an old policy) and the current policy distribution that arises in offline methods like standard DPO. Similarities: both iteratively use the current policy model to generate new data (or responses) and update the model with this new data, so that the training data distribution tracks the policy as it changes. Differences: Iterative RLHF typically refers to alternating between "online data generation (sampling with the current policy and scoring with an RM)" and "updating the policy with new data (possibly using PPO or DPO)." Online DPO more specifically refers to generating a set of responses with the current policy at each training iteration, having an RM or human select preference pairs, and then directly computing the DPO loss and updating the model using this newly generated, distribution-matched preference data, skipping the explicit RL step.
Follow-up: When generating preference pairs using the current policy in Online DPO, what sampling temperature should be used? Why is this parameter choice important?
Q25. Scaling laws in post-training: how do data volume and model scale affect alignment quality? How do the optimal compute allocation strategies for SFT and RL differ?
Answer: Post-training scaling laws differ from those in pre-training. For data volume: in the SFT phase, there are diminishing returns; high-quality data is more important than large volumes of low-quality data, and performance improvements slow after reaching a certain scale. For model scale: larger base models generally have stronger alignment potential and can better understand complex instructions and values, but the amount of high-quality data needed to achieve the same alignment level may not scale proportionally. Optimal allocation strategy for SFT vs. RL: SFT yields more "data-efficient" returns, and it is typically cost-effective to invest more compute early in a project to quickly establish instruction-following capability. RL (e.g., RLHF) is more "compute-intensive," with its returns manifesting in fine-grained behavioral adjustments and value alignment, requiring more online sampling and iteration. A common strategy is: use most of the compute budget to train a sufficiently good base model and SFT model, then use the remaining, relatively smaller compute budget for a few key RL iterations for fine-tuning, since the marginal returns of RL may diminish rapidly.
Follow-up: If we treat both model scale and data volume as resources, in the post-training phase, do you think it is more likely to yield a superior assistant model in real-world applications to invest in aligning a 70B model, or to invest in aligning a 7B model with a larger volume of higher-quality data? Please explain your reasoning.
More L3 Deep Dives / Extended L3
Q26: What does DPO's implicit reward actually learn? What are its fundamental limitations compared to an explicit RM?
The gradient of the DPO loss is equivalent to optimizing an implicit reward . This implicit reward is essentially an accumulation of token-level log-probability ratios under the reference policy, with no explicit modeling of generation semantics. Compared to training an independent RM, the DPO reward is bound to the policy's parameter space, leading to three core limitations: (1) distribution coupling — the reward cannot evaluate OOD responses independently of the policy, limiting exploration; (2) representation bottleneck — the policy must simultaneously serve as both "value evaluator" and "strategy generator," creating potential parameter conflicts; (3) temporal inconsistency — as the policy changes during training the implicit reward drifts, whereas an explicit RM's reward distribution remains relatively stable. This also explains why online DPO (re-sampling with the current policy) typically outperforms offline DPO.
Follow-up: Since DPO has an off-policy problem, Rejection Sampling Fine-Tuning (RFT) is a simpler alternative — under what conditions would RFT be more effective than DPO, and under what conditions would it fail?
Q27: What statistical bias does GRPO's Group Normalization introduce? How can it be mitigated?
GRPO applies group-level z-score normalization (subtract mean, divide by standard deviation) across multiple responses to the same prompt, implicitly assuming that within-prompt comparison is sufficient. Statistically, when the group size is small (e.g., ), the estimated mean and variance are highly variable, causing high noise in advantage estimates. More critically, group normalization defines advantage entirely relative to the same group, which means: (1) if all responses in a group are of low quality, a "best of a bad bunch" dynamic still produces positive advantage, reinforcing the policy in a low-quality region; (2) conversely, if all responses in the group are high quality, even excellent answers are suppressed. This relative ranking bias means that when the reward distribution is skewed (e.g., most responses score similarly), GRPO may systematically diverge from the absolute quality signal. Mitigation approaches include introducing a baseline anchor (e.g., an EMA reference reward) or mixing absolute-relative advantage.
Follow-up: Under GRPO's KL constraint, if the group size tends to infinity, what form does GRPO's optimization objective mathematically converge to? How does it relate to standard PPO?
Q28: How does Reward Model overparameterization affect RLHF? Should the RM be the same scale as the policy, larger, or smaller?
RM overparameterization (far more parameters than training data requires) causes two problems: (1) spurious correlations — the RM may learn surface features unrelated to preference (e.g., specific writing styles, length) and achieve high accuracy, but these shortcuts break down once the policy updates; (2) calibration degradation — the scalar output of an overparameterized RM tends to be overconfident (concentrated at a few extreme values), causing advantage estimate variance to explode in PPO or the policy to be dominated by a small number of samples. In practice, RM scale selection involves a trade-off: a larger RM has stronger semantic understanding but is more prone to overfitting and is expensive to run; a smaller RM may generalize better but has limited expressiveness. One view is that the RM should be slightly larger than or equal to the policy scale to ensure sufficient reward signal resolution, while reward ensembles (averaging/voting across multiple RMs) mitigate overfitting.
Follow-up: If multiple RMs in a reward ensemble all start from the same SFT initialization and differ only in data shuffling, under what conditions will this ensemble still fail systematically? How would you design a truly diverse RM ensemble?
Q29: How is the Credit Assignment problem solved in multi-turn dialogue RLHF? Is existing sequence-level reward sufficient?
In multi-turn dialogue, the user's final satisfaction is a function of the entire conversation history, but standard RLHF gives only a single scalar reward at the final turn, creating a severe temporal credit assignment problem: the model cannot tell which turn's response caused a positive or negative evaluation. Intuitive solutions include: (1) turn-level reward modeling — training an independent reward model for each dialogue turn, but this faces partial observability of dialogue state and high annotation costs; (2) Monte Carlo rollout — re-sampling subsequent dialogue from a given turn to estimate value, but combinatorial explosion is severe; (3) shaped reward via dialogue act — using dialogue acts (e.g., clarification, confirmation) as intermediate reward signals. Empirically, pure sequence-level reward is manageable for short dialogues (2–3 turns), but in long dialogues the policy tends to fall into early-turn over-optimization (over-optimizing the first-turn response to capture the initial reward signal while neglecting subsequent interaction quality).
Follow-up: If you want to implement reward attribution at the token level (rather than turn level), what methods could theoretically decompose a sequence-level reward down to each token? What are the theoretical guarantees and practical difficulties of such an approach?
Q30: Is the theoretically optimal solution of KL-constrained RL sensitive to β? When β deviates from optimal, how do PPO and DPO differ in their failure modes?
From a KL-regularized RL perspective, controls the position on the exploration-exploitation Pareto frontier. Theoretically, the optimal depends on the scale of the reward function and the entropy of the reference policy, and cannot be determined in advance. When is too large (over-regularization), both PPO and DPO converge toward the reference policy and alignment effects are weak. When is too small (under-regularization), their failure modes diverge: PPO experiences a positive feedback loop of reward hacking — once the policy finds a reward loophole it is continuously reinforced, the RM is evaluated out-of-distribution, and reward collapses; DPO exhibits instability from preference reversal — the implicit reward of off-policy samples drifts during training, the margin between chosen and rejected shrinks or even flips, and the loss oscillates. In practice, PPO's (KL penalty coefficient) typically needs to be co-tuned with the learning rate, while DPO's behaves more like a temperature: a smaller allows a larger chosen-rejected margin but is also more prone to overfitting.
Follow-up: Is there a theoretically grounded method to adaptively adjust (rather than manually tuning)? What problems arise when using KL divergence itself as the signal for adaptive β?
Q31: Process Reward Models (PRM) have advantages on long-chain tasks like mathematical reasoning, but how do you handle the annotation ambiguity of "steps that are correct but part of a suboptimal reasoning path"?
The core challenge for PRMs is the multi-modal solution distribution: for the same problem, multiple valid reasoning paths exist (e.g., algebraic vs. geometric approaches), where steps within each path are internally consistent but paths are not directly comparable. During annotation, if annotators are asked "is this step correct?", they may give false negatives when unfamiliar with a particular reasoning style. More subtly, even if a step is correct within its current path, if the overall path is suboptimal, the step-level reward should be adjusted — but this requires a global view, which is fundamentally at odds with PRM's local evaluation nature. Directions for resolution include: (1) path-conditioned PRM — evaluating the current step conditioned on preceding steps, rather than in absolute terms; (2) Monte Carlo estimation — rolling out from the current step to the final answer and using the success rate as the step-level reward, though computational cost is high; (3) agreement-based filtering — annotating only the "critical steps" shared across multiple paths, avoiding path-specific steps.
Follow-up: If Monte Carlo rollout is used to estimate PRM's step-level reward, should the rollout policy be the current training policy or a fixed exploration policy? How does this choice affect the bias and variance of the reward estimate?
Q32: Constitutional AI (CAI) claims AI feedback can replace human feedback, but where is the theoretical ceiling of RLAIF? Can the gap between AI feedback and human feedback be eliminated?
The theoretical ceiling of RLAIF is bounded by the capability limits of the AI evaluator. The core issue is: if the AI evaluator has systematic preferences of its own (e.g., verbosity bias, sycophancy), then a policy trained on its feedback will inherit and even amplify those preferences, creating a evaluator-policy co-adaptation degeneracy loop. The deeper limitation is the unverifiability of value alignment — certain dimensions of human preference (such as honesty and harmlessness) fundamentally require human judgment, and AI cannot self-validate. CAI's "constitutional principles" attempt to circumvent this with explicit rules, but rules cannot cover all corner cases, and conflicts between rules require human arbitration. Empirically, RLAIF can approach human feedback on certain objective dimensions (e.g., format correctness), but still has a significant gap on dimensions requiring deep value judgment (e.g., nuanced harm assessment). Theoretically, RLAIF can only achieve RLHF-level performance when the AI evaluator is an unbiased and consistent estimator of human preferences — an assumption that currently cannot be guaranteed.
Follow-up: If the AI evaluator has a known bias (e.g., verbosity bias), can debiasing techniques (e.g., calibration, adversarial training) correct it before RLAIF training? What are the theoretical guarantees of such correction?
Q33: In multi-turn RLHF, how should the dynamics of user strategy be modeled? What systematic errors arise from assuming a fixed user strategy?
Standard multi-turn RLHF implicitly makes a stationary user assumption — that the user follows a fixed response strategy throughout the conversation. In reality, users adjust their questioning strategy based on the model's replies (e.g., pressing harder when the model evades a question, asking for brevity when the model is too verbose). This transforms RLHF from a single-agent MDP into a two-player Markov Game. Under a non-stationary user strategy, the fixed-user assumption causes: (1) overfitting to the simulated user — the policy learns optimal responses for a particular simulated user pattern rather than a robust strategy for real dynamic users; (2) exploitation of user patience — if the simulated user never terminates the conversation due to overly long responses, the policy learns an excessively verbose style. The more fundamental difficulty is that real user strategies are themselves a distribution and may even shift because of model behavior (user-model co-evolution), which theoretically approaches non-stationary multi-agent RL, for which no mature convergence guarantees currently exist.
Follow-up: If you want to explicitly model dynamic user strategies, could a user simulator be jointly trained with the policy? What are the known failure modes of such a self-play framework?
§A Key Papers Timeline
2015-06 · GAE — Schulman et al., ICLR 2016. arXiv:1506.02438 — Introduces Generalized Advantage Estimation, a TD(λ)-style exponentially-weighted multi-step return that continuously interpolates the bias-variance trade-off via λ; the standard advantage estimator underlying PPO and RLHF.
2022-03 · InstructGPT — Ouyang et al., NeurIPS 2022. arXiv:2203.02155 — Establishes the canonical 3-stage RLHF pipeline (SFT → Bradley-Terry reward model → PPO with KL penalty) for aligning GPT-3 into an instruction-following assistant.
2022-09 · WiSE-FT — Wortsman et al., CVPR 2022. arXiv:2109.01903 — Reduces alignment tax by linearly interpolating weights of the fine-tuned and base models in weight space, preserving pre-training robustness while retaining task performance.
2022-12 · Constitutional AI (CAI / RLAIF) — Bai et al., arXiv preprint. arXiv:2212.08073 — Replaces human preference annotators with an LLM guided by explicit constitutional principles via a self-critique-revision loop, enabling scalable RLAIF without per-sample human labels.
2023-05 · DPO — Rafailov et al., NeurIPS 2023. arXiv:2305.18290 — Derives a closed-form reparameterization of the KL-constrained RLHF objective that eliminates the explicit reward model, reducing preference alignment to a single supervised classification loss on (chosen, rejected) pairs.
2023-10 · IPO — Azar et al., AISTATS 2024. arXiv:2310.12036 — Identifies that DPO's Bradley-Terry/logit mapping allows KL regularization to vanish under near-deterministic preferences; proposes ΨPO with Ψ = Identity, yielding a bounded squared-loss objective that preserves effective regularization.
2024-02 · DeepSeekMath / GRPO — Shao et al., arXiv preprint. arXiv:2402.03300 — Introduces Group Relative Policy Optimization (GRPO), which removes the PPO critic by normalizing rewards within a sampled group for the same prompt, halving the number of models required and enabling stable RL from verifiable rewards.
2024-02 · KTO — Ethayarajh et al., ICML 2024. arXiv:2402.01306 — Replaces paired (chosen, rejected) preference data with pointwise binary desirability labels, framing alignment as prospect-theoretic utility maximization with an asymmetric sigmoid loss and a KL-based reference point.
2024-02 · DPOP (Smaug) — Pal et al., arXiv preprint. arXiv:2402.13228 — Shows that DPO can decrease the log-probability of chosen responses when chosen and rejected are near-identical; adds a max(0,·) penalty term to anchor chosen log-prob above the reference model.
2024-03 · ORPO — Hong et al., arXiv preprint. arXiv:2403.07691 — Merges SFT and preference alignment into a single stage by appending a reference-free odds-ratio contrastive term to the cross-entropy loss, eliminating the need for a frozen reference model.
2024-05 · SimPO — Meng et al., NeurIPS 2024. arXiv:2405.14734 — Aligns DPO's implicit reward with generation-time likelihood by using length-normalized average log-probability and adds an explicit target margin γ, removing the reference model and mitigating length bias.
2024-10 · Likelihood Displacement — Razin et al., ICLR 2025. arXiv:2410.08847 — Proves that DPO shifts probability mass away from chosen responses when chosen and rejected share high hidden-embedding similarity (CHES score), potentially causing "unintentional unalignment"; proposes CHES-based data filtering as a remedy.
2025-01 · DeepSeek-R1 — Guo et al., Nature 2025. arXiv:2501.12948 — Demonstrates that chain-of-thought reasoning ability can emerge from pure GRPO-based RL on a base model (R1-Zero), and that a cold-start SFT stage followed by two rounds of RL and rejection-sampling SFT yields a reasoning model competitive with OpenAI o1.