Cheatsheet

Reward Modeling & Evaluation Cheatsheet

A Complete Overview of Reward Modeling in LLM Post-Training


1. RM Training Methods

1.1 Core Objective

A reward model learns a scalar scoring function from human preferences, used to score LLM-generated candidate responses in RLHF (Reinforcement Learning from Human Feedback) or similar pipelines, guiding policy model optimization.

1.2 Pairwise / Bradley-Terry Model

Core idea: Rather than estimating absolute scores directly, learn the preference relationship between two responses.

Bradley-Terry (BT) Model:

Given prompt xx, chosen response ywy_w, rejected response yly_l:

P(ywylx)=σ(rθ(x,yw)rθ(x,yl))P(y_w \succ y_l \mid x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

where σ\sigma is the sigmoid function and rθr_\theta is the reward model with parameters θ\theta.

Training Loss:

LBT=E(x,yw,yl)[logσ(rθ(x,yw)rθ(x,yl))]\mathcal{L}_{\text{BT}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

Advantages:

Limitations:

From-scratch implementation: an RM = backbone + scalar head, scoring the last non-pad token's hidden state; the loss is just BT's negative log-sigmoid.

import torch
import torch.nn as nn
import torch.nn.functional as F

class RewardModel(nn.Module):
    def __init__(self, backbone, hidden):
        super().__init__()
        self.backbone = backbone               # returns (B, T, hidden)
        self.head = nn.Linear(hidden, 1, bias=False)   # scalar reward head

    def forward(self, input_ids, attn_mask):
        h = self.backbone(input_ids, attn_mask)        # (B, T, H)
        last = attn_mask.sum(1).long() - 1             # index of last non-pad token
        pooled = h[torch.arange(h.size(0)), last]      # (B, H) pool that position
        return self.head(pooled).squeeze(-1)           # (B,) scalar reward

def bt_loss(r_w, r_l):
    # r_w/r_l: scalar rewards for chosen / rejected; minimize -log σ(r_w - r_l)
    return -F.logsigmoid(r_w - r_l).mean()

1.3 Pointwise / Regression

Core idea: Directly regress to an absolute quality score.

Training Loss:

Lpoint=E(x,y,s)[(rθ(x,y)s)2]\mathcal{L}_{\text{point}} = \mathbb{E}_{(x, y, s)} \left[ (r_\theta(x, y) - s)^2 \right]

where ss is the human-annotated absolute score (e.g., a 1–5 Likert scale).

Advantages:

Limitations:

1.4 Other Variants

Method Description
Listwise / Plackett-Luce Full ranking over k responses; carries more information than pairwise
Regression + Ranking Hybrid Jointly optimizes regression loss and ranking loss
Multi-objective RM Assigns separate scores for different dimensions (helpfulness, safety, factuality)
Token-level RM Distributes rewards at the token level; related to PRM

1.5 DPO Implicit Reward ≠ KL Estimator (a common interview trap)

Both DPO and the KL estimators feature the same log-ratio βlogπθπref\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}, yet they are two different objects — frequently conflated in interviews.

DPO implicit reward: Inverting the optimal solution of KL-regularized RLHF gives

r^θ(x,y)=βlogπθ(yx)πref(yx)+βlogZ(x)\hat r_\theta(x,y) = \beta\log\frac{\pi_\theta(y\mid x)}{\pi_{\text{ref}}(y\mid x)} + \beta\log Z(x)

This is a scalar reward for a single response, playing exactly the role of rθr_\theta in §1.2 above. Substituting it back into the BT loss, the partition term βlogZ(x)\beta\log Z(x) cancels exactly in the difference r^(x,yw)r^(x,yl)\hat r(x,y_w)-\hat r(x,y_l) — so DPO needs neither an explicit RM nor any computation of Z(x)Z(x):

LDPO=E[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\frac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\right)\right]

KL estimators (k1/k2/k3): These estimate the divergence KL(πθπref)\mathrm{KL}(\pi_\theta\|\pi_{\text{ref}}) — a single aggregate scalar over the whole distribution, used as the regularization penalty in RLHF/GRPO (the kl term in the GRPO code above).

Why they get confused: both are built from the log-ratio, but

提示 / Note

For the precise definitions and gradient properties of the three KL estimators, see llm-post-training §9.4; this section only highlights how they differ from the DPO implicit reward.


2. PRM vs ORM: Process Reward Model vs Outcome Reward Model

2.1 ORM — Outcome Reward Model

Definition: Assigns a single overall reward score based solely on the final output.

Input:  [Prompt + full response text]
Output: single scalar reward r

Characteristics:

2.2 PRM — Process Reward Model

Definition: Assigns a reward score to each step of the reasoning process.

Input:  [Prompt + first i steps]
Output: per-step reward r_step(i)

Characteristics:

2.3 Comparison Table

Dimension ORM (Outcome Reward) PRM (Process Reward)
Reward Granularity Entire response Each reasoning step
Signal Density Sparse Dense
Annotation Cost Low High
Credit Assignment Poor Good
Typical Applications Dialogue, writing Math reasoning, code, multi-step reasoning
Integration with Search Best-of-N Best-of-N, tree search, beam search
Annotation Method Final answer correctness Per-step logical correctness

2.4 Automated PRM Data Generation

To reduce annotation cost, common approaches include:

Math-Shepherd-style auto-labeling (Wang et al., arXiv:2312.08935): no humans — use MC completions to estimate whether each step can still reach the correct answer. From the prefix up to step ii, sample NN full solutions {aj}\{a_j\}, compare to the gold answer aa^*, and assign a binary/soft label:

ysiHE=I ⁣[aj=a]{0,1}ysiSE=1Nj=1NI[aj=a]y_{s_i}^{\text{HE}}=\mathbb{I}\!\big[\exists\,a_j=a^*\big]\in\{0,1\}\qquad y_{s_i}^{\text{SE}}=\frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\,[a_j=a^*]

(HE = hard estimation, positive if any completion is correct; SE = soft estimation, the correct-rate as a soft label.)

Aggregating step scores into a trajectory score: Math-Shepherd uses MIN (the minimum step score minirsi\min_i r_{s_i}, most conservative) for best-of-N reranking; Lightman et al. (arXiv:2305.20050, PRM800K, 800k human step labels marking each step correct/neutral/wrong) instead primarily use PRODUCT (product of step scores).

提示 / Note

Auto (MC) vs human (PRM800K): MC labeling scales indefinitely but has false negatives (a correct step labeled wrong); human labeling has no MC noise but is very costly.


3. Reward Hacking & Over-Optimization

3.1 What is Reward Hacking

Definition: The policy model learns to exploit flaws and out-of-distribution behavior of the reward model to obtain high scores, rather than genuinely improving response quality.

Common Patterns:

Pattern Definition
Verbosity hacking Generating long but low-quality content to exploit the RM's spurious preference for length
Sycophancy Using flattering, fawning language to align with the user's stance rather than providing accurate answers
Format gaming Overusing Markdown lists, headings, bold text, etc., which the RM misinterprets as quality signals
Spec gaming Meeting the literal requirements of the RM or task specification while violating its intent (e.g., answers that are "formally correct" but substantively wrong)
OOD collapse After the policy drifts from the training distribution, RM scores become unreliable, giving inflated or random scores for out-of-distribution generations

3.2 Goodhart's Law Perspective

"When a measure becomes a target, it ceases to be a good measure."

rtruerRM(OOD)r_{\text{true}} \neq r_{\text{RM}} \quad (\text{OOD})

The RM is only accurate within its training distribution; once the policy generates out-of-distribution content, RM predictions are no longer reliable.

3.2a Gao et al. 2022 — Scaling Laws for Overoptimization

Source: Gao, Schulman & Hilton, arXiv:2210.10760, using a synthetic gold RM in the InstructGPT setup.

Core variable: Let d=DKL(ππinit)d = \sqrt{D_{\text{KL}}(\pi \| \pi_{\text{init}})} (the square root of the KL distance; the paper chooses this parameterization because KL is a "quadratic measure"). The functional form of how the gold RM score RR changes with dd differs by optimization method:

Best-of-N (BoN) form:

Rbon(d)=d(αbonβbond)R_{\text{bon}}(d) = d(\alpha_{\text{bon}} - \beta_{\text{bon}} d)

Reinforcement Learning (RL) form:

RRL(d)=d(αRLβRLlogd)R_{\text{RL}}(d) = d(\alpha_{\text{RL}} - \beta_{\text{RL}} \log d)

where αbon,βbon,αRL,βRL\alpha_{\text{bon}}, \beta_{\text{bon}}, \alpha_{\text{RL}}, \beta_{\text{RL}} are fitted parameters (varying smoothly with RM parameter count); R(0):=0R(0) := 0. ⚠️ The RL form has infinite slope near the origin; the paper notes this form may not hold near the origin.

Key findings:

⚠️ The specific values of α,β\alpha, \beta above depend on RM parameter count and data volume; the paper provides no single universal constant. For specific values, consult Figure 3 of the original paper.

3.3 Mitigations

3.3.1 KL Divergence Control

Add a KL penalty term to the RLHF objective:

maxπExD,yπ(x)[rθ(x,y)βDKL(π(yx)πref(yx))]\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot|x)} \left[ r_\theta(x, y) - \beta \cdot D_{\text{KL}}(\pi(y|x) \| \pi_{\text{ref}}(y|x)) \right]

提示 / Note

For the single-sample KL estimators k1/k2/k3, in-reward vs in-loss placement, and the gradient bias of k3-as-loss, see llm-post-training §9.4.

3.3.2 RM Ensembles

3.3.3 Length Penalty

radjusted(x,y)=rθ(x,y)αlen(y)r_{\text{adjusted}}(x,y) = r_\theta(x,y) - \alpha \cdot \text{len}(y)

or the normalized form:

radjusted=rθ(x,y)len(y)γr_{\text{adjusted}} = \frac{r_\theta(x,y)}{\text{len}(y)^\gamma}

3.3.4 Iterative Re-Training / Online RLHF

Iterative pipeline:
1. Use current policy π_k to generate candidate responses
2. Collect new preference annotations / or annotate with RM
3. Update RM: r_k → r_{k+1}
4. Apply RLHF update with new RM: π_k → π_{k+1}
5. Repeat

3.3.5 Other Methods

Method Description
Rejection Sampling / Best-of-N No RL; sample N responses from the policy and select the one with the highest RM score
Constrained Optimization Add hard safety/quality constraints
Pre-training anchor Maintain proximity to the pre-trained model (e.g., L2 regularization)
Robust RM training Adversarial training and data augmentation to improve RM generalization

3.4 Preference Data Construction

The quality ceiling of the RM is determined by the quality of the preference data. The following are core design choices:

3.4.1 Absolute vs Relative Annotations

Dimension Absolute Annotation (Pointwise) Relative Annotation (Pairwise/Comparative)
Annotation form Rate a single response 1–5 Compare two responses and pick the better one
Annotation cost High (requires internally consistent scales across annotators) Low (cognitively simpler task)
Inter-annotator agreement Low; scale drift is pronounced High; humans make relative judgments more reliably
Information content Richer (ordinal relationships recoverable) Directly corresponds to the Bradley-Terry model
Typical use Direct regression RM Mainstream RLHF practice

3.4.2 Margin Filtering

Motivation: Preference pairs contain many "hard to distinguish" samples (low annotator agreement); training directly on these introduces noise.

Approach:

3.4.3 Annotator Calibration

Problem: Different annotators have systematic biases (personal styles, varying strictness).

Mitigation methods:

⚠️ Annotator bias is learned and amplified by the RM, ultimately affecting policy behavior — calibration at the data construction stage is more fundamental than post-hoc remediation.


4. Length Bias & Other RM Pathologies

4.1 Length Bias

Phenomenon: RMs tend to assign higher scores to longer responses, even when longer does not mean better.

Root causes:

Consequences:

Mitigations:

4.2 Position Bias

Phenomenon: In pairwise comparisons, RMs tend to prefer responses in a specific position (e.g., the first or the last).

Mitigation: Swap positions and make two predictions; accept only consistent results.

4.3 Verbosity / Redundancy Bias

Phenomenon: Beyond mere length, RMs also tend to favor responses containing more redundant modifiers, filler transitions, and formatting elements.

Distinction from length bias: Even at the same length, "fancier" responses may still be preferred.

4.4 Confirmation / Sycophancy Bias

Phenomenon: RMs prefer responses that agree with the user's stated position, even if that position is incorrect.

4.5 Summary of Common RM Pathologies

Pathology Manifestation Root Cause
Length bias Longer responses score higher Spurious correlations in training data
Position bias Preference for specific positions Order effects in annotation
Verbosity bias Preference for over-decorated responses Annotators mistakenly treat it as a quality signal
Sycophancy bias Preference for responses that flatter users Psychological tendencies of annotators
Format bias Preference for lists/Markdown formatting Spurious format associations in training data
OOD collapse Inaccurate scoring for out-of-distribution generations Limited RM generalization
Surface feature shortcuts Scores based on keywords rather than semantics Insufficient model capacity/data
Annotator noise amplification RM learns individual annotator biases Uneven annotation quality

5. RM & Judge Evaluation

5.1 RewardBench

Overview: RewardBench is a comprehensive reward model evaluation benchmark designed to systematically test RM performance across multiple dimensions.

Evaluation dimensions (approximate categories):

Evaluation method:

Use cases:

5.2 LLM-as-Judge

📎 Cross-reference: This section focuses on LLM-as-Judge as a training signal (how its biases can contaminate RM training data and affect RLHF optimization). For specifics on LLM-as-Judge in evaluation practice — including bias mitigation and benchmark selection — see eval-and-judges-en.html §2.

Core idea: Use a powerful LLM (e.g., GPT-4-class models) to directly score or rank response quality, replacing human evaluation.

Common Evaluation Paradigms

Paradigm Description
Pointwise Scoring Rate a single response on a 1–10 scale
Pairwise Comparison Choose the better of two responses
Ranking / Listwise Rank multiple responses
Reference-guided Score by comparison against a reference answer

Common Biases in LLM-as-Judge

(a) Position Bias
(b) Verbosity Bias
(c) Self-Preference Bias
(d) Other LLM-as-Judge Biases
Bias Description
Format bias Preference for responses using Markdown, lists, and other formatted elements
Capability boundary bias LLM cannot correctly evaluate responses that exceed its own capabilities
Sycophancy bias Overly "friendly"; reluctant to give low scores
Anchoring effect Influenced by candidate content seen earlier
Keyword bias Oversensitivity to specific terminology

LLM-as-Judge Best Practices

  1. Structured rubrics: Clearly define evaluation dimensions and criteria
  2. Position debiasing: Evaluate twice with swapped positions
  3. Temperature set to 0: Reduce randomness, improve consistency
  4. Multi-judge: Use multiple different LLMs and average/vote
  5. Human calibration: Periodically check consistency against human evaluations
  6. Pairwise > absolute: LLMs make more reliable relative judgments than absolute ratings
  7. Chain-of-Thought evaluation: Have the LLM reason first, then assign a score

5.3 Trade-offs Between Human and Automated Evaluation

Evaluation quality ←——————————————————————→ Evaluation cost
  Expert human eval    Crowdsourcing    LLM-as-Judge    Automatic metrics (BLEU/ROUGE...)
  (highest quality)                                             (lowest cost)

5.4 Other Evaluation Methods


6. Interview Questions

L1 — Fundamentals


Q1: What is a Reward Model (RM)? What role does it play in RLHF?

A: A reward model is a scoring function learned from human preference data that outputs a scalar reward value for LLM-generated responses. In RLHF, the RM acts as a proxy for human preferences, serving as the objective function for policy optimization — the policy model improves its output quality by maximizing the RM score.

Follow-up: Why can't human annotations be used directly as rewards for RL? Because RL requires a large volume of online reward signals; human annotation is expensive and slow. The RM provides a proxy signal that can be computed in batch, instantly.


Q2: What are the core assumptions of the Bradley-Terry model? What does its training loss function look like?

A: The BT model assumes: for a given prompt, the probability that response ywy_w is preferred over yly_l is determined by the sigmoid function of the difference in their reward scores. The training loss is the negative log-likelihood: L=logσ(r(yw)r(yl))\mathcal{L} = -\log\sigma(r(y_w) - r(y_l)). Core assumptions include that preferences can be explained by scalar score differences and that preferences are transitive.

Follow-up: What happens if preferences are not transitive? The BT model encounters inconsistent annotation pairs, leading to unstable training. In such cases, more flexible preference models such as Plackett-Luce may be considered.


Q3: What is the core difference between PRM and ORM? Which scenarios is each suited for?

A: ORM assigns a single reward to the final output only; the signal is sparse. PRM assigns rewards to each reasoning step; the signal is dense. PRM is suited for multi-step reasoning tasks (math, code); ORM is suited for end-to-end evaluation tasks (dialogue, writing). PRM's advantage is precise credit assignment, but its annotation cost is far higher than ORM's.

Follow-up: How can PRM training data be obtained automatically? Monte Carlo estimation can be used: continue sampling multiple rollouts from each step and use the final answer accuracy as an approximation of that step's reward.


Q4: What is Best-of-N (BoN) sampling? How does it differ from RL training?

A: BoN samples N responses from the policy and uses the RM to select the one with the highest score as the final output. It is an inference-time optimization method that does not update model parameters; it is simple but inference cost scales proportionally with N. RL training updates parameters during training, requiring no additional sampling at inference time.

Follow-up: What is the ceiling of BoN? Diminishing returns as N grows; also limited by the policy's original distribution — it cannot generate high-quality responses outside the distribution.


L2 — Intermediate


Q5: What is reward hacking? Give 2–3 concrete examples.

A: The policy model learns to exploit RM weaknesses for high scores rather than genuinely improving quality. Examples: (1) generating verbose content to score higher (length hacking); (2) using flattering/sycophantic language; (3) repeating high-scoring template sentences from training data; (4) overusing formatting elements (headings, lists, bold text).

Follow-up: What is the relationship between Goodhart's Law and reward hacking? Goodhart's Law states that "once a measure becomes a target, it is no longer a good measure" — the RM is an approximate measure of human preferences; when the policy specifically optimizes it, the two become decoupled.


Q6: Explain in detail the role of KL divergence penalty in RLHF and how it is tuned.

A: The KL penalty term βDKL(ππref)\beta \cdot D_{\text{KL}}(\pi \| \pi_{\text{ref}}) constrains the policy from deviating too far from the reference model, acting as regularization. β\beta too large → the policy barely updates and learns nothing; β\beta too small → allows excessive exploration, prone to reward hacking. Adaptive KL is commonly used: set a target KL value and adjust β\beta dynamically.

Follow-up: Which layer of reward hacking does each of KL penalty and RM ensembles address? KL constrains the policy's scope from the optimization constraint layer; RM ensembles reduce scoring variance and over-optimism from the reward estimation layer. The two are complementary.


Q7: How can RM ensembles mitigate reward hacking? What are the differences between aggregation strategies?

A: Train multiple independent RMs (different seeds/data subsets/architectures) and aggregate scores. Strategies: (1) Mean: smooths scores, reduces noise from any single RM; (2) Min / conservative: takes the most conservative score, avoids over-optimism — safer when the policy drifts out of distribution; (3) Uncertainty-weighted: reduce the weight of high-variance samples. In practice, the min strategy is most effective against reward hacking but can be overly conservative.

Follow-up: How can the compute overhead of ensembles be optimized? Use a shared-parameter backbone with different heads; or train multiple lightweight variants using PEFT adapters.


Q8: What causes length bias? How can it be mitigated at the data level and inference level?

A: Cause: In training preference pairs, chosen responses are typically longer than rejected ones; the RM overfits to the spurious feature "long = good." Data level: Construct preference pairs with similar lengths; analyze and remove length correlations. Inference level: Apply a length penalty r=rαlenr' = r - \alpha \cdot \text{len}; or use length normalization.

Follow-up: Why might applying a length penalty only at inference time be insufficient? If the RM has already internalized length as a strong feature, the length penalty may not fully offset it; the more fundamental approach is to debias at the training data or training method level.


Q9: How does iterative re-training mitigate distribution shift?

A: Standard RLHF trains the RM only on data generated by the initial policy; as RL training progresses, the policy distribution shifts, and the RM becomes inaccurate on the new distribution. Iterative re-training uses the current policy to generate new data → re-annotate → update the RM → continue RLHF, ensuring the RM always covers the policy's current distribution. The cost is more than a doubling of compute and annotation overhead.

Follow-up: Does DPO also need iterative re-training? In theory, DPO also faces the distribution shift problem; iterative DPO (online DPO / online preference optimization) has been proposed to address this.


Q10: What are the main biases of LLM-as-Judge? How can they be systematically mitigated?

A: Main biases: (1) Position bias — prefers specific positions; (2) Verbosity bias — prefers long responses; (3) Self-preference — gives higher scores to models from the same family; (4) Format bias — prefers formatted content. Mitigations: swap positions for dual evaluation, define explicit scoring rubrics, vote across multiple judges, periodically calibrate against humans.

Follow-up: Why is pairwise comparison generally better than absolute scoring (pointwise)? Because absolute scoring requires the judge to maintain a consistent internal scoring scale, which is prone to drift; pairwise comparison only requires judging relative quality, which is a simpler cognitive task with higher consistency.


Q11: What is RewardBench? Which capability dimensions does it evaluate in RMs?

A: RewardBench is a comprehensive reward model evaluation benchmark that measures RM performance across multiple dimensions using pairwise accuracy. Dimensions include: general dialogue quality, challenging dialogue (distinguishing subtle differences), safety, and reasoning ability. It provides a standardized comparison framework that helps diagnose an RM's strengths and weaknesses.

Follow-up: Can RewardBench results directly predict an RM's actual performance in RLHF? Not entirely. Pairwise accuracy on a benchmark is a necessary but not sufficient condition — an RM's performance on a benchmark is not fully correlated with whether it will encounter reward hacking during RLHF optimization; online evaluation is also needed.


L3 — Advanced


Q12: From a theoretical perspective, why is KL constraint necessary when the RM is imperfect?

A: When rRMrtruer_{\text{RM}} \neq r_{\text{true}}, unconstrained maximization of rRMr_{\text{RM}} can cause rtruer_{\text{true}} to decrease (reward hacking). The KL constraint is equivalent to optimizing within a local region where the RM is reliable — the RM is accurate near the training distribution, and KL ensures the policy does not leave this "trust region." This aligns with the trust region concept in TRPO/PPO. When KL is 0, π=πref\pi = \pi_{\text{ref}}; when KL is small, RM reliability is high.

Follow-up: Are there types of reward hacking that KL penalty cannot prevent? Yes — if reward hacking occurs within a low-KL region near πref\pi_{\text{ref}} (e.g., simply adding a few flattering words), KL penalty cannot stop it. In such cases, a better RM is needed.


Q13: Design a complete iterative RLHF system and describe its architecture and key decision points.

A:

Architecture:
π₀ (SFT model)
   ↓
[Data generation] π_k generates → N responses per prompt
   ↓
[Preference annotation] Human annotations or RM annotations (on-policy data)
   ↓
[RM update] Fine-tune RM_k on new data → RM_{k+1}
   ↓
[Policy optimization] RLHF (PPO) with RM_{k+1}, KL→π_ref
   ↓
[Evaluation] Reward curves, KL curves, human evaluation, reward hacking detection
   ↓
Repeat k = 1, 2, ...

Key decision points:

  • Iteration frequency (how many RL steps before updating the RM)
  • Whether to collect human annotations each round (high cost) vs. using RM self-annotation
  • Choice of KL target value
  • When to stop iterating (reward saturation / human evaluation target met)

Follow-up: How do you detect when iteration should stop? What are the potential risks of continuous iteration? Stop when human evaluation scores stop improving, KL continues to increase, or mode collapse occurs. Continued iteration may cause the policy to converge toward the RM's specific preferences, losing diversity.


Q14: Compare PRM's Monte Carlo estimation method with direct human annotation; analyze the bias-variance trade-off of each.

A: MC estimation: Sample K rollouts from each step and use final accuracy as the reward. Bias comes from limited sampling (estimates are imprecise when K is small); variance comes from sampling randomness. Larger K is more accurate but more expensive. Human annotation: Theoretically unbiased but constrained by annotator capability/consistency; subject to systematic bias and noise. MC estimation is scalable but has systematic bias; human annotation is accurate but not scalable. A hybrid approach (small-scale human annotation + MC expansion) is common practice.

Follow-up: Under what conditions does MC estimation fail severely? When the downstream search space from a given step is extremely large and correct answers are extremely rare (e.g., complex math problems), even many rollouts may all fail, causing zero reward for correct steps (false negatives).


Q15: How would you design an evaluation framework that is robust to reward hacking?

A: Multiple layers of defense are needed:

  1. RM internal metrics: Pairwise accuracy (in-distribution vs. OOD), calibration
  2. Proxy metric monitoring: KL curves, reward distribution drift, generation length changes, n-gram repetition rate
  3. Blind human evaluation: Randomly sample outputs for human scoring; compare against RM scores
  4. Adversarial test sets: Construct test cases with known reward hacking patterns
  5. Multi-RM consistency: Declining score correlation across multiple RMs as an early warning signal
  6. A/B testing: End-user satisfaction as the ultimate evaluation

Follow-up: In practice, which layer is most often overlooked but most critical? OOD detection is most often overlooked — people typically only look at in-distribution accuracy, but RM performance on the OOD distribution generated by the policy is what determines whether reward hacking occurs.


Q16: Discuss the impact of LLM-as-Judge self-preference bias on benchmark leaderboards.

A: If a mainstream LLM-as-Judge (e.g., GPT-4) has self-preference bias, models stylistically similar to it will rank higher on leaderboards, distorting rankings. For example, if a model is from the same family as the judge or trained on similar data, it may receive disproportionately high scores. This creates judge sensitivity in benchmark results — rankings depend on judge selection — undermining the comparability and credibility of leaderboards.

Follow-up: How would you design a leaderboard that is robust to judge bias? Use multiple heterogeneous judges (different vendors, different sizes); report inter-judge agreement; disclose the relationship between the judge and each model; and incorporate human evaluation as an anchor.


Q17: Why is "the RM's generalization ability the most critical bottleneck in RLHF"?

A: The entire RLHF optimization loop relies on the RM as its objective function. The RM's capability ceiling determines the ceiling of policy optimization. Specifically: (1) Poor RM generalization → reward hacking; (2) Biased RM → policy inherits the bias; (3) Unreliable RM on OOD → iterative training also fails to improve it; (4) RM cannot evaluate quality dimensions beyond its training distribution → the policy cannot improve on those dimensions. All other techniques (KL, ensemble, iteration) are compensating for insufficient RM generalization.

Follow-up: Can scaling up the RM (increasing parameter count) systematically solve the generalization problem? Larger RMs do have better generalization (higher capacity, better feature representations), but cannot fully solve it — because the fundamental bottleneck is sometimes the noise and incompleteness of the preference data itself, not model capacity.


Q18: From an information-theoretic perspective, why is pairwise more efficient than pointwise?

A: Human preference judgments are fundamentally ordinal information rather than cardinal information. Pairwise methods directly leverage ordinal information (A > B) with minimal information loss; pointwise requires humans to map to an absolute scale (e.g., 1–5), introducing additional scale calibration noise. From an information-theoretic perspective, a portion of the information in each pointwise annotation is "wasted" on scale noise. Pairwise therefore has higher sample efficiency.

Follow-up: Is there a way to recover absolute scores from pairwise data? The latent scores of each response can be recovered by solving for the MLE of the BT model, but the absolute values only have relative meaning and require an additional anchor point to determine the scale.


Q19: Compare DPO and RLHF in handling reward hacking — similarities and differences.

A: Similarities: Both rely on implicit/explicit reward signals in preference data; both are subject to the constraints of Goodhart's Law. Differences: (1) RLHF has an explicit RM that can be attacked and diagnosed; DPO implicitly encodes rewards, making it hard to inspect in isolation; (2) RLHF allows flexible adjustments via KL, ensembles, and iteration; DPO's β is similar to a KL penalty but offers different control granularity; (3) DPO is trained on off-policy data, facing more severe distribution shift; (4) RLHF's PPO is inherently on-policy, which to some extent partially mitigates distribution shift.

Follow-up: Are there forms of reward hacking unique to RLHF that DPO would not encounter? Issues specific to RLHF include: the explicit RM's OOD scoring failures, and instability in PPO training leading to sudden policy changes. DPO does not encounter explicit RM OOD problems, but faces implicit reward distribution shift problems — different in form but similar in nature.


Q20a (L3): What practical guidance does the scaling law from Gao et al. 2022 offer for setting KL penalty strength?

A: The paper's core finding is that in their experimental setup, increasing the KL penalty coefficient β\beta does not improve the gold score–KL curve (the frontier); the effect is equivalent to early stopping on the same curve. This implies:

  1. β\beta cannot be tuned as a generalization measure: Increasing β\beta constrains the policy to deviate less from the reference model, reducing KL consumption, but does not make the RM more robust to the same KL shift; "being safer" is only because optimization stops earlier, not because the RM itself improved.

  2. The true role of KL penalty: Preventing the policy from moving too far in a single step (stabilizing training), not fundamentally solving reward hacking. What is truly needed is improvement in RM generalization itself (more data, larger models, iterative re-training).

  3. Practical implication: One should not rely on increasing β\beta to "buy" more optimization headroom; if the gold score has already peaked, the RM should be retrained rather than continuing to strengthen KL constraints on the same RM.

注意 / Caution

Honesty note: The paper's authors explicitly note that this conclusion is "sensitive to hyperparameters" and is not guaranteed to hold in all settings.

Follow-up: BoN and RL have different functional forms (BoN is d(αβd)d(\alpha - \beta d), RL is d(αβlogd)d(\alpha - \beta \log d)) — what does this tell us? BoN's quadratic form means over-optimization deteriorates with acceleration (decline is faster at larger dd); RL's logarithmic form means deterioration is slower but sustained. Therefore, for the same KL "budget," BoN is more efficient but also collapses faster; one should not directly compare optimization quantity across methods using KL, as the two obey different over-optimization dynamics.


Q20: If you were designing the next generation of RM, what do you think are the three most important improvement directions?

A: (Open-ended question; three key directions below)

  1. Stronger generalization and OOD robustness: Current RMs perform well in-distribution but collapse out-of-distribution. Better architectural design (e.g., uncertainty-aware RM), adversarial training, and broader training data coverage are needed.

  2. Multi-dimensional disentangled scoring: Decouple helpfulness, safety, factuality, style, and other dimensions into independent scoring heads, avoiding compression of information into a single scalar. This also allows different optimization weights for different dimensions.

  3. Self-improving RM: Give the RM self-calibration capability — continuously detecting during RLHF whether its own predictions are consistent with actual preferences, and updating automatically. Reduces reliance on human annotation.

Follow-up: What potential conflicts exist between these three directions? Multi-dimensional scoring increases model complexity, which may affect generalization; self-improvement mechanisms may introduce systematic bias if unreliable; uncertainty estimation is computationally expensive in high-dimensional spaces. Trade-offs must be made in engineering practice.


Appendix: Key Terminology Glossary

Term Chinese Brief Definition
Reward Model (RM) 奖励模型 Scoring function learned from preference data
RLHF 基于人类反馈的强化学习 RL optimization using RM signals
Bradley-Terry Model 布拉德利-特里模型 Probabilistic model for pairwise comparison
ORM 结果奖励模型 Scores only the final output
PRM 过程奖励模型 Scores each reasoning step
Reward Hacking 奖励欺骗 Exploiting RM weaknesses for high scores
KL Divergence KL 散度 Measure of the difference between two distributions
Distribution Shift 分布偏移 Mismatch between training and evaluation data distributions
Best-of-N (BoN) N选一 Sample N responses and select the highest-scoring one
LLM-as-Judge 大模型作评估者 Using an LLM to replace human evaluation
Length Bias 长度偏差 RM preference for longer responses
Self-Preference Bias 自我偏好偏差 LLM preference for responses in its own style
Inter-annotator Agreement 标注者间一致性 Degree of consistency across different annotators
Trust Region 信赖域 Safe update region in optimization
Credit Assignment 信用分配 Attributing final outcomes to individual steps
Elo Rating Elo 排名 Dynamic scoring system based on win/loss outcomes
Plackett-Luce Model Plackett-Luce 模型 Probabilistic model for ranked data
DPO 直接偏好优化 Preference learning that bypasses an explicit RM
Conservative Estimation 保守估计 Strategy of taking the minimum score in an ensemble

This cheatsheet is for study reference only. For specific numbers, refer to the original papers and official leaderboards.

Extended L3

Q: When designing RMs for "next-generation" foundation models (e.g., with stronger reasoning and planning capabilities), what fundamental challenges might existing evaluation paradigms (such as RewardBench) face?

A: The core challenge lies in the leap in complexity of the object being evaluated. Most existing benchmarks target relatively standard dialogue or simple reasoning tasks. For agents capable of long-horizon planning, tool use, or complex sub-task decomposition, the RM needs to evaluate not just the quality of a single response, but the overall effectiveness of an interaction trajectory and the soundness of long-term decisions. This requires evaluation paradigms to shift from "scoring static snippets" to "evaluating dynamic sequences," and calls for new metrics that can understand state transitions and world models — a fundamental expansion of the capability dimensions required of RMs.

Follow-up: How should RM training data be constructed under this new paradigm? The shift needs to go from "preference pairs" to "trajectory comparison" data, potentially involving multi-step interactions in simulated environments. Annotation will rely more heavily on automated verification (e.g., task success rate) and sandboxed testing in advanced simulators; the feasibility of pure human annotation will drop sharply.


Q: What are the fundamental difficulties in applying Process Reward Models (PRM) to open-ended generation tasks (e.g., creative writing)?

A: The fundamental difficulty lies in the ambiguity of "process" definition and the subjectivity of evaluation. In mathematical reasoning, "steps" have clear logical boundaries and objective correctness criteria. But in creative writing, the transitions between paragraphs, the construction of imagery, and the build-up of emotion have no objective standards; their quality is highly dependent on subjective taste and holistic context. As a result, collecting reliable step-by-step annotations for PRM is extremely difficult, and automated evaluation (e.g., MC estimation) also fails due to the lack of a clear "correct answer."

Follow-up: Should PRM be abandoned entirely for open-domain tasks, or are there compromises? A "coarse-grained PRM" or "hybrid reward" approach can be adopted. For example, divide the writing process into a small number of high-level "stages" (e.g., ideation, development, conclusion), or combine ORM's holistic reward with process rewards at key turning points (e.g., plot climaxes), to balance evaluation granularity with feasibility.


Q: From a game-theoretic perspective, can reward hacking be viewed as a "red team–blue team" dynamic game between the policy model and the reward model? What does this imply for designing mitigations?

A: Yes, this is essentially a non-cooperative game. The policy model (blue team) aims to discover and exploit vulnerabilities in the RM's (red team's) decision boundary to maximize rewards. Traditional mitigation strategies (such as fixed KL penalties) are "static defenses," whereas the game-theoretic perspective suggests adopting dynamic, adaptive adversarial strategies. For example, a dedicated "red team" RM can be trained whose objective is not to score responses but to actively find patterns that let the current policy obtain inflated rewards, and the discovered vulnerabilities are then used to update (harden) the primary RM.

Follow-up: What stability and cost challenges might this adaptive adversarial approach face in practice? The main challenge is that training dynamics may be unstable, with both parties entering an "arms race" that causes the optimization objective to drift continuously. Maintaining multiple adversarial models also introduces significant compute and coordination overhead.


Q: How does RM calibration affect the RLHF optimization process? What specific risks does an "accurate but uncalibrated" RM pose?

A: Calibration refers to whether an RM's output scores truthfully reflect the absolute probability or expected value of response quality. An "accurate but uncalibrated" RM may rank correctly while its absolute score values or distribution have systematic bias. In RLHF, this can lead to misjudgment of optimization intensity: for example, uniformly inflated RM scores may cause the optimizer to believe the policy is already good and stop too early; or a narrow scoring range for small quality differences may result in insufficient policy update momentum (weak gradient signal). The KL penalty term depends on the relative magnitude of rewards; an uncalibrated RM may invalidate this trade-off.

Follow-up: In engineering practice, what methods can diagnose and improve RM calibration? Calibration curves can be plotted: segment RM prediction scores and compare the true quality (human annotation or task success rate) within each segment. Improvement methods include adding calibration regularization terms to the RM training loss, or post-processing RM outputs at inference time (e.g., temperature scaling).


Q: When integrating safety as a hard constraint (rather than an optimization objective) into the RM framework, what is the theoretically most rigorous formalization?

A: The most rigorous approach is to model the problem as a constrained optimization problem, rather than a simple multi-objective weighted sum. Specifically, the optimization objective remains maximizing the RM score on the primary quality dimension (e.g., helpfulness), subject to a set of safety constraints (e.g., RM_safety(x, y) > τ). Theoretically, this can be solved via Lagrange multipliers or projected gradient methods, ensuring the policy optimizes within the safety-feasible set. This is more robust to the safety objective being sacrificed compared to mixing safety and helpfulness into a single score.

Follow-up: What is the main difficulty in setting an explicit threshold τ for safety constraints (e.g., "harmful probability < 0.1%") in practice? The core difficulty is the fuzziness and context-dependence of safety boundaries. A response that is "safe" in most contexts may be unsafe in specific sensitive contexts. Therefore, a globally fixed τ is unrealistic; context-dependent dynamic adjustment is needed, which places higher demands on both the RM and the constraint system.


Q: Beyond ensembles and uncertainty filtering, what more active roles can RM uncertainty estimation play during RLHF training?

A: Uncertainty estimation can actively guide training data collection and exploration strategies, enabling active learning. For example, within the training loop, priority can be given to sampling and human annotation in prompt-response regions where RM uncertainty is high, to improve the RM most efficiently. Furthermore, during policy optimization, the agent can be encouraged to actively explore high-uncertainty regions (i.e., the RM's knowledge boundary) to discover potentially new high-quality strategies; this is analogous to the exploration-exploitation trade-off in Bayesian optimization.

Follow-up: How can uncertainty-based exploration be specifically implemented in on-policy algorithms such as PPO in RLHF? RM uncertainty can be incorporated as part of the exploration reward, encouraging the policy to generate responses that the RM finds "surprising" or "uncertain." Specifically, add an exploration term positively correlated with uncertainty to the final reward, but carefully balance it to avoid generating meaningless random outputs.

§A Key Papers Timeline