Cheatsheet

Evaluation & LLM-as-Judge

In post-training, evaluation is often the real bottleneck: training can run, but whether the model actually improved — or where it regressed — is entirely determined by evaluation. This page covers how to evaluate an aligned model and the pitfalls of various evaluation approaches. ⚠️ No concrete scores are listed here (they go stale quickly and are easy to misremember); for specific numbers, refer to official benchmark/leaderboard sources.

0. TL;DR

1. Three Families of Evaluation

Type Measures Examples
Capability benchmarks (automatic, ground-truth answers) Knowledge / reasoning / code MMLU, GSM8K, MATH, HumanEval/MBPP, BBH, IFEval (instruction following)
Preference / dialogue evaluation (judge scoring) Subjective quality of responses AlpacaEval (LLM-judge win-rate), MT-Bench (multi-turn, judge scores), Chatbot Arena (human pairwise → Elo)
Reward model evaluation Whether the RM aligns with human preferences RewardBench (chat / safety / reasoning categories, etc.), agreement rate with human annotations

1.1 Benchmark cheat-sheet

注意 / Caution

The tables below list only mechanism and structure (signal source, scoring method, pitfalls), not concrete scores.

Preference / dialogue evals (judge or human):

Eval Signal source Scoring Most prominent bias Debias / control
AlpacaEval 2.0 Single LLM-judge (model vs. reference answer) Win-rate (vs. the reference) Verbosity bias Length-controlled win-rate (regress out length)
MT-Bench LLM-judge Multi-turn, 1–10 scalar score or pairwise Position / verbosity / self-preference Average the two orderings in pairwise mode
Chatbot Arena Human blind pairwise battles Bradley-Terry / Elo → ranking + confidence intervals User distribution / style preference, crowd noise Massive vote volume + fresh dynamic questions resist contamination

Capability benchmarks (ground-truth, automatic):

Benchmark Measures Format Scoring Known pitfalls
MMLU 57-subject multiple-choice knowledge 4-way multiple choice Option accuracy Option / letter-order bias; heavily contaminated
MATH Competition math (7 subjects / 5 difficulty levels) Free-form solutions Final-answer match (verifiable) Brittle answer parsing; partly contaminated
HumanEval Python code generation Function completion + unit tests pass@k (unit-test pass rate) Only 164 problems, high variance, easy to overfit
IFEval Verifiable instruction following Constrained instructions (length / format) Programmatic verification (no judge needed) Covers only machine-checkable constraints

1.1a Unbiased pass@k

Naive estimator 1(1p^)k1-(1-\hat p)^k (p^=c/n\hat p=c/n): for k2k\ge2 it systematically underestimates the true pass@k at any finite nn (Jensen's inequality: (1p^)k(1-\hat p)^k is convex in p^\hat p, so E[(1p^)k](1p)k\mathbb{E}[(1-\hat p)^k]\ge(1-p)^k); the bias shrinks as nn grows but is never zero; for k=1k=1, c/nc/n is already unbiased.

Unbiased estimator (Chen et al., arXiv:2107.03374 §3): sample nkn\ge k candidates per problem, cc pass the unit tests, then

pass@k^=1(nck)(nk)\widehat{\text{pass@}k}=1-\frac{\binom{n-c}{k}}{\binom{n}{k}}

Intuition: (nck)/(nk)\binom{n-c}{k}/\binom{n}{k} is the probability that kk randomly drawn candidates are all wrong; 11 minus it is "at least one correct". Numerically stable implementation (avoids large-binomial overflow):

import numpy as np
def pass_at_k(n, c, k):
    if n - c < k: return 1.0
    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
提示 / Note

Standard protocol (Chen et al.): n=200n=200 per problem, report k{1,10,100}k\in\{1,10,100\}.

1.2 Bradley-Terry / Elo: pairwise wins → ranking

Arena turns a massive pile of "A vs B, which is better?" pairwise votes into a total-order ranking via the Bradley-Terry (BT) model (Elo is its online approximation).

BT model: give each item ii a latent score sis_i; then

P(ij)=σ(sisj)=11+e(sisj)P(i \succ j) = \sigma(s_i - s_j) = \frac{1}{1+e^{-(s_i-s_j)}}

Only score differences are identifiable (a global shift leaves probabilities unchanged), so fix an anchor (e.g. s0=0s_0=0). Fitting = logistic-regression MLE: the log-likelihood i,jwijlogσ(sisj)\sum_{i,j} w_{ij}\log\sigma(s_i-s_j) (wijw_{ij} = times ii beats jj) is concave (equivalently the negative log-likelihood is a convex loss), a convex optimization with a global optimum; the solution is finite and unique when the win graph is connected.

Core assumptions (and where they break):

  1. Unidimensional strength: quality is summarized by a single scalar sis_i → items are totally orderable;
  2. (Stochastic) transitivity: if A tends to beat B and B tends to beat C, then A tends to beat C — no rock-paper-scissors cyclic preferences (ABCAA\succ B\succ C\succ A);
  3. Independent comparisons: matches are mutually independent, no order/learning effects;
  4. Basic BT does not model ties (Rao-Kupper and other extensions handle ties).

→ When preferences are genuinely non-transitive cycles (each wins on a different dimension), a single scalar Elo flattens the cycle into one order, and the ranking's "objectivity" is overstated.

Elo = online approximation of BT: each match does one fixed-step update on the prediction error, viewable as an online gradient update on the BT logistic loss:

# Elo: the online/streaming version of Bradley-Terry, one fixed-step gradient update per match
def elo_update(r_a, r_b, score_a, K=32, scale=400):
    e_a = 1.0 / (1.0 + 10 ** ((r_b - r_a) / scale))  # BT/logistic predicted P(A wins)
    r_a = r_a + K * (score_a - e_a)                  # score_a in {1, 0.5, 0}
    r_b = r_b + K * ((1 - score_a) - (1 - e_a))
    return r_a, r_b
提示 / Note

In the batch setting, fitting BT by MLE (logistic regression) over all pairwise results is more stable than per-match Elo and yields confidence intervals (Arena uses BT + bootstrap for intervals).

2. LLM-as-Judge: How to Use It + Biases

📎 Cross-reference: This section focuses on LLM-as-Judge from the perspective of evaluation practice (how to select a judge, operational details, benchmark applications). For how LLM-as-Judge biases affect RM training and reward hacking when used as a RLHF training signal, see cheatsheet-reward-modeling-eval-en.html §5.2.

Use a strong model as a judge to score responses or do pairwise comparisons. Cheaper and faster, but has systematic biases:

Formalizing position debiasing: let the judge's preference on an ordered pair be J(a,b){A,B}J(a,b)\in\{A,B\}. With no position bias, J(a,b)J(a,b) and J(b,a)J(b,a) should pick the same answer (consistent). Debiased rule: count a win only if both orders pick the same answer, else call it a tie — making position bias an explicit tie rather than letting it leak into the win-rate. Note: AlpacaEval uses a single fixed ordering by default (model response first); MT-Bench pairwise mode uses swap augmentation (average the two orderings) — not all tools apply order averaging.

2.1 Position-debiased judge harness (code)

# LLM-as-judge: pairwise comparison + position debiasing (order-swap) harness.
# In practice judge() calls a strong model and parses its verdict; here a stub shows the protocol.

def judge_debiased(question, ans1, ans2, judge):
    """Judge each ordering once to cancel position bias.
    judge(q, A, B) returns 'A' or 'B' (which position's answer is better).
    Returns 'ans1' / 'ans2' / 'tie' (orders disagree -> tie)."""
    v1 = judge(question, ans1, ans2)            # order (A=ans1, B=ans2)
    v2 = judge(question, ans2, ans1)            # swapped (A=ans2, B=ans1)
    pick1 = 'ans1' if v1 == 'A' else 'ans2'     # answer actually chosen, call 1
    pick2 = 'ans2' if v2 == 'A' else 'ans1'     # in the swapped call A=ans2, so v2=='A' means ans2 was picked
    return pick1 if pick1 == pick2 else 'tie'   # count only if consistent

def win_rate(questions, model_answers, ref_answers, judge):
    """Debiased win-rate of model vs ref; tie counts 0.5."""
    s = 0.0
    for q, m, r in zip(questions, model_answers, ref_answers):
        out = judge_debiased(q, m, r, judge)    # ans1=model, ans2=ref
        s += 1.0 if out == 'ans1' else (0.5 if out == 'tie' else 0.0)
    return s / len(questions)

# --- Demo: an extreme "always pick the first" position-biased judge is exposed as a tie ---
def biased_judge(q, a, b):
    return 'A'                                  # extreme position bias
print(judge_debiased("q", "model", "ref", biased_judge))   # -> 'tie'
print("win_rate:", win_rate(["q"], ["model"], ["ref"], biased_judge))  # -> 0.5

2.2 Measuring judge–human agreement

"Calibration / agreement with humans" recurs throughout (the §1 RM row, the mitigations above, Q31), but raw agreement alone overstates "the real agreement after removing chance": when one label dominates, two annotators agree by pure guessing with high probability. Cohen's κ removes this chance agreement:

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

κ=1\kappa=1 is perfect, κ=0\kappa=0 means agreement is only at the level of "guessing independently from each annotator's own marginals", κ<0\kappa<0 is worse than guessing. Intuition: if both annotators label independently with 90/10 marginals, raw agreement is about 0.92+0.12=0.820.9^2+0.1^2=0.82 — looks high, yet κ0\kappa\approx0. So report κ, not just raw agreement, for judge–human consistency.

import numpy as np
def cohens_kappa(labels_a, labels_b):
    cats = sorted(set(labels_a) | set(labels_b))
    idx = {c: i for i, c in enumerate(cats)}
    n, K = len(labels_a), len(cats)
    conf = np.zeros((K, K))
    for a, b in zip(labels_a, labels_b):
        conf[idx[a], idx[b]] += 1
    p_o = np.trace(conf) / n                            # observed agreement
    p_e = (conf.sum(0) * conf.sum(1)).sum() / n ** 2    # chance agreement
    return (p_o - p_e) / (1 - p_e)

More than two annotators (nominal categories): use Fleiss' κ; ordinal scores: use weighted κ or Krippendorff's α (supports ordinal distances and missing data).

2.3 Length-controlled win-rate: the regression form

The §1 table lists AlpacaEval 2.0's debiasing as "regress out length" — concretely (Dubois et al., arXiv:2404.04475) you fit a generalized linear model predicting the judge's preference, with the length difference as an explicit term (simplified form):

logitP(modelref)=θm+γΔlen+\text{logit}\,P(\text{model}\succ\text{ref}) = \theta_m + \gamma\cdot\Delta_{\text{len}} + \dots

where Δlen\Delta_{\text{len}} is the two answers' length difference. To report, set the length term Δlen=0\Delta_{\text{len}}=0 and take the expectation over the instruction distribution, giving the length-controlled win-rate — intuitively "the win-rate the model should have if both answers were the same length", removing the "length association" the GLM models.

⚠️ Note: it removes the length association the model fits, so genuine quality signal correlated with length may be removed along with it, and unmodeled style effects are not removed; empirically it markedly improves the Spearman correlation with Chatbot Arena (human Elo) rankings.

3. Data Contamination

The training set contains test-set examples → inflated scores that do not reflect generalization.

3.1 Types of Contamination

Type Description
Direct overlap Training data contains the exact questions or answers from the evaluation set
Temporal leakage Training data cutoff is later than the evaluation set creation date; the model has "seen" test-period content
Near-overlap Paraphrased or translated versions of test questions appear in the training set, undetected by string matching
Membership inference contamination Training set does not contain the exact questions, but contains highly related in-distribution samples, causing score inflation

3.2 Detection Methods

3.3 n-gram Deduplication

3.4 Contamination-Resistant Benchmark Design

4. Pitfalls in Preference Evaluation

5. A Practical Evaluation Protocol (Post-Training)

  1. Capability: GSM8K/MATH (math), HumanEval (code), MMLU (knowledge), IFEval (instruction following).
  2. Alignment quality: AlpacaEval / MT-Bench (judge, with position debiasing); Chatbot Arena when necessary.
  3. Safety / refusal: harmful-prompt refusal rate, over-refusal rate.
  4. Regression: compare against baselines to confirm no dimensions degraded (alignment tax).
  5. Contamination audit + multiple seeds/prompts to report variance.

Stratified Follow-ups

L1 Basics

  1. Why is evaluation considered the bottleneck of post-training? What does each of capability benchmarks and preference evaluation measure?
  2. What is LLM-as-judge? What known biases does it have?
  3. What is data contamination? Why does it make scores unreliable?
  4. What is the fundamental difference between capability benchmarks and preference evaluation? What is each suited to measure?
  5. How is AlpacaEval's win-rate computed? Against what reference is the rate measured?
  6. How do MT-Bench and Chatbot Arena differ in their evaluation signal (who does the scoring)?
  7. What does pass@k mean? Why do code evals often use it instead of single-pass accuracy?
  8. What is alignment tax? Why specifically check for regressions after post-training?
  9. Why fix the prompt template and report variance over multiple seeds?
  10. How do IFEval and MMLU fundamentally differ in how they score (programmatic verification vs. option accuracy)?

L2 Intermediate

  1. How is position bias mitigated? Why does "evaluate both orderings and average" work?
  2. How do AlpacaEval / MT-Bench / Chatbot Arena differ in their evaluation signals (automatic judge vs. human Elo)?
  3. How do you detect whether training data has contaminated a given evaluation set?
  4. Why is verbosity bias so stubborn? How does length-controlled win-rate factor out the length effect?
  5. The order-swap rule calls a tie when the two orderings disagree — compared to majority voting, what robustness advantage does this have?
  6. Chatbot Arena uses Bradley-Terry / Elo to turn pairwise outcomes into rankings — what is the core assumption of this model?
  7. List several contamination-detection methods (n-gram / Min-k% / canary / paraphrase-drop) — what does each actually catch?
  8. Why is "dedup ≠ decontamination"?
  9. Why is self-preference bias especially dangerous when using a homologous model as the judge?
  10. Why does a high judge win-rate not equal stronger human preference? Where do the two systematically diverge?

L3 Deep Dive

  1. How does Goodhart's Law manifest in leaderboard gaming? How do you design "gaming-resistant" evaluations?
  2. How is a reward model evaluated (the RewardBench approach)? What is the relationship between RM evaluation and final policy performance?
  3. For reasoning models, why should evaluation shift from "single-pass accuracy" to "accuracy under a compute budget"? What does this demand of the evaluation protocol?
  4. If online metrics (user retention) conflict with offline evaluation (judge win-rate), which do you trust, and how do you investigate the discrepancy?
  5. When summarizing a model with a single scalar score, how can it mask Pareto-style regressions like "capability ↑ but safety ↓"? How should evaluation design expose this?

Extended L3

Q26. When using LLM-as-judge to evaluate multi-turn, long-context conversations, what are the common difficulties? What methodological improvements exist?
The core challenge in evaluating multi-turn conversations is that the judge model tends to exhibit **context forgetting** or **local bias** — focusing only on the quality of the most recent one or two turns while ignoring overall conversational coherence and task completion. A key improvement is designing **process-oriented rubrics** that explicitly require evaluating each turn's contribution to the final goal, and introducing a **segment summarization** mechanism that forces the judge to summarize before scoring, thereby partially mitigating its short-sightedness.
**Follow-up**: Beyond improving the rubric, can the evaluation protocol itself be changed to reduce judging difficulty — for example, decomposing it into a series of simpler subtask evaluations?
Q27. What are the main limitations of LLM-as-judge in assessing factual accuracy and logical soundness? How can they be mitigated?
The main limitations are that the judge model's own **knowledge boundary** and **reasoning flaws** can lead to incorrect verdicts. It may fail to detect factual errors, or mistakenly accept an answer with logical gaps as sound. Mitigation typically involves a **hybrid evaluation** approach: for fact-checking, combine **retrieval-augmented verification** — retrieve authoritative information first, then compare; for logical evaluation, attempt to use **formal verification** tools or design **step-by-step verification prompts** specifically targeting reasoning chains.
**Follow-up**: Given limited resources, which judge capabilities (breadth of knowledge, reasoning ability, tool use) should be prioritized for improvement to most effectively increase evaluation accuracy?
Q28. How do you evaluate a model's "emergent capabilities"? How does this fundamentally differ from evaluating conventional capabilities?
The key difference in evaluating emergent capabilities lies in their **unpredictability** and **non-smoothness**. Conventional capabilities typically improve predictably on a benchmark as model scale or training data increases. Emergent capabilities, by contrast, appear suddenly past some threshold and are often not directly reflected in standard benchmarks. As a result, evaluation methods must shift from **fixed test sets** to **open-ended, programmatically generated probe tasks**, and must focus on detecting **behavioral pattern shifts** when the model faces entirely novel, complex task combinations.
**Follow-up**: Can one design an evaluation framework that not only discovers emergent capabilities but also, to some degree, predicts the conditions under which they will appear?
Q29. Why does calibrating the confidence of LLM-as-judge outputs matter? How is it achieved in practice?
Calibrating judge output confidence gives its scores or comparison results **interpretable probabilistic meaning**. For example, when a judge says "90% confident that A is better than B," that number should, in the long run, approximate the true frequency with which A is actually preferred over B. In practice, calibration requires a **human-annotated calibration set**. By repeatedly evaluating the judge on this set, one can analyze the distribution of its scoring deviations from human consensus, then apply **post-hoc calibration algorithms** (such as Platt Scaling or Isotonic Regression) to adjust raw scores so they better match the statistical patterns of human judgment.
**Follow-up**: If the human-annotated data used for calibration is itself low-quality or very small in scale, what effect does this have on the calibrated judge? What are the alternatives?
Q30. When evaluating the safety of conversational models, why is "over-refusal" an important metric? How do you analyze the trade-off it forms with the harmful-prompt refusal rate?
Over-refusal measures the degree to which a model incorrectly refuses **benign or borderline queries**, directly affecting user experience and model **utility**. A model with a very high over-refusal rate may be safe but becomes effectively useless. Analyzing this trade-off cannot simply pursue Pareto optimality across both metrics; instead, **risk tiering** should be introduced. Categorize harmful prompts by severity and set different refusal strictness thresholds for each tier. Evaluation should separately report refusal rates for each tier and use **cost-sensitive analysis** to assess the model's balance between overall risk exposure and user experience loss.
**Follow-up**: How do you construct a high-quality adversarial safety evaluation set that can automatically generate prompts spanning various risk tiers and "gray area" cases?
Q31. How do you conduct "meta-evaluation" — that is, how do you judge whether an evaluation benchmark or an LLM-as-judge is itself valid and reliable?
Meta-evaluation of a benchmark primarily examines its **discriminability** (can it effectively distinguish models at different capability levels), **robustness** (is it sensitive to minor prompt changes), and **ecological validity** (does the capability it measures relate to real-world needs). Meta-evaluation of an LLM-as-judge focuses on its **agreement** with human judgments (e.g., Cohen's Kappa) and its **fairness** across different subgroups. A key method is **cross-validation**: have multiple distinct, high-quality judges (or humans) evaluate the same set of data, and check whether the target judge or benchmark agrees with the consensus.
**Follow-up**: After discovering that a widely used benchmark likely has serious biases or is outdated, what responsibilities and feasible actions does a researcher have to promote its iteration or warn the community?
Q32. In domain-specific settings (e.g., medical, legal), what unique challenges arise for general-purpose LLM-as-judge evaluation? What are the key steps in building a domain-expert evaluation pipeline?
The core challenges are the **domain knowledge barrier** and the **specialized evaluation criteria** required. A general-purpose judge may not understand the nuances of domain terminology or the rigor of professional logic. Key steps in building a domain evaluation pipeline are, first, **co-defining evaluation dimensions** with domain experts — jointly determining dimensions such as "conservatism of medical advice" or "accuracy of legal citations." Second, **building a domain gold standard** — a set of authoritative reference answers or judgments annotated by experts. Finally, designing a **human-AI collaborative evaluation process** in which the AI judge handles initial screening while human experts handle edge-case review and final adjudication.
**Follow-up**: When domain experts themselves disagree on the same response (e.g., physicians from different schools of thought), how do you design a system that accommodates reasonable expert disagreement while still enabling effective automated evaluation?

§A Key Papers Timeline