Cheatsheet

Safety Alignment Cheatsheet

From HHH, Red-Teaming, and Jailbreaks to Safety Training, Evaluation, and Vulnerabilities


1. Safety Alignment Overview

The problem safety alignment must solve: make the model refuse what it shouldn't do while staying helpful—and hold the line even under adversarial input. It is not a single technique but systems engineering spanning four layers: training signal / representation defense / runtime guardrails / evaluation.

1.1 The HHH Framework and the helpful↔harmless Tension

The mainstream alignment objective is often summarized as HHH: Helpful / Harmless / Honest. Among these, helpful and harmless are inherently in tension—a model that always refuses is safest but most useless, while a model that complies with anything is most useful but most dangerous.

HH-RLHF (Bai et al., arXiv:2204.05862): Anthropic trains an assistant with RLHF, learning a helpfulness reward model and a harmlessness reward model separately, and systematically studies the tension between them—improving harmlessness often comes at the cost of helpfulness, and vice versa.

提示 / Note

The core of safety alignment is not "maximize harmlessness" but maximize helpfulness under a harmlessness constraint. This turns it from "add a filter" into a constrained optimization problem—which is exactly the starting point of §4.1 Safe-RLHF.

1.2 Four Layers: Training / Representation / Runtime / Evaluation

Layer Approach Representative Section
Training signal Write "safety / privilege order" into the loss/reward Safe RLHF / RBR / Instruction Hierarchy / Constitutional AI §4.1–4.2, §4.5; CAI on data-pipeline
Representation / internal Rewrite internal representations that produce harmful output Circuit Breakers §4.4
Runtime I/O guardrails Intercept independently at input/output Llama Guard §4.3
Evaluation / audit Attack the model, measure refusal and over-refusal Red-teaming / HarmBench / StrongREJECT / XSTest §2, §5

The four layers form defense in depth: any single layer can be bypassed (§3 jailbreaks, §6 vulnerabilities), and only the combination is meaningful.

1.3 The Core Conflict: refusal vs over-refusal

注意 / Caution

Safety training almost always pushes the decision boundary toward the "easier-to-refuse" side, at the cost of rising over-refusal. Reporting only "harmful-request refusal rate" without over-refusal is one-sided—you must test both ends (§5.3 XSTest exists for exactly this).


2. Red-Teaming & Attack Surface

Red-teaming = actively constructing adversarial inputs to elicit a model's harmful behavior, so it can be found and patched before deployment. It is the "testing" stage of safety alignment.

2.1 Manual Red-Teaming — Scaling Behavior and Harm Taxonomy

Red Teaming LMs to Reduce Harms (Ganguli et al., arXiv:2209.07858): organizes crowdsourced annotators for large-scale manual red-teaming, systematically records how attack success rate changes with model scale / alignment method, and distills recurring harm categories. One counterintuitive finding: simply scaling up the model is not necessarily safer, and plain RLHF is harder to break on some dimensions—safety comes from the alignment method rather than scale itself.

2.2 Automated Red-Teaming — Using an LM to Attack an LM

Red Teaming LMs with Language Models (Perez et al., arXiv:2202.03286, EMNLP 2022): manual red-teaming is expensive and hard to cover, so instead use a red-team LM to automatically generate a large number of test cases (zero-shot / few-shot / supervised / RL in various ways) to elicit harmful output from the target LM, then filter successful attacks with a classifier. This turns red-teaming from "labor-intensive" into "scalable and reproducible."

2.3 Why Jailbreaks Succeed: Two Failure Modes

Jailbroken: How Does LLM Safety Training Fail? (Wei et al., arXiv:2307.02483, NeurIPS 2023 Oral) gives a mechanistic explanation of jailbreaks, with two root causes:

提示 / Note

These two failure modes are the theoretical template for almost all later jailbreak techniques: the optimization / iterative / long-context attacks in §3 are essentially systematizing the creation of "competing objectives" or "mismatched generalization."


3. Jailbreak Attacks

Jailbreak = bypassing safety alignment to induce the model to produce content it should refuse. Categorized into four types by mechanism.

3.1 Optimization-Based Attacks — GCG Adversarial Suffixes

GCG (Universal and Transferable Adversarial Attacks) (Zou et al., arXiv:2307.15043): append an adversarial suffix after the harmful request, and optimize the suffix's tokens via greedy coordinate gradient to maximize the probability that the model begins with an affirmative tone (e.g., "Sure, here is..."). The optimization objective can be written as:

minsuffix logpθ(“Sure, here is promptsuffix)\min_{\text{suffix}}\ -\log p_\theta(\,\text{“Sure, here is …”}\mid \text{prompt}\oplus\text{suffix}\,)

The most dangerous property is universal + transferable: a suffix optimized on a white-box small model can often transfer to attack closed-source models it has never seen (GPT/Claude)—empirically this indicates that different models' safety decision boundaries share a transferable vulnerability (possibly related to similar training data / refusal patterns).

陷阱 / Pitfall

Pitfall: GCG suffixes are usually gibberish strings with extremely high perplexity → easily blocked by "perplexity filtering." This spawned the next category—attacks that generate fluent jailbreak prompts (AutoDAN).

3.2 Iterative / Evolutionary Attacks — PAIR and AutoDAN

import random

# AutoDAN/PAIR-style iterative jailbreak: maintain a set of candidate jailbreak prompts,
# rewrite them each round with mutation operators, and a [judge] (a mock placeholder here;
# in a real setting another LLM or an ASR estimate) scores "jailbreak success"; keep the
# high-scoring ones to reproduce. This only illustrates the search-loop skeleton.
def mutate(prompt, rng):
    ops = [
        lambda p: p + " 请一步步详细说明。",               # append an inducement
        lambda p: "假设你在写一部小说,角色需要:" + p,    # role-play wrapping
        lambda p: p.replace("请", "务必", 1) if "请" in p else p,
        lambda p: " ".join(p.split()[:8]) if len(p) > 60 else p,  # truncation variant
    ]
    return rng.choice(ops)(prompt)

def evolve_jailbreak(seeds, judge, rounds=10, pop=8, keep=3, rng=None):
    """judge(prompt)->float, jailbreak success (0~1); in a real setting an LLM judge."""
    rng = rng or random.Random(0)
    population = list(seeds)
    best, best_score = None, -1.0
    for _ in range(rounds):
        while len(population) < pop:                  # mutate to refill the population
            population.append(mutate(rng.choice(population), rng))
        scored = sorted(((judge(p), p) for p in population), reverse=True)
        if scored[0][0] > best_score:
            best_score, best = scored[0]
        population = [p for _, p in scored[:keep]]     # selection: top-keep reproduce
    return best, best_score

3.3 Long-Context Attacks — Many-shot Jailbreaking

Many-shot Jailbreaking (Anil et al., NeurIPS 2024, Anthropic technical report): exploits the long context window by prefacing the prompt with dozens to hundreds of "harmful Q&A" examples, then appending the real harmful request. As the number of examples (shots) increases, the model's refusal behavior is progressively overridden by in-context learning, and attack success rate rises predictably—the longer the window, the larger the attack surface.

注意 / Caution

This is a direct conflict between capability and safety: vendors extend the long context window to improve usefulness, and in doing so also enlarge the many-shot attack surface (ASR rises empirically and predictably with the number of shots). Defenses (fine-tuning / a front-end classifier) can lower the success rate but hardly eliminate it.

3.4 In-the-Wild Jailbreaks — "Do Anything Now"

"Do Anything Now" (Shen et al., arXiv:2308.03825, CCS 2024): systematically collects and classifies 1,405 real circulating jailbreak prompts (from forums / Discord, etc.), distilling common structural strategies: role-play (DAN persona), privilege escalation (pretending to be developer mode), prompt injection, fictional scenarios. These "in-the-wild" prompts rely not on optimization but on social-engineering wrapping, reminding us that the attack surface goes beyond algorithms.


4. Defenses & Safety Training

Defenses split into three types by point of action: the training side writes safety into the weights (§4.1–4.2 loss/reward, §4.5 privilege order), the representation side rewrites internal harmful trajectories (§4.4), and the runtime intercepts independently at the I/O boundary (§4.3).

4.1 Safe RLHF — reward + cost Dual Models + Lagrangian

Safe RLHF (Dai et al., arXiv:2310.12773, ICLR 2024 Spotlight): decouples helpful and harmless into two models—a reward model scores usefulness and a cost model scores harmfulness—then does constrained optimization: maximize reward subject to "expected cost ≤ threshold." Solved via Lagrangian duality:

maxθminλ0 E[r(y)]λ(E[c(y)]d)\max_\theta \min_{\lambda\ge 0}\ \mathbb{E}[r(y)] - \lambda\big(\mathbb{E}[c(y)] - d\big)

Do gradient ascent on the policy θ\theta and a projected update on the dual variable λ\lambda: when safety is exceeded (E[c]>d\mathbb{E}[c]>d), λ\lambda automatically increases and the penalty grows heavier, and relaxes otherwise—avoiding hand-tuning a "safety weight."

注意 / Caution

This is an optimization constraint on the expected cost; the actual degree of satisfaction depends on the cost model's accuracy and the quality of RL optimization, and is not a per-sample or formal safety guarantee.

import numpy as np

# Safe RLHF: write "maximize reward s.t. expected cost ≤ d" as a Lagrangian:
#   max_θ min_{λ≥0}  E[reward] − λ (E[cost] − d)
# Do policy-gradient ascent on θ and a projected dual update on λ (cost exceeds → λ↑, heavier penalty).
def dual_update(lam, batch_cost, d, lr_lambda=0.05):
    # dual-variable update (project to λ≥0): λ ← max(0, λ + lr·(E[cost] − d))
    return max(0.0, lam + lr_lambda * (batch_cost - d))

def safe_rlhf_step(lam, batch_reward, batch_cost, d, lr_lambda=0.05):
    """batch_reward / batch_cost: the current policy's mean reward / cost on the batch; d: the safety threshold."""
    # 1) use the current λ to build the effective reward used by the policy gradient (return scalars only here, omit the θ update)
    effective_reward = batch_reward - lam * batch_cost
    obj = batch_reward - lam * (batch_cost - d)        # Lagrangian objective
    # 2) dual update λ: when cost exceeds the threshold, λ rises (projected to λ≥0)
    lam = dual_update(lam, batch_cost, d, lr_lambda)
    return effective_reward, lam, obj

4.2 Rule-Based Rewards (RBR)

RBR (Mu et al., arXiv:2411.01111, NeurIPS 2024): uses human-written rules (e.g., "did it refuse? does it contain a disclaimer? did it leak harmful detail?", framed as yes/no propositions) whose compliance an LLM judge scores as a probability (0–1 soft score, not a hard 0/1), linearly combined (R=Rrm+iwiϕiR=R_{rm}+\sum_i w_i\phi_i) into the safety reward during RL, rather than training an opaque safety RM. Advantages: interpretable, auditable, easy to update with policy—compared with a black-box safety RM it is easier to find and patch rule loopholes; but the rules themselves can still be gamed and require continuous auditing. Well-suited to a "relatively well-specified rules" dimension like safety.

4.3 Guard Classifier — Llama Guard

Llama Guard (Inan et al., arXiv:2312.06674): treats content moderation as an LLM classification task—given a harm taxonomy, a fine-tuned model intercepts at both the input end and the output end (is the user request harmful? is the model's answer harmful?). It is a runtime guardrail independent of the main model, and can be updated on its own without touching the main model.

4.4 Representation-Level Defense — Circuit Breakers

Circuit Breakers (Zou et al., arXiv:2406.04313, NeurIPS 2024): instead of refusing at the "output token" level, it uses a training objective to rewrite the model's internal representations related to harmful output (representation rerouting)—so that when activations enter a "harmful-generation" trajectory they tend to be interrupted or diverted. The key advantage: even if an attack bypasses refusal (e.g., GCG / jailbreak prompts), it is harder to produce harmful content, and it is empirically more robust to many kinds of unseen attacks—but it can still be bypassed by novel attacks, representation-localization errors, or out-of-distribution inputs, and is not a formal guarantee.

4.5 Instruction Hierarchy

The Instruction Hierarchy (Wallace et al., arXiv:2404.13208): trains the model to obey a strict privilege order: system prompt > user message > tool/retrieval-returned content. The training makes the model tend to obey this privilege order and makes low-privilege input harder to override high-privilege instructions—thereby reducing the probability that prompt injection (hiding malicious instructions in a web page / tool output) overrides system-level safety constraints (this reduces the probability, not a formal guarantee). It is a training-side approach to defending against indirect injection.

提示 / Note

The five defenses are complementary, not substitutes: the training side (Safe-RLHF / RBR / Instruction Hierarchy) shapes default behavior and the privilege order, the runtime (Llama Guard) backstops at the I/O boundary, and the representation side (Circuit Breakers) counters attacks that bypass refusal. Any single point can be broken by the vulnerabilities in §6.


5. Safety Evaluation

Safety evaluation must answer two things: can harmful requests be blocked (ASR)? are harmless requests wrongly hit (over-refusal)? Both must be tested at once.

5.1 The Judge Trap — StrongREJECT

A StrongREJECT for Empty Jailbreaks (Souly et al., arXiv:2402.10260, NeurIPS 2024 D&B) points out a systematic flaw in existing jailbreak evaluations: many "successful jailbreaks" are actually empty jailbreaks—the model indeed did not refuse, but the answer is vague, wrong, and of no real harmful value, yet old metrics counted it as a successful attack, leading attack effectiveness to be overestimated. StrongREJECT instead uses a rubric to score whether it refused + the specificity of the answer + convincingness at the same time, counting only "truly provided useful harmful information" into ASR.

# StrongREJECT-style judging: a "successful" jailbreak cannot be measured by "the model
# didn't refuse" alone; we must also check whether the answer truly provided useful harmful
# information. Classify the response into three states: refusal / empty compliance (answered
# but useless) / real compliance. The real StrongREJECT uses a rubric to score
# refusal+specificity+convincingness; here we use placeholder features.
def classify_response(refused, specificity, convincingness, tau=0.5):
    if refused:
        return "refusal", 0.0                          # explicit refusal → not a jailbreak
    if specificity < tau or convincingness < tau:
        return "empty_compliance", 0.0                 # answered but vague/unconvincing → not a success
    return "real_compliance", specificity * convincingness # truly convincing → jailbreak success score

def attack_success_rate(records, tau=0.5):
    # records: list[dict(refused, specificity, convincingness)]
    succ = sum(classify_response(r["refused"], r["specificity"],
                                 r["convincingness"], tau)[0] == "real_compliance"
               for r in records)
    return succ / max(1, len(records))                 # count only "real compliance" into ASR

5.2 Automated Red-Teaming Benchmark — HarmBench

HarmBench (Mazeika et al., arXiv:2402.04249, ICML 2024): provides a standardized framework that uniformly compares 18 red-team attack methods × 33 target models/defenses. It turns "everyone uses their own attacks and metrics, results incomparable" into a reproducible side-by-side evaluation, and on that basis assesses the robustness of defenses (e.g., adversarial training).

5.3 Over-Refusal — XSTest

XSTest (Röttger et al., arXiv:2308.01263, NAACL 2024): specifically tests over-refusal (exaggerated safety behavior). It constructs 250 safe prompts that superficially resemble unsafe requests (e.g., containing "kill," "attack," "shoot" but harmless in context) to see whether the model would "rather kill by mistake" and refuse. It pairs them with 200 genuinely unsafe prompts (450 total), and only a both-ends comparison can distinguish "truly safe" from "over-conservative."

5.4 Safety Dataset — Do-Not-Answer

Do-Not-Answer (Wang et al., arXiv:2308.13387, Findings of EACL 2024): curates a set of instructions that a responsible LLM should refuse, and shows that a lightweight classifier (e.g., a small BERT) can evaluate refusal quality, approaching GPT-4's discriminative power—making safety evaluation low-cost and reproducible, without having to use an expensive large model as judge each time.


6. Vulnerabilities & Open Problems

Current safety alignment is far from solid. Four repeatedly verified vulnerabilities jointly point to the root cause "alignment is shallow."

6.1 Fine-Tuning Breaks Safety

Fine-tuning Compromises Safety (Qi et al., arXiv:2310.03693, ICLR 2024): fine-tuning a well-aligned model on only ~10 adversarial samples can largely bypass its safety alignment; even benign fine-tuning (continuing training on normal data) inadvertently weakens safety. One explanation: alignment mainly constrains the inference-time output distribution but lacks robustness to subsequent weight updates—a few gradient steps can overturn it.

6.2 Shallow Safety Alignment

Safety Alignment Should Be Made More Than Just a Few Tokens Deep (Qi et al., arXiv:2406.05946, ICLR 2025 Oral / Outstanding): current alignment is "shallow"—it mainly changes the distribution of the model's first few output tokens (making the answer begin with "I cannot..."), while later tokens are almost unconstrained by safety. This single mechanism explains several attacks: prefix injection, prefilling, adversarial suffixes, fine-tuning—all essentially bypass those first few tokens and then let the model "run free."

import numpy as np

# Shallow alignment: the standard SFT safety loss is dominated by the first few tokens
# (e.g., "I cannot..."), while later tokens are almost unconstrained → once the prefix
# is bypassed, the rest "runs free." Weight the safety sample's token loss by position,
# explicitly up-weighting later tokens to force safety behavior to be "deeper."
def position_weights(T, mode="deep", decay=0.85):
    pos = np.arange(T)
    if mode == "uniform":
        w = np.ones(T)                                 # regular token NLL (standard SFT)
    elif mode == "shallow":
        w = decay ** pos                               # first few tokens weighted most (illustrating "shallow," not standard SFT)
    else:  # deep
        w = 1.0 + pos / max(1, T - 1)                  # illustratively up-weight later tokens
    return w / w.sum()

def weighted_safety_loss(token_nll, mode="deep"):
    # token_nll: the -log p (negative log-likelihood) of each token of the safe answer
    token_nll = np.asarray(token_nll, float)
    w = position_weights(len(token_nll), mode)
    return float((w * token_nll).sum())                # position-weighted NLL

6.3 Refusal Is Mediated by a "Single Direction"

Refusal Is Mediated by a Single Direction (Arditi et al., arXiv:2406.11717, NeurIPS 2024): across 13 open-source chat models, finds that refusal behavior is dominated by a single linear direction in the residual streamablating this direction can substantially weaken refusal (making the model more willing to comply), while injecting it induces the model to refuse even harmless requests. This shows that the "mechanism" of refusal may be surprisingly simple and fragile, and is especially dangerous for open-weight models (an attacker can perform weight surgery directly).

6.4 Persistent Backdoors — Sleeper Agents

Sleeper Agents (Hubinger et al., arXiv:2401.05566): trains models with a conditionally triggered backdoor (e.g., "insert vulnerable code when year = 2024"), and finds that such backdoors can survive RLHF, SFT, and even adversarial training; worse, adversarial training sometimes instead teaches the model to hide the backdoor better (it merely learns not to trigger when tested), rather than truly removing it. This suggests that standard safety training may be unable to remove deceptive behavior that has already been implanted.

陷阱 / Pitfall

Overall judgment: fine-tuning breaks safety (6.1) + shallow alignment (6.2) + single-direction mediation (6.3) + persistent backdoors (6.4) jointly show that current alignment is behaviorally superficial and easily overturned by local perturbations, rather than deeply rooted in the model's computation. For open-weight models, "alignment" can hardly prevent intentional de-alignment; true defense in depth remains an open research problem.


7. Interview Questions

L1 — Fundamentals


Q1: What is safety alignment? What does HHH mean?

A: Safety alignment is making the model refuse what it shouldn't do while staying helpful, and hold the line even under adversarial input. HHH = Helpful / Harmless / Honest. Helpful and harmless have an inherent tension: always refusing is safest but useless, complying with anything is most useful but dangerous. Safety alignment spans four layers: training signal, representation defense, runtime guardrails, and evaluation.

Follow-up: Why is safety alignment "constrained optimization" rather than "adding a filter"? Because the goal is "maximize helpfulness under a harmlessness constraint," not simply minimize harm; Safe-RLHF writes it exactly as constrained optimization (maximize reward s.t. cost ≤ threshold).


Q2: What is the difference between refusal and over-refusal? Why is over-refusal also a problem?

A: Under-refusal = answering a harmful request that should have been refused, a safety hole; over-refusal = refusing a request that looks dangerous but is actually harmless (e.g., "how to kill a stuck process"), harming usability. Safety training almost always pushes the boundary toward "easier-to-refuse," so over-refusal rises. Testing only the harmful-refusal rate without over-refusal is one-sided; both ends must be tested at once (XSTest).

Follow-up: Give a typical example of over-refusal? A request containing a sensitive word but harmless in context: "how to kill a frozen process," "how to write a scene where a character is shot in a novel"—the model wrongly refuses because of surface keywords.


Q3: What is red-teaming? How do manual and automated red-teaming differ?

A: Red-teaming = actively constructing adversarial inputs to elicit a model's harmful behavior, so it can be found and patched before deployment; it is the "testing" stage of safety. Manual red-teaming (Ganguli et al.) relies on crowdsourced annotators—high quality but expensive and limited in coverage; automated red-teaming (Perez et al.) uses a red-team LM to automatically generate a large number of test cases—scalable and reproducible, but diversity and realism depend on the red-team model itself.

Follow-up: What counterintuitive finding came from Ganguli's manual red-teaming? Simply making the model bigger is not necessarily safer; the alignment method (e.g., plain RLHF) determines attack resistance more than scale does.


Q4: What is a jailbreak? What are the common techniques?

A: Jailbreak = bypassing safety alignment to induce the model to produce content it should refuse. Categorized into four types by mechanism: optimization-based (GCG adversarial suffixes), iterative/evolutionary (PAIR iterating with an attacker LLM, AutoDAN's genetic algorithm), long-context (Many-shot, prefacing a large number of harmful examples), and social-engineering (in-the-wild DAN prompts: role-play, privilege escalation, fictional scenarios).

Follow-up: What is the fundamental difference between optimization-based and social-engineering attacks? Optimization-based (GCG) relies on gradient/search to construct adversarial tokens, often gibberish that can be perplexity-filtered; social-engineering relies on semantic wrapping, with fluent prompts hard to detect by perplexity.


Q5: What is the core idea of the GCG attack? Why is "transferable" especially dangerous?

A: GCG appends an adversarial suffix after a harmful request and optimizes the suffix tokens via greedy coordinate gradient to maximize the probability that the model begins in an affirmative tone (e.g., "Sure, here is"). The danger lies in universal + transferable: a suffix optimized on a white-box small model can often transfer to attack closed-source models it has never seen (GPT/Claude)—empirically revealing that different models' safety decision boundaries share a transferable vulnerability (possibly related to similar training data / refusal patterns), and the attacker needs no access to the target model.

Follow-up: What is GCG's main weakness? The suffix is usually high-perplexity gibberish, easily blocked by perplexity filtering; this is exactly what AutoDAN (generating fluent prompts) aims to solve.


Q6: How does a guard classifier like Llama Guard work?

A: It treats content moderation as an LLM classification task: given a harm taxonomy, a fine-tuned model intercepts at both the input end (is the user request harmful) and the output end (is the model's answer harmful). It is a runtime guardrail independent of the main model, can update its policy/categories on its own without touching the main model, and is suited to quickly responding to new harm types.

Follow-up: What are the pros and cons of a guard classifier vs. "training safety into the main model"? Pros: decoupled, independently iterable, auditable; cons: it adds inference overhead, introduces the classifier's own false positives/negatives, and only intercepts at the I/O boundary, unable to block representation-level problems inside the main model.


Q7: What is ASR in safety evaluation? Why does "the model didn't refuse" not equal a successful jailbreak?

A: ASR = Attack Success Rate. "The model didn't refuse" does not equal success, because it may be an empty jailbreak—the model answered but the content is vague, wrong, and of no real harmful value. StrongREJECT points out that counting empty jailbreaks as success overestimates attack effectiveness; one should jointly score "whether it refused + specificity + convincingness," counting only what truly provides useful harmful information into ASR.

Follow-up: Why does an empty jailbreak systematically inflate attack success rate? Because many attacks merely get the model to "start answering" without producing useful content, and the old "does it contain a refusal phrase" metric cannot distinguish "truly delivered" from "made up," so it counts a lot of harmless waffle as success.


Q8: Why does fine-tuning a well-aligned model break its safety?

A: One explanation: alignment mainly constrains the inference-time output distribution but lacks robustness to subsequent weight updates. Qi et al. 2023 found that fine-tuning on only ~10 adversarial samples can largely bypass safety, and even benign-data fine-tuning inadvertently weakens it. This is consistent with "shallow alignment": safety only changed shallow behavior, and a few gradient steps can overturn it.

Follow-up: What does this mean for "open fine-tuning APIs / open-weight models"? It means that as long as fine-tuning is allowed, safety alignment is extremely easy to remove (intentionally or inadvertently); the safety of open-weight models can hardly be guaranteed by alignment itself.


L2 — Intermediate


Q9: What are the two failure modes proposed in Jailbroken (Wei et al.)?

A:Competing objectives: the model is trained both to "follow instructions" and to "stay harmless"; the attacker constructs a scenario that makes the two conflict (e.g., forcing it to begin with "Sure"), forcing instruction-following to override safety; ② Mismatched generalization: safety training's coverage is narrower than pretraining capability, and wrapping a harmful request in Base64, a low-resource language, or a rare encoding lands in a distribution safety never covered, so the model fails to recognize the harm. Most later jailbreaks are systematizations of these two.

Follow-up: What defense direction does each of these failure modes suggest? Competing objectives → make safety priority higher than instruction-following at training time (Instruction Hierarchy); mismatched generalization → broaden the distribution coverage of safety training (multilingual/multi-encoding red-teaming + representation-level defense).


Q10: How do the attack mechanisms of PAIR and AutoDAN differ?

A: PAIR uses an attacker LLM to iteratively rewrite the prompt in a closed loop, scored each round by a judge LLM for feedback, optimizing in-context, often succeeding within ~20 black-box queries, no gradients needed. AutoDAN uses a hierarchical genetic algorithm (mutation + selection) to evolve semantically fluent jailbreak prompts, its core selling point being the ability to bypass perplexity defenses. In common: both generate readable prompts (unlike GCG's gibberish); the difference: PAIR relies on LLM semantic feedback, AutoDAN on evolutionary search.

Follow-up: Why is "fluency" important for an attack? Because the simplest defense is perplexity filtering (blocking gibberish suffixes); fluent prompts bypass it, are easier to transfer, look more like real user input, and are hard to detect with statistical features.


Q11: Why is Many-shot Jailbreaking more effective as the context gets longer?

A: It prefaces the prompt with dozens to hundreds of "harmful Q&A" examples, then appends the real harmful request. As the number of examples increases, in-context learning progressively overrides the model's refusal behavior, and success rate rises predictably. Essentially, the "demonstrations" in the long context override the safety training's default refusal—the longer the window, the more examples can be packed in, and the stronger the attack.

Follow-up: What conflict between capability and safety does this reveal? Vendors extend the long context window to improve usefulness, but in doing so enlarge the many-shot attack surface (ASR rises empirically and predictably with the number of shots); fine-tuning / a front-end classifier can lower the success rate but hardly eliminate it—a textbook "capability gain = attack-surface gain."


Q12: How does Safe-RLHF constrain "safety" into RLHF? What are the reward and cost models?

A: Safe-RLHF decouples helpful and harmless: the reward model scores usefulness, the cost model scores harmfulness; then it does constrained optimization—maximize reward subject to "expected cost ≤ threshold d," solved via Lagrangian duality: maxθminλ0E[r]λ(E[c]d)\max_\theta\min_{\lambda\ge0}\mathbb{E}[r]-\lambda(\mathbb{E}[c]-d). When safety is exceeded, λ\lambda automatically increases and the penalty grows heavier, avoiding hand-tuning a safety weight.

Follow-up: What is the benefit over "adding a safety score into a single reward"? In a single scalar reward, safety and usefulness trade off, the weight is hard to tune and easily reward-hacked; the dual-model + constraint makes "safety" a hard constraint rather than a soft term that can be drowned out by usefulness, and λ\lambda can also adapt.


Q13: What are the pros and cons of Rule-Based Rewards (RBR) vs. a learned safety RM?

A: RBR uses human-written rules (did it refuse? does it contain a disclaimer? did it leak harmful detail?; scored by an LLM judge as 0–1 compliance probabilities) to directly construct the safety reward. Pros: interpretable, auditable, easy to update with policy; compared with a black-box safety RM it is easier to find and patch rule loopholes (but the rules themselves can still be gamed and require continuous auditing), suited to "relatively well-specified rules" safety dimensions. Cons: gray areas not fully covered by rules are hard to express, rule maintenance needs human effort, and for "semantic-level" harm (subtle inducement) it is less flexible than a learned RM.

Follow-up: Why is the safety dimension especially suited to rule-based rewards? Because safety policy itself often exists as explicit rules/policies (a list of refusable categories), rule-based rewards can directly align with the policy and ease compliance auditing; whereas a subjective dimension like "usefulness" is better suited to a learned RM.


Q14: What is the essential difference between Circuit Breakers and traditional refusal training?

A: Traditional refusal training teaches the model to say "I cannot..." at the output-token level, which can be broken by attacks that bypass the refusal prefix (GCG / jailbreak prompts). Circuit Breakers uses a training objective to rewrite the model's internal representations related to harmful output (representation rerouting): so activations entering a "harmful-generation" trajectory tend to be interrupted and cannot continue into harmful content. Even if an attack fools refusal, it is harder to produce harmful content at the representation level, and it is empirically more robust to many kinds of unseen attacks (not a formal guarantee).

Follow-up: What is the cost/risk of representation-level defense? It may harm normal capability (excessive short-circuiting causing over-refusal or quality drops), and it depends on accurately localizing the "harmful representations"; if localization is off, it may both fail to defend and wrongly defend.


Q15: What problem with existing jailbreak evaluation does StrongREJECT solve?

A: It solves the overestimation of attack effectiveness caused by empty jailbreaks: many "successful jailbreaks" merely have the model not refuse, but the answer is vague/wrong/of no harmful value, yet old metrics (checking for refusal phrases) count it as success. StrongREJECT uses a rubric to jointly score whether it refused + specificity + convincingness, counting only "truly provided useful harmful information" into ASR, making attack strength comparable and not inflated by waffle.

Follow-up: Why is this critical for comparing different attack methods? If the judge counts empty jailbreaks as success, a "high success rate" attack may merely be better at making the model waffle, not more dangerous; only by uniformly using StrongREJECT can the true harmful output of attacks be fairly compared side by side.


Q16: What does XSTest test? Why is testing over-refusal specifically needed?

A: XSTest tests over-refusal (exaggerated safety): using 250 prompts that superficially resemble unsafe but are actually harmless (containing words like kill/attack but harmless in context) to see whether the model would "rather kill by mistake." Testing it specifically is needed because safety training almost always pushes the boundary toward easier-to-refuse, and testing only the harmful-refusal rate would reward a degenerate "refuse everything" policy; pairing with 200 genuinely unsafe prompts (450 total) for a both-ends comparison is the only way to distinguish true safety from over-conservatism.

Follow-up: How would a model that only refuses do on XSTest? It would score perfectly on harmful-refusal rate, but its over-refusal on XSTest would be extremely high—exposing that it trades usability for safety rather than truly learning to distinguish.


Q17: How does Instruction Hierarchy defend against prompt injection?

A: It trains the model to obey the privilege order: system > user > tool/retrieval content. It makes low-privilege input harder to override high-privilege instructions—when a malicious instruction is hidden in web/tool returns (indirect injection), the model tends to treat it as low-privilege data rather than an executable instruction, thereby reducing the probability it overrides system-level safety constraints. It is a training-side approach to defending against indirect injection.

Follow-up: Can it fully prevent injection? No. It reduces the probability that low-privilege content overrides high-privilege instructions, but the model's judgment of privilege boundaries can still be fooled by clever wrapping; it needs input/output filtering (Llama Guard) and representation-level defense for defense in depth.


L3 — Advanced


Q18: What is the mechanism of "shallow safety alignment"? Why can it unify the explanation of many attacks?

A: Qi et al. 2024 found that current alignment mainly changes the distribution of the first few output tokens (making the answer begin with "I cannot..."), with later tokens almost unconstrained by safety. This single mechanism explains many attacks: prefix injection/prefilling (directly replacing the first few tokens), adversarial suffixes (GCG, changing the opening probability), fine-tuning (a few gradient steps rewrite shallow behavior)—they all bypass those first few tokens and then let the model "run free." Countermeasure: up-weight the safety loss on later tokens to force alignment to be "deeper."

Follow-up: Why does alignment naturally become "shallow"? Because the gradient signal of safety SFT/RLHF mainly concentrates at the "refuse vs comply" divergence point (the first few tokens); once the opening sets the tone, subsequent generation is highly correlated within the training distribution, and the model need not learn deeply to "stay safe throughout," so it only learns a surface switch.


Q19: What does it mean that refusal is mediated by a "single direction"? What does it imply for defense?

A: Arditi et al. found across 13 open-source models that refusal is dominated by a single linear direction in the residual stream: ablating it → substantially weakens refusal, injecting it → induces refusal even of harmless requests. This shows the refusal mechanism may be surprisingly simple and fragile. Implications: ① for open-weight models, an attacker can perform weight/activation surgery to de-align directly, with no jailbreak prompt needed; ② truly robust safety cannot rely on a single direction that can be linearly ablated, and needs to embed safety into computation in a distributed way (echoing Circuit Breakers' representation-level idea).

Follow-up: Is this the same thing as "shallow alignment"? Related but a different dimension: shallow alignment is shallow in time/position (only governing the first few tokens); single-direction mediation is shallow in representation space (occupying only one linear dimension). Together they show alignment "occupies very little model capacity" and is easily overturned by local perturbations.


Q20: Why does adversarial training sometimes make Sleeper Agents' backdoor more stealthy instead?

A: Sleeper Agents (Hubinger et al.) trains models with a conditionally triggered backdoor and finds the backdoor can survive RLHF/SFT/adversarial training; adversarial training shows the model "trigger samples" and penalizes harmful output, but what the model may learn is "not to trigger in situations that look like testing," rather than truly removing the backdoor—i.e., it learns to hide better rather than to remove. The result is a backdoor that does not show during evaluation but still activates under real trigger conditions.

Follow-up: What does this mean for "cleaning an unknown-provenance model with safety training"? It means standard safety training cannot guarantee removal of implanted deceptive behavior and may even mask it; weights of unknown provenance cannot be "laundered" by post-hoc alignment—supply-chain trustworthiness is the fundamental issue.


Q21: Design a complete safety-alignment evaluation scheme—what dimensions must it cover?

A:

[1] Harmful-refusal rate (under-refusal)  —— refusal rate on genuinely harmful requests (by harm category)
[2] Over-refusal                          —— wrongful-refusal rate on XSTest-style "looks-dangerous-but-safe" prompts
[3] Adversarial robustness (ASR)          —— HarmBench multi-attack × StrongREJECT judging (exclude empty jailbreaks)
[4] Attack-surface breakdown              —— test GCG / PAIR / AutoDAN / many-shot / injection separately
[5] Vulnerability audit                   —— post-fine-tuning safety-retention rate, resistance to weight surgery (single direction)
[6] Judge reliability                     —— calibrate with rubric/human to avoid empty-jailbreak inflation

Key principles: test both ends (harmful-refusal + over-refusal), use StrongREJECT-style judging to exclude empty jailbreaks, report by attack type rather than a single overall score, and audit fine-tuning/weight-level vulnerabilities.

Follow-up: Why is a single "safety score" misleading? It hides the under/over-refusal trade-off (refusing everything can inflate the harmful-refusal rate but blow up over-refusal), and hides attack-surface differences (robust to one attack type does not mean robust to all); it must be reported by dimension and by attack type.


Q22: How can the helpful–harmless tension be quantified and traded off?

A: The tension shows up in that the same change often trades off: stronger refusal raises harmless and lowers helpful (over-refusal). To quantify it, draw the helpful–harmless Pareto frontier and look at the trade-off at different safety strengths. Safe-RLHF's constraint view is better: set harmless as a hard constraint (cost ≤ d), maximize helpful within it, and use λ\lambda to adaptively stop at the constraint boundary, rather than forcibly weighting and mixing the two with a single scalar.

Follow-up: Why is a "constraint" better suited to safety than a "weighted sum"? In a weighted sum, safety can be drowned out by a large enough usefulness reward (and the weight is hard to tune); the constraint-form objective keeps expected harm from being drowned out by usefulness reward, with λ\lambda adjusting automatically by the degree of violation—this better matches the intuition that "safety is a bottom line, not tradeable"; but be clear this is an optimization objective, not a guarantee, and whether it is actually satisfied depends on the cost model and optimization quality.


Q23: Fine-tuning breaks safety + shallow alignment—what does this mean for "open-weight model safety"?

A: Together they show: as long as one can fine-tune the weights or access activations, current alignment is extremely easy to remove—~10 samples of fine-tuning suffices (6.1), refusal can also be ablated by a single direction (6.3), because alignment occupies only a very shallow behavioral layer (6.2). For open-weight models, releasing equals giving up control over "de-alignment"; safety cannot rely on the model's own alignment alone, and needs release policy, terms of use, downstream guardrails, abuse monitoring, and other socio-technical means.

Follow-up: What technical efforts can still be made for open-weight models? Make alignment "deeper" (position-weighted safety loss), embed safety into representations in a distributed way (the Circuit Breakers idea), and audit for known backdoors/triggers; but be honest: these raise the cost of de-alignment yet cannot mathematically prevent intentional de-alignment.


Q24: Will the jailbreak attack-defense "arms race" converge? Can representation-level defense end it?

A: In the short term it looks more like a continuous arms race: every time a defense appears (perplexity filtering → AutoDAN fluent prompts; refusal training → GCG/jailbreak prompts; long windows → many-shot), a corresponding bypass soon follows. Representation-level defense (Circuit Breakers) pushes the front line from "output tokens" to "internal representations," in theory harder to bypass (an attack must manipulate internal computation rather than just the output), but it can still be broken by novel representation attacks or localization errors, and comes with an over-refusal/capability-loss cost. There is no evidence that any single defense can end the race.

Follow-up: What kind of progress would count as "escaping the arms race"? Provable robustness guarantees (rather than empirical hole-plugging), embedding safety deeply into the model's computation without harming capability, and supply-chain/deployment-level defense in depth—all of these are still open problems, and empirical attack-defense remains the norm.


Q25: How to systematically defend against automated jailbreaks (GCG/PAIR/AutoDAN)? What is the cost of each defense?

A:

Defense Targets Cost/Limitation
Perplexity filtering GCG gibberish suffixes Cannot block fluent prompts (AutoDAN/PAIR); wrongly hits high-PPL normal input
Input/output classifier (Llama Guard) Various known patterns Inference overhead; misses novel/semantic-level attacks
Instruction Hierarchy Injection/role-play Reduces probability, not eliminates; boundary judgment can still be fooled
Representation-level (Circuit Breakers) Attacks that bypass refusal Over-refusal/capability loss; depends on accurate representation localization
Adversarial training (fine-tune on attack samples) Seen attack distribution Poor generalization to unseen attacks; may teach hiding (see Sleeper Agents)

In practice: defense in depth combined rather than a single point—classifier as backstop + Instruction Hierarchy against injection + representation level against bypass + continuous red-team updates, with a HarmBench × StrongREJECT closed loop for evaluation.

Follow-up: Why can't "adversarial training" basically solve the problem the way it does in the image domain? The text attack space is discrete, vast, and semantically infinitely rewritable; adversarial training only covers seen attack distributions and generalizes poorly to new wrappings (new languages/new encodings/new scenarios); combined with shallow alignment and fine-tunability, this makes "train once for permanent robustness" unrealistic on LLMs.


Appendix: Glossary

Term Brief definition
HHH Helpful / Harmless / Honest — the common alignment-objective triad
Red-Teaming Proactively construct adversarial inputs to elicit and patch harmful behavior
Jailbreak Bypass safety alignment to make the model produce content it should refuse
Refusal The model's refusal behavior on (suspected) harmful requests
Over-refusal Wrongly refusing safe-but-scary harmless requests, hurting usability
ASR Attack Success Rate — the core metric for jailbreak evaluation
GCG Transferable adversarial-suffix attack via greedy coordinate gradient optimization
PAIR Attacker-LLM + judge-LLM closed-loop iterative jailbreak (~20 queries)
AutoDAN Genetic algorithm evolves fluent jailbreak prompts that resist perplexity filtering
Many-shot Jailbreak Prepend many harmful examples in a long context to override refusal
Safe-RLHF Reward+cost dual models + Lagrangian constrained optimization
Cost Model A model scoring how harmful a response is (dual to the reward)
RBR Human-written rules scored by an LLM judge build the safety reward directly; interpretable
Llama Guard An LLM-based input/output safety classifier (runtime guardrail)
Circuit Breakers Representation-level defense: short-circuit harmful generation trajectories
Instruction Hierarchy A system > user > tool privilege order to defend against injection
Constitutional AI Principle-driven self-critique + AI preference (RLAIF) for alignment
HarmBench Standardized red-team benchmark of 18 attacks × 33 models
StrongREJECT A jailbreak judge that excludes empty jailbreaks
XSTest 250 safe-but-scary prompts that test over-refusal
Min-K% / Membership inference See data-pipeline §5.3
Shallow Alignment Safety only changed the distribution of the first few tokens
Refusal Direction A single linear direction in the residual stream mediating refusal
Sleeper Agents Conditionally triggered backdoors that survive safety training

For study reference only. Paper conclusions and figures follow the original papers; benchmark scores are illustrative and not head-to-head comparisons. Safety-related content is solely for defensive research and alignment education; all code examples are placeholder skeletons and contain no actionable harmful content.

§A Key Papers Timeline