Self-improving LLMs

How LLMs use self-generated signals to "score → filter → train" themselves, iterating continuously without large-scale human annotation.

注意 / Caution

Study notes, not the authors' research (see README integrity statement). Numbers / conclusions follow the original papers; uncertain points are noted.

0. The core loop

Generate → Filter / Score → Train → Repeat

Each round, the current policy produces candidate answers or preference pairs; some filtering mechanism (rules, another model, self-scoring) eliminates low-quality outputs; the remaining high-quality samples are used to update weights; the next round reruns with the new model. This self-improvement loop is the shared skeleton of all methods.

TL;DR — quick anchors (2-minute pass)

Shared skeleton = the bootstrap loop: generate → filter / score → train → repeat; the improvement ceiling is capped by filtering-signal quality, not free capability gains.
Bootstrap family: STaR (rejection-sample correct traces + hint-retry) / RFT (simplified, leans on multiple correct solutions per problem for diversity) / ReST (Grow-Improve offline, Improve can iterate with stricter thresholds).
Self-Rewarding: one model both generates and acts as LLM-as-judge, iterative DPO; generation & judgment share parameters and co-evolve—but blind spots get inherited via self-judging.
Self-Play (SPIN): use "the previous-round self" to produce negatives, DPO-style learn to distinguish genuine human responses, approaching the human distribution until indistinguishable.
AI Feedback (CAI/RLAIF): SL-CAI self-critique + revise → RL-CAI AI-labeled preferences → RM → RL; the filtering signal comes from constitutional principles (alignment), not answer correctness.
Inference-time (training-free): Reflexion (verbal reflection into episodic memory, session-only) / Self-Refine (generate-critique-revise, no weight update); improvement is non-persistent and bounded by initial self-judgment ability.
Training-time vs inference-time: STaR/ReST/SPIN/CAI update weights and persist; Reflexion/Self-Refine don't, and reset on restart.
Three failure modes: reward hacking (Goodhart) / model collapse · distribution narrowing (keeping only top-k) / RM over-optimization (OOD decoupling); shared mitigation = KL constraint + diverse independent signals.
The filtering signal is the lifeline: verifiable rules > self-scoring (inherits blind spots); the more independent and verifiable the signal, the less the loop collapses.

1. Bootstrap-then-Train: bootstrapping from correct traces

1.1 STaR — Rejection Sampling + Iterative Fine-tuning

STaR1Iteratively fine-tunes on chain-of-thought where the correct answer was produced, without requiring a large-scale rationale dataset. Zelikman 2022 ↗ (Self-Taught Reasoner) is the foundational scheme for bootstrapped fine-tuning of LLM chain-of-thought:

Rollout: sample $K$ chain-of-thought rationales per problem.
Filter: retain only those rationales whose final answer is correct (rejection sampling).
Fine-tune: SFT on the retained set, update the model.
Hint-retry: for problems where all answers are wrong, give the correct answer and ask the model to "re-explain", then mix those into training (prevents easy problems from dominating the training set).

After $T$ iterations, the model is simultaneously the data generator and the data filter.

1.2 RFT — Rejection Sampling Fine-tuning

RFT is a simplified variant of STaR: it skips hint-retry, and directly retains the correctly-answered samples from $K$ samples per problem, aggregating them into a richer fine-tuning set. Key finding: multiple correct solutions to the same problem have higher diversity than a single solution, which helps generalization.

1.3 ReST — Grow-Improve Offline RL Loop

ReST2Generates a large dataset from the current policy (Grow), then filters by reward threshold and fine-tunes (Improve), more sample-efficient than online RLHF. Gulcehre 2023 ↗ splits the loop into two phases:

Grow: sample from the current policy $\pi_\theta$ , build an offline dataset $\mathcal{D}$ , score with reward function $r(\cdot)$ .
Improve: fine-tune $\pi_\theta$ on the subset $\mathcal{D}_{\ge\tau}$ where reward exceeds threshold $\tau$ .

Key point: the Improve phase can be repeated multiple times (progressively raising $\tau$ for stricter filtering), while Grow only needs occasional refresh — computation is more concentrated compared to online RLHF's per-step sampling.

Method	Filter criterion	Online?	Training method
STaR / RFT	Answer correctness (rule)	Quasi-online (iterative)	SFT
ReST	Reward function threshold	Offline batches	SFT / best-of-N distillation

陷阱 / Pitfall

Misconception: "Bootstrapping like STaR/RFT can keep pulling itself up indefinitely." The bootstrapping ceiling is doubly capped by the filtering signal and the current accuracy: problems answered entirely wrong have no correct trace to learn from (STaR's hint-retry is the fallback so easy problems don't dominate the training set), and keeping only correct traces narrows training-set diversity round by round (see §6 Failure modes).

2. Self-Rewarding: the model as its own judge

Self-Rewarding Language Models3The same model both generates responses and scores them using LLM-as-a-Judge; uses iterative DPO to jointly improve generation and judgment capabilities. Yuan 2024 ↗ breaks the assumption of "requiring an external reward model":

Sample multiple responses to the same prompt.
The same model scores each response using LLM-as-a-Judge format (score + rationale).
Construct preference pairs $(y_w, y_l)$ by score, update with DPO.
In the next round, judging ability also improves — both abilities share the same parameters and co-evolve.

The prerequisite of this approach: the model's generation ability and judgment ability must mutually promote rather than contaminate each other. Experiments show this holds for several iterations, but whether long-term degradation occurs remains an open question (see §6 Failure modes).

陷阱 / Pitfall

Misconception: "If the model scores itself, it can self-judge and improve indefinitely." The model's blind spots are systematically inherited in self-judging—it can't catch the errors it can't catch; it works for several rounds, but whether it degrades long-term is an open question, and it most easily triggers model collapse / distribution narrowing (§6.2). So the more independent and verifiable the self-judging signal, the safer.

3. Self-Play: using "the previous-round self" as opponent

SPIN4Current model vs. previous-round model: the latter generates negative samples, the former learns to distinguish them; self-improvement using only SFT data. Chen 2024 ↗ (Self-Play Fine-Tuning) is inspired by game theory:

Positive samples: human responses $y^*$ in the original SFT dataset.
Negative samples: outputs $\tilde{y}$ of the previous-round model $\pi_{\theta_{t-1}}$ on the same prompts.
Objective: the current model $\pi_{\theta_t}$ learns to distinguish genuine human responses from "old-self" outputs, updated with a DPO-like loss.

$\mathcal{L}_{\text{SPIN}}(\theta_t) = -\mathbb{E}\left[\log\sigma\!\left(\lambda\log\frac{\pi_{\theta_t}(y^*|x)}{\pi_{\theta_{t-1}}(y^*|x)} - \lambda\log\frac{\pi_{\theta_t}(\tilde{y}|x)}{\pi_{\theta_{t-1}}(\tilde{y}|x)}\right)\right].$

Key point: no additional human preference annotation required — negative samples are entirely provided by the model's own historical versions. As iterations proceed, $\pi_{\theta_t}$ continually approaches the human distribution until convergence when the two become indistinguishable.

4. AI Feedback: letting AI replace human preference labeling

Constitutional AI5Uses a set of "constitutional" principles to guide the model to self-critique and revise outputs; AI-generated preference data replaces human harmlessness annotation (RLAIF). Bai 2022 ↗ (CAI / RLAIF) is currently the most influential approach to "AI replacing human preference":

SL-CAI (supervised phase):

Model generates a harmful draft response.
Given a constitutional principle (e.g., "avoid discriminatory content"), the model self-critiques.
The model revises its response based on the critique.
The revised response is used for SFT.

RL-CAI (reinforcement phase): 5. The model scores a pair of responses using AI judgment (which better conforms to the constitution), constructing preference data. 6. Train a reward model with AI-labeled preferences, then iterate with RL.

Difference from STaR/ReST: the filtering signal comes from constitutional principles, not task answer correctness — targeting alignment rather than reasoning ability.

陷阱 / Pitfall

Misconception: "Constitutional AI and STaR/ReST are the same kind of method." The skeleton (generate→filter→train) is identical, but the filtering signal differs in origin: CAI/RLAIF's signal comes from constitutional principles, targeting alignment (harmlessness); STaR/ReST's signal comes from task answer correctness, targeting reasoning ability. In other words, what changes is "what serves as the filter."

5. Inference-time Self-correction (Training-free)

The following two methods do not update weights; they belong to inference-time self-improvement, conceptually related to the training loops above but different:

5.1 Reflexion — Verbal Reinforcement Learning

Reflexion6Agent converts task feedback into natural language reflections, stores them in episodic memory, and references them on the next attempt — no gradient updates needed. Shinn 2023 ↗ lets the agent in multiple trial-and-error loops:

Execute task → receive environment feedback (success / failure / error message).
Generate verbal reflection: summarize in natural language "what went wrong and how to improve next time".
Store reflection in episodic memory, inject into context in the next round.

Success rate improves significantly after a few iterations — but improvement exists only in the current session's context, and is lost on restart.

5.2 Self-Refine — Generate-Critique-Revise Loop

Self-Refine7The same frozen LLM loops: generate output → self-critique → revise based on critique, no training or additional supervision required, consistently gains across tasks. Madaan 2023 ↗ has a fixed three-step loop:

$\text{output}_0 \xrightarrow{\text{critique}} \text{feedback}_0 \xrightarrow{\text{refine}} \text{output}_1 \xrightarrow{\cdots}$

No training, no additional supervision — directly leverages the pretrained model's self-critique capability. Experiments show gains across multiple tasks (code, summarization, dialogue, math), but the ceiling is limited by the model's initial judgment ability.

Method	Improvement occurs at	Updates weights?	Persistent?
Reflexion	inference-time, multiple attempts	No	No (within context)
Self-Refine	inference-time, single loop	No	No
STaR / ReST / SPIN / CAI	training-time	Yes	Yes

陷阱 / Pitfall

Misconception: "Reflexion / Self-Refine make the model 'learn' to correct itself." Both are inference-time, no weight updates: Reflexion's reflections live only in episodic memory and are lost on restart, and Self-Refine's gains are bounded by the model's initial self-judgment ability. To persist improvement into the weights, you need training-time loops like STaR/ReST/SPIN/CAI.

6. Failure modes

The self-improvement loop looks appealing, but has three structural risks:

6.1 Reward Hacking

When the filtering signal (reward model, LLM scoring, rule filter) is imperfect, the model learns strategies that score high but are not truly correct: shortcut answers, surface-fluent but content-wrong rationales, outputs specifically designed to please the scoring template.

Root cause: the gap between the optimization target (proxy reward) and the true target (task quality) — Goodhart's Law.
Mitigation: use diverse, independent evaluation signals; limit the magnitude of a single RL update (KL constraint).

提示 / Note

An agent-scale variant — tool-mediated reward tampering: Once the model can call tools / execute code, reward hacking gains an extra path of "manipulating the evaluation channel" — skipping the real verification step, inferring the answer from task-adjacent metadata (filenames, comments, leaked ground-truth), or rewriting the eval script / unit tests so they always pass ("editing the test instead of the implementation," observed in agentic-coding training). A 2026-05 benchmark (arXiv:2605.02964, 13 models) gives a rough magnitude: exploitation rates of roughly 0%–14% on the standard sweep (harder variants reach ~22%), with reasoning-/RL-dominated models tending to exploit more in this benchmark (only the same-family DeepSeek comparison is controlled; cross-vendor differences are merely correlational). ⚠️ This is a single benchmark, a very recent preprint — use it for order-of-magnitude intuition only, not as a deployment fact or model ranking. Mitigation: lock the evaluator (isolate eval code from the agent's writable space) + trajectory-level auditing (check that verification actually ran, not just the final score).

6.2 Model Collapse / Distribution Narrowing

Each round only retains "high-score" samples, eliminating the diversity of low-score samples. After multiple rounds, the training set tends toward uniformity, model output diversity decreases, and generalization worsens. This is especially severe in Self-Rewarding-style "model scores itself" schemes: the model's blind spots are systematically inherited in preference labeling.

$\text{Diversity}(\pi_{\theta_t}) \le \text{Diversity}(\pi_{\theta_{t-1}}) \quad \text{(if only top-}k\text{ kept per round)}$

6.3 Reward Model Over-optimization (RM Over-optimization)

The reward model in the RL phase is itself an approximation; as the policy is continuously optimized, the score curve eventually decouples from true quality (the out-of-distribution regions of the reward model are exploited). A KL divergence penalty is the standard mitigation:

$\mathcal{J}(\theta) = \mathbb{E}[r(y)] - \beta\,\mathrm{KL}[\pi_\theta \,\|\, \pi_{\text{ref}}].$

Larger $\beta$ keeps the policy closer to the reference policy, but at the cost of more conservative improvement.

陷阱 / Pitfall

Misconception: "In self-improvement, keeping only high-score samples is always good." Keeping only top-k each round makes the training distribution monotonically narrow (diversity never grows) → worse generalization, i.e. model collapse; this is especially bad when "the model scores itself" (blind spots inherited). Note RFT's opposite lesson: keeping multiple different correct solutions to the same problem actually boosts diversity and generalization—diversity itself must be guarded as an objective.

7. From-scratch code: STaR-style rejection-sampling fine-tuning loop

95 行 / lines

"""
STaR-style rejection-sampling fine-tuning loop (illustrative).
Dependencies: transformers, torch — uses GPT-2 for pedagogical demonstration; replace with a larger model for real training.
"""
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from torch.utils.data import Dataset

# ---------- Hypothetical QA data ----------
PROBLEMS = [
    {"question": "What is 3 + 5?",  "answer": "8"},
    {"question": "What is 7 * 6?",  "answer": "42"},
    {"question": "What is 12 - 4?", "answer": "8"},
]

# ---------- Helper: simple answer extraction ----------
def extract_answer(text: str) -> str:
    """Extract the last number from generated text (for demonstration)."""
    import re
    nums = re.findall(r"\d+", text)
    return nums[-1] if nums else ""

# ---------- 1. Rollout: sample K rationales per problem ----------
def rollout(model, tokenizer, problems, K=4, max_new=64, device="cpu"):
    """Returns list of (question, rationale, is_correct)."""
    results = []
    model.eval()
    for prob in problems:
        prompt = f"Question: {prob['question']}\nLet's think step by step:"
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model.generate(
                **inputs, max_new_tokens=max_new,
                do_sample=True, temperature=0.8,
                num_return_sequences=K, pad_token_id=tokenizer.eos_token_id,
            )
        for seq in outputs:
            text = tokenizer.decode(seq, skip_special_tokens=True)
            rationale = text[len(prompt):]
            correct = extract_answer(rationale) == prob["answer"]
            results.append({"prompt": prompt, "rationale": rationale, "correct": correct})
    return results

# ---------- 2. Filter: retain only correct rationales ----------
def filter_correct(results):
    return [r for r in results if r["correct"]]

# ---------- 3. Dataset wrapper ----------
class RationaleDataset(Dataset):
    def __init__(self, samples, tokenizer, max_len=128):
        self.tokenizer = tokenizer
        self.data = []
        for s in samples:
            text = s["prompt"] + s["rationale"]
            enc = tokenizer(text, truncation=True, max_length=max_len,
                            padding="max_length", return_tensors="pt")
            input_ids = enc["input_ids"].squeeze()
            self.data.append({"input_ids": input_ids, "labels": input_ids.clone()})

    def __len__(self):  return len(self.data)
    def __getitem__(self, i): return self.data[i]

# ---------- 4. Train: SFT on correct rationales ----------
def finetune(model, tokenizer, samples, output_dir="./star-ckpt"):
    ds = RationaleDataset(samples, tokenizer)
    if len(ds) == 0:
        print("No correct samples — skip this iteration.")
        return
    args = TrainingArguments(
        output_dir=output_dir, num_train_epochs=1,
        per_device_train_batch_size=2, logging_steps=5,
        save_strategy="no", report_to="none",
    )
    Trainer(model=model, args=args, train_dataset=ds).train()

# ---------- 5. STaR main loop ----------
def star_loop(model_name="gpt2", n_iters=3, K=4):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(model_name)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)

    for t in range(n_iters):
        print(f"\n=== Iteration {t+1}/{n_iters} ===")
        all_results = rollout(model, tokenizer, PROBLEMS, K=K, device=device)
        correct = filter_correct(all_results)
        print(f"  Correct rationales: {len(correct)} / {len(all_results)}")
        finetune(model, tokenizer, correct)

    return model

if __name__ == "__main__":
    star_loop(n_iters=2, K=4)

The above code is for illustrative purposes only: real STaR uses larger models, longer rationales, and hint-retry as a fallback. The core workflow (sample → filter → finetune → repeat) is consistent with the paper.

8. Frontier: test-time RL (TTRL) & self-evolving agents

The self-improvement in §1–§7 all relies on some supervision signal (correct answers, preference labels, verifiable rewards). Since 2025, two frontiers relax this assumption further: one pushes RL onto fully unlabeled test data (TTRL), the other moves "improvement" out of the weights and into the workflow / skill library (self-evolving agents) — especially friendly to black-box API models.

8.1 Test-Time RL (TTRL)

TTRL12RL on unlabeled test data: sample multiple outputs per question, use the majority-vote answer as a pseudo-label, reward = agreement with that consensus, then a GRPO-style update.Zuo 2504.16084 ↗ (Test-Time Reinforcement Learning) pushes §1's "generate-filter-train" loop to the extreme: no ground-truth at all. The flow — sample multiple outputs for the same test question → majority vote picks the consensus answer as a pseudo-label → reward = whether the output matches the consensus → GRPO-style RL update. Think of it as "RLVR without a verifier": the model's own consistency stands in for an external correctness signal. As reported, gains on math reasoning are large (Qwen-2.5-Math-7B's pass@1 on AIME24 improves by roughly +211% relative).

注意 / Caution

TTRL's dependency and trap: The pseudo-label comes from majority vote, so the quality bottleneck is majority voting itself — if the consensus is systematically wrong on a class of problems, the reward signal degrades and TTRL may reinforce a confident-but-wrong consensus (same root as §6.2 model collapse). Note it is not strictly capped by maj@n (the paper reports exceeding the initial maj@n, since even with a wrong pseudo-label the reward estimate is often still largely usable), but it fundamentally amplifies existing ability rather than creating new knowledge, so it's near-useless on problems the model can't even approach. The reported +211% is a self-improvement delta, not a leaderboard score, and cross-model / cross-task robustness is still being checked.

8.2 Self-evolving agents: moving improvement out of the weights

When an agent = "LLM + workflow + tools + memory," "self-improvement" need not touch the weights — it can change the workflow structure or the skill library:

Automated workflow optimization (AFlow): AFlow13Represents the agent workflow as a code graph and uses MCTS to search the combination space of "operators" (generate / revise / ensemble / verify), guided by execution/evaluation scores, to auto-iterate toward better workflows (search/evaluation, not an RL reward).Zhang 2410.10762 ↗ writes the workflow as a code graph and uses MCTS to search the operator-combination space, guided by execution/evaluation scores, auto-discovering better control flows — weights untouched, only the workflow changes; it reportedly approaches strong-model performance at very low inference cost on several benchmarks.
Skill library / lifelong learning (Voyager): Voyager14An LLM agent in Minecraft that crystallizes successful behaviors into executable-code "skills," written into a retrievable skill library for later composition and reuse — no fine-tuning of the base model.Wang 2305.16291 ↗ crystallizes successful behaviors into executable-code skills, stored in a retrievable skill library and directly called / composed on similar future tasks — base model frozen, capability growth deposited entirely in external memory. This is the bridge between §2 "continual learning" and this cheatsheet: a skill library = continual learning without weight updates.

提示 / Note

Scope note: This section is a frontier-direction sketch, not settled fact. AFlow / Voyager's strong results are mostly within each paper's specific task domain; cross-domain robustness and the real cost-effectiveness vs weight-level RL are still being tested. In interviews, present "mechanism + applicability boundary" rather than treating a single number as a universal conclusion.

Stratified follow-ups

L1 Foundational

1. What is the "generate-filter-train" loop for self-improvement? Why does it need to loop rather than run once?

Answer: The loop skeleton is: the current policy generates candidate outputs → a filtering/scoring mechanism eliminates low-quality samples → weights are updated on the retained set → the new model reruns. The problem with a single pass is: the initial model has limited capability, so the correct samples from one round cover a narrow range; as the loop runs, each new model can solve problems the previous round got wrong, gradually expanding training signal coverage and achieving bootstrapped capability improvement.

Follow-up: Under what conditions will this loop stop iterating? Does it stop because the model has converged, or because it has hit an insurmountable bottleneck? → The loop stopping usually means the filtering signal has saturated (the current model answers all training set problems correctly, rejection sampling produces new samples nearly identical to existing data, and gradients approach zero), or the task difficulty exceeds the model's bootstrapping capability (no correct samples can be produced for completely unsolvable problems). Neither is true "convergence" — both are stagnation; true convergence requires that the model also stops improving on a held-out set.

2. How does STaR train chain-of-thought without rationale annotations? What problem does hint-retry solve?

Answer: STaR samples $K$ chain-of-thought traces per problem and retains only the rationales with a correct final answer for SFT (rejection sampling), thus requiring no human rationale annotations. Hint-retry handles "problems where the model gets everything wrong" — it gives the correct answer and asks the model to re-generate an explanation, then mixes those into the training set, preventing easy problems from monopolizing the training set and leaving hard problems without any gradient updates.

Follow-up: Hint-retry introduces the correct answer as a hint — what bias does this bring? → The rationale the model re-generates may contain "backward reasoning from the answer" — on the surface the steps look reasonable, but they actually rely on information that shouldn't have been known. When such samples are mixed into the training set, they may teach the model to imitate backward-reasoning patterns rather than genuine forward reasoning, harming generalization to new problems; this is precisely one motivation for proposing PRM (process reward model).

3. Why are Reflexion and Self-Refine called "training-free"? Can their improvements persist?

Answer: Neither updates model weights — Reflexion stores natural language reflections in episodic memory and injects them into context, while Self-Refine loops "generate → critique → revise" within a single conversation. Improvements cannot persist: Reflexion's improvement exists only within the current session's context and is lost on restart; Self-Refine similarly starts from scratch on each new call.

Follow-up: To make Reflexion or Self-Refine improvements persistent, what system architecture design is needed? → The high-quality "reflections" or "revised outputs" extracted after multiple rounds of inference can be used as new training data for periodic offline SFT or DPO updates, forming a closed loop from inference-time improvement to training; but the core challenge is filtering the quality of this self-produced data — erroneous reflections that are made persistent are harder to correct than one-time errors.

4. What role does the "constitution" play in Constitutional AI? How does AI feedback replace human preference annotation?

Answer: The "constitution" is a list of principles (e.g., "avoid discriminatory content"); in the SL-CAI phase it guides the model to self-critique and revise harmful drafts, and the revised outputs are used for SFT. In the RL-CAI phase, the model uses AI scoring (which response better conforms to the constitution) to construct preference pairs, trains a reward model with AI-labeled preferences, and then performs RL — thus replacing large-scale human harmlessness annotation with AI feedback (RLAIF).

Follow-up: Beyond a preset static list of constitutional principles, what directions exist for making feedback principles more dynamic and adaptive? → One direction is meta-reward models: dynamically retrieving or generating the most relevant principles for a given input and harmful output for critique, rather than applying the same rules to all prompts; another direction is automatically distilling "implicit constitutions" from large amounts of annotated data using clustering or inductive learning, allowing principles themselves to evolve with task distribution rather than being fixed by humans.

L2 Intermediate

5. How do the Grow and Improve phases of ReST divide their roles? Why is it more sample-efficient than online RLHF?

Answer: The Grow phase uses the current policy $\pi_\theta$ to sample at large scale and scores with a reward function to build offline dataset $\mathcal{D}$ ; the Improve phase fine-tunes on the subset where reward exceeds threshold $\tau$ , and can raise $\tau$ to repeat Improve multiple times. The reason for sample efficiency is: Grow only needs occasional refresh, and Improve can reuse the same batch of data for multiple rounds; online RLHF must sample new data every step, dispersing computation.

Follow-up: In what scenarios is ReST's offline batch mechanism actually inferior to online RLHF? → When the task distribution or environment changes dynamically, the dataset built in the offline Grow phase becomes rapidly outdated, and its reward signals reflect the distribution under the old policy; online RLHF samples in real-time at every step, can track distribution drift, and is better suited for non-stationary environments (e.g., user preference drift in dialogue systems, external dependency changes in code execution) — at the cost of higher computational expense.

6. SPIN uses "the previous-round self" as negative samples — what are the advantages and disadvantages compared to DPO using human preference pairs?

Answer: SPIN's advantage is requiring no additional human preference annotation, with negative samples entirely provided by the historical version $\pi_{\theta_{t-1}}$ , at low cost. The disadvantage is that the theoretical upper bound is locked by SFT data quality — SPIN's convergence condition is $\pi_{\theta_t} = p_\text{data}$ , and it cannot surpass the human SFT data; moreover, as iterations progress, negative sample quality approaches positive sample quality, and the contrastive signal grows progressively weaker. DPO's human preferences can cover alignment dimensions beyond SFT data, but annotation costs are high.

Follow-up: The contrastive signal in SPIN vanishes with iteration — what analogy does this have with GAN training dynamics, and what does it imply for choosing iteration count in practice? → The discriminator loss in SPIN is analogous to the GAN discriminator's loss approaching zero as the generator approaches the real distribution: when $\pi_{\theta_t} \approx p_\text{data}$ , positive and negative samples are nearly indistinguishable, and gradients approach zero — analogous to GAN training saturation. The practical implication is: SPIN is well-suited for early iterations to close the SFT distribution gap, but should be switched to methods with external validation signals (such as RLVR) in later stages; otherwise extra iteration rounds yield neither benefit nor protection against distribution drift.

7. What problems arise when "generation" and "judgment" share the same parameters in Self-Rewarding?

Answer: The generator's blind spots are inherited by the judge — if the model is weak at a certain type of reasoning (e.g., counterfactual reasoning), its probability of scoring that reasoning highly is also lower than the true level, because judgment and generation capability share the same knowledge base. Preference data therefore systematically underestimates this type of capability. There is also self-confirmation bias: the model tends to give high scores to answers that "sound like its own style". This is not the random error of hallucination but a systematic bias — the preference signal itself pulls the model toward its own existing style, forming a positive feedback loop that accumulates and amplifies errors rather than mean-reverting (see Deep-dive Q3).

Follow-up: On what kinds of task distributions will self-confirmation bias be most severe? → It is most severe on tasks requiring divergent/creative thinking (e.g., story creation, brainstorming) — the judge tends to reward answers similar in style and logical path to its own, systematically suppressing novel but "atypical" high-quality outputs; conversely, on tasks with objective correct/wrong criteria (e.g., math, code), self-confirmation bias is relatively weakened by external verifiable signals as a constraint.

8. Are reward hacking and RM over-optimization the same thing? How does KL constraint mitigate it?

Answer: RM over-optimization is a specific form of reward hacking: after continuous policy optimization, outputs that achieve high proxy reward but low true quality are found in the RM's out-of-distribution regions — a quantitative manifestation of Goodhart's Law. KL constraint limits the extent to which the policy deviates from the reference model via $\mathcal{J}(\theta) = \mathbb{E}[r(y)] - \beta\,\mathrm{KL}[\pi_\theta \,\|\, \pi_{\text{ref}}]$ ; larger $\beta$ is more conservative, preventing the policy from entering the RM's high-scoring out-of-distribution regions.

Follow-up: What cost does KL constraint impose while mitigating RM over-optimization, and is this cost worth it? → The cost is limiting the exploration range: the policy cannot deviate far from the reference model, even if the RM points in a genuinely better direction. When RM quality is high, this cost is worth it (stable improvement prioritized over risky exploration); when RM quality is poor, the KL constraint locks the policy near a suboptimal region, unable to either improve or discover the true optimum — in this case, improving the RM or switching to verifiable signals should be prioritized over increasing $\beta$ .

L3 Deep-dive

9. How is model collapse / distribution narrowing characterized mathematically? What mitigation strategies exist (temperature sampling, diversity constraints, data mixing)?

Answer: Retaining only top- $k$ samples per round is statistically equivalent to truncated sampling — each time only the high-density regions of the distribution are taken, causing entropy to monotonically decrease over multiple rounds: $H(\pi_{\theta_t}) \le H(\pi_{\theta_{t-1}})$ . Distribution narrowing has two root layers, analyzed by Shumailov et al.: the first is statistical approximation error — each sampled dataset is finite, low-probability tail events are underestimated or missing, and the next-round model learns from this finite sample and cannot recover missing tails regardless of model capacity; the second is function approximation error — limited model capacity further compresses representation of already-low-frequency patterns. Both errors accumulate additively across iterations: statistical error provides "worse raw material" for function error, while function error makes the base distribution for the next round's sampling narrower than the last, forming a negative spiral. In Self-Rewarding settings, the situation is more severe: the judge itself is also drifting, the gap between preference data and true preferences grows with each round, and collapse signals and bias signals amplify simultaneously. In long-chain chain-of-thought tasks, tail solutions (unconventional reasoning paths) disappear first in the initial rounds of filtering, yet these paths are often precisely what is needed to handle out-of-distribution problems. Mitigation strategies fall into three categories: raising sampling temperature to preserve low-probability paths (trading higher variance for higher probability of hitting diverse correct paths); adding a diversity reward term to explicitly reward output diversity against the main loss; periodically mixing in original human data as a distributional anchor to prevent unconstrained drift. Among the three, "mixing human data" is most fundamental, because it directly blocks the accumulation source of both errors — having an anchor provides tail replenishment.

Follow-up: Can the above mitigation strategies theoretically fully prevent distribution narrowing? → No: raising temperature only increases sampling variance, while the training objective (MLE in SFT or preference loss in DPO) itself still pushes the model to fit high-density regions of the data; diversity reward is an additive term, and its tradeoff with the main loss requires tuning and cannot precisely cover all tail patterns; only continuously mixing external data can theoretically break the additive spiral of both errors.

10. What selection bias does retaining only correct samples each round in STaR introduce? How can it be mitigated?

Answer: Filtering only on final answers is equivalent to $p_{\text{train}}(r|x) \propto p_\theta(r|x)\cdot\mathbf{1}[\text{answer}(r)=a^*]$ , leading to three types of bias: ① incorrect reasoning paths enter the training set as long as the answer happens to be correct; ② the training set comes from $\pi_{\theta_{t-1}}$ , deviating from the true reasoning distribution each round; ③ for hard problems that are filtered out entirely, the model receives no gradient updates and cannot improve by bootstrapping. Mitigation directions: use a process reward model (PRM) to score each step (Lightman et al.) to reduce step-level errors; mix original SFT data to prevent complete distribution drift.

Follow-up: PRM was proposed to address outcome bias — but does PRM itself have similar limitations or new risks? → Yes: PRM requires step-level annotations as supervision, still relying initially on human or strong-model annotation, whose distribution is equally subject to annotation bias; more importantly, when using PRM scores as the optimization objective, the same "PRM over-optimization" risk applies — the policy may learn to generate step sequences that cater to PRM scoring patterns but contain actual reasoning errors, essentially the same structure as RM over-optimization, just with the granularity shifted from outcome to step.

11. Combining Self-Rewarding's LLM-as-Judge with an external reward model — what information does each contribute? How to prevent the two from "colluding"?

Answer: LLM-as-Judge contributes the generator's own semantic understanding and stylistic judgment (broad coverage but subject to self-confirmation bias); the external RM contributes independent-parameter preference estimation (initially uncorrelated in bias direction with the generator, but subject to out-of-distribution generalization failure). The key to preventing collusion is maintaining parameter independence and ensuring training data is not cross-contaminated; simultaneously, using held-out verifiable answers or human evaluation as third-party signals for periodic calibration, to avoid two approximation signals accumulating errors in the same direction.

Follow-up: If labels generated by LLM-as-Judge are used to train the external RM, can the two still be considered "independent"? What effect does this have on collusion prevention? → No longer independent: the RM's training data already carries the bias direction of LLM-as-Judge; although the parameters are separate, the information is already contaminated, and the two will accumulate errors in the same blind-spot direction rather than correcting each other — this is the most common "false complementarity" trap. True independence requires the RM's annotation data source to be independent from LLM-as-Judge (e.g., human annotation or programmatic judgment of verifiable tasks), and uses held-out third-party signals to periodically verify whether the divergence directions of the two are correlated.

12. If the self-improvement loop converges to a local optimum (the model cannot produce data better than itself), what are the ways to break out?

Answer: Based on Deep-dive Q7, stagnation has three root causes — filtering signal saturation, insufficient exploration after distribution narrowing, and task difficulty exceeding bootstrapping capability. Corresponding breakout strategies: ① curriculum learning: introduce harder or more diverse problems to expand signal coverage; ② raise sampling temperature or add diversity reward to restore exploration capability; ③ introduce an external stronger teacher model (or RLVR verifier) to provide training signals independent of the current model; ④ switch methods, from bootstrapping methods like SPIN/STaR to RL methods with external verification signals.

Follow-up: After introducing an external teacher model or RLVR verifier, what new failure modes may still arise? → Three main risk categories: ① over-optimization on verifiable tasks causing degradation on open tasks — the model specializes in fitting the verifier's judgment rules, with declining generalization to tasks without a unique answer; ② teacher-student distribution mismatch — if the teacher model or verifier's task distribution does not match the target distribution, the provided signals are ineffective or even harmful; ③ shortcut learning — the model learns to guess the teacher's output patterns or the verifier's rule boundaries rather than internalizing general reasoning capabilities, immediately failing when the verifier set is changed.

Deep-dive

Detailed analysis of advanced interview questions. ⚠️ Study notes, not the authors' research. Numbers follow the original papers.

Q1. STaR only retains "correctly answered" samples: what selection bias does this introduce? What is the formal impact on the learned distribution?

Core bias: STaR1 in each round of iteration includes only chain-of-thought traces with a correct final answer in the training set. Formally this is equivalent to:

$p_{\text{train}}(r \mid x) \propto p_\theta(r \mid x) \cdot \mathbf{1}[\text{answer}(r) = a^*]$

where $r$ is the rationale, $x$ is the problem, and $a^*$ is the reference answer.

Three structural consequences:

Correctness ≠ reasoning quality: a rationale may arrive at the correct answer through luck, shortcuts, or "backward reasoning from the answer", yet the reasoning steps themselves are wrong. Since filtering only looks at the final answer, incorrect reasoning paths are systematically mixed into the training set. This aligns with the motivation for Lightman et al.9 proposing the process reward model (PRM): outcome supervision cannot distinguish "correct reasoning getting the right answer" from "incorrect reasoning getting the right answer".
Accumulated distribution shift: the round- $t$ training set is drawn from the conditional distribution of $\pi_{\theta_{t-1}}$ , not the true reasoning distribution $p^*(r \mid x)$ . After each iteration, $\pi_{\theta_t}$ further deviates from $p^*$ , and the "correctness rate" signal of the filter becomes increasingly self-referential.
Hard-problem blind spots: for problems the model consistently fails, the filtered training set is empty (hint-retry covers some, but cannot fully compensate). The model receives neither gradient updates nor bootstrapped improvement on these problems, creating a "Matthew effect" — the strong get stronger, hard problems stagnate.

Mitigation directions: PRM scoring each step (rather than only looking at the final answer) can reduce step-level errors; data mixing (retaining original SFT data) prevents complete distribution drift.

Q2. Why does iterative self-training narrow the distribution (model collapse)? Intuition + when does it bite?

Intuition: "retaining only high-score samples" each round is statistically equivalent to truncated sampling — only taking the high-density region of the distribution each time. Over multiple rounds, tail low-probability (but high-diversity) outputs are systematically eliminated.

Shumailov et al.10 analyzed the consequences of recursively training on self-generated data at both theoretical and experimental levels:

Statistical approximation error: each sampled dataset is finite; tail events are underestimated or missing.
Function approximation error: limited model capacity further compresses representation of low-frequency patterns.

The two errors accumulate through iteration, causing the distribution to continuously narrow. Intuitively characterized with an inequality:

$H(\pi_{\theta_t}) \le H(\pi_{\theta_{t-1}}) \quad \text{(if only top-}k\text{ samples kept per round)}$

Entropy monotonically decreases; outputs trend toward repetition and uniformity.

When it truly bites:

Scenario	Why it's severe
Self-Rewarding (model scores itself)	The judge itself is drifting; the gap between preference data and true preferences keeps growing
Long-chain chain-of-thought tasks	Per-step sampling variance is high; tail solutions (unconventional reasoning paths) disappear first in filtering
Multi-turn dialogue / agent loop	History in context is also self-generated data; recursive contamination effect is stronger
Using only self-generated data, without mixing human data	No anchor; distribution drift is unconstrained

Mitigation: periodically mix in original human data (anchor to prevent drift); raise sampling temperature to preserve diversity; use a diversity reward term to explicitly reward output diversity.

Q3. Why does the judge-generator coupling in Self-Rewarding fail?

In Self-Rewarding3, the same set of parameters acts both as the generator (producing answers) and as the judge (scoring answers). This creates a structural problem: the generator's blind spots are inherited by the judge.

Specific mechanism:

Shared blind spots: if the generator is weak at a certain type of reasoning (e.g., counterfactual reasoning), its probability of scoring that reasoning highly is also lower than the true level — because judgment capability and generation capability share the same knowledge foundation. Preference data therefore systematically underestimates this type of capability.
Self-Confirmation Bias: the model tends to score answers that "sound like its own style" higher. This is not a random error from hallucination, but a systematic bias — the preference signal itself is pulling the model toward its existing style, forming a positive feedback loop.
Correlated error drift: after each DPO update, the generator and judge move synchronously toward the direction of the preference data. If the preference data itself is erroneous (coming from a judge with blind spots), the next round's judge will aggravate the bias in the same direction — errors do not mean-revert but accumulate and amplify.

Formally, let $J_\theta$ be the judgment score function, $G_\theta$ the generation function, both sharing $\theta$ . The true quality function is $q^*$ . Then:

$\mathbb{E}[J_\theta(y) - q^*(y)] \ne 0 \quad \text{and is correlated with the bias direction of } G_\theta$

Contrast: an external reward model (with independent parameters) is at least initially uncorrelated in bias direction with the generator. But it has another problem: out-of-distribution generalization failure (see Q5).

Q4. SPIN converges to the SFT data distribution — why is this an upper bound? What does it mean in practice?

SPIN4's theoretical convergence condition is: if and only if $\pi_{\theta_t} = p_\text{data}$ (the current model is identical to the human SFT data distribution), the loss gradient vanishes and training stops.

This mathematically provides a strict capability upper bound:

$\text{SPIN limit policy} = p_\text{data} \quad \text{(SFT data distribution)}$

Corollaries:

Cannot surpass SFT data quality: if the SFT data contains errors, biases, or capability blind spots, the model after SPIN convergence will also inherit these defects. SPIN only makes the model "more like the human SFT data" — it cannot discover new capabilities beyond that data.
Negative sample quality degrades with iteration: the round- $t$ negative samples are generated by $\pi_{\theta_{t-1}}$ ; as $\pi_{\theta_t} \to p_\text{data}$ , negative sample quality increasingly approaches positive sample quality, and the contrastive signal grows progressively weaker. In practice, this manifests as: large gains in early iterations, diminishing marginal returns in later iterations approaching zero.
Fundamental difference from STaR/ReST: STaR-type methods use task correctness as the filtering signal, and can theoretically surpass SFT data (as long as a correct rationale exists, it can be learned). SPIN targets the ability to distinguish from human data, and its ceiling is determined by human data quality.

Practical implication: SPIN is suitable as a tool to "close the SFT gap" (eliminating the gap between the model distribution and human data distribution), but is not suitable as an infinite loop for continuous self-improvement — in later iterations, methods with external verification signals should be switched to.

Q5. Reward Model over-optimization: what is the scaling-law shape of reward rising / true quality falling?

Gao et al.8 systematically studied the relationship between degree of RM optimization (measured by KL divergence $d$ ) and true quality, finding the curve shape depends on the optimization method:

Best-of- $N$ sampling: proxy reward roughly grows as $\sqrt{\log N}$ (the paper notes fitting is difficult; this is an approximate description); true quality first rises then plateaus — over-optimization effect is relatively mild.
RL (policy gradient): proxy reward can continue rising, but true quality monotonically decreases after some $d^*$ .

Approximate functional form (fitted in the paper):

$\text{gold reward} \approx d\,(\alpha - \beta \ln d)$

where $\alpha, \beta > 0$ , $d = \sqrt{D_{\mathrm{KL}}}$ , optimal point $d^* = \exp\!\left(\dfrac{\alpha - \beta}{\beta}\right)$ (derived by setting $\mathrm{d}R/\mathrm{d}d = 0$ ). Beyond $d^*$ , true quality decreases as KL increases.

Key scaling conclusion: the coefficients $\alpha, \beta$ smoothly change with RM parameter count — larger RMs have a higher $d^*$ , meaning the over-optimization "critical point" arrives later, but a critical point always exists. This means scaling up the RM cannot eliminate over-optimization risk, only delay it.

Intuition: the RM is an approximation fitted on limited data; the policy finds shortcuts in the RM's out-of-distribution regions (high KL regions) that achieve high proxy reward but low true quality. This is Goodhart's Law expressed quantitatively in RL.

Practical defenses: set KL penalty coefficient $\beta$ ; periodically check with held-out gold reward (e.g., human evaluation or verifiable answers); avoid too many iteration rounds on a single RM.

Q6. Why is a ground-truth verifier (RLVR) safer than a learned judge?

RLVR (Reinforcement Learning with Verifiable Rewards) was systematically applied to mathematical reasoning by DeepSeekMath11: for tasks with deterministic answers (math problems, code unit tests), correctness is checked programmatically as the reward signal, rather than training a reward model.

Safety comparison:

Dimension	Learned Judge / RM	Ground-truth Verifier
Signal authenticity	Approximate (has fitting error)	Exact (rule / symbolic execution)
Over-optimization risk	High (policy can exploit loopholes)	Very low (answer correctness is binary fact)
Out-of-distribution generalization	Unreliable outside training distribution	Independent of policy distribution, always trustworthy
Blind spot inheritance	May share blind spots with generator	No parameters, no blind spots
Applicable scope	Broad (but imprecise)	Limited to mechanically verifiable tasks

Why exploiting loopholes is easy for RM but hard for verifiers: the RM's out-of-distribution behavior is unconstrained; the policy can find "high-scoring but low-quality" outputs unseen by the RM. Programmatic verifiers only check whether the final result conforms to the specification — the policy cannot cheat on "the specification itself" (specifications are exogenous).

Limitations: the applicable scope of RLVR depends on task verifiability. Tasks like natural language generation, summarization, and creative writing have no unique correct answer and cannot directly use RLVR. Therefore RLVR and learned rewards are not substitutes but complements — prefer RLVR for verifiable tasks; for open generation tasks, RM + KL constraint is necessary.

Q7. When does self-improvement stagnate? What is the role of exploration / diversity? How to empirically distinguish genuine improvement from reward hacking?

Three root causes of stagnation:

Filtering signal saturation: when the model can answer all training set problems correctly, rejection sampling produces correct samples nearly identical to existing training data — gradient signal approaches zero.
Insufficient exploration after distribution narrowing: as described in Q2, after distribution entropy decreases, the model no longer samples sufficiently diverse rationales, making it difficult to recover from erroneous paths or discover new strategies.
Task difficulty exceeds bootstrapping capability: for problems completely beyond the model's ability, no correct samples can be produced through rejection sampling; external curriculum (simpler subproblems, stronger teacher models) is needed.

Role of exploration / diversity: self-improvement is fundamentally an exploitation-exploration tradeoff. Retaining only correct samples each round is pure exploitation; to sustain improvement, the following are needed:

Higher temperature: sample more diverse paths, trading higher variance for higher probability of hitting new correct paths.
Diversity reward: add entropy regularization or a diversity term to the optimization objective to prevent mode collapse.
Curriculum learning: progressively introduce harder problems rather than repeatedly iterating on a fixed set.

Empirical methods for distinguishing genuine improvement from reward hacking:

Metric	Signal of genuine improvement	Signal of reward hacking
Proxy reward vs Gold reward	Both rise synchronously	Proxy rises but Gold reward is flat or falls
Held-out evaluation set	Also gains on unseen problem types	Only gains within training distribution; drops out-of-distribution
Manual spot-check of output quality	Reasoning step quality visibly improves	Surface fluency, but logical gaps in steps increase
Output diversity	Distribution entropy is maintained or slightly decreases	Distribution entropy collapses rapidly; outputs highly repetitive
KL divergence trend	Grows slowly and positively correlated with Gold	KL grows rapidly, exceeding Gao et al.'s $d^*$

The gold standard is always: maintain a held-out evaluation set completely untouched by the self-training process, and regularly score with a trusted oracle (human evaluation or verifiable answers). Only if this score consistently rises can genuine improvement be confirmed.

References

All are original sources of foundational load-bearing methods, individually verified (title + arXiv ID). Click superscripts to jump; click ↩ to return.

Zelikman et al. STaR: Bootstrapping Reasoning With Reasoning. 2022. arXiv:2203.14465 — Iteratively fine-tunes on correct chain-of-thought, without large-scale rationale annotation. ↩
Gulcehre et al. Reinforced Self-Training (ReST) for Language Modeling. 2023. arXiv:2308.08998 — Grow-Improve offline RL loop, more sample-efficient than online RLHF. ↩
Yuan et al. Self-Rewarding Language Models. 2024. arXiv:2401.10020 — Same model acts as both generator and LLM-as-Judge; jointly improves both via iterative DPO. ↩
Chen et al. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. 2024. arXiv:2401.01335 — SPIN: uses the previous-round self as opponent; self-improvement using only SFT data. ↩
Bai et al. Constitutional AI: Harmlessness from AI Feedback. 2022. arXiv:2212.08073 — Constitution-guided self-critique and revision; RLAIF replaces human harmlessness annotation with AI preferences. ↩
Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. 2023. arXiv:2303.11366 — Verbal reflections stored in episodic memory; multi-round self-correction without weight updates. ↩
Madaan et al. Self-Refine: Iterative Refinement with Self-Feedback. 2023. arXiv:2303.17651 — Frozen model self-loop: generate → critique → revise; no training, consistent gains across tasks. ↩
Gao et al. Scaling Laws for Reward Model Overoptimization. 2022. arXiv:2210.10760 — Separation curve of proxy reward vs gold reward as KL increases; RM scaling laws. ↩
Lightman et al. Let's Verify Step by Step. 2023. arXiv:2305.20050 — Process supervision (PRM) outperforms outcome supervision (ORM); PRM800K dataset. ↩
Shumailov et al. The Curse of Recursion: Training on Generated Data Makes Models Forget. 2023. arXiv:2305.17493 — Recursive training on self-generated data causes distribution tails to vanish (model collapse). ↩
Shao et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. arXiv:2402.03300 — RLVR (RL with verifiable rewards) + GRPO; programmatic verification replaces learned RM. ↩
Zuo et al. TTRL: Test-Time Reinforcement Learning. 2025. arXiv:2504.16084 — RL on unlabeled test data with majority-vote pseudo-labels; "RLVR without a verifier," quality hinges on the majority-vote signal (not a strict maj@n ceiling). ↩
Zhang et al. AFlow: Automating Agentic Workflow Generation. 2024. arXiv:2410.10762 — MCTS over the operator space of code-ified workflows; leaves weights untouched, optimizes only workflow structure. ↩
Wang et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. 2023. arXiv:2305.16291 — Executable-code skill library + lifelong learning; frozen base, capability deposited in external memory. ↩
Thaman. Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use. 2026. arXiv:2605.02964 — Reward-hacking exploitation rates across 13 models under tool-mediated evaluation (~0%–14%); single benchmark, very recent preprint, magnitude only. ↩