Drill · 手撕

Drill: Self-Refine Loop from scratch

可运行的 from-scratch 实现 + 测试。目标:每一行都能在面试里推导和辩证。 Runnable from-scratch implementation with tests — derive and defend every line.

核心思想 / Core idea

Self-Refine (Madaan et al., 2023, arXiv:2303.17651) 的直觉: 用同一个模型既生成候选 对候选打分与反思 —— 把输出当草稿、反复改写,每轮保留最优。

The intuition from Self-Refine: use the same model to both produce a candidate and critique it, then feed that critique back as the context for the next generation, keeping the best result seen so far.

候选生成                 自评分                  反思/编辑
generate(state)  -->  score(candidate)  -->  reflect(candidate, score)
     ^                                              |
     |________ candidate for next round ____________|
                     (keep best)

本钻题用玩具连续优化代替 LLM:生成器输出一个实向量,评分器是已知二次型, 反思器是对评分做一步梯度上升。环路结构与论文完全一致,但无需网络或数据集。

This drill replaces the LLM with a toy continuous objective: a generator outputs a real vector, the scorer is a known quadratic, and the reflector takes one gradient-ascent step on the score. The loop structure is identical to the paper; no network calls or datasets needed.

数学 / The math

设目标向量为 tt,候选为 xx,评分为:

s(x)=xt2(,0]s(x) = -\|x - t\|^2 \in (-\infty,\, 0]

反思步(梯度上升,步长 α\alpha):

x=x+αxs(x)=x+2α(tx)x' = x + \alpha\,\nabla_x\, s(x) = x + 2\alpha\,(t - x)

对凸二次型,每步严格减小 xt2\|x-t\|^2,只要 0<α<10 < \alpha < 1

Best-keeping invariant: best_scorek=maxiks(xi)\text{best\_score}_k = \max_{i \le k}\, s(x_i) 该量单调不减,即循环的核心不变量 (loop invariant)。

与 Reflexion 的关系 / Relation to Reflexion

Reflexion (Shinn et al., 2023, arXiv:2303.11366) 把反思结果 存入外部记忆缓冲区 (episodic memory),下次生成时当上下文读入。 Self-Refine 更简洁:直接把反思文本拼入 prompt 重新生成,无独立记忆模块。 两者的核心循环结构相同:生成 → 评分 → 反思 → 再生成

Reflexion stores reflections in an episodic memory buffer and reads them back at the next episode start. Self-Refine inlines the feedback directly into the prompt. Both share the same loop skeleton: generate → score → reflect → regenerate.

文件 / Files

python test_self_refine.py        # plain run
python -m pytest test_self_refine.py

追问分层 / Stratified follow-ups