Drill · 手撕

Drill: Turn credit assignment from scratch

可运行的 from-scratch 实现 + 测试。目标:每一行都能在面试里推导和辩护。 Runnable from-scratch implementation with tests — derive and defend every line.

背景 / Background

训练 LLM agent 时,轨迹由多轮交替 token 构成:

[系统提示] [观察_1] [动作_1] [工具输出_1] [动作_2] [工具输出_2]

Credit assignment 要解决三个问题:

  1. 折扣回报 — 把稀疏的最终奖励分摊回每个时间步。
  2. 基线消除 — 降低梯度方差,同时不引入额外偏差。
  3. 掩码损失 — 梯度只流向 agent 动作 token;工具输出/观察 token 不参与训练。

Training an LLM agent requires solving three sub-problems:

  1. Discounted return — propagate a sparse terminal reward back to every step.
  2. Baseline subtraction — reduce gradient variance without adding bias.
  3. Masked loss — gradient flows only through agent action tokens; tool-output and observation tokens are masked out.

数学 / The math

1. 折扣回报 Discounted return

Gt=k=0Tt1γkrt+kG_t = \sum_{k=0}^{T-t-1} \gamma^k\, r_{t+k}

逆向递推: GT1=rT1G_{T-1} = r_{T-1}, Gt=rt+γGt+1G_t = r_t + \gamma\, G_{t+1}

2. GRPO 组相对优势 Group-relative advantage

对同一 prompt 采样 GG 条轨迹(一组),组内归一化代替价值网络:

Ai=Giμgroupσgroup+εA_i = \frac{G_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \varepsilon}

来源:DeepSeekMath, Shao et al., 2024 (arXiv:2402.03300).

3. 掩码策略梯度损失 Masked policy-gradient loss

L=(b,t):mb,t=1Ab,tlogπθ(ab,t)b,tmb,t\mathcal{L} = -\frac{\sum_{(b,t):\, m_{b,t}=1} A_{b,t}\,\log \pi_\theta(a_{b,t})}{\sum_{b,t} m_{b,t}}


文件 / Files

文件 内容
from_scratch.py discounted_return + group_relative_advantages + masked_pg_loss
test_turn_credit.py 17 个断言测试:零均值属性、掩码正确性、数值精度、端到端流水线
python test_turn_credit.py        # 或 python -m pytest test_turn_credit.py

追问分层 / Stratified follow-ups

L1 — 概念

L2 — 实现

L3 — 算法


参考 / References