Drill: Turn credit assignment from scratch

可运行的 from-scratch 实现 + 测试。目标:每一行都能在面试里推导和辩护。 Runnable from-scratch implementation with tests — derive and defend every line.

背景 / Background

训练 LLM agent 时，轨迹由多轮交替 token 构成：

[系统提示] [观察_1] [动作_1] [工具输出_1] [动作_2] [工具输出_2] …

Credit assignment 要解决三个问题：

Training an LLM agent requires solving three sub-problems:

Discounted return — propagate a sparse terminal reward back to every step.
Baseline subtraction — reduce gradient variance without adding bias.
Masked loss — gradient flows only through agent action tokens; tool-output and observation tokens are masked out.

$G_t = \sum_{k=0}^{T-t-1} \gamma^k\, r_{t+k}$

逆向递推: $G_{T-1} = r_{T-1}$ , $G_t = r_t + \gamma\, G_{t+1}$

对同一 prompt 采样 $G$ 条轨迹（一组），组内归一化代替价值网络：

$A_i = \frac{G_i - \mu_{\text{group}}}{\sigma_{\text{group}} + \varepsilon}$

来源：DeepSeekMath, Shao et al., 2024 (arXiv:2402.03300).

$\mathcal{L} = -\frac{\sum_{(b,t):\, m_{b,t}=1} A_{b,t}\,\log \pi_\theta(a_{b,t})}{\sum_{b,t} m_{b,t}}$

文件	内容
`from_scratch.py`	`discounted_return` + `group_relative_advantages` + `masked_pg_loss`
`test_turn_credit.py`	17 个断言测试：零均值属性、掩码正确性、数值精度、端到端流水线

python test_turn_credit.py        # 或 python -m pytest test_turn_credit.py

Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300. — GRPO 原始来源 / origin of GRPO.
Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. — REINFORCE 基础 / REINFORCE foundation.
Schulman et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. — PPO，GRPO 的直接前身 / PPO, direct predecessor of GRPO.
Schulman et al. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438. — GAE，另一种低方差优势估计 / GAE, another low-variance advantage estimator.