GRPO 从零开始演练 / From-Scratch GRPO Drill

核心数学原理 / Core Mathematics

1. 分组相对优势 / Group-Relative Advantage

给定输入 $s$ ，采样 $K$ 个输出 $\{o_1, \dots, o_K\}$ ，计算标准化奖励： $\tilde{r}_i = \frac{r_i - \text{mean}(\{r_1, \dots, r_K\})}{\text{std}(\{r_1, \dots, r_K\})}$ 优势估计为： $\hat{A}_i = \tilde{r}_i$

2. 裁剪代理目标 / Clipped Surrogate Objective

$L^{\text{CLIP}}(\theta) = \mathbb{E}\left[ \min\left( \frac{\pi_\theta(o_i|s)}{\pi_{\theta_{\text{old}}}(o_i|s)} \hat{A}_i,\; \text{clip}\left( \frac{\pi_\theta(o_i|s)}{\pi_{\theta_{\text{old}}}(o_i|s)}, 1-\epsilon, 1+\epsilon \right) \hat{A}_i \right) \right]$

3. KL 散度正则项 / KL Divergence Penalty

$L(\theta) = L^{\text{CLIP}}(\theta) - \beta \, D_{\text{KL}}\left( \pi_\theta \| \pi_{\text{ref}} \right)$ 其中 $\pi_{\text{ref}}$ 是参考策略（通常是初始策略）。

PPO 与 GRPO 对比 / PPO vs. GRPO

PPO	GRPO
使用价值网络 (Critic) 估计优势函数	无需价值网络，使用组内相对奖励作为基线
优势计算依赖 GAE	优势直接由分组标准化奖励得到
需要同时更新策略和价值网络	只更新策略网络
训练更复杂，存储需求高	训练更简单，内存高效

关键创新：GRPO 通过在同一输入的多个采样输出间进行相对比较，完全移除了对价值函数的依赖。

文件说明 / Files

项目包含且仅包含以下三个文件：

from_scratch.py — GRPO 核心实现与训练主程序 / Core GRPO implementation and main training script
test_grpo.py — 单元测试与可视化脚本 / Unit tests and visualization script
README.md — 本说明文档 / This documentation

运行方式 / Run

# 训练 GRPO 模型
python from_scratch.py

# 运行测试与可视化
python test_grpo.py

追问分层 / Stratified Follow-ups

L1 - 基础理解 / Basic Understanding

为什么组内标准化能替代价值函数？/ Why can within-group normalization replace the value function?
裁剪项如何防止策略更新过大？/ How does clipping prevent large policy updates?
KL 惩罚项的作用是什么？/ What's the role of the KL penalty?

L2 - 深入机制 / Mechanism Deep-Dive

分组大小 $K$ 对训练稳定性的影响？/ How does group size $K$ affect training stability?
标准化时除以标准差的数值稳定性问题？/ Numerical stability issues when dividing by standard deviation?
与 REINFORCE 的方差减少技术对比？/ Comparison with REINFORCE variance reduction techniques?

L3 - 前沿变体 / Advanced Variants

DAPO（Decoupled-clip & Dynamic-sAmpling Policy Optimization, arXiv:2503.14476)

针对长 CoT RL 对 GRPO 的四点改进:

Clip-Higher:解耦上/下裁剪 $\epsilon$ 、抬高上界,给低概率 token 上升空间,防熵塌缩。
Dynamic Sampling:丢弃「组内奖励全相同」(优势恒 0、无梯度)的 prompt。
Token-level loss:按 token 而非序列平均,避免长回答梯度被稀释。
Overlong reward shaping:对超长回答软惩罚。

Dr.GRPO（GRPO Done Right, Liu et al.)

改进点:修正 GRPO 的优化偏置——优势里的 std 归一化 + loss 里的回答长度归一化会偏好「更长的错误回答」。
与 GRPO 的不同:去掉 std 除法与长度归一化(改用常数)→ 更无偏、token 更省、回答不虚长。(「Dr.」= Done Right,不是 Discounted。)

注意:DAPO 与 Dr.GRPO 都保留 GRPO 的核心——无需价值函数的相对优势——只在裁剪/采样/归一化上改进。详见 ../../cheatsheets/reasoning-rl-frontier.md。