可运行的 from-scratch 实现 + 测试。目标:每一行都能在面试里推导和辩护。 Runnable from-scratch implementation with tests — derive and defend every line.
这是学习笔记,不是作者的研究工作。方法均来自引文中列出的经典论文。 These are study notes, not the author's research. All methods come from the cited papers.
背景 / Background
持续学习 (continual learning) 的核心难题是灾难性遗忘 (catastrophic forgetting): 神经网络在学完 Task 2 后,为 Task 1 优化的权重被大幅覆写,Task 1 性能骤降 (McCloskey & Cohen 1989; Ratcliff 1990)。
The core challenge in continual learning is catastrophic forgetting: after training on Task 2, the weights tuned for Task 1 are largely overwritten and Task 1 performance collapses (McCloskey & Cohen 1989; Ratcliff 1990).
方法 A: EWC (Elastic Weight Consolidation) / Method A
论文 / Paper: Kirkpatrick et al. 2017, arXiv:1612.00796 "Overcoming catastrophic forgetting in neural networks"
核心思想 / Core idea
贝叶斯视角:Task 2 的后验 ∝ Task-2 似然 × Task-1 后验 (作为先验)。 用对角 Laplace 近似 Task-1 后验,得到 EWC 损失:
Bayesian view: Task-2 posterior ∝ Task-2 likelihood × Task-1 posterior (as prior). Approximating the Task-1 posterior with a diagonal Laplace gives EWC loss above.
各项含义 / Term meanings:
- — Task 1 结束时记录的权重锚点 (anchor)
- — 对角 Fisher 信息估计,度量 对 Task 1 的重要性
- — 正则化强度超参数
Fisher 对角估计 / Diagonal Fisher estimate
即梯度平方的均值。对回归任务等价于对 MSE 梯度求平方后平均。 This is the mean squared gradient: weights whose gradients were large on Task 1 data are the important ones and will be penalised most for drifting.
方法 B: Experience Replay / Method B
经典参考 / Classic reference: Robins 1995 — "Catastrophic Forgetting, Rehearsal and Pseudorehearsal" (Connection Science); Lopez-Paz & Ranzato 2017, arXiv:1706.08840 (GEM, a gradient-constrained replay variant).
核心思想 / Core idea
训练 Task 2 时,将一小批 Task-1 样本 (来自 ring buffer) 与 Task-2 样本混合:
When training on Task 2, interleave a mini-batch of Task-1 samples from a ring buffer with the current task batch.
- = replay ratio (0.5 → equal weight to both tasks)
- Buffer stores Task-1 pairs; FIFO eviction when capacity is full
- No auxiliary parameters — memory cost is O(buffer size)
对比 / Comparison
| 方法 | 额外参数 | 原始数据需求 | 复杂度 |
|---|---|---|---|
| 朴素微调 Naive | 0 | 不需要 | 基线 |
| EWC | 2× weights (F + θ*) | 少量 Task-1 (估 Fisher) | 低 |
| Replay | buffer size | 完整 Task-1 subset | 低 |
EWC 适合无法重用原始数据的场景 (如隐私限制); Replay 更直接,但需要存储原始样本。
EWC is preferred when original data cannot be retained (e.g., privacy constraints); Replay is simpler and often more effective when data storage is acceptable.
文件 / Files
from_scratch.py—SmallMLP,estimate_fisher,ewc_penalty,ReplayBuffer,train_naive/train_ewc/train_replaytest_ewc_replay.py— 6 assertion-based tests; fully deterministic (fixed seeds)
python test_ewc_replay.py # 或 python -m pytest test_ewc_replay.py
测试断言 / Test assertions:
- 朴素微调确实遗忘 (Task-1 loss > 0.3 after T2 training)
- EWC 遗忘量显著低于朴素基线 (reduction ≥ 0.5 MSE units)
- Replay 遗忘量显著低于朴素基线 (reduction ≥ 0.5 MSE units)
- EWC 不阻止 Task-2 学习 (T2 loss drops ≥ 50% from random init)
- Fisher 对角值全部 ≥ 0 (必然性质:梯度平方)
- ReplayBuffer 容量上限和采样形状正确
追问分层 / Stratified follow-ups
L1 (概念)
- 为什么 Fisher 对角是梯度平方的期望?和权重的 Hessian 什么关系?
- EWC 与 L2 正则化的本质区别是什么?(提示:L2 锚定到 0,EWC 锚定到 θ*)
- Replay 为什么 FIFO 而不是按重要性保留?(引出 reservoir sampling / MIR)
L2 (实现细节)
- 多任务顺序下怎么扩展 EWC?(每个任务累积一组 F + θ*;online EWC)
- Fisher 估计的采样数 N 怎么影响估计质量?bias-variance tradeoff?
- Replay buffer 满了应该怎么选 eviction policy?iCaRL 用的是什么策略?
L3 (研究层)
- GEM (arXiv:1706.08840) 如何用 replay buffer 加梯度约束取代损失项?优势是什么?
- EWC 的对角 Fisher 近似忽略了什么?(参数间协方差 → Kronecker-factored 近似 → K-FAC/KFAC-EWC)
- 持续学习的三类范式 (regularisation / replay / architecture expansion) 各自的根本瓶颈是什么?
参考文献 / References
- Kirkpatrick et al. (2017). "Overcoming catastrophic forgetting in neural networks." PNAS. arXiv:1612.00796
- Lopez-Paz & Ranzato (2017). "Gradient Episodic Memory for Continual Learning." NeurIPS. arXiv:1706.08840
- Robins (1995). "Catastrophic Forgetting, Rehearsal and Pseudorehearsal." Connection Science, 7(2).
- McCloskey & Cohen (1989). "Catastrophic Interference in Connectionist Networks." Psychology of Learning and Motivation, 24.
- Ratcliff (1990). "Connectionist Models of Recognition Memory." Psychological Review, 97(2).