When a model learns continuously on sequential tasks, how can it acquire new knowledge without forgetting old knowledge? This is a mandatory question beyond the "train-once-then-deploy" paradigm, and a latent risk throughout the LLM post-training pipeline (pretrain → SFT → DPO → RL).
Study notes, not the author's own research (see README honesty statement). Numbers / conclusions follow the original papers; uncertain items are noted.
0. The evolution
IID one-shot training → sequential multi-task (Task 1 → Task 2 → …) → catastrophic forgetting appears: gradients directly overwrite old weights.
Stability-plasticity dilemma: a network must be plastic — accepting gradient updates for new tasks — and stable — preserving representations of old tasks. The two requirements are inherently in tension.
TL;DR — quick anchors (2-minute pass)
- Core tension = stability-plasticity: plastic (learn new) vs. stable (keep old) inherently conflict; catastrophic forgetting = new-task gradients overwriting shared parameters, not a capacity shortfall.
- Four settings: Task-IL / Domain-IL / Class-IL (hardest) / Continual pretraining (no boundaries); difficulty hinges on whether the task ID is known at test time.
- Three method families: regularization (EWC/SI/MAS, penalize moving important weights) / replay (ER/GEM/A-GEM/DER, mix old samples or soft targets) / parameter isolation (ProgNN/PackNet/LoRA-CL, separate sub-networks).
- Parameter isolation (esp. ProgNN/PackNet) gives structural zero-forgetting; regularization / replay are merely approximate; isolation's cost is parameter growth with task count.
- EWC:
L + (λ/2)·Σ Fᵢ(θᵢ−θᵢ*)², Fisher diagonal measures old-task importance; λ too small still forgets, too large can't learn the new task. - LwF = distilling the old model's soft outputs, no old data needed; but soft-target quality drops under large task drift.
- Three metrics: AA (overall retention) / BWT (forgetting, signed, ideal ≥0) / FWT (forward transfer); don't read BWT as "larger magnitude is better."
- LLM angle: continual pretrain / instruction-tuning / alignment;
pretrain→SFT→DPO→RLis CL at every hop, alignment tax accumulates; KL constraint ≈ implicit EWC. - Knowledge editing (ROME rank-1 / MEMIT batch) = targeted surgery, complementary to CL's global protection; sequential edits cause interference—judge on reliability / generalization / locality.
1. Why catastrophic forgetting happens
A neural network's parameters are shared storage for all tasks. When running SGD on task , the gradient of the loss with respect to the parameters has no knowledge that "these weights matter for ", so it overwrites them — this is catastrophic forgetting.
Classic settings:
| Setting | Task ID known at test time | Clear task boundary |
|---|---|---|
| Task-IL | Yes | Yes |
| Domain-IL | No | Yes |
| Class-IL (hardest) | No | Yes |
| Continual pretraining | No explicit boundary | No |
Misconception: "Catastrophic forgetting is the model 'not remembering' / lacking capacity." The root cause is interference, not capacity—network parameters are shared storage for all tasks, and a new task's SGD gradient doesn't know "which weights matter for old tasks," so it simply overwrites them. So scaling the model up alone doesn't solve it—you protect important weights / replay the old distribution / isolate sub-networks.
2. Three method families
2.1 Regularization
Core idea: add a penalty that protects old weights to the new-task loss, so that weights important to old tasks change as little as possible.
EWC (Elastic Weight Consolidation)1Uses the diagonal of the Fisher information matrix to measure weight importance; quadratic penalty prevents forgetting. Kirkpatrick 2017 ↗ total loss:
- : parameter values after completing the old task
- : diagonal element of the Fisher information matrix (measures the "importance" of parameter to the old task)
- : hyperparameter controlling the stability vs. plasticity trade-off
SI (Synaptic Intelligence): an online version of EWC that accumulates each parameter's contribution to the loss during training, without needing to recompute the Fisher after the task ends.
MAS (Memory Aware Synapses): importance is estimated by the gradient norm of the output function with respect to parameters, requiring no labeled data.
| Method | Importance estimate | Needs old data | Compute cost |
|---|---|---|---|
| EWC | Fisher diagonal | No | Medium (one backward pass) |
| SI | Online trajectory integral | No | Low (computed during training) |
| MAS | Output gradient norm | No (only unlabeled input needed) | Low–medium |
2.2 Rehearsal & Replay
Core idea: mix in old-task samples during new-task training, so that gradients "remember" the past simultaneously.
Experience Replay: maintain an episodic memory buffer holding real samples from old tasks; randomly interleave them at a fixed ratio during new-task training.
GEM (Gradient Episodic Memory)2Projects gradient updates into the feasible region where old-task losses do not increase, while allowing forward transfer. Lopez-Paz 2017 ↗:
For each old task , requires the updated gradient to satisfy: i.e., the new gradient must not point "opposite" to the old-task gradient. If violated, is projected onto the feasible region.
A-GEM (Averaged GEM)3Merges multiple old-task constraints into a single average-gradient constraint, drastically reducing computation while achieving performance comparable to GEM. Chaudhry 2019 ↗: uses the average gradient over all old-task buffers as the single constraint, reducing the per-step QP projection from constraints to 1:
Generative Replay: use a generative model (e.g., VAE, GAN) to learn the old-task distribution and synthesize "pseudo-old data" at training time — no real old samples need to be stored, but generative quality bottlenecks accumulate error.
DER (Dark Experience Replay)4Stores old-sample logits (soft targets) in the buffer and uses MSE to match them, fusing rehearsal with knowledge distillation. Buzzega 2020 ↗: buffer stores where is the logit (dark knowledge) from the model at the past time step; at replay time the model must also match :
2.3 Parameter Isolation & Architectural CL
Core idea: different tasks occupy different sub-networks; new tasks expand capacity without modifying old-task parameters — forgetting is structurally eliminated.
Progressive Neural Networks5Adds a new network column for each new task; lateral connections exploit knowledge from frozen old columns; forgetting is structurally impossible. Rusu 2016 ↗: each task gets an independent network column; old columns are frozen; new columns read old-column activations via lateral connections:
- Advantage: zero forgetting, natural forward transfer
- Disadvantage: parameter count grows linearly with the number of tasks
PackNet: performs iterative pruning + freezing within a single network, assigning a parameter mask to each task, with no shared gradient paths between tasks.
LoRA-based / Adapter-based CL: incrementally adds a set of LoRA weights or adapter modules per task; the backbone is frozen; tasks are routed to the corresponding adapter via task ID — parameter overhead is manageable and LLM-friendly.
| Method family | Zero forgetting | Forward transfer | Parameter growth | Needs old data |
|---|---|---|---|---|
| Regularization (EWC/SI/MAS) | Approximate | Limited | None | No |
| Replay (ER/GEM/A-GEM/DER) | Approximate | Yes | Buffer | Yes (partial) |
| Parameter isolation (ProgNN/PackNet/LoRA-CL) | Yes | Limited–yes | Linear–lightweight | No |
Misconception: "Regularization methods like EWC fully prevent forgetting." Regularization and replay only give approximate zero-forgetting (soft constraint / sampled mixing): EWC's λ too small still forgets, too large can't learn the new task. Truly structural zero-forgetting comes only from parameter isolation (ProgNN freezing old columns / PackNet masks), at the cost of parameter growth with task count.
3. Knowledge Distillation: LwF
LwF (Learning without Forgetting)6When training on a new task, uses the soft outputs of the old model as distillation targets, mitigating forgetting without storing any old data. Li 2016 ↗:
- During inference on new-task data, first use the old model to produce soft output
- Joint optimization:
- is typically temperature-scaled KL divergence or cross-entropy
- No old data required; downside: soft-target quality degrades when task drift is large
LwF is essentially implicit replay of the old model's knowledge rather than old data — attractive in privacy-sensitive or storage-constrained settings.
4. Evaluation Metrics
Let denote accuracy on task measured after completing task , across tasks total.
Average Accuracy (AA): average accuracy over all tasks after learning all of them:
Backward Transfer (BWT) — more negative means more forgetting:
Forward Transfer (FWT) — positive values indicate old tasks benefited new ones:
where is the baseline accuracy for task trained independently from random initialization.
| Metric | Measures | Ideal value |
|---|---|---|
| AA | Overall memory retention | Higher is better |
| BWT | Degree of forgetting | ≥ 0 (closer to 0 or > 0 is better) |
| FWT | Forward transfer | > 0 |
Forgetting is sometimes defined directly as the mean accuracy drop for each task from "when learned" to "final", which is the negation of BWT.
Misconception: "Larger BWT magnitude is better." BWT is a signed metric: negative means forgetting, more negative is worse, and the ideal is ≥0 or close to 0 (positive = old tasks are even reinforced by later ones). When reading CL papers, don't treat "large BWT" as good—check its sign and how close it is to 0.
5. The LLM Angle
5.1 Continual Pretraining
Continuing to train a language model on new-domain corpora after initial pretraining (e.g., medical text, code updates); core challenges:
- Old general capabilities (math reasoning, instruction following) may degrade
- Distribution shift from new corpora may be milder than task-level shift but persists longer
- Common mitigations: learning-rate warm-up restart, replaying a small amount of general data, reducing the learning rate
5.2 Continual Instruction-Tuning
Sequentially introducing new instruction types (e.g., coding → then math → then safety); each SFT round may overwrite behaviors learned in previous rounds. LoRA-based or adapter-per-task approaches are natural low-parameter solutions: the backbone is shared, and behaviors are separated by module routing.
5.3 Continual Alignment & Alignment Tax
Sequential alignment pipeline: pretrain → SFT → DPO → RL (RLHF/RLVR) — each step continues training from the previous step's checkpoint.
Alignment tax: alignment often trades away part of general capabilities (e.g., code generation, factuality). When steps are stacked sequentially, the tax accumulates:
- Excessive SFT may compress knowledge diversity
- Doing RLHF after DPO can lead to over-refusal or format degradation
- Every hop faces the CL problem of "new alignment objective vs. behavior learned in the previous hop"
Mitigation strategies:
- KL constraint (PPO clip / DPO reference model) — limits each step's deviation from the reference; structurally analogous to implicit EWC
- Replay old preference data — mix in data from earlier alignment stages
- LoRA with independent adapter per stage — backbone unchanged, alignment behavior localized
Misconception: "The alignment chain SFT→DPO→RL has independent steps." Each hop continues training on the previous hop's checkpoint—itself a CL problem: alignment tax accumulates hop by hop (over-SFT compresses diversity; RLHF after DPO can cause over-refusal / format degradation). This is also why the KL constraint (PPO clip / DPO reference model) is essentially an implicit analog of EWC.
5.4 Why CL Matters for LLM Post-Training
| Scenario | CL challenge |
|---|---|
| Incremental update of a new model version | Cannot retrain from scratch; must fine-tune incrementally without losing old capabilities |
| Multiple rounds of RLHF iteration | Each policy update may overwrite alignment from the previous round |
| Personalization / continuous user feedback | Adapt to new user preferences while preserving general capabilities |
| Knowledge update (time-sensitive information) | Inject new facts without disturbing old knowledge structure |
5.5 Knowledge Editing (ROME / MEMIT)
Positioning: CL regularization (EWC etc.) is global protection—penalizing all weights important to old tasks; knowledge editing (model editing) is targeted surgery—rewriting only the few parameters that store a single fact, updating one piece of knowledge precisely without retraining. The two are complementary: CL prevents "learning new, forgetting old," while editing performs "precise revision of the old."
ROME (Rank-One Model Editing)10Causal tracing locates factual knowledge mainly in middle-layer MLPs; treats that MLP's second linear projection (down-projection) as a linear associative memory and rewrites a specific key-value association with a rank-1 update. Meng 2022 ↗: causal tracing finds that factual knowledge resides mainly in middle-layer MLP key→value associations. Viewing that layer's weight as a linear associative memory (key maps to value ), inserting a new fact is equivalent to a minimal-change constrained least squares:
whose closed-form solution is a rank-1 update to —preserving existing associations while mapping the target subject's key to the new object.
MEMIT (Mass-Editing Memory in a Transformer)11Extends ROME's single edit to batches of thousands of facts updated across multiple middle layers. Meng 2022 ↗: extends ROME's single edit to batches of thousands of facts, amortizing the update across multiple middle layers, resolving the degradation ROME suffers when editing many facts sequentially one by one.
Forgetting under sequential edits: when editing many facts in sequence, later edits interfere with earlier ones (edit interference) and may spill over to unrelated knowledge and general capabilities—exactly catastrophic forgetting reappearing in the "editing" paradigm. Edit quality is therefore measured on three axes simultaneously: reliability (the target fact is corrected), generalization (paraphrases / synonymous rewordings also take effect), and locality / specificity (unrelated knowledge is untouched). These three trade off against one another, isomorphic to stability-plasticity.
Misconception: "ROME/MEMIT edits one fact, and that's it—no effect on anything else." Editing many facts in sequence causes edit interference, and may spill over to unrelated knowledge and general capabilities—exactly catastrophic forgetting reappearing in the "editing" paradigm. So edit quality must be judged on three axes at once: reliability (corrected), generalization (paraphrases/synonyms also take effect), locality (unrelated knowledge untouched), which trade off against each other.
6. From-scratch EWC
71 行 / lines
import torch
import torch.nn.functional as F
def compute_fisher_diagonal(model, dataloader, device="cpu"):
"""
Estimate the Fisher information matrix diagonal using squared log-likelihood gradients.
dataloader: old-task data; model: model after completing the old task (parameters fixed).
Returns dict: param_name -> F_i (same shape as parameter)
"""
model.eval()
fisher = {n: torch.zeros_like(p) for n, p in model.named_parameters()}
n_samples = 0
for x, y in dataloader:
x, y = x.to(device), y.to(device)
model.zero_grad()
logits = model(x)
log_prob = F.log_softmax(logits, dim=-1)
# Use predicted label to approximate sampling — this is a simplified implementation;
# the original EWC paper actually uses the true label y to estimate Fisher
# (empirical Fisher = E_{(x,y)~D}[grad log p(y|x)^2])
pred = log_prob.argmax(dim=-1)
loss = F.nll_loss(log_prob, pred, reduction="sum")
loss.backward()
for n, p in model.named_parameters():
if p.grad is not None:
fisher[n] += p.grad.detach().pow(2)
n_samples += x.size(0)
fisher = {n: f / n_samples for n, f in fisher.items()}
return fisher
class EWCTrainer:
"""
Adds an EWC quadratic penalty during new-task training to prevent old-task parameters from being overwritten.
"""
def __init__(self, model, old_dataloader, ewc_lambda=5000.0, device="cpu"):
self.model = model
self.device = device
self.ewc_lambda = ewc_lambda
# Save a snapshot of old-task parameters
self.theta_star = {
n: p.detach().clone()
for n, p in model.named_parameters()
}
# Compute Fisher diagonal
self.fisher = compute_fisher_diagonal(model, old_dataloader, device)
def ewc_penalty(self):
"""EWC penalty term: (lambda/2) sum_i F_i (theta_i - theta*_i)^2"""
penalty = torch.tensor(0.0, device=self.device)
for n, p in self.model.named_parameters():
penalty = penalty + (
self.fisher[n] * (p - self.theta_star[n]).pow(2)
).sum()
return 0.5 * self.ewc_lambda * penalty
def train_step(self, x, y, optimizer, criterion):
"""One training step on the new task: task loss + EWC penalty."""
x, y = x.to(self.device), y.to(self.device)
optimizer.zero_grad()
logits = self.model(x)
task_loss = criterion(logits, y)
loss = task_loss + self.ewc_penalty()
loss.backward()
optimizer.step()
return task_loss.item(), loss.item()
ewc_lambdais the core hyperparameter: too small and forgetting remains severe; too large and the new task cannot be learned. In practice one typically searches in and accumulates multiple Fisher groups as tasks grow (online EWC merges them via a running average).
Stratified follow-ups
L1 Foundations
1. What is catastrophic forgetting? Why are neural networks especially prone to it?
Answer: A neural network's parameters are shared storage for all tasks. When running SGD on new task , the gradient has no knowledge of which weights matter for old task , so it directly overwrites them and causes a sudden drop in old-task performance — this is catastrophic forgetting. The root cause is parameter sharing combined with independent optimization objectives: the new-task gradient is "noise" with respect to the old-task loss.
Follow-up: From an optimization-theory perspective, what is the essence of catastrophic forgetting? → The essence is that the gradient-descent direction for the new task is an ascent direction on the old-task loss surface — the optimal parameter regions for the two tasks do not overlap in parameter space. SGD moving along the new-task gradient simultaneously destroys the local minimum of the old task. This is the degenerate form of Pareto conflict in multi-objective optimization under single shared parameterization.
2. What is the stability-plasticity dilemma? Give an intuitive analogy.
Answer: A network must simultaneously have plasticity — accepting gradient updates for new tasks — and stability — preserving old-task representations. The two are inherently in tension. Intuitive analogy: when learning a new language, the brain must remember the mother tongue (stability) while building new language circuits (plasticity); memorizing the new language by rote may crowd out the neural pathways of the mother tongue.
Follow-up: Along the stability-plasticity axis, where do the three method families (regularization / replay / parameter isolation) each fall? How do you choose based on task requirements? → Regularization (EWC/SI) leans toward stability: the penalty constrains parameter deviation, limiting plasticity. Replay sits in the middle: mixing in old data balances both ends, but buffer size determines the lean. Parameter isolation (ProgNN/LoRA-CL) leans most toward stability: old parameters are fully frozen and only new capacity is added. When tasks are highly correlated and forward transfer is valuable, choose replay; when tasks are independent and storage/privacy is constrained, choose parameter isolation.
3. What is the core idea of EWC? What role does the Fisher information matrix play?
Answer: EWC (Elastic Weight Consolidation) adds a quadratic penalty to the new-task loss, so that weights important to old tasks change as little as possible. The Fisher information matrix diagonal measures the importance of parameter to the old task — larger means a stronger penalty, and that parameter is "elastically" protected.
Follow-up: What are the main limitations of the Fisher information matrix diagonal approximation? What improved alternatives exist? → The diagonal approximation ignores inter-parameter covariance. When two parameters have strong coupled importance to the old-task loss (e.g., Q/K weights in attention), penalizing each independently underestimates the true curvature. Improvements include: block-diagonal approximation (K-FAC, retaining intra-layer covariance per block), Kronecker factorization, and SI's online trajectory integral (replaces Fisher with each parameter's actual contribution to loss reduction, avoiding post-hoc recomputation).
4. Both LwF and EWC avoid storing old data — how do their anti-forgetting mechanisms differ?
Answer: EWC imposes constraints in parameter space — Fisher penalties keep important old-task weights close to their old values. LwF (Learning without Forgetting) imposes constraints in output space — it uses the old model's soft output on new-task data as a distillation target , keeping the new model's output behavior close to the old model's. LwF requires no importance-score recomputation, but soft-target quality degrades as task drift accumulates.
Follow-up: Under what conditions does LwF's anti-forgetting effect severely degrade? How can it be mitigated? → When the input distributions of old and new tasks are extremely dissimilar (e.g., old task is image classification, new task is text), the old model's soft outputs on new-task data become nearly uniform or have very low confidence — the distillation signal quality approaches random, equivalent to having no constraint. Mitigation: combine with parameter-space constraints (EWC penalties on important weights) to supplement output-space distillation; or compute only on the subset where old and new task input spaces overlap, filtering out low-confidence soft targets.
L2 Advanced
5. What is the core difference between GEM and A-GEM in replay methods? Why is A-GEM more efficient?
Answer: GEM imposes a separate gradient constraint for each old task , requiring a QP with inequalities at time complexity . A-GEM merges all old-task constraints into a single mean constraint with a closed-form projection; per-step computation drops from to , independent of the number of tasks.
Follow-up: How stable is A-GEM's mean constraint under buffer-sampling noise? What improvement directions exist? → A-GEM estimates by randomly sampling from the buffer at each step; high-variance sampling makes the constraint direction unstable — when the estimated deviates significantly from the true mean, the projection may go in the wrong direction and introduce additional forgetting noise. Improvement directions include: increasing the buffer sample size per step to reduce variance; smoothing historical with momentum (similar to EMA); and follow-up work such as ER-ACE, which bypasses the gradient-projection framework entirely via asymmetric cross-entropy.
6. What does each of BWT and FWT measure? What does it indicate if a model has very negative BWT but very positive FWT?
Answer: BWT (Backward Transfer) measures forgetting; more negative means greater performance drop on old tasks. FWT (Forward Transfer) measures positive transfer from old tasks to new ones; more positive means better transfer. Very negative BWT but very positive FWT means the model leveraged old knowledge to accelerate learning on new tasks (good transfer) while severely overwriting old-task parameters (heavy forgetting) — a typical high-plasticity, low-stability model.
Follow-up: What are the limitations of directly using BWT and FWT to evaluate LLMs' continual-learning capability? → LLM task boundaries are blurry (SFT/DPO/RL are soft boundaries rather than discrete task sequences), making it hard to construct a clear performance matrix. LLM "capability" is multi-dimensional (reasoning / code / safety), and single-number accuracy cannot capture capability interference. Additionally, the stability gap (D3) means that end-of-task snapshots underestimate the worst-case forgetting during training — for deployment safety, peak forgetting magnitude is more critical than BWT.
7. Why do Progressive Networks achieve "zero forgetting"? What is the cost?
Answer: Progressive Neural Networks add an independent network column for each new task; old-column parameters are fully frozen; new columns read old-column activations via lateral connections — the gradient path to old columns is cut off, so forgetting is structurally impossible. The cost is that parameter count grows linearly with the number of tasks.
Follow-up: How does LoRA-based CL mitigate the linear parameter growth compared to Progressive Networks? What is the fundamental difference in their zero-forgetting guarantees? → LoRA-CL adds only low-rank matrices (rank ) per task; parameter increment is rather than , far more scalable than full-column expansion. However, the zero-forgetting guarantees differ in nature: ProgNN achieves a hard guarantee by structurally cutting gradients to frozen old columns; LoRA-CL relies on backbone freezing plus low-rank subspace allocation — if orthogonality is not enforced (e.g., without O-LoRA), different task adapters' activations may interfere, making it a soft guarantee that depends on subspace overlap.
8. What advantages does LoRA-based CL have over EWC / replay? How is it used in the LLM post-training pipeline?
Answer: LoRA-based CL incrementally adds a set of low-rank adapters per task with the backbone frozen — parameter overhead is manageable, old tasks receive zero gradient interference, no old data needs to be stored, and no Fisher computation is required. In the LLM post-training pipeline (SFT → DPO → RL), each stage freezes the backbone and updates only one set of LoRA weights, confining the alignment tax to the adapter layer and leaving the backbone's general knowledge intact; at test time, routing by task-ID selects the corresponding adapter.
Follow-up: When using LoRA-based CL in LLMs, what engineering challenges arise from managing multiple adapter sets at deployment? Can O-LoRA's orthogonality constraint fundamentally eliminate this problem? → Engineering challenges include: needing to store and dynamically load adapters by task-ID (increasing inference latency and memory scheduling complexity), inability to merge adapter weights for mixed-task batches, and routing failure in Class-IL / continual pretraining settings where task-ID is unknown. O-LoRA's orthogonality constraint addresses only the adapter activation-interference problem; it does not solve routing failure — in settings without task-ID, additional mechanisms (such as a prototype classifier or task-inference module) are still needed to determine which adapter to use.
L3 Deep-dive
9. What are the main differences between LLM continual pretraining and the classic Task-IL setting? What unique challenges does it pose?
Answer: Classic Task-IL has explicit task boundaries and task IDs; LLM continual pretraining has no explicit boundaries — domain corpora flow in continuously, and task granularity is fuzzy (general capabilities and new domains overlap heavily). Unique challenges include: old general capabilities (math reasoning, instruction following) may degrade in hard-to-detect ways; distribution shift from new corpora persists for a long time; forgetting cannot be quantified with the standard matrix; and the ratios for learning-rate warm-up restarts and general-data replay are difficult to tune.
Follow-up: In LLM continual pretraining without explicit task boundaries, how can general-capability degradation be monitored and diagnosed in real time? → A "probe benchmark matrix" must be designed (covering dimensions such as reasoning / code / math / instruction following), evaluated at fixed step intervals during training — effectively implementing per-iteration continuous evaluation at LLM scale (see stability gap D3). This should be combined with gradient/activation drift detection to locate which layers are heavily rewritten by new corpora, so as to decide whether to trigger general-data replay or lower the learning rate. The cost is significant additional compute overhead, requiring a trade-off between monitoring frequency and training efficiency.
10. What CL problem does the alignment tax in the sequential alignment pipeline (SFT → DPO → RL) fundamentally represent? What approaches mitigate it?
Answer: The alignment tax is fundamentally the accumulated forgetting tax per hop in sequential CL — each alignment step continues training from the previous step's checkpoint, and the gradient of the new alignment objective overwrites previously learned behavior. Mitigation approaches: (1) KL constraint (PPO clip / DPO reference model) limits the deviation magnitude per step; its second-order expansion is equivalent to EWC weighted by Fisher; (2) Replay old preference data mixed in from earlier alignment stages; (3) LoRA with independent adapter per stage localizes alignment behavior, leaving the backbone unchanged.
Follow-up: The KL penalty in PPO is mathematically an implicit EWC, but what key practical differences affect its anti-forgetting effectiveness? → Three key differences: (1) Anchor dynamics: the KL constraint anchors to a dynamically updated reference policy that may be updated each RL round; the EWC anchor is a fixed snapshot after the old task — a dynamic anchor in long sequential alignment causes "anchor drift," continuously shifting the anti-forgetting center. (2) Full matrix vs. diagonal approximation: KL divergence uses the complete Fisher matrix; EWC uses a diagonal approximation — the former is a more accurate constraint but not computed explicitly. (3) Coefficient tuning: PPO's must be dynamically balanced between exploration and conservatism, and too large a value prevents policy convergence; EWC's is typically set statically after the task is fixed — is harder to tune in dynamic alignment scenarios.
11. EWC uses the empirical Fisher (true labels) rather than the true Fisher (labels sampled from the model) — when does this cause problems?
Answer: The empirical Fisher uses the dataset's true labels , and equals the true Fisher only when the model perfectly fits the data (). Problematic scenarios: high estimation noise in when the model is far from convergence on the old task; direction contaminated when old-task labels are noisy; errors that propagate continuously during online multi-task accumulation (online EWC); and dual degradation from the diagonal approximation and empirical error when inter-task distribution difference is extreme — protecting the wrong directions.
Follow-up: Besides switching to the true Fisher, what structural alternatives exist to fundamentally bypass the limitations of diagonal empirical Fisher? → Two categories of fundamental alternatives: (1) Replace the importance measure: SI uses each parameter's actual contribution to loss reduction, integrated as , rather than Fisher — label-free and online-accumulative, avoiding both empirical and diagonal errors. MAS uses output gradient norms, requiring no labels at all. (2) Replace the quadratic-penalty framework entirely: PackNet / LoRA-CL's parameter isolation schemes require no importance estimation — old-task parameters are structurally frozen, so the accuracy of importance estimation becomes irrelevant. This shows that Fisher estimation limitations are a systemic problem of the regularization paradigm, not an engineering problem that can be improved indefinitely.
12. What are the advantages and limitations of Generative Replay compared to experience replay? Is it more feasible in the era of large models?
Answer: Generative Replay uses a generative model (VAE/GAN) to synthesize "pseudo-old data" for training. Advantages: no real old samples need to be stored, naturally resolving privacy and storage constraints. Limitations: generative-quality bottlenecks accumulate error along the task sequence — the generative model itself faces forgetting, and the deviation between synthesized data and the true distribution compounds over long task chains. In the era of large models, LLMs have extremely strong generative capabilities, and using an LLM to synthesize old-task data achieves quality far higher than a small GAN — feasibility is significantly improved, but generation cost is high and synthesized data may still have systematic distributional bias relative to the original.
Follow-up: In LLM Generative Replay, what are the core problems with using the model itself as the generator ("self-replay") versus using an independently frozen old model as the generator? → Self-replay (having the current model generate old-task data to train itself) suffers from self-reinforcing bias: the model has already partially forgotten the old task after new-task training, so the generated old-task samples are themselves lower quality; training on these samples reinforces forgetting — forming a negative feedback loop. Using an independently frozen old model as the generator (similar to LwF's teacher) ensures generation quality does not degrade with the current model, but requires storing a full additional model copy whose storage overhead may be higher than an experience-replay buffer; moreover, the frozen old model cannot be updated to track shifts in the new data distribution, accumulating distributional bias over long task chains.
Deep-dive
The following are interview-level advanced Q&A covering the 7 most frequently probed hard points from the notes above. All conclusions come from the cited papers and contain no original research by the author.
D1. Empirical Fisher vs. True Fisher — Which does EWC use, and when does it break down?
True Fisher is defined as the expectation of the outer product of log-likelihood gradients under the model's predictive distribution :
Empirical Fisher replaces the sampled from the model distribution with the true labels in the dataset:
The two are equal only when the model perfectly fits the data (). Research has shown that during optimization, empirical Fisher cannot generally capture second-order information, and its deviation from the true Hessian in practice can be large — pathological behavior can occur even on simple optimization problems.
EWC uses empirical Fisher: the implementation computes the expected squared gradient over pairs from the old-task dataset . The code comments also note "use true label to estimate Fisher."
When does the approximation break down?
| Scenario | Risk |
|---|---|
| Computing when the model is far from convergence on the old task | estimate is noisy; protects wrong directions |
| Old-task labels are noisy | direction is contaminated by label noise |
| Multi-task sequential accumulation (online EWC) | Early-task errors propagate continuously in the running average |
| Extreme inter-task distribution gap (e.g., language → vision) | Diagonal approximation is already a strong assumption; empirical error compounds the degradation |
Intuition: EWC's quadratic penalty is essentially a local quadratic approximation of the parameter space — diagonal empirical Fisher is the coarsest layer of this approximation. When the old task is well-converged, labels are clean, and parameter correlations are weak, the approximation is acceptable; otherwise the "importance scores" decouple from the true loss-landscape curvature, and the penalty protects wrong directions.
D2. Online / Streaming EWC — How to accumulate Fisher across tasks?
Standard EWC stores one pair for each new old-task encountered, with memory growing linearly in task count . Online EWC (Progress & Compress, Schwarz et al. 2018)7Progress & Compress dual-network framework: active column learns new tasks; knowledge base consolidates with online EWC; Fisher is accumulated across tasks via exponential moving average. Schwarz 2018 ↗ merges all historical tasks' Fisher into a single via exponential moving average (EMA):
The total penalty degenerates to a single penalty term:
Advantage: constant memory (stores only one and one snapshot).
Cost and risks:
- EMA exponentially downweights Fisher from earlier tasks — the older the task, the weaker the protection; it inherently favors protecting "recent" tasks.
- is a new hyperparameter: retains history but forgetting rate is high; degenerates to looking only at the previous task.
- The reference point is updated each round; after each compression the new is not the common optimum for all historical tasks — the penalty center drifts as tasks accumulate.
SI (Synaptic Intelligence) is another streaming approach: it online-accumulates each parameter's integral contribution to loss reduction during training as an importance measure:
SI requires no additional backward pass after the task ends and is suited to streaming settings without explicit task boundaries, but the signal-to-noise ratio of the importance estimate is lower than EWC.
D3. Stability Gap — Why does forgetting worsen before recovering?
De Lange et al. (arXiv 2022, ICLR 2023)8Proposes a per-iteration continuous evaluation framework; first systematically documents the stability gap: performance drops sharply after a task switch then recovers; evaluating only at task end misses this phenomenon. De Lange 2022 ↗ discovered through per-iteration continuous evaluation that almost all mainstream CL methods (including EWC, ER, A-GEM) experience a sharp drop in old-task performance in the first few steps after switching to a new task — followed by gradual recovery and even exceeding the pre-switch level as training progresses. This "drop then recover" phenomenon is called the stability gap.
Mechanistic intuition:
Task T1 training complete → switch to T2
↓ first few dozen steps
T2's large gradients hit shared feature layers → T1 representations temporarily disrupted → T1 performance drops sharply
↓ training continues
regularization / replay constraints begin to take effect → T1 representations gradually recover
↓ end of T2
T1 performance recovers (but may be below its peak at the end of T1 training)
Why was this not found before?
The standard evaluation protocol tests only once after each task is fully learned, which happens to skip the drop period — the stability gap is nearly invisible in the end-of-task "snapshots."
Interview key points:
- The stability gap means the BWT metric (end-of-task snapshot) underestimates the true magnitude of forgetting — in safety-critical settings (incremental updates of deployed LLMs), worst-case performance during training may be much more severe than BWT reflects.
- Mitigation directions: reduce new-task learning rate during the warm-up transition period; "probe" the new task with small batches before full-speed training; increase the old-task proportion in replay at the beginning of the task switch.
D4. Replay Sample Selection Strategies and Buffer Size Effects
Buffer size is one of the most critical hyperparameters in replay methods. Three main selection strategies:
① Random Reservoir Sampling
For a data stream of unknown length, maintain a buffer of size ; each new sample is included in the buffer with probability (randomly replacing an existing sample). This guarantees that every sample in the buffer is a uniform subset of all historical samples — no class bias, simple to implement, and the default strategy for ER / A-GEM.
② Herding (iCaRL)
iCaRL's herding algorithm: greedily and iteratively selects samples so that the feature mean of the exemplar set best approximates the feature mean of the entire class:
Herding outperforms random sampling for class exemplar selection, especially when is very small (only a few samples per class), preserving within-class diversity better. However, herding requires having all class data available and depends on the feature space — after feature drift, old-class exemplars may no longer represent their current features.
③ Gradient-Based Selection
Selects old samples that have the greatest influence on the new-task gradient update — typically samples whose gradient direction most "conflicts" with the new-task gradient. The intuition is to constrain gradients using the hardest-to-satisfy samples. Compute cost is high (requires extra backward passes to estimate gradient influence per candidate sample); rarely used in large-scale experiments.
Buffer size effects summary:
| Buffer size | Random/Reservoir | Herding | Gradient-based |
|---|---|---|---|
| Very small (1–5 samples per class) | Poor coverage, severe forgetting | Clearly better than random | Good effect but extremely high cost |
| Medium (20–50 per class) | Approaches herding | Gap narrows | Diminishing returns |
| Large (approaching unlimited) | All three converge, approaching joint training | Same | Same |
Core insight: as , all replay methods degenerate to joint training (the upper bound); when is small, exemplar representativeness matters more than randomness — herding wins in this regime.
D5. GEM's Per-Task QP Cost vs. A-GEM's Single Mean Constraint
GEM's QP problem:
Current new-task gradient is ; for each old task there is a constraint . If the constraint is violated, solve:
This is a quadratic program (QP) with inequality constraints. Standard QP solvers have time complexity and space complexity — quickly infeasible as task count grows. GEM's implementation approximates the solution with iterative methods such as Frank-Wolfe; each step still requires gradient inner-product computations, i.e., ( = parameter dimension).
A-GEM's single constraint:
A-GEM merges all old-task constraints into one average gradient:
If violated, the projection has a closed-form solution:
Per-step computation drops from to — independent of task count (computing can be done once and reused).
Cost:
A-GEM satisfies an average-direction constraint and does not guarantee for every individual old task — the loss on some single old task may increase. Empirically, A-GEM's AA and BWT are comparable to GEM's, but when gradients across tasks are highly heterogeneous (some task directions deviate greatly from the mean), certain old tasks occasionally experience more forgetting than with GEM.
Interview memory point: GEM = per-task constraint + quadratic program, QP; A-GEM = single mean constraint + closed-form projection, ; the trade-off is sacrificing per-task constraint guarantees for linear time complexity.
D6. Why Is Class-IL Hardest? The Role of the Output Head
van de Ven & Tolias 20199Systematically compares difficulty across the three CL settings (Task-IL / Domain-IL / Class-IL) on Split MNIST and Split CIFAR-100; regularization methods almost completely fail on Class-IL. van de Ven 2019 ↗'s three-scenario framework reveals the fundamental difficulty gap:
Output head structure comparison across three scenarios:
| Scenario | Task-ID at test time | Output head | What the model must do |
|---|---|---|---|
| Task-IL | Known | Independent head per task (only current task activated) | Classify within the task; candidate set is known |
| Domain-IL | Unknown | Shared head (fixed output dimensionality) | Classify in a fixed output space; no need to distinguish tasks |
| Class-IL | Unknown | Shared head (all task classes) | Classify among all classes across all historical tasks |
Three obstacles that make Class-IL hardest:
Task-ID inference problem: the model does not know which task the current input belongs to and cannot route to the corresponding sub-classifier — it must distinguish all classes on a single head.
Output head historical bias (recency bias): when learning a new task, only the new task's classes generate large gradient updates, skewing the logit scale of the output layer toward the new task — old-class logits are suppressed even if the feature layer still remembers the old task. This is "classifier-layer forgetting" unique to Class-IL.
Fundamental failure of regularization: EWC protects feature-layer parameters, but the new-class nodes initialized in the output head interfere with the gradient flow of old-class nodes — the Fisher diagonal cannot capture this cross-class output-layer interference. van de Ven et al.'s experiments show that EWC's accuracy on Class-IL approaches chance level.
Remedies:
- Replay + prototype classifier (e.g., iCaRL): use exemplar feature means for nearest-neighbor classification, bypassing output head bias.
- Task-agnostic feature learning: freeze a pre-trained backbone and only update a lightweight classification head — reducing feature drift.
- Empirical bias correction: at Class-IL test time, apply temperature adjustment or weighting to old-class logits to counteract recency bias.
D7. LLM Forgetting vs. Capability Interference — Measurement Methods and Task-Order Sensitivity
Key differences between LLM "forgetting" and classic CL:
| Dimension | Classic CL (small model / classification) | LLM post-training |
|---|---|---|
| Task granularity | Clear (Task 1/2/3…) | Blurry ("math reasoning" / "code" / "safety alignment" overlap heavily) |
| Manifestation of forgetting | Drop in old-task classification accuracy | Capability interference: new-capability activation paths overwrite old-capability paths; not simple "forgetting" |
| Measurement difficulty | Directly quantifiable with matrix | Requires dedicated benchmarks (MMLU/GSM8K/HumanEval…) tracking each capability |
| Boundary clarity | Explicit task boundaries | No explicit boundaries; SFT→DPO→RL are soft boundaries |
How to measure LLM CL quality:
- Capability matrix tracking: after each alignment stage ends, evaluate on several "probe benchmarks" (covering general reasoning, code, math, safety) — equivalent to constructing an matrix at LLM scale.
- BWT analogy for LLMs: measure "change in coding capability after SFT compared to pretrain baseline" — if negative, it is part of the alignment tax.
- Activation/gradient analysis: detect which layers' activation distributions change most before and after new-task training — locating the knowledge-storage layers that are "overwritten."
Task-order sensitivity:
LLM CL is highly sensitive to task order, because:
- The gradient directions during sequential training depend on the loss landscape shaped by previous tasks — doing math SFT before safety RLHF produces a very different result from the reverse order.
- Harder tasks (math/code) require large learning rates and large gradients, causing greater damage to parameters used by subsequent smaller tasks.
- Cumulative nature of alignment tax: in the SFT → DPO → RL sequence, each step's forgetting tax accumulates on the next step's checkpoint.
KL constraint is implicit EWC:
The KL penalty in PPO-RLHF:
Expanding this with a Taylor expansion around reference policy , the second-order term of KL divergence is proportional to the Fisher information matrix:
That is, PPO's KL penalty ≈ a full-matrix EWC weighted by Fisher — both are "quadratic penalties on parameter deviation from a reference point, weighted by information-geometric curvature." The difference: EWC uses a diagonal approximation and explicitly stores the old-task Fisher; the KL constraint uses the full distributional distance and dynamically anchors to the current reference policy. The KL term in DPO follows the same logic — the reference model is mathematically equivalent to EWC's .
D8. LoRA-based CL — Adapter Interference and Routing
Sources of adapter interference:
The naive approach is to train an independent set of LoRA weights for each task and route by task-ID at test time. Two sources of interference exist:
- Parameter-space overlap: different tasks' low-rank subspaces may overlap substantially — if , one task's adapter can interfere with another task's activations after merging.
- Routing failure without task-ID: in Class-IL or continual pretraining settings, task-ID is unavailable at test time, making it impossible to route to the correct adapter.
O-LoRA's orthogonal subspace approach:
O-LoRA enforces, during LoRA training for task , that the new task's low-rank subspace is orthogonal to all previously seen tasks' subspaces:
This is achieved by projecting gradients onto the orthogonal complement of . Orthogonal subspaces guarantee that different task adapters' activations do not interfere with each other — in settings where task-ID is known this yields approximately zero forgetting without storing old-task data.
Limitations:
- Available orthogonal dimensions decrease as task count increases — in rank- LoRA, at most about fully orthogonal tasks can be supported ( = weight matrix dimension).
- In Class-IL settings without task-ID, orthogonality does not solve the routing problem — additional mechanisms to infer task-ID or use prototype classification are still required.
- Enforcing orthogonality may limit forward transfer between related tasks — tasks that could share subspaces to accelerate learning lose that benefit.
Practice in LLM post-training:
In the sequential alignment pipeline (SFT → DPO → RL), freezing the backbone and updating only one set of LoRA per stage is equivalent to a parameter isolation scheme with known task-ID — the accumulation of alignment tax is largely confined to the LoRA layers and backbone knowledge is not overwritten. The cost is the need to manage multiple adapter sets and their merging/routing logic at deployment.
References
All are original sources for classic load-bearing methods, verified line by line (title + arXiv ID). Click the superscript to jump; click ↩ to return.
- Kirkpatrick et al. Overcoming catastrophic forgetting in neural networks. PNAS 2017. arXiv:1612.00796 — EWC: Fisher diagonal quadratic penalty against forgetting. ↩
- Lopez-Paz et al. Gradient Episodic Memory for Continual Learning. NeurIPS 2017. arXiv:1706.08840 — GEM: gradient projection constraint + forward/backward transfer. ↩
- Chaudhry et al. Efficient Lifelong Learning with A-GEM. ICLR 2019. arXiv:1812.00420 — A-GEM: single mean gradient constraint, efficient GEM. ↩
- Buzzega et al. Dark Experience for General Continual Learning: a Strong, Simple Baseline. NeurIPS 2020. arXiv:2004.07211 — DER: store logits as soft distillation targets + rehearsal. ↩
- Rusu et al. Progressive Neural Networks. 2016. arXiv:1606.04671 — Add a new column per task + lateral connections; structurally zero forgetting. ↩
- Li and Hoiem. Learning without Forgetting. ECCV 2016. arXiv:1606.09282 — LwF: old model soft outputs as distillation targets; no old data needed. ↩
- Schwarz et al. Progress & Compress: A scalable framework for continual learning. ICML 2018. arXiv:1805.06370 — Dual-network active column + knowledge base; online EWC accumulates Fisher across tasks via EMA, constant memory. ↩
- De Lange, van de Ven, and Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap. ICLR 2023. arXiv:2205.13452 — Per-iteration continuous evaluation framework; discovers the stability gap phenomenon of sharp performance drop then recovery after task switch. ↩
- van de Ven and Tolias. Three scenarios for continual learning. 2019. arXiv:1904.07734 — Systematically defines the three scenarios Task-IL / Domain-IL / Class-IL; shows that regularization methods almost completely fail on Class-IL. ↩
- Meng et al. Locating and Editing Factual Associations in GPT. NeurIPS 2022. arXiv:2202.05262 — ROME: causal localization + rank-1 editing of factual associations in middle-layer MLPs. ↩
- Meng et al. Mass-Editing Memory in a Transformer. ICLR 2023. arXiv:2210.07229 — MEMIT: batch-editing thousands of facts across multiple layers. ↩