Covers probability, linear algebra, calculus, information theory, MLE / MAP, and common distributions. Bilingual notation: primary language is English; established Chinese terms are referenced parenthetically where helpful.
Use cases: ML/DL interview prep, course-note quick reference, technical blog reference.
Part 1: Concepts & Formula Derivations
1.1 Probability Fundamentals
Basic relationships among Conditional Probability, Joint Probability, and Marginal Probability:
Bayes' Theorem:
posterior ∝ likelihood × prior.
Chain Rule of Probability:
Conditional Independence: means that given , and are independent, i.e., .
1.2 Expectation, Variance, Covariance
| Property | Formula |
|---|---|
| Linearity | |
| Variance definition | |
| Scaling | |
| Variance of a sum | |
| Covariance |
Covariance Matrix : diagonal entry ; the matrix is always positive semi-definite (PSD) because .
1.3 Common Distributions
| Distribution | Parameters | Mean | Variance | Typical Use |
|---|---|---|---|---|
| Bernoulli | Binary classification labels | |||
| Binomial | Count of successes in independent trials | |||
| Poisson | Counting rare events | |||
| Uniform | Parameter initialization | |||
| Gaussian | Noise models, weight priors | |||
| Exponential | Waiting-time modeling |
Multivariate Gaussian PDF:
68-95-99.7 Rule: For , approximately 68% of data falls within , 95% within , and 99.7% within .
1.4 Central Limit Theorem (CLT)
Let be i.i.d. with mean and variance . As :
ML connection: In mini-batch SGD, the batch gradient is an unbiased estimator of the full-data gradient, with variance ( = batch size). SGD noise provides implicit regularization in non-convex optimization, helping the model escape sharp minima.
1.5 Maximum Likelihood & Maximum A Posteriori
MLE (Maximum Likelihood Estimation):
Purely data-driven; no prior information.
MAP (Maximum A Posteriori):
Regularization equivalence:
| Prior distribution | Corresponding regularization |
|---|---|
| Gaussian prior | L2 regularization (Weight Decay) |
| Laplace prior | L1 regularization (Lasso) |
1.6 Bias-Variance Decomposition
- Bias: systematic deviation of the model → underfitting
- Variance: sensitivity of the model to perturbations in training data → overfitting
- Noise: irreducible error intrinsic to the data
Ensemble methods such as Bagging and Dropout primarily reduce Variance.
1.7 Core Linear Algebra
1.7.1 Matrix Rank and Low-Rank Factorization
The rank of a matrix is the maximum number of linearly independent rows (or columns), i.e., the dimension of the column space.
Parameter efficiency of low-rank matrices: If , then can be factored as , where and .
- Original parameter count:
- Low-rank factorization parameter count:
- When , parameter count is dramatically reduced
1.7.2 Singular Value Decomposition (SVD)
Any matrix can be decomposed as:
where is orthogonal (left singular vectors), (singular values, ), and is orthogonal (right singular vectors).
Optimal Low-Rank Approximation (Eckart–Young–Mirsky Theorem):
where , , are formed from the top singular values/vectors.
Two norms expressed in terms of singular values:
- Frobenius norm:
- Spectral norm: (largest singular value)
ML applications:
- PCA: right singular vectors of the covariance matrix = principal component directions
- Parameter compression: truncated SVD for model compression
- LoRA initialization strategy: truncated SVD of pretrained weights to initialize low-rank factors (see LoftQ and similar methods)
1.7.3 Eigendecomposition vs. SVD
| Eigendecomposition | SVD | |
|---|---|---|
| Applicability | Square matrices (must be diagonalizable) | Any matrix |
| Factored form | ||
| Symmetric matrices | , orthogonal | Same form; singular values = |eigenvalues| (eigenvalues = singular values when PSD) |
| Value range | Eigenvalues can be negative | Singular values |
When is symmetric positive definite (SPD), eigendecomposition and SVD are equivalent.
1.7.4 Positive Definite Matrix
A symmetric matrix is positive definite if and only if:
- All eigenvalues
- All leading principal minors (Sylvester's criterion)
- A Cholesky decomposition exists
1.7.5 Moore–Penrose Pseudoinverse
Given by SVD: , where replaces each nonzero singular value with its reciprocal. Used to compute the minimum-norm least-squares solution of .
Closed-form solution of linear regression: . When is singular, add L2 regularization (Ridge regression): , which also improves the condition number.
1.7.6 Matrix Multiplication Complexity
, : computing requires FLOPs.
LLM FLOPs estimate: For a Transformer model with parameters trained on tokens, the total compute is approximately:
(forward pass ≈ 2ND, backward pass ≈ 4ND; inference / forward-only ≈ 2ND.) This is the canonical estimate used in the scaling law literature.
1.8 Calculus & Gradients
1.8.1 Jacobian and Hessian
Jacobian matrix: first-order derivative of a vector function :
Hessian matrix: second-order derivative of a scalar function :
is symmetric. Positive definite → local minimum; negative definite → local maximum; indefinite → saddle point.
1.8.2 Chain Rule and Backpropagation
For a composite function :
Matrix form (vector → vector), using the vector-Jacobian product (VJP):
Backpropagation is essentially the efficient computation of the VJP chain product from output to input, making use of activations cached during the forward pass.
1.9 Information Theory
1.9.1 Entropy
Measures the uncertainty of a random variable. is maximized under a uniform distribution () and equals 0 for a deterministic distribution. Units: → bits; → nats (deep learning frameworks typically use nats).
1.9.2 Kullback–Leibler Divergence
Key properties:
- (proved via Jensen's inequality); equality holds iff
- Asymmetric: ; not a distance metric
- Forward KL (): mean-seeking; tends to cover all of 's support
- Reverse KL (): mode-seeking; tends to concentrate on the dominant mode of
ML applications: The KL penalty in RLHF constrains the policy from deviating too far from the reference model; the DPO objective implicitly contains a KL constraint.
1.9.3 Cross-Entropy
Relationship to entropy and KL divergence:
When is a deterministic distribution (one-hot label), , so minimizing cross-entropy minimizing MLE.
Perplexity (PPL) and its relationship to cross-entropy:
1.9.4 Mutual Information
Measures the amount of information shared between and . iff .
ML applications: Maximizing mutual information in representation learning (e.g., contrastive learning with InfoNCE loss); using MI for feature selection to find features most relevant to the label.
1.9.5 Jensen's Inequality
For a convex function : .
Key applications:
- Proving non-negativity of KL divergence:
- Deriving the ELBO:
1.10 Importance Sampling
When direct sampling from a target distribution is difficult, sample from a proposal distribution and correct with importance weights:
The weight is the importance weight. When and diverge too much, variance can explode.
1.11 Closed-Form KL for Gaussians
For and :
Special case in VAE: , :
1.12 Variational Inference & ELBO
Goal: infer the posterior , but is intractable.
Introduce a variational distribution and replace with the ELBO lower bound:
Equivalence: . Maximizing ELBO minimizing .
Part 2: PyTorch Snippets
2.1 Numerically Stable Softmax
import torch
def softmax(x: torch.Tensor, dim: int = -1) -> torch.Tensor:
"""Numerically stable softmax."""
x_max = x.max(dim=dim, keepdim=True).values
exp_x = torch.exp(x - x_max) # subtract max to prevent overflow
return exp_x / exp_x.sum(dim=dim, keepdim=True)
2.2 Cross-Entropy Loss from Scratch
import torch
def cross_entropy_loss(logits: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
"""
logits: (N, C) — unnormalized scores
targets: (N,) — integer class labels
returns: scalar loss
"""
# numerically stable log-softmax
log_probs = logits - logits.max(dim=-1, keepdim=True).values
log_probs = log_probs - torch.logsumexp(log_probs, dim=-1, keepdim=True)
# extract log-probability of the target class
nll = -log_probs[torch.arange(len(targets)), targets]
return nll.mean()
2.3 KL Divergence for VAE
import torch
def kl_divergence_gaussian(mu: torch.Tensor, log_var: torch.Tensor) -> torch.Tensor:
"""
Computes KL( N(mu, exp(log_var)) || N(0, I) )
mu, log_var: (N, d)
returns: (N,) per-sample KL values
"""
return -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp(), dim=-1)
2.4 SVD Low-Rank Approximation
import torch
def low_rank_approx(W: torch.Tensor, r: int) -> torch.Tensor:
"""
Returns the optimal rank-r Frobenius-norm approximation of W.
W: (m, n), r: target rank
"""
U, S, Vt = torch.linalg.svd(W, full_matrices=False)
# keep top r singular values
return (U[:, :r] * S[:r]) @ Vt[:r, :]
# example
W = torch.randn(128, 64)
W_approx = low_rank_approx(W, r=8)
print(f"Approximation error (Frobenius): {(W - W_approx).norm():.4f}")
2.5 LoRA-Style Forward Pass
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Simplified LoRA linear layer: freezes pretrained weights, adds a trainable low-rank bypass."""
def __init__(self, in_dim: int, out_dim: int, r: int = 8, alpha: float = 16.0):
super().__init__()
# pretrained weight (frozen)
self.W = nn.Linear(in_dim, out_dim, bias=False)
self.W.weight.requires_grad_(False)
# low-rank bypass ΔW = B @ A
self.A = nn.Linear(in_dim, r, bias=False)
self.B = nn.Linear(r, out_dim, bias=False)
# init: A with Kaiming, B with zeros → ΔW = 0 at start
nn.init.kaiming_uniform_(self.A.weight)
nn.init.zeros_(self.B.weight)
self.scaling = alpha / r
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.W(x) + self.B(self.A(x)) * self.scaling
2.6 Gaussian MLE from Scratch
import torch
def gaussian_mle(data: torch.Tensor):
"""
data: (N, d)
returns: (mu, var) — MLE estimates of mean and variance
"""
mu = data.mean(dim=0)
var = data.var(dim=0, unbiased=False) # MLE uses N (not N-1)
return mu, var
def gaussian_log_likelihood(data: torch.Tensor, mu: torch.Tensor, var: torch.Tensor) -> torch.Tensor:
"""Computes Gaussian log-likelihood."""
d = data.shape[-1]
return -0.5 * (d * torch.log(2 * torch.pi * var) + ((data - mu) ** 2) / var).sum(dim=-1).mean()
2.7 MC Estimation of KL Divergence
import torch
def mc_kl_divergence(log_p, log_q, samples, num_samples: int = 10000):
"""
Estimates KL(P || Q) via Monte Carlo.
log_p, log_q: callable functions returning log-probabilities given samples
samples: samples drawn from P
"""
return (log_p(samples) - log_q(samples)).mean()
# example: estimate KL between N(0,1) and N(1, 0.5)
def log_p(x): return -0.5 * (x ** 2 + torch.log(torch.tensor(2 * 3.14159)))
def log_q(x): return -0.5 * ((x - 1) ** 2 / 0.5 + torch.log(torch.tensor(2 * 3.14159 * 0.5)))
samples = torch.randn(100000)
print(f"MC KL estimate: {mc_kl_divergence(log_p, log_q, samples):.4f}")
Part 3: Interview Questions (25 Questions)
L1 — Basic
Q1. What is the relationship among conditional probability, joint probability, and marginal probability?
A: Joint probability ; marginal probability is obtained by summing (or integrating) over the joint: . Bayes' theorem ties all three together.
Follow-up: How does the chain rule of probability generalize to more than three variables? What does conditional independence mean?
A: . means that given , knowing does not change beliefs about : . In a probabilistic graphical model this corresponds to blocking all paths from to .
Q2. What are the basic rules for expectation and variance?
A:
- Linearity:
- Variance:
- Scaling:
- Variance of a sum of independent variables equals the sum of variances (no covariance term)
Follow-up: What do the diagonal entries of a covariance matrix represent? Why is a covariance matrix always positive semi-definite?
A: The diagonal entries are the per-dimension variances . Symmetry is immediate; PSD holds because (variance is non-negative).
Q3. Name at least four common probability distributions with their mean, variance, and typical use cases.
A:
| Distribution | Mean | Variance | Use |
|---|---|---|---|
| Bernoulli | Binary classification labels | ||
| Gaussian | Noise models, weight priors | ||
| Poisson | Counting rare events | ||
| Uniform | Parameter initialization |
Follow-up: How is the multivariate Gaussian PDF written?
A: .
Q4. What is Maximum Likelihood Estimation (MLE)? Give the mathematical definition.
A:
Select the parameters that maximize the probability of the observed data. Equivalent to minimizing the negative log-likelihood (NLL). Under a Gaussian assumption, MLE for regression is equivalent to minimizing MSE.
Follow-up: Under what conditions does MLE produce a biased estimate?
A: A classic example: the MLE for the Gaussian variance (dividing by instead of ) is biased (underestimates variance). With small samples, MLE is prone to overfitting on complex models.
Q5. What is matrix rank? What are the properties of a low-rank matrix?
A: The rank of a matrix is the maximum number of linearly independent rows (or columns), equal to the dimension of the column space. A low-rank matrix (rank ) can be factored as (), reducing the parameter count from to .
Follow-up: What is the connection between low-rank factorization and parameter-efficient fine-tuning (PEFT)?
A: LoRA assumes that the weight update during fine-tuning is low-rank, and approximates it using the factorization . Only and are trained while the original weights are frozen, dramatically reducing the number of trainable parameters and GPU memory usage.
Q6. What is the definition of entropy and its intuition?
A: , measuring the uncertainty of a random variable. Entropy is maximized under a uniform distribution () and is 0 for a deterministic distribution. The unit depends on the base of the logarithm: → bits; → nats.
Follow-up: Why does the maximum entropy principle yield the Gaussian distribution?
A: Under the constraints , , , maximizing via Lagrange multipliers yields the Gaussian distribution as the optimal solution.
Q7. What is the difference between dot product and cosine similarity?
A:
- Dot product: , influenced by both magnitude and direction
- Cosine similarity: , reflects only direction (angle), magnitude is normalized out
Follow-up: Why does Attention use scaled dot-product rather than cosine similarity?
A: Scaled dot-product is computationally efficient (matrix multiplication), and the norms of Q and K participate in the attention weights, increasing expressiveness. RoPE rotations preserve norms and encode only relative position information. Cosine similarity requires extra L2 normalization and limits expressiveness.
Q8. What are the Jacobian matrix and the Hessian matrix?
A:
- Jacobian : first-order derivative matrix of a vector function ,
- Hessian : second-order derivative matrix of a scalar function, , symmetric. Positive definite → local minimum; indefinite → saddle point
Follow-up: Why is the Hessian rarely used directly in large-model optimization?
A: The Hessian has size ( = number of parameters); storing and computing it is infeasible for models with billions of parameters. Second-order methods like K-FAC and Shampoo use Kronecker factorization or block-diagonal approximations of the Hessian to reduce cost.
L2 — Intermediate
Q9. What is the difference between MLE and MAP? How does regularization correspond to Bayesian priors?
A: MLE maximizes the likelihood ; MAP maximizes the posterior , adding a prior term . Gaussian prior → L2 regularization (Weight Decay); Laplace prior → L1 regularization (Lasso).
Follow-up: From a Bayesian perspective, what prior does AdamW's weight decay assume?
A: It is equivalent to an isotropic Gaussian prior . AdamW decouples weight decay from the gradient update (decoupled weight decay), which differs subtly from L2-regularized Adam, but the Bayesian interpretation is the same.
Q10. What is the Bias-Variance Decomposition?
A: Prediction error = Bias² + Variance + irreducible noise. High bias → underfitting; high variance → overfitting. Increasing model capacity reduces bias but may increase variance, and vice versa — this is the bias-variance tradeoff.
Follow-up: Do ensemble methods (Bagging, Dropout as ensemble) primarily reduce bias or variance?
A: Primarily reduce variance. Bagging averages multiple high-variance models, reducing variance by a factor of ( = number of models), with almost no change in bias. Boosting mainly reduces bias.
Q11. What is SVD? How is it used for low-rank approximation?
A: Any matrix , where are orthogonal and is a diagonal matrix of singular values. The Eckart–Young theorem states that the truncated SVD , retaining only the top singular values, is the optimal rank- Frobenius-norm approximation.
Follow-up: How are the Frobenius norm and spectral norm each determined by the singular values?
A: (square root of the sum of squared singular values); (largest singular value).
Q12. What is the difference between eigendecomposition and SVD?
A: Eigendecomposition applies only to square matrices (and requires diagonalizability), ; SVD applies to any matrix of any shape, , with singular values always non-negative. For symmetric positive definite matrices, eigendecomposition and SVD are equivalent.
Follow-up: What is the geometric meaning of the eigenvectors of a covariance matrix?
A: The eigenvectors of the covariance matrix are the principal axis directions (principal directions) of the data distribution, corresponding to the principal component directions in PCA. The magnitude of each eigenvalue represents the variance along that direction.
Q13. What is the definition of KL divergence and its key properties?
A: . Properties: (1) ; (2) asymmetric, not a distance metric; (3) .
Follow-up: How do Forward KL and Reverse KL differ in their fitting behavior?
A: Forward KL () is mean-seeking: tends to cover all high-probability regions of , preserving diversity. Reverse KL () is mode-seeking: tends to concentrate on a single peak of and may ignore other modes. Variational inference typically uses Reverse KL.
Q14. What is the relationship among cross-entropy, KL divergence, and entropy?
A: . When is a one-hot distribution, , so minimizing cross-entropy = minimizing KL divergence = MLE.
Follow-up: What is the relationship between a language model's perplexity (PPL) and cross-entropy?
A: , where is the cross-entropy. PPL can be understood as the average number of effective choices the model has at each token position; lower PPL means a better model.
Q15. What is Jensen's inequality? What are its applications in ML?
A: For a convex function : . Applications: (1) proving non-negativity of KL divergence; (2) deriving the variational lower bound ELBO; (3) proving convergence of the EM algorithm.
Follow-up: What is variational inference (VI)? What is the relationship between the ELBO and KL divergence?
A: VI approximates the complex posterior with a simple distribution . ; maximizing ELBO is equivalent to minimizing the KL divergence between the variational posterior and the true posterior.
Q16. What are the necessary and sufficient conditions for a positive definite matrix?
A: For a symmetric matrix to be positive definite: (1) all eigenvalues ; (2) all leading principal minors (Sylvester's criterion); (3) a Cholesky decomposition exists; (4) .
Follow-up: Why is a covariance matrix always positive semi-definite?
A: For any , (variance is non-negative). Being strictly positive definite requires the variance to be strictly greater than zero (no deterministic linear relationship exists).
Q17. What is the computational complexity of matrix multiplication? How are LLM training FLOPs estimated?
A: , : requires FLOPs. LLM training estimate: with parameters and training tokens, total FLOPs (forward ≈ 2ND, backward ≈ 4ND; inference / forward-only ≈ 2ND).
Follow-up: Why is the decode phase of LLM inference typically memory-bound rather than compute-bound?
A: Autoregressive decoding generates only 1 token per step, collapsing the batch dimension. Matrix multiplication degenerates to matrix-vector multiplication, which has very few FLOPs but requires loading the entire model weights (IO-intensive), leaving GPU compute underutilized.
L3 — Deep
Q18. What is the relationship between the Central Limit Theorem (CLT) and SGD noise? Why does SGD noise have a regularization effect?
A: The CLT guarantees that the mini-batch gradient is an unbiased estimator of the full-data gradient, with variance . SGD gradient noise helps the model escape sharp minima and tends to converge to flat minima. Flat minima generally generalize better, which amounts to implicit regularization.
Follow-up: What is the effect of increasing batch size during training?
A: Gradient variance decreases (), making optimization more stable but potentially losing the regularization effect and degrading generalization. Requires paired learning rate adjustment (linear scaling rule).
Q19. What is the fundamental difference in fitting behavior between Forward KL and Reverse KL? In what scenarios is each used?
A:
- Forward KL (, I-projection): minimizing it forces to have mass wherever → mean-seeking / zero-avoiding. Suitable for preserving multi-modal coverage.
- Reverse KL (, M-projection): minimizing it pushes to have negligible mass where → mode-seeking / zero-forcing. Suitable for concentrating on the dominant mode.
Variational inference (VI) uses Reverse KL by default (it arises naturally from the ELBO derivation); the KL constraint in RLHF is Reverse KL (mode-seeking): it pulls policy toward the reference and penalizes deviation from high-probability regions of , suppressing reward-hacking distribution drift.
Follow-up: Are there practical applications of mixing Forward/Reverse KL?
A: The Rényi- divergence family interpolates between the two for and has been used in variational inference and reinforcement learning to trade off coverage against concentration.
Q20. Why does Attention use scaled dot-product rather than cosine similarity?
A: Scaled dot-product advantages: (1) efficient matrix multiplication; (2) the norms of Q and K participate in attention weight computation, increasing expressiveness; (3) the scaling factor prevents dot products from becoming too large and causing softmax saturation. Cosine similarity loses magnitude information and requires extra normalization.
RoPE (Rotary Position Embedding) only changes the direction (angle) of Q and K, keeping norms unchanged, so positional information is naturally incorporated through the angular difference in the dot product.
Follow-up: What happens if Q and K are L2-normalized before computing the dot product?
A: This is equivalent to cosine similarity (with temperature scaling). Some work (e.g., Normalized Attention) has explored this direction; the advantage is that attention weights are norm-independent and more stable, but it may limit the ability to differentiate expressiveness among tokens. Methods like CosFormer introduce cosine reweighting while maintaining linear attention efficiency.
Q21. In what scenarios is reverse-mode automatic differentiation (AD) more efficient, and when is forward-mode AD preferred?
A:
- Reverse-mode AD: one forward pass + one backward pass computes gradients of a scalar loss with respect to all parameters, with complexity the forward computation. Best for many parameters, few outputs (i.e., ); this is the foundation of backpropagation in deep learning.
- Forward-mode AD: each pass propagates along one input direction to obtain the gradient of all outputs with respect to one input. Best for few inputs, many outputs (i.e., ), such as computing Jacobian-vector products (JVPs).
Follow-up: Can a Hessian-vector product (HVP) be computed without explicitly forming the Hessian?
A: Yes. ; first apply forward-mode differentiation to (or backpropagate through ), requiring only two differentiation passes with complexity comparable to a single gradient computation. This is the basis for implicit second-order methods such as conjugate gradient (CG).
Q22. How is importance sampling applied in PPO? What does the in PPO-Clip constrain?
A: PPO reuses trajectories collected under the old policy to train the new policy , correcting for the distributional shift via the importance ratio . The PPO-Clip objective:
(typically 0.1–0.2) constrains the range of the importance ratio (not the weight itself). When falls outside , the gradient is cut off, preventing excessively large policy updates that would destabilize training.
Follow-up: If the importance ratio variance is too large, what are the alternatives?
A: V-trace (introduced in IMPALA) truncates the ratio, or Retrace() and similar methods control variance. The core idea is to limit the effective importance weight to ensure convergence stability.
Q23. What is variational inference (VI) and the ELBO? How is it derived?
A: Approximate the true posterior with a simple distribution . Derivation:
The inequality follows from Jensen's inequality ( is concave). ELBO = reconstruction term − regularization term . -VAE uses a hyperparameter to reweight the KL term, controlling the tradeoff between disentanglement and reconstruction quality.
Follow-up: Why is the ELBO maximized (rather than minimized)?
A: The ELBO is a lower bound on . Maximizing ELBO → tighter lower bound → closer to the true posterior . Since and KL , maximizing ELBO is equivalent to minimizing the KL divergence between the variational and true posteriors.
Q24. What is the theoretical and empirical basis for the low-rank assumption in parameter-efficient fine-tuning (PEFT)?
A: Theoretical basis — research on intrinsic dimensionality shows that fine-tuning a pretrained model actually takes place in a low-dimensional subspace far smaller than the full parameter space. LoRA parameterizes weight updates as (), requiring only parameters () — orders of magnitude fewer than the original . Empirically, is sufficient to approach full fine-tuning performance across a variety of downstream tasks.
Follow-up: How should the rank and which layers to apply LoRA to be chosen?
A: Rank is typically selected via a small-scale search on the validation set. In practice, applying LoRA to the Q/K/V/O projection matrices of the attention layers works well. More advanced methods such as AdaLoRA adaptively allocate different ranks to each layer based on the singular value distribution. Choosing requires balancing expressiveness against parameter efficiency.
Q25. What is the closed-form KL divergence for Gaussian distributions? How is it used in a VAE?
A: For two multivariate Gaussians and :
In a VAE with and , this simplifies to:
This KL term appears as a regularization term in the ELBO, encouraging the encoder output to stay close to the standard normal prior.
Follow-up: What is the effect of in -VAE?
A: Increasing the KL term weight forces the encoder to match the standard normal prior more tightly, so the learned latent variable becomes more disentangled, but reconstruction quality may suffer. has the opposite effect: a more flexible latent space but potentially at the cost of disentanglement.
Last updated: 2025 | This cheat sheet is for study reference only; consult original literature for authoritative formulas.
Extended L3
Q26: What is the intrinsic connection between the Fisher Information Matrix (FIM) and the natural gradient? Why is the natural gradient the steepest descent in the KL sense?
A: The Fisher information matrix is defined as the covariance of the log-likelihood gradient: . It is also the negative expected Hessian of the log-likelihood (at the MLE), and is closely related to the local quadratic expansion of the KL divergence: . Thus the FIM is the local metric tensor of the KL divergence in parameter space. The natural gradient is defined as , i.e., the direction of steepest descent in loss under the KL-ball constraint . Intuitively: the ordinary gradient gives steepest descent in Euclidean space, but equal Euclidean steps in parameter space do not correspond to equal changes in the distribution; the natural gradient uses the FIM to correct this "distortion" so that optimization steps are uniform in distribution space.
Follow-up: Why is the natural gradient not used directly in practice? What approximations exist? (Hint: K-FAC and EKFAC use block-diagonal Kronecker factorization of the FIM to reduce the cost of inversion.)
Q27: What is the mathematical essence of the reparameterization trick? Why does it reduce the variance of the gradient estimator?
A: In a VAE, the gradient of the ELBO expectation cannot be backpropagated through the sampling node directly (sampling is not differentiable). The reparameterization trick rewrites as , where . This shifts the randomness away from the parameters to a parameter-free noise variable , while is deterministically differentiable with respect to , so gradients can flow back to and via standard backpropagation. Compared to the score function estimator (REINFORCE) , the reparameterization trick directly encodes gradient information into the computation graph without relying on the scalar value as a multiplicative weight, resulting in substantially lower variance. In essence, the score function estimator uses only the scalar value of as a weight, while reparameterization exploits the local gradient , providing a richer signal.
Follow-up: For discrete latent variables (e.g., discrete VAE), the reparameterization trick does not directly apply. What are the alternatives? (Hint: Gumbel-Softmax / Concrete distribution uses a continuous relaxation to approximate discrete sampling while maintaining differentiability.)
Q28: How does the spectral norm of a matrix control the Lipschitz constant of a neural network? What does this imply for training stability?
A: A function has Lipschitz constant if . For a linear layer , the Lipschitz constant equals exactly (largest singular value). For a -layer network , the overall Lipschitz constant is , the product of spectral norms across layers. If is too large, small perturbations to inputs are amplified exponentially in the forward pass (exploding activations), and gradients are amplified exponentially in the backward pass (exploding gradients), destabilizing training. Spectral normalization (dividing each layer's weight by its spectral norm: ) constrains the per-layer Lipschitz constant to 1 and is widely used in GAN discriminator training to prevent mode collapse. The spectral norm can be efficiently approximated via power iteration without requiring a full SVD.
Follow-up: Why is it necessary to impose a Lipschitz constraint on the GAN discriminator (e.g., via spectral normalization or gradient penalty)? What goes wrong without it? (Hint: The Wasserstein distance requires the discriminator to belong to the class of 1-Lipschitz functions; otherwise the objective is unbounded.)
Q29: What is the core identity of the trace trick (matrix trace trick)? In which ML derivations is it indispensable?
A: The core identities are (cyclic permutation invariance) and the fact that a scalar equals its own trace: . These make it feasible to differentiate scalar functions of matrix products. Typical applications: (1) matrix differentiation in linear regression , expanded using to yield the closed-form solution; (2) differentiating and in the multivariate Gaussian log-likelihood; (3) deriving the MLE of the covariance matrix ; (4) deriving the matrix Riccati equation in linear dynamical systems (Kalman filter). In short, the trace trick is the core tool for any setting involving differentiating matrix quadratic forms.
Follow-up: Use the trace trick to derive the MLE for the precision matrix (inverse covariance) of a multivariate Gaussian . Hint: the log-likelihood contains and terms. (Answer: , i.e., the MLE precision matrix is the inverse of the sample covariance.)
Q30: What is the precise mathematical correspondence between the Schur complement and the conditional distribution of a multivariate Gaussian?
A: Let and partition the covariance matrix as . The conditional distribution is again Gaussian, and the top-left block of the joint precision matrix is exactly , which is the Schur complement: . The conditional mean is . This correspondence has far-reaching implications: (1) in Gaussian Markov Random Fields, zero entries in the precision matrix correspond to conditional independence relationships, encoding conditional structure more directly than the covariance matrix; (2) the posterior predictive in Gaussian Processes (GP) is fundamentally a conditional Gaussian, yielding a closed-form solution via the Schur complement; (3) the Kalman filter update step can also be understood from the Schur complement perspective.
Follow-up: Why are Gaussian graphical models encoded using the precision matrix rather than the covariance matrix? What is the relationship between conditional independence and zero entries in the precision matrix? (Answer: iff ; the non-zero pattern of the precision matrix directly corresponds to the edge structure of the graph.)
Q31: What is the fundamental difference between Wasserstein distance and KL divergence for measuring distributional differences? Why does WGAN use Wasserstein rather than KL?
A: KL divergence is infinite when the supports of and do not overlap (even if they are geometrically "close"); in high-dimensional spaces this is almost always the case, since two low-dimensional manifold supports almost never overlap. Wasserstein distance (Earth Mover's Distance), based on optimal transport, is defined as , and provides a meaningful finite distance even when supports do not overlap. Intuitive analogy: KL looks at the ratio (and "explodes" if where ); Wasserstein measures the minimum "work" required to transport one pile of "dirt" to another. Therefore, Wasserstein provides smooth gradient signals even when distributions do not overlap, while KL or JS divergence leads to vanishing gradients. WGAN uses the Wasserstein-1 distance as the generator objective, converting it via Kantorovich–Rubinstein duality to an optimization over 1-Lipschitz discriminators, resolving the training instability of classical GANs.
Follow-up: The computational cost of Wasserstein distance is typically higher than that of KL divergence; how can it be approximated in high dimensions? (Hint: the Sinkhorn algorithm adds entropic regularization to optimal transport, converting a linear program into parallelizable matrix-scaling iterations; Sliced Wasserstein reduces high-dimensional problems to multiple one-dimensional problems via random projections.)
Q32: How does the spectral structure (spectrum) of the Hessian characterize the local geometry of the loss landscape? What does this imply for understanding saddle points and escape strategies in optimization?
A: At a critical point (where the gradient is zero), the eigenvalue distribution of the Hessian determines the local geometry: (1) all eigenvalues (positive definite) → local minimum, the surface is bowl-shaped; (2) all eigenvalues (negative definite) → local maximum; (3) a mix of positive and negative eigenvalues (indefinite) → saddle point, with descent directions along negative-curvature directions. In high-dimensional parameter spaces, saddle points far outnumber local minima (because the probability of all eigenvalues being positive decays exponentially), making them the main obstacle in non-convex optimization. Escape strategies: (1) SGD gradient noise naturally provides perturbations along negative-curvature directions, helping escape saddle points — the geometric interpretation of SGD's implicit regularization effect; (2) momentum helps traverse flat regions; (3) explicit methods such as adding perturbations along Hessian negative-curvature directions (not yet common but theoretically effective). The eigenvalue spectrum of the Hessian is also related to generalization: the larger the eigenvalues of the Hessian at a minimum (the "sharper" the surface), the worse that minimum tends to generalize — this motivates sharpness-aware minimization (SAM) and related methods.
Follow-up: Why is the saddle point problem more severe than the local minimum problem in extremely high-dimensional spaces? How does this follow from the eigenvalue distribution of the Hessian? (Answer: in an -dimensional space, the Hessian has eigenvalues; at a random critical point each eigenvalue is equally likely positive or negative, so the probability that all are positive is , exponentially small — almost any critical point is a saddle point rather than a local minimum.)