The conversation has been condensed and edited for clarity.

User: One issue with vanilla gradient descent is that the gradient direction is sensitive to parametrization. Are there practical methods to avoid this issue? Does natural gradient descent help?

Assistant: Yes, natural gradient descent addresses this by using the Fisher Information Matrix (FIM), which respects the geometry of the model’s output distribution. This makes updates invariant to reparameterizations like scalings, focusing only on meaningful directions in function space rather than arbitrary parameter differences.

User: Is K-FAC part of a “natural” family of approximations to natural gradient descent, or is it more heuristic?

Assistant: K-FAC is based on principled approximations rather than heuristics. It approximates the FIM as a Kronecker product for each layer, assuming that gradients and activations are roughly independent. This approximation captures enough of the FIM’s structure to allow efficient natural gradient descent in deep networks, reducing both memory and computational costs.

User: How does K-FAC scale with model size? Does the overhead stay constant, or does it get worse as the number of parameters grows?

Assistant: K-FAC scales reasonably well but not with a constant factor. For each layer, it introduces a cost of \(O(d_{\text{in}}^2 + d_{\text{out}}^2)\) due to the Kronecker factors, plus \(O(d_{\text{in}}^3 + d_{\text{out}}^3)\) for inverting them. So, while manageable for many networks, the cost can increase significantly for very deep or wide models.

User: Are there other approximation families with fewer assumptions?

Assistant: Yes, several alternatives approximate the FIM or Hessian differently. Diagonal FIM approximations (like Adam) avoid inter-parameter dependencies entirely, giving a lower fidelity approximation. L-BFGS, another second-order approximation, uses recent gradients to estimate the inverse Hessian and is often used as a quasi-second-order method. Block-diagonal approximations also exist, but K-FAC’s structured Kronecker product strikes a good balance between efficiency and accuracy.

User: Is L-BFGS an approximation to both natural gradient descent and Newton’s method?

Assistant: L-BFGS is mainly an approximation to Newton’s method, not natural gradient. It approximates the Hessian rather than the FIM, making it sensitive to parameterization. However, because it adapts to the loss landscape’s curvature, it can sometimes behave similarly to natural gradient, especially in well-behaved regions.

User: How does natural gradient compare to a naive second-order method that computes the Hessian?

Assistant: Natural gradient descent and naive second-order methods both leverage curvature, but in different ways. Naive second-order methods use the Hessian of the loss directly, while natural gradient uses the FIM, which adapts updates based on the model’s output geometry rather than parameter curvature alone. In practice, naive second-order methods are often infeasible due to the high cost of computing and inverting the Hessian, especially in non-convex loss landscapes. Natural gradient descent is more computationally feasible, particularly with approximations like K-FAC, which make it viable for large models without the limitations of naive Hessian-based methods.

Reparametrization sensitivity, with ChatGPT

Subscribe to my newsletter

Stephane Bersier

Stephane Bersier