New Insights and Perspectives on the Natural Gradient Method

Size: px

Start display at page:

Download "New Insights and Perspectives on the Natural Gradient Method"

Marion Morgan
5 years ago
Views:

1 1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018

2 Motivation 2 / 18 In parameter space (µ, σ), each pair of distributions have the same distance.

3 Motivation 3 / 18 We often talk about the parameter space of a neural network, instead of the functions that those parameters represent.

4 Gradient Descent 4 / 18 The gradient descent update θ = α θ L is the solution to the following optimization problem: arg min [ θ L(θ) θ] θ s.t. θ δ. We are optimizing a linear approximation of L within a trust region defined by the Euclidean metric in parameter space.

5 Natural Gradient 5 / 18 Consider a family of density functions F : θ p(z) where θ R n : p θ (z) := F(θ)(z). We naturally obtain a notion of closeness among different values of θ: d(θ 1, θ 2 ) := KL(p θ1 (z) p θ2 (z)).

6 6 / 18 Natural Gradient d(θ 1, θ 2 ) := KL(p θ1 (z) p θ2 (z)). Let s see how d behaves near a given θ. KL(p θ (z) p θ+ θ (z)) = E z [log p θ ] E z [log p θ ] E z [ log p θ ] θ + O( θ 2 ) = O( θ 2 ), so no useful information from first order approximation.

7 Natural Gradient 7 / 18 d(θ 1, θ 2 ) := KL(p θ1 (z) p θ2 (z)). Let s see how d behaves near a given θ. KL(p θ (z) p θ+ θ (z)) = 1 2 θ E z [ 2 log p θ ] θ + O( θ 3 ) 1 2 θ F θ θ, where F θ = E z [ 2 log p θ ] = Ez [ log p θ log p θ ]. The Fisher Information Matrix is the expected negative Hessian of log p.

8 Natural Gradient We change the constraint of 8 / 18 arg min [ θ L(θ) θ] θ s.t. θ = const to s.t.kl(p θ (z) p θ+ θ (z)) = const.

9 Natural Gradient We change the constraint of 9 / 18 arg min [ θ L(θ) θ] θ s.t. θ = const to s.t.kl(p θ (z) p θ+ θ (z)) = const. Solution after using Lagrange multiplier is θ L(θ) θ λ θ F θ θ θ F 1 θ θ L

10 Natural Gradient 10 / 18 For η defined as η η 0 = F 1 2 (θ θ0 ), the Fisher ball is a unit circle. Thus in the parameter space of η, the natural gradient is the same as the gradient.

11 Second-order Optimization 11 / 18 θ = arg min M(δ), δ M(δ) := 1 2 δ Bδ + h(θ) δ + h(θ) The solution is θ = B 1 h.

12 12 / 18 Second-order Optimization θ = arg min M(δ), δ M(δ) := 1 2 δ Bδ + h(θ) δ + h(θ) The solution is θ = B 1 h. B = βi: Gradient Descent. B = H(θ): Newton s Method. Newton s method assumes h is convex, and fails otherwise.

13 Second-order Optimization The Generalized Gauss-Newton Matrix Let our loss function be L(f (x, θ)). 13 / 18 H ij = L(f (x, θ)) θ j θ i = n θ j f k (x, θ) = k=0 ( n n k=0 l=0 ( L (f (x, θ)) L 2 (f (x, θ)) f k (x, θ) f l (x, θ) ) f k (x, θ) θ i f k (x, θ) θ j n L (f (x, θ)) f k (x, θ) 2 + f k (x, θ) θ j θ i k=0 n H = Jf L H L J f + H fk Jf H L J f = G f k k=0 ) f k (x, θ) θ i

14 14 / 18 Second-order Optimization Why G instead of H G is positive semidefinite, H is not. Let s expand H further, assuming f is a feedforward neural network: Wi φ W i+1 s i ai si+1 H G = l m i ( a i L) Js i H [φ(si )] j J si i=1 j=1 The remaining term is the sum of curvature terms coming from each intermediate activation. These curvature terms are subject to more frequent change. In ReLU networks, H [φ(si )] = 0 a.e..

15 F and G Define the conditional distribution r to be 15 / 18 r(y z) = r(y f (x, θ)) We have θ log p(y; x, θ) = J f z log r(y z) [ F = E x log p(y; x, θ) log p(y; x, θ) ] [ ] = E x Jf z log r(y z) z log r(y z) J f [ ] = E x Jf F R J f [ ] G = E x Jf H L J f F = G when F R = H L.

16 16 / 18 F and G We know that F = G when F R = H L. When does this occur? Let L(y, z) = log r(y z). F R = E Ry f (x,θ) [H log r ] H L = E (x,y) [H log r ] F R = H L holds when E Ry f (x,θ) [H log r ] = E (x,y) [H log r ]

17 17 / 18 F and G We know that F = G when F R = H L. When does this occur? Let L(y, z) = log r(y z). F R = E Ry f (x,θ) [H log r ] H L = E (x,y) [H log r ] F R = H L holds when E Ry f (x,θ) [H log r ] = E (x,y) [H log r ] This holds when r(y z) is an exponential family: log r(y z) = z T (y) log Z(z) since in this case, the hessian H log r does not depend on y. Most commonly-used losses satisfy this property, including the MSE loss and cross-entropy loss.

18 18 / 18 Summary Roughly speaking, the natural gradient is the direction of steepest change of loss in function space. The natural gradient is invariant under reparameterization. For most neural networks of interest, natural gradient descent is identical to a second-order method (GGN).

19 References I 19 / 18 [1] Shun-Ichi Amari. Natural Gradient Works Efficiently in Learning. In: Neural Comput. (1998). [2] James Martens. Deep learning via Hessian-free optimization. In: Proceedings of the International Conference on Machine Learning (ICML) (2010). [3] James Martens. New insights and perspectives on the natural gradient method. Preprint arxiv: [4] James Martens and Roger B. Grosse. Optimizing Neural Networks with Kronecker-factored Approximate Curvature. In: Proceedings of the International Conference on Machine Learning (ICML) (2015). [5] Hyeyoung Park, Shun-Ichi Amari, and Kenji Fukumizu. Adaptive natural gradient learning algorithms for various stochastic models. In: Neural Networks (2000).

20 References II 20 / 18 [6] Razvan Pascanu and Yoshua Bengio. Revisiting Natural Gradient for Deep Networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014). [7] Oriol Vinyals and Daniel Povey. Krylov Subspace Descent for Deep Learning. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) (2012).

21 Thank You 21 / 18

arxiv: v9 [cs.lg] 21 Nov 2017

arxiv: v9 [cs.lg] 21 Nov 2017 New insights and perspectives on the natural gradient method James Martens DeepMind London, United Kingdom james.martens@gmail.com arxiv:42.93v9 [cs.lg] 2 Nov 207 Editor: Abstract Natural gradient descent