Hamiltonian Descent Methods

Size: px

Start display at page:

Download "Hamiltonian Descent Methods"

Cora Woods
5 years ago
Views:

1 Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2

2 The problem Unconstrained minimization of a differentiable f : R d R, x = arg min x R d f(x). This talk: convex f Paper: also briefly consider non-convex f.

3 Optimization and Machine Learning Imbalance in our pipelines. Time spent designing models, but success constrained by optimizer. Have we discovered all the useful optimizers? If there s any doubt that optimization is a bottleneck for neural nets, consider how many architectural innovations were ways to get SGD to work better.

4 Optimization and Computer Science The computational complexity classes of convex optimization characterized by the information required of f [7]. 0th-order Local black-box evaluation of... f(x) 1st-order f(x), f(x) = ( ) f(x) x (n) 2nd-order f(x), f(x), 2 f(x) = ( ) 2 f(x) x (n) x (m)

5 Optimization and Computer Science Study rate of convergence of iterative methods. 0 log(f(xi) f(x )) sub-linear linear super-linear 30 iteration i Distinguish between fast linear and slow sub-linear convergence.

6 Gradient descent E.g. gradient descent Iterates with step size ɛ > 0 is a first-order method, x i+1 = x i ɛ f(x i ). x (2) x 0 x 1 x (1)

7 When is gradient descent fast? f C 2 is strongly convex & smooth iff µ, L > 0, x R d, µi 2 f(x) LI; Gradient descent on smooth & strongly convex f with ɛ = L 1 has fast linear convergence, f(x i ) f(x ) O ( ( 1 µ ) ) i L

8 Smoothness & strong convexity important? Lower bound. Nemirovski & Yudin [7] show, iter. first-order method, iteration i, smooth convex f, such that convergence is slow, f(x i ) f(x ) Ω(i 2 ). Similar for non-smooth strongly convex.

9 Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. 4 L f (x) 2 µ x

10 Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.

11 Power functions Power functions useful as study case of idealized convex functions. f(x) = x b b b 1, x R 4 b = 4/3 4 b = 2 4 b = 4 f (x) x x x

12 Power functions Smooth & strongly convex iff b = 2. Lojasiewicz inquality [4], real analytic functions can be bounded by power functions at their zero locus. If f : R d R is real analytic & convex with unique minium x, then K R d compact, b 1 and µ > 0, such that x K, f(x) f(x ) µ b x x b 2 In general, don t know b.

13 Continuous limit of optimization algorithms To study properties of optimizers consider ɛ 0, Iterates Iterates Iterates x (2) x (2) x (2) x (1) x (1) x (1) E.g. gradient descent iterates approx. solution to gradient flow, x t = f(x t ) with x 0 R d, t 0.

14 Continuous limit of optimization algorithms Fundamental properties revealed by studying solutions x t : [0, ) R d of gradient flow, e.g. a descent method, (f(x t )) = f(x t ), x t = f(xt ), f(x t ) 0.

15 Gradient descent on power functions For f(x) = x b /b, we have (f(x t )) = b x t b 2 f(x t ), so, ( f(x t ) = exp b t Two regimes in b for rate of convergence, 0 ) x s b 2 ds. 1 < b 2 f(x t ) O(exp( λt)) b > 2 f(x t ) Ω(t b b 2 )

16 Gradient descent on power functions Continuous time gradient descent on f(x) = x b /b 0 log f(xt) time t b = 4 b = 2 b = 4/3

17 Gradient descent on power functions Gradient descent with step size ɛ > 0, x i+1 = x i (1 ɛ x i b 2 ), doesn t converge for b < 2 as x i b 2 explodes. 0 log f(xi) iteration i b = 4/3 b = 2 b = 4

18 Gradient descent on power functions Summary of gradient descent with fixed ɛ on power functions. super-linear in continuous time sub-linear in continuous time linear in discrete time b This mirrors lower bounds, although specialized methods can do better in this case.

19 Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results.

20 Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.

21 The question What can be done using the first-order computation of two functions f, k R d R? f(x), f(x), k(p), k(p). k(p), k(p) cheap to compute (e.g., O(d)) to avoid cheating.

22 Proposed methods & key contributions Methods generalize momemtum method [10] to include non-standard kinetic energy k, we call them Hamiltonian descent methods. Linear rates possible for convex functions f that are not smooth & strongly convex. Convergence theory in continuous & discrete time.

23 Gradient descent with momentum Polyak s heavy ball [10] with ɛ, γ > 0: Iterates p i+1 = ɛ f(x i ) + (1 ɛγ)p i x (2) x i+1 = x i + ɛp i+1 Persistent motion helps in narrow valleys. x (1) Heavy ball Gradient descent

24 Gradient descent with momentum Continuous ɛ 0 limit of Polyak s heavy ball, Polyak s heavy ball Continuous heavy ball x i+1 = x i + ɛp i+1 p i+1 = ɛ f(x i ) + (1 ɛγ)p i x t = p t p t = f(x t ) γp t

25 Hamiltonian descent methods Generalize position update of Polyak s heavy ball, Continuous heavy ball x t = p t p t = f(x t ) γp t Continuous Hamiltonian descent x t = k(p t ) p t = f(x t ) γp t Also called conformal Hamiltonian system [6].

26 Hamiltonian descent methods Def. In physics the total energy or Hamiltonian is defined as, H(x, p) = f(x) f(x ) + k(p) If k strictly convex with min. k(0) = 0, then solutions of conformal Hamiltonian systems descend the Hamiltonian, (H(x t, p t )) = γ k(p t ), p t 0

27 Hamiltonian descent methods Dual views on k and f relationship: Given f, design k for fast convergence? Given k, on which class of f is convergence fast? Class of smooth & strongly convex corresponding to quadratic k(p) = p, p /2 not an accident! Develop intuition via one dim. power functions. Let ϕ a (t) = t a /a, f(x) = ϕ b ( x ) k(p) = ϕ a ( p ).

28 Hamiltonian descent on power functions Continuous system becomes, = sgn(p t) p t a sgn(x t ) x t b 1 γp t x t p t momentum p = + position x position x position x

29 Hamiltonian descent on power functions Solutions with a = 2 and b = 2, 1 momentum p position x

30 Hamiltonian descent on power functions Worst case is x t & p t small. To escape, want along p t that (k(p t )) k(p t ), f( k(p t )dt) + γp t Ck(p t ). 4 k(p) 4 f(x) y p x i.e. k(p) ( f) 1 (p).

31 Hamiltonian descent on power functions For power functions, this is 1 a + 1 b 1. 4 sub-linear convergence f(x) = x b /b k(p) = p a /a a linear convergence b We show linear convergence in continuous time iff 1 a + 1 b 1.

32 Hamiltonian descent on power functions Solutions with a = 2 and b = 8 (here 1 a + 1 b < 1) 1 momentum p position x

33 Hamiltonian descent on power functions We study three fixed ɛ discretizations, e.g. first explicit is p i+1 p i ɛ x i+1 x i ɛ = f(x i ) γp i+1 = k(p i+1 ) If k(p) = ϕ a ( p ), all disc. require L > 0, x R, f (x) a a L(f(x) f(x )) If k(p) = ϕ a ( p ), f(x) = ϕ b ( x ), this satisfied if 1 a + 1 b 1.

34 Hamiltonian descent on power functions a linear convergence sub-linear convergence b f(x) = x b /b k(p) = p a /a linear convergence of 1st explicit method linear convergence of 2nd explicit method quadratic suitable for strongly convex and smooth Linear convergence of fixed ɛ discretizations if 1 a + 1 b = 1.

35 Hamiltonian descent on power functions Generalize smoothness & strong convexity to power growth! 4 b = 4/3 4 b = 2 4 b = 4 f (x) x x x Can deal with second derivatives that shrink or explode.

36 Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results. Hamiltonian descent can cope with 2 f(x) shrinking or exploding.

37 Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.

38 Convex Conjugate Def. Given a convex function h : R d R { }, define the convex conjugate h : R d R { } E.g. h(x) = x b b h (p) = sup x R d x, p h(x) = h (p) = p a a h(x) = 1 2 x, Ax = h (p) = a + 1 b = 1 p, A 1 p

39 Choosing k Given f, design k for fast convergence? Good choice of k(p) related to convex conjugate of f c (x) = f(x + x ) f(x ). Assumption A. α (0, 1] such that p R d k(p) α max{f c (p), f c ( p)}

40 Choosing k continuous Theorem. Given f diff. and convex with unique minimum x, k diff. and strictly convex with unique minimum k(0) = 0, α satisfying assumption A, and γ (0, 1). Let λ = (1 γ)γ, 4 then the solutions of the Hamiltonian descent system satisfy, f(x t ) f(x ) O (exp ( λαt))

41 Choosing k discrete Assumption B. All discretizations require first-order assumption, C f,k > 0, for all x, p R d, f(x), k(p) C f,k H(x, p).

42 Choosing k discrete Assumptions C xor D. Explicit discretizations require second-order assumptions on either f or k, Assumption C. D f,k > 0, x R d \ {x }, and p R d, f is twice cont. diff. and k(p), 2 f(x) k(p) D f,k H(x, p) Assumption D. Switch f and k in Assumption C. Under such assumptions, C > 0, s.t. discretizations converge linearly ɛ (0, C], γ (0, 1].

43 Power Kinetic Energies Given k, on which class of f is convergence fast? Def. Given a, A [1, ), define ϕ A a (t) = 1 A (ta + 1) A a 1 A for t [0, ) ϕ A a behaves like ϕ A for large t and ϕ a for small t. Conditions on f given a norm and k as k(p) = ϕ A a ( p ) p = sup p, x x 1

44 Power Kinetic Energies 4 ' A a ( x ) with a =8/7 ' A a ( x ) with a =2 ' A a ( x ) with a =8 ' A a ( x ) x x A =8/7 A =2 A = x

45 Power Kinetic Energies Let b = a a 1 B = A A 1 Assumption A implied by, µ > 0, f(x) f(x ) µϕ B b ( x x ). Implied by strong convexity for b = B = 2.

46 Power Kinetic Energies Assumption B implied by, L > 0, ϕ A a ( f(x) ) L(f(x) f(x )). Implied by smoothness for a = A = 2.

47 Power Kinetic Energies Assumption C for b, B 2 implied by, twice cont. diff. of f and and L > 0, x R d \ {x }, 2 f(x) L 2 ϕ B b ( x x ) Equivalent to smoothness for b = B = 2. Assumption D relies on smoothness of k, so req. twice cont. diff. of.

48 Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) xt 0 f(x) = x 4 /4 k(p) = 3p 4/3 / log f(xt) xt 0 f(x) = x 4 /4 k(p) = p 2 / time t time t

49 Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) xt 0 f(x) = x 4 /4 k(p) = p 8/7 7/

50 Adaptive rates α may improve as (x i, p i ) (x, 0). To capture this, our analysis is extended to capture k(p) α(k(p)) max{f c (p), f c ( p)} for α : [0, ) (0, 1] differentiable convex, non-increasing. Allows us to provide position-independent step-size choice with naturally adaptive rates for B A/(A 1).

51 Relativistic Kinetic Energy Lu et al. [5] study the relativistic kinetic energy for sampling k(p) = p k(p) = p p k(p) 2 is bounded, which improves stability, similar to gradient clipping [9], Adam [3], RMSProp [2], AdaGrad [1].

52 Relativistic Kinetic Energy Relativistic is k(p) = ϕ 1 2 ( p ). Suitable for strongly convex, but possibly non-smooth. Has adaptive rates, α(y) (y + 1) 1 1 B

53 Simulations, f(x) = ϕ 8 2( x ) 40 Gradient descent k(p) = ϕ 1 2( p ) k(p) = ϕ 8/7 2 (p) log f(xi) x iterates xi iteration i

54 Conclusions Theoretical. Lower bounds assuming two first-order oracles? Optimal γ, ɛ?

55 Conclusions Methodological. ks for specific problems of interest? Constrained optimization? Biggest limitation is that designing k requires knowledge of f near minimum. Adaptive methods, e.g. [11]?

56 Thanks to you and my coauthors: Daniel Paulin Yee Whye Teh Brendan O Donoghue Arnaud Doucet

57 [1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul): , [2] Geoffrey Hinton. Neural Networks for Machine Learning. url: Slides of Lecture 6. [3] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, [4] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87 89, [5] X. Lu, V. Perrone, L. Hasenclever, Y. W. Teh, and S. Vollmer. Relativistic Monte Carlo. In Artificial Intelligence and Statistics, pages , [6] R. McLachlan and M. Perlmutter. Conformal Hamiltonian systems. Journal of Geometry and Physics, 39(4): , [7] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, [8] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, [9] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages , [10] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, [11] V. Roulet and A. d Aspremont. Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems, pages , 2017.

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient