Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2
The problem Unconstrained minimization of a differentiable f : R d R, x = arg min x R d f(x). This talk: convex f Paper: also briefly consider non-convex f.
Optimization and Machine Learning Imbalance in our pipelines. Time spent designing models, but success constrained by optimizer. Have we discovered all the useful optimizers? If there s any doubt that optimization is a bottleneck for neural nets, consider how many architectural innovations were ways to get SGD to work better.
Optimization and Computer Science The computational complexity classes of convex optimization characterized by the information required of f [7]. 0th-order Local black-box evaluation of... f(x) 1st-order f(x), f(x) = ( ) f(x) x (n) 2nd-order f(x), f(x), 2 f(x) = ( ) 2 f(x) x (n) x (m)
Optimization and Computer Science Study rate of convergence of iterative methods. 0 log(f(xi) f(x )) 10 20 sub-linear linear super-linear 30 iteration i Distinguish between fast linear and slow sub-linear convergence.
Gradient descent E.g. gradient descent Iterates with step size ɛ > 0 is a first-order method, x i+1 = x i ɛ f(x i ). x (2) x 0 x 1 x (1)
When is gradient descent fast? f C 2 is strongly convex & smooth iff µ, L > 0, x R d, µi 2 f(x) LI; Gradient descent on smooth & strongly convex f with ɛ = L 1 has fast linear convergence, f(x i ) f(x ) O ( ( 1 µ ) ) i L
Smoothness & strong convexity important? Lower bound. Nemirovski & Yudin [7] show, iter. first-order method, iteration i, smooth convex f, such that convergence is slow, f(x i ) f(x ) Ω(i 2 ). Similar for non-smooth strongly convex.
Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. 4 L f (x) 2 µ 0 2 0 2 x
Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.
Power functions Power functions useful as study case of idealized convex functions. f(x) = x b b b 1, x R 4 b = 4/3 4 b = 2 4 b = 4 f (x) 2 2 2 0 2 1 0 1 2 x 0 2 1 0 1 2 x 0 2 1 0 1 2 x
Power functions Smooth & strongly convex iff b = 2. Lojasiewicz inquality [4], real analytic functions can be bounded by power functions at their zero locus. If f : R d R is real analytic & convex with unique minium x, then K R d compact, b 1 and µ > 0, such that x K, f(x) f(x ) µ b x x b 2 In general, don t know b.
Continuous limit of optimization algorithms To study properties of optimizers consider ɛ 0, Iterates Iterates Iterates x (2) x (2) x (2) x (1) x (1) x (1) E.g. gradient descent iterates approx. solution to gradient flow, x t = f(x t ) with x 0 R d, t 0.
Continuous limit of optimization algorithms Fundamental properties revealed by studying solutions x t : [0, ) R d of gradient flow, e.g. a descent method, (f(x t )) = f(x t ), x t = f(xt ), f(x t ) 0.
Gradient descent on power functions For f(x) = x b /b, we have (f(x t )) = b x t b 2 f(x t ), so, ( f(x t ) = exp b t Two regimes in b for rate of convergence, 0 ) x s b 2 ds. 1 < b 2 f(x t ) O(exp( λt)) b > 2 f(x t ) Ω(t b b 2 )
Gradient descent on power functions Continuous time gradient descent on f(x) = x b /b 0 log f(xt) 10 20 30 time t b = 4 b = 2 b = 4/3
Gradient descent on power functions Gradient descent with step size ɛ > 0, x i+1 = x i (1 ɛ x i b 2 ), doesn t converge for b < 2 as x i b 2 explodes. 0 log f(xi) 10 20 30 iteration i b = 4/3 b = 2 b = 4
Gradient descent on power functions Summary of gradient descent with fixed ɛ on power functions. super-linear in continuous time sub-linear in continuous time 1 2 3 4 linear in discrete time b This mirrors lower bounds, although specialized methods can do better in this case.
Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results.
Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.
The question What can be done using the first-order computation of two functions f, k R d R? f(x), f(x), k(p), k(p). k(p), k(p) cheap to compute (e.g., O(d)) to avoid cheating.
Proposed methods & key contributions Methods generalize momemtum method [10] to include non-standard kinetic energy k, we call them Hamiltonian descent methods. Linear rates possible for convex functions f that are not smooth & strongly convex. Convergence theory in continuous & discrete time.
Gradient descent with momentum Polyak s heavy ball [10] with ɛ, γ > 0: Iterates p i+1 = ɛ f(x i ) + (1 ɛγ)p i x (2) x i+1 = x i + ɛp i+1 Persistent motion helps in narrow valleys. x (1) Heavy ball Gradient descent
Gradient descent with momentum Continuous ɛ 0 limit of Polyak s heavy ball, Polyak s heavy ball Continuous heavy ball x i+1 = x i + ɛp i+1 p i+1 = ɛ f(x i ) + (1 ɛγ)p i x t = p t p t = f(x t ) γp t
Hamiltonian descent methods Generalize position update of Polyak s heavy ball, Continuous heavy ball x t = p t p t = f(x t ) γp t Continuous Hamiltonian descent x t = k(p t ) p t = f(x t ) γp t Also called conformal Hamiltonian system [6].
Hamiltonian descent methods Def. In physics the total energy or Hamiltonian is defined as, H(x, p) = f(x) f(x ) + k(p) If k strictly convex with min. k(0) = 0, then solutions of conformal Hamiltonian systems descend the Hamiltonian, (H(x t, p t )) = γ k(p t ), p t 0
Hamiltonian descent methods Dual views on k and f relationship: Given f, design k for fast convergence? Given k, on which class of f is convergence fast? Class of smooth & strongly convex corresponding to quadratic k(p) = p, p /2 not an accident! Develop intuition via one dim. power functions. Let ϕ a (t) = t a /a, f(x) = ϕ b ( x ) k(p) = ϕ a ( p ).
Hamiltonian descent on power functions Continuous system becomes, = sgn(p t) p t a 1 + 0 sgn(x t ) x t b 1 γp t x t p t momentum p = + position x position x position x
Hamiltonian descent on power functions Solutions with a = 2 and b = 2, 1 momentum p 0 1 1 0 1 position x
Hamiltonian descent on power functions Worst case is x t & p t small. To escape, want along p t that (k(p t )) k(p t ), f( k(p t )dt) + γp t Ck(p t ). 4 k(p) 4 f(x) y 2 2 0 2 0 2 p 0 2 0 2 x i.e. k(p) ( f) 1 (p).
Hamiltonian descent on power functions For power functions, this is 1 a + 1 b 1. 4 sub-linear convergence f(x) = x b /b k(p) = p a /a a 3 2 1 linear convergence 1 2 3 4 b We show linear convergence in continuous time iff 1 a + 1 b 1.
Hamiltonian descent on power functions Solutions with a = 2 and b = 8 (here 1 a + 1 b < 1) 1 momentum p 0 1 1 0 1 position x
Hamiltonian descent on power functions We study three fixed ɛ discretizations, e.g. first explicit is p i+1 p i ɛ x i+1 x i ɛ = f(x i ) γp i+1 = k(p i+1 ) If k(p) = ϕ a ( p ), all disc. require L > 0, x R, f (x) a a L(f(x) f(x )) If k(p) = ϕ a ( p ), f(x) = ϕ b ( x ), this satisfied if 1 a + 1 b 1.
Hamiltonian descent on power functions a 4 3 2 1 linear convergence sub-linear convergence 1 2 3 4 b f(x) = x b /b k(p) = p a /a linear convergence of 1st explicit method linear convergence of 2nd explicit method quadratic suitable for strongly convex and smooth Linear convergence of fixed ɛ discretizations if 1 a + 1 b = 1.
Hamiltonian descent on power functions Generalize smoothness & strong convexity to power growth! 4 b = 4/3 4 b = 2 4 b = 4 f (x) 2 2 2 0 2 1 0 1 2 x 0 2 1 0 1 2 x 0 2 1 0 1 2 x Can deal with second derivatives that shrink or explode.
Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results. Hamiltonian descent can cope with 2 f(x) shrinking or exploding.
Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.
Convex Conjugate Def. Given a convex function h : R d R { }, define the convex conjugate h : R d R { } E.g. h(x) = x b b h (p) = sup x R d x, p h(x) = h (p) = p a a h(x) = 1 2 x, Ax = h (p) = 1 2 1 a + 1 b = 1 p, A 1 p
Choosing k Given f, design k for fast convergence? Good choice of k(p) related to convex conjugate of f c (x) = f(x + x ) f(x ). Assumption A. α (0, 1] such that p R d k(p) α max{f c (p), f c ( p)}
Choosing k continuous Theorem. Given f diff. and convex with unique minimum x, k diff. and strictly convex with unique minimum k(0) = 0, α satisfying assumption A, and γ (0, 1). Let λ = (1 γ)γ, 4 then the solutions of the Hamiltonian descent system satisfy, f(x t ) f(x ) O (exp ( λαt))
Choosing k discrete Assumption B. All discretizations require first-order assumption, C f,k > 0, for all x, p R d, f(x), k(p) C f,k H(x, p).
Choosing k discrete Assumptions C xor D. Explicit discretizations require second-order assumptions on either f or k, Assumption C. D f,k > 0, x R d \ {x }, and p R d, f is twice cont. diff. and k(p), 2 f(x) k(p) D f,k H(x, p) Assumption D. Switch f and k in Assumption C. Under such assumptions, C > 0, s.t. discretizations converge linearly ɛ (0, C], γ (0, 1].
Power Kinetic Energies Given k, on which class of f is convergence fast? Def. Given a, A [1, ), define ϕ A a (t) = 1 A (ta + 1) A a 1 A for t [0, ) ϕ A a behaves like ϕ A for large t and ϕ a for small t. Conditions on f given a norm and k as k(p) = ϕ A a ( p ) p = sup p, x x 1
Power Kinetic Energies 4 ' A a ( x ) with a =8/7 ' A a ( x ) with a =2 ' A a ( x ) with a =8 ' A a ( x ) 2 0 2 1 0 1 2 x 2 1 0 1 2 x A =8/7 A =2 A =8 2 1 0 1 2 x
Power Kinetic Energies Let b = a a 1 B = A A 1 Assumption A implied by, µ > 0, f(x) f(x ) µϕ B b ( x x ). Implied by strong convexity for b = B = 2.
Power Kinetic Energies Assumption B implied by, L > 0, ϕ A a ( f(x) ) L(f(x) f(x )). Implied by smoothness for a = A = 2.
Power Kinetic Energies Assumption C for b, B 2 implied by, twice cont. diff. of f and and L > 0, x R d \ {x }, 2 f(x) L 2 ϕ B b ( x x ) Equivalent to smoothness for b = B = 2. Assumption D relies on smoothness of k, so req. twice cont. diff. of.
Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) 10 15 xt 0 f(x) = x 4 /4 k(p) = 3p 4/3 /4 20 25 1 0 1 5 log f(xt) 10 15 xt 0 f(x) = x 4 /4 k(p) = p 2 /2 20 25 0 10 20 30 time t 1 0 10 20 30 time t
Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) 10 15 xt 0 f(x) = x 4 /4 k(p) = p 8/7 7/8 20 25 1
Adaptive rates α may improve as (x i, p i ) (x, 0). To capture this, our analysis is extended to capture k(p) α(k(p)) max{f c (p), f c ( p)} for α : [0, ) (0, 1] differentiable convex, non-increasing. Allows us to provide position-independent step-size choice with naturally adaptive rates for B A/(A 1).
Relativistic Kinetic Energy Lu et al. [5] study the relativistic kinetic energy for sampling k(p) = p 2 2 + 1 1 k(p) = p p 2 2 + 1 k(p) 2 is bounded, which improves stability, similar to gradient clipping [9], Adam [3], RMSProp [2], AdaGrad [1].
Relativistic Kinetic Energy Relativistic is k(p) = ϕ 1 2 ( p ). Suitable for strongly convex, but possibly non-smooth. Has adaptive rates, α(y) (y + 1) 1 1 B
Simulations, f(x) = ϕ 8 2( x ) 40 Gradient descent k(p) = ϕ 1 2( p ) k(p) = ϕ 8/7 2 (p) log f(xi) 20 0 20 x 0 1 25 100 iterates xi 40 100 80 60 40 20 0 20 0 500 1000 0 500 1000 iteration i 0 500 1000
Conclusions Theoretical. Lower bounds assuming two first-order oracles? Optimal γ, ɛ?
Conclusions Methodological. ks for specific problems of interest? Constrained optimization? Biggest limitation is that designing k requires knowledge of f near minimum. Adaptive methods, e.g. [11]?
Thanks to you and my coauthors: Daniel Paulin Yee Whye Teh Brendan O Donoghue Arnaud Doucet
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. [2] Geoffrey Hinton. Neural Networks for Machine Learning. url: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, 2014. Slides 26-31 of Lecture 6. [3] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. [4] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87 89, 1963. [5] X. Lu, V. Perrone, L. Hasenclever, Y. W. Teh, and S. Vollmer. Relativistic Monte Carlo. In Artificial Intelligence and Statistics, pages 1236 1245, 2017. [6] R. McLachlan and M. Perlmutter. Conformal Hamiltonian systems. Journal of Geometry and Physics, 39(4):276 300, 2001. [7] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983. [8] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013. [9] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310 1318, 2013. [10] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. [11] V. Roulet and A. d Aspremont. Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems, pages 1119 1129, 2017.