Hamiltonian Descent Methods

Similar documents
Day 3 Lecture 3. Optimizing deep networks

A Conservation Law Method in Optimization

Lecture 6 Optimization for Deep Neural Networks

ECS289: Scalable Machine Learning

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Large-scale Stochastic Optimization

Accelerated Proximal Gradient Methods for Convex Optimization

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Gradient Methods Using Momentum and Memory

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

STA141C: Big Data & High Performance Statistical Computing

Convex Optimization Lecture 16

Lecture 7: September 17

Deep Feedforward Networks

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Sharpness, Restart and Compressed Sensing Performance.

Notes on AdaGrad. Joseph Perla 2014

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Coordinate Descent and Ascent Methods

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Adam: A Method for Stochastic Optimization

Adaptive Gradient Methods AdaGrad / Adam

J. Sadeghi E. Patelli M. de Angelis

Computational and Statistical Learning Theory

Deep Learning II: Momentum & Adaptive Step Size

Stochastic Optimization: First order method

Optimization for Training I. First-Order Methods Training algorithm

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Selected Topics in Optimization. Some slides borrowed from

Overview of gradient descent optimization algorithms. HYUNG IL KOO Based on

Optimized first-order minimization methods

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Integration Methods and Optimization Algorithms

Trade-Offs in Distributed Learning and Optimization

CS260: Machine Learning Algorithms

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

5. Subgradient method

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Lecture 16: FTRL and Online Mirror Descent

Deep Learning & Neural Networks Lecture 4

Non-convex optimization. Issam Laradji

Optimization for neural networks

arxiv: v1 [math.oc] 7 Dec 2018

Subgradient Method. Ryan Tibshirani Convex Optimization

Introduction to Optimization

Non-Linearity. CS 188: Artificial Intelligence. Non-Linear Separators. Non-Linear Separators. Deep Learning I

CPSC 540: Machine Learning

IMPROVING STOCHASTIC GRADIENT DESCENT

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Big Data Analytics: Optimization and Randomization

CPSC 540: Machine Learning

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Understanding Neural Networks : Part I

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECS171: Machine Learning

Lecture 25: Subgradient Method and Bundle Methods April 24

Advanced computational methods X Selected Topics: SGD

Convergence rate of SGD

Newton s Method. Javier Peña Convex Optimization /36-725

Convex Optimization. Problem set 2. Due Monday April 26th

Ad Placement Strategies

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

SHARPNESS, RESTART AND ACCELERATION

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Nonlinear Optimization Methods for Machine Learning

arxiv: v1 [math.oc] 1 Jul 2016

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

arxiv: v4 [math.oc] 5 Jan 2016

1 Sparsity and l 1 relaxation

Stochastic Gradient Descent: The Workhorse of Machine Learning. CS6787 Lecture 1 Fall 2017

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Conditional Gradient (Frank-Wolfe) Method

minimize x subject to (x 2)(x 4) u,

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Probabilistic Graphical Models

CSC2541 Lecture 5 Natural Gradient

Math 273a: Optimization Subgradient Methods

Deep Learning & Artificial Intelligence WS 2018/2019

Accelerate Subgradient Methods

On Nesterov s Random Coordinate Descent Algorithms - Continued

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Negative Momentum for Improved Game Dynamics

Ergodic Subgradient Descent

ECS289: Scalable Machine Learning

Optimization methods

Learning with stochastic proximal gradient

Transcription:

Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2

The problem Unconstrained minimization of a differentiable f : R d R, x = arg min x R d f(x). This talk: convex f Paper: also briefly consider non-convex f.

Optimization and Machine Learning Imbalance in our pipelines. Time spent designing models, but success constrained by optimizer. Have we discovered all the useful optimizers? If there s any doubt that optimization is a bottleneck for neural nets, consider how many architectural innovations were ways to get SGD to work better.

Optimization and Computer Science The computational complexity classes of convex optimization characterized by the information required of f [7]. 0th-order Local black-box evaluation of... f(x) 1st-order f(x), f(x) = ( ) f(x) x (n) 2nd-order f(x), f(x), 2 f(x) = ( ) 2 f(x) x (n) x (m)

Optimization and Computer Science Study rate of convergence of iterative methods. 0 log(f(xi) f(x )) 10 20 sub-linear linear super-linear 30 iteration i Distinguish between fast linear and slow sub-linear convergence.

Gradient descent E.g. gradient descent Iterates with step size ɛ > 0 is a first-order method, x i+1 = x i ɛ f(x i ). x (2) x 0 x 1 x (1)

When is gradient descent fast? f C 2 is strongly convex & smooth iff µ, L > 0, x R d, µi 2 f(x) LI; Gradient descent on smooth & strongly convex f with ɛ = L 1 has fast linear convergence, f(x i ) f(x ) O ( ( 1 µ ) ) i L

Smoothness & strong convexity important? Lower bound. Nemirovski & Yudin [7] show, iter. first-order method, iteration i, smooth convex f, such that convergence is slow, f(x i ) f(x ) Ω(i 2 ). Similar for non-smooth strongly convex.

Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. 4 L f (x) 2 µ 0 2 0 2 x

Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.

Power functions Power functions useful as study case of idealized convex functions. f(x) = x b b b 1, x R 4 b = 4/3 4 b = 2 4 b = 4 f (x) 2 2 2 0 2 1 0 1 2 x 0 2 1 0 1 2 x 0 2 1 0 1 2 x

Power functions Smooth & strongly convex iff b = 2. Lojasiewicz inquality [4], real analytic functions can be bounded by power functions at their zero locus. If f : R d R is real analytic & convex with unique minium x, then K R d compact, b 1 and µ > 0, such that x K, f(x) f(x ) µ b x x b 2 In general, don t know b.

Continuous limit of optimization algorithms To study properties of optimizers consider ɛ 0, Iterates Iterates Iterates x (2) x (2) x (2) x (1) x (1) x (1) E.g. gradient descent iterates approx. solution to gradient flow, x t = f(x t ) with x 0 R d, t 0.

Continuous limit of optimization algorithms Fundamental properties revealed by studying solutions x t : [0, ) R d of gradient flow, e.g. a descent method, (f(x t )) = f(x t ), x t = f(xt ), f(x t ) 0.

Gradient descent on power functions For f(x) = x b /b, we have (f(x t )) = b x t b 2 f(x t ), so, ( f(x t ) = exp b t Two regimes in b for rate of convergence, 0 ) x s b 2 ds. 1 < b 2 f(x t ) O(exp( λt)) b > 2 f(x t ) Ω(t b b 2 )

Gradient descent on power functions Continuous time gradient descent on f(x) = x b /b 0 log f(xt) 10 20 30 time t b = 4 b = 2 b = 4/3

Gradient descent on power functions Gradient descent with step size ɛ > 0, x i+1 = x i (1 ɛ x i b 2 ), doesn t converge for b < 2 as x i b 2 explodes. 0 log f(xi) 10 20 30 iteration i b = 4/3 b = 2 b = 4

Gradient descent on power functions Summary of gradient descent with fixed ɛ on power functions. super-linear in continuous time sub-linear in continuous time 1 2 3 4 linear in discrete time b This mirrors lower bounds, although specialized methods can do better in this case.

Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results.

Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.

The question What can be done using the first-order computation of two functions f, k R d R? f(x), f(x), k(p), k(p). k(p), k(p) cheap to compute (e.g., O(d)) to avoid cheating.

Proposed methods & key contributions Methods generalize momemtum method [10] to include non-standard kinetic energy k, we call them Hamiltonian descent methods. Linear rates possible for convex functions f that are not smooth & strongly convex. Convergence theory in continuous & discrete time.

Gradient descent with momentum Polyak s heavy ball [10] with ɛ, γ > 0: Iterates p i+1 = ɛ f(x i ) + (1 ɛγ)p i x (2) x i+1 = x i + ɛp i+1 Persistent motion helps in narrow valleys. x (1) Heavy ball Gradient descent

Gradient descent with momentum Continuous ɛ 0 limit of Polyak s heavy ball, Polyak s heavy ball Continuous heavy ball x i+1 = x i + ɛp i+1 p i+1 = ɛ f(x i ) + (1 ɛγ)p i x t = p t p t = f(x t ) γp t

Hamiltonian descent methods Generalize position update of Polyak s heavy ball, Continuous heavy ball x t = p t p t = f(x t ) γp t Continuous Hamiltonian descent x t = k(p t ) p t = f(x t ) γp t Also called conformal Hamiltonian system [6].

Hamiltonian descent methods Def. In physics the total energy or Hamiltonian is defined as, H(x, p) = f(x) f(x ) + k(p) If k strictly convex with min. k(0) = 0, then solutions of conformal Hamiltonian systems descend the Hamiltonian, (H(x t, p t )) = γ k(p t ), p t 0

Hamiltonian descent methods Dual views on k and f relationship: Given f, design k for fast convergence? Given k, on which class of f is convergence fast? Class of smooth & strongly convex corresponding to quadratic k(p) = p, p /2 not an accident! Develop intuition via one dim. power functions. Let ϕ a (t) = t a /a, f(x) = ϕ b ( x ) k(p) = ϕ a ( p ).

Hamiltonian descent on power functions Continuous system becomes, = sgn(p t) p t a 1 + 0 sgn(x t ) x t b 1 γp t x t p t momentum p = + position x position x position x

Hamiltonian descent on power functions Solutions with a = 2 and b = 2, 1 momentum p 0 1 1 0 1 position x

Hamiltonian descent on power functions Worst case is x t & p t small. To escape, want along p t that (k(p t )) k(p t ), f( k(p t )dt) + γp t Ck(p t ). 4 k(p) 4 f(x) y 2 2 0 2 0 2 p 0 2 0 2 x i.e. k(p) ( f) 1 (p).

Hamiltonian descent on power functions For power functions, this is 1 a + 1 b 1. 4 sub-linear convergence f(x) = x b /b k(p) = p a /a a 3 2 1 linear convergence 1 2 3 4 b We show linear convergence in continuous time iff 1 a + 1 b 1.

Hamiltonian descent on power functions Solutions with a = 2 and b = 8 (here 1 a + 1 b < 1) 1 momentum p 0 1 1 0 1 position x

Hamiltonian descent on power functions We study three fixed ɛ discretizations, e.g. first explicit is p i+1 p i ɛ x i+1 x i ɛ = f(x i ) γp i+1 = k(p i+1 ) If k(p) = ϕ a ( p ), all disc. require L > 0, x R, f (x) a a L(f(x) f(x )) If k(p) = ϕ a ( p ), f(x) = ϕ b ( x ), this satisfied if 1 a + 1 b 1.

Hamiltonian descent on power functions a 4 3 2 1 linear convergence sub-linear convergence 1 2 3 4 b f(x) = x b /b k(p) = p a /a linear convergence of 1st explicit method linear convergence of 2nd explicit method quadratic suitable for strongly convex and smooth Linear convergence of fixed ɛ discretizations if 1 a + 1 b = 1.

Hamiltonian descent on power functions Generalize smoothness & strong convexity to power growth! 4 b = 4/3 4 b = 2 4 b = 4 f (x) 2 2 2 0 2 1 0 1 2 x 0 2 1 0 1 2 x 0 2 1 0 1 2 x Can deal with second derivatives that shrink or explode.

Summary so far 2 f(x) bounded by positive contants (or equiv. first-order conditions [8]) is important for first-order methods. Power functions as a sandbox test case for optimization. Mirrors lower bound results. Hamiltonian descent can cope with 2 f(x) shrinking or exploding.

Outline Gradient descent on power functions. Hamiltonian descent on power functions. A tour of our results. Conclusions.

Convex Conjugate Def. Given a convex function h : R d R { }, define the convex conjugate h : R d R { } E.g. h(x) = x b b h (p) = sup x R d x, p h(x) = h (p) = p a a h(x) = 1 2 x, Ax = h (p) = 1 2 1 a + 1 b = 1 p, A 1 p

Choosing k Given f, design k for fast convergence? Good choice of k(p) related to convex conjugate of f c (x) = f(x + x ) f(x ). Assumption A. α (0, 1] such that p R d k(p) α max{f c (p), f c ( p)}

Choosing k continuous Theorem. Given f diff. and convex with unique minimum x, k diff. and strictly convex with unique minimum k(0) = 0, α satisfying assumption A, and γ (0, 1). Let λ = (1 γ)γ, 4 then the solutions of the Hamiltonian descent system satisfy, f(x t ) f(x ) O (exp ( λαt))

Choosing k discrete Assumption B. All discretizations require first-order assumption, C f,k > 0, for all x, p R d, f(x), k(p) C f,k H(x, p).

Choosing k discrete Assumptions C xor D. Explicit discretizations require second-order assumptions on either f or k, Assumption C. D f,k > 0, x R d \ {x }, and p R d, f is twice cont. diff. and k(p), 2 f(x) k(p) D f,k H(x, p) Assumption D. Switch f and k in Assumption C. Under such assumptions, C > 0, s.t. discretizations converge linearly ɛ (0, C], γ (0, 1].

Power Kinetic Energies Given k, on which class of f is convergence fast? Def. Given a, A [1, ), define ϕ A a (t) = 1 A (ta + 1) A a 1 A for t [0, ) ϕ A a behaves like ϕ A for large t and ϕ a for small t. Conditions on f given a norm and k as k(p) = ϕ A a ( p ) p = sup p, x x 1

Power Kinetic Energies 4 ' A a ( x ) with a =8/7 ' A a ( x ) with a =2 ' A a ( x ) with a =8 ' A a ( x ) 2 0 2 1 0 1 2 x 2 1 0 1 2 x A =8/7 A =2 A =8 2 1 0 1 2 x

Power Kinetic Energies Let b = a a 1 B = A A 1 Assumption A implied by, µ > 0, f(x) f(x ) µϕ B b ( x x ). Implied by strong convexity for b = B = 2.

Power Kinetic Energies Assumption B implied by, L > 0, ϕ A a ( f(x) ) L(f(x) f(x )). Implied by smoothness for a = A = 2.

Power Kinetic Energies Assumption C for b, B 2 implied by, twice cont. diff. of f and and L > 0, x R d \ {x }, 2 f(x) L 2 ϕ B b ( x x ) Equivalent to smoothness for b = B = 2. Assumption D relies on smoothness of k, so req. twice cont. diff. of.

Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) 10 15 xt 0 f(x) = x 4 /4 k(p) = 3p 4/3 /4 20 25 1 0 1 5 log f(xt) 10 15 xt 0 f(x) = x 4 /4 k(p) = p 2 /2 20 25 0 10 20 30 time t 1 0 10 20 30 time t

Simulations, f(x) = ϕ 4 ( x ) 0 5 Objective log f(xt) log f(xi) 1 Solution & Iterates xt xi log f(xt) 10 15 xt 0 f(x) = x 4 /4 k(p) = p 8/7 7/8 20 25 1

Adaptive rates α may improve as (x i, p i ) (x, 0). To capture this, our analysis is extended to capture k(p) α(k(p)) max{f c (p), f c ( p)} for α : [0, ) (0, 1] differentiable convex, non-increasing. Allows us to provide position-independent step-size choice with naturally adaptive rates for B A/(A 1).

Relativistic Kinetic Energy Lu et al. [5] study the relativistic kinetic energy for sampling k(p) = p 2 2 + 1 1 k(p) = p p 2 2 + 1 k(p) 2 is bounded, which improves stability, similar to gradient clipping [9], Adam [3], RMSProp [2], AdaGrad [1].

Relativistic Kinetic Energy Relativistic is k(p) = ϕ 1 2 ( p ). Suitable for strongly convex, but possibly non-smooth. Has adaptive rates, α(y) (y + 1) 1 1 B

Simulations, f(x) = ϕ 8 2( x ) 40 Gradient descent k(p) = ϕ 1 2( p ) k(p) = ϕ 8/7 2 (p) log f(xi) 20 0 20 x 0 1 25 100 iterates xi 40 100 80 60 40 20 0 20 0 500 1000 0 500 1000 iteration i 0 500 1000

Conclusions Theoretical. Lower bounds assuming two first-order oracles? Optimal γ, ɛ?

Conclusions Methodological. ks for specific problems of interest? Constrained optimization? Biggest limitation is that designing k requires knowledge of f near minimum. Adaptive methods, e.g. [11]?

Thanks to you and my coauthors: Daniel Paulin Yee Whye Teh Brendan O Donoghue Arnaud Doucet

[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. [2] Geoffrey Hinton. Neural Networks for Machine Learning. url: http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf, 2014. Slides 26-31 of Lecture 6. [3] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. [4] S. Lojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. Les équations aux dérivées partielles, 117:87 89, 1963. [5] X. Lu, V. Perrone, L. Hasenclever, Y. W. Teh, and S. Vollmer. Relativistic Monte Carlo. In Artificial Intelligence and Statistics, pages 1236 1245, 2017. [6] R. McLachlan and M. Perlmutter. Conformal Hamiltonian systems. Journal of Geometry and Physics, 39(4):276 300, 2001. [7] A. S. Nemirovsky and D. B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, 1983. [8] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013. [9] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310 1318, 2013. [10] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964. [11] V. Roulet and A. d Aspremont. Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems, pages 1119 1129, 2017.