arxiv: v1 [math.oc] 13 Sep 2018

Size: px

Start display at page:

Download "arxiv: v1 [math.oc] 13 Sep 2018"

Brian Fox
5 years ago
Views:

1 Hamilonian Descen Mehods Chris J. Maddison 1,2,*, Daniel Paulin 1,*, Yee Whye Teh 1,2, Brendan O Donoghue 2, and Arnaud Douce 1 arxiv: v1 [mah.oc] 13 Sep Deparmen of Saisics, Universiy of Oxford 2 DeepMind, London, UK * Boh auhors conribued equally o his work. 14 Sepember 2018 Absrac We propose a family of opimizaion mehods ha achieve linear convergence using firsorder gradien informaion and consan sep sizes on a class of convex funcions much larger han he smooh and srongly convex ones. This larger class includes funcions whose second derivaives may be singular or unbounded a heir minima. Our mehods are discreizaions of conformal Hamilonian dynamics, which generalize he classical momenum mehod o model he moion of a paricle wih non-sandard kineic energy exposed o a dissipaive force and he gradien field of he funcion of ineres. They are firs-order in he sense ha hey require only gradien compuaion. Ye, crucially he kineic gradien map can be designed o incorporae informaion abou he convex conjugae in a fashion ha allows for linear convergence on convex funcions ha may be non-smooh or non-srongly convex. We sudy in deail one implici and wo explici mehods. For one explici mehod, we provide condiions under which i converges o saionary poins of non-convex funcions. For all, we provide condiions on he convex funcion and kineic energy pair ha guaranee linear convergence, and show ha hese condiions can be saisfied by funcions wih power growh. In sum, hese mehods expand he class of convex funcions on which linear convergence is possible wih firs-order compuaion. 1 Inroducion We consider he problem of unconsrained minimizaion of a differeniable funcion f : R d R, min fx, 1 x R d by ieraive mehods ha require only he parial derivaives fx = fx/ x n R d of f, known also as firs-order mehods [38, 45, 41]. These mehods produce a sequence of ieraes x i R d, and our emphasis is on hose ha achieve linear convergence, i.e., as a funcion of he ieraion i hey saisfy fx i fx = Oλ i for some rae λ > 1 and x R d a global minimizer. We briefly consider non-convex differeniable f, bu he bulk of our analysis focuses on he case of convex differeniable f. Our resuls will also occasionally require wice differeniabiliy of f. 1

2 Ieraes Objecive x 2 log fxi Gradien descen Classical momenum Hamilonian descen x 1 ieraion i Figure 1: Opimizing fx = [x 1 + x 2 ] 4 + [x 1 /2 x 2 /2] 4 wih hree mehods: gradien descen wih fixed sep size equal o 1/L 0 where L 0 = λ max 2 fx 0 is he maximum eigenvalue of he Hessian 2 f a x 0 ; classical momenum, which is a paricular case of our firs explici mehod wih kp = [p p 2 2 ]/2 and fixed sep size equal o 1/L 0 ; and Hamilonian descen, which is our firs explici mehod wih kp = 3/4[p 1 4/3 + p 2 4/3 ] and a fixed sep size. The convergence raes of firs-order mehods on convex funcions can be broadly separaed by he properies of srong convexiy and Lipschiz smoohness. Taken ogeher hese properies for convex f are equivalen o he condiions ha he following lef hand bound srong convexiy and righ hand bound smoohness hold for some µ, L 0, and all x, y R d, µ 2 x y 2 2 fx fy fy, x y L 2 x y 2 2, 2 where x, y = d n=1 xn y n is he sandard inner produc and x 2 = x, x is he Euclidean norm. For wice differeniable f, hese properies are equivalen o he condiions ha eigenvalues of he marix of second-order parial derivaives 2 fx = 2 fx/ x n x m R d d are everywhere lower bounded by µ and upper bounded by L, respecively. Thus, funcions whose second derivaives are coninuously unbounded or approaching 0, canno be boh srongly convex and smooh. Boh bounds play an imporan role in he performance of firs-order mehods. On he one hand, for smooh and srongly convex f, he ieraes of many firs-order mehods converge linearly. On he oher hand, for any firs-order mehod, here exis smooh convex funcions and non-smooh srongly convex funcions on which is convergence is sub-linear, i.e., fx i fx Oi 2 for any firs-order mehod on smooh convex funcions. See [38, 45, 41] for hese classical resuls and [25] for oher more exoic scenarios. Moreover, for a given mehod i can someimes be very easy o find examples on which is convergence is slow; see Figure 1, in which gradien descen wih a fixed sep size converges slowly on fx = [x 1 + x 2 ] 4 + [x 1 /2 x 2 /2] 4, which is no srongly convex as is Hessian is singular a 0, 0. The cenral assumpion in he wors case analyses of firs-order mehods is ha informaion abou f is resriced o black box evaluaions of f and f locally a poins x R d, see [38, 41]. In his paper we assume addiional access o firs-order informaion of a second differeniable funcion k : R d R and show how k can be designed o incorporae informaion abou f o yield pracical mehods ha converge linearly on convex funcions. These mehods are derived by discreizing 2

3 a sub-linear convergence in coninuous ime linear convergence in coninuous ime b fx = x b /b kp = p a /a linear convergence of 1s explici mehod linear convergence of 2nd explici mehod quadraic suiable for srongly convex and smooh Figure 2: Convergence Regions for Power Funcions. Shown are regions of disinc convergence ypes for Hamilonian descen sysems wih fx = x b /b, kp = p a /a for x, p R and a, b 1,. We show in Secion 2 convergence is linear in coninuous ime iff 1/a + 1/b 1. In Secion 4 we show ha he assumpions of he explici discreizaions can be saisfied if 1/a + 1/b = 1, leaving his as he only suiable pairing for linear convergence. Ligh doed line is he line occupied by classical momenum wih kp = p 2 /2. he conformal Hamilonian sysem [33]. These sysems are parameerized by f, k : R d R and γ 0, wih soluions x, p R 2d, x = kp p = fx γp. 3 From a physical perspecive, hese sysems model he dynamics of a single paricle locaed a x wih momenum p and kineic energy kp being exposed o a force field f and a dissipaive force. For his reason we refer o k as, he kineic energy, and k, he kineic map. When he kineic map k is he ideniy, kp = p, hese dynamics are he coninuous ime analog of Polyak s heavy ball mehod [44]. Le f c x = fx + x fx denoe he cenered version of f, which akes is minimum a 0, wih minimum value 0. Our key observaion in his regard is ha when f is convex, and k is chosen as kp = fc p + fc p/2 where fc p = sup{ x, p f c x : x R d } is he convex conjugae of f c, hese dynamics have linear convergence wih rae independen of f. In oher words, his choice of k acs as a precondiioner, a generalizaion of using kp = p, A 1 p /2 for fx = x, Ax /2. Thus k can exploi global informaion provided by he conjugae fc o condiion convergence for generic convex funcions. To preview he flavor of our resuls in deail, consider he special case of opimizing he power funcion fx = x b /b for x R and b 1, iniialized a x 0 > 0 using sysem 3 or discreizaions of i wih kp = p a /a for p R and a 1,. FOr his choice of f, i can be shown ha fc p = fc p = kp when a = b/b 1. In line wih his, in Secion 2 we show ha 3 exhibis linear convergence in coninuous ime if and only if 1/a + 1/b 1. In Secion 3 we propose wo explici discreizaions wih fixed sep sizes; in Secion 4 we show ha he firs explici discreizaion converges if 1/a + 1/b = 1 and b 2, and he second converges if 1/a + 1/b = 1 and 1 < b 2. This means ha he only suiable pairing corresponds in his case o he choice kp fc p+fc p. Figure 2 summarizes his discussion. Reurning o Figure 1, we can compare 3

4 he use of he kineic energy of Polyak s heavy ball wih a kineic energy ha relaes appropriaely o he convex conjugae of fx = [x 1 + x 2 ] 4 + [x 1 /2 x 2 /2] 4. Mos convex funcions are no simple power funcions, and compuing f c p + f c p exacly is rarely feasible. To make our observaions useful for numerical opimizaion, we show ha linear convergence is sill achievable in coninuous ime even if kp α max{f c p, f c p} for some 0 < α 1 wihin a region defined by x 0. We sudy hree discreizaions of 3, one implici mehod and wo explici ones which are suiable for funcions ha grow asympoically fas or slow, respecively. We prove linear convergence raes for hese under appropriae addiional assumpions. We inroduce a family of kineic energies ha generalize he power funcions o capure disinc power growh near zero and asympoically far from zero. We show ha he addiional assumpions of discreizaion can be saisfied for his family of k. We derive condiions on f ha guaranee he linear convergence of our mehods when paired wih a specific choice of k from his family. These condiions generalize he quadraic growh implied by smoohness and srong convexiy, exending i o general power growh ha may be disinc near he minimum and asympoically far from he minimum, which we refer o as ail and body behavior, respecively. Sep sizes can be fixed independenly of he iniial posiion and ofen dimension, and do no require adapaion, which ofen leads o convergence problems, see [57]. Indeed, we analyze a kineic map k ha resembles he ierae updaes of some popular adapive gradien mehods [13, 59, 18, 27], and show ha i condiions he opimizaion of srongly convex funcions wih very fas growing ails non-smooh. Thus, our mehods provide a framework opimizing poenially non-smooh or non-srongly convex funcions wih linear raes using firs-order compuaion. The organizaion of he paper is as follows. In he res of his secion, we cover noaion, review a few resuls from convex analysis, and give an overview of he relaed lieraure. In Secion 2, we show he linear convergence of 3 under condiions on he relaion beween he kineic energy k and f. We show a parial converse ha in some seings our condiions are necessary. In Secion 3, we presen he hree discreizaions of he coninuous dynamics and sudy he assumpions under which linear raes can be guaraneed for convex funcions. For one of he discreizaions, we also provide condiions under which i converges o saionary poins of non-convex funcions. In Secion 4, we sudy a family of kineic energies suiable for funcions wih power growh. We describe he class of funcions for which he assumpions of he discreizaions can be saisfied when using hese kineic energies. 1.1 Noaion and Convex Analysis Review We le x, y = d n=1 xn y n denoe he sandard inner produc for x, y R d and x 2 = x, x he Euclidean norm. For a differeniable funcion f : R d R, he gradien fx = fx/ x n R d is he vecor of parial derivaives a x. For wice-differeniable f, he Hessian 2 hx = 2 fx/ x n x m R d d is he marix of second-order parial derivaives a x. The noaion x denoes he soluion x : [0, R d o a differenial equaion wih derivaive in denoed x. x i denoes he ieraes x i : {0, 1,...} R d of a discree sysem. Consider a convex funcion h : C R ha is defined on a convex domain C R d and differeniable on he inerior inc. The convex conjugae h : R d R is defined as h p = sup{ x, p hx : x C} 4 and i is iself convex. I is easy o show from he definiion ha if g : C R is anoher convex funcion such ha gx hx for all x C, hen h p g p for all p R d. Because we make 4

5 such exensive use of i, we remind readers of he Fenchel-Young inequaliy: for x C and p R d, x, p hx + h p, 5 which is easily derived from he definiion of h, or see Secion 12 of [47]. Theorem 26.4 of [47], For x inc by x, hx = hx + h hx. 6 Le y R d, c R \ {0}. If gx = hx + y c, hen g p = h p p, y + c Theorem 12.3 [47]. If hx = x b /b for x R and b 1,, hen h p = p a /a where a = b/b 1 page 106 of [47]. If gx = chx, hen g p = ch p/c Table 3.2 [6]. For hese and more on h, we refer readers o [47, 8, 6]. 1.2 Relaed Lieraure Sandard references on convex opimizaion and he convergence analysis of firs-order mehods include [38, 45, 3, 8, 41, 9]. The heavy ball mehod was inroduced by Polyak in his seminal paper [44]. In his paper, local convergence wih linear rae was shown i.e., when he iniial posiion is sufficienly close o he local minimum. For quadraic funcions, i can be shown ha he convergence rae for opimally chosen sep sizes is proporional o he square roo of he condiional number of he Hessian, similarly o conjugae gradien descen see e.g., [46]. As far as we know, global convergence of he heavy ball mehod for non-quadraic funcions was only recenly esablished in [19] and [30], see [22] for an exension o sochasic average gradiens. The heavy ball mehod forms he basis of he some of he mos successful opimizaion mehods for deep learning, see e.g., [54, 27], and he recen review [7]. Hereafer, classical momenum refers o any firs-order discreizaion of he coninuous analog of Polyak s heavy ball wih possibly subopimal sep sizes. Neserov obained upper and lower bounds of maching order for firs-order mehods for smooh convex funcions and smooh srongly convex funcions, see [41]. In Necoara e al. [36], he assumpion of srong convexiy was relaxed, and under a weaker quadraic growh condiion, linear raes were obained by several well known opimizaion mehods. Several oher auhors obained linear raes for various classes of non-srongly convex or non-uniformly smooh funcions, see e.g., [37, 26, 11, 58, 14, 48]. In recen years, here has been ineres in he opimizaion communiy in looking a he coninuous ime ODE limi of opimizaion mehods, when he sep size ends o zero. Su e al. [52, 53] have found he coninuous ime limi of Neserov s acceleraed gradien descen. This resul improves he inuiion abou Neserov s mehod, as he proofs of convergence raes in coninuous ime are raher elegan and clear, while he previous proofs in discree ime are no as ransparen. Follow-ups have sudied he coninuous ime counerpars o acceleraed mirror descen [28] as well as higher order discreizaions of such sysems [55, 56]. Sudying coninuous ime sysems for opimizaion can separae he concerns of designing an opimizer from he difficulies of discreizaion. This perspecive has resuled in numerous oher recen works ha propose new opimizaion mehods, and sudy exising ones via heir coninuous ime limi, see e.g., [4, 1, 15, 24, 10, 16, 17]. Conformal Hamilonian sysems 3 are sudied in geomery [33, 5], because heir soluions preserve symplecic area up o a consan; when γ = 0 symplecic area is exacly preserved, when γ > 0 symplecic area dissipaes uniformly a an exponenial rae [33]. In classical mechanics, 5

6 Hamilonian dynamics sysem 3 wih γ = 0 are used o describe he moion of a paricle exposed o he force field f. Here, he mos common form for k is kp = p, p /2m, where m is he mass, or in relaivisic mechanics, kp = c p, p + m 2 c 2 where c is he speed of ligh, see [21]. In he Markov Chain Mone Carlo lieraure, where discreized Hamilonian dynamics again γ = 0 are used o propose moves in a Meropolis Hasings algorihm [34, 23, 12, 35], k is viewed as a degree of freedom ha can be used o improve he mixing properies of he Markov chain [20, 31]. Sochasic differenial equaions similar o 3 wih γ > 0 have been sudied from he perspecive of designing k [32, 51]. 2 Coninuous Dynamics In his secion, we moivae he discree opimizaion algorihms by inroducing heir coninuous ime counerpars. These sysems are differenial equaions described by a Hamilonian vecor field plus a dissipaion field. Thus, we briefly review Hamilonian dynamics, he coninuous dynamics of Hamilonian descen, and derive convergence raes for convex f in coninuous ime. 2.1 Hamilonian Sysems In he Hamilonian formulaion of mechanics, he evoluion of a paricle exposed o a force field f is described by is locaion x : [0, R d and momenum p : [0, R d as funcions of ime. The sysem is characerized by he oal energy, or Hamilonian, Hx, p = kp + fx fx, 7 where x is one of he global minimizers of f and k : R d R is called he kineic energy. Throughou, we consider kineic energies k ha are a sricly convex funcions wih minimum a k0 = 0. The Hamilonian H defines he rajecory of a paricle x and is momenum p via he ordinary differenial equaion, x = p Hx, p = kp p 8 = x Hx, p = fx. For any soluion of his sysem, he value of he oal energy over ime H = Hx, p is conserved as H = kp, p + fx, x = 0. Thus, he soluions of he Hamilonian field oscillae, exchanging energy from x o p and back again. 2.2 Coninuously Descending he Hamilonian The soluions of a Hamilonian sysem remain in he level se {x, p : H = H 0 }. To drive such a sysem owards saionary poins, he oal energy mus reduce over ime. Consider as a moivaing example he coninuous sysem x = fx γx, which describes Polyak s heavy ball algorihm in coninuous ime [44]. Leing x = p, he heavy ball sysem can be rewrien as x = p p = fx γp. 9 Noe ha his sysem can be viewed as a combinaion of a Hamilonian field wih kp = p, p /2 and a dissipaion field, i.e., x, p = F x, p + Gx, p where F x, p = p, fx and 6

7 Hamilonian Field Dissipaion Field Conformal Hamilonian Field momenum p + = posiion x Figure 3: A visualizaion of a conformal Hamilonian sysem. Gx, p = 0, γp, see Figure 3 for a visualizaion. This is naurally exended o define he more general conformal Hamilonian sysem [33], x = kp p = fx γp. 3 revisied wih γ 0,. When k is convex wih a minimum k0 = 0, hese sysems descend he level ses of he Hamilonian. We can see his by showing ha he oal energy H is reduced along he rajecory x, p, H = kp, p + fx, x = γ kp, p γkp 0, 10 where we have used he convexiy of k, and he fac ha i is minimised a k0 = 0. The following proposiion shows some exisence and uniqueness resuls for he dynamics 3. We say ha H is radially unbounded if Hx, p when x, p 2, e.g., his would be implied if f and k were sricly convex wih unique minima. Proposiion 2.1 Exisence and uniqueness. If f and k are coninuous, k is convex wih a minimum k0 = 0, and H is radially unbounded, hen for every x, p R d, here exiss a soluion x, p of 3 defined for every 0 wih x 0, p 0 = x, p. If in addiion, f and k are coninuously differeniable, hen his soluion is unique. Proof. Firs, only assuming coninuiy, i follows from Peano s exisence heorem [42] ha here exiss a local soluion on an inerval [ a, a] for some a > 0. Le [0, A denoe he righ maximal inerval where a soluion of 3 saisfying ha x 0 = x and p 0 = p exis. From 10, i follows ha H 0, and hence H H 0 for every [0, A. Now by he radial unboundedness of H, and he fac ha H H 0, i follows ha he compac se {x, p : Hx, p H 0 } is never lef by he dynamics, and hence by Theorem 3 of [43] page 91, we mus have A =. The uniqueness under coninuous differeniabiliy follows from he Fundamenal Exisence Uniqueness Theorem on page 74 of [43]. As shown in he nex proposiion, 10 implies ha conformal Hamilonian sysems approach saionary poins of f. 7

8 Proposiion 2.2 Convergence o a saionary poin. Le x, p be a soluion o he sysem 3 wih iniial condiions x 0, p 0 = x, p R 2d, f coninuously differeniable, and k coninuously differeniable, sricly convex wih minimum a 0 and k0 = 0. If f is bounded below and H is radially unbounded, hen fx 2 0. Proof. Since f is bounded below, H 0. Since H is radially unbounded, he se B := {x, p R 2d : Hx, p Hx 0, p 0 + 1} is a compac se ha conains x 0, p 0 in is inerior. Moreover, by 10, we also have x, p B for all > 0. Consider he se M = {x, p : H = 0} B. Since k is sricly convex, his se is equivalen o {x, p : p 2 = 0} B. The larges invarian se of he dynamics 3 inside M is I = {x, p R 2d : p 2 = 0, fx 2 = 0} B. By LaSalle s principle [29], all rajecories sared from B mus approach I. Since f is a coninuous bounded funcion on he compac se B, here is a poin x B such ha fx fx for every x B i.e. he minimum is aained in B by he exreme value heorem see [49]. Moreover, due o he definiion of B, x is in is inerior, hence fx 2 = 0 and herefore x, 0 I. Thus he se I is non-empy noe ha I migh conain oher local minima as well. Remark 1. This consrucion can be generalized by modifying he γp componen of 3 o a more general dissipaion field γdp. If he dissipaion field is everywhere aligned wih he kineic map, kp, Dp 0, hen hese sysems dissipae energy. We have no found alernaives o Dp = γp ha resul in linear convergence in general. 2.3 Coninuous Hamilonian Descen on Convex Funcions In his secion we sudy how k can be designed o condiion he sysem 3 for linear convergence in logfx fx. Alhough he soluions x, p of 3 approach saionary poins under weak condiions, o derive raes we consider he case when f is convex. To moivae our choice of k, consider he quadraic funcion fx = x, Ax /2 wih kp = p, A 1 p /2 for posiive definie symmeric A R d d. Now 3 becomes, x = A 1 p p = Ax γp. 11 By he change of variables v = A 1 p, his is equivalen o x = v v = x γv, 12 which is a universal equaion and hence he convergence rae of 11 is independen of A. Alhough his kineic energy implemens a consan precondiioner for any f, for his specific f k is is convex conjugae f. This suggess he core idea of his paper: aking k relaed in some sense o f for more general convex funcions may condiion he convergence of 3. Indeed, we show in his secion ha, if he kineic energy kp upper bounds a cenered version of f p, hen he convergence of 3 is linear. More precisely, define he following cenered funcion f c : R d R, f c x = fx + x fx. 13 The convex conjugae of f c is given by f c p = f p x, p +fx and is minimized a f c 0 = 0. Imporanly, as we will show in he final lemma of his secion, aking a kineic energy such ha 8

9 kp α maxfc p, fc p for some α 0, 1] suffices o achieve linear raes on any differeniable convex f in coninuous ime. The consan α is included o capure he fac ha k may under esimae fc by some consan facor, so long as i is posiive. If α does no depend in any fashion on f, hen he convergence rae of 3 is independen of f. In Secion 2.4 we also show a parial converse for some simple problems aking a k no saisfying hose assumpions resuls in sublinear convergence for almos every pah excep for one unique curve and is mirror. Remark 2. There is an ineresing connecion o dualiy heory for a specific choice of k. In a sligh abuse of represenaion, consider rewriing he original problem as min fx = min x R d x R d 1 fx + fx. 2 The Fenchel dual of his problem is equivalen o he following problem afer a small reparameerizaion of p see Chaper 31 of [47], 1 max p R d 2 f p f p. The Fenchel dualiy heorem guaranees ha for a given pair of primal-dual variables x, p R d, he dualiy gap beween he primal objecive fx and he dual objecive f p f p/2 is posiive. Thus, fx f p f p/2 = fx fx + f p + f p/2 + fx = fx fx + f c p + f c p/2 0. Thus, for he choice kp = f c p + f c p/2, which as we will show implies linear convergence of 3, he Hamilonian Hx, p is exacly he dualiy gap beween he primal and dual objecives. Linear raes in coninuous ime can be derived by a Lyapunov funcion V : R d d [0, ha summarizes he oal energy of he sysem, conracs exponenially or linearly in log-space, and is posiive unless x, p = x, 0. Ulimaely we are rying o prove a resul of he form V λv for some rae λ > 0. As he energy H is decreasing, i suggess using H as a Lyapunov funcion. Unforunaely, his will no suffice, as H plaeaus insananeously H = 0 a poins on he rajecory where p = 0 despie x possibly being far from x. However, when p = 0, he momenum field reduces o he erm fx and he derivaive of x x, p in is insananeously sricly negaive x x, fx < 0 for convex f unless we are a x, 0. This suggess he family of Lyapunov funcions ha we sudy in his paper, Vx, p = Hx, p + β x x, p, 14 where β 0, γ see he nex lemma for condiions ha guaranee ha i is non-negaive. As wih H, V is used o indicae Vx, p a ime along a soluion o 3. Before moving on o he final lemma of he secion, we prove wo echnical lemmas ha will give us useful conrol over V hroughou he paper. The firs lemma describes how β mus be consrained for V o be posiive and o rack H closely, so ha i is useful for he analysis of he convergence of H and ulimaely f. 9

10 Lemma 2.3 Bounding he raio of H and V. Le x R d, f : R d R convex wih unique minimum x, k : R d R sricly convex wih minimum k0 = 0, α 0, 1] and β 0, α]. If p R d is such ha kp αfc p, hen Hx, p x x, p kp/α + fx fx, α 15 Hx, p Vx, p. 16 α β α If p R d is such ha kp αf c p, hen Proof. Assuming ha kp αf c p, we have Hx, p x x, p kp/α + fx fx, α 17 Vx, p α+β α Hx, p. 18 kp/α + f c x x f c p + f c x x x x, p f c x x + f c x x = x x, p, hence we have follows by rearrangemen. The proof of 17 and 18 is similar. Lemma 2.3 consrains β in erms of α. For a resul like V λv, we will need o conrol β in erms of he magniude γ of he dissipaion field. The following lemma provides consrains on β and, under hose consrains, he opimal β. The proof can be found in Secion A of he Appendix. Lemma 2.4 Convergence raes in coninuous ime for fixed α. Given γ 0, 1, f : R d R differeniable and convex wih unique minimum x, k : R d R differeniable and sricly convex wih minimum k0 = 0. Le x, p R d be he value a ime of a soluion o he sysem 3 such ha here exiss α 0, 1] where kp αfc p. Define αγ αβ βγ β1 γ λα, β, γ = min,. 19 α β 1 β If β 0, minα, γ], hen Finally, V λα, β, γv. 1. The opimal β 0, minα, γ], β = arg max β λα, β, γ and λ = λα, β, γ are given by, β = 1 1+α α + γ2 1 γα 2 + γ2 4, 20 1 λ 1 α 1 γα + γ2 = 1 γα 2 + γ2 4 for 0 < α < 1, 21 γ1 γ 2 γ for α = 1, 10

11 2. If β 0, αγ/2], hen λα, β, γ = β1 γ, and 22 1 β γ β γ 2 1 γ/4 kp βγ x x, p β x x, fx β1 γkp + fx fx + β x x, p. 23 These wo lemmas are sufficien o prove he linear conracion of V and he conracion fx fx α α β H 0 exp λ under he assumpion of consan α and β. Sill, he consan α, which conrols our approximaion of fc may be quie pessimisic if i mus hold globally along x, p as he sysem converges o is minimum. Insead, in he final lemma ha collecs he convergence resul for his secion, we consider he case where α may increase as convergence proceeds. To suppor an improving α, our consan β will now have o vary wih ime and we will be forced o ake slighly subopimal β and λ given by 22 of Lemma 2.4. Sill, he improving α will be imporan in fuure secions for ensuring ha we are able o achieve posiion independen sep sizes. We are now ready o presen he cenral resul of his secion. Under Assumpions A we show linear convergence of 3. In general, he dependence of he rae of linear convergence on f is via he funcion α and he consan C α,γ in our analysis. Assumpions A. A.1 f : R d R differeniable and convex wih unique minimum x. A.2 k : R d R differeniable and sricly convex wih minimum k0 = 0. A.3 γ 0, 1. A.4 There exiss some differeniable non-increasing convex funcion α : [0, 0, 1] and consan C α,γ 0, γ ] such ha for every p R d, and ha for every y [0, kp αkp maxf c p, f c p 24 C α,γ α yy < αy. 25 In paricular, if kp α maxf c p, f c p for a consan α 0, 1], hen he consan funcion αy = α serves as a valid, bu pessimisic choice. Remark 3. Assumpion A.4 can be saisfied if a symmeric lower bound on f is known. example, srong convexiy implies For fx + x fx µ 2 x 2 2. This in urn implies fc p p 2 2 /2µ. Because kp = p 2 2 /2µ is symmeric, i saisfies A.4 which explains why condiions relaing o srong convexiy are necessary for linear convergence of Polyak s heavy ball. 11

12 Theorem 2.5 Convergence bound in coninuous ime wih general α. Given f, k, γ, α, C α,γ saisfying Assumpions A. Le x, p be a soluion o he sysem 3 wih iniial saes x 0, p 0 = x, 0 where x R d. Le α = α3h 0, λ = 1 γcα,γ 4, and W : [0, [0, be he soluion of W = λ α2w W, wih W 0 := H 0 = fx 0 fx. Then for every [0,, we have fx fx 2H 0 exp λ α2w 2H 0 exp λα Proof. By 24 in assumpion A.4, he condiions of Lemma 2.3 hold, and by 15 and 17 we have x x, p kp /αkp + fx fx H αkp. 27 Insead of defining he Lyapunov funcion V exacly as in 14 we ake a ime-dependen β. Specifically, for every 0 le V be he unique soluion v of he equaion v = H + C α,γα2v 2 x x, p 28 in he inerval v [H /2, 3H /2]. To see why his equaion has a unique soluion in v [H /2, 3H /2], noe ha from 27 i follows ha and hence for any such v, we have α 2v x x, p H for every v H 2, H 2 H + C α,γα2v x x, p H. 29 This means ha for v = H 2, he lef hand side of 28 is smaller han he righ hand side, while for v = 3H 2, i is he oher way around. Now using 25 in assumpion A.4 and 27, we have C α,γ α 2V x x, p C α,γ α 2V 2V α2v Thus, by differeniaion, we can see ha 30 implies ha v H Cα,γ v 2 α2v x x, p > 0, < 1, 30 which implies ha 28 has a unique soluion V in [ H 2, 3H 2 ]. Le α = α2v and β = Cα,γ 2 α 2V. By he implici funcion heorem, i follows ha V is differeniable in. Morover, since V = H + C α,γα2v 2 for every 0, by differeniaing boh sides, we obain ha x x, p 31 V = γ β kp, p β γ x x, p β x x, fx + β x x, p 12

13 0 5 Objecive log fx log fx i 1 Soluion & Ieraes x x i log fx x 0 fx = x 4 /4 kp = 3p 4/3 / log fx x 0 fx = x 4 /4 kp = p 2 / ime ime Figure 4: Imporance of Assumpions A. Soluions x and ieraes x i of our firs explici mehod on fx = x 4 /4 wih wo differen choices of k. Noice ha f c p = 3p 4/3 /4 and hus kp = p 2 /2 canno be made o saisfy assumpion A.4. The firs hree erms are equivalen o he emporal derivaive of V wih consan β = β. Since α αkp and β γ, he assumpions of Lemma 2.4 are saisfied locally for α, β and we ge V λα, β, γv + β x x, p = λα, β, γv + C α,γ α x x, p V. Using 22 of Lemma 2.4 for α, β, we have λα, β, γ = β1 γ 1 β V β 1 γv + C α,γ α x x, p V. β 1 γ and Using 30 we have V β1 γ 2 V. Noice ha V 0 = H 0 since we have assumed ha p 0 = 0, and he claim of he lemma follows by Grönwall s inequaliy. The final inequaliy 26 follows from he fac ha α2v α3h 0 = α. 2.4 Parial Lower Bounds In his secion we consider a parial converse of Proposiion 2.5, showing in a simple seing ha if he assumpion kp α maxf c p, f c p of A.4 is violaed, hen he ODE 3 conracs sublinearly. Figure 4 considers he example fx = x 4 /4. If kp = p a /a, hen assumpions A canno 13

14 Two ypical pahs Unique fas pahs 1 momenum p posiion x posiion x 5 log fx ime ime Figure 5: Soluions o he Hamilonian descen sysem wih fx = x 4 /4 and kp = x 2 /2. The righ plos show a numerical approximaion of x η, p η and x η, p η. The lef plos show a numerical approximaion of x θ, p θ and x θ, p θ for θ = η +δ R, which represen ypical pahs. be saisfied for small p unless b 4/3. Figure 4 shows ha an inappropriae choice of kp = p 2 /2 leads o sub-linear convergence boh in coninuous ime and for one of he discreizaions of Secion 3. In conras, he choice of kp = 3p 4/3 /4 resuls in linear convergence, as expeced. Le b, a > 1 and γ > 0. For d = 1 dimension, wih he choice fx := x b /b and kp := p a /a, 3 akes he following form, x = p a 1 signp, 32 p = x b 1 signx γp. Since fx akes is minimum a 0, x, p are expeced o converge o 0, 0 as. There is a rivial soluion: x = p = 0 for every R. The following Lemma shows an exisence and uniqueness resul for his equaion. The proof is included in Secion B of he Appendix. Lemma 2.6 Exisence and uniqueness of soluions of he ODE. Le a, b, γ 0,. For every 0 R and x, p R 2, here is a unique soluion x, p R of he ODE 32 wih x 0 = x, p 0 = p. Eiher x = p = 0 for every R, or x, p 0, 0 for every R. 14

15 Noe ha if x, p is a soluion, and R, hen x +, p + is also a soluion ime ranslaion, and x, p is also a soluion cenral symmery. Noe also ha f p = f p = p b /b for b := 1 1 b 1. Hence if a b, or equivalenly, if 1 b + 1 a 1, he condiions of Proposiion 2.5 are saisfied for some α > 0 in paricular, if a = b, hen α = 1 independenly of x 0, p 0. Hence in such cases, he speed of convergence is linear. For a > b Kp, lim p 0 f p = 0, so he condiions of Proposiion 2.5 are violaed. Now we are ready o sae he main resul in his secion, a heorem characerizing he convergence speeds of x, p o 0, 0 in his siuaion. The proof is included in Secion B of he Appendix. Proposiion 2.7 Lower bounds on he convergence rae in coninuous ime. Suppose ha 1 b + 1 a < 1. For any θ R, we denoe by x θ, p θ he unique soluion of 32 wih x 0 = θ, p 0 = 0. Then here exiss a consan η 0, depending on a and b such ha he pah x η mirrored version x η, p η saisfy ha x η = x η Oexp α for every α < γa 1 as. For any pah x, p ha is no a ime ranslaion of x η, p η 1 x 1 = O ba b a as, so he speed of convergence is sub-linear and no linearly fas. or x η, p η, p η, we have and is Figure 5 illusraes he wo pahs where he convergence is linearly fas for a = 2, b = 4. The main idea in he proof of Proposiion 2.7 is ha we esablish he exisence of a class of rapping ses, i.e. once he pah of he ODE eners one of hem, i never escapes. Convergence raes wihin such ses can be shown o be logarihmic, and i is esablished ha only wo pahs which are symmeric wih respec o he origin avoid each one of he rapping ses, and hey have linear convergence rae. 3 Opimizaion Algorihms In his secion we consider hree discreizaions of he coninuous sysem 3, one implici and wo explici. For hese discreizaions we mus assume more abou he relaionship beween f and k. The implici mehod defines he ieraes as soluion of a local subproblem. The firs and second explici mehods are fully explici, and we mus again make sronger assumpions on f and k. The proofs of all of he resuls in his secion are given in Secion C of he Appendix. 3.1 Implici Mehod Consider he following discree approximaion x i, p i o he coninuous sysem, making he fixed ɛ > 0 finie difference approximaion, xi+1 xi ɛ = x and pi+1 pi ɛ = p, which approximaes he field a he forward poins. x i+1 x i = kp i+1 ɛ 33 p i+1 p i = γp i+1 fx i+1. ɛ 15

16 Since k kp = p, his sysem of equaions corresponds o he saionary condiion of he following subproblem ieraion, which we inroduce as our implici mehod. Implici Mehod. Given f, k : R d R, ɛ, γ 0,, x 0, p 0 R d. Le δ = 1 + γɛ 1 and { x i+1 = arg min ɛk x xi x R d ɛ + ɛδfx δ p i, x } p i+1 = δp i ɛδ fx i The following lemma shows ha he formulaion 34 is well defined. Secion C of he Appendix. The proof is included in Lemma 3.1 Well-definedness of he implici scheme. Suppose ha f and k saisfy assumpions A.1 and A.2, and ɛ, γ 0,. Then 34 has a unique soluion for every x i, p i R d, and his soluion also saisfies 33. As his discreizaion involves solving a poenially cosly subproblem a each ieraion, i requires a relaively ligh assumpion on he compaibiliy of f and k. Assumpions B. B.1 There exiss C f,k 0, such ha for all x, p R d, fx, kp C f,k Hx, p. 35 Remark 4. Smoohness of f implies 1 2 fx 2 2 Lfx fx see of Theorem of [41]. Thus, if f is smooh and kp = 1 2 p 2 2, hen he assumpion B.1 can be saisfied by C f,k = max{1, L}, since fx, kp 1 2 fx kp 2 2 Lfx fx + kp. The following proposiion shows a convergence resul for he implici scheme. Proposiion 3.2 Convergence bound for he implici scheme. Given f, k, γ, α, C α,γ, and 1 γ C f,k saisfying assumpions A and B. Suppose ha ɛ < 2 maxc f,k,1. Le α = α3h 0, and le W 0 = fx 0 fx and for i 0, W i+1 = W i [1 + ɛc α,γ 1 γ 2C f,k ɛα2w i /4] 1. Then for any x 0, p 0 wih p 0 = 0, he ieraes of 33 saisfy for every i 0, fx i fx 2W i 2W 0 [1 + ɛc α,γ 1 γ 2C f,k ɛα /4] i. 1 γ Remark 5. Proposiion 3.2 means ha we can fix any sep size 0 < ɛ < 2 maxc f,k,1 independenly of he iniial poin, and have linear convergence wih conracion rae ha is proporional o α3h 0 iniially and possibly increasing as we ge closer o he opimum. In Secion 4 we inroduce kineic 16

17 energies kp ha behave like p a 2 near 0 and p A 2 in he ails. We will show ha for funcions fx ha behave like x x b 2 near heir minima and x x B 2 in he ails he condiions of assumpions B are saisfied as long as 1 a + 1 b = 1 and 1 A + 1 B 1. In paricular, if we choose kp = p relaivisic kineic energy, hen a = 2 and A = 1, and assumpions B can be shown o hold for every f ha has quadraic behavior near is minimum and no faser han exponenial growh in he ails. 3.2 Firs Explici Mehod, wih Analysis via he Hessian of f The following discree approximaion x i, p i o he coninuous sysem makes a similar finie difference approximaion, i+1 x i x ɛ = x and pi+1 pi ɛ = p for ɛ > 0. In conras o he implici mehod, i approximaes he field a he poin x i, p i+1, making i fully explici wihou any cosly subproblem, x i+1 x i = kp i+1 ɛ p i+1 p i = γp i+1 fx i. ɛ This mehod can be rewrien as our firs explici mehod. Firs Explici Mehod. Given f, k : R d R, ɛ, γ 0,, x 0, p 0 R d. Le δ = 1 + γɛ 1 and p i+1 = δp i ɛδ fx i x i+1 = x i + ɛ kp i This discreizaion explois he convexiy of k by approximaing he coninuous dynamics a he forward poin p i+1, bu is made explici by approximaing a he backward poin x i. Because his mehod approximaes he field a he backward poin x i i requires a kind of smoohness assumpion o preven f from changing oo rapidly beween ieraes. This assumpion is in he form of a condiion on he Hessian of f, and hus we require wice differeniabiliy of f for he firs explici mehod. Because he accumulaion of gradiens of f in he form of p i are modulaed by k, his condiion in fac expresses a requiremen on he ineracion beween k and 2 f, see assumpion C.3. Assumpions C. C.1 There exiss C k 0, such ha for every p R d, kp, p C k kp. 37 C.2 f : R d R convex wih a unique minimum a x and wice coninuously differeniable for every x R d \ {x }. C.3 There exiss D f,k 0, such ha for every p R d, x R d \ {x }, kp, 2 fx kp D f,k α3hx, phx, p

18 Remark 6. If f smooh and wice differeniable hen v, 2 fxv is everywhere bounded by L for v R d such ha v 2 = 1 see Theorem of [41]. Thus, using kp = 1 2 p 2 2, his allows us o saisfy assumpion C.3 wih D f,k = max{1, 2L}, since kp, 2 fx kp L kp 2 2 = 2Lkp fx fx + 2Lkp. Assumpion C.1 is clearly saisfied in his case by C k = 2. The following lemma shows a convergence resul for his discreizaion. Proposiion 3.3 Convergence bound for he firs explici scheme. Given f, k, γ, α, C α,γ, C f,k, C k, D f,k saisfying assumpions A, B, and C, and ha 0 < ɛ < min Le α = α3h 0, W 0 := fx 0 fx, and for i 0, le 1 γ 2 maxc f,k +6D f,k /C α,γ,1, C α,γ W i+1 = W i 1 + ɛc 1 α,γ [1 γ 2ɛC f,k + 6D f,k /C α,γ ] α2w i. 4 Then for any x 0, p 0 wih p 0 = 0, he ieraes 36 saisfy for every i 0, fx i fx 2W i 2W ɛc i α,γ [1 γ 2ɛC f,k + 6D f,k /C α,γ ] α. 4 10C f,k +5γC k. Remark 7. Similar o Remark 5, Proposiion 3.3 implies ha, under suiable assumpions and posiion independen sep sizes, he firs explici mehod can achieve linear convergence wih conracion rae ha is proporional o α3h 0 iniially and possibly increasing as we ge closer o he opimum. In paricular, again as remarked in Remark 5, for fx ha behave like x x b 2 near heir minima and x x B 2 in he ails he condiions of assumpions C can be saisfied for kineic energies ha grow like p a 2 in he body and p A 2 in he ails as long as 1 a + 1 b = 1, 1 A + 1 B 1. The disincion here is ha for he firs explici mehod we will require b, B Second Explici Mehod, wih Analysis via he Hessian of k Our second explici mehod invers relaionship beween f and k from he firs. Again, i makes a fixed ɛ sep approximaion xi+1 xi ɛ = x and pi+1 pi ɛ = p. In conras o he implici 33 and firs explici 36 mehods, i approximaes he field a he poin x i+1, p i. Second Explici Mehod. Given f, k : R d R, ɛ, γ 0,, x 0, p 0 R d. Le, x i+1 = x i + ɛ kp i p i+1 = 1 ɛγp i ɛ fx i This discreizaion explois he convexiy of f by approximaing he coninuous dynamics a he forward poin x i+1, bu is made explici by approximaing a he backward poin p i. As wih he oher explici mehod, i requires a smoohness assumpion o preven k from changing oo rapidly beween ieraes, which is expressed as a requiremen on he ineracion beween f and 2 k, see assumpion D.5. These assumpions can be saisfied for k ha have quadraic or higher power growh and are suiable for f ha may have unbounded second derivaives a heir minima for such f, Assumpions C can no hold. 18

19 Assumpions D. D.1 k : R d R sricly convex wih minimum k0 = 0 and wice coninuously differeniable for every p R d \ {0}. D.2 There exiss C k 0, such ha for every p R d, kp, p C k kp. 40 D.3 There exiss D k 0, such ha for every p R d \ {0}, p, 2 kpp D k kp. 41 D.4 There exiss E k, F k 0, such ha for every p, q R d, kp kq E k kq + F k kp kq, p q. 42 D.5 There exiss D f,k 0, such ha for every x R d, p R d \ {0}, fx, 2 kp fx D f,k α3hx, phx, p. 43 Remark 8. Smoohness of f implies 1 2 fx 2 2 Lfx fx see of Theorem of [41]. Thus, if f is smooh and kp = 1 2 p 2 2, hen he assumpion D.5 can be saisfied by D f,k = max{1, 2L}, since 2 kp = I and fx, 2 kp fx = fx 2 2 2Lfx fx 2Lfx fx + kp. The k-specific assumpions D.2 and D.3 can clearly be saisfied wih C k = D k = 2 in his case. We show ha D.4 can be saisfied in Secion 4. Proposiion 3.4 Convergence bound for he second explici scheme. Given f, k, γ, α, C α,γ, C f,k, C k, D k, D f,k, E k, F k saisfying assumpions A, B, and D, and ha 0 < ɛ < min 1 γ 2C f,k + 6D f,k /C α,γ, 1 γ 8D k 1 + E k, C α,γ 1, 65C f,k + 2γC k + 12γC α,γ 6γ 2. D k F k Le α = α3h 0, W 0 := fx 0 fx, and for i 0, le W i+1 = W i 1 ɛc α,γ [1 γ 2ɛC f,k + 6D f,k /C α,γ ] α2w i. 4 Then for any x 0, p 0 wih p 0 = 0, he ieraes 39 saisfy for every i 0, fx i fx 2W i 2W 0 1 ɛc i α,γ [1 γ 2ɛC f,k + 6D f,k /C α,γ ] α. 4 19

20 0 5 Objecive log fx log fx i 1 Soluion & Ieraes x x i log fx x 0 fx = x 4 /4 kp = p 8/7 7/ Figure 6: Imporance of discreizaion assumpions. Soluions x and ieraes x i of our firs explici mehod on fx = x 4 /4. Wih an inappropriae choice of kineic energy, kp = p 8/7 7/8, he coninuous soluion converges a a linear rae bu he ieraes do no. Remark 9. Similar o Remark 5, Proposiion 3.4 implies ha, under suiable assumpions and for a fixed sep size independen of he iniial poin, he second explici mehod can achieve linear convergence wih conracion rae ha is proporional o α3h 0 iniially and possibly increasing as we ge closer o he opimum. In paricular, again as remarked in Remark 5, for fx ha behave like x x b 2 near heir minima and x x B 2 in he ails he condiions of assumpions D can be saisfied for kineic energies ha grow like p a 2 in he body and p A 2 in he ails as long as 1 a + 1 b = 1, 1 A + 1 B 1. The disincion here is ha for he second explici mehod we will require b, B 2. To conclude he analysis of our mehods on convex funcions, consider he example fx = x 4 /4 from Figure 4. If we ake kp = p a /a, hen assumpion A.4 requires ha a 4/3. Assumpions B and C canno be saisfied as long as a < 4/3, which suggess ha kp = f p is he only suiable choice in his case. Indeed, in Figure 6, we see ha he choice of kp = p 8/7 7/8 resuls in a sysem whose coninuous dynamics converge a a linear rae and whose discree dynamics fail o converge. Noe ha as he coninuous sysems converge he oscillaion frequency increases dramaically, making i difficul for a fixed sep size scheme o approximae. 3.4 Firs Explici Mehod on Non-Convex f We close his secion wih a brief analysis of he convergence of he firs explici mehod on nonconvex f. A radiional requiremen of discreizaions is some degree of smoohness o preven he funcion changing oo rapidly beween poins of approximaion. The noion of Lipschiz smoohness is he sandard one, bu he use of he kineic map k o selec ieraes allows Hamilonian descen mehods o consider he broader definiion of uniform smoohness, as discussed in [60, 2, 61] bu specialized here for our purposes. Uniform smoohness is defined by a norm and a convex non-decreasing funcion σ : [0, [0, ] such ha σ0 = 0. A funcion f : R d R is σ-uniformly smooh, if for all x, y R d, fy fx + fx, y x + σ y x

21 Lipschiz smoohness corresponds o σ = 1 2 2, and generally speaking here exis non-rivial uniformly smooh funcions for σ = 1 b b for 1 < b 2, see, e.g., [40, 60, 2, 61]. Assumpions E. E.1 f : R d R differeniable. E.2 γ 0,. E.3 There exiss a norm on R d, b 1,, D k 0,, D f 0,, σ : [0, [0, ] non-decreasing convex such ha σ0 = 0 and σc c b σ for c, 0, ; for all p R d, σ kp D k kp; 45 and for all x, y R d, fy fy + fx, y x + D f σ y x. 46 Lemma 3.5 Convergence of he firs explici scheme wihou convexiy. Given, f, k, γ, b, D k, D f, σ saisfying assumpions E and A.2. If ɛ 0, b 1 γ/d f D k ], hen he ieraes 36 of he firs explici mehod saisfy and fx i 2 0. H i+1 H i ɛ b D f D k ɛγkp i+1 0, 47 Remark 10. L-Lipschiz coninuiy of he gradiens fx fy 2 L x y 2 for L > 0 wih Euclidean norm 2 implies boh fy fy+ fx, y x + L 2 y x 2 2 and 1 2 fx 2 2 Lfx fx. Thus, if f, k are L f, L k smooh, respecively, hen he condiion for convergence simplifies o ɛ γ/l f L k. 4 Kineic Maps for Funcions wih Power Behavior In his secion we design a family of kineic maps k suiable for a class of funcions f ha exhibi power growh, which we will describe precisely as a se of assumpions. This class includes srongly convex and smooh funcions. However, i is much broader, including funcions wih possibly nonquadraic power behavior and singular or unbounded Hessians. Firs, we show ha his family of kineic energies saisfies he k-specific assumpions of Secion 3. Then we use he generic analysis of Secion 3 o provide a specific se of assumpions on fs and heir mach o he choice of k. As a consequence, his analysis grealy exends he class of funcions for which linear convergence is possible wih fixed sep size firs order compuaion. Sill, his analysis is no mean o be an exhausive caalogue of possible kineic energies for Hamilonian descen. Insead, i serves as an example of how known properies of f can be used o design k. Noe ha, wih a few excepions, he proofs of all of our resuls in his secion are deferred o Secion D of he Appendix. 21

22 4 ' A a x wih a =8/7 ' A a x wih a =2 ' A a x wih a =8 ' A a x x x A =8/7 A =2 A = x Figure 7: Power kineic energies in one dimension. 4.1 Power Kineic Energies We assume a given norm x and is dual p = sup{ x, p : x 1} for x, p R d. Define he family of power kineic energies k, kp = ϕ A a p where ϕ A a = 1 A a + 1 A a 1 A for [0, and a, A [1,. 48 For a = A we recover he sandard power funcions, ϕ a a = a /a. For disinc a A, we have ϕ A a A 1 for large and ϕ A a a 1 for small. Thus, kp p A /A as p and kp p a /a as p 0. See Figure 7 for examples from his family in one dimension. Broadly speaking, his family of kineic energies mus be mached in a conjugae fashion o he body and ail behavior of f. Informally, for his choice of k we will require condiions on f ha correspond o requiring ha i grows like x x b in he body as x x 0 and x x B in he ails as x x for some b, B 1,. In paricular, our growh condiions in he case of f growing like x 2 2 = x, x everywhere will be necessary condiions of srong convexiy and smoohness. More generally, a, A, b, B will be well-mached if 1/a+1/b = 1/A+1/B = 1, bu oher scenarios are possible. Of hese, he conjugae relaionship beween a and b is he mos criical; i capures he asympoic mach beween f and k as x i, p i x, 0, and our analysis requires ha 1/a + 1/b = 1. The mach beween A and B is less criical. In he ideal case, B is known and A = B/B 1. In his case, he discreizaions will converge a a consan fas linear rae. If B is no known, i suffices for 1/A + 1/B 1. The consequence of underesimaing A < B/B 1 will be refleced in a linear, bu non-consan, rae of convergence via α of Assumpion A.4, which depends on he iniial x 0 and slowly improves owards a fas rae as he sysem converges and he regime swiches. We presen a complee analysis and se of condiions on f for wo of he mos useful scenarios. In Proposiion 4.4 we consider he case ha f grows like ϕ B b x x where b, B > 1 are exacly known. In his case convergence proceeds a a fas consan linear rae when mached wih kp = ϕ A a p where a = b/b 1 and A = B/B 1. In Proposiion 4.5 we consider he case ha f grows like ϕ B 2 x x where B 2 is unknown. Here, he convergence is linear wih a non-consan rae when mached wih he relaivisic kineic energy kp = ϕ 1 2 p. The case covered by relaivisic kineic k is paricularly valuable, as i covers a large class of globally non-smooh, bu srongly convex funcions. Table 1 summarizes his, and hroughou he remaining subsecions we flesh ou he deails of hese claims. 22

23 fx grows like ϕ B b x appropriae kp = ϕa a p mehod powers known? body power b ail power B body power a ail power A implici known b > 1 B > 1 a = b/b 1 A = B/B 1 unknown b = 2 B 2 a = 2 A = 1 1s explici known b 2 B 2 a = b/b 1 A = B/B 1 unknown b = 2 B 2 a = 2 A = 1 2nd explici known 1 < b 2 1 < B 2 a = b/b 1 A = B/B 1 Table 1: A summary of he condiions on f and power kineic k considered in his secion ha saisfy he assumpions of Secion 3. Here grows like is an imprecise erm meaning ha f s growh can be bounded in an appropriae way by ϕ B b x ϕb b is defined in 48. The full precise assumpions on f are laid ou in Proposiions 4.4 and 4.5. In paricular, b = B = 2 corresponds o assumpions similar in spiri o srong convexiy and smoohness. Oher combinaions of b, B and a, A are possible. For hese kineic energies o be suiable in our analysis, hey mus a minimum saisfy assumpions A.2, C.1, D.1, D.3, and D.4. Assumpions C.1 and D.3 are clearly saisfied by kp = p a /a for p R wih consans C k = a and D k = aa 1. In he remainder of his subsecion, we provide condiions on he norms and a, A under which assumpions like hese hold for ϕ A a wih muliple power behavior in any finie dimension. In general, he problemaic erms of kp and 2 kp ha arise in high dimensions involve he gradien and Hessian of he norm. The gradien of norm can be deal wih cleanly, bu our analysis requires addiional conrol on he Hessian of he norm. To conrol erms involving 2 p we define a generalizaion of he maximum eigenvalue induced by he norm. Le λ max : R d d R be he funcion defined by λ maxm = sup{ v, Mv : v R d, v = 1}. 49 For symmeric M R d d and Euclidean his is exacly he maximum eigenvalue of M. Now we are able o sae our lemma analyzing power kineic energies. Lemma 4.1 Verifying assumpions on k. Given a norm p on p R d, a, A [1,, and ϕ A a in 48. Define he consan, C a,a = 1 a 1 A 1 A 1 B b b A a + a 1 A a A a 1 kp = ϕ A a p saisfies he following. 1. Convexiy. If a > 1 or A > 1, hen k is sricly convex wih a unique minimum a 0 R d. 2. Conjugae. For all x R d, k x = ϕ A a x. 23

24 3. Gradien. If p is differeniable a p R d \ {0} and a > 1, hen k is differeniable for all p R d, and for all p R d, kp, p max{a, A}kp, 51 ϕ A a kp max{a, A} 1kp. 52 Addiionally, if a, A > 1, define B = A/A 1, b = a/a 1, and hen ϕ B b kp C a,a max{a, A} 1kp. 53 Addiionally, if a, A 2, hen for all p, q R d, kp kq, q + kp kq, p q Hessian. If p is wice coninuously differeniable a p R d \ {0}, hen k is wice coninuously differeniable for all p R d \ {0}, and for all p R d \ {0}, p, 2 kpp max{a, A}max{a, A} 1kp. 55 Addiionally, if a, A 2 and here exiss N [0, such ha p λ max 2 p N for p R d \ {0}, hen for all p R d \ {0} ϕ A/2 a/2 λ max 2 kp max{a, A} 1 + N max{a, A} 2kp. 56 Remark , 54, and 55 ogeher direcly confirm ha hese k saisfy C.1, D.3, and D.4 wih consans C k = max{a, A}, D k = max{a, A}max{a, A} 1, E k = max{a, A} 1, and F k = 1. The oher resuls 52, 53, and 56 will be used in subsequen lemmas along wih assumpions on f o saisfy he remaining assumpions of discreizaion. The assumpion ha p λ max 2 p N in Lemma 4.1 is saisfied by b-norms for b [2,, as he following lemma confirms. I implies ha if p = p b for b 2, we can ake N = b 1 in 56. Lemma 4.2 Bounds on λ max 2 p for b-norms. Given b [2,, le x b = for x R d. Then for x R d \ {0}, x b λ b max 2 x b b 1. d n=1 xn b 1/b The remaining assumpions B.1, C.3, and D.5 involve inner producs beween derivaives of f and k. To conrol hese erms we will use he Fenchel-Young inequaliy. To his end, he conjugaes of ϕ A a will be a crucial componen of our analysis. Lemma 4.3 Convex conjugaes of ϕ A a. Given a, A 1, and ϕ A a in 48. Define B = A/A 1, b = a/a 1. The following hold. 1. Near Conjugae. ϕ B b upper bounds he conjugae ϕa a for all [0,, ϕ A a ϕ B b

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3