arxiv: v1 [cs.lg] 5 Nov 2018

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 5 Nov 2018"

Alice Bryant
5 years ago
Views:

1 Sochaic Modified Equaion I: Mahemaical Foundaion Sochaic Modified Equaion and Dynamic of Sochaic Gradien Algorihm I: Mahemaical Foundaion arxiv: v1 [c.lg] 5 Nov 218 Qianxiao Li Iniue of High Performance Compuing Agency for Science, Technology and Reearch 1 Fuionopoli Way, Connexi Norh, Singapore Cheng Tai Beijing Iniue of Big Daa Reearch and Peking Univeriy Beijing, China, 18 Weinan E Princeon Univeriy Princeon, NJ 8544, USA Beijing Iniue of Big Daa Reearch and Peking Univeriy, Beijing, China Edior: Abrac liqix@ihpc.a-ar.edu.g chengai@pku.edu.cn weinan@mah.princeon.edu We develop he mahemaical foundaion of he ochaic modified equaion (SME) framework for analyzing he dynamic of ochaic gradien algorihm, where he laer i approximaed by a cla of ochaic differenial equaion wih mall noie parameer. We prove ha hi approximaion can be underood mahemaically a an weak approximaion, which lead o a number of precie and ueful reul on he approximaion of ochaic gradien decen (SGD), momenum SGD and ochaic Neerov acceleraed gradien mehod in he general eing of ochaic objecive. We alo demonrae hrough explici calculaion ha hi coninuou-ime approach can uncover imporan analyical inigh ino he ochaic gradien algorihm under conideraion ha may no be eay o obain in a purely dicree-ime eing. Keyword: ochaic gradien algorihm, modified equaion, ochaic differenial equaion, momenum, Neerov acceleraed gradien 1. Inroducion Sochaic gradien algorihm (SGA) are ofen ued o olve opimizaion problem of he form min f(x) := Ef γ (x) (1.1) x R d where {f r : r Γ} i a family of funcion from R d o R and γ i a Γ-valued random variable, wih repec o which he expecaion i aken (hee noion will be made precie in he following ecion). For empirical lo minimizaion in upervied learning applicaion, γ i uually a uniform random variable aking value in Γ = {1, 2,..., n}. In hi cae, f i 1

2 Li, Tai and E he oal empirical lo funcion and f r, r Γ are he lo funcion due o he r h raining ample. In hi paper, we hall conider he general iuaion of a expecaion over arbirary index e and diribuion. Solving (1.1) uing he andard gradien decen (GD) on x give he ieraion cheme x k+1 = x k η Ef γ (x k ), (1.2) for k and η i a mall poiive ep-ize known a he learning rae. Noe ha hi require he evaluaion of he gradien of an expecaion, which can be coly (in hi empirical rik minimizaion cae, hi happen when n i large). In i imple form, he ochaic gradien decen (SGD) algorihm replace he expecaion of he gradien wih a ampled gradien, i.e. x k+1 = x k η f γk (x k ), (1.3) where each γ k i an independen and idenically diribued (i.i.d.) random variable wih he ame diribuion a γ. Under mild condiion, we hen have E[ f γk (x k ) x k ] = Ef(x k ). In oher word, (1.3) i a ampled verion of (1.2). In he lieraure, many convergence reul are available for SGD and i varian (Shamir and Zhang, 213; Mouline and Bach, 211; Needell e al., 214; Xiao and Zhang, 214; Shalev-Shwarz and Zhang, 214; Bach and Mouline, 213; Défoez and Bach, 215). However, i i ofen he cae ha differen analyi echnique mu be adoped for differen varian of he algorihm and here generally lacked a yemaic approach o udy heir precie dynamical properie. In Li e al. (215), a general approach wa inroduced o addre hi problem, in which dicree-ime ochaic gradien algorihm are approximaed by coninuou-ime ochaic differenial equaion wih he noie erm depending on a mall parameer (he learning rae). Thi can be viewed a a generalizaion of he mehod of modified equaion (Hir, 1968; Noh and Proer, 196; Daly, 1963; Warming and Hye, 1974) o he ochaic eing, and allow one o employ ool from ochaic calculu o yemaically analyze he dynamic of ochaic gradien algorihm. The ochaic modified equaion (SME) approach wa furher developed in Li e al. (217), where a weak approximaion reul for he SGD wa proved in a finie-um-objecive eing. The preen erie of paper build on he earlier work of Li e al. (215, 217) and aim o eablih he framework of ochaic modified equaion and heir applicaion in greaer generaliy and deph, and highligh he advanage of hi yemaic framework for udying ochaic gradien algorihm uing coninuou-ime mehod. A he fir in he erie, hi paper will focu on mahemaical apec, namely he main approximaion heorem relaing ochaic gradien algorihm o ochaic modified equaion in he form of weak approximaion. Thee generalize he approximaion reul in Li e al. (217) in variou apec. In a ubequen paper in he erie, we will dicu he applicaion of hi formalim o adapive ochaic gradien algorihm and relaed problem. The organizaion of hi paper i a follow. We fir dicu relaed work in Sec. 2, epecially in he conex of coninuou-ime approximaion. Nex, we moivae he SME approach and e up he precie mahemaical framework in Sec We hen prove in Sec. 4 a cenral reul relaing dicree ochaic algorihm and coninuou ochaic procee, which allow u o derive SME for ochaic gradien decen and varian. In Sec. 5, he 2

3 Sochaic Modified Equaion I: Mahemaical Foundaion SME approach i ued o analyze he dynamic of ochaic gradien algorihm when applied o opimize a imple ye non-rivial objecive. Laly, we conclude wih ome dicuion of our reul in Sec. 6. The longer proof of he reul ued in he paper are organized in he appendix. Thee are eenially elf-conained, bu baic knowledge of ochaic calculu and probabiliy heory are aumed. Unfamiliar reader may refer o andard inroducory ex, uch a Durre (21) and Okendal (213). 1.1 Noaion In hi paper, we adhere wherever poible o he following noaion. Dimenional indice are wrien a ubcrip wih a bracke o avoid confuion wih oher equenial indice (e.g. ime, ieraion number), which do no have bracke. When more han one indice are preen, we eparae hem wih a comma, e.g. x k,(i) i he i-h coordinae of he vecor x k, he k h member of a equence. We adop he Einein ummaion convenion, where repeaed (paial) indice are ummed, i.e. x (i) x (i) := d i=1 x (i)x (i). For a marix A, we denoe by λ(a) = {λ 1 (A), λ 2 (A),... } he e of eigenvalue of A. If A i Hermiian, hen he eigenvalue are ordered o ha λ 1 (A) denoe a maximum eigenvalue. We denoe he uual Euclidean norm by and for higher rank enor, we ue he ame noaion o denoe he flaened vecor norm (e.g. for marice i will be he Frobeniu norm). The ymbol denoe he minimum operaor, i.e. a b := min(a, b). For a probabiliy pace (or generally, a meaure pace) (Ω, F, P), he ymbol L(Ω, F, P), p (1, ) denoe he uual Lebegue pace, i.e. u L p (Ω, F, P) if u p L p (Ω,F,P) := Ω u(ω) p dp(ω) E u p <. When he underlying probabiliy pace i obviou, we ue he horhand L p (Ω) L(Ω, F, P). In addiion, when Ω = R d, we alo wrie he local L p pace a L p loc (Rd ), which conain u for which u p i inegrable on compac ube of R d. Finally, we noe ha in he proof of variou reul, we ypically ue he leer C (whoe value may change acro reul) o denoe a generic poiive conan. Thi i uually independen of he learning rae η, bu if no explicily aed oherwie, i may depend on e.g. Lipchiz conan, ambien dimenion, ec. 2. Relaed work In hi ecion, we dicu everal relaed work on analyzing dicree-ime algorihm uing coninuou-ime approache. The idea of approximaing dicree-ime ochaic algorihm by coninuou equaion dae back o he large body of work known a ochaic approximaion heory (Kuhner and Yin, 23; Ljung e al., 212). Thee ypically eablih law of large number ype reul where he limiing equaion i an ODE, which can hen be ued o prove powerful convergence reul for he ochaic algorihm under conideraion. A noion of convergence in diribuion, imilar o a cenral limi heorem, wa alo udied for he purpoe of eimaing he rae of convergence of he ODE mehod (Kuhner, 1978; Kuhner and Shwarz, 1984; Kuhner and Clark, 212), where connecion beween leading order perurbaion and Ornein-Uhlenbeck (OU) procee are eablihed. How- 3

4 Li, Tai and E ever, hee eimae are no yemaically ued o yemaically udy he dynamic of ochaic gradien algorihm. A far a he auhor are aware, he fir work on uing ochaic differenial equaion o udy he precie properie of ochaic gradien algorihm are he independen work of Li e al. (215) and Mand e al. (215). In Li e al. (215), a yemaic framework of SDE approximaion of SGD and SGD wih momenum are derived and applied o udy dynamical properie of he ochaic algorihm a well a adapive parameer uning cheme. Thee go beyond OU proce approximaion and hi diincion i imporan ince he OU proce i no alway he appropriae ochaic approximaion in general eing (See Sec. 4.2 of hi paper). In Mand e al. (215), a imilar procedure i employed o derive a SDE approximaion for he SGD, from which iue uch a choice of learning rae are udied. Alhough he concree analyi in Mand e al. (215) i on he rericed cae of conan diffuion marice leading o OU procee, he eenial idea on he general leading order approximaion are alo dicued. I i imporan o noe ha he approximaion argumen in boh Li e al. (215) and Mand e al. (215) are heuriic from a mahemaical poin of view. In Li e al. (217), he SME approximaion i rigorouly proved in he finie-um-objecive cae wih rong regulariy condiion, and furher aympoic analyi and uning algorihm are udied. The SME approach ha ubequenly been uilized o udy varian of ochaic gradien algorihm, including hoe in he diribued opimizaion eing (An e al., 218). The work of Mand e al. (215) i furher developed in Mand e al. (216, 217), wih applicaion uch a he developmen calable MCMC algorihm. The preen paper build on he earlier work of Li e al. (215, 217), bu focue on exending and olidifying he mahemaical apec. In paricular, we preen an enirely rigorou and elf-conained mahemaical formulaion of he SME framework ha applie o more general algorihm (including momenum SGD and ochaic Neerov acceleraed gradien mehod) and more general objecive (expecaion over random funcion, inead of ju a finie-um). Moreover, variou regulariy condiion in Li e al. (217) have been relaxed. The main approximaion procedure i inpired by he eminal work of Milein (1986, 1975) in numerical analyi of ochaic differenial equaion, bu lower regulariy condiion are required in our cae due o he preence of he mall noie parameer, which allow for beer runcaion of Iô-Taylor expanion. The mahemaical analyi of he SME-ype approximaion for he SGD wa alo performed in Feng e al. (217); Hu e al. (217) uing emi-group approache, alhough he moohne requiremen preened here are greaer han hoe eablihed uing he curren mehod. Laly, he Neerov acceleraed gradien SME we derive in Sec. 4.4 can be viewed a a generalizaion of he ODE approach in Su e al. (214) o ochaic gradien, and we how ha he preence of noie give addiional feaure o he dynamic. Finally, we noe ha coninuou-ime approximaion ha eablih link beween opimizaion, calculu of variaion and ymplecic inegraion ha been udied in Wibiono e al. (216); Beancour e al. (218). 3. Sochaic modified equaion We now inroduce he ochaic modified equaion framework. The aring moivaion i he obervaion ha GD ieraion i a (Euler) dicreizaion of he coninuou-ime, 4

5 Sochaic Modified Equaion I: Mahemaical Foundaion ordinary differenial equaion dx d = f(x), (3.1) and udying (3.1) can give u imporan inigh o he dynamic of he dicree-ime algorihm for mall enough learning rae. The naural queion when exending hi o SGD i, wha i he righ coninuou-ime equaion o conider? Below, we begin wih ome heuriic conideraion. 3.1 Heuriic moivaion we rewrie he SGD ieraion (1.3) a x k+1 = x k η f(x k ) + ηv k (x k, γ k ), (3.2) where V k (x k, γ k ) = η( f(x k ) f γk (x k )) i a d-dimenional random vecor. A raighforward calculaion how ha E[V k x k ] = cov[v k, V k x k ] = ησ(x k ), Σ(x k ) := E[( f γk (x k ) f(x k ))(f γk (x k ) f(x k )) T x k ], (3.3) i.e. condiional on x k, V k (x k ) ha mean and covariance ησ(x k ). Here, Σ i imply he condiional covariance of he ochaic gradien approximaion f γ of f. Now, conider a ime-homogeneou Iô ochaic differenial equaion (SDE) of he form dx = b(x )d + ησ(x )dw, (3.4) where X R d for and W i a andard d-dimenional Wiener proce. The funcion b : R d R d i known a he drif and σ : R d R d d i he diffuion marix. The key obervaion i ha if we apply he Euler dicreizaion wih ep-ize η o (3.4), approximaing X kη by ˆX k, we obain he following dicree ieraion for he laer: ˆX k+1 = ˆX k + ηb( ˆX k ) + ησ( ˆX k )Z k, (3.5) where Z k := W (k+1)η W kη are d-dimenional i.i.d. andard normal random variable. Comparing wih (3.2), if we e b = f, σ(x) = Σ(x) 1 /2 and idenify wih kη, we hen have maching fir and econd condiional momen. Hence, hi moivae he approximaing equaion dx = f(x )d + (ησ(x )) 1/2 dw. (3.6) Noe ha a hi heuriic argumen how, he preence of he mall parameer η on he diffuion erm i neceary o model he fac ha when learning rae decreae, he flucuaion o he SGA ierae mu alo decreae. The immediae mahemaical queion i hen: in wha ene i an SDE like (3.6) an approximaion of (1.3)? Le u now eablih he precie mahemaical framework in which we can anwer hi queion. 5

6 Li, Tai and E 3.2 The mahemaical framework Le (Ω, F, P) be a ufficienly rich probabiliy pace and (Γ, F Γ ) be a meaure pace repreening he index pace for our random objecive. Le γ : Ω Γ be a random variable and (r, x) f r (x) a meaurable mapping from Γ R d o R. Hence, for each x, f γ (x) i a random variable. Throughou hi paper, we aume he follow fac abou f γ (x): Aumpion 3.1 The random variable f γ (x) aifie (i) f γ (x) L 1 (Ω) for all x R d (ii) f γ (x) i coninuouly differeniable in x almo urely and for each R >, here exi a random variable M R,γ uch ha max x R f γ (x) M R,γ almo urely, wih E M R,γ < (iii) f γ (x) L 2 (Ω) for all x R d Noe ha in he empirical rik minimizaion cae where Γ i finie, he condiion above are ofen rivially aified. Condiion (i) in Aumpion 3.1 allow u o define he oal objecive funcion we would like o minimize a he expecaion f(x) := Ef γ (x) f γ(ω) (x)dp(ω). (3.7) Moreover, Aumpion 3.1 (ii) implie via he dominaed convergence heorem ha E f γ = Ef γ f. Now, le {γ k : k =, 1,... } be a equence of i.i.d. Γ-valued random variable wih he ame diribuion a γ. Le x R d be fixed and define he generalized ochaic gradien ieraion a he ochaic proce Ω x k+1 = x k + ηh(x k, γ k, η) (3.8) for k, where h : R d Γ R R d i a meaurable funcion and η > i he learning rae. In he imple cae of SGD, we have h(x, r, η) = f r (x), bu we hall conider he generalized verion above o ha modified equaion for SGD varian can alo be derived from our approximaion heorem. Nex, le u define he cla of approximaing coninuou ochaic procee, which we call ochaic modified equaion. Conider he ime-homogeneou Iô diffuion proce {X : } repreened by he following ochaic differenial equaion (SDE) dx = b(x, η)d + ησ(x, η)dw, X = x (3.9) where {W : } i a andard d-dimenional Wiener proce independen of {γ k }, b : R d R R d i he approximaing drif vecor and σ : R d R R d d i he approximaing diffuion marix. In he following, we will need o pick b, σ appropriaely o ha (3.8) i approximaed by (3.9), he ene of which we now decribe. Fir, noice ha he ochaic proce {x k } induce a probabiliy meaure on he produc pace R d R d, wherea {X } induce a probabiliy meaure on C ([, ), R d ). Hence, we can only compare heir value by ampling a dicree number of poin from he laer. Second, he proce {x k } i adaped o he filraion generaed by {γ k } (e.g. in 6

7 Sochaic Modified Equaion I: Mahemaical Foundaion he cae of SGD, hi i he random ampling of funcion in {f r }), wherea he proce {X } i adaped o an independen, Wiener filraion. Hence, i i no appropriae o compare individual ample pah. Raher, we define below a ene of weak approximaion by comparing he diribuion of he wo procee. Definiion 1 Le G denoe he e of coninuou funcion R d R of a mo polynomial growh, i.e. g G if here exi poiive ineger κ 1, κ 2 > uch ha g(x) κ 1 (1 + x 2κ 2 ), for all x R d. Moreover, for each ineger α 1 we denoe by G α he e of α-ime coninuouly differeniable funcion R d R which, ogeher wih i parial derivaive up o and including order α, belong o G. Noe ha each G α i a ubpace of C α, he uual pace of α-ime coninuouly differeniable funcion. Moreover, if g depend on addiional parameer, we ay g G α if he conan κ 1, κ 2 are independen of hee parameer, i.e. g G α uniformly. Finally, he definiion generalize o vecor-valued funcion coordinae-wie in he co-domain. Definiion 2 Le T >, η (, 1 T ), and α 1 be an ineger. Se N = T/η. We ay ha a coninuou-ime ochaic proce {X : [, T ]} i an order α weak approximaion of a dicree ochaic proce {x k : k =,..., N} if for every g G α+1, here exi a poiive conan C, independen of η, uch ha max Eg(x k) Eg(X kη ) Cη α. (3.1) k=,...,n Le u dicu briefly he noion of weak approximaion a inroduced above. Thee are approximaion of he diribuion of ample pah, inead of he ample pah hemelve. Thi i enforced by requiring ha he expecaion of he wo procee {X } and {x k } over a ufficienly large cla of e funcion o be cloe. In our definiion, he e funcion cla G α+1 i quie large, and in paricular i include all polynomial. Thu, Eq. (3.1) implie in paricular ha all momen of he wo procee become cloe a he rae of η α, and hence o mu heir diribuion. The noion of weak approximaion mu be conraed wih ha of rong approximaion, where one would for example require (in he cae of mean-quare approximaion) [E x k X kη 2 ] 1 /2 Cη α. The above force he acual ample-pah of he wo procee o be cloe, per realizaion of he random proce, which everely limi i applicaion. In fac, one imporan advanage of weak approximaion i ha he approximaing SDE proce X can in fac approximae dicree ochaic procee whoe ep-wie driving noie i no Gauian, which i exacly wha we need o analyze general ochaic gradien ieraion. 4. The approximaion heorem We now preen he main approximaion heorem. The derivaion i baed on he following wo-ep proce: 7

8 Li, Tai and E 1. We eablih a connecion beween one-ep approximaion and approximaion on a finie ime inerval. 2. We conruc a one-ep approximaion ha i of order α+1, and o he approximaion on a finie inerval i of order α. 4.1 Relaing one-ep o N-ep approximaion Le u conider generally he queion of he relaionhip beween one-ep approximaion and approximaion on a finie inerval. Le T >, η (, 1 T ) and N = T/η and recall he general SGA ieraion x k+1 = x k + ηh(x k, γ k, η), x R d, k =,..., N. (4.1) and he general candidae family of approximaing SDE dx η,ɛ = b(x η,ɛ, η, ɛ)d + ησ(x η,ɛ, η, ɛ)dw, X = x, [, T ], (4.2) where ɛ (, 1) i a mollificaion parameer, whoe role will become apparen laer. To reduce noaional cluer and improve readabiliy, unle ome limiing procedure i conidered, we hall no explici wrie he dependence of X η,ɛ on η, ɛ and imply denoe by X he oluion of he above SDE. Le u alo denoe for convenience X k := X kη. Furher, le {X x, : } denoe he ochaic proce obeying he ame equaion (4.2), bu wih he iniial condiion X x, x,l = x. We imilarly wrie X k := X x,lη kη and denoe by {x x,l k : k l} he ochaic proce aifying (4.1) bu wih x l = x. Throughou hi ecion, we aume he following condiion: Aumpion 4.1 The funcion b : R d (, 1 T ) (, 1) R d and σ : R d (, 1 T ) (, 1) R d d aify: 1. Uniform linear growh condiion for all x, y R d, η (, 1 T ), ɛ (, 1). 2. Uniform Lipchiz condiion b(x, η, ɛ) 2 + σ(x, η, ɛ) 2 L 2 (1 + x 2 ) b(x, η, ɛ) b(y, η, ɛ) + σ(x, η, ɛ) σ(y, η, ɛ) L x y for all x, y R d, η (, 1 T ), ɛ (, 1). Noe ha 2 implie 1 if here i a lea one x where he upremum of b, σ over η, ɛ i finie. In paricular, hee condiion imply via Thm. 18 ha here exi a unique oluion o Eq Now, le u denoe he one-ep change (x) := x x, 1 x, (x) := Xx, 1 x. (4.3) We prove he following reul which relae one-ep approximaion wih approximaion on a finie ime inerval. 8

9 Sochaic Modified Equaion I: Mahemaical Foundaion Theorem 3 Le T >, η (, 1 T ), ɛ (, 1) and N = T/η. Le α 1 be an ineger. Suppoe furher ha he following condiion hold: (i) There exi a funcion ρ : (, 1) R + and K 1 G independen of η, ɛ uch ha E (ij )(x) E j=1 for = 1, 2,..., α and where i j {1,..., d}. E α+1 j=1 (ij )(x) K 1(x)(ηρ(ɛ) + η α+1 ), j=1 (ij )(x) K 1 (x)η α+1, (ii) For each m 1, he 2m-momen of x x, k i uniformly bounded wih repec o k and η, i.e. here exi a K 2 G, independen of η, k, uch ha for all k =,..., N T/η. E x x, k 2m K 2 (x), Then, for each g G α+1, here exi a conan C >, independen of η, ɛ, uch ha max Eg(x k) Eg(X kη ) C(η α + ρ(ɛ)) k=,...,n The proof of Thm. 3 require a number of echnical reul ha we defer o he appendix. Below, we demonrae he main ingredien of he proof and refer o he appendix where he proof of he auxiliary reul are fully preened. Proof In hi proof, ince here are many condiioning on he iniial condiion, o preven need upercrip we hall inroduce he alernaive noaion X (x, ) X x,, and imilarly for X k and x k. Fix g G α+1 and 1 k N. We have Eg(X kη ) = Eg( X k ) = Eg( X k ( X 1, 1)) Eg( X k (x 1, 1)) + Eg( X k (x 1, 1)). If k > 1, by noing ha X k (x 1, 1) = X k ( X 2 (x 1, 1), 2), we ge Eg( X k (x 1, 1)) = Eg( X k ( X 2 (x 1, 1), 2)) Eg( X k (x 2, 2)) + Eg( X k (x 2, 2)) Coninuing hi proce, we hen have Eg( X k 1 k ) = Eg( X k ( X l (x l 1, l 1), l)) Eg( X k (x l, l)) l=1 + Eg( X k (x k 1, k 1)) 9

10 Li, Tai and E and hence by ubracing Eg(x k ) Eg(x k (x k 1, k 1)) we ge and o Eg( X k 1 k ) Eg(x k ) = Eg( X k ( X l (x l 1, l 1), l)) Eg( X k (x l, l)) l=1 l=1 + Eg( X k (x k 1, k 1)) Eg(x k (x k 1, k 1)) Eg( X k 1 [ k ) Eg(x k ) = EE g( X k ( X l (x l 1, l 1), l)) X l (x l 1, l 1) ] [ EE g( X ] k (x l, l)) x l + Eg( X k (x k 1, k 1)) Eg(x k (x k 1, k 1)), Now, le u(x, ) = Eg(X kη (x, )). Then, we have Eg( X k 1 k ) Eg(x k ) Eu( X l (x l 1, l 1), lη) Eu(x l (x l 1, l 1), lη) l=1 + Eg( X k (x k 1, k 1)) Eg(x k (x k 1, k 1)) k 1 E E[u( X l (x l 1, l 1), lη) x l 1 ] E[u(x l (x l 1, l 1), lη) x l 1 ] l=1 + E E[g( X k (x k 1, k 1)) x k 1 ] E[g(x k (x k 1, k 1)) x k 1 ]. Uing Prop. 25, u(, ) G α+1 uniformly in,, η and ɛ. Thu, by Aumpion (i) and Lem. 27, ( k 1 ) Eg(x k ) Eg( X k ) (ηρ(ɛ) + η α+1 ) EK l 1 (x l 1 ) + EK k 1 (x k 1 ) (ηρ(ɛ) + η α+1 ) l=1 N κ l,1 (1 + E x l 2κ l,2 ), where in he la line we ued momen eimae from Thm. 19. Finally, uing Aumpion (ii) and he fac ha N T/η, we have Eg(x k ) Eg(X kη ) = Eg(x k ) Eg( X k ) C(ρ(ɛ) + η α ). l= 4.2 SME for ochaic gradien decen Thm. 3 allow u o prove he main approximaion reul for he curren paper. In paricular, in hi ecion we derive a econd-order accurae weak approximaion for he imple 1

11 Sochaic Modified Equaion I: Mahemaical Foundaion SGD ieraion (1.3), from which a impler, fir-order accurae approximaion alo follow. A een in Thm. 3, we need only verify he condiion (i)-(ii) in order o prove he weak approximaion reul. Thee condiion moly involve momen eimae, which we now perform. To implify preenaion, we inroduce he following horhand. Whenever we wrie ψ(x) = ψ (x) + ηψ 1 (x) + O(r(η, ɛ)), for ome remainder erm r(η, ɛ), we mean: here exi K G independen of η, ɛ uch ha Now, le u e in (4.2) ψ(x) ψ (x) ηψ 1 (x) K(x)r(η, ɛ). b(x, η, ɛ) = b (x, ɛ) + ηb 1 (x, ɛ) σ(x, η, ɛ) = σ (x, ɛ), where b, b 1, σ are funcion o be deermined. We have he following momen eimae. Lemma 4 Le (x) be defined a in (4.3). Suppoe furher ha wih b, b 1, σ G 3. Then we have (i) E (i) (x) = b (x, ɛ) (i) η + [ 1 2 b (x, ɛ) (j) (j) b (x, ɛ) (i) + b 1 (x, ɛ) (i) ]η 2 + O(η 3 ), (ii) E (i) (x) (j) (x) = [b (x, ɛ) (i) b (x, ɛ) (j) + σ (x, ɛ) (i,k) σ (x, ɛ) (j,k) ]η 2 + O(η 3 ), (iii) E 3 j=1 (ij )(x) = O(η 3 ). Proof To obain (i)-(iii), we imply apply Lem. 28 wih ψ(z) = j=1 (z (i j ) x (ij )) for = 1, 2, 3 repecively. Nex, we eimae he momen of he SGA ieraion below. Lemma 5 Le (x) be defined a in (4.3) wih he SGD ieraion, i.e. h(x, r, η) = f r (x). Suppoe ha for each x R d, f G 1. Then, (i) E (i) (x) = (i) f(x)η, (ii) E (i) (x) (j) (x) = (i) f(x) (j) f(x)η 2 + Σ(x) (i,j) η 2, (iii) E 3 j=1 (i j )(x) = O(η 3 ), where Σ(x) := E( f γ (x) f(x))( f γ (x) f(x)) T. Proof We have (x) = η f γ (x). Taking expecaion, he reul hen follow. We now prove he main approximaion heorem for he imple SGD. Before preening he aemen and proof, we hall noe a few echnical iue ha preven he direc applicaion 11

12 Li, Tai and E of Thm. 3 wih he momen eimae in Lem.4 and 5. The laer ugge ignoring ɛ and eing b (x, ɛ) = f(x), b 1 (x, ɛ) = 1 4 f(x) 2, σ (x, ɛ) = Σ(x) 1 2. Then, we would ee from Lem.4 and 5 ha he SGD and he SDE have maching momen up o O(η 3 ). The fir iue wih hi approach i ha even if Σ(x) i ufficienly mooh (which may follow from he regulariy of f γ ), he moohne of Σ(x) 1 /2 canno be guaraneed unle Σ(x) i poiive-definie, which i ofen oo rong an aumpion in pracice and exclude inereing cae where Σ(x) i a ingular diffuion marix. However, he reul in Sec. 4.1 require moohne. Second, we would like o conider funcion f γ ha may no have higher rong derivaive required by he Lemma, beyond hoe required o define he modified equaion ielf. To fix boh of hee iue, we will ue a imple mollifying echnique. Thi i he reaon for he incluion of he ɛ parameer in he reul in Sec Definiion 6 Le u denoe by ν : R d R, ν Cc (R d ) he andard mollifier { C exp( 1 ) x < 1 1 x ν(x) := 2 x 1, where C := ( R ν(y)dy) 1 i choen o ha he inegral of ν i 1. Furher, define ν ɛ (x) = d ɛ d ν(x/ɛ). Le ψ L 1 loc (Rd ) be locally inegrable, hen we may define i mollificaion by ψ ɛ (x) := (ν ɛ ψ)(x) = ν ɛ (x y)ψ(y)dy = ν ɛ (y)ψ(x y)dy, R d B(,ɛ) where B(z, ɛ) i he d-dimenional ball of radiu ɛ cenered a z. The mollificaion of vecor (or marix) valued funcion are defined elemen-wie. The mollifier ha very ueful properie. In paricular, we will ue he following wellknown fac (ee e.g. Evan (21) for proof) (i) If ψ L 1 loc (Rd ), hen ψ ɛ C (R d ) (ii) ψ ɛ (x) ψ(x) a ɛ for almo every x R d (wih repec o he Lebegue meaure) (iii) If ψ i coninuou, hen ψ ɛ (x) ψ(x) a ɛ uniformly on compac ube of R d Nex, we make ue of he idea of weak derivaive. Definiion 7 Le Ψ L 1 loc (Rd ) and J be a muli-index of order J. Suppoe ha here exi a ψ L 1 loc (Rd ) uch ha Ψ(x) J φ(x)dx = ( 1) J ψ(x)φ(x)dx R d R d for all φ C c. Then, we call ψ he order J weak derivaive of Ψ and wrie D J Ψ = ψ. Noe ha when i exi, he weak derivaive i unique almo everywhere and if Ψ i differeniable, J Ψ = D J Ψ almo everywhere (Evan, 21). 12

13 Sochaic Modified Equaion I: Mahemaical Foundaion The inroducion of weak derivaive moivae he definiion of he weak verion of he funcion pace G α. Definiion 8 For α 1, we define he pace G α w o be he ubpace of L 1 loc (Rd ) uch ha if g G α w, hen g ha weak derivaive up o order α and for each muli-index J wih J α, here exi poiive ineger κ 1, κ 2 uch ha D J g(x) κ 1 (1 + x 2κ 2 ) for a.e. x R d. A in Def. 1, if g depend on addiional parameer, we ay ha g G α w if he above conan do no depend on he addiional parameer. Alo, vecor-valued g are defined a above elemen-wie in he co-domain. Noe ha G α w i a ubpace of he Sobolev pace W α,1 loc. Theorem 9 Le, T >, η (, 1 T ) and e N = T/η. Le {x k : k } be he SGD ieraion defined in (1.3). Suppoe he following condiion are me: (i) f Ef γ i wice coninuouly differeniable, f 2 i Lipchiz, and f G 4 w. (ii) f γ aifie a Lipchiz condiion: f γ (x) f γ (y) L γ x y a.. for all x, y R d, where L γ i a random variable which i poiive a.. and EL m γ for each m 1. < Define {X : [, T ]} a he ochaic proce aifying he SDE dx = (f(x ) η f(x ) 2 )d + ησ(x ) 1 /2 dw X = x, (4.4) wih Σ(x) = E( f γ (x) f(x))( f γ (x) f(x)) T. Then, {X : [, T ]} i an order- 2 weak approximaion of he SGD, i.e. for each g G 3, here exi a conan C > independen of η uch ha max Eg(x k) Eg(X kη ) Cη 2. k=,...,n Proof Fir, we check ha Eq. (4.4) admi a unique oluion, which amoun o checking he condiion in Thm. 18. Noe ha he Lipchiz condiion (ii) implie f i Lipchiz wih conan EL γ. To ee ha Σ(x) 1 /2 i alo Lipchiz, oberve ha u(x) := f γ (x) f(x) i Lipchiz (in he ene of (ii), wih conan a mo L γ + EL γ ), and Σ(x) 1 /2 Σ(y) 1 /2 = [u(x)u(x) T ] 1 /2 L 2 (Ω) [u(y)u(y) T ] 1 /2 L 2 (Ω) [u(x)u(x) T ] 1 /2 [u(y)u(y) T ] 1 /2 L 2 (Ω). Moreover, oberve ha for vecor u R d he mapping u (uu T ) 1 /2 = uu T / u i Lipchiz, which implie Σ(x) 1 /2 Σ(y) 1 /2 L u(x) u(y) L 2 (Ω) L x y. 13

14 Li, Tai and E The Lipchiz condiion on he drif and he diffuion marix imply uniform linear growh, o by Thm. 18, Eq. (4.4) admi a unique oluion. For each ɛ (, 1), define he mollified funcion b (x, ɛ) = ν ɛ f(x), b 1 (x, ɛ) = 1 4 νɛ ( f(x) 2 ), σ (x, ɛ) = ν ɛ Σ(x) 1 /2. Oberve ha b + ηb 1, σ aifie a Lipchiz condiion in x uniformly in η, ɛ. To ee hi, noe ha for any Lipchiz funcion ψ wih conan L, we have ν ɛ ψ(x) ν ɛ ψ(y) ν ɛ (z) ψ(x z) ψ(y z) dz L x y, B(,ɛ) which prove b + ηb 1 and σ are uniformly Lipchiz. Similarly, he linear growh condiion follow. Hence, we may define a family of ochaic procee {X ɛ : ɛ (, 1)} aifying dx ɛ = b (X ɛ, ɛ) + ηb 1 (X ɛ, ɛ) + ησ (X ɛ, ɛ)dw X ɛ = x, which each admi a unique oluion by Thm. 18. Now, we claim ha b (, ɛ), b 1 (, ɛ), σ (, ɛ) G 3 uniformly in ɛ. To ee hi, imply oberve ha mollificaion are mooh, and moreover, he polynomial growh i aified ince ν ɛ D J ψ = J (ν ɛ ψ) and furhermore, if ψ G, hen we have ψ ɛ (x) ν ɛ (y) ψ(x y) dy B(,ɛ) κ 1 ( κ 2 1 x 2κ κ ɛ d B(,ɛ) y 2κ 2 dy Bu B(,ɛ) y 2κ 2 dy Vol(B(, ɛ)) = Cɛ d, where C i independen of ɛ. Thi how ha ψ ɛ G uniformly in ɛ. Thi immediaely implie ha b (, ɛ), b 1 (, ɛ), σ (, ɛ) G 3. Now, ince b (x, ɛ) b (x, ) (and imilarly for b 1, σ ), and he limi are coninuou, by Lem. 4, 5, 29, 3 all condiion of Thm. 3 are aified, and hence we conclude ha for each g G 3, we have, max k=,...,n Eg(Xɛ kη ) Eg(x k) C(η 2 + ρ(ɛ)), where C i independen of η and ɛ and ρ(ɛ) a ɛ. Moreover, ince b (x, ɛ) b (x, ) (and imilarly for b 1, σ ) uniformly on compac e, we may apply Thm. 2 o conclude ha Thu, we have up E X ɛ X 2 a ɛ. [,T ] Eg(X kη ) Eg(x k ) Eg(X ɛ kη ) Eg(x k) + Eg(X ɛ kη ) Eg(X kη) C(η 2 + ρ(ɛ)) + ( E Xkη ɛ X kη 2)1 /2 ( 1 E 2 g(λxkη ɛ + (1 λ)x kη) 2 dλ )1/2 ) 14

15 Sochaic Modified Equaion I: Mahemaical Foundaion Uing Thm. 19 and aumpion ha 2 g G, he la expecaion i finie and hence aking he limi ɛ yield our reul. By going for a lower order approximaion, we of coure have he following: Corollary 1 Aume he ame condiion a in Thm. 9, excep ha we replace (i) wih (i) f Ef γ i coninuouly differeniable, and f G 3 w. Define {X : [, T ]} a he ochaic proce aifying he SDE dx = f(x )d + ησ(x ) 1 /2 dw X = x, (4.5) wih Σ(x) = E( f γ (x) f(x))( f γ (x) f(x)) T. Then, {X : [, T ]} i an order- 1 weak approximaion of he SGD, i.e. for each g G 2, here exi a conan C > independen of η uch ha max Eg(X kη) Eg(x k ) Cη. k=,...,n Remark 11 In he above reul, he mo rericive condiion i probably he Lipchiz condiion on f γ. Such Lipchiz condiion are imporan o enure ha he SME admi unique rong oluion and he SGA having uniformly bounded momen. Noe ha following imilar echnique in SDE analyi (e.g. Kloeden and Plaen (211)), hee global condiion may be relaxed o heir repecive local verion if we aume in addiion a uniform global linear growh condiion on f γ. Finally, for applicaion, ypical lo funcion have inward poining gradien for all ufficienly large x, meaning ha he SGD ierae will be uniformly bounded almo urely. Thu, we may imply modify he lo funcion for large x (wihou affecing he SGA ierae) o aify he condiion above. Remark 12 The conan C doe no depend on η, bu a evidenced in he proof of he heorem, i generally depend on g, T, d and he variou Lipchiz conan. For he fairly general iuaion we are conider, we do no derive igh eimae of hee dependencie. 4.3 SME for ochaic gradien decen wih momenum Le u dicu he correponding SME for a popular varian of he SGD called he momenum SGD (MSGD). The momenum SGD augmen he uual SGD ieraion wih a memory erm. In he uual form, we have he ieraion ˆv k+1 = ˆµˆv k ˆη f γk (x k ) x k+1 = x k + ˆv k+1 where ˆµ (, 1) (ypically cloe o 1) i called he momenum parameer and ˆη i he learning rae. Le u conider a recaled verion of he above ha i eaier o analyze via coninuou-ime approximaion. We redefine η := ˆη, v k := ˆv k / ˆη, µ := (1 ˆµ)/ ˆη (4.6) 15

16 Li, Tai and E o obain v k+1 = v k µηv k η f γk (x k ) x k+1 = x k + ηv k+1. (4.7) In view of he recaling, he range of momenum parameer we conider become µ (, η 1/2 ), which we may replace by (, ) for impliciy. Le u now derive he SME aified by he ieraion (4.7). Oberve ha hi i again a pecial cae of (4.1) wih x now replaced by (v, x) and h(v, x, γ, η) = ( µv f γ (x), v ηµv η f γ (x)) In view of Thm. 14 and he reul in Sec. 4.2, in order o derive he SME we imply mach momen up o order 3. A in Sec. 4.2, le u define he one ep difference The following momen expanion are immediae. Lemma 13 Le (x, v) be defined a in (4.8). We have (v, x) := (v v,x, 1 v, x v,x, 1 x). (4.8) (i) E (i) (v, x) = η( µv (i) (i) f(x), v) + η 2 (, µv (i) (i) f(x)), (ii) E (i) (v, x) (j) (v, x) = µ 2 v (i) v (j) + µv (i) (j) f(x) + µv (j) (i) f(x) η 2 +Σ(x) (i,j) + (i) (j) f(x) µv (i) v (j) v (j) (i) f(x) + O(η 3 ), µv (i) v (j) v (i) (j) f(x) v (i) v (j) (iii) E 3 j=1 (i j )(v, x) = O(η 3 ), where Σ(x) := E( f γ (x) f(x))( f γ (x) f(x)) T. Proof The proof follow from direc calculaion of he momen. Hence, proceeding exacly a in Sec. 4.2 and uing Lem.4, 13, we ee ha we may e b (v, x) = ( µv f(x), v) b 1 (v, x) = 1 ( 2 µ[µv + f(x)] 2 f(x)v, µv + f(x) ) ( ) Σ(x) 1/2 σ (v, x) = in order o mach he momen. By imilar mollificaion and limiing argumen a in Thm. 9, we arrive a he following approximaion heorem, where we can ee ha he SME for MSGD ake he form of a Langevin equaion. 16

17 Sochaic Modified Equaion I: Mahemaical Foundaion Theorem 14 Aume he ame condiion a in Thm. 9. Le µ > be fixed and define {V, X : [, T ]} a he ochaic proce aifying he SDE dv = [(µi η[µ2 I 2 f(x )])V + ( ηµ) f(x )]d + ησ(x ) 1 /2 dw V = v, dx = [(1 1 2 ηµ)v 1 2 η f(x )]d X = x, (4.9) wih Σ(x) a defined in Thm. 9. Then, {(V, X ) : [, T ]} i an order-2 weak approximaion of he MSGD. Moreover, if we relax he aumpion o Cor. 1, we have he order-1 weak approximaion dv = [µv + f(x )]d + ησ(x ) 1 /2 dw V = v, dx = V d X = x. (4.1) Noe ha by invering he caling (4.6), he order-1 SME (4.1) i he formal equaion derived in Li e al. (215). 4.4 SME for a momenum varian: Neerov acceleraed gradien I follow from he calculaion above ha we can alo obain he SME for he ochaic gradien verion of he Neerov acceleraed gradien (NAG) mehod (Neerov, 1983), which we refer o a SNAG. In he non-ochaic cae, he NAG mehod ha been analyzed uing he ODE approach (Su e al., 214). Therefore, he derivaion in hi ecion can be viewed a a ochaic parallel. The NAG mehod i omeime ued wih ochaic gradien, and hence i i ueful o analyze i properie in hi eing and compare i o MSGD. The uncaled NAG ieraion are ˆv k+1 = ˆµ kˆv k ˆη f γk (x k + ˆµ kˆv k ) x k+1 = x k + ˆv k+1 wih ˆv =, which differ from he momenum ieraion a he gradien i now evaluaed a a prediced poiion x k + ˆµ kˆv k, inead of he original poiion x k. Moreover, he momenum parameer ˆµ k i now allowed o vary a k increae, and in fac, he uual choice of ˆµ k = k 1 k+2 (4.11) hi ha imporan link o abiliy and acceleraion in he deerminiic cae (Neerov, 1983; Su e al., 214). In paricular, i achieve O(1/k 2 ) convergence rae for general convex funcion. On he oher hand, a conan ˆµ k i uggeed for rongly convex funcion (Neerov, 213). In he following, we hall fir conider he cae of conan momenum parameer wih ˆµ k ˆµ, and hen he choice (4.11) ubequenly. Conan momenum. which i again (4.1) wih Uing he ame recaling in (4.6), we have v k+1 = v k µηv k η f γk (x k + η(1 µη)v k ) x k+1 = x k + ηv k+1. h(v, x, γ, η) = ( µv f γ (x + η(1 µη)v), v ηµv η f γ (x + η(1 µη)v)) Hence, we have he following momen expanion. (4.12) 17

18 Li, Tai and E Lemma 15 Le (x, v) := (v v,x, 1 v, x v,x, 1 x). We have (i) E (i) (v, x) = η( µv (i) (i) f(x), v) + η 2 ( (i) (j) f(x)v (j), µv (i) (i) f(x + v)) + O(η 3 ), (ii) E (i) (v, x) (j) (v, x) = µ 2 v (i) v (j) + µv (i) (j) f(x + v) + µv (j) (i) f(x + v) η 2 +Σ(x + v) (i,j) + (i) (j) f(x + v) µv (i) v (j) v (i) (j) f(x + v) + O(η 3 ), µv (i) v (j) v (j) (i) f(x + v) (iii) E 3 j=1 (i j )(v, x) = O(η 3 ), where Σ(x) := E( f γ (x) f(x))( f γ (x) f(x)) T. v (i) v (j) Proof The proof follow from direc calculaion of he momen and Taylor expanion. Hence, we may mach momen by eing b (v, x) = ( µv f(x), v) b 1 (v, x) = 1 ( 2 µ[µv + f(x)] + 2 f(x)v, µv + f(x) ) ( ) σ (v, x) = Σ(x) 1 2 from which we obain he following approximaion heorem for SNAG. Theorem 16 Aume he ame condiion a in Thm. 14. Define {V, X : [, T ]} a he ochaic proce aifying he SDE dv = [(µi η[µ2 I + 2 f(x )])V + ( ηµ) f(x )]d + ησ(x ) 1 /2 dw V = v, dx = [(1 1 2 ηµ)v 1 2 η f(x )]d X = x, (4.13) wih Σ a defined in Thm. 14. Then, {(V, X ) : [, T ]} i an order-2 weak approximaion of SNAG. Moreover, he ame order-1 weak approximaion of MSGD in (4.1) hold for he SNAG. The reul above how ha for conan momenum parameer, he modified equaion for MSGD and he SNAG are equivalen a leading order, bu differ when we conider he econd order modified equaion. Le u now dicu he cae where he momenum parameer i allowed o vary. Varying momenum. argumen, we arrive a Now le u ake ˆµ a in (4.11). Then, uing he ame recaling v k+1 = v k µ k ηv k η f γk (x k + η(1 µ k η)v k ) x k+1 = x k + ηv k+1. (4.14) 18

19 Sochaic Modified Equaion I: Mahemaical Foundaion wih µ k = 3/(2η + kη). Now, in order o apply our heoreical reul o deduce he SME, imply noice ha we may inroduce an auxiliary calar variable z k+1 = z k + η, z =. Then, µ k = 3/(2η + z k ), and hence all erm are now no explicily k-independen, hu we may proceed formally a in he previou ecion o arrive a he order-1 SME for SNAG wih varying momenum dv = [ 3 V + f(x )]d + ησ(x ) 1 /2 dw V =, dx = V d X = x. (4.15) Thi reul i formal becaue he erm 3/ doe no aify our global Lipchiz condiion, unle we reric our inerval o ome [, T ] wih >, in which cae he above reul become rigorou. Alernaively, ome limiing argumen have o be ued o eablih wellpoedne of he equaion on [, T ] individually. We hall omi hee analye in he curren paper, and only conider (4.15) on ome inerval [, T ], where iniial condiion are hen replaced by (v, x ). A a poin of comparion, (4.15) reduce o he ODE derived in Su e al. (214) if Σ(x) (i.e. he gradien are non-ochaic). 5. Applicaion of he SME o he analyi of SGA In hi ecion, we apply he SME framework developed o analyze he dynamic of he hree ochaic gradien algorihm varian dicued above, namely SGD, MSGD and SNAG. We hall focu on imple bu non-rivial model where o a large exen, analyical compuaion uing SME are racable, giving u key inigh ino he algorihm ha are oherwie difficul o obain wihou appealing o he coninuou formalim preened in hi paper. We conider primarily he following model: Le H R d d be a ymmeric, poiive definie marix. Define he ample objec- Model: ive f γ (x) := 1 2 (x γ)t H(x γ) 1 2 Tr(H) γ N (, I) (5.1) which give he oal objecive f(x) Ef γ (x) = 1 2 xt Hx. 5.1 SME analyi of SGD We fir derive he SME aociaed wih (5.1). For impliciy, we will only conider he order-1 SME (4.5). A direc compuaion how ha Σ(x) = H 2 and o he SME for SGD applied o model (5.1) i dx = HX d + ηhdw, Thi i a muli-dimenional Ornein-Uhlenbeck (OU) proce and admi he explici oluion X = e (x H + ) η e H HdW. 19

20 Li, Tai and E Oberve ha for each, he diribuion of X i Gauian. Uing Iô iomery, we hen deduce he dynamic of he objecive funcion Ef(X ) = 1 2 xt He 2H x η Tr(H 3 e 2( )H )d n = 1 2 xt He 2H x η λ 2 i (H)(1 e 2λi(H) ). (5.2) The fir erm decay linearly wih aympoic rae 2λ d (H), and he econd erm i induced by noie, and i aympoic value i proporional o he learning rae η. Thi i he wellknown wo-phae behavior of SGD under conan learning rae: an iniial decen phae induced by he deerminiic gradien flow and an evenual flucuaion phae dominaed by he variance of he ochaic gradien. In hi ene, he SME make he ame predicion, and in fac we can ee ha i approximae he SGD ieraion well a η decreae (Fig. 5.1(a)), according o he rae we derived in Thm. 9 and Cor. 1. i=1 f(xt) f(x T/ ) Order 1 Slope=1 Order 1 Slope=2 rae SGD Slope = (H) (a) (b) Figure 5.1: SME predicion v SGD dynamic. (a) SME a a weak approximaion of he SGD. We compue he weak error wih e funcion g equal o f (ee Thm. 9). A prediced by our analyi, he order-2 SME (4.4) (order-1 SME (4.5)) hould give a lope = 2 (1) decreae in error a η decreae (noe ha he x-axi i flipped). The SME oluion i compued uing an exac formula derived by he applicaion of Iô iomery and he SGD expecaion i averaged over 1e6 run. We ook T = 2.. We ee ha he predicion of Thm. 9 and Cor.1 hold. (b) Decen rae v condiion number. H i generaed wih differen condiion number, and he reuling decen rae of SGD i approximaely κ(h) 1, a prediced by he SME. Moreover, noice ha by he idenificaion = kη (k i he SGD ieraion number), he SME analyi ell u ha he aympoic linear convergence rae (in k, i.e. rae log[ef(x k )]/k) in he decen phae of he SGD i 2λ d (H)η. For numerical abiliy (even in he non-ochaic cae), we uually require η 1/λ 1 (H), hu he maximal decen rae i inverely proporional o he condiion number κ(h) = λ 1 (H)/λ d (H). We validae hi obervaion by generaing a collecion of H wih varying condiion number and applying 2

21 Sochaic Modified Equaion I: Mahemaical Foundaion SGD wih η 1/λ 1 (H). In Fig 5.1(b), we plo he iniial decen rae veru he condiion number of H and we oberve ha we indeed have rae κ(h) 1. Alernae model. Now, we conider a ligh variaion of he model (5.1). The goal i how ha he dynamic of SGD (and he correponding SME) i no alway Gauian-like and hu uing he OU proce o model he SGD i no alway valid. Given he ame poiivedefinie marix H, we diagonalize i in he form H = QDQ T where Q i an orhogonal marix and D i a diagonal marix of eigenvalue. We hen define he ample objecive f γ (x) := 1 2 (QT x) T [D + diag(γ)](q T x) γ N (, I) (5.3) which give he ame oal objecive f(x) Ef γ (x) = 1 2 xt Hx. However, we have a differen expreion for Σ(x) which give he SME We can rewrie he above a Σ(x) = Qdiag(Qx) 2 Q T, dx = HX d + ηq diag(q T x) Q T dw in diribuion = HX d + ηq diag(q T x)q T dw. dx = HX d + η d Q (l) X dw (l),, where Q (l) = Q diag(q (l, ) )Q T and Q (l, ) denoe he l h row of Q. By oberving ha every pair of {H, Q (1),..., Q (d) } commue, we have he explici oluion l=1 X = e 1 2 η+ η d l=1 Q(l) W (l), e H x. which i a muli-dimenional Black-Schole (Black and Schole, 1973) ype of ochaic proce. In paricular, he diribuion i no Gauian of any >. Neverhele, we may ake expecaion o obain Ef(X ) = 1 2 eη x T He 2H x. Thi immediaely implie he following inereing behavior: if η < 2λ d (H), hen 2H ηi i poiive definie and o Ef(X ) exponenially a conan, non-zero η; Oherwie, depending on iniial condiion x, he objecive may no converge o. In paricular, if η > 2λ d (H) (which happen quie ofen if he condiion number of H i large) and x i in general poiion, hen we have aympoic exponenial divergence. Thi i a varianceinduced divergence ypically oberved in Black-Schole and geomeric Brownian moion ype of ochaic procee. The erm variance-induced i imporan here ince he deerminiic par of he evoluion equaion i mean-revering and in fac i idenical o he able OU proce udied earlier. In Fig. 5.2(a), (b), we how he correpondence of he SME finding 21

22 Li, Tai and E wih he acual dynamic of he SGD ieraion. In paricular, we ee in Fig. 5.2(c) ha for mall η, we have exponenial convergence of he SGD a conan learning rae, wherea for η > 2λ d (H), he SGD ierae ar o ocillae wildly and i mean value i dominaed by few large value and diverge approximaely a he rae prediced by he SME. Noe ha hi divergence i prediced o be a a finie η, and from he heory developed o far we canno conclude ha he SME approximaion alway hold accuraely a hi regime (bu he approximaion i guaraneed for η ufficienly mall). Neverhele, we oberve a lea in hi model ha he variance-induced divergence of he SGD happen a prediced by he SME. f(xt) f(x T/ ) Order 1 Slope=1 Order 1 Slope=2 f SME ( =.25) SGD ( =.25) SME ( =.1) SGD ( =.1) SME ( =.1) SGD ( =.1) (k ) (a) (b) 1 SME ( =.1) SGD ( =.1) SME ( =.1) SGD ( =.1) f (k ) (c) Figure 5.2: SME predicion v SGD dynamic for he model varian (5.3). (a) Order of convergence of he SME o he SGD. We ue he ame eup a in Fig. 5.1(a). Oberve ha our analyi again predic he correc rae of weak error decay a η decreae. (b) SGD pah v order-1 SME predicion. Solid line are SME exac oluion and doed line are mean of SGD pah over 5 run, and he percenile are haded. We oberve convergence of Ef a conan η, and ha he ample mean i dominaed by few large value, a oberved by he deviaion of he percenile from he mean. (b) Varianceinduced exploion. A prediced by he SME analyi, if η > 2λ d (H) (Here, λ d (H) =.1), variance-induced inabiliy e in. 22

23 Sochaic Modified Equaion I: Mahemaical Foundaion 5.2 SME analyi of MSGD Le u now ue he SME o analyze MSGD applied o model (5.1). We have hown earlier ha Σ(x) = H. Thu, according o Thm. 14, he order-1 SME for MSGD i dv = [µv + HX ]d + ηhdw, dx = V d, (5.4) wih X = x and V =. If we e Y := (V, X ) R 2d, U a 2d-dimenional Brownian moion wih fir d coordinae equal o W, and define block marice ( ) ( ) µi H H A :=, B :=, (5.5) I we can hen wrie (5.4) a which admi he explici oluion dy = AY + ηbdu, Y = (, x ), Y = e A (Y + η By Iô iomery, we have [ Ef(X ) = 1 2 diag(, H) 1 /2 e A Y 2 + η ) e A BdU.. ] diag(, H) 1 /2 e ( )A B 2 d, (5.6) One can ee immediaely ha a imilar wo-phae behavior i preen, bu he propery of he decen phae now hinge on he pecral properie of he marix A (inead of H). Before proceeding, we fir oberve ha he eigenvalue of A can be wrien a ( λ(a) := {Λ +, Λ }, Λ ±,i = 1 2 µ ± ) µ 2 4λ i (H), i = 1, 2,..., d. (5.7) In paricular, Rλ i (A) > for all i a long a µ >. We alo need he following imple reul concerning he decay of he norm of e A if all eigenvalue of A have poiive real par. Lemma 17 Le A be a real quare marix uch ha all eigenvalue have poiive real par. Then, (i) For each ɛ >, here exi a conan C ɛ > independen of bu depend on ɛ, uch ha e A C ɛ e (min i Rλ i (A) ɛ) (ii) If in addiion A i diagonalizable, hen here exi a conan C > independen of uch ha e A Ce min i Rλ i (A) 23

24 Li, Tai and E Proof See Appendix E. Wih he above reul, we can now characerize he decay of he objecive under momenum SGD. From expreion (5.7), we ee ha a long a µ 2 4λ i for any i = 1,..., d, A ha 2d diinc eigenvalue and i hence diagonalizable. We hall hereafer aume ha µ i in general poiion uch ha hi i he cae. Uing Lem. 17 and expreion (5.6), we arrive a he eimae ηc 2 λ 1 (H) 3 Ef(X ) 1 2 C2 x 2 λ 1 (H)e 2 min i Rλ i (A) min i Rλ i (A) (1 e 2 min i Rλ i (A) ). (5.8) Thi reul ell u ha he convergence rae of he decen phae i now conrolled by he minimum real par of he eigenvalue of A, inead of he minimum eigenvalue of H. In paricular, we achieve he be linear convergence rae by maximizing he malle real par of he eigenvalue of A. Thi lead o he following opimizaion problem for he opimal convergence rae: up min min µ (, ) i=1,...,d {+1, 1} { [ R µ + ]} µ 2 4λ i (H) Since H i poiive definie, he upremum i aained a µ = 2 λ d (H) wih he rae alo equal o 2 λ d (H). However, noe ha if we ake µ = µ exacly, one can check ha A i no longer diagonalizable and by Lem. 17, he rae i lighly diminihed, hu echnically we can ake µ a cloe o µ a we like (i.e. he rae i a cloe o 2 λ d (H) a we like), bu exac equaliy i no echnically deducible from curren reul. In Fig. 5.3(c), we demonrae he opimal choice of µ and i effec on he convergence rae. Moreover, oberve ha a µ increae, he number of complex eigenvalue ar o decreae, and he magniude of he imaginary par of he complex eigenvalue alo decreae. Thi ignifie ha increaing µ caue ocillaion o decreae in magniude and frequency. Thi i again corroboraed by numerical experimen (Fig. 5.3(c)). Anoher inereing obervaion i ha by he idenificaion = ηk, he decen rae (in erm of k) i 2 λ d (H)η. A before, if we chooe he maximal able learning rae we would have ˆη 1/λ 1 (H) (ˆη = η 2 according o he caling inroduced in (4.6)). Thu, for he MSGD ierae we have i decen rae κ(h) 1/2, which i a huge improvemen over SGD, whoe rae i κ(h) 1, epecially for badly condiioned marice where κ(h) 1. In Fig. 5.3(d), we plo he MSGD iniial decen rae for varying condiion number of H. Again, we oberve ha he SME analyi give he correc characerizaion of he precie dynamic and recover he quare-roo relaionhip wih condiion number. Finally, le u dicu he effec of adding momenum o he aympoic flucuaion due o noiy gradien. Noe ha i i no correc o conclude, uing Eq. (5.8), ha aking µ µ alo give he lowe flucuaion. Thi i becaue he conan C depend on µ a well, a i evidenced in he proof of Lem. 17, which how ha C depend on he condiioning of he eigenvecor marix of A. To proceed, we do no ue he bound (5.8). Inead, we explicily 24

25 Sochaic Modified Equaion I: Mahemaical Foundaion f(xt) f(x T/ ) Order 1 Slope=1 Order 1 Slope=2 f SME ( =.5) MSGD ( =.5) SME ( =.1) MSGD ( =.1) SME ( =.1) MSGD ( =.1) (k ) (a) (b) 1 3 SME ( =.48) MSGD ( =.48) SME ( =.95) MSGD ( =.95) SME ( =1.91) MSGD ( =1.91) SGD Slope = 1 2 f rae (k ) (c) (H) (d) Figure 5.3: SME predicion v MSGD dynamic. (a) and (b) SME v MSGD dynamic a µ =.1 for differen learning rae η. A before, he SME predicion ge beer a η decreae according o he prediced order. Noice alo he preence of ocillaion, due o he complex eigenvalue of A. (c) Opimal decen rae of he SGD i achieved by he SME predicion µ = µ, which i.95 in hi cae. Noice ha exacly a prediced by he SME, increaing µ decreae he ocillaion frequency and magniude (due o having fewer complex eigenvalue and maller imaginary par), a well a he aympoic flucuaion (due o formula (5.9)). (d) Decen rae v condiion number. H i generaed wih differen condiion number, and he decen rae of MSGD i κ(h) 1/2, a prediced by he SME, which for badly condiioned H give a large improvemen. diagonalize A and afer ome compuaion, we arrive a he exac expreion for Ef(X ) Ef(X ) = 1 2 diag(, H)1 /2 e A Y 2 (5.9) η d i=1 [ λ 3 i 1 e 2RΛ +,i µ 2 4λ i 2RΛ +,i + 1 e 2RΛ,i 2RΛ,i ] 2R(, µ, λ i (H)) (5.1) 25

26 Li, Tai and E where R(, µ, λ) = { 1 e µ µ µ 2 λ µ+ 4λ µ 2 e µ in( 4λ µ 2 ) µe µ co( 4λ µ 2 ) 4λ µ < 2 λ. (5.11) In paricular, he aympoic lo value induced by noie i d lim Ef(X ) = 1 2 η i=1 [ λ i (H) 3 µ 2 4λ i (H) 1 2RΛ +,i + 1 2RΛ,i 2 min { }] µ 4λ i (H), 1 µ (5.12) Oberve ha hi funcion (in fac, each erm in he um) i monoone-decreaing in µ, and for µ 1 i cale like µ 1, and for µ 1 i cale like µ 3. Thu, increaing he momenum parameer decreae he aympoic noie in he ierae, i.e. decreae he aympoic value of Ef, which hould be in he abence of noie. Thi again agree wih he acual MSGD dynamic (Fig. 5.3(b)). Conequenly, o obain opimal radeoff beween decen and noie, we would like a momenum chedule ha equal µ in he decen phae and increae o infiniy (in he original caling hi correpond o ˆµ ) a we approach he opimum. Finding hi opimal chedule can be ca a an opimal conrol problem (Li e al., 217), and a rigorou inveigaion of hee approache will be conidered in ubequen work. 5.3 SME analyi of SNAG Finally, le u ee wha we can ay, uing he SME approach, abou he difference beween MSGD and SNAG in hi ochaic eing. Le u fir conider he cae of conan momenum. From Thm. 16, we know ha he order-1 SME are idenical, o we mu conider higher order SME. A raighforward compuaion yield he following order-2 SME for MSGD and SNAG (again we le Y = (V, X )) MSGD: dy = A 1 Y + ηbdu, Y = (, x ), SNAG: dy = A 2 Y + ηbdu, Y = (, x ), where A i = A ηe i wih A, B a defined in (5.5) and E 1 := ( µ 2 ) I H µh, E µi H 2 := ( µ 2 ) I + H µh. µi H From he analyi in Sec. 4.3, he decen rae i dominaed by he minimal real par of he eigenvalue of A i, which are repecively λ(a 1 ) = λ(a 2 ) = { 1 4 { ( 1 4 ) } µ 2 (ηµ + 2) 2 + 4η 2 λ i (H) 2 8λ i (H)(ηµ + 2), i = 1,..., d µ(ηµ + 2) ± ( µ(ηµ + 2) + 2ηλ i (H) ± ηµ + 2 µ 2 (ηµ + 2) + 4λ i (H)(ηµ 2) ) }, i = 1,..., d We oberve ha for mall µ (i.e. ˆµ 1 in he uual MSGD caling), he erm in quare-roo are negaive and hence for he ame mall µ, he convergence rae of SNAG i 1 2 ηλ d(h) larger 26

EECE 301 Signals & Systems Prof. Mark Fowler

EECE 301 Signals & Systems Prof. Mark Fowler EECE 31 Signal & Syem Prof. Mark Fowler Noe Se #27 C-T Syem: Laplace Tranform Power Tool for yem analyi Reading Aignmen: Secion 6.1 6.3 of Kamen and Heck 1/18 Coure Flow Diagram The arrow here how concepual