Global Convergence of Online Limited Memory BFGS

Size: px
Start display at page:

Download "Global Convergence of Online Limited Memory BFGS"

Transcription

1 Journal of Machine Learning Research 16 ( Submied 9/14; Revised 7/15; Published 12/15 Global Convergence of Online Limied Memory BFGS Aryan Mokhari Alejandro Ribeiro Deparmen of Elecrical and Sysems Engineering Universiy of Pennsylvania Philadelphia, PA 19104, USA Edior: Léon Boou Absrac Global convergence of an online (sochasic limied memory version of he Broyden-Flecher- Goldfarb-Shanno (BFGS quasi-newon mehod for solving opimizaion problems wih sochasic objecives ha arise in large scale machine learning is esablished. Lower and upper bounds on he Hessian eigenvalues of he sample funcions are shown o suffice o guaranee ha he curvaure approximaion marices have bounded deerminans and races, which, in urn, permis esablishing convergence o opimal argumens wih probabiliy 1. Experimenal evaluaion on a search engine adverising problem showcase reducions in convergence ime relaive o sochasic gradien descen algorihms. Keywords: quasi-newon mehods, large-scale opimizaion, sochasic opimizaion 1. Inroducion Many problems in Machine Learning can be reduced o he minimizaion of a sochasic objecive defined as an expecaion over a se of random funcions (Boou and Le Cun (2005; Boou (2010; Shalev-Shwarz and Srebro (2008; Mokhari and Ribeiro (2014b. Specifically, consider an opimizaion variable w R n and a random variable θ Θ R p ha deermines he choice of a funcion f(w, θ : R n p R. Sochasic opimizaion problems enail deerminaion of he argumen w ha minimizes he expeced value F (w := E θ [f(w, θ], w := argmin w E θ [f(w, θ] := argmin F (w. (1 w We refer o f(w, θ as he random or insananeous funcions and o F (w := E θ [f(w, θ] as he average funcion. A canonical class of problems having his form are suppor vecor machines (SVMs ha reduce binary classificaion o he deerminaion of a hyperplane ha separaes poins in a given raining se; see, e.g., (Vapnik (2000; Boou (2010; Boser e al. (1992. In ha case, θ denoes individual raining samples, f(w, θ he loss of choosing he hyperplane defined by w, and F (w := E θ [f(w, θ] he mean loss across all elemens of he raining se. The opimal argumen w is he opimal linear classifier. Numerical evaluaion of objecive funcion gradiens w F (w = E θ [ w f(w, θ] is inracable when he cardinaliy of Θ is large, as is he case, e.g., when SVMs are rained on large ses. This moivaes he use of algorihms relying on sochasic gradiens ha provide gradien esimaes based on small daa subsamples. For he purpose of his paper sochasic opimizaion algorihms can be divided ino hree caegories: Sochasic gradien descen (SGD and relaed firs order mehods, sochasic Newon mehods, and sochasic quasi-newon mehods. SGD is he mos popular mehod used o solve sochasic opimizaion problems (Boou (2010; Shalev-Shwarz e al. (2011; Zhang (2004. However, as we consider problems of ever larger dimension heir slow convergence imes have limied heir pracical appeal and fosered he search for alernaives. In his regard, i has o be noed ha SGD is slow because of he use of gradiens c 2015 Aryan Mokhari and Alejandro Ribeiro.

2 Mokhari and Ribeiro as descen direcions which leads o poor curvaure approximaion in ill-condiioned problems. The golden sandard o deal wih hese ill-condiioned funcions in a deerminisic seing is Newon s mehod. However, unbiased sochasic esimaes of Newon seps can be compued in general. This fac limis he applicaion of sochasic Newon mehods o problems wih specific srucure (Birge e al. (1994; Zargham e al. (2013. If SGD is slow o converge and sochasic Newon can be used in general, he remaining alernaive is o modify deerminisic quasi-newon mehods ha speed up convergence imes relaive o gradien descen wihou using Hessian evaluaions (Dennis and Moré (1974; Powell (1976; Byrd e al. (1987; Nocedal and Wrigh (1999. This has resuled in he developmen of he sochasic quasi-newon mehods known as online (o Broyden-Flecher-Goldfarb-Shanno (BFGS (Schraudolph e al. (2007; Bordes e al. (2009, regularized sochasic BFGS (RES (Mokhari and Ribeiro (2014a, and online limied memory (olbfgs (Schraudolph e al. (2007 which occupy he middle ground of broad applicabiliy irrespecive of problem srucure and condiioning. All hree of hese algorihms exend BFGS by using sochasic gradiens boh as descen direcions and consiuens of Hessian esimaes. The obfgs algorihm is a direc generalizaion of BFGS ha uses sochasic gradiens in lieu of deerminisic gradiens. RES differs in ha i furher modifies BFGS o yield an algorihm ha reains is convergence advanages while improving heoreical convergence guaranees and numerical behavior. The olbfgs mehod uses a modificaion of BFGS o reduce he compuaional cos of each ieraion. An imporan observaion here is ha in rying o adap o he changing curvaure of he objecive, sochasic quasi-newon mehods may end up exacerbaing he problem. Indeed, since Hessian esimaes are sochasic, i is possible o end up wih almos singular Hessian esimaes. The corresponding small eigenvalues hen resul in a caasrophic amplificaion of he noise which nullifies progress made owards convergence. This is no a minor problem. In obfgs his possibiliy precludes convergence analyses and may resul in erraic numerical behavior (Mokhari and Ribeiro (2014a. As a maer of fac, he main moivaion for he inroducion of RES is o avoid his caasrophic noise amplificaion so as o reain smaller convergence imes while ensuring ha opimal argumens are found wih probabiliy 1 (Mokhari and Ribeiro (2014a. Generally, sochasic quasi-newon mehods whose Hessian approximaions have bounded eigenvalues converge o opimal argumens (Sunehag e al. (2009. However valuable, he convergence guaranees of RES and he convergence ime advanages of obfgs and RES are ained by an ieraion cos of order O(n 2 and O(n 3, respecively, which precludes heir use in problems where n is very large. In deerminisic seings his problem is addressed by limied memory (LBFGS (Liu and Nocedal (1989 which can be easily generalized o develop he olbfgs algorihm (Schraudolph e al. (2007. Numerical ess of olbfgs are promising bu heoreical convergence characerizaions are sill lacking. The main conribuion of his paper is o show ha olbfgs converges wih probabiliy 1 o opimal argumens across realizaions of he random variables θ. This is he same convergence guaranee provided for RES and is in marked conras wih obfgs, which fails o converge if no properly regularized. Convergence guaranees for olbfgs do no require such measures. We begin he paper wih brief discussions of deerminisic BFGS (Secion 2 and LBFGS (Secion 2.1 and he inroducion of olbfgs (Secion 2.2. The fundamenal idea in BFGS and olbfgs is o coninuously saisfy a secan condiion while saying close o previous curvaure esimaes. They differ in ha BFGS uses all pas gradiens o esimae curvaure while olbfgs uses a fixed moving window of pas gradiens. The use of his window reduces memory and compuaional cos (Appendix A. The difference beween LBFGS and olbfgs is he use of sochasic gradiens in lieu of heir deerminisic counerpars. Noe ha choosing large mini-bach for compuing sochasic gradiens reduces he variance of sochasic approximaion and decreases he gap beween LBFGS and olbfgs, however, increases he compuaional cos of olbfgs. Therefore, picking a suiable mini-bach size is an imporan sep in he implemenaion of olbfgs. Convergence properies of olbfgs are hen analyzed (Secion 3. Under he assumpion ha he sample funcions f(w, θ are srongly convex we show ha he race and deerminan of he 3152

3 Global Convergence of Online Limied Memory BFGS Hessian approximaions compued by olbfgs are upper and lower bounded, respecively (Lemma 3. These bounds are hen used o limi he range of variaion of he raio beween he Hessian approximaions larges and smalles eigenvalues (Lemma 4. In urn, his condiion number limi is shown o be sufficien o prove convergence o he opimal argumen w wih probabiliy 1 over realizaions of he sample funcions (Theorem 6. This is an imporan resul because i ensures ha olbfgs doesn suffer from he numerical problems ha hinder obfgs. We complemen his almos sure convergence resul wih a characerizaion of he convergence rae which is shown o be a leas O(1/ in expecaion (Theorem 7. I is fair o emphasize ha, differen from he deerminisic case, he convergence rae of olbfgs is no beer han he convergence rae of SGD. This is no a limiaion of our analysis. The difference beween sochasic and regular gradiens inroduces a noise erm ha dominaes convergence once we are close o he opimum, which is where superlinear convergence raes manifes. In fac, he same convergence rae would be observed if exac Hessians were available. The bes ha can be proven of olbfgs is ha he convergence rae is no worse han ha of SGD. Given ha heoreical guaranees only sae ha he curvaure correcion does no exacerbae he problem s condiion i is perhaps fairer o describe olbfgs as an adapive recondiioning sraegy insead of a sochasic quasi-newon mehod. The laer descripion refers o he genesis of he algorihm. The former is a more accurae descripion of is acual behavior. To show he advanage of olbfgs we use i o rain a logisic regressor o predic he click hrough rae in a search engine adverising problem (Secion 4. The logisic regression uses a heerogeneous feaure vecor wih 174,026 binary enries ha describe he user, he search, and he adverisemen (Secion 4.1. Being a large scale problem wih heerogeneous daa, he condiion number of he logisic log likelihood objecive is large and we expec o see significan advanages of olbfgs relaive o SGD. This expecaion is fulfilled. The olbfgs algorihm rains he regressor using less han 1% of he daa required by SGD o obain similar classificaion accuracy. (Secion 4.3. We close he paper wih concluding remarks (Secion 5. Noaion Lowercase boldface v denoes a vecor and uppercase boldface A a marix. We use v o denoe he Euclidean norm of vecor v and A o denoe he Euclidean norm of marix A. The race of A is wrien as r(a and he deerminan as de(a. We use I for he ideniy marix of appropriae dimension. The noaion A B implies ha he marix A B is posiive semidefinie. The operaor E x [ ] sands in for expecaion over random variable x and E[ ] for expecaion wih respec o he disribuion of a sochasic process. 2. Algorihm Definiion Recall he definiions of he sample funcions f(w, θ and he average funcion F (w := E θ [f(w, θ]. We assume he sample funcions f(w, θ are srongly convex for all θ. This implies he objecive funcion F (w := E θ [f(w, θ], being an average of he srongly convex sample funcions, is also srongly convex. We define he gradien s(w := F (w of he average funcion F (w and assume ha i can be compued as s(w := F (w = E θ [ f(w, θ]. (2 Since he funcion F (w is srongly convex, gradiens s(w are descen direcions ha can be used o find he opimal argumen w in (1. Inroduce hen a ime index, a sep size ɛ, and a posiive definie marix B 0 o define a generic descen algorihm hrough he ieraion w +1 = w ɛ B s(w = w ɛ d. (3 where we have also defined he descen sep d = B s(w. When B = I is he ideniy marix, (3 reduces o gradien descen. When B = H(w := 2 F (w is he Hessian of he objecive funcion, (3 defines Newon s algorihm. In his paper we focus on quasi-newon mehods whereby we aemp o selec marices B close o he Hessian H(w. Various mehods are known o selec 3153

4 Mokhari and Ribeiro marices B, including hose by Broyden e.g., Broyden e al. (1973; Davidon, Flecher, and Powell (DFP e.g., Flecher (2013; and Broyden, Flecher, Goldfarb, and Shanno (BFGS e.g., Byrd e al. (1987; Powell (1976. We work wih he marices B used in BFGS since hey have been observed o work bes in pracice (see Byrd e al. (1987. In BFGS, he funcion s curvaure B is approximaed by a finie difference. Le v denoe he variable variaion a ime and r he gradien variaion a ime which are respecively defined as v := w +1 w, r := s(w +1 s(w. (4 We selec he marix B +1 o be used in he nex ime sep so ha i saisfies he secan condiion B +1 v = r. The raionale for his selecion is ha he Hessian H(w saisfies his condiion for w +1 ending o w. Noice however ha he secan condiion B +1 v = r is no enough o compleely specify B +1. To resolve his indeerminacy, marices B +1 in BFGS are also required o be as close as possible o he previous Hessian approximaion B in erms of differenial enropy (see Mokhari and Ribeiro (2014a. These condiions can be resolved in closed form leading o he explici expression, B +1 = B + r r T v T r B v v T B v T B v. (5 While he expression in (5 permis updaing he Hessian approximaions B +1, implemenaion of he descen sep in (3 requires is inversion. This can be avoided by using he Sherman-Morrison formula in (5 o wrie B +1 = ZT B Z + ρ v v T, (6 where we defined he scalar ρ and he marix Z as ρ := 1 v T r, Z := I ρ r v T. (7 The updaes in (5 and (6 require he inner produc of he gradien and variable variaions o be posiive, i.e., v T r > 0. This is always rue if he objecive F (w is srongly convex and furher implies ha B +1 says posiive definie if B 0, (Nocedal and Wrigh (1999. Each BFGS ieraion has a cos of O(n 2 arihmeic operaions. This is less han he O(n 3 of each sep in Newon s mehod bu more han he O(n cos of each gradien descen ieraion. In general, he relaive convergence raes are such ha he oal compuaional cos of BFGS o achieve a arge accuracy is smaller han he corresponding cos of gradien descen. Sill, alernaives o reduce he compuaional cos of each ieraion are of ineres for large scale problems. Likewise, BFGS requires sorage and propagaion of he O(n 2 elemens of B, whereas gradien descen requires sorage of O(n gradien elemens only. This moivaes alernaives ha have smaller memory fooprins. Boh of hese objecives are accomplished by he limied memory (LBFGS algorihm ha we describe in he following secion. 2.1 LBFGS: Limied Memory BFGS As i follows from (6, he updaed Hessian inverse approximaion B depends on B and he curvaure informaion pairs {v, r }. In urn, o compue B, he esimae B 2 and he curvaure pair {v 2, r 2 } are used. Proceeding recursively, i follows ha B is a funcion of he iniial approximaion B 0 and all previous curvaure informaion pairs {v u, r u } u=0. The idea in LBFGS is o resric he use of pas curvaure informaion o he las τ pairs {v u, r u } u= τ. Since earlier ieraes {v u, r u } wih u < τ are likely o carry lile informaion abou he curvaure a he curren ierae w, his resricion is expeced o resul in a minimal performance penaly. For a precise definiion, pick a posiive definie marix B,0 as he iniial Hessian inverse approximaion a sep. Proceed hen o perform τ updaes of he form in (6 using he las τ curvaure 3154

5 Global Convergence of Online Limied Memory BFGS informaion pairs {v u, r u } u= τ. Denoing as B,u he curvaure approximaion afer u updaes are performed we have ha he refined marix approximaion B,u+1 is given by [cf. (6] B,u+1 = ZT τ+u B,u Z τ+u + ρ τ+u v τ+u v T τ+u, (8 where u = 0,..., τ 1 and he consans ρ τ+u and rank-one plus ideniy marices Z τ+u are as given in (7. The inverse Hessian approximaion B o be used in (3 is he one yielded afer compleing he τ updaes in (8, i.e., B = B,τ. Observe ha when < τ here are no enough pairs {v u, r u } o perform τ updaes. In such case we jus redefine τ = and proceed o use he = τ available pairs {v u, r u } u=0. Implemenaion of he produc B s(w in (3 for marices B recursion in (8 does no need explici compuaion of he marix B = B,τ obained from he,τ. Alhough he deails are no sraighforward, observe ha each ieraion in (8 is similar o a rank-one updae and ha as such i is no unreasonable o expec ha he produc B s(w = B,τ s(w can be compued using τ recursive inner producs. Assuming ha his is possible, he implemenaion of he recursion in (8 doesn need compuaion and sorage of prior marices B. Raher, i suffices o keep he τ mos recen curvaure informaion pairs {v u, r u } u= τ, hus reducing sorage requiremens from O(n 2 o O(τn. Furhermore, each of hese inner producs can be compued a a cos of n operaions yielding a oal compuaional cos of O(τ n per LBFGS ieraion. Hence, LBFGS decreases boh he memory requiremens and he compuaional cos of each ieraion from he O(n 2 required by regular BFGS o O(τn. We presen he deails of his ieraion in he conex of he online (sochasic LBFGS ha we inroduce in he following secion. 2.2 Online (Sochasic Limied Memory BFGS To implemen (3 and (8 we need o compue gradiens s(w. This is impracical when he number of funcions f(w, θ is large, as is he case in mos sochasic problems of pracical ineres and moivaes he use of sochasic gradiens in lieu of acual gradiens. Consider a given se of L realizaions θ = [θ 1 ;...; θ L ] and define he sochasic gradien of F (w a w given samples θ as ŝ(w, θ := 1 L L f(w, θ l. (9 l=1 In olbfgs we use sochasic gradiens ŝ(w, θ for descen direcions and curvaure esimaors. In paricular, he descen ieraion in (3 is replaced by he descen ieraion w +1 = w ɛ ŝ(w, θ = w ɛ ˆd, (10 where θ = [θ 1 ;...; θ L ] is he se of samples used a sep o compue he sochasic gradien ŝ(w, θ as per (9 and he marix is a funcion of pas sochasic gradiens ŝ(w u, θ u wih u insead of a funcion of pas gradiens s(w u as in (3. As we also did in (3 we have defined he sochasic sep ˆd := ŝ(w, θ o simplify upcoming discussions. To properly specify we define he sochasic gradien variaion ˆr a ime as he difference beween he sochasic gradiens ŝ(w +1, θ and ŝ(w, θ associaed wih subsequen ieraes w +1 and w and he common se of samples θ [cf. (4], ˆr := ŝ(w +1, θ ŝ(w, θ. (11 Observe ha ŝ(w, θ is he sochasic gradien used a ime in (10 bu ha ŝ(w +1, θ is compued solely for he purpose of deermining he sochasic gradien variaion. The perhaps more naural definiion ŝ(w +1, θ +1 ŝ(w, θ for he sochasic gradien variaion, which relies on he 3155

6 Mokhari and Ribeiro sochasic gradien ŝ(w +1, θ +1 used a ime +1 in (10 is no sufficien o guaranee convergence; see e.g.,(mokhari and Ribeiro (2014a. To define he olbfgs algorihm we jus need o provide sochasic versions of he definiions in (7 and (8. The scalar consans and ideniy plus rank-one marices in (7 are redefined o he corresponding sochasic quaniies ˆρ τ+u = 1 v T τ+uˆr τ+u and Ẑ τ+u = I ˆρ τ+uˆr τ+u v T τ+u, (12 = B,τ in (8 is replaced by he olbfgs Hessian inverse approxi- whereas he LBFGS marix B maion = where he iniial marix,τ which we define as he oucome of τ recursive applicaions of he updae,,u+1 = ẐT τ+u,0,u Ẑ τ+u + ˆρ τ+u v τ+u v T τ+u, (13 is given and he ime index is u = 0,..., τ. The olbfgs algorihm is defined by he sochasic descen ieraion in (10 wih marices = compued by τ recursive applicaions of (13. Excep for he fac ha hey use sochasic variables, (10 and (13 are idenical o (3 and (8. Thus, as is he case in (3, he Hessian inverse approximaion in (13 is a funcion of he iniial Hessian inverse approximaion B,0 and he τ mos recen curvaure informaion pairs {v u, ˆr u } u= τ. Likewise, when < τ here are no enough pairs {v u, ˆr u } o perform τ updaes. In such case we jus redefine τ = and proceed o use he = τ available pairs {v u, ˆr u } u=0. We also poin ou ha he updae in (13 necessiaes ˆrT u v u > 0 for all ime indexes u. This is rue as long as he insananeous funcions f(w, θ are srongly convex wih respec o w as we show in Lemma 2. The equaions in (10 and (13 are used concepually bu no in pracical implemenaions. For he laer we exploi he srucure of (13 o rearrange he erms in he compuaion of he produc ŝ(w, θ. To see how his is done consider he recursive updae for he Hessian inverse approximaion in (13 and make u = τ 1 o wrie =,τ = (ẐT,τ,τ (Ẑ + ˆρ v v T. (14 Equaion (14 shows he relaion beween he Hessian inverse approximaion and he (τ 1s updaed version of he iniial Hessian inverse approximaion,τ a sep. Se now u = τ 2 in (13 o express,τ in erms of,τ 2 and subsiue he resul in (14 o rewrie as = (ẐT Ẑ T 2,τ 2 (Ẑ 2 Ẑ + ˆρ 2 (ẐT v 2 v 2 T (Ẑ + ˆρ v v. T (15 We can proceed recursively by subsiuing,τ 2 for is expression in erms of,τ 3 and in he resul subsiue,τ 3 for is expression in erms of,τ 3 and so on. Observe ha a new summand is added in each of hese subsiuions from which i follows ha repeaing his process τ imes yields = (ẐT... ẐT τ,0 (Ẑ τ Ẑ... + ˆρ τ (ẐT... ẐT τ+1 v τ v τ T (Ẑ τ+1... Ẑ + + ˆρ 2 (ẐT v 2 v 2 T (Ẑ + ˆρ v v. T (16 The imporan observaion in (16 is ha he marix Ẑ and is ranspose ẐT are he firs and las produc erms of all summands excep he las, ha he marices Ẑ 2 and is ranspose ẐT 2 are second and penulimae in all erms bu he las wo, and so on. Thus, when compuing he 3156

7 Global Convergence of Online Limied Memory BFGS olbfgs sep ˆd := ŝ(w, θ he operaions needed o compue he produc wih he nex o las summand of (16 can be reused o compue he produc wih he second o las summand which in urn can be reused in deermining he produc wih he hird o las summand and so on. This observaion compounded wih he fac ha muliplicaions wih he ideniy plus rank one marices Ẑ requires O(n operaions yields an algorihm ha can compue he olbfgs sep ˆd := ŝ(w, θ in O(τn operaions. We summarize he specifics of such compuaion in he following proposiion where we consider he compuaion of he produc p wih a given arbirary vecor p. Proposiion 1 Consider he olbfgs Hessian inverse approximaion =,τ obained afer τ recursive applicaions of he updae in (13 wih he scalar sequence ˆρ τ+u and ideniy plus rank-one marix sequence Ẑ τ+u as defined in (12 for given variable and sochasic gradien variaion pairs {v u, r u } u= τ. For a given vecor p = p 0 define he sequence of vecors p k hrough he recursion p u+1 = p u α uˆr u for u = 0,..., τ 1, (17 where we also define he consans α u := ˆρ u v up T u. Furher define he sequence of vecors q k wih iniial value q 0 =,0 p τ and subsequen elemens q u+1 = q u + (α τ u β u v τ+u for u = 0,..., τ 1, (18 where we define consans β u := ˆρ τ+uˆr T τ+uq u. The produc p equals q τ, i.e., p = q τ. Proof See Appendix A. The reorganizaion of compuaions described in Proposiion 1 has been done for he deerminisic LBFGS mehod in, e.g., (Nocedal and Wrigh (1999. We have used he same echnique here for compuing he descen direcion of olbfgs and have shown he resul and derivaions for compleeness. In any even, Proposiion 1 assers ha i is possible o reduce he compuaion of he produc p beween he olbfgs Hessian approximaion marix and arbirary vecor p o he compuaion of wo vecor sequences {p u } τ u=0 and {q u} τ u=0. The produc p = q τ is given by he las elemen of he laer sequence. Since deerminaion of each of he elemens of each sequence requires O(n operaions and he oal number of elemens in each sequence is τ he oal operaion cos o compue boh sequences is of order O(τn. In compuing p we also need o add he cos of he produc q 0 =,0 p τ ha links boh sequences. To mainain overall compuaion cos of order O(τn his marix has o have a sparse or low rank srucure. A common choice in LBFGS, ha we adop for olbfgs, is o make,0 = ˆγ I. The scalar consan ˆγ is a funcion of he variable and sochasic gradien variaions v and ˆr, explicily given by ˆγ = vt ˆr ˆr T ˆr = vt ˆr ˆr 2. (19 wih he value a he firs ieraion being ˆγ 0 = 1. The scaling facor ˆγ aemps o esimae one of he eigenvalues of he Hessian marix a sep and has been observed o work well in pracice; see e.g., Liu and Nocedal (1989; Nocedal and Wrigh (1999. Furher observe ha he cos of compuing ˆγ is of order O(n and ha since,0 is diagonal cos of compuing he produc q 0 =,0 p τ is also of order O(n. We adop he iniializaion in (19 in our subsequen analysis and numerical experimens. The compuaion of he produc p using he resul in Proposiion 1 is summarized in algorihmic form in he funcion in Algorihm 1. The funcion receives as argumens he iniial marix,0, he sequence of variable and sochasic gradien variaions {v u, ˆr u } u= τ and he vecor p o produce he oucome q = q τ = p. When called wih he sochasic gradien p = ŝ(w, θ, he 3157

8 Mokhari and Ribeiro Algorihm 1 Compuaion of olbfgs sep q = p when called wih p = ŝ(w, θ. ( 1: funcion q = q τ = olbfgs Sep,0, p = p 0, {v u, ˆr u } u= τ 2: for u = 0, 1,..., τ 1 do {Loop o compue consans α u and sequence p u } 3: Compue and sore scalar α u = ˆρ u v up T u 4: Updae sequence vecor p u+1 = p u α uˆr u. [cf. (17] 5: end for 6: Muliply p τ by iniial marix: q 0 =,0 p τ 7: for u = 0, 1,..., τ 1 do {Loop o compue consans β u and sequence q u } 8: Compue scalar β u = ˆρ τ+uˆr T τ+uq u 9: Updae sequence vecor q u+1 = q u + (α τ u β u v τ+u [cf. (18] 10: end for {reurn q = q τ } funcion oupus he olbfgs sep ˆd := ŝ(w, θ needed o implemen he olbfgs descen sep in (10. The core of Algorihm 1 is given by he loop in seps 2-5 ha compues he consans α u and sequence elemens p u as well as he loop in seps 7-10 ha compues he consans β u and sequence elemens q u. The wo loops are linked by he iniializaion of he second sequence wih he oucome of he firs which is performed in Sep 6. To implemen he firs loop we require τ inner producs in Sep 4 and τ vecor summaions in Sep 5 which yield a oal of 2τn muliplicaions. Likewise, he second loop requires τ inner producs and τ vecor summaions in seps 9 and 10, respecively, which yields a oal cos of also 2τ n muliplicaions. Since he iniial Hessian inverse approximaion marix,0 is diagonal he cos of compuaion,0 p τ in Sep 6 is n muliplicaions. Thus, Algorihm 1 requires a oal of (4τ + 1n muliplicaions which affirms he complexiy cos of order O(τn for olbfgs. For reference, olbfgs is also summarized in algorihmic form in Algorihm 2. As wih any sochasic descen algorihm he descen ieraion is implemened in hree seps: he acquisiion of L samples in Sep 2, he compuaion of he sochasic gradien in Sep 3, and he implemenaion of he descen updae on he variable w in Sep 6. Seps 4 and 5 are devoed o he compuaion of he olbfgs descen direcion ˆd. In Sep 4 we iniialize he esimae,0 = ˆγ I as a scaled ideniy marix using he expression for ˆγ in (19 for > 0. The value of γ = γ 0 for = 0 is lef as an inpu for he algorihm. We use ˆγ 0 = 1 in our numerical ess. In Sep 5 we use Algorihm 1 for efficien compuaion of he descen direcion ˆd = ŝ(w, θ. Sep 7 deermines he value of he sochasic gradien ŝ(w +1, θ so ha he variable variaions v and sochasic gradien variaions ˆr become available for he compuaion of he curvaure approximaion marix. In Sep 8 he variable variaion v and sochasic gradien variaion ˆr are compued o be used in he nex ieraion. We analyze convergence properies of his algorihm in Secion 3 and develop an applicaion o search engine adverisemen in Secion Convergence Analysis For he subsequen analysis i is convenien o define he insananeous objecive funcion associaed wih samples θ = [θ 1,..., θ L ] as ˆf(w, θ := 1 L L f(w, θ l. (20 The definiion of he insananeous objecive funcion ˆf(w, θ in associaion wih he fac ha F (w := E θ [f(w, θ] implies ha F (w = E θ [ ˆf(w, θ]. (21 l=1 3158

9 Global Convergence of Online Limied Memory BFGS Algorihm 2 olbfgs Require: Iniial value w 0. Iniial Hessian approximaion parameer ˆγ 0 = 1. 1: for = 0, 1, 2,... do 2: Acquire L independen samples θ = [θ 1,..., θ L] 3: Compue sochasic gradien: ŝ(w, θ = 1 L wf(w, θ l [cf. (9] L 4: Iniialize Hessian inverse esimae as l=1,0 = ˆγI wih ˆγ = vt ˆr 5: Compue descen direcion wih Algorihm 1: ˆd = olbfgs Sep ˆr T ˆr for > 0 [cf (19] 6: Descend along direcion ˆd : w +1 = w ɛ ˆd [cf. (10] 7: Compue sochasic gradien: ŝ(w +1, θ = 1 L wf(w +1, θ l [cf. (9] L l=1 (,0, ŝ(w, θ, {v u, ˆr u} u= τ 8: Variaions v = w +1 w [variable, cf. (4] ˆr = ŝ(w +1, θ ŝ(w, θ [soch. gradien, cf.(11] 9: end for Our goal here is o show ha as ime progresses he sequence of variable ieraes w approaches he opimal argumen w. In proving his resul we make he following assumpions. Assumpion 1 The insananeous funcions ˆf(w, θ are wice differeniable and he eigenvalues of he insananeous Hessian Ĥ(w, θ = 2 w ˆf(w, θ are bounded beween consans 0 < m and M < for all random variables θ, mi Ĥ(w, θ MI. (22 Assumpion 2 The second momen of he norm of he sochasic gradien is bounded for all w. i.e., here exiss a consan S 2 such ha for all variables w i holds E θ [ ŝ(w, θ 2 w ] S 2. (23 Assumpion 3 The sep size sequence is seleced as nonsummable bu square summable, i.e., ɛ =, =0 and ɛ 2 <. (24 =0 Assumpions 2 and 3 are cusomary in sochasic opimizaion. The resricion imposed by Assumpion 2 is inended o limi he random variaion of sochasic gradiens. If he variance of heir norm is unbounded i is possible o have rare evens ha derail progress owards convergence. The condiion in Assumpion 3 balances descen owards opimal argumens which requires a slowly decreasing sep size wih he evenual eliminaion of random variaions which requires rapidly decreasing sep sizes. An effecive sep size choice for which Assumpion 3 holds is o make ɛ = ɛ 0 T 0 /(T 0 +, for given parameers ɛ 0 and T 0 ha conrol he iniial sep size and is speed of decrease, respecively. Assumpion 1 is sronger han usual and specific o olbfgs. Observe ha considering he lineariy of he expecaion operaor and he expression in (21 i follows ha he Hessian of he average funcion can be wrien as 2 wf (w = H(w = E θ [Ĥ(w, θ]. Combining his observaion wih he bounds in (22 we conclude ha here are consans m m and M M such ha mi mi H(w MI MI. (25 The bounds in (25 are cusomary in convergence proofs of descen mehods. For he resuls here he sronger condiion spelled in Assumpion 1 is needed. This assumpion in necessary o guaranee ha he inner produc ˆr T v > 0 is posiive as we show in he following lemma. 3159

10 Mokhari and Ribeiro Lemma 2 Consider he sochasic gradien variaion ˆr defined in (11 and he variable variaion v defined in (4. Le Assumpion 1 hold so ha we have lower and upper bounds m and M on he eigenvalues of he insananeous Hessians. Then, for all seps he inner produc of variable and sochasic gradien variaions ˆr T v is bounded below as m v 2 ˆr T v. (26 Furhermore, he raio of sochasic gradien variaion squared norm ˆr 2 = ˆr T ˆr o inner produc of variable and sochasic gradien variaions is bounded as m ˆrT ˆr ˆr T = ˆr 2 v ˆr T v M. (27 Proof See Appendix B. According o Lemma 2, srong convexiy of insananeous funcions ˆf(w, θ guaranies posiiveness of he inner produc v T ˆr as long as he variable variaion is no idenically null. In urn, his implies ha he consan ˆγ in (19 is nonnegaive and ha, as a consequence, he iniial Hessian,0 inverse approximaion is posiive definie for all seps. The posiive definieness of,0 in associaion wih he posiiveness of he inner produc of variable and sochasic gradien variaions v T ˆr > 0 furher guaranees ha all he marices,u+1, including he marix =,τ in paricular, ha follow he updae rule in (13 say posiive definie see Mokhari and Ribeiro (2014a for deails. This proves ha (10 is a proper sochasic descen ieraion because he sochasic gradien ŝ(w, θ is moderaed by a posiive definie marix. However, his fac alone is no enough o guaranee convergence because he minimum and maximum eigenvalues of could become arbirarily small and arbirarily large, respecively. To prove convergence we show his is no possible by deriving explici lower and upper bounds on hese eigenvalues. The analysis is easier if we consider he marix as opposed o. Consider hen he updae in (13, and use he Sherman-Morrison formula o rewrie as an updae ha relaes,u+1 o,u,,u+1 =,u,u v τ+u v T τ+u,u v T τ+u,u v τ+u + ˆr τ+uˆr T τ+u v τ+uˆr T, (28 τ+u for u = 0,..., τ 1 and,0 = 1/ˆγ I as per (19. As in (13, he Hessian approximaion a sep is =,τ. In he following lemma we use he updae formula in (28 o find bounds on he race and deerminan of he Hessian approximaion. Lemma 3 Consider he Hessian approximaion =,τ defined by he recursion in (28 wih,0 = ˆγ I and ˆγ as given by (19. If Assumpion 1 holds rue, he race r( of he Hessian approximaion is uniformly upper bounded for all imes 1, ( r (n + τ M. (29 Likewise, if Assumpion 1 holds rue, he deerminan de( of he Hessian approximaion is uniformly lower bounded for all imes Proof See Appendix C. ( de m n+τ. (30 [(n + τ M] τ 3160

11 Global Convergence of Online Limied Memory BFGS Lemma 3 saes ha he race and deerminans of he Hessian approximaion marix =,τ are bounded for all imes 1. For ime = 0 we can wrie a similar bound ha akes ino accoun he fac ha he consan γ ha iniializes he recursion in (28 is γ 0 = 1. Given ha we are ineresed in an asympoic convergence analysis, his bound in inconsequenial. The bounds on he race and deerminan of are respecively equivalen o bounds in he sum and produc of is eigenvalues. Furher considering ha he marix is posiive definie, as i follows from Lemma 2, hese bounds can be furher ransformed ino bounds on he smalls and larges eigenvalue of. The resuling bounds are formally saed in he following lemma. Lemma 4 Consider he Hessian approximaion =,τ defined by he recursion in (28 wih,0 = ˆγ I and ˆγ as given by (19. Define he sricly posiive consan 0 < c := m n+τ /[(n + τ M] n+τ and he finie consan C := (n + τ M <. If Assumpion 1 holds rue, he range of eigenvalues of is bounded by c and C for all ime seps 1, i.e., Proof See Appendix D. m n+τ [(n + τ M] n+τ I =: ci CI := (n + τ M I. (31 The bounds in Lemma 4 imply ha heir respecive inverses are bounds on he range of he eigenvalues of he Hessian inverse approximaion marix. Specifically, he minimum eigenvalue of he Hessian inverse approximaion is larger han 1/C and he maximum eigenvalue of does no exceed 1/c, or, equivalenly, 1 C I 1 c I. (32 We furher emphasize ha he bounds in (32, or (31 for ha maer, limi he condiioning of for all realizaions of he random samples { θ } =0, irrespecive of he paricular random draw. Having marices ha are sricly posiive definie wih eigenvalues uniformly upper bounded by 1/c leads o he conclusion ha if ŝ(w, θ is a descen direcion, he same holds rue of ŝ(w, θ. The sochasic gradien ŝ(w, θ is no a descen direcion in general, bu we know ha his is rue for is condiional expecaion E[ŝ(w, θ w ] = F (w. Hence, we conclude ha ŝ(w, θ is an average descen direcion since E[ ŝ(w, θ w ] = F (w. Sochasic opimizaion mehods whose displacemens w +1 w are descen direcions on average are expeced o approach opimal argumens. We show ha his is rue of olbfgs in he following lemma. Lemma 5 Consider he online Limied memory BFGS algorihm as defined by he descen ieraion in (10 wih marices =,τ obained afer τ recursive applicaions of he updae in (13 iniialized wih,0 = ˆγ I and ˆγ as given by (19. If Assumpions 1 and 2 hold rue, he sequence of average funcion values F (w saisfies Proof See Appendix E. E [ F (w +1 w ] F (w F (w F (w ɛ C F (w 2 + MS2 ɛ 2 2c 2. (33 Seing aside he erm MS 2 ɛ 2 /2c 2 for he sake of argumen, (88 defines a supermaringale relaionship for he sequence of objecive funcion errors F (w F (w. This implies ha he sequence ɛ F (w 2 /C is almos surely summable which, given ha he sep sizes ɛ are nonsummable as per (24, furher implies ha he limi infimum lim inf F (w of he gradien norm 3161

12 Mokhari and Ribeiro F (w is almos surely null. This laer observaion is equivalen o having lim inf F (w F (w = 0 wih probabiliy 1 over realizaions of he random samples { θ } =0. Therefore, a subsequence of he sequence of objecive funcion errors F (w F (w converges o null almos surely. Moreover, according o he resul of supermaringale convergence heorem, he limi lim F (w F (w of he nonnegaive objecive funcion errors F (w F (w almos surely exiss. This observaion in conjuncion wih he fac ha a subsequence of he sequence F (w F (w converges almos surely o null implies ha he whole sequence of F (w F (w converges almos surely o zero. Considering he srong convexiy assumpion his resul implies almos sure convergence of he whole sequence of w w 2 o null. The erm MS 2 ɛ 2 /2c 2 is a relaively minor nuisance ha can be aken care of wih a echnical argumen ha we presen in he proof of he following heorem. Theorem 6 Consider he online Limied memory BFGS algorihm as defined by he descen ieraion in (10 wih marices =,τ obained afer τ recursive applicaions of he updae in (13 iniialized wih,0 = ˆγ I and ˆγ as given by (19. If Assumpions 1-3 hold rue he limi of he squared Euclidean disance o opimaliy w w 2 converges o zero almos surely, i.e., [ ] Pr lim w w 2 = 0 = 1, (34 where he probabiliy is over realizaions of he random samples { θ } =0. Proof See Appendix F. Theorem 6 esablishes convergence of he olbfgs algorihm summarized in Algorihm 2. The lower and upper bounds on he eigenvalues of derived in Lemma 4 play a fundamenal role in he proofs of he prerequisie Lemma 5 and Theorem 6 proper. Roughly speaking, he lower bound on he eigenvalues of resuls in an upper bound on he eigenvalues of which limis he effec of random variaions on he sochasic gradien ŝ(w, θ. If his bound does no exis as is he case, e.g., of regular sochasic BFGS we may observe caasrophic amplificaion of random variaions of he sochasic gradien. The upper bound on he eigenvalues of, which resuls in a lower bound on he eigenvalues of, guaranees ha he random variaions in he curvaure esimae do no yield marices wih arbirarily small norm. If his bound does no hold, i is possible o end up haling progress before convergence as he sochasic gradien is nullified by muliplicaion wih an arbirarily small eigenvalue. The resul in Theorem 6 is srong because i holds almos surely over realizaions of he random samples { θ } =0 bu no sronger han he same convergence guaranees ha hold for SGD. We complemen he convergence resul in Theorem 6 wih a characerizaion of he expeced convergence rae ha we inroduce in he following heorem. Theorem 7 Consider he online Limied memory BFGS algorihm as defined by he descen ieraion in (10 wih marices =,τ obained afer τ recursive applicaions of he updae in (13 iniialized wih,0 = ˆγ I and ˆγ as given by (19. Le Assumpions 1 and 2 hold, and furher assume ha he sep size sequence is of he form ɛ = ɛ 0 /( + T 0 wih he parameers ɛ 0 and T 0 saisfying he inequaliy 2mɛ 0 T 0 /C > 1. Then, he difference beween he expeced opimal objecive E [F (w ] and he opimal objecive F (w is bounded as E [F (w ] F (w C 0 T 0 +, (35 where he consan C 0 is defined as { ɛ 2 C 0 := max 0 T0 2 CMS 2 } 2c 2 (2mɛ 0 T 0 C, T 0 (F (w 0 F (w. (

13 Global Convergence of Online Limied Memory BFGS Proof See Appendix G. Theorem 7 shows ha under specified assumpions he expeced error in erms of he objecive value afer olbfgs ieraions is of order O(1/. As is he case of Theorem 6, his resul is no beer han he convergence rae of convenional SGD. As can be seen in he proof of Theorem 7, he convergence rae is dominaed by he noise erm inroduced by he difference beween sochasic and regular gradiens. This noise erm would be presen even if exac Hessians were available and in ha sense he bes ha can be proven of olbfgs is ha he convergence rae is no worse han ha of SGD. Given ha heorems 6 and 7 parallel he heoreical guaranees of SGD i is perhaps fairer o describe olbfgs as an adapive recondiioning sraegy insead of a sochasic quasi-newon mehod. The laer descripion refers o he genesis of he algorihm, bu he former is more accurae descripion of is behavior. Do noice ha while he convergence rae doesn change, improvemens in convergence ime are significan as we illusrae wih he numerical experimens ha we presen in he nex secion. 4. Search Engine Adverising We apply olbfgs o he problem of predicing he click-hrough rae (CTR of an adverisemen displayed in response o a specific search engine query by a specific visior. In hese problems we are given mea informaion abou an adverisemen, he words ha appear in he query, as well as some informaion abou he visior and are asked o predic he likelihood ha his paricular ad is clicked by his paricular user when performing his paricular query. The informaion specific o he ad includes descripors of differen characerisics such as he words ha appear in he ile, he name of he adveriser, keywords ha idenify he produc, and he posiion on he page where he ad is o be displayed. The informaion specific o he user is also heerogeneous and includes gender, age, and propensiy o click on ads. To rain a classifier we are given informaion abou pas queries along wih he corresponding click success of he ads displayed in response o he query. The ad meadaa along wih user daa and search words define a feaure vecor ha we use o rain a logisic regressor ha predics he CTR of fuure ads. Given he heerogeneiy of he componens of he feaure vecor we expec a logisic cos funcion wih skewed level ses and consequen large benefis from he use of olbfgs. 4.1 Feaure Vecors For he CTR problem considered here we use he Tencen search engine daa se Sun (2012. This daa se conains he oucomes of 236 million ( searches along wih informaion abou he ad, he query, and he user. The informaion conained in each sample poin is he following: User profile: If known, age and gender of visior performing query. Deph: Toal number of adverisemens displayed in he search resuls page. Posiion: Posiion of he adverisemen in he search page. Impression: Number of imes he ad was displayed o he user who issued he query. Query: The words ha appear in he user s query. Tile: The words ha appear in he ile of ad. Keywords: Seleced keywords ha specify he ype of produc. Ad ID: Unique idenifier assigned o each specific adverisemen. 3163

14 Mokhari and Ribeiro Nonzero componens Feaure ype Toal componens Maximum (observed/srucure Mean (observed Age 6 1 (srucure 1.0 Gender 3 1 (srucure 1.0 Impression 3 1 (srucure 1.0 Deph 3 1 (srucure 1.0 Posiion 3 1 (srucure 1.0 Query 20, (observed 3.0 Tile 20, (observed 8.8 Keyword 20, (observed 2.1 Adveriser ID 5,184 1 (srucure 1.0 Adverisemen ID 108,824 1 (srucure 1.0 Toal 174, (observed 20.9 Table 1: Componens of he feaure vecor for predicion of adverisemens click-hrough raes. For each feaure class we repor he oal number of componens in he feaure vecor as well as he maximum and average number of nonzero componens. Adveriser ID: Unique idenifier assigned o each specific adveriser. Clicks: Number of imes he user clicked on he ad. From his informaion we creae a se of feaure vecors {x i } N i=1, wih corresponding labels y i {, 1}. The label associaed wih feaure vecor x i is y i = 1 if he number of clicks in he ad is more han 0. Oherwise he label is y i =. We use a binary encoding for all he feaures in he vecor x i. For he age of he user we use he six age inervals (0, 12], (12, 18], (18, 24], (24, 30], (30, 40], and (40, o consruc six indicaor enries in x i ha ake he value 1 if he age of he user is known o be in he corresponding inerval. E.g., a 21 year old user has an age ha falls in he hird inerval which implies ha we make [x i ] 3 = 1 and [x i ] k = 0 for all oher k beween 1 and 6. If he age of he user is unknown we make [x i ] k = 0 for all k beween 1 and 6. For he gender of he visiors we use he nex hree componens of x i o indicae male, female, or unknown gender. For a male user we make [x i ] 7 = 1, for a female user [x i ] 8 = 1, and for visiors of unknown gender we make [x i ] 9 = 1. The nex hree componens of x i are used for he deph feaure. If he he number of adverisemens displayed in he search page is 1 we make [x i ] 10 = 1, if 2 differen ads are shown we make [x i ] 11 = 1, and for dephs of 3 or more we make [x i ] 12 = 1. To indicae he posiion of he ad in he search page we also use hree componens of x i. We use [x i ] 13 = 1, [x i ] 14 = 1, and [x i ] 15 = 1 o indicae ha he ad is displayed in he firs, second, and hird posiion, respecively. Likewise we use [x i ] 16, [x i ] 17 and [x i ] 18 o indicae ha he impression of he ad is 1, 2 or more han 3. For he words ha appear in he query we have in he order of 10 5 disinc words. To reduce he number of elemens necessary for his encoding we creae 20,000 bags of words hrough random hashing wih each bag conaining 5 or 6 disinc words. Each of hese bags is assigned an index k. For each of he words in he query we find he bag in which his word appears. If he word appears in he kh bag we indicae his occurrence by seing he k + 18h componen of he feaure vecor o [x i ] k+18 = 1. Observe ha since we use 20,000 bags, componens 19 hrough 20,018 of x i indicae he presence of specific words in he query. Furher noe ha we may have more han one x i componen differen from zero because here may be many words in he query, bu ha he oal number of nonzero elemens is much smaller han 20,000. On average, 3.0 of hese elemens of 3164

15 Global Convergence of Online Limied Memory BFGS he feaure vecor are nonzero. The same bags of words are used o encode he words ha appear in he ile of he ad and he produc keywords. We encode he words ha appear in he ile of he ad by using he nex 20, 000 componens of vecor x i, i.e. componens 20, 019 hrough 40, 018. Componens 40, 019 hrough 60, 018 are used o encode produc keywords. As in he case of he words in he search jus a few of hese componens are nonzero. On average, he number of nonzero componens of feaure vecors ha describe he ile feaures is 8.8. For produc keywords he average is 2.1. Since he number of disinc adverisers in he raining se is 5, 184 we use feaure componens 60, 019 hrough o encode his informaion. For he kh adveriser ID we se he k + 60, 018h componen of he feaure vecor o [x i ] k+60,018 = 1. Since he number of disinc adverisemens is 108, 824 we allocae he las 108, 824 componens of he feaure vecor o encode he ad ID. Observe ha only one ou of 5, 184 adveriser ID componens and one of he 108, 824 adverisemen ID componens are nonzero. In oal, he lengh of he feaure vecor is 174,026 where each of he componens are eiher 0 or 1. The vecor is very sparse. We observe a maximum of 148 nonzero elemens and an average of 20.9 nonzero elemens in he raining se see Table 1. This is imporan because he cos of implemening inner producs in he olbfgs raining of he logisic regressor ha we inroduce in he following secion is proporional o he number of nonzero elemens in x i. 4.2 Logisic Regression of Click-Through Rae We use he raining se o esimae he CTR wih a logisic regression. For ha purpose le x R n be a vecor conaining he feaures described in Secion 4.1, w R n a classifier ha we wan o rain, and y, 1 an indicaor variable ha akes he value y = 1 when he ad presened o he user is clicked and y = when he ad is no clicked by he user. We hypohesize ha he CTR, defined as he probabiliy of observing y = 1, can be wrien as he logisic funcion CTR(x; w := P [ y = 1 x; w ] = exp ( x T w. (37 We read (37 as saing ha for a feaure vecor x he CTR is deermined by he inner produc x T w hrough he given logisic ransformaion. Consider now he raining se S = {(x i, y i } N i=1 which conains N realizaions of feaures x i and respecive click oucomes y i and furher define he ses S 1 := {(x i, y i S : y i = 1} and S := {(x i, y i S : y i = } conaining clicked and unclicked adverisemens, respecively. Wih he daa given in S we define he opimal classifier w as a maximum likelihood esimae (MLE of w given he model in (37 and he raining se S. This MLE can be found as he minimizer of he log-likelihood loss N w := argmin λ 2 w N i=1 = argmin λ 2 w [ N log (1 + exp ( y i x T i w x i S 1 log ( 1 + exp( x T i w + ( log 1 + exp(x T i w ], (38 x i S where we have added he regularizaion erm λ w 2 /2 o disincenivize large values in he weigh vecor w ; see e.g., Ng (2004. The pracical use of (37 and (38 is as follows. We use he daa colleced in he raining se S o deermine he vecor w in (38. When a user issues a query we concaenae he user and query specific elemens of he feaure vecor wih he ad specific elemens of several candidae ads. We hen proceed o display he adverisemen wih, say, he larges CTR. We can inerpre he se S as having been acquired offline or online. In he former case we wan o use a sochasic opimizaion algorihm because compuing gradiens is infeasible recall ha we are considering raining samples 3165

16 Mokhari and Ribeiro Objecive funcion value F(w SGD olbfgs x 10 4 Number of processed feaure vecors(l Figure 1: Illusraion of Negaive log-likelihood value for olbfgs and SGD afer processing cerain amoun of feaure vecors. The accuracy of olbfgs is beer han SGD afer processing a specific number of feaure vecors. wih a number of elemens N in he order of The performance meric of ineres in his case is he logisic cos as a funcion of compuaional ime. If elemens of S are acquired online we updae w whenever a new vecor becomes available so as o adap o changes in preferences. In his case we wan o exploi he informaion in new samples as much as possible. The correc meric in his case is he logisic cos as a funcion of he number of feaure vecors processed. We use he laer meric for he numerical experimens in he following secion. 4.3 Numerical Resuls Ou of he in he Tencen daa se we selec 10 6 sample poins o use as he raining se S and 10 5 sample poins o use as a es se T. To selec elemens of he raining and es se we divide he firs sample poins of he complee daa se in 10 5 consecuive blocks wih 11 elemens. The firs 10 elemens of he block are assigned o he raining se and he 11h elemen o he es se. To solve for he opimal classifier we implemen SGD and olbfgs by selecing feaure vecors x i a random from he raining se S. In all of our numerical experimens he regularizaion parameer in (38 is λ = The sep sizes for boh algorihms are of he form ɛ = ɛ 0 T 0 /(T 0 +. We se ɛ 0 = 10 2 and T 0 = 10 4 for olbfgs and ɛ 0 = 10 and T 0 = 10 6 for SGD. For SGD he sample size in (9 is se o L = 20 whereas for olbfgs i is se o L = 100. The values of parameers ɛ 0, T 0, and L are chosen o yield bes convergence imes in a rough parameer opimizaion search. Observe he relaively large values of L ha are used o compue sochasic gradiens. This is necessary due o he exreme sparsiy of he feaure vecors x i ha conain an average of only 20.9 nonzero ou 174,026 elemens. Even when considering L = 100 vecors hey are close o orhogonal. The size of memory for olbfgs is se o τ = 10. Wih L = 100 feaures wih an average sparsiy of 20.9 nonzero elemens and memory τ = 10 he cos of each olbfgs ieraion is in he order of operaions. Figure 1 illusraes he convergence pah of SGD and olbfgs on he adverising raining se. We depic he value of he log likelihood objecive in (38 evaluaed a w = w where w is he classifier ierae deermined by SGD or olbfgs. The horizonal axis is scaled by he number of feaure vecors L ha are used in he evaluaion of sochasic gradiens. This resuls in a plo of log likelihood cos versus he number L of feaure vecors processed. To read ieraion indexes from Figure 1 divide he horizonal axis values by L = 100 for olbfgs and L = 20 for SGD. The curvaure correcion of olbfgs resuls in significan reducions in convergence ime. For way of 3166

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs PROC. IEEE CONFERENCE ON DECISION AND CONTROL, 06 A Primal-Dual Type Algorihm wih he O(/) Convergence Rae for Large Scale Consrained Convex Programs Hao Yu and Michael J. Neely Absrac This paper considers

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

Christos Papadimitriou & Luca Trevisan November 22, 2016

Christos Papadimitriou & Luca Trevisan November 22, 2016 U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 20: Riccati Equations and Least Squares Feedback Control 34-5 LINEAR SYSTEMS Lecure : Riccai Equaions and Leas Squares Feedback Conrol 5.6.4 Sae Feedback via Riccai Equaions A recursive approach in generaing he marix-valued funcion W ( ) equaion for i for he

More information

Optimality Conditions for Unconstrained Problems

Optimality Conditions for Unconstrained Problems 62 CHAPTER 6 Opimaliy Condiions for Unconsrained Problems 1 Unconsrained Opimizaion 11 Exisence Consider he problem of minimizing he funcion f : R n R where f is coninuous on all of R n : P min f(x) x

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Notes for Lecture 17-18

Notes for Lecture 17-18 U.C. Berkeley CS278: Compuaional Complexiy Handou N7-8 Professor Luca Trevisan April 3-8, 2008 Noes for Lecure 7-8 In hese wo lecures we prove he firs half of he PCP Theorem, he Amplificaion Lemma, up

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD HAN XIAO 1. Penalized Leas Squares Lasso solves he following opimizaion problem, ˆβ lasso = arg max β R p+1 1 N y i β 0 N x ij β j β j (1.1) for some 0.

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization

A Decentralized Second-Order Method with Exact Linear Convergence Rate for Consensus Optimization 1 A Decenralized Second-Order Mehod wih Exac Linear Convergence Rae for Consensus Opimizaion Aryan Mokhari, Wei Shi, Qing Ling, and Alejandro Ribeiro Absrac This paper considers decenralized consensus

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Network Newton Distributed Optimization Methods

Network Newton Distributed Optimization Methods Nework Newon Disribued Opimizaion Mehods Aryan Mokhari, Qing Ling, and Alejandro Ribeiro Absrac We sudy he problem of minimizing a sum of convex objecive funcions where he componens of he objecive are

More information

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8) I. Definiions and Problems A. Perfec Mulicollineariy Econ7 Applied Economerics Topic 7: Mulicollineariy (Sudenmund, Chaper 8) Definiion: Perfec mulicollineariy exiss in a following K-variable regression

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

GMM - Generalized Method of Moments

GMM - Generalized Method of Moments GMM - Generalized Mehod of Momens Conens GMM esimaion, shor inroducion 2 GMM inuiion: Maching momens 2 3 General overview of GMM esimaion. 3 3. Weighing marix...........................................

More information

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization A Forward-Backward Spliing Mehod wih Componen-wise Lazy Evaluaion for Online Srucured Convex Opimizaion Yukihiro Togari and Nobuo Yamashia March 28, 2016 Absrac: We consider large-scale opimizaion problems

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

Lecture 33: November 29

Lecture 33: November 29 36-705: Inermediae Saisics Fall 2017 Lecurer: Siva Balakrishnan Lecure 33: November 29 Today we will coninue discussing he boosrap, and hen ry o undersand why i works in a simple case. In he las lecure

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details! MAT 257, Handou 6: Ocober 7-2, 20. I. Assignmen. Finish reading Chaper 2 of Spiva, rereading earlier secions as necessary. handou and fill in some missing deails! II. Higher derivaives. Also, read his

More information

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1. Roboics I April 11, 017 Exercise 1 he kinemaics of a 3R spaial robo is specified by he Denavi-Harenberg parameers in ab 1 i α i d i a i θ i 1 π/ L 1 0 1 0 0 L 3 0 0 L 3 3 able 1: able of DH parameers of

More information

MATH 128A, SUMMER 2009, FINAL EXAM SOLUTION

MATH 128A, SUMMER 2009, FINAL EXAM SOLUTION MATH 28A, SUMME 2009, FINAL EXAM SOLUTION BENJAMIN JOHNSON () (8 poins) [Lagrange Inerpolaion] (a) (4 poins) Le f be a funcion defined a some real numbers x 0,..., x n. Give a defining equaion for he Lagrange

More information

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS NA568 Mobile Roboics: Mehods & Algorihms Today s Topic Quick review on (Linear) Kalman Filer Kalman Filering for Non-Linear Sysems Exended Kalman Filer (EKF)

More information

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018 MATH 5720: Gradien Mehods Hung Phan, UMass Lowell Ocober 4, 208 Descen Direcion Mehods Consider he problem min { f(x) x R n}. The general descen direcions mehod is x k+ = x k + k d k where x k is he curren

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t M ah 5 2 7 Fall 2 0 0 9 L ecure 1 0 O c. 7, 2 0 0 9 Hamilon- J acobi Equaion: Explici Formulas In his lecure we ry o apply he mehod of characerisics o he Hamilon-Jacobi equaion: u + H D u, x = 0 in R n

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t Exercise 7 C P = α + β R P + u C = αp + βr + v (a) (b) C R = α P R + β + w (c) Assumpions abou he disurbances u, v, w : Classical assumions on he disurbance of one of he equaions, eg. on (b): E(v v s P,

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 38, No. 2, May 2013, pp. 209 227 ISSN 0364-765X (prin) ISSN 1526-5471 (online) hp://dx.doi.org/10.1287/moor.1120.0562 2013 INFORMS On Boundedness of Q-Learning Ieraes

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018 ELE 538B: Large-Scale Opimizaion for Daa Science Quasi-Newon mehods Yuxin Chen Princeon Universiy, Spring 208 00 op ff(x (x)(k)) f p 2 L µ f 05 k f (xk ) k f (xk ) =) f op ieraions converges in only 5

More information

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H. ACE 56 Fall 005 Lecure 5: he Simple Linear Regression Model: Sampling Properies of he Leas Squares Esimaors by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Inference in he Simple

More information

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter Sae-Space Models Iniializaion, Esimaion and Smoohing of he Kalman Filer Iniializaion of he Kalman Filer The Kalman filer shows how o updae pas predicors and he corresponding predicion error variances when

More information

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients Secion 3.5 Nonhomogeneous Equaions; Mehod of Undeermined Coefficiens Key Terms/Ideas: Linear Differenial operaor Nonlinear operaor Second order homogeneous DE Second order nonhomogeneous DE Soluion o homogeneous

More information

Online Appendix to Solution Methods for Models with Rare Disasters

Online Appendix to Solution Methods for Models with Rare Disasters Online Appendix o Soluion Mehods for Models wih Rare Disasers Jesús Fernández-Villaverde and Oren Levinal In his Online Appendix, we presen he Euler condiions of he model, we develop he pricing Calvo block,

More information

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given

More information

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation: M ah 5 7 Fall 9 L ecure O c. 4, 9 ) Hamilon- J acobi Equaion: Weak S oluion We coninue he sudy of he Hamilon-Jacobi equaion: We have shown ha u + H D u) = R n, ) ; u = g R n { = }. ). In general we canno

More information

Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. cost function n

Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. cost function n IEEE TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS, VOL. 2, NO. 4, DECEMBER 2016 507 A Decenralized Second-Order Mehod wih Exac Linear Convergence Rae for Consensus Opimizaion Aryan Mokhari,

More information

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model Modal idenificaion of srucures from roving inpu daa by means of maximum likelihood esimaion of he sae space model J. Cara, J. Juan, E. Alarcón Absrac The usual way o perform a forced vibraion es is o fix

More information

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3 and d = c b - b c c d = c b - b c c This process is coninued unil he nh row has been compleed. The complee array of coefficiens is riangular. Noe ha in developing he array an enire row may be divided or

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

Some Ramsey results for the n-cube

Some Ramsey results for the n-cube Some Ramsey resuls for he n-cube Ron Graham Universiy of California, San Diego Jozsef Solymosi Universiy of Briish Columbia, Vancouver, Canada Absrac In his noe we esablish a Ramsey-ype resul for cerain

More information

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Kriging Models Predicing Arazine Concenraions in Surface Waer Draining Agriculural Waersheds Paul L. Mosquin, Jeremy Aldworh, Wenlin Chen Supplemenal Maerial Number

More information

15. Vector Valued Functions

15. Vector Valued Functions 1. Vecor Valued Funcions Up o his poin, we have presened vecors wih consan componens, for example, 1, and,,4. However, we can allow he componens of a vecor o be funcions of a common variable. For example,

More information

Let us start with a two dimensional case. We consider a vector ( x,

Let us start with a two dimensional case. We consider a vector ( x, Roaion marices We consider now roaion marices in wo and hree dimensions. We sar wih wo dimensions since wo dimensions are easier han hree o undersand, and one dimension is a lile oo simple. However, our

More information

Ordinary Differential Equations

Ordinary Differential Equations Ordinary Differenial Equaions 5. Examples of linear differenial equaions and heir applicaions We consider some examples of sysems of linear differenial equaions wih consan coefficiens y = a y +... + a

More information

R.#W.#Erickson# Department#of#Electrical,#Computer,#and#Energy#Engineering# University#of#Colorado,#Boulder#

R.#W.#Erickson# Department#of#Electrical,#Computer,#and#Energy#Engineering# University#of#Colorado,#Boulder# .#W.#Erickson# Deparmen#of#Elecrical,#Compuer,#and#Energy#Engineering# Universiy#of#Colorado,#Boulder# Chaper 2 Principles of Seady-Sae Converer Analysis 2.1. Inroducion 2.2. Inducor vol-second balance,

More information

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Simulaion-Solving Dynamic Models ABE 5646 Week 2, Spring 2010 Week Descripion Reading Maerial 2 Compuer Simulaion of Dynamic Models Finie Difference, coninuous saes, discree ime Simple Mehods Euler Trapezoid

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models. Technical Repor Doc ID: TR--203 06-March-203 (Las revision: 23-Februar-206) On formulaing quadraic funcions in opimizaion models. Auhor: Erling D. Andersen Convex quadraic consrains quie frequenl appear

More information

References are appeared in the last slide. Last update: (1393/08/19)

References are appeared in the last slide. Last update: (1393/08/19) SYSEM IDEIFICAIO Ali Karimpour Associae Professor Ferdowsi Universi of Mashhad References are appeared in he las slide. Las updae: 0..204 393/08/9 Lecure 5 lecure 5 Parameer Esimaion Mehods opics o be

More information

EKF SLAM vs. FastSLAM A Comparison

EKF SLAM vs. FastSLAM A Comparison vs. A Comparison Michael Calonder, Compuer Vision Lab Swiss Federal Insiue of Technology, Lausanne EPFL) michael.calonder@epfl.ch The wo algorihms are described wih a planar robo applicaion in mind. Generalizaion

More information

Notes on online convex optimization

Notes on online convex optimization Noes on online convex opimizaion Karl Sraos Online convex opimizaion (OCO) is a principled framework for online learning: OnlineConvexOpimizaion Inpu: convex se S, number of seps T For =, 2,..., T : Selec

More information

Cash Flow Valuation Mode Lin Discrete Time

Cash Flow Valuation Mode Lin Discrete Time IOSR Journal of Mahemaics (IOSR-JM) e-issn: 2278-5728,p-ISSN: 2319-765X, 6, Issue 6 (May. - Jun. 2013), PP 35-41 Cash Flow Valuaion Mode Lin Discree Time Olayiwola. M. A. and Oni, N. O. Deparmen of Mahemaics

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information

4.1 Other Interpretations of Ridge Regression

4.1 Other Interpretations of Ridge Regression CHAPTER 4 FURTHER RIDGE THEORY 4. Oher Inerpreaions of Ridge Regression In his secion we will presen hree inerpreaions for he use of ridge regression. The firs one is analogous o Hoerl and Kennard reasoning

More information

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests Ouline Ouline Hypohesis Tes wihin he Maximum Likelihood Framework There are hree main frequenis approaches o inference wihin he Maximum Likelihood framework: he Wald es, he Likelihood Raio es and he Lagrange

More information

How to Deal with Structural Breaks in Practical Cointegration Analysis

How to Deal with Structural Breaks in Practical Cointegration Analysis How o Deal wih Srucural Breaks in Pracical Coinegraion Analysis Roselyne Joyeux * School of Economic and Financial Sudies Macquarie Universiy December 00 ABSTRACT In his noe we consider he reamen of srucural

More information

Lecture 9: September 25

Lecture 9: September 25 0-725: Opimizaion Fall 202 Lecure 9: Sepember 25 Lecurer: Geoff Gordon/Ryan Tibshirani Scribes: Xuezhi Wang, Subhodeep Moira, Abhimanu Kumar Noe: LaTeX emplae couresy of UC Berkeley EECS dep. Disclaimer:

More information

Longest Common Prefixes

Longest Common Prefixes Longes Common Prefixes The sandard ordering for srings is he lexicographical order. I is induced by an order over he alphabe. We will use he same symbols (,

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015 Explaining Toal Facor Produciviy Ulrich Kohli Universiy of Geneva December 2015 Needed: A Theory of Toal Facor Produciviy Edward C. Presco (1998) 2 1. Inroducion Toal Facor Produciviy (TFP) has become

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Differential Equations

Differential Equations Mah 21 (Fall 29) Differenial Equaions Soluion #3 1. Find he paricular soluion of he following differenial equaion by variaion of parameer (a) y + y = csc (b) 2 y + y y = ln, > Soluion: (a) The corresponding

More information

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006

2.160 System Identification, Estimation, and Learning. Lecture Notes No. 8. March 6, 2006 2.160 Sysem Idenificaion, Esimaion, and Learning Lecure Noes No. 8 March 6, 2006 4.9 Eended Kalman Filer In many pracical problems, he process dynamics are nonlinear. w Process Dynamics v y u Model (Linearized)

More information

Single and Double Pendulum Models

Single and Double Pendulum Models Single and Double Pendulum Models Mah 596 Projec Summary Spring 2016 Jarod Har 1 Overview Differen ypes of pendulums are used o model many phenomena in various disciplines. In paricular, single and double

More information

EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE

EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE Version April 30, 2004.Submied o CTU Repors. EXPLICIT TIME INTEGRATORS FOR NONLINEAR DYNAMICS DERIVED FROM THE MIDPOINT RULE Per Krysl Universiy of California, San Diego La Jolla, California 92093-0085,

More information

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin ACE 56 Fall 005 Lecure 4: Simple Linear Regression Model: Specificaion and Esimaion by Professor Sco H. Irwin Required Reading: Griffihs, Hill and Judge. "Simple Regression: Economic and Saisical Model

More information

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY ECO 504 Spring 2006 Chris Sims RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY 1. INTRODUCTION Lagrange muliplier mehods are sandard fare in elemenary calculus courses, and hey play a cenral role in economic

More information

4.6 One Dimensional Kinematics and Integration

4.6 One Dimensional Kinematics and Integration 4.6 One Dimensional Kinemaics and Inegraion When he acceleraion a( of an objec is a non-consan funcion of ime, we would like o deermine he ime dependence of he posiion funcion x( and he x -componen of

More information

Expert Advice for Amateurs

Expert Advice for Amateurs Exper Advice for Amaeurs Ernes K. Lai Online Appendix - Exisence of Equilibria The analysis in his secion is performed under more general payoff funcions. Wihou aking an explici form, he payoffs of he

More information

Announcements: Warm-up Exercise:

Announcements: Warm-up Exercise: Fri Apr 13 7.1 Sysems of differenial equaions - o model muli-componen sysems via comparmenal analysis hp//en.wikipedia.org/wiki/muli-comparmen_model Announcemens Warm-up Exercise Here's a relaively simple

More information

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data Chaper 2 Models, Censoring, and Likelihood for Failure-Time Daa William Q. Meeker and Luis A. Escobar Iowa Sae Universiy and Louisiana Sae Universiy Copyrigh 1998-2008 W. Q. Meeker and L. A. Escobar. Based

More information

14 Autoregressive Moving Average Models

14 Autoregressive Moving Average Models 14 Auoregressive Moving Average Models In his chaper an imporan parameric family of saionary ime series is inroduced, he family of he auoregressive moving average, or ARMA, processes. For a large class

More information

Module 2 F c i k c s la l w a s o s f dif di fusi s o i n

Module 2 F c i k c s la l w a s o s f dif di fusi s o i n Module Fick s laws of diffusion Fick s laws of diffusion and hin film soluion Adolf Fick (1855) proposed: d J α d d d J (mole/m s) flu (m /s) diffusion coefficien and (mole/m 3 ) concenraion of ions, aoms

More information

Solutions from Chapter 9.1 and 9.2

Solutions from Chapter 9.1 and 9.2 Soluions from Chaper 9 and 92 Secion 9 Problem # This basically boils down o an exercise in he chain rule from calculus We are looking for soluions of he form: u( x) = f( k x c) where k x R 3 and k is

More information

BU Macro BU Macro Fall 2008, Lecture 4

BU Macro BU Macro Fall 2008, Lecture 4 Dynamic Programming BU Macro 2008 Lecure 4 1 Ouline 1. Cerainy opimizaion problem used o illusrae: a. Resricions on exogenous variables b. Value funcion c. Policy funcion d. The Bellman equaion and an

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

THE MATRIX-TREE THEOREM

THE MATRIX-TREE THEOREM THE MATRIX-TREE THEOREM 1 The Marix-Tree Theorem. The Marix-Tree Theorem is a formula for he number of spanning rees of a graph in erms of he deerminan of a cerain marix. We begin wih he necessary graph-heoreical

More information

Appendix to Creating Work Breaks From Available Idleness

Appendix to Creating Work Breaks From Available Idleness Appendix o Creaing Work Breaks From Available Idleness Xu Sun and Ward Whi Deparmen of Indusrial Engineering and Operaions Research, Columbia Universiy, New York, NY, 127; {xs2235,ww24}@columbia.edu Sepember

More information