Global Convergence of Online Limited Memory BFGS

Size: px

Start display at page:

Download "Global Convergence of Online Limited Memory BFGS"

Alan Lester
5 years ago
Views:

1 Journal of Machine Learning Research 16 ( Submied 9/14; Revised 7/15; Published 12/15 Global Convergence of Online Limied Memory BFGS Aryan Mokhari Alejandro Ribeiro Deparmen of Elecrical and Sysems Engineering Universiy of Pennsylvania Philadelphia, PA 19104, USA Edior: Léon Boou Absrac Global convergence of an online (sochasic limied memory version of he Broyden-Flecher- Goldfarb-Shanno (BFGS quasi-newon mehod for solving opimizaion problems wih sochasic objecives ha arise in large scale machine learning is esablished. Lower and upper bounds on he Hessian eigenvalues of he sample funcions are shown o suffice o guaranee ha he curvaure approximaion marices have bounded deerminans and races, which, in urn, permis esablishing convergence o opimal argumens wih probabiliy 1. Experimenal evaluaion on a search engine adverising problem showcase reducions in convergence ime relaive o sochasic gradien descen algorihms. Keywords: quasi-newon mehods, large-scale opimizaion, sochasic opimizaion 1. Inroducion Many problems in Machine Learning can be reduced o he minimizaion of a sochasic objecive defined as an expecaion over a se of random funcions (Boou and Le Cun (2005; Boou (2010; Shalev-Shwarz and Srebro (2008; Mokhari and Ribeiro (2014b. Specifically, consider an opimizaion variable w R n and a random variable θ Θ R p ha deermines he choice of a funcion f(w, θ : R n p R. Sochasic opimizaion problems enail deerminaion of he argumen w ha minimizes he expeced value F (w := E θ [f(w, θ], w := argmin w E θ [f(w, θ] := argmin F (w. (1 w We refer o f(w, θ as he random or insananeous funcions and o F (w := E θ [f(w, θ] as he average funcion. A canonical class of problems having his form are suppor vecor machines (SVMs ha reduce binary classificaion o he deerminaion of a hyperplane ha separaes poins in a given raining se; see, e.g., (Vapnik (2000; Boou (2010; Boser e al. (1992. In ha case, θ denoes individual raining samples, f(w, θ he loss of choosing he hyperplane defined by w, and F (w := E θ [f(w, θ] he mean loss across all elemens of he raining se. The opimal argumen w is he opimal linear classifier. Numerical evaluaion of objecive funcion gradiens w F (w = E θ [ w f(w, θ] is inracable when he cardinaliy of Θ is large, as is he case, e.g., when SVMs are rained on large ses. This moivaes he use of algorihms relying on sochasic gradiens ha provide gradien esimaes based on small daa subsamples. For he purpose of his paper sochasic opimizaion algorihms can be divided ino hree caegories: Sochasic gradien descen (SGD and relaed firs order mehods, sochasic Newon mehods, and sochasic quasi-newon mehods. SGD is he mos popular mehod used o solve sochasic opimizaion problems (Boou (2010; Shalev-Shwarz e al. (2011; Zhang (2004. However, as we consider problems of ever larger dimension heir slow convergence imes have limied heir pracical appeal and fosered he search for alernaives. In his regard, i has o be noed ha SGD is slow because of he use of gradiens c 2015 Aryan Mokhari and Alejandro Ribeiro.

2 Mokhari and Ribeiro as descen direcions which leads o poor curvaure approximaion in ill-condiioned problems. The golden sandard o deal wih hese ill-condiioned funcions in a deerminisic seing is Newon s mehod. However, unbiased sochasic esimaes of Newon seps can be compued in general. This fac limis he applicaion of sochasic Newon mehods o problems wih specific srucure (Birge e al. (1994; Zargham e al. (2013. If SGD is slow o converge and sochasic Newon can be used in general, he remaining alernaive is o modify deerminisic quasi-newon mehods ha speed up convergence imes relaive o gradien descen wihou using Hessian evaluaions (Dennis and Moré (1974; Powell (1976; Byrd e al. (1987; Nocedal and Wrigh (1999. This has resuled in he developmen of he sochasic quasi-newon mehods known as online (o Broyden-Flecher-Goldfarb-Shanno (BFGS (Schraudolph e al. (2007; Bordes e al. (2009, regularized sochasic BFGS (RES (Mokhari and Ribeiro (2014a, and online limied memory (olbfgs (Schraudolph e al. (2007 which occupy he middle ground of broad applicabiliy irrespecive of problem srucure and condiioning. All hree of hese algorihms exend BFGS by using sochasic gradiens boh as descen direcions and consiuens of Hessian esimaes. The obfgs algorihm is a direc generalizaion of BFGS ha uses sochasic gradiens in lieu of deerminisic gradiens. RES differs in ha i furher modifies BFGS o yield an algorihm ha reains is convergence advanages while improving heoreical convergence guaranees and numerical behavior. The olbfgs mehod uses a modificaion of BFGS o reduce he compuaional cos of each ieraion. An imporan observaion here is ha in rying o adap o he changing curvaure of he objecive, sochasic quasi-newon mehods may end up exacerbaing he problem. Indeed, since Hessian esimaes are sochasic, i is possible o end up wih almos singular Hessian esimaes. The corresponding small eigenvalues hen resul in a caasrophic amplificaion of he noise which nullifies progress made owards convergence. This is no a minor problem. In obfgs his possibiliy precludes convergence analyses and may resul in erraic numerical behavior (Mokhari and Ribeiro (2014a. As a maer of fac, he main moivaion for he inroducion of RES is o avoid his caasrophic noise amplificaion so as o reain smaller convergence imes while ensuring ha opimal argumens are found wih probabiliy 1 (Mokhari and Ribeiro (2014a. Generally, sochasic quasi-newon mehods whose Hessian approximaions have bounded eigenvalues converge o opimal argumens (Sunehag e al. (2009. However valuable, he convergence guaranees of RES and he convergence ime advanages of obfgs and RES are ained by an ieraion cos of order O(n 2 and O(n 3, respecively, which precludes heir use in problems where n is very large. In deerminisic seings his problem is addressed by limied memory (LBFGS (Liu and Nocedal (1989 which can be easily generalized o develop he olbfgs algorihm (Schraudolph e al. (2007. Numerical ess of olbfgs are promising bu heoreical convergence characerizaions are sill lacking. The main conribuion of his paper is o show ha olbfgs converges wih probabiliy 1 o opimal argumens across realizaions of he random variables θ. This is he same convergence guaranee provided for RES and is in marked conras wih obfgs, which fails o converge if no properly regularized. Convergence guaranees for olbfgs do no require such measures. We begin he paper wih brief discussions of deerminisic BFGS (Secion 2 and LBFGS (Secion 2.1 and he inroducion of olbfgs (Secion 2.2. The fundamenal idea in BFGS and olbfgs is o coninuously saisfy a secan condiion while saying close o previous curvaure esimaes. They differ in ha BFGS uses all pas gradiens o esimae curvaure while olbfgs uses a fixed moving window of pas gradiens. The use of his window reduces memory and compuaional cos (Appendix A. The difference beween LBFGS and olbfgs is he use of sochasic gradiens in lieu of heir deerminisic counerpars. Noe ha choosing large mini-bach for compuing sochasic gradiens reduces he variance of sochasic approximaion and decreases he gap beween LBFGS and olbfgs, however, increases he compuaional cos of olbfgs. Therefore, picking a suiable mini-bach size is an imporan sep in he implemenaion of olbfgs. Convergence properies of olbfgs are hen analyzed (Secion 3. Under he assumpion ha he sample funcions f(w, θ are srongly convex we show ha he race and deerminan of he 3152

3 Global Convergence of Online Limied Memory BFGS Hessian approximaions compued by olbfgs are upper and lower bounded, respecively (Lemma 3. These bounds are hen used o limi he range of variaion of he raio beween he Hessian approximaions larges and smalles eigenvalues (Lemma 4. In urn, his condiion number limi is shown o be sufficien o prove convergence o he opimal argumen w wih probabiliy 1 over realizaions of he sample funcions (Theorem 6. This is an imporan resul because i ensures ha olbfgs doesn suffer from he numerical problems ha hinder obfgs. We complemen his almos sure convergence resul wih a characerizaion of he convergence rae which is shown o be a leas O(1/ in expecaion (Theorem 7. I is fair o emphasize ha, differen from he deerminisic case, he convergence rae of olbfgs is no beer han he convergence rae of SGD. This is no a limiaion of our analysis. The difference beween sochasic and regular gradiens inroduces a noise erm ha dominaes convergence once we are close o he opimum, which is where superlinear convergence raes manifes. In fac, he same convergence rae would be observed if exac Hessians were available. The bes ha can be proven of olbfgs is ha he convergence rae is no worse han ha of SGD. Given ha heoreical guaranees only sae ha he curvaure correcion does no exacerbae he problem s condiion i is perhaps fairer o describe olbfgs as an adapive recondiioning sraegy insead of a sochasic quasi-newon mehod. The laer descripion refers o he genesis of he algorihm. The former is a more accurae descripion of is acual behavior. To show he advanage of olbfgs we use i o rain a logisic regressor o predic he click hrough rae in a search engine adverising problem (Secion 4. The logisic regression uses a heerogeneous feaure vecor wih 174,026 binary enries ha describe he user, he search, and he adverisemen (Secion 4.1. Being a large scale problem wih heerogeneous daa, he condiion number of he logisic log likelihood objecive is large and we expec o see significan advanages of olbfgs relaive o SGD. This expecaion is fulfilled. The olbfgs algorihm rains he regressor using less han 1% of he daa required by SGD o obain similar classificaion accuracy. (Secion 4.3. We close he paper wih concluding remarks (Secion 5. Noaion Lowercase boldface v denoes a vecor and uppercase boldface A a marix. We use v o denoe he Euclidean norm of vecor v and A o denoe he Euclidean norm of marix A. The race of A is wrien as r(a and he deerminan as de(a. We use I for he ideniy marix of appropriae dimension. The noaion A B implies ha he marix A B is posiive semidefinie. The operaor E x [ ] sands in for expecaion over random variable x and E[ ] for expecaion wih respec o he disribuion of a sochasic process. 2. Algorihm Definiion Recall he definiions of he sample funcions f(w, θ and he average funcion F (w := E θ [f(w, θ]. We assume he sample funcions f(w, θ are srongly convex for all θ. This implies he objecive funcion F (w := E θ [f(w, θ], being an average of he srongly convex sample funcions, is also srongly convex. We define he gradien s(w := F (w of he average funcion F (w and assume ha i can be compued as s(w := F (w = E θ [ f(w, θ]. (2 Since he funcion F (w is srongly convex, gradiens s(w are descen direcions ha can be used o find he opimal argumen w in (1. Inroduce hen a ime index, a sep size ɛ, and a posiive definie marix B 0 o define a generic descen algorihm hrough he ieraion w +1 = w ɛ B s(w = w ɛ d. (3 where we have also defined he descen sep d = B s(w. When B = I is he ideniy marix, (3 reduces o gradien descen. When B = H(w := 2 F (w is he Hessian of he objecive funcion, (3 defines Newon s algorihm. In his paper we focus on quasi-newon mehods whereby we aemp o selec marices B close o he Hessian H(w. Various mehods are known o selec 3153

4 Mokhari and Ribeiro marices B, including hose by Broyden e.g., Broyden e al. (1973; Davidon, Flecher, and Powell (DFP e.g., Flecher (2013; and Broyden, Flecher, Goldfarb, and Shanno (BFGS e.g., Byrd e al. (1987; Powell (1976. We work wih he marices B used in BFGS since hey have been observed o work bes in pracice (see Byrd e al. (1987. In BFGS, he funcion s curvaure B is approximaed by a finie difference. Le v denoe he variable variaion a ime and r he gradien variaion a ime which are respecively defined as v := w +1 w, r := s(w +1 s(w. (4 We selec he marix B +1 o be used in he nex ime sep so ha i saisfies he secan condiion B +1 v = r. The raionale for his selecion is ha he Hessian H(w saisfies his condiion for w +1 ending o w. Noice however ha he secan condiion B +1 v = r is no enough o compleely specify B +1. To resolve his indeerminacy, marices B +1 in BFGS are also required o be as close as possible o he previous Hessian approximaion B in erms of differenial enropy (see Mokhari and Ribeiro (2014a. These condiions can be resolved in closed form leading o he explici expression, B +1 = B + r r T v T r B v v T B v T B v. (5 While he expression in (5 permis updaing he Hessian approximaions B +1, implemenaion of he descen sep in (3 requires is inversion. This can be avoided by using he Sherman-Morrison formula in (5 o wrie B +1 = ZT B Z + ρ v v T, (6 where we defined he scalar ρ and he marix Z as ρ := 1 v T r, Z := I ρ r v T. (7 The updaes in (5 and (6 require he inner produc of he gradien and variable variaions o be posiive, i.e., v T r > 0. This is always rue if he objecive F (w is srongly convex and furher implies ha B +1 says posiive definie if B 0, (Nocedal and Wrigh (1999. Each BFGS ieraion has a cos of O(n 2 arihmeic operaions. This is less han he O(n 3 of each sep in Newon s mehod bu more han he O(n cos of each gradien descen ieraion. In general, he relaive convergence raes are such ha he oal compuaional cos of BFGS o achieve a arge accuracy is smaller han he corresponding cos of gradien descen. Sill, alernaives o reduce he compuaional cos of each ieraion are of ineres for large scale problems. Likewise, BFGS requires sorage and propagaion of he O(n 2 elemens of B, whereas gradien descen requires sorage of O(n gradien elemens only. This moivaes alernaives ha have smaller memory fooprins. Boh of hese objecives are accomplished by he limied memory (LBFGS algorihm ha we describe in he following secion. 2.1 LBFGS: Limied Memory BFGS As i follows from (6, he updaed Hessian inverse approximaion B depends on B and he curvaure informaion pairs {v, r }. In urn, o compue B, he esimae B 2 and he curvaure pair {v 2, r 2 } are used. Proceeding recursively, i follows ha B is a funcion of he iniial approximaion B 0 and all previous curvaure informaion pairs {v u, r u } u=0. The idea in LBFGS is o resric he use of pas curvaure informaion o he las τ pairs {v u, r u } u= τ. Since earlier ieraes {v u, r u } wih u < τ are likely o carry lile informaion abou he curvaure a he curren ierae w, his resricion is expeced o resul in a minimal performance penaly. For a precise definiion, pick a posiive definie marix B,0 as he iniial Hessian inverse approximaion a sep. Proceed hen o perform τ updaes of he form in (6 using he las τ curvaure 3154

5 Global Convergence of Online Limied Memory BFGS informaion pairs {v u, r u } u= τ. Denoing as B,u he curvaure approximaion afer u updaes are performed we have ha he refined marix approximaion B,u+1 is given by [cf. (6] B,u+1 = ZT τ+u B,u Z τ+u + ρ τ+u v τ+u v T τ+u, (8 where u = 0,..., τ 1 and he consans ρ τ+u and rank-one plus ideniy marices Z τ+u are as given in (7. The inverse Hessian approximaion B o be used in (3 is he one yielded afer compleing he τ updaes in (8, i.e., B = B,τ. Observe ha when < τ here are no enough pairs {v u, r u } o perform τ updaes. In such case we jus redefine τ = and proceed o use he = τ available pairs {v u, r u } u=0. Implemenaion of he produc B s(w in (3 for marices B recursion in (8 does no need explici compuaion of he marix B = B,τ obained from he,τ. Alhough he deails are no sraighforward, observe ha each ieraion in (8 is similar o a rank-one updae and ha as such i is no unreasonable o expec ha he produc B s(w = B,τ s(w can be compued using τ recursive inner producs. Assuming ha his is possible, he implemenaion of he recursion in (8 doesn need compuaion and sorage of prior marices B. Raher, i suffices o keep he τ mos recen curvaure informaion pairs {v u, r u } u= τ, hus reducing sorage requiremens from O(n 2 o O(τn. Furhermore, each of hese inner producs can be compued a a cos of n operaions yielding a oal compuaional cos of O(τ n per LBFGS ieraion. Hence, LBFGS decreases boh he memory requiremens and he compuaional cos of each ieraion from he O(n 2 required by regular BFGS o O(τn. We presen he deails of his ieraion in he conex of he online (sochasic LBFGS ha we inroduce in he following secion. 2.2 Online (Sochasic Limied Memory BFGS To implemen (3 and (8 we need o compue gradiens s(w. This is impracical when he number of funcions f(w, θ is large, as is he case in mos sochasic problems of pracical ineres and moivaes he use of sochasic gradiens in lieu of acual gradiens. Consider a given se of L realizaions θ = [θ 1 ;...; θ L ] and define he sochasic gradien of F (w a w given samples θ as ŝ(w, θ := 1 L L f(w, θ l. (9 l=1 In olbfgs we use sochasic gradiens ŝ(w, θ for descen direcions and curvaure esimaors. In paricular, he descen ieraion in (3 is replaced by he descen ieraion w +1 = w ɛ ŝ(w, θ = w ɛ ˆd, (10 where θ = [θ 1 ;...; θ L ] is he se of samples used a sep o compue he sochasic gradien ŝ(w, θ as per (9 and he marix is a funcion of pas sochasic gradiens ŝ(w u, θ u wih u insead of a funcion of pas gradiens s(w u as in (3. As we also did in (3 we have defined he sochasic sep ˆd := ŝ(w, θ o simplify upcoming discussions. To properly specify we define he sochasic gradien variaion ˆr a ime as he difference beween he sochasic gradiens ŝ(w +1, θ and ŝ(w, θ associaed wih subsequen ieraes w +1 and w and he common se of samples θ [cf. (4], ˆr := ŝ(w +1, θ ŝ(w, θ. (11 Observe ha ŝ(w, θ is he sochasic gradien used a ime in (10 bu ha ŝ(w +1, θ is compued solely for he purpose of deermining he sochasic gradien variaion. The perhaps more naural definiion ŝ(w +1, θ +1 ŝ(w, θ for he sochasic gradien variaion, which relies on he 3155

6 Mokhari and Ribeiro sochasic gradien ŝ(w +1, θ +1 used a ime +1 in (10 is no sufficien o guaranee convergence; see e.g.,(mokhari and Ribeiro (2014a. To define he olbfgs algorihm we jus need o provide sochasic versions of he definiions in (7 and (8. The scalar consans and ideniy plus rank-one marices in (7 are redefined o he corresponding sochasic quaniies ˆρ τ+u = 1 v T τ+uˆr τ+u and Ẑ τ+u = I ˆρ τ+uˆr τ+u v T τ+u, (12 = B,τ in (8 is replaced by he olbfgs Hessian inverse approxi- whereas he LBFGS marix B maion = where he iniial marix,τ which we define as he oucome of τ recursive applicaions of he updae,,u+1 = ẐT τ+u,0,u Ẑ τ+u + ˆρ τ+u v τ+u v T τ+u, (13 is given and he ime index is u = 0,..., τ. The olbfgs algorihm is defined by he sochasic descen ieraion in (10 wih marices = compued by τ recursive applicaions of (13. Excep for he fac ha hey use sochasic variables, (10 and (13 are idenical o (3 and (8. Thus, as is he case in (3, he Hessian inverse approximaion in (13 is a funcion of he iniial Hessian inverse approximaion B,0 and he τ mos recen curvaure informaion pairs {v u, ˆr u } u= τ. Likewise, when < τ here are no enough pairs {v u, ˆr u } o perform τ updaes. In such case we jus redefine τ = and proceed o use he = τ available pairs {v u, ˆr u } u=0. We also poin ou ha he updae in (13 necessiaes ˆrT u v u > 0 for all ime indexes u. This is rue as long as he insananeous funcions f(w, θ are srongly convex wih respec o w as we show in Lemma 2. The equaions in (10 and (13 are used concepually bu no in pracical implemenaions. For he laer we exploi he srucure of (13 o rearrange he erms in he compuaion of he produc ŝ(w, θ. To see how his is done consider he recursive updae for he Hessian inverse approximaion in (13 and make u = τ 1 o wrie =,τ = (ẐT,τ,τ (Ẑ + ˆρ v v T. (14 Equaion (14 shows he relaion beween he Hessian inverse approximaion and he (τ 1s updaed version of he iniial Hessian inverse approximaion,τ a sep. Se now u = τ 2 in (13 o express,τ in erms of,τ 2 and subsiue he resul in (14 o rewrie as = (ẐT Ẑ T 2,τ 2 (Ẑ 2 Ẑ + ˆρ 2 (ẐT v 2 v 2 T (Ẑ + ˆρ v v. T (15 We can proceed recursively by subsiuing,τ 2 for is expression in erms of,τ 3 and in he resul subsiue,τ 3 for is expression in erms of,τ 3 and so on. Observe ha a new summand is added in each of hese subsiuions from which i follows ha repeaing his process τ imes yields = (ẐT... ẐT τ,0 (Ẑ τ Ẑ... + ˆρ τ (ẐT... ẐT τ+1 v τ v τ T (Ẑ τ+1... Ẑ + + ˆρ 2 (ẐT v 2 v 2 T (Ẑ + ˆρ v v. T (16 The imporan observaion in (16 is ha he marix Ẑ and is ranspose ẐT are he firs and las produc erms of all summands excep he las, ha he marices Ẑ 2 and is ranspose ẐT 2 are second and penulimae in all erms bu he las wo, and so on. Thus, when compuing he 3156

7 Global Convergence of Online Limied Memory BFGS olbfgs sep ˆd := ŝ(w, θ he operaions needed o compue he produc wih he nex o las summand of (16 can be reused o compue he produc wih he second o las summand which in urn can be reused in deermining he produc wih he hird o las summand and so on. This observaion compounded wih he fac ha muliplicaions wih he ideniy plus rank one marices Ẑ requires O(n operaions yields an algorihm ha can compue he olbfgs sep ˆd := ŝ(w, θ in O(τn operaions. We summarize he specifics of such compuaion in he following proposiion where we consider he compuaion of he produc p wih a given arbirary vecor p. Proposiion 1 Consider he olbfgs Hessian inverse approximaion =,τ obained afer τ recursive applicaions of he updae in (13 wih he scalar sequence ˆρ τ+u and ideniy plus rank-one marix sequence Ẑ τ+u as defined in (12 for given variable and sochasic gradien variaion pairs {v u, r u } u= τ. For a given vecor p = p 0 define he sequence of vecors p k hrough he recursion p u+1 = p u α uˆr u for u = 0,..., τ 1, (17 where we also define he consans α u := ˆρ u v up T u. Furher define he sequence of vecors q k wih iniial value q 0 =,0 p τ and subsequen elemens q u+1 = q u + (α τ u β u v τ+u for u = 0,..., τ 1, (18 where we define consans β u := ˆρ τ+uˆr T τ+uq u. The produc p equals q τ, i.e., p = q τ. Proof See Appendix A. The reorganizaion of compuaions described in Proposiion 1 has been done for he deerminisic LBFGS mehod in, e.g., (Nocedal and Wrigh (1999. We have used he same echnique here for compuing he descen direcion of olbfgs and have shown he resul and derivaions for compleeness. In any even, Proposiion 1 assers ha i is possible o reduce he compuaion of he produc p beween he olbfgs Hessian approximaion marix and arbirary vecor p o he compuaion of wo vecor sequences {p u } τ u=0 and {q u} τ u=0. The produc p = q τ is given by he las elemen of he laer sequence. Since deerminaion of each of he elemens of each sequence requires O(n operaions and he oal number of elemens in each sequence is τ he oal operaion cos o compue boh sequences is of order O(τn. In compuing p we also need o add he cos of he produc q 0 =,0 p τ ha links boh sequences. To mainain overall compuaion cos of order O(τn his marix has o have a sparse or low rank srucure. A common choice in LBFGS, ha we adop for olbfgs, is o make,0 = ˆγ I. The scalar consan ˆγ is a funcion of he variable and sochasic gradien variaions v and ˆr, explicily given by ˆγ = vt ˆr ˆr T ˆr = vt ˆr ˆr 2. (19 wih he value a he firs ieraion being ˆγ 0 = 1. The scaling facor ˆγ aemps o esimae one of he eigenvalues of he Hessian marix a sep and has been observed o work well in pracice; see e.g., Liu and Nocedal (1989; Nocedal and Wrigh (1999. Furher observe ha he cos of compuing ˆγ is of order O(n and ha since,0 is diagonal cos of compuing he produc q 0 =,0 p τ is also of order O(n. We adop he iniializaion in (19 in our subsequen analysis and numerical experimens. The compuaion of he produc p using he resul in Proposiion 1 is summarized in algorihmic form in he funcion in Algorihm 1. The funcion receives as argumens he iniial marix,0, he sequence of variable and sochasic gradien variaions {v u, ˆr u } u= τ and he vecor p o produce he oucome q = q τ = p. When called wih he sochasic gradien p = ŝ(w, θ, he 3157

8 Mokhari and Ribeiro Algorihm 1 Compuaion of olbfgs sep q = p when called wih p = ŝ(w, θ. ( 1: funcion q = q τ = olbfgs Sep,0, p = p 0, {v u, ˆr u } u= τ 2: for u = 0, 1,..., τ 1 do {Loop o compue consans α u and sequence p u } 3: Compue and sore scalar α u = ˆρ u v up T u 4: Updae sequence vecor p u+1 = p u α uˆr u. [cf. (17] 5: end for 6: Muliply p τ by iniial marix: q 0 =,0 p τ 7: for u = 0, 1,..., τ 1 do {Loop o compue consans β u and sequence q u } 8: Compue scalar β u = ˆρ τ+uˆr T τ+uq u 9: Updae sequence vecor q u+1 = q u + (α τ u β u v τ+u [cf. (18] 10: end for {reurn q = q τ } funcion oupus he olbfgs sep ˆd := ŝ(w, θ needed o implemen he olbfgs descen sep in (10. The core of Algorihm 1 is given by he loop in seps 2-5 ha compues he consans α u and sequence elemens p u as well as he loop in seps 7-10 ha compues he consans β u and sequence elemens q u. The wo loops are linked by he iniializaion of he second sequence wih he oucome of he firs which is performed in Sep 6. To implemen he firs loop we require τ inner producs in Sep 4 and τ vecor summaions in Sep 5 which yield a oal of 2τn muliplicaions. Likewise, he second loop requires τ inner producs and τ vecor summaions in seps 9 and 10, respecively, which yields a oal cos of also 2τ n muliplicaions. Since he iniial Hessian inverse approximaion marix,0 is diagonal he cos of compuaion,0 p τ in Sep 6 is n muliplicaions. Thus, Algorihm 1 requires a oal of (4τ + 1n muliplicaions which affirms he complexiy cos of order O(τn for olbfgs. For reference, olbfgs is also summarized in algorihmic form in Algorihm 2. As wih any sochasic descen algorihm he descen ieraion is implemened in hree seps: he acquisiion of L samples in Sep 2, he compuaion of he sochasic gradien in Sep 3, and he implemenaion of he descen updae on he variable w in Sep 6. Seps 4 and 5 are devoed o he compuaion of he olbfgs descen direcion ˆd. In Sep 4 we iniialize he esimae,0 = ˆγ I as a scaled ideniy marix using he expression for ˆγ in (19 for > 0. The value of γ = γ 0 for = 0 is lef as an inpu for he algorihm. We use ˆγ 0 = 1 in our numerical ess. In Sep 5 we use Algorihm 1 for efficien compuaion of he descen direcion ˆd = ŝ(w, θ. Sep 7 deermines he value of he sochasic gradien ŝ(w +1, θ so ha he variable variaions v and sochasic gradien variaions ˆr become available for he compuaion of he curvaure approximaion marix. In Sep 8 he variable variaion v and sochasic gradien variaion ˆr are compued o be used in he nex ieraion. We analyze convergence properies of his algorihm in Secion 3 and develop an applicaion o search engine adverisemen in Secion Convergence Analysis For he subsequen analysis i is convenien o define he insananeous objecive funcion associaed wih samples θ = [θ 1,..., θ L ] as ˆf(w, θ := 1 L L f(w, θ l. (20 The definiion of he insananeous objecive funcion ˆf(w, θ in associaion wih he fac ha F (w := E θ [f(w, θ] implies ha F (w = E θ [ ˆf(w, θ]. (21 l=1 3158

9 Global Convergence of Online Limied Memory BFGS Algorihm 2 olbfgs Require: Iniial value w 0. Iniial Hessian approximaion parameer ˆγ 0 = 1. 1: for = 0, 1, 2,... do 2: Acquire L independen samples θ = [θ 1,..., θ L] 3: Compue sochasic gradien: ŝ(w, θ = 1 L wf(w, θ l [cf. (9] L 4: Iniialize Hessian inverse esimae as l=1,0 = ˆγI wih ˆγ = vt ˆr 5: Compue descen direcion wih Algorihm 1: ˆd = olbfgs Sep ˆr T ˆr for > 0 [cf (19] 6: Descend along direcion ˆd : w +1 = w ɛ ˆd [cf. (10] 7: Compue sochasic gradien: ŝ(w +1, θ = 1 L wf(w +1, θ l [cf. (9] L l=1 (,0, ŝ(w, θ, {v u, ˆr u} u= τ 8: Variaions v = w +1 w [variable, cf. (4] ˆr = ŝ(w +1, θ ŝ(w, θ [soch. gradien, cf.(11] 9: end for Our goal here is o show ha as ime progresses he sequence of variable ieraes w approaches he opimal argumen w. In proving his resul we make he following assumpions. Assumpion 1 The insananeous funcions ˆf(w, θ are wice differeniable and he eigenvalues of he insananeous Hessian Ĥ(w, θ = 2 w ˆf(w, θ are bounded beween consans 0 < m and M < for all random variables θ, mi Ĥ(w, θ MI. (22 Assumpion 2 The second momen of he norm of he sochasic gradien is bounded for all w. i.e., here exiss a consan S 2 such ha for all variables w i holds E θ [ ŝ(w, θ 2 w ] S 2. (23 Assumpion 3 The sep size sequence is seleced as nonsummable bu square summable, i.e., ɛ =, =0 and ɛ 2 <. (24 =0 Assumpions 2 and 3 are cusomary in sochasic opimizaion. The resricion imposed by Assumpion 2 is inended o limi he random variaion of sochasic gradiens. If he variance of heir norm is unbounded i is possible o have rare evens ha derail progress owards convergence. The condiion in Assumpion 3 balances descen owards opimal argumens which requires a slowly decreasing sep size wih he evenual eliminaion of random variaions which requires rapidly decreasing sep sizes. An effecive sep size choice for which Assumpion 3 holds is o make ɛ = ɛ 0 T 0 /(T 0 +, for given parameers ɛ 0 and T 0 ha conrol he iniial sep size and is speed of decrease, respecively. Assumpion 1 is sronger han usual and specific o olbfgs. Observe ha considering he lineariy of he expecaion operaor and he expression in (21 i follows ha he Hessian of he average funcion can be wrien as 2 wf (w = H(w = E θ [Ĥ(w, θ]. Combining his observaion wih he bounds in (22 we conclude ha here are consans m m and M M such ha mi mi H(w MI MI. (25 The bounds in (25 are cusomary in convergence proofs of descen mehods. For he resuls here he sronger condiion spelled in Assumpion 1 is needed. This assumpion in necessary o guaranee ha he inner produc ˆr T v > 0 is posiive as we show in he following lemma. 3159

10 Mokhari and Ribeiro Lemma 2 Consider he sochasic gradien variaion ˆr defined in (11 and he variable variaion v defined in (4. Le Assumpion 1 hold so ha we have lower and upper bounds m and M on he eigenvalues of he insananeous Hessians. Then, for all seps he inner produc of variable and sochasic gradien variaions ˆr T v is bounded below as m v 2 ˆr T v. (26 Furhermore, he raio of sochasic gradien variaion squared norm ˆr 2 = ˆr T ˆr o inner produc of variable and sochasic gradien variaions is bounded as m ˆrT ˆr ˆr T = ˆr 2 v ˆr T v M. (27 Proof See Appendix B. According o Lemma 2, srong convexiy of insananeous funcions ˆf(w, θ guaranies posiiveness of he inner produc v T ˆr as long as he variable variaion is no idenically null. In urn, his implies ha he consan ˆγ in (19 is nonnegaive and ha, as a consequence, he iniial Hessian,0 inverse approximaion is posiive definie for all seps. The posiive definieness of,0 in associaion wih he posiiveness of he inner produc of variable and sochasic gradien variaions v T ˆr > 0 furher guaranees ha all he marices,u+1, including he marix =,τ in paricular, ha follow he updae rule in (13 say posiive definie see Mokhari and Ribeiro (2014a for deails. This proves ha (10 is a proper sochasic descen ieraion because he sochasic gradien ŝ(w, θ is moderaed by a posiive definie marix. However, his fac alone is no enough o guaranee convergence because he minimum and maximum eigenvalues of could become arbirarily small and arbirarily large, respecively. To prove convergence we show his is no possible by deriving explici lower and upper bounds on hese eigenvalues. The analysis is easier if we consider he marix as opposed o. Consider hen he updae in (13, and use he Sherman-Morrison formula o rewrie as an updae ha relaes,u+1 o,u,,u+1 =,u,u v τ+u v T τ+u,u v T τ+u,u v τ+u + ˆr τ+uˆr T τ+u v τ+uˆr T, (28 τ+u for u = 0,..., τ 1 and,0 = 1/ˆγ I as per (19. As in (13, he Hessian approximaion a sep is =,τ. In he following lemma we use he updae formula in (28 o find bounds on he race and deerminan of he Hessian approximaion. Lemma 3 Consider he Hessian approximaion =,τ defined by he recursion in (28 wih,0 = ˆγ I and ˆγ as given by (19. If Assumpion 1 holds rue, he race r( of he Hessian approximaion is uniformly upper bounded for all imes 1, ( r (n + τ M. (29 Likewise, if Assumpion 1 holds rue, he deerminan de( of he Hessian approximaion is uniformly lower bounded for all imes Proof See Appendix C. ( de m n+τ. (30 [(n + τ M] τ 3160

11 Global Convergence of Online Limied Memory BFGS Lemma 3 saes ha he race and deerminans of he Hessian approximaion marix =,τ are bounded for all imes 1. For ime = 0 we can wrie a similar bound ha akes ino accoun he fac ha he consan γ ha iniializes he recursion in (28 is γ 0 = 1. Given ha we are ineresed in an asympoic convergence analysis, his bound in inconsequenial. The bounds on he race and deerminan of are respecively equivalen o bounds in he sum and produc of is eigenvalues. Furher considering ha he marix is posiive definie, as i follows from Lemma 2, hese bounds can be furher ransformed ino bounds on he smalls and larges eigenvalue of. The resuling bounds are formally saed in he following lemma. Lemma 4 Consider he Hessian approximaion =,τ defined by he recursion in (28 wih,0 = ˆγ I and ˆγ as given by (19. Define he sricly posiive consan 0 < c := m n+τ /[(n + τ M] n+τ and he finie consan C := (n + τ M <. If Assumpion 1 holds rue, he range of eigenvalues of is bounded by c and C for all ime seps 1, i.e., Proof See Appendix D. m n+τ [(n + τ M] n+τ I =: ci CI := (n + τ M I. (31 The bounds in Lemma 4 imply ha heir respecive inverses are bounds on he range of he eigenvalues of he Hessian inverse approximaion marix. Specifically, he minimum eigenvalue of he Hessian inverse approximaion is larger han 1/C and he maximum eigenvalue of does no exceed 1/c, or, equivalenly, 1 C I 1 c I. (32 We furher emphasize ha he bounds in (32, or (31 for ha maer, limi he condiioning of for all realizaions of he random samples { θ } =0, irrespecive of he paricular random draw. Having marices ha are sricly posiive definie wih eigenvalues uniformly upper bounded by 1/c leads o he conclusion ha if ŝ(w, θ is a descen direcion, he same holds rue of ŝ(w, θ. The sochasic gradien ŝ(w, θ is no a descen direcion in general, bu we know ha his is rue for is condiional expecaion E[ŝ(w, θ w ] = F (w. Hence, we conclude ha ŝ(w, θ is an average descen direcion since E[ ŝ(w, θ w ] = F (w. Sochasic opimizaion mehods whose displacemens w +1 w are descen direcions on average are expeced o approach opimal argumens. We show ha his is rue of olbfgs in he following lemma. Lemma 5 Consider he online Limied memory BFGS algorihm as defined by he descen ieraion in (10 wih marices =,τ obained afer τ recursive applicaions of he updae in (13 iniialized wih,0 = ˆγ I and ˆγ as given by (19. If Assumpions 1 and 2 hold rue, he sequence of average funcion values F (w saisfies Proof See Appendix E. E [ F (w +1 w ] F (w F (w F (w ɛ C F (w 2 + MS2 ɛ 2 2c 2. (33 Seing aside he erm MS 2 ɛ 2 /2c 2 for he sake of argumen, (88 defines a supermaringale relaionship for he sequence of objecive funcion errors F (w F (w. This implies ha he sequence ɛ F (w 2 /C is almos surely summable which, given ha he sep sizes ɛ are nonsummable as per (24, furher implies ha he limi infimum lim inf F (w of he gradien norm 3161

12 Mokhari and Ribeiro F (w is almos surely null. This laer observaion is equivalen o having lim inf F (w F (w = 0 wih probabiliy 1 over realizaions of he random samples { θ } =0. Therefore, a subsequence of he sequence of objecive funcion errors F (w F (w converges o null almos surely. Moreover, according o he resul of supermaringale convergence heorem, he limi lim F (w F (w of he nonnegaive objecive funcion errors F (w F (w almos surely exiss. This observaion in conjuncion wih he fac ha a subsequence of he sequence F (w F (w converges almos surely o null implies ha he whole sequence of F (w F (w converges almos surely o zero. Considering he srong convexiy assumpion his resul implies almos sure convergence of he whole sequence of w w 2 o null. The erm MS 2 ɛ 2 /2c 2 is a relaively minor nuisance ha can be aken care of wih a echnical argumen ha we presen in he proof of he following heorem. Theorem 6 Consider he online Limied memory BFGS algorihm as defined by he descen ieraion in (10 wih marices =,τ obained afer τ recursive applicaions of he updae in (13 iniialized wih,0 = ˆγ I and ˆγ as given by (19. If Assumpions 1-3 hold rue he limi of he squared Euclidean disance o opimaliy w w 2 converges o zero almos surely, i.e., [ ] Pr lim w w 2 = 0 = 1, (34 where he probabiliy is over realizaions of he random samples { θ } =0. Proof See Appendix F. Theorem 6 esablishes convergence of he olbfgs algorihm summarized in Algorihm 2. The lower and upper bounds on he eigenvalues of derived in Lemma 4 play a fundamenal role in he proofs of he prerequisie Lemma 5 and Theorem 6 proper. Roughly speaking, he lower bound on he eigenvalues of resuls in an upper bound on he eigenvalues of which limis he effec of random variaions on he sochasic gradien ŝ(w, θ. If his bound does no exis as is he case, e.g., of regular sochasic BFGS we may observe caasrophic amplificaion of random variaions of he sochasic gradien. The upper bound on he eigenvalues of, which resuls in a lower bound on he eigenvalues of, guaranees ha he random variaions in he curvaure esimae do no yield marices wih arbirarily small norm. If his bound does no hold, i is possible o end up haling progress before convergence as he sochasic gradien is nullified by muliplicaion wih an arbirarily small eigenvalue. The resul in Theorem 6 is srong because i holds almos surely over realizaions of he random samples { θ } =0 bu no sronger han he same convergence guaranees ha hold for SGD. We complemen he convergence resul in Theorem 6 wih a characerizaion of he expeced convergence rae ha we inroduce in he following heorem. Theorem 7 Consider he online Limied memory BFGS algorihm as defined by he descen ieraion in (10 wih marices =,τ obained afer τ recursive applicaions of he updae in (13 iniialized wih,0 = ˆγ I and ˆγ as given by (19. Le Assumpions 1 and 2 hold, and furher assume ha he sep size sequence is of he form ɛ = ɛ 0 /( + T 0 wih he parameers ɛ 0 and T 0 saisfying he inequaliy 2mɛ 0 T 0 /C > 1. Then, he difference beween he expeced opimal objecive E [F (w ] and he opimal objecive F (w is bounded as E [F (w ] F (w C 0 T 0 +, (35 where he consan C 0 is defined as { ɛ 2 C 0 := max 0 T0 2 CMS 2 } 2c 2 (2mɛ 0 T 0 C, T 0 (F (w 0 F (w. (

13 Global Convergence of Online Limied Memory BFGS Proof See Appendix G. Theorem 7 shows ha under specified assumpions he expeced error in erms of he objecive value afer olbfgs ieraions is of order O(1/. As is he case of Theorem 6, his resul is no beer han he convergence rae of convenional SGD. As can be seen in he proof of Theorem 7, he convergence rae is dominaed by he noise erm inroduced by he difference beween sochasic and regular gradiens. This noise erm would be presen even if exac Hessians were available and in ha sense he bes ha can be proven of olbfgs is ha he convergence rae is no worse han ha of SGD. Given ha heorems 6 and 7 parallel he heoreical guaranees of SGD i is perhaps fairer o describe olbfgs as an adapive recondiioning sraegy insead of a sochasic quasi-newon mehod. The laer descripion refers o he genesis of he algorihm, bu he former is more accurae descripion of is behavior. Do noice ha while he convergence rae doesn change, improvemens in convergence ime are significan as we illusrae wih he numerical experimens ha we presen in he nex secion. 4. Search Engine Adverising We apply olbfgs o he problem of predicing he click-hrough rae (CTR of an adverisemen displayed in response o a specific search engine query by a specific visior. In hese problems we are given mea informaion abou an adverisemen, he words ha appear in he query, as well as some informaion abou he visior and are asked o predic he likelihood ha his paricular ad is clicked by his paricular user when performing his paricular query. The informaion specific o he ad includes descripors of differen characerisics such as he words ha appear in he ile, he name of he adveriser, keywords ha idenify he produc, and he posiion on he page where he ad is o be displayed. The informaion specific o he user is also heerogeneous and includes gender, age, and propensiy o click on ads. To rain a classifier we are given informaion abou pas queries along wih he corresponding click success of he ads displayed in response o he query. The ad meadaa along wih user daa and search words define a feaure vecor ha we use o rain a logisic regressor ha predics he CTR of fuure ads. Given he heerogeneiy of he componens of he feaure vecor we expec a logisic cos funcion wih skewed level ses and consequen large benefis from he use of olbfgs. 4.1 Feaure Vecors For he CTR problem considered here we use he Tencen search engine daa se Sun (2012. This daa se conains he oucomes of 236 million ( searches along wih informaion abou he ad, he query, and he user. The informaion conained in each sample poin is he following: User profile: If known, age and gender of visior performing query. Deph: Toal number of adverisemens displayed in he search resuls page. Posiion: Posiion of he adverisemen in he search page. Impression: Number of imes he ad was displayed o he user who issued he query. Query: The words ha appear in he user s query. Tile: The words ha appear in he ile of ad. Keywords: Seleced keywords ha specify he ype of produc. Ad ID: Unique idenifier assigned o each specific adverisemen. 3163

14 Mokhari and Ribeiro Nonzero componens Feaure ype Toal componens Maximum (observed/srucure Mean (observed Age 6 1 (srucure 1.0 Gender 3 1 (srucure 1.0 Impression 3 1 (srucure 1.0 Deph 3 1 (srucure 1.0 Posiion 3 1 (srucure 1.0 Query 20, (observed 3.0 Tile 20, (observed 8.8 Keyword 20, (observed 2.1 Adveriser ID 5,184 1 (srucure 1.0 Adverisemen ID 108,824 1 (srucure 1.0 Toal 174, (observed 20.9 Table 1: Componens of he feaure vecor for predicion of adverisemens click-hrough raes. For each feaure class we repor he oal number of componens in he feaure vecor as well as he maximum and average number of nonzero componens. Adveriser ID: Unique idenifier assigned o each specific adveriser. Clicks: Number of imes he user clicked on he ad. From his informaion we creae a se of feaure vecors {x i } N i=1, wih corresponding labels y i {, 1}. The label associaed wih feaure vecor x i is y i = 1 if he number of clicks in he ad is more han 0. Oherwise he label is y i =. We use a binary encoding for all he feaures in he vecor x i. For he age of he user we use he six age inervals (0, 12], (12, 18], (18, 24], (24, 30], (30, 40], and (40, o consruc six indicaor enries in x i ha ake he value 1 if he age of he user is known o be in he corresponding inerval. E.g., a 21 year old user has an age ha falls in he hird inerval which implies ha we make [x i ] 3 = 1 and [x i ] k = 0 for all oher k beween 1 and 6. If he age of he user is unknown we make [x i ] k = 0 for all k beween 1 and 6. For he gender of he visiors we use he nex hree componens of x i o indicae male, female, or unknown gender. For a male user we make [x i ] 7 = 1, for a female user [x i ] 8 = 1, and for visiors of unknown gender we make [x i ] 9 = 1. The nex hree componens of x i are used for he deph feaure. If he he number of adverisemens displayed in he search page is 1 we make [x i ] 10 = 1, if 2 differen ads are shown we make [x i ] 11 = 1, and for dephs of 3 or more we make [x i ] 12 = 1. To indicae he posiion of he ad in he search page we also use hree componens of x i. We use [x i ] 13 = 1, [x i ] 14 = 1, and [x i ] 15 = 1 o indicae ha he ad is displayed in he firs, second, and hird posiion, respecively. Likewise we use [x i ] 16, [x i ] 17 and [x i ] 18 o indicae ha he impression of he ad is 1, 2 or more han 3. For he words ha appear in he query we have in he order of 10 5 disinc words. To reduce he number of elemens necessary for his encoding we creae 20,000 bags of words hrough random hashing wih each bag conaining 5 or 6 disinc words. Each of hese bags is assigned an index k. For each of he words in he query we find he bag in which his word appears. If he word appears in he kh bag we indicae his occurrence by seing he k + 18h componen of he feaure vecor o [x i ] k+18 = 1. Observe ha since we use 20,000 bags, componens 19 hrough 20,018 of x i indicae he presence of specific words in he query. Furher noe ha we may have more han one x i componen differen from zero because here may be many words in he query, bu ha he oal number of nonzero elemens is much smaller han 20,000. On average, 3.0 of hese elemens of 3164

15 Global Convergence of Online Limied Memory BFGS he feaure vecor are nonzero. The same bags of words are used o encode he words ha appear in he ile of he ad and he produc keywords. We encode he words ha appear in he ile of he ad by using he nex 20, 000 componens of vecor x i, i.e. componens 20, 019 hrough 40, 018. Componens 40, 019 hrough 60, 018 are used o encode produc keywords. As in he case of he words in he search jus a few of hese componens are nonzero. On average, he number of nonzero componens of feaure vecors ha describe he ile feaures is 8.8. For produc keywords he average is 2.1. Since he number of disinc adverisers in he raining se is 5, 184 we use feaure componens 60, 019 hrough o encode his informaion. For he kh adveriser ID we se he k + 60, 018h componen of he feaure vecor o [x i ] k+60,018 = 1. Since he number of disinc adverisemens is 108, 824 we allocae he las 108, 824 componens of he feaure vecor o encode he ad ID. Observe ha only one ou of 5, 184 adveriser ID componens and one of he 108, 824 adverisemen ID componens are nonzero. In oal, he lengh of he feaure vecor is 174,026 where each of he componens are eiher 0 or 1. The vecor is very sparse. We observe a maximum of 148 nonzero elemens and an average of 20.9 nonzero elemens in he raining se see Table 1. This is imporan because he cos of implemening inner producs in he olbfgs raining of he logisic regressor ha we inroduce in he following secion is proporional o he number of nonzero elemens in x i. 4.2 Logisic Regression of Click-Through Rae We use he raining se o esimae he CTR wih a logisic regression. For ha purpose le x R n be a vecor conaining he feaures described in Secion 4.1, w R n a classifier ha we wan o rain, and y, 1 an indicaor variable ha akes he value y = 1 when he ad presened o he user is clicked and y = when he ad is no clicked by he user. We hypohesize ha he CTR, defined as he probabiliy of observing y = 1, can be wrien as he logisic funcion CTR(x; w := P [ y = 1 x; w ] = exp ( x T w. (37 We read (37 as saing ha for a feaure vecor x he CTR is deermined by he inner produc x T w hrough he given logisic ransformaion. Consider now he raining se S = {(x i, y i } N i=1 which conains N realizaions of feaures x i and respecive click oucomes y i and furher define he ses S 1 := {(x i, y i S : y i = 1} and S := {(x i, y i S : y i = } conaining clicked and unclicked adverisemens, respecively. Wih he daa given in S we define he opimal classifier w as a maximum likelihood esimae (MLE of w given he model in (37 and he raining se S. This MLE can be found as he minimizer of he log-likelihood loss N w := argmin λ 2 w N i=1 = argmin λ 2 w [ N log (1 + exp ( y i x T i w x i S 1 log ( 1 + exp( x T i w + ( log 1 + exp(x T i w ], (38 x i S where we have added he regularizaion erm λ w 2 /2 o disincenivize large values in he weigh vecor w ; see e.g., Ng (2004. The pracical use of (37 and (38 is as follows. We use he daa colleced in he raining se S o deermine he vecor w in (38. When a user issues a query we concaenae he user and query specific elemens of he feaure vecor wih he ad specific elemens of several candidae ads. We hen proceed o display he adverisemen wih, say, he larges CTR. We can inerpre he se S as having been acquired offline or online. In he former case we wan o use a sochasic opimizaion algorihm because compuing gradiens is infeasible recall ha we are considering raining samples 3165

16 Mokhari and Ribeiro Objecive funcion value F(w SGD olbfgs x 10 4 Number of processed feaure vecors(l Figure 1: Illusraion of Negaive log-likelihood value for olbfgs and SGD afer processing cerain amoun of feaure vecors. The accuracy of olbfgs is beer han SGD afer processing a specific number of feaure vecors. wih a number of elemens N in he order of The performance meric of ineres in his case is he logisic cos as a funcion of compuaional ime. If elemens of S are acquired online we updae w whenever a new vecor becomes available so as o adap o changes in preferences. In his case we wan o exploi he informaion in new samples as much as possible. The correc meric in his case is he logisic cos as a funcion of he number of feaure vecors processed. We use he laer meric for he numerical experimens in he following secion. 4.3 Numerical Resuls Ou of he in he Tencen daa se we selec 10 6 sample poins o use as he raining se S and 10 5 sample poins o use as a es se T. To selec elemens of he raining and es se we divide he firs sample poins of he complee daa se in 10 5 consecuive blocks wih 11 elemens. The firs 10 elemens of he block are assigned o he raining se and he 11h elemen o he es se. To solve for he opimal classifier we implemen SGD and olbfgs by selecing feaure vecors x i a random from he raining se S. In all of our numerical experimens he regularizaion parameer in (38 is λ = The sep sizes for boh algorihms are of he form ɛ = ɛ 0 T 0 /(T 0 +. We se ɛ 0 = 10 2 and T 0 = 10 4 for olbfgs and ɛ 0 = 10 and T 0 = 10 6 for SGD. For SGD he sample size in (9 is se o L = 20 whereas for olbfgs i is se o L = 100. The values of parameers ɛ 0, T 0, and L are chosen o yield bes convergence imes in a rough parameer opimizaion search. Observe he relaively large values of L ha are used o compue sochasic gradiens. This is necessary due o he exreme sparsiy of he feaure vecors x i ha conain an average of only 20.9 nonzero ou 174,026 elemens. Even when considering L = 100 vecors hey are close o orhogonal. The size of memory for olbfgs is se o τ = 10. Wih L = 100 feaures wih an average sparsiy of 20.9 nonzero elemens and memory τ = 10 he cos of each olbfgs ieraion is in he order of operaions. Figure 1 illusraes he convergence pah of SGD and olbfgs on he adverising raining se. We depic he value of he log likelihood objecive in (38 evaluaed a w = w where w is he classifier ierae deermined by SGD or olbfgs. The horizonal axis is scaled by he number of feaure vecors L ha are used in he evaluaion of sochasic gradiens. This resuls in a plo of log likelihood cos versus he number L of feaure vecors processed. To read ieraion indexes from Figure 1 divide he horizonal axis values by L = 100 for olbfgs and L = 20 for SGD. The curvaure correcion of olbfgs resuls in significan reducions in convergence ime. For way of 3166

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3