Revisiting Projection-Free Optimization for Strongly Convex Constraint Sets

Size: px

Start display at page:

Download "Revisiting Projection-Free Optimization for Strongly Convex Constraint Sets"

Sydney Reynolds
5 years ago
Views:

1 Revisiing Projecion-Free Opimizaion for Srongly Convex Consrain Ses Jarrid Recor-Brooks 2260 Hayward S Ann Arbor, MI, 4804 Universiy of Michigan, Ann Arbor jrecorb@umich.edu Jun-Kun Wang 226 Fers Drive NW Alana, GA, Georgia Insiue of Technology jimwang@gaech.edu * Barzan Mozafari 2260 Hayward S Ann Arbor, MI, 4804 Universiy of Michigan, Ann Arbor mozafari@umich.edu Absrac We revisi he Frank-Wolfe (FW) opimizaion under srongly convex consrain ses. We provide a faser convergence rae for FW wihou line search, showing ha a previously overlooked varian of FW is indeed faser han he sandard varian. Wih line search, we show ha FW can converge o he global opimum, even for smooh funcions ha are no convex, bu are quasi-convex and locally-lipschiz. We also show ha, for he general case of (smooh) non-convex funcions, FW wih line search converges wih high probabiliy o a saionary poin a a rae of O( ), as long as he consrain se is srongly convex one of he fases convergence raes in non-convex opimizaion. Inroducion A popular family of opimizaion algorihms are so-called gradien descen algorihms: ieraive algorihms ha are comprised of a gradien descen sep a each ieraion, followed by a projecion sep when here is a feasibiliy consrain. The purpose of he projecion is o ensure ha he updae vecor remains wihin he feasible se. In many cases, however, he projecion sep may have no closed-form and hus requires solving anoher opimizaion problem iself (e.g., for l.5 norm balls or maroid polyopes (Hazan and ohers 206; Hazan and Kale 202)), he closed-form may exis bu involve an expensive compuaion (e.g., he SVD of he model marix for Schaen-, Schaen- 2, and Schaen- norm balls (Hazan and ohers 206)), or here may simply be no mehod available for compuing he projecion in general (e.g., he convex hull of roaion marices (Hazan, Kale, and Warmuh 200), which arises as a consrain se in online learning seings (Hazan, Kale, and Warmuh 200)). In hese scenarios, each ieraion of he gradien descen may require many inner ieraions o compue he projecion (Jaggi, Sulovsk, and ohers 200; Lacose-Julien and Jaggi 205; Hazan and Kale 202). This makes he projecion sep quie cosly, and can accoun for much of he execuion ime of each ieraion (e.g., see our echnical repor (Recor-Brooks, Wang, and Mozafari 208)). * The work performed while a suden a he Universiy of Michigan, Ann Arbor. Copyrigh 209, Associaion for he Advancemen of Arificial Inelligence ( All righs reserved. Frank-Wolfe (FW) opimizaion In his paper, we focus on FW approaches, also known as projecion-free or condiional gradien algorihms (Frank and Wolfe 956). Unlike gradien descen, hese algorihms avoid he projecion sep alogeher by ensuring ha he updae vecor always lies wihin he feasible se. A each ieraion, FW solves a linear program over a consrain se. Since linear programs have closed-form soluions for mos consrain ses, each ieraion of FW is, in many cases, more cos effecive han conducing a gradien descen sep and hen projecing i back o he consrain se (Jaggi 203; Hazan and Kale 202; Hazan and ohers 206). Anoher main advanage of FW is he sparsiy of is soluion. Since he soluion of a linear program is always a verex (i.e., exreme poin) of he feasible se (when he se iself is convex), each ieraion of FW can add, a mos, one new verex o he soluion vecor. Thus, a ieraion, he soluion is a combinaion of, a mos, + verices of he feasible se, hereby guaraneeing he sparsiy of he evenual soluion (Clarkson 200; Jaggi 203; 20). For hese reasons, FW opimizaion has drawn growing ineres in recen years, especially in marix compleion, srucural SVM, compuer vision, sparse PCA, meric learning, and many oher seings (Jaggi, Sulovsk, and ohers 200; Lacose-Julien e al. 203; Osokin e al. 206; Wang e al. 206; Chari e al. 205; Harchaoui e al. 202; Hazan and Kale 202; Shalev-Shwarz, Gonen, and Shamir 20). Unforunaely, while faser in each ieraion, sandard FW requires many more ieraions o converge han gradien descen, and herefore is slower overall. This is because FW s convergence rae is ypically O ( ) while ha of (acceleraed) gradien descen is O ( ), where is he number of ieraions (Jaggi 203). 2 We make several conribuions (summarized in Table ):. We revisi a non-convenional varian of FW opimizaion, called Primal Averaging (PA) (Lan 203), which has been largely negleced in he pas, as i was believed o have he same convergence rae as FW wihou line search, ye incurring exra compuaions (i.e., marix averaging sep) a each ieraion. However, we discover ha, when he consrain se is srongly convex, his non-convenional varian enjoys a much faser convergence rae wih high probabiliy, O( ) versus O( 2 ), which more han com-

2 Addiional Assumpions abou he Loss Funcion Consrain Se Assumpion Convergence Rae Convex Loss Funcion This Paper None Srongly convex O ( ) wih high 2 probabiliy No Sae-of-he-Ar Resul(s) (Jaggi 203) None Convex O ( ) No (Garber and Hazan 205) Srongly convex Srongly convex O ( ) Yes (Lacose-Julien and Jaggi Srongly convex Polyope O (exp 2 ( )) Yes 205) (Leviin and Polyak 966; Demyanov and Rubinov 970; Dunn 979) Norm of he gradien is lower bounded Srongly convex O (exp ( )) No (Beck and Teboulle 2004) f(x) = Ax b 2 2 Convex O (exp ( )) No Quasi-Convex Loss Funcion This Paper Locally-Lipschiz, Norm of he gradien is lower bounded Srongly convex O ( min ( /3, /2 )) Yes Requires Line Search (In Each Ieraion) Sae-of-he-Ar Resul(s) Does no exis Does no exis Does no exis Does no exis Does no exis Non-Convex Loss Funcion This Paper None Srongly convex O ( ) wih high probabiliy Sae-of-he-Ar Resul(s) (Lacose-Julien 206) None Convex O ( ) No /2 Table : Our conribuions compared o he sae-of-he-ar resuls for projecion-free opimizaion. Here, is he number of ieraions. For non-convex funcions, convergence is defined in erms of a saionary poin insead of a global minimum. Noe ha alhough our bound is probabilisic for convex loss funcions, we use no addiional assumpions on he loss funcion and do no require line search, which can be a cosly operaion for big daa (see Secion 2). Yes pensaes for is slighly more expensive ieraions. This surprising resul has imporan ramificaions in pracice, as many classificaion, regression, muliask learning, and collaboraive filering asks rely on norm consrains ha are srongly convex, e.g., generalized linear models wih l p norm, squared loss regression wih l p norm, muliask learning wih Group Marix norm, and marix compleion wih Schaen norm (Kim and Xing 200; Garber and Hazan 205; Hazan and ohers 206). 2. While previous work on FW opimizaion has generally focused on convex funcions, we show ha FW wih line search can converge o he global opimum, even for smooh funcions ha are no convex, bu are quasiconvex and locally-lipschiz. 3. We also sudy he general case of (smooh) non-convex funcions, showing ha FW wih line search can converge o a saionary poin a a rae of O( ) wih high probabiliy, as long as he consrain se is srongly convex. To he bes of our knowledge, we are no aware of such a fas convergence rae in he non-convex opimizaion lieraure. 4. Finally, we conduc exensive experimens on various benchmark daases, empirically validaing our heore- Wihou any assumpions, converging o local opima for coninuous non-convex funcions is NP-hard (Carmon e al. 207; Agarwal e al. 206). ical resuls, and comparing he acual performance of various FW varians in pracice. 2 Relaed Work Table compares he sae-of-he-ar on projecion-free opimizaion o our conribuions. Convex opimizaion Garber and Hazan (Garber and Hazan 205) show ha for srongly convex and smooh loss funcions, FW wih line search achieves a convergence rae of O( ) over srongly convex ses. In conras, we do no need he 2 loss funcion o be srongly convex. Furher, hey require an exac line search a each ieraion o achieve his convergence rae. Line search, however, comes wih significan downsides. An exac line search solves he problem min f(x + γv) for loss funcion f, soluion vecor x γ [0,] Rn, and descen direcion v R n. There are several mehods for solving his opimizaion, and choosing he bes mehod is ofen difficul for praciioners (e.g., brackeing line searches versus inerpolaion ones). Moreover, a bes, hese mehods converge o he minimum a a rae of O ( ) (Sun and Yuan 2006). Approximae line searches require 2 fewer ieraions. However, in using hem, one loses mos heoreical guaranees provided in previous work, including ha of (Garber and Hazan 205). Noneheless, boh exac and inexac line searches involve a leas one evaluaion of he loss funcion or one of is derivaives, which can be quie prohibiive for large daases (see Secion 7.2). This is because he underlying

3 funcion for daa modeling is ypically in he form of a finie sum (e.g., regression loss) over all he daa. In comparison, Primal Averaging, which we sudy and promoe, does no require a line search and works wih a predefined sep size. Noably, his allows PA o considerably ouperform FW wih line search (see Secion 7.2). Prior work (Leviin and Polyak 966; Demyanov and Rubinov 970; Dunn 979) shows ha sandard FW wihou line search for smooh funcions can achieve an exponenial convergence rae, by making a sric assumpion ha he gradien is lower-bounded everywhere in he feasible se. In our analysis of PA, however, we do no assume he gradien is lower-bounded everywhere, allowing our resul o be more widely applicable. Quasi-convex opimizaion Hazan e al. sudy quasiconvex and locally-lipschiz loss funcions ha admi some saddle poins (Hazan, Levy, and Shalev-Shwarz 205). One of he opimizaion algorihms for his class of funcions is he so-called normalized gradien descen, which converges o an ɛ-neighborhood of he global minimum. The analysis in (Hazan, Levy, and Shalev-Shwarz 205) is for unconsrained opimizaion. In his paper, we analyze FW for he same class of funcions, bu wih srongly convex consrain ses. Ineresingly, when he consrain se is an l 2 ball, FW becomes equivalen o normalized gradien descen. In his paper, we boh ) show ha FW can converge o a neighborhood of a global minimum, and 2) derive a convergence rae. (Dunn 979) exends he analysis of FW o a class of quasi-convex funcions of he form f(w) := g(h(w)), where h is differeniable and monoonically increasing, and g is a smooh funcion. Such funcions are quie rare in machine learning. In conras, we sudy a much more general class of quasi-convex funcions, including several popular models (e.g., generalized linear models wih a sigmoid loss). Non-convex opimizaion While here has been a surge of research on non-convex opimizaion in recen years (Carmon e al. 207; Ge e al. 205; Agarwal e al. 206; Lee e al. 206; Lacose-Julien 206), nearly all of i has focused on unconsrained opimizaion. To our knowledge, here are only a few excepions (Lacose-Julien 206; Ghadimi and Lan 206; Ge e al. 205; Reddi e al. 206). (Lacose-Julien 206) proves ha FW for smooh non-convex funcions converges o a saionary poin, a a rae of O( ), which maches he rae of projeced gradien descen. (Reddi e al. 206) exends his and considers a sochasic version of FW for smooh non-convex funcions. Furhermore, Theorem 7 of (Yu, Zhang, and Schuurmans 204) provides a convergence rae for non-convex opimizaion using FW, which is slower han O( ). We show in his paper ha, for srongly convex ses, FW converges o a saionary poin wih high probabiliy much faser: O( ). 3 Background 3. Preliminaries Srongly convex consrain ses are quie common in machine learning. For example, when p (, 2], l p balls {u R n : u p r} and Schaen-p balls {X R m n : X Sp r} are all srongly convex (Garber and Hazan 205), where ( min(m,n) ) /p X Sp = i= σ(x) p i is he Schaen-p norm and σ(x) i is he i h larges singular value of X. Group l p,q balls, used in muliask learning (Garber and Hazan 205; Kim and Xing 200), are also srongly convex when p, q (, 2]. In his paper, we use he following definiions. Definiion (Srongly convex se). A convex se Ω R d is an α-srongly convex se wih respec o a norm if for any u, v Ω and any θ [0, ], he ball induced by which is cenered a θu + ( θ)v wih radius θ( θ) α 2 u v 2 is also included in Ω. Definiion 2 (Quasi-convex funcions). A funcion f : R d R is quasi-convex if for all u, v R d such ha f(u) f(v), i follows ha f(v), u v 0, where, is he sandard inner produc. Definiion 3 (Sricly-quasi-convex funcions). A funcion f : R d R is sricly-quasi-convex if i is quasi-convex and is gradiens only vanish a he global minimum. Tha is, for all u R d, i follows ha f(u) > f(u ) f(u) = 0 where u is he global minimum. Definiion 4 (Sricly-locally-quasi-convex funcions). Le u, v R d, κ, ɛ > 0. Furher, wrie B r (x) as he Euclidean norm ball cenered a x of radius r where x R d and r R. We say f : R d R is (ɛ, κ, v)-sricly-locally-quasi-convex in u if a leas one of he following applies:. f(u) f(v) ɛ 2. f(u) > 0 and for every y B ɛ (v) i holds ha κ f(u), y u A Brief Overview of Frank-Wolfe (FW) The Frank-Wolfe (FW) algorihm (Algorihm ) aemps o solve he consrained opimizaion problem minf(x) for x Ω some convex consrain se Ω (a.k.a. feasible se) and some funcion f : Ω R. FW begins wih an iniial soluion w 0 Ω. Then, a each ieraion, i compues a search direcion v by minimizing he linear approximaion of f a w, v = min v, f(w ), where f(w ) is he gradien of f a w. v Ω Nex, FW produces a convex combinaion of he curren ierae w and he search direcion v o find he nex ierae w + = ( γ )w + γ v where γ [0, ] is he learning rae for he curren ieraion. There are a number of ways o choose he learning rae γ. Chief among hese are seing γ = 2 + (Algorihm, opion A) or finding γ via line search (Algorihm, opion B). 4 Faser Convergence Rae for Smooh Convex Funcions 4. Primal Averaging (PA) PA (Lan 203) (Algorihm 2) is a varian of FW ha operaes in a syle similar o Neserov s acceleraion mehod. PA mainains hree sequences, (z ) =,2,..., (v ) =,2,..., and (w ) =,2,.... The firs is he acceleraing sequence (as in Neserov acceleraion), he second is he sequence of search direcions, and he hird is he sequence of soluion vecors. A each ieraion, PA updaes is sequences by compuing wo

4 Algorihm Sandard Frank-Wolfe algorihm : Inpu: loss f : Ω R. 2: Inpu: linear op. oracle O( ) for Ω. 3: Iniialize: any w Ω. 4: for =, 2, 3,... do 5: v O( f(w )) = arg min v Ω v, f(w ). 6: Opion (A): Predefined decay learning rae {γ [0, ]} =,2,... 7: Opion (B): γ =arg min γ [0,] γ v w, f(w ) + γ 2 L 2 v w 2. 8: w + ( γ )w + γ v. 9: end for Algorihm 2 Primal Averaging : Iniialize any v 0 Ω R d. Se w 0 = v 0. 2: for =, 2, 3,... do 3: γ = : z = ( γ )w + γ v. 5: Opion (A): p = Σ i= θi Θ f(z i ), where Θ = Σ i= θ i, θ =, and θ Θ = γ. 6: Opion (B): p = f(z ). 7: v = arg min v Ω v, p. 8: w = ( γ )w + γ v. 9: end for convex combinaions and consuling he linear oracle, such ha z = ( γ )w + γ v v = arg min Θ v Ω θ i f(z i ), v i= w = ( γ )w + γ v where Θ = i= θ i and he θ i are chosen, such ha γ = θ Θ. Noe ha choosing θ does no require significan compuaion as seing θ = saisfies he requiremen γ = θ 2 Θ for all. Since z and w are convex combinaions of elemens of he consrain se Ω, z and w are hemselves in Ω. While he inpu o he linear oracle is a single gradien vecor in sandard FW, PA uses an average of he gradiens seen in ieraions, 2,..., as he inpu o he linear oracle. In sandard FW, he sequence (w ) =,2,... has he following propery (Jaggi 203; Lan 203; Hazan and ohers 206): f(w ) f(w ) 2L ( + ) Σ i= v i w i 2 () where w is an opimal poin and L is he smoohness parameer of f. We observe ha he i= v i w i facor of () is he average disance beween he search direcion and soluion vecor pairs. Denoe he diameer D of Ω as D = sup u v. Then, since w i and v i are boh in Ω, u,v Ω we find ha i= v i w i D. Tha is, he average disance of v i and w i is upper bounded by diameer D of Ω. 2 If θ = hen θ Θ = = 2 = 2 = γ. i= i (+) + Combining his wih () yields sandard FW s convergence rae: f(w ) f(w ) 2L ( + ) Σ i= v i w i 2 ( ) (2) 2LD2 + = O PA has a similar guaranee for he sequence (w ) =,2,... (Lan 203). Namely f(w ) f(w ) 2L ( + ) Σ i= v i v i 2 (3) While he inabiliy o guaranee an arbirarily small disance beween v i and w i in Equaion caused sandard FW o converge as O( ), his is no he case for he disance beween v i and v i in Equaion 3. Should we be able o bound he disance v i v i o be arbirarily small, we can show ha PA converges as O( ) wih high probabiliy. We observe ha he sequence (v 2 ) =,2,... expresses his behavior when he consrain se is srongly convex. We have he following heorem. 3 Theorem. Assume he convex funcion f is smooh wih parameer L. Furher, define he funcion h as h(w) = f(w) + θξ T w where θ ( ] ɛ 0, 4D, ξ R d, w Ω, Ω is an α-srongly convex se, D is he diameer of Ω, and ξ is uniform on he uni sphere. Applying PA o h yields he following convergence rae for f wih probabiliy δ, ( ) dl f(w ) f(w ) = O α 2 δ All omied proofs can be found in our echnical repor (Recor- Brooks, Wang, and Mozafari 208).

5 Theorem saes ha applying PA o a perurbed funcion h over an α-srongly convex consrain se allows any smooh, convex funcion f o converge as O ( ) wih probabiliy δ, albei depending on δ and d. However, 2 as grows, he 2 erm in he convergence rae s denominaor quickly dominaes he rae s δ and d erms. This, combined wih PA s nonreliance on line search, allows i o ouperform he mehod proposed in (Garber and Hazan 205). We noe ha, alhough Theorem requires us o run PA on he perurbed funcion h, f iself sill converges as O ( ) wih high probabiliy. Tha is, he ieraes w 2 produced by running PA on h hemselves have he guaranee of f(w ) f(w ) = O ( ) dl α 2 δ 2 2 for w = arg minf(w) wih probabiliy δ. We also w Ω empirically invesigae his resul in Secion Sochasic Primal Averaging (SPA) Here we provide a sochasic version of Primal Averaging. While in he previous secion we sudied PA wih Opion (A) of Algorihm 2, we now consider PA wih Opion (B) of Algorihm 2, providing an analysis of is sochasic version. Tha is, p = f(z ), where f represens he aggregaed sochasic gradien consruced as f(z ) = i S ˆ fi (z ). Furher, ˆ f i ( ) is he sochasic gradien compued wih he ih iem of a daase of size N, while S is he se of indices sampled wihou replacemen from {, 2,..., N} a ieraion. We noe ha S = min( 4, N). Theorem 2. Assume he convex funcion f is smooh wih parameer L. Denoe σ as he variance of a sochasic gradien. Suppose p = f(z ) and he number of samples used o obain p is n = O( 4 ). Furher, define he funcion h as h(w) = f(w) + θξ T w where θ ( ] ɛ 0, 4D, ξ R d, w Ω, Ω is an α-srongly convex se, D is he diameer of Ω, and ξ is uniform on he uni sphere. Then applying PA o h yields he following convergence rae for f wih probabiliy δ, ( dl E[f(w )] f(w 2 (D 2 ) + σ) log ) = O α 2 δ 2 2 Theorem( 2 saes ) ha he sochasic version of PA mainains an O log convergence rae wih high probabiliy, 2 using h in a manner similar o Theorem. Noe ha n grows as O( 4 ) unil i begins o use all he daa poins o compue he gradien. Thus, for earlier ieraions of SPA, he algorihm requires far less compuaion han is deerminisic counerpar. However, he samples required in each ieraion grows quickly, causing laer ieraions of SPA o share he same compuaional cos as deerminisic Primal Averaging. 5 Sricly-Locally-Quasi-Convex Funcions In his secion we show ha FW wih line search can converge wihin an ɛ-neighborhood of he global minimum for sriclylocally-quasi-convex funcions. Furhermore, if i is assumed ha he norm of he gradien is lower bounded, hen FW wih line search can converge wihin an ɛ-neighborhood of he global minimum in O ( max ( )) ɛ, 2 ɛ ieraions. 3 Theorem 3. Assume ha he funcion f is smooh wih parameer L, and ha f is (ɛ, κ, w )-sricly-locally-quasiconvex, where w is a global minimum. Then, he sandard FW algorihm wih line search (Algorihm opion (B)) can converge wihin an ɛ-neighborhood of he global minimum when he consrain se is srongly convex. Furhermore, if one assumes ha f(w) f(w ) ɛ implies ha he norm of he gradien is lower bounded as f(w) θɛ for some θ R, hen he algorihm needs = O(max( 2κ θɛ, 8Lκ 2 θɛ )) ieraions o produce an ierae ha is wihin an ɛ neighborhood 3 of he global minimum. Hazan e al. (Hazan, Levy, and Shalev-Shwarz 205) provide several examples of sricly-locally-quasi-convex funcions. Firs, if ɛ (0, ] and x = (x, x 2 ) [ 0, 0] 2, hen he funcion g(x) = ( + e x ) + ( + e x2 ) is (ɛ,, x )-sricly-locally-quasi-convex in x. Second, if ɛ (0, ) and w R d, hen he funcion h(w) = m (y i φ( w, x i )) 2 m i= is (ɛ, 2 γ, w )-sricly-locally-quasi-convex in w. Here, φ(z) = z 0, γ R is he margin of a percepron, and we have m samples {(x i, y i )} m i= B (0) {0, } where B (0) R d. 6 Smooh Non-Convex Funcions In his secion, we show ha, wih high probabiliy, FW wih line search converges as O ( ) o a saionary poin when he loss funcion is non-convex and he consrain se is srongly convex. To our knowledge, a rae his rapid does no exis in he non-convex opimizaion lieraure. To help demonsrae our heoreical guaranee, we inroduce a measure called he FW gap. The FW gap of f a a poin w Ω is defined as k := max v Ω v w, f(w ). This measure is adoped in (Lacose-Julien 206), which is he firs work o show ( ) ha, for smooh non-convex funcions, FW has an O convergence rae o a saionary poin ( ) over arbirary convex ses. The O rae maches he rae of projeced gradien descen when he loss funcion is smooh and non-convex. I has been shown (Lacose-Julien 206) ha a poin w is a saionary poin for he consrained opimizaion problem if and only if k = 0. Theorem 4. Assume ha he non-convex funcion f is smooh wih parameer L and he consrain se Ω is α- srongly convex and has dimensionaliy d. Furher, define he funcion h as h(w) = f(w) + θξ T w where θ ( ] ɛ 0, 4D, ξ R d, w Ω, D is he diameer of Ω, and ξ is uniform on he uni sphere. Le l = f(w ) f(w ) and C = αδ π 8L. 2d Then applying FW wih line search o h yields he following guaranee for he FW gap of f wih probabiliy δ, min k s s l min{ 2, C } = O ( )

6 Convexiy of Loss Funcion Loss Funcion Consrain Task Convex Quadraic Loss l p norm Regression Observed Quadraic Loss Schaen-p norm Marix Compleion Sricly-Locally-Quasi-Convex Squared Sigmoid l p norm Classificaion Non-Convex Bi-Weigh Loss l p norm Robus Regression Table 2: Various loss funcions and consrain ses used in our experimens. (a) Marix compleion w/ convex (observed quadraic) loss, Schaen-2 norm consrain. (b) Classificaion w/ quasiconvex (squared sigmoid) loss, l 2 norm consrain. (c) Regression w/ non-convex (bi-weigh) loss, l 2 norm consrain. Figure : Convergence raes of FW varians for convex loss wihou line search and non-convex loss wih line search. We would furher discuss he resul saed in he heorem. In non-convex opimizaion lieraure, Neserov and Polyak (Neserov and Polyak 2006) show ha cubic regularizaion of Newon s mehod can find a saionary poin in O(ɛ 3/2 ) ieraions and evaluaions of he Hessian. Firs order mehods, such as gradien descen, ypically require O(ɛ 2 ) ieraions (Carmon e al. 207) o converge o a saionary poin. Recen progress on firs order mehods, however, assumes some mild condiions and show ha an improved rae of O(ɛ 7/4 ) is possible (Carmon e al. 207; Agarwal e al. 206). Here, we show ha when he consrain se is srongly convex, FW wih line search only needs O(ɛ ) ieraions o arrive wihin an ɛ-neighborhood of a saionary poin. I is imporan o noe, alhough he O(ɛ ) convergence rae holds probabilisically, i is quie fas compared o he known raes in he non-convex opimizaion lieraure. 7 Experimens We have conduced exensive experimens on differen combinaions of loss funcions, consrain ses, and real-life daases (Table 2). Here, we only repor wo main ses of experimens: he empirical validaion of our heoreical resuls in erms of convergence raes (Secion 7.) and he comparison of various opimizaions in erms of acual run imes (Secion 7.2). We refer he ineresed reader o our echnical repor for addiional experimens (Recor-Brooks, Wang, and Mozafari 208). For classificaion and regression, we used he logisic and quadraic loss funcions. For marix compleion, we used he observed quadraic loss (Freund, Grigas, and Mazumder 207), defined as f (X) = (i,j) P (M) (X i,j M i,j ) 2 where X is he esimaed marix, M is he observed marix, and P (M)={(i, j) : M i,j is observed}. As a nonconvex, bu sricly-locally-quasi-convex loss, we also used squared sigmoid loss ϕ(z) = ( + exp( z)) (Hazan, Levy, and Shalev-Shwarz 205) for classificaion. For robus regression, we used he bi-weigh loss (Belagiannis e al. 205), as a non-convex (bu smooh) loss ψ(f(x i ), y i ) = (f(x i ) y i ) 2 + (f(x i ) y i ) 2. For regression, we used he YearPredicionMSD daase (500K observaions, 90 feaures) (Lichman 203). For classificaion, we used he Adul daase (49K observaions, 4 feaures) (Lichman 203). For marix compleion, we used he MovieLens daase (M movie raings from 6,040 users on 3,900 movies) (Harper and Konsan 206). 7. Empirical Validaion of Convergence Raes We ran several experimens o empirically validae our convergence resuls. In paricular, we sudied he performance of Primal Averaging (PA) and sandard FW Wih Line Search (FWLS) wih boh l 2 and Schaen-2 norm balls as our srongly convex consrain ses. Theorem guaranees a convergence rae of O( ) for PA when he consrain se is srongly convex and he 2 loss funcion is convex. We experimened wih boh l 2 (logisic classifier) and Schaen-2 norm (marix compleion) balls, measuring he loss value a each ieraion. As shown in Figure a, a slope of 2.4 confirms Theorem s guaranee, which predics a slope of a leas 2. Theorem 3 shows ha FWLS converges o he global minimum a he rae of O ( min ( )), /3 when he consrain /2

7 (a) PA vs. sandard FW varians. (b) PA vs. gradien descen. (c) Sochasic PA vs. sochasic GD. Figure 2: PA versus (a) oher FW varians, (b) gradien descen, and (c) sochasic gradien descen. se is srongly convex and he loss funcion is sricly-locallyquasi-convex. We invesigaed his resul wih he squared sigmoid loss and an l 2 norm consrain. Figure b exhibis our resuls, showing a slope of 2.2, a finding beer han he wors-case bounds given by Theorem 3, i.e., a slope of 0.5 (see our echnical repor (Recor-Brooks, Wang, and Mozafari 208) for a deailed discussion). From Theorem 4, we expec FWLS o converge o a saionary poin of a (smooh) non-convex funcion a a rae of O( ) when consrained o a srongly convex se. Using he bi-weigh loss and an l 2 norm consrain, we measured he loss value a each ieraion. As shown in Figure c, he resuls confirmed our heoreical resuls, showing an even seeper slope (.46 insead of, since Theorem 4 only provides a wors-case upper bound). 7.2 Comparison of Differen Opimizaion Algorihms To compare he acual performance of various opimizaion algorihms, we measure he run imes, insead of he number of ieraions o convergence, in order o accoun for he ime spen in each ieraion. In Figure 2, doed verical lines mark he convergence poins of various algorihms. Firs, we compared all hree varians of FW: PA, sandard FW Wih Predefined Learning Rae (FWPLR) defined in Algorihm wih opion A, and sandard FW Wih Line Search (FWLS) defined in Algorihm wih opion B. All mehods were esed on a regression ask (quadraic loss) wih an l 2 norm ball consrain. As shown in Figure 2a, PA converged 3.7 and 5.6 faser han FWPLR and FWLS, respecively. This considerable speedup has significan ramificaions in pracice. Tradiionally, PA has been shied away from, due o is slower ieraions, while is convergence rae was believed o be he same as he more efficien varians (Lan 203). However, as proven in Secion 4, PA does converge in fewer ieraions. We also compared he run ime of PA versus projeced gradien descen (regression ask wih a quadraic loss). We compared heir deerminisic versions in Figure 2b, where PA converged significanly faser (7.7 ), as expeced. For a fair comparison of heir sochasic versions, Sochasic Primal Averaging (SPA) and Sochasic Gradien Descen (SGD), we considered wo cases: an l 2 consrain (which has an efficien projecion) and l. consrain (which has a cosly projecion). As expeced, for an efficien projecion, SGD converged 4.6 faser han SPA (Figure 2c), and when he projecion was cosly, SPA converged 25. faser (see (Recor-Brooks, Wang, and Mozafari 208) for deailed plos). 8 Conclusion In his paper, we revisied an imporan class of opimizaion echniques, FW mehods, and offered new insigh ino heir convergence properies for srongly convex consrain ses, which are quie common in machine learning. Specifically, we discovered ha, for convex funcions, a non-convenional varian of FW (i.e., Primal Averaging) converges significanly faser han he commonly used varians of FW wih high probabiliy. We also showed ha PA s O( 2 ) convergence rae more han compensaes for is slighly more expensive compuaional cos a each ieraion. We furher proved ha for sricly-locally-quasi-convex funcions, FW can converge o wihin an ɛ-neighborhood of he global minimum in O ( max( ɛ 2, ɛ 3 ) ) ieraions. Even for non-convex funcions, we proved ha FW s convergence rae is beer han he previously known resuls in he lieraure wih high probabiliy. These new convergence raes have significan ramificaions for praciioners, due o he widespread applicaions of srongly convex norm consrains in classificaion, regression, marix compleion, and collaboraive filering seings. Finally, we conduced exensive experimens on real-world daases o validae our heoreical resuls and invesigae our improvemen over exising mehods. In summary, we showed ha PA reduces opimizaion ime by compared o sandard FW varians, and by compared o projeced gradien descen. Our plan is o inegrae PA in machine learning libraries libraries, including our BlinkML projec (Park e al. 208). 9 Acknowledgmens This work is in par suppored by he Naional Science Foundaion (grans and 55369).

8 References Agarwal, N.; Allen-Zhu, Z.; Bullins, B.; Hazan, E.; and Ma, T Finding approximae local minima for nonconvex opimizaion in linear ime. arxiv preprin arxiv: Beck, A., and Teboulle, M A condiional gradien mehod wih linear rae of convergence for solving convex linear sysems. Mahemaical Mehods of Operaions Research 59(2): Belagiannis, V.; Rupprech, C.; Carneiro, G.; and Navab, N Robus opimizaion for deep regression. In Proceedings of he IEEE Inernaional Conference on Compuer Vision, Carmon, Y.; Duchi, J.; Hinder, O.; and Sidford, A Acceleraed mehods for non-convex opimizaion. hps://arxiv.org/pdf/ pdf. Chari, V.; Lacose-Julien, S.; Lapev, I.; and Sivic, J On pairwise coss for nework flow muli-objec racking. In Proceedings of he IEEE Conference on Compuer Vision and Paern Recogniion, Clarkson, K. L Coreses, sparse greedy approximaion, and he frank-wolfe algorihm. ACM Transacions on Algorihms (TALG) 6(4):63. Demyanov, V. F., and Rubinov, A. M Approximae mehods in opimizaion problems. Elsevier Publishing Company,. Dunn, J. C Raes of convergence for condiional gradien algorihms near singular and nonsingular exremals. SIAM Journal on Conrol and Opimizaion. Frank, M., and Wolfe, P An algorihm for quadraic programming. Naval research logisics quarerly 3(-2):95 0. Freund, R. M.; Grigas, P.; and Mazumder, R An exended frank wolfe mehod wih in-face direcions, and is applicaion o low-rank marix compleion. SIAM Journal on Opimizaion 27(): Garber, D., and Hazan, E Faser raes for he frank-wolfe mehod over srongly-convex ses. In Inernaional Conference on Machine Learning, Ge, R.; Huang, F.; Jin, C.; and Yuan, Y Escaping from saddle poins - online sochasic gradien for ensor decomposiion. CoRR abs/ Ghadimi, S., and Lan, G Acceleraed gradien mehods for nonconvex nonlinear and sochasic programming. Mahemaical Programming 56(-2): Harchaoui, Z.; Douze, M.; Paulin, M.; Dudik, M.; and Malick, J Large-scale image classificaion wih race-norm regularizaion. In Compuer Vision and Paern Recogniion (CVPR), 202 IEEE Conference on, IEEE. Harper, F. M., and Konsan, J. A The movielens daases: Hisory and conex. ACM Transacions on Ineracive Inelligen Sysems (TiiS) 5(4):9. Hazan, E., and Kale, S Projecion-free online learning. arxiv preprin arxiv: Hazan, E., e al Inroducion o online convex opimizaion. Foundaions and Trends in Opimizaion 2(3-4): Hazan, E.; Kale, S.; and Warmuh, M. K Learning roaions wih lile regre. In COLT, Hazan, E.; Levy, K.; and Shalev-Shwarz, S Beyond convexiy: Sochasic quasi-convex opimizaion. In Advances in Neural Informaion Processing Sysems, Jaggi, M.; Sulovsk, M.; e al A simple algorihm for nuclear norm regularized problems. In Proceedings of he 27h inernaional conference on machine learning (ICML-0), Jaggi, M. 20. Sparse convex opimizaion mehods for machine learning. Technical repor, ETH Zürich. Jaggi, M Revisiing frank-wolfe: Projecion-free sparse convex opimizaion. In ICML (), Kim, S., and Xing, E. P Tree-guided group lasso for muliask regression wih srucured sparsiy. In ICML, Lacose-Julien, S., and Jaggi, M On he global linear convergence of frank-wolfe opimizaion varians. In Advances in Neural Informaion Processing Sysems, Lacose-Julien, S.; Jaggi, M.; Schmid, M.; and Plescher, P Block-coordinae frank-wolfe opimizaion for srucural svms. ICML. Lacose-Julien, S Convergence rae of frank-wolfe for nonconvex objecives,. arxiv: Lan, G The complexiy of large-scale convex programming under a linear opimizaion oracle. hps://arxiv.org/abs/ Lee, J. D.; Simchowiz, M.; Jordan, M. I.; and Rech, B Gradien descen converges o minimizers. COLT. Leviin, E. S., and Polyak, B. T Consrained minimizaion mehods. USSR Compuaional mahemaics and mahemaical physics. Lichman, M UCI machine learning reposiory. Neserov, Y., and Polyak, B. T Cubic regularizaion of newon mehod and is global performance. Mahemaical Programming 08(): Osokin, A.; Alayrac, J.-B.; Lukasewiz, I.; Dokania, P.; and Lacose- Julien, S Minding he gaps for block frank-wolfe opimizaion of srucured svms. In Inernaional Conference on Machine Learning, Park, Y.; Qing, J.; Shen, X.; and Mozafari, B BlinkML: Approximae machine learning wih probabilisic guaranees y. Technical Repor hp://web.eecs.umich.edu/ mozafari/ php/daa/uploads/blinkml_repor.pdf. Recor-Brooks, J.; Wang, J.-K.; and Mozafari, B Revisiing projecion-free opimizaion for srongly convex consrain ses. Technical Repor hp://web.eecs.umich.edu/ mozafari/php/daa/uploads/fw_repor.pdf. Reddi, S. J.; Sra, S.; Poczos, B.; and Smola., A Sochasic frank-wolfe mehods for nonconvex opimizaion. Alleron. Shalev-Shwarz, S.; Gonen, A.; and Shamir, O. 20. Large-scale convex minimizaion wih a low-rank consrain. arxiv preprin arxiv: Sun, W., and Yuan, Y.-X Opimizaion heory and mehods: nonlinear programming, volume. Springer Science & Business Media. Wang, Y.-X.; Sadhanala, V.; Dai, W.; Neiswanger, W.; Sra, S.; and Xing, E Parallel and disribued block-coordinae frank-wolfe algorihms. In Inernaional Conference on Machine Learning, Yu, Y.; Zhang, X.; and Schuurmans, D Generalized condiional gradien for srucured esimaion. arxiv:

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3