arxiv: v4 [stat.ml] 14 Jun 2018

Size: px

Start display at page:

Download "arxiv: v4 [stat.ml] 14 Jun 2018"

Toby Dawson
5 years ago
Views:

1 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy Lin Chen Chrisopher Harshaw 3 Hamed Hassani 4 Amin Karbasi arxiv: v4 [sa.ml 4 Jun 08 Absrac Online opimizaion has been a successful framework for solving large-scale problems under compuaional consrains and parial informaion. Curren mehods for online convex opimizaion require eiher a projecion or exac gradien compuaion a each sep, boh of which can be prohibiively expensive for large-scale applicaions. A he same ime, here is a growing rend of nonconvex opimizaion in machine learning communiy and a need for online mehods. Coninuous DR-submodular funcions, which exhibi a naural diminishing reurns condiion, have recenly been proposed as a broad class of non-convex funcions which may be efficienly opimized. Alhough online mehods have been inroduced, hey suffer from similar problems. In his work, we propose Mea-Frank-Wolfe, he firs online projecionfree algorihm ha uses sochasic gradien esimaes. The algorihm relies on a careful sampling of gradiens in each round and achieves he opimal O( T ) adversarial regre bounds for convex and coninuous submodular opimizaion. We also propose One-Sho Frank-Wolfe, a simpler algorihm which requires only a single sochasic gradien esimae in each round and achieves an O(T /3 ) sochasic regre bound for convex and coninuous submodular opimizaion. We apply our mehods o develop a novel lifing framework for he online discree submodular maximizaion and also see ha hey ouperform curren sae-of-he-ar echniques on various experimens. Yale Insiue for Nework Science, Yale Universiy, New Haven, CT, USA Deparmen of Elecrical Engineering, Yale Universiy 3 Deparmen of Compuer Science, Yale Universiy 4 Deparmen of Elecrical and Sysems Engineering, Universiy of Pennsylvania, Philadelphia, PA, USA. Correspondence o: Lin Chen <lin.chen@yale.edu>.. Inroducion As he amoun of colleced daa becomes massive in boh size and complexiy, algorihm designers are faced wih unprecedened challenges in saisics, machine learning, and conrol. In he pas decade, online opimizaion has provided a successful compuaional framework for ackling a wide variey of challenging problems, ranging from non-parameric regression o porfolio managemen (Calandriello e al., 07; Agarwal e al., 006). In online opimizaion, a large or complex opimizaion problem is broken down ino a sequence of smaller opimizaion problems, each of which mus be solved wih limied informaion. This framework capures many real-world scenarios in which sandard opimizaion heory does no apply. For insance, a machine learning applicaion canno feasibly process erabyes of daa a a single ime; raher, subses of daa may be handled in a sequenial fashion. Anoher example is when he rue objecive funcion is he expecaion of an unknown disribuion of funcions, and may only be accessible via samples, as is he case for problems in online learning and conrol heory (Xiao, 00; Wang & Boyd, 008). Online convex opimizaion, a branch of online opimizaion ha considers sequenially minimizing convex funcions, has proved paricularly useful for saisical and machine learning applicaions. Online convex opimizaion has enjoyed much success in hese areas because mos offline machine learning echniques uilize he exising heory of convex opimizaion. As in he offline seing, gradien mehods are a popular class of algorihms for online convex opimizaion due o heir simpliciy; however, hey require projecions ono he consrain se, which involve solving a quadraic program in he general case. These projecions are infeasible for large scale applicaions wih complicaed consrains such as marix compleion, nework rouing problems, and maximum machings. Online projecion-free mehods have been proposed and are much more efficien, replacing a projecion ono he consrain se wih a linear opimizaion over he consrain se a each ieraion (Hazan & Kale, 0; Garber & Hazan, 03). However, hese projecion-free mehods require exac gradien compuaions, which may be prohibiively expensive for even moderaely sized daa ses and inracable when a

2 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy closed form does no exis. Thus, here is a huge need for online convex opimizaion rouines ha are projecion-free and also robus o sochasic gradien esimaes. While convex programs may be efficienly solved (a leas in heory), here is a growing number of non-convex problems arising in machine learning and saisics. Noable examples include nonnegaive principle componen analysis, low-rank marix recovery, sigmoid loss funcions for binary classificaion, and he raining of deep neural neworks, o name a few. Undersanding which ypes of non-convex funcions may be efficienly opimized and developing echniques for doing so is a pressing research quesion for boh heory and pracice. Recenly, coninuous DR-submodular funcions have been proposed as a broad class of non-convex funcions which admi efficien approximae maximizaion rouines, even hough exac maximizaion is NP-Hard (Bian e al., 07). These funcions capure many real-life applicaions, such as opimal experimen design, non-definie quadraic programming, coverage and diversiy funcions, and coninuous relaxaion of discree submodular funcions. Recen works (Chen e al., 08) have proposed mehods for online coninuous DR-submodular opimizaion; however, hese oo require eiher expensive projecions or exac gradien compuaions. Our conribuions In his paper, we presen a suie of projecion-free algorihms for online opimizaion ha use sochasic esimaes of he gradien and leverage he averaging echnique (Mokhari e al., 08a;b) o reduce heir variance. This includes Mea-Frank-Wolfe, he firs projecion-free algorihm for adversarial online opimizaion which requires only sochasic gradien esimaes. The algorihm relies on a careful sampling of gradiens in each round and achieves opimal O( T ) regre and ( /e)-regre bounds for convex and submodular opimizaion, respecively. One-Sho Frank-Wolfe, a simpler projecion-free algorihm for sochasic online opimizaion which requires only a single sochasic gradien esimae in each round. This simpler algorihm achieves O(T /3 ) regre and ( /e)-regre bounds for he convex and submodular case, respecively. A novel class of algorihms for online discree submodular opimizaion which are based on lifing discree funcions o he coninuous domain, applying our mehods wih an exremely efficien sampling echnique, and using rounding schemes o produce a discree soluion. Finally, o demonsrae he effeciveness of our algorihms, we esed heir performance on an exensive se of experimens and measured agains common baselines.. Relaed Work The Frank-Wolfe algorihm, also known as he condiional gradien descen, was originally proposed for he offline seing in (Frank & Wolfe, 956). The framework of online convex opimizaion was inroduced by Zinkevich (003), in which he online projeced gradien descen was proposed and proved o achieve an O( T ) regre bound. However, he projecions required for such an algorihm are oo expensive for many large-scale online problems. The online condiional gradien descen was he firs projecion-free online algorihm, originally proposed in (Hazan & Kale, 0). An improved condiional gradien algorihm was laer designed for smooh and srongly convex opimizaion which achieves he opimal O( T ) adversarial regre bound (Garber & Hazan, 03). However, boh of hese algorihms can perform arbirarily poorly if supplied wih sochasic gradien esimaes. Lafond e al. (05) proposed an online Frank-Wolfe varian for he any-ime sochasic online seing ha converges o a saionary poin for nonconvex expeced funcions. While convergence is an imporan propery of he any-ime mehods, arbirary saionary poins do no yield approximaion guaranees for general non-convex funcions. Johnson & Zhang (03) inroduced he variance reducion echnique for acceleraing sochasic gradien descen. I was independenly discovered by Mahdavi e al. (03). Allen-Zhu & Hazan (06) applied his echnique o nonconvex opimizaion. Hazan & Luo (06) devised a projecion-free sochasic convex opimizaion algorihm based on his echnique. Mokhari e al. (08a;b) proposed he firs sample-efficien variance reducion echnique for projecion-free algorihms ha does no require increasing bach sizes. Their mehod achieves he igh ( /e) approximaion guaranee for monoone and coninuous DRsubmodular funcions. Alhough hese variance reducion echniques have enjoyed success in he offline seing, hey have ye o be as exensively applied in he online seing ha we consider in his paper. In he discree domain, Sreeer & Golovin (009) sudied he online maximizaion problem of monoone submodular se funcions subjec o a knapsack consrain and inroduced he mea-acion echnique. In a celebraed work, Calinescu e al. (0) proposed an (offline) mehod for maximizing monoone submodular se funcions subjec o a maroid consrain by working in he coninuous domain via he mulilinear exension, hen rounding he fracional soluion. By combining he mea-acion and lifing echniques, Golovin e al. (04) presened an algorihm whose ( /e)-regre is bounded by O( T ). The lifing mehod herein relies

3 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy on an expensive sampling procedure ha does no scale favorably o large applicaions. Bach (05) demonsraed connecions beween coninuous submodular funcions and convex funcions in he conex of minimizaion. Building upon he coninuous greedy algorihm of (Calinescu e al., 0), Bian e al. (07) proposed an algorihm ha achieves a ( /e)-approximaion guaranee for maximizing monoone coninuous DR-submodular funcions subjec o down-closed convex consrains. Projeced gradien mehods were invesigaed in (Hassani e al., 07) and were shown o aain a /-approximaion raio for monoone coninuous DR-submodular funcions. Very recenly, Chen e al. (08) borrowed he idea of meaacion (Sreeer & Golovin, 009) and proposed several online algorihms for maximizing monoone coninuous DR-submodular funcions. However, each of hese mehods eiher requires an expensive projecion sep a each ieraion or canno handle sochasic gradien esimaes. 3. Preliminaries In his work, we are ineresed in opimizing wo classes of funcions, namely convex and coninuous DR-submodular. To begin defining coninuous submodular funcions, we firs recall he definiion of a submodular se funcion. A real-valued se funcion f : Ω R + is submodular if f(a) + f(b) f(a B) + f(a B) for all A, B Ω. The noion of submodulariy has been exended o coninuous domains (Wolsey, 98; Vondrák, 007; Bach, 05). Consider a funcion f : X R + where he domain is of he form X = n i= X i and each X i is a compac subse of R +. We say ha f is coninuous submodular if f is coninuous and for all x, y X, we have f(x) + f(y) f(x y) + f(x y) where x y and x y are componen-wise maximum and minimum, respecively. Noe ha we have defined boh discree and coninuous funcions o be nonnegaive on heir respecive domains. For efficien maximizaion, we also require ha hese funcions saisfy a diminishing reurns condiion (Bian e al., 07). We say ha f is coninuous DR-submodular if f is differeniable and f(x) f(y) for all x y. The main aracion of coninuous DRsubmodular funcions is ha hey are concave in posiive direcions; ha is, for all x y, f(y) f(x) + f(x), y x (Calinescu e al., 0; Bian e al., 07). A funcion f is monoone if f(x) f(y) for all x y. A funcion f is L-smooh if f(x) f(y) L x y for all x, y. We now provide a brief inroducion o online opimizaion, referring he ineresed reader o he excellen survey of (Hazan e al., 06). In he online seing, a player seeks o ieraively opimize a sequence of funcions f,... f T over T rounds. In each round, a player mus firs choose a poin x from he consrain se K. Afer playing x, he value of f (x ) is revealed o he player, along wih access o he gradien f. Alhough he player does no know he funcion f while choosing x, hey may use informaion of previously seen funcions o guide heir choice. The siuaion where an arbirary sequence of funcions f,..., f T is presened is known as he adversarial online seing. In he adversarial seing, he goal of he player is o minimize adversarial regre, which is defined as R T f (x ) inf = x K = f (x) for minimizaion problems and analogously defined for maximizaion problems. Inuiively, a player s regre is low if he accumulaed value of heir acions over he T rounds is close o ha of he single bes acion in hindsigh. Indeed, his is a naural framework for daa-inensive applicaions where he enire daa may no fi ono a single disk and hus needs o be processed in T baches. The algorihm designer would like o devise a scheme o process he T baches separaely in a way ha is compeiive wih he bes single disk soluion. A slighly differen formulaion known as sochasic online seing is when he funcions are chosen i.i.d. from some unknown disribuion f D. In his case, he player seeks o minimize sochasic regre, which is defined as SR T = f(x ) T inf x K f(x) where f(x) = E f D[f (x) denoes he expeced funcion. This is a naural framework for many saisical and machine learning applicaions, such as empirical risk minimizaion, where he rue objecive is unknown bu pairs of daa poins and labels are sampled. While he sochasic seing appears easier han he adversarial seing (in he sense ha any sraegy for he adversarial seings applies o sochasic seings and obains a poenially lower regre), he sraegies designed for he sochasic seing may be much simpler and more compuaionally efficien. For boh adversarial and sochasic seings, a sraegy ha achieves a regre ha is sublinear in T is considered good and O( T ) regre bounds are opimal for convex funcions in boh seings. Alhough convex programs can be efficienly solved o high accuracy, general non-convex programs canno be efficienly exacly opimized, hus necessiaing anoher definiion of regre.

4 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy The α-regre is defined as α-r T α sup x K = f (x) f (x ) = for adversarial maximizaion problems, and may be analogously exended o oher scenarios. Inuiively, α-regre compares a player s acions wih he bes α-approximaion o he opimal soluion in hindsigh. This is appropriae when he objecive funcions do no admi efficien opimizaion rouines, bu do admi consan-facor approximaions, as is he case wih coninuous DR-submodular funcions. Nearly all opimizaion mehods for boh offline and online seings use firs order informaion of he objecive funcion; however, exac gradien compuaions can be cosly, especially when he objecive funcion is only readily expressed as a large sum of individual funcions or is iself an expecaion over an unknown disribuion. In his case, sochasic esimaes are usually much more compuaionally efficien o obain via sampling or simulaion. In his work, we assume ha once a funcion f is revealed, he player gains oracle access o unbiased sochasic esimaes of he gradien, raher han he exac gradien. More precisely, he player may query he oracle o obain a random linear funcion f(x) such ha E[ f(x) f(x) = 0 for all x. This compuaional model capures commonly used mini-bach mehods for esimaing gradiens, among oher examples. In his work, we make a few main assumpions ha allow our algorihms o be analyzed. Assumpion. The consrain se K is convex and compac, wih diameer D = sup x,y K x y and radius R = sup x K x. Assumpion. In he adversarial seing, each funcion f is L-smooh and in he sochasic seing, he expeced funcion f is L-smooh. Assumpion 3. In he adversarial seing, he gradien oracle is unbiased E[ f (x) f (x) = 0 and has a bounded variance E[ f (x) f (x) σ for all poins x and funcions f. In he sochasic seing, he gradien oracle is unbiased E[ f(x) f (x) = 0 and has a bounded variance E[ f(x) f (x) σ for all poins x and funcions f. We remark ha in he sochasic seing and under mild regulariy condiions, unbiasedness of he gradiens E[ f (x) f (x) = 0 implies unbiasedness E[ f(x) f (x) = 0 in Assumpion 3 because f(x) = E f D[f (x) Moreover, upper bounds on he variance erms E[ f(x) f (x) σ a and E[ f (x) f (x) σ b yield a variance bound of E[ f(x) f (x) σ a + σ b, by he riangle inequaliy. 4. Main Resuls We now presen wo algorihms for online opimizaion of convex and coninuous DR-submodular funcions in he adversarial and sochasic seings. Unlike previous work, hese mehods are projecion-free and require only sochasic esimaes of he gradiens, raher han exac gradien compuaions. In boh algorihms, he main compuaional primiive is linear opimizaion over a compac convex se. In addiion, we remark ha boh algorihms can be convered ino an anyime algorihm ha does no require he knowledge of he horizon T via he doubling rick; see Secion.3. of (Shalev-Shwarz e al., 0). 4.. Adversarial Online Seing Algorihm combines he recen variance reducion echnique of (Mokhari e al., 08a) along wih he use of online linear opimizaion oracles o minimize he regre in each round. An online linear opimizaion oracle is an insance of an online linear opimizaion (minimizaion/maximizaion in he convex/dr-submodular seing, respecively) algorihm ha opimizes linear objecives in a sequenial manner. Boh he variance reducion in he sochasic gradien esimaes and he online linear oracles are crucial in he algorihm, as jus one echnique is no enough o ge sublinear regre bounds in he adversarial seing. A a high level, our algorihm produces ieraes x by running K seps of a Frank-Wolfe procedure, using an average of previous gradien esimaes and linear online opimizaion oracles in place of exac opimizaion of he rue gradien. Afer a poin x is played in round, our algorihm queries he gradien oracle f a K poins. Then, he gradien esimaes are averaged wih hose from previous rounds and fed as objecive funcions ino K linear online opimizaion oracles. The K poins chosen by he oracles are used as ieraes in a full K-sep Frank-Wolfe subrouine o obain he nex poin x +. A formal descripion is provided in Algorihm. There are only a few differences in Algorihm for convex and submodular opimizaion. Firs, he online oracles should be minimizing in he case of convex opimizaion and maximizing in he case of submodular opimizaion. Second, he iniial poin x may be any poin in K for convex problems bu should be se o 0 for submodular problems (even if K is no down-closed). Finally, he updae rule is x (k+) for convex problems and ( η k )x (k) x (k+) x (k) + η k v (k) + η k v (k) for submodular problems. We now provide a formal regre bound.

5 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy Algorihm Mea-Frank-Wolfe Inpu: convex se K, ime horizon T, linear opimizaion oracles E ()... E (K), sep sizes ρ k (0, ) and η k (0, ), and iniial poin x Oupu: {x : T } : Iniialize online linear opimizaion oracles E ()... E (K) : Iniialize d (0) = 0 and x () = x 3: for,, 3,..., T do 4: v (k) oupu of oracle E (k) in round 5: x (k+) updae(x (k), η k ) for k =... K 6: Play x = x (K+), hen obain value f (x ) and unbiased oracle access o f 7: d (k) ( ρ k )d (k ) + ρ k f (x (k) ) for k =... K 8: Feedback v (k) 9: end for, d (k) o E (k) for k =... K Theorem (Proof in Appendices B and C). Suppose Assumpions - 3 hold, he online linear opimizaion oracles have regre a mos R E T, and he averaging parameers are chosen as ρ k =. Then for convex funcions (k+3) /3 f,..., f T and sep sizes η k = k+3, he adversarial regre of Algorihm is a mos 4T DQ / + 4T K K /3 (M + LD 3 ) log(k + ) RE T in expecaion, where M = max T [f (x ) f (x ) and Q max{4 /3 max T f (x ), 4σ + 3(LD) /}. For monoone coninuous DR-submodular funcions f,..., f T and sep sizes η k = K, he adversarial ( /e)-regre of Algorihm is a mos 3T DQ / K /3 + LD T K + RE T in expecaion, where Q max{max T f (x ) 4 /3, 4σ + 6L R }. From Theorem, we observe ha by seing K = T 3/ and choosing a projecion-free online linear opimizaion oracle wih R E T = O( T ), such as Follow he Perurbed Leader (Cohen & Hazan, 05), boh regres are bounded above by O( T ). We remark ha he expecaion in Theorem is wih respec o he sochasic gradien esimaes. 4.. Sochasic Online Seing In he sochasic online seing, where funcions are sampled i.i.d. f D, we can develop much simpler algorihms ha sill achieve sublinear regre. Algorihm works wihou insaniaing any online linear opimizaion oracles and requires only a single sochasic esimae of he gradien a each round. Indeed, because he funcions are no arbirarily chosen, variance reducion along wih one Frank-Wolfe sep suffices o achieve a sublinear regre bound. Algorihm One-Sho Frank-Wolfe Inpu: convex se K, ime horizon T, sep sizes ρ (0, ) and η (0, ), and iniial poin x Oupu: {x : T } : d 0 0 : for,, 3,..., T do 3: Play x, hen obain value f (x ) and unbiased oracle access o f 4: d ( ρ )d + ρ f (x ) 5: v arg max v K d, v 6: x + updae(x, v, η ) 7: end for The differences in Algorihm for convex and submodular opimizaion are similar o hose in Algorihm. Namely, he updae rules are he same and he iniial poin x may be arbirarily chosen from K for convex opimizaion, and se o 0 for submodular opimizaion. Theorem (Proof in Appendices D and E). Suppose Assumpions - 3 hold and he averaging parameers are chosen as ρ =. Then for a convex expeced funcion f and sep sizes η = +3 (+3) /3, he sochasic regre of Algorihm is a mos 4M log(t + ) + 6Q / DT / LD log (T + 3) in expecaion, where M = f(x ) f(x ) and Q max{4 /3 F (x ), 4σ + 3(LD) /}. For expeced funcions f which are monoone coninuous DR-submodular and sep sizes η k = K, he sochasic ( /e)-regre of Algorihm is a mos ( /e)m + 3DQ/ (3T /3 + T ) + LD 0. in expecaion, where M = f(x ) f(0) and Q max{ f(0) 4 /3, 4σ + 6L R } 4.3. Lifing Mehods for Discree Online Opimizaion One exciing applicaion of our online coninuous DRsubmodular opimizaion algorihms is a new approach for online discree submodular opimizaion. While previous mehods could only handle knapsack consrains (Sreeer & Golovin, 009) or required expensive sampling procedures (Golovin e al., 04), our coninuous mehods can

6 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy be applied o he discree seing o handle general maroid consrains and compuaionally cheap sampling procedures. Suppose f,... f T are nonnegaive monoone submodular se funcions on a ground se Ω wih maroid consrain I and f,... f T are corresponding muli-linear exensions wih maroid polyope K [0, n. A discree procedure ha uses our coninuous algorihm is as follows: a each round, he online coninuous algorihm produces a fracional soluion x K, which is hen rounded o a se X I and played as he discree soluion. The value f (X ) is revealed and he player is graned access o he discree funcion f. Then, he player supplies he coninuous algorihm wih a sochasic gradien esimae ˆf obained by a single funcion evaluaion, as f (x) x i = E[f(R {i}) f(r), i [n, () where R is random subse of [n \ {i} such ha for every j [n \ {i}, he even j R happens wih an independen probabiliy of x j. Because a lossless rounding scheme is used, he discree player enjoys a regre ha is no worse han ha of he coninuous soluion. Provably lossless rounding schemes include he pipage rounding (Ageev & Sviridenko, 004; Calinescu e al., 0) and conenion resoluion (Vondrák e al., 0). Mos discree submodular maximizaion algorihms ha go hrough he muli-linear exension require a gradien esimae wih high accuracy. In order o do his, hey appeal o a concenraion bound, which requires O(n ) evaluaions of he discree funcion for independenly chosen samples. In sark conras, our algorihms can handle sochasic gradien esimaes and hus require only a single funcion evaluaion, finally making coninuous mehods a realiy for large-scale online discree opimizaion problems. The framework of he one-sampling lifing mehod is illusraed in Fig.. Sochasic opimizaion algorihm A Mulilinear exension Coninuous soluion One sample Submodular se fn. Rounding Discree soluion Figure. Diagram of he one-sample lifing mehod As an example, we presen in Algorihm 3 how o use Mea-Frank-Wolfe as an online maximizaion algorihm of submodular se funcions. According o Theorem, he 3T DQ/ ( /e)-regre of Algorihm 3 is bounded by LD T K K /3 + + RE T, where RE T is he regre of E (k) up o horizon T. If one ses E (k) o an online linear maximizaion algorihm wih regre bound O( T ) and ses K = T 3/, he ( /e)-regre is a mos O( T ). Algorihm 3 Mea-Frank-Wolfe for online discree submodular maximizaion Inpu: maroid consrain I, ime horizon T, linear opimizaion oracles E ()... E (K), sep sizes ρ k (0, ) and η k (0, ), and iniial poin x Oupu: {X : T } : Iniialize online linear opimizaion oracles E ()...E (K), seing he consrain se o he maroid polyope of I : Iniialize d (0) = 0 and x () = x 3: for,, 3,..., T do 4: v (k) oupu of oracle E (k) in round 5: x (k+) updae(x (k), η k ) for k =... K 6: x x (K+) 7: play X round(x ), obain value f (X ) and observe he funcion f 8: Sample f (x (k) ) for k = 0,..., K 9: d (k) ( ρ k )d (k ) + ρ k f (x (k) ) for k =... K 0: Feedback v (k) : end for 5. Experimen, d (k) o E (k) for k =... K In his secion, we es our online algorihms for monoone coninuous DR-submodular and convex opimizaion on boh real-world and synheic daa ses. We find ha our algorihms ouperform mos baselines, including projeced gradien descen, when supplied wih sochasic gradien esimaes. All code was wrien in he Julia programming language and esed on a Macinosh deskop wih an Inel Processor i7 wih 6 GB of RAM. No pars of he code were opimized pas basic Julia usage. A lis of all algorihms o be compared in his secion is presened below. Mea-Frank-Wolfe is Algorihm. We compare he variance-reduced mea-frank-wolfe algorihm and he analogue wihou variance reducion, denoed Mea- FW w/ VR and Mea-FW w/o VR, respecively. One-sho Frank-Wolfe is Algorihm. We compare he One-sho online Frank-Wolfe algorihm wih and wihou variance reducion, denoed OS-FW w/ VR OS-FW w/o NVR, respecively. Regularized online Frank-Wolfe is referred o as he online condiional gradien algorihm in (Hazan e al., 06). I has a regularizer erm when compuing he gradien. Thus we erm i he regularized online Frank- Wolfe algorihm and denoe i as Regularized-OFW.

7 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy Online projeced gradien ascen (OGA) follows he direcion of he projeced gradien. Is /-regre is a mos O( T ) for online monoone coninuous DRsubmodular maximizaion if he sep size is se o Θ(/ ) on he -h ieraion (Chen e al., 08). Noe ha OGA is no a projecion-free algorihm. In he seing of convex minimizaion, we use online projeced gradien descen insead (denoed by OGD). When we perform experimens on discree submodular maximizaion problems using our lifing mehod, we also compare he above algorihms wih he Online Greedy algorihm (Sreeer & Golovin, 009). 5.. Online DR-Submodular Maximizaion In order o es he performance of algorihms for online maximizaion of monoone coninuous DR-submodular funcions wih sochasic gradien esimaes, we conduced hree ses of experimens on real-world daases. We approximae he ( /e)-regre by running an offline Frank Wolfe maximizaion o produce a soluion ha is a ( /e) approximaion o he opimum. Joke Recommendaions (Coninuous) The firs se of experimens is o opimize a sequence of coninuous faciliy locaion objecives on he Jeser daase (Goldberg e al., 00). I conains raings of 00 jokes from 73,4 users and he raing range is [ 0, 0. We re-scale he raing range ino [0, 0 so ha all raings are nonnegaive. Le R uj be user u s raing of joke j. All users are splied ino disjoin baches B, B,..., B T, each conaining B users. The faciliy locaion objecive is defined as f (X) = u B max j X R uj, X [J, where J = 00 is he oal number of jokes and [J = {,, 3,..., J}. Is mulilinear exension is given by f (x) = J u B l= R uj x l u l ju l m= ( x ju m ), x [0, J, where ju, ju,..., ju J is a permuaion of,,..., J such ha R uj u R uj u... R uj J u (Iyer e al., 04). In his experimen, he sequence of objecive funcions o be opimized is { f, f,..., f T }. The sochasic gradien is obained by he sampling mehod given in Eq. () wih only one sample for each coordinae of he gradien. We se he consrain se o {x [0, J : x } and choose B = 5. We presen he resuls in Fig. (a). Mea-FW w/ VR aains he smalles regre. The counerpar wihou variance reducion Mea-FW w/o VR is inferior o Mea-FW w/ VR in erms of he regre. OS-FW w/ VR ouperforms OS- FW w/o NVR, which suggess ha he variance reducion echnique improves he performance of he algorihms. Joke Recommendaions (Discree) In he second se experimens, we consider online maximizaion of discree submodular funcions. The problem se up is he same as before, bu insead of evaluaing regre of he mulilinear exensions, we round soluions using pipage rounding and evaluae he regre on he discree submodular funcions. We se he bach size B o 40 and we recommend 0 jokes for users. The resuls are illusraed in Fig. (b). We observe ha Mea-FW w/ VR ouperforms all oher algorihms again. The projeced algorihm OGA is second o Mea-FW w/ VR. Online Greedy appears only beer han Regularized-OFW. The experimen resul show ha he coninuous algorihms designed under he framework of he lifing mehod perform beer han he discree algorihms. Topic Summarizaion We consider he problem of selecing news documens in order o maximize he probabilisic coverage of news opics (El-Arini e al., 009; Yue & Guesrin, 0). We applied he laen Dirichle allocaion o he corpus of Reuers-578, Disribuion.0, se he number of opics o 0, and exraced he opic disribuion of each news documen. We sample T baches of news documens from he corpus and denoe hem by B, B,..., B T, where each bach conains 50 randomly sampled documens. For each bach B i, we define he probabilisic coverage funcion as follows f i (X) = 0 0 j= [ a X ( p a(j)), X B i, where p a ( ) is he opic disribuion of news documen a. Is mulilinear exension is f i (x) = 0 0 j= [ a X ( p a (j)x a ), x [0, 50, see (Iyer e al., 04). The sequence of objecive funcions ha he algorihms are expeced o maximize is f, f,..., f T. As in he experimens on joke recommendaions, he sochasic gradien is obained by he sampling mehod given in Eq. () wih only one sample for each coordinae of he gradien. The consrain se is {x [0, 50 : x 45}. We show he ( /e)-regre of he algorihms in Fig. (c). Again, Mea- FW w/ VR exhibis he lowes regre han any oher algorihm. Is non-variance-reduced counerpar Mea-FW w/o VR is second o i. OS-FW w/ VR ouperforms OS-FW w/o NVR, which confirms he improvemen brough by he variance reducion echnique. 5.. Online Convex Minimizaion The nex wo ses of experimens es he performance of he algorihms for online minimizaion of convex funcions wih sochasic gradien esimaes. For hese experimens, he regre is compued by obaining he offline soluions wih a Frank-Wolfe solver. Sochasic Cos Nework Flow The fourh se of experimens is a minimum sochasic cos flow in a direced nework. A direced graph G = (V, E) wih source s V, sink v V, and edge capaciies c : E R + is known o he player. A flow is a funcion x : R E + R + ha saisfies he capaciies on each edge 0 x(e) c(e) and obeys he

8 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy e regre OS-FW w/ VR Mea FW w/ VR Mea FW w/o VR OS-FW w/o VR OGA OS FW w/ VR OS FW w/o VR Regularized OFW Mea-FW w/o VR Regularized-OFW OGA Mea-FW w/ VR Ieraion index (a) Coninuous faciliy locaion on Jeser daase e regre Mea FW w/ VR Mea FW w/o VR OGA Regularized OFW Online greedy Mea-FW w/o VR Online greedy Regularized-OFW OGA Mea-FW w/ VR Ieraion index (b) Discree faciliy locaion on Jeser daase (c) News recommendaion in Reuers corpus e regre Mea FW w/ VR Mea FW w/o VR OGA OS FW w/ VR OS FW w/o VR Regularized OFW Regularized-OFW OS-FW w/o VR OGA Mea-FW w/o VR OS-FW w/ VR Mea-FW w/ VR Ieraion index Regre Mea FW w/ VR Mea FW w/o VR OGD OS-FW w/o VR OS FW w/ VR OS FW w/o VR OS-FW w/ VR Regularized OFW Regularized-OFW Mea-FW w/o VR OGD Mea-FW w/ VR Ieraion index (d) Nework flow Regre Mea FW w/ VR Mea FW w/o VR OGD OS FW w/ VR OS FW w/o VR Regularized OFW Regularized-OFW OS-FW w/ VR Mea-FW w/o VR OS-FW w/o VR OGD Mea-FW w/ VR Ieraion index (e) Marix compleion Execuion ime in seconds Mea-FW w/ VR 6.69 Mea-FW w/o VR Algorihms Mea FW w/ VR Mea FW w/o VR OGD OS FW w/ VR OS FW w/o VR Regularized OFW OGD Regularized-OFW OS-FW w/ VR OS-FW w/o VR (f) Execuion ime of marix compleion Figure. Figs. (a) o (c) shows he ( /e)-regre of online DR-submodular maximizaion algorihms. In Figs. (a) and (b), we show he regre for he coninuous and discree faciliy locaion objecive funcions on he Jeser daase, respecively. We show he resuls for he online news recommendaion problem in he Reuers corpus in Fig. (c). The resuls for online convex minimizaion are illusraed in Figs. (d) o (f). In Fig. (d), we show he regre of he algorihms applied o he sochasic cos nework flow problem. The resuls for he marix compleion problem are shown in Fig. (e) and he compuaional ime is illusraed in Fig. (f). conservaion laws for all verices z, a z = s x(r) = a z = v {z,r} E 0 oherwise for some fixed a 0. In each round, a convex cos funcion on he flow f : R E R + is drawn from a disribuion, unknown o he player. The goal is o minimize he sochasic regre of he flows chosen. Linear opimizaions for his problem may be implemened as combinaorial nework flow algorihms. We used he direced Zachary Karae nework wih 34 nodes and 78 arcs (Zachary, 977). We se all edge capaciies o and cos funcions are of he form f(x) = e E w ex(e) where w e Unif[00, 0. The resuls are presened in Fig. (d). Mea-FW w/ VR aains he lowes regre among all baselines. Again, he regre of Mea-FW w/o VR is larger han he variance-reduced Mea- FW w/ VR. Similarly, OS-FW w/ VR also ouperforms OS-FW w/o NVR. Marix Compleion In he online convex marix compleion problem, one would like o consruc a low rank marix X R m n ha well-approximaes a given marix M R m n on observed enries OB [m [n. The convex relaxaion is min Trace(X) k (i,j) OB (X i,j M i,j ). In he online seing, observed enries of he marix arrive in T baches, OB, OB,... OB T, each of size B. In each round, we consruc a low-rank marix o minimize he oal regre over he T rounds. Alhough projecion involves a full singular value decomposiion, linear opimizaion here is simply a calculaion of he larges singular vecors of (X M) OB, see Chaper 7 of (Hazan e al., 06). In our experimen, M is a rank 0 marix wih m = n = 50, and B = 00. We illusrae he resuls in Fig. (e) and he compuaional ime is shown in Fig. (f). Mea-FW w/ VR is only second o OGD. However, OGD is he slowes algorihm due o he compuaionally expensive projecion operaions and is compuaional ime is five imes ha of Mea-FW w/ VR. The non-variance-reduced Mea-FW w/o VR is inferior o Mea-FW w/ VR in erms of regre. Acknowledgmens AK was suppored by AFOSR YIP (FA ). CH was suppored in par by NSF GRFP (DGE49) and by ONR Award N

9 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy References Agarwal, A., Hazan, E., Kale, S., and Schapire, R. E. Algorihms for porfolio managemen based on he newon mehod. In Proceedings of he 3rd Inernaional Conference on Machine Learning, ICML 06, pp. 9 6, 006. Ageev, A. A. and Sviridenko, M. I. Pipage rounding: A new mehod of consrucing algorihms wih proven performance guaranee. Journal of Combinaorial Opimizaion, 8(3):307 38, 004. Allen-Zhu, Z. and Hazan, E. Variance reducion for faser non-convex opimizaion. In Inernaional Conference on Machine Learning, pp , 06. Bach, F. Submodular funcions: from discree o coninous domains. arxiv preprin arxiv: , 05. Bian, A., Mirzasoleiman, B., Buhmann, J. M., and Krause, A. Guaraneed non-convex opimizaion: Submodular maximizaion over coninuous domains. In AISTATS, February 07. Calandriello, D., Lazaric, A., and Valko, M. Second-order kernel online convex opimizaion wih adapive skeching. In Inernaional Conference on Machine Learning, ICML 7, 07. Calinescu, G., Chekuri, C., Pál, M., and Vondrák, J. Maximizing a monoone submodular funcion subjec o a maroid consrain. SIAM Journal on Compuing, 40(6): , 0. Chen, L., Hassani, H., and Karbasi, A. Online coninuous submodular maximizaion. In AISTATS, pp. o appear, 08. Cohen, A. and Hazan, T. Following he perurbed leader for online srucured learning. In Proceedings of he 3Nd Inernaional Conference on Inernaional Conference on Machine Learning - Volume 37, ICML 5, pp , 05. El-Arini, K., Veda, G., Shahaf, D., and Guesrin, C. Turning down he noise in he blogosphere. In SIGKDD, pp ACM, 009. Frank, M. and Wolfe, P. An algorihm for quadraic programming. Naval Research Logisics (NRL), 3(-):95 0, 956. Garber, D. and Hazan, E. A linearly convergen condiional gradien algorihm wih applicaions o online and sochasic opimizaion. arxiv preprin arxiv: , 03. Goldberg, K., Roeder, T., Gupa, D., and Perkins, C. Eigenase: A consan ime collaboraive filering algorihm. informaion rerieval, 4():33 5, 00. Golovin, D., Krause, A., and Sreeer, M. Online submodular maximizaion under a maroid consrain wih applicaion o learning assignmens. Technical repor, arxiv, 04. Hassani, H., Solanolkoabi, M., and Karbasi, A. Gradien mehods for submodular maximizaion. arxiv preprin arxiv: , 07. Hazan, E. and Kale, S. Projecion-free online learning. In ICML, pp , 0. Hazan, E. and Luo, H. Variance-reduced and projecion-free sochasic opimizaion. In ICML, pp. 63 7, 06. Hazan, E. e al. Inroducion o online convex opimizaion. Foundaions and Trends R in Opimizaion, (3-4):57 35, 06. Iyer, R., Jegelka, S., and Bilmes, J. Monoone closure of relaxed consrains in submodular opimizaion: Connecions beween minimizaion and maximizaion. In Uncerainy in Arificial Inelligence (UAI), Quebic Ciy, Quebec Canada, July 04. AUAI. Johnson, R. and Zhang, T. Acceleraing sochasic gradien descen using predicive variance reducion. In NIPS, pp , 03. Lafond, J., Wai, H.-T., and Moulines, E. On he online Frank-Wolfe algorihms for convex and non-convex opimizaions. arxiv preprin arxiv:50.07, 05. Mahdavi, M., Zhang, L., and Jin, R. Mixed opimizaion for smooh funcions. In NIPS, pp , 03. Mokhari, A., Hassani, H., and Karbasi, A. Condiional gradien mehod for sochasic submodular maximizaion: Closing he gap. In AISTATS, pp , 08a. Mokhari, A., Hassani, H., and Karbasi, A. Sochasic condiional gradien mehods: From convex minimizaion o submodular maximizaion. arxiv preprin arxiv: , 08b. Shalev-Shwarz, S. e al. Online learning and online convex opimizaion. Foundaions and Trends R in Machine Learning, 4():07 94, 0. Sreeer, M. and Golovin, D. An online algorihm for maximizing submodular funcions. In NIPS, pp , 009. Vondrák, J. Submodulariy in combinaorial opimizaion. PhD hesis, Charles Universiy, 007. Vondrák, J., Chekuri, C., and Zenklusen, R. Submodular funcion maximizaion via he mulilinear relaxaion and conenion resoluion schemes. In STOC, pp ACM, 0.

10 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy Wang, Y. and Boyd, S. Fas model predicive conrol using online opimizaion. IFAC Proceedings Volumes, 4(): , 008. ISSN h IFAC World Congress. Wolsey, L. A. An analysis of he greedy algorihm for he submodular se covering problem. Combinaorica, (4): , 98. Xiao, L. Dual averaging mehods for regularized sochasic learning and online opimizaion. J. Mach. Learn. Res., : , December 00. ISSN Yue, Y. and Guesrin, C. Linear submodular bandis and heir applicaion o diversified rerieval. In NIPS, pp , 0. Zachary, W. W. An informaion flow model for conflic and fission in small groups. Journal of anhropological research, 33(4):45 473, 977. Zinkevich, M. Online convex programming and generalized infiniesimal gradien ascen. In ICML, pp , 003.

11 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy A. Variance Reducion Theorem Each of our resuls relies on a recen variance reducion echnique, proposed by (Mokhari e al., 08a;b). We now presen Theorem 3, which appears as Lemma in (Mokhari e al., 08a). Alhough he proof is essenially he same, we presen i here so ha i is self-conained. When we apply Theorem 3 in he analysis of our algorihms, we will have ha {a } are a sequence of gradiens, {ã } are sochasic gradien esimaes, and {d } are he sequence of averaged gradien esimaes. Moreover, he upper bound on he norm of he difference of gradiens a a comes from he ierae updae procedure and smoohness of he objecive funcion. Theorem 3. Le {a } T =0 be a sequence of poins in R n such ha a a G/( + s) for all T wih fixed consans G 0 and s 3. Le {ã } T = be a sequence of random variables such ha E[ã F = a and E[ ã a F σ for every 0, where F is he σ-field generaed by {ã i } i= and F 0 =. Le {d } T =0 be a sequence of random variables where d 0 is fixed and subsequen d are obained by he recurrence wih ρ = (+s) /3. Then, we have d = ( ρ )d + ρ ã E[ a d where Q max{ a 0 d 0 (s + ) /3, 4σ + 3G /}. Q ( + s + ) /3, We remark ha we only need s 3/.83 in he saemen of Theorem 3. Proof. Le = a d. We have he following ideniy = ρ (a ã ) + ( ρ )(a a ) + ( ρ )(a d ). Expanding he square and aking he expecaion wih respec o F gives E[ F ρ σ + ( ρ ) G Taking he expecaion again gives E[ ρ σ + ( ρ ) G By Young s inequaliy, we have ( + s) + ( ρ ) + ( ρ ) E[ a a, a d F. ( + s) + ( ρ ) E[ + ( ρ ) E[ a a, a d. a a, a d β a d + (/β ) ( + s). Therefore we deduce E[ ρ σ + ( ρ ) G ( ( + s) + ( ρ ) E[ + ( ρ ) G ) β E[ + (/β ) ( + s) ρ σ + G ( + s) ( ρ ) ( + β ) + E[ ( ρ ) ( + β ). We wrie z for E[. Noice ha ( ρ )( + ρ /) as long as ρ 0. If we assume ρ [0,, seing β = ρ / yields z ρ σ + ρ σ + G G ( + s) ( ρ ) ( + ρ ) + z ( ρ ) ( + ρ ) G ( + s) ( + ρ ) + z ( ρ ).

12 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy We se ρ = (+s) /3, where s /3. Since ( + s) = ( + s) 4/3 ( + s) /3 ( + s) 4/3, we have z ( ( + s) )z 4σ G /3 + + ( + s) 4/3 ( + s) + G ( + s) 4/3 ( ( + s) )z 4σ /3 + ( + s) + 3G 4/3 ( + s) 4/3 ( ( + s) )z /3 + 4σ + 3G / ( + s) 4/3 ( ( + s) )z Q /3 + ( + s). 4/3 Q We claim z for 0 T and show his by inducion. I holds for = 0 due o he definiion of Q. Now (+s+) /3 we assume ha i is rue for = k. We have In order o show ha z k z k ( (k + s) )z Q /3 k + (k + s) 4/3 ( (k + s) ) Q /3 (k + s) + Q /3 (k + s) 4/3 = Q (k + s)/3 (k + s) 4/3. Q (k+s+) /3, i suffices o show ha ((k + s) /3 )(k + s + ) /3 (k + s) 4/3. The above inequaliy holds since (k + s + ) /3 (k + s) /3 +. B. Proof of Theorem : Convex Case We begin by examining he sequence of ieraes x (), x (),..., x (K+) of he updae and because f is L-smooh, we have f (x (k+) ) f (x ) = f (x (k) + η k (v (k) x (k) )) f (x ) f (x (k) f (x (k) ) f (x ) + η k f (x (k) ) ) f (x ) + η k f (x (k) ) produced in Algorihm for a fixed. By definiion x (k) + ηk L v(k) x (k) x (k) + ηk LD. Now, observe ha he dual pairing may be decomposed as f (x (k) ) x (k) = f (x (k) We can bound he firs erm using Young s Inequaliy o ge f (x (k) x x + f (x (k) f (x (k) β k f (x (k) β k ), x x (k) + β k v (k) x + β k D + d (k) x.

13 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy for any β k > 0, which will be chosen laer in he proof. We may also bound he second erm in he decomposiion of he dual pairing using convexiy of f, i.e. f (x (k) ), x x (k) f (x ) f (x (k) ). Using hese upper bounds, we ge ha f (x (k) ) x (k) f (x (k) β k + β k D + f (x ) f (x (k) Using his upper bound on he dual pairing in he firs inequaliy, we ge ha [ f (x (k+) ) f (x ) ( η k )(f (x (k) ) f (x ))+η k Now we will apply he variance reducion echnique. Noe ha f (x (k+) f (x (k) ) L x (k+) f (x (k) β k ) d (k) ) + d (k) x. +β k D + d (k) x LD +η k. x (k) Lη k x (k) v (k) LD k + 3 Where we have used ha f is L-smooh, he convex updae, and ha he sep size is η k = k+3. Now, using Theorem 3 wih G = LD and s = 3, we have ha E[ f (x (k) Q (k + 4) Q /3 (k + 4). /3 Where Q max{ f (x ) 4 /3, 4σ + 3(LD) /} and Q max{4 /3 max T f (x ), 4σ + 3(LD) /} Thus, aking expecaion of boh sides of he opimaliy gap and seing β k = E[f (x (k+) ) f (x ) ( η k )(E[f (x (k) ) f (x )) + η k [ Q / D (k + 4) Q / yields D(k+4) /3 + d(k) /3 x LD + η k. Now we have obained an upper bound on he expeced opimaliy gap E[f (x (k+) ) f (x ) in erms of he expeced opimaliy gap E[f (x (k) ) f (x ) in he previous ieraion. By inducion on k, we ge ha he final ierae in he sequence, x x (K+), saisfies he following expeced opimaliy gap K E[f (x ) f (x ) ( η k ) [f (x ) f (x ) + η k K j=k+ [ Q / D ( η j ) (k + 4) + d(k) /3 x LD + η k Recall ha he Frank Wolfe sep sizes are η k = k+3. We may obain upper bounds on produc of he form K k=r ( η k) by K K ( η k ) = k =r k=r ( ) ( exp k + 3 k=r ) ( ) K+ exp x + 3 x=r x + 3 dx = r + 3 K + 4 r + 3 K () Subsiuing sep sizes η k = k+3 ino Eq () and using his upper bound yields E[f (x ) f (x ) 4 K [f (x ) f (x ) + ( Which may be furher simplified by using k+3 k+4 K ( k + 3 k + 4 ) [ Q / D K (k + 4) ) 4 3K o obain + d(k) /3 x + LD (k + 3) (3) E[f (x ) f (x ) 4 K [f (x ) f (x ) + 4 3K [ Q / D (k + 3) + d(k) /3 x + LD, (k + 3)

14 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy As before, we can obain he following upper bounds using inegral mehods: ( ) K + 3 k + 3 log log(k + ) 3 k = Subsiuing hese bounds ino Eq (3) yields and k = (k + 3) 3 ( (K + 3) /3 3 /3) 3 /3 K/3 E[f (x ) f (x ) 4 K [f (x ) f (x ) + 4Q/ D + 4LD log(k + ) + 4 K /3 3K 3K Now, we can begin o bound regre by summing over all =... T o obain E[f (x ) = f (x ) 4 K = = [f (x ) f (x ) + 4T Q/ D K /3 + 4T LD log(k + ) 3K + 4 3K d (k) = d (k) x x. Recall ha for a fixed k, he sequence {v (k) } T = is produced by a online linear minimizaion oracle wih regre R E T so ha d (k) = x = d (k) min x K d (k), x R E T. Subsiuing his ino he upper bound and using M = max T [f (x ) f (x ) yields = E[f (x ) = f (x ) = 4T DQ/ K /3 + 4T K (M + LD 3 ) log(k + ) RE T Now, seing K = T 3/ and using a linear oracle wih R E T = O( T ) yields E[f (x ) = = f (x ) 4 T DQ / + 4 ( ) M + LD T 3 (log T 3/ + ) RE T = O( T ). C. Proof of Theorem : DR-Submodular Case Using he smoohness of f and recalling x (k+) x (k) = K v(k), we have f (x (k+) ) f (x (k) = f (x (k) f (x (k) ) + f (x (k) ), x (k+) x (k) L x(k+) x (k) ) + K f (x (k) ) L K v(k) ) + K f (x (k) ) LD K. (4) We can re-wrie he erm f (x (k) ) as f (x (k) ) = f (x (k) = f (x (k) = f (x (k) + d (k) + d (k), x + d (k) x x + f (x (k) ), x + d (k) x. (5)

15 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy We claim f (x (k) ), x f (x ) f (x (k) ). Indeed, using monooniciy of f and concaviy along non-negaive direcions, we have Plugging Eq. (6) ino Eq. (5), we obain f (x ) f (x (k) ) f (x x (k) ) f (x (k) ) f (x (k) ), x x (k) x (k) = f (x (k) ), (x x (k) ) 0 f (x (k) ), x. (6) f (x (k) ) f (x (k) x + d (k) x + (f (x ) f (x (k) )). (7) Using Young s inequaliy, we can show ha f (x (k) x β f (x (k) (k) β f (x (k) (k) β(k) v(k) x β (k) D / (8) Then we plug Eqs. (7) and (8) ino Eq. (4), we deduce f (x (k+) ) f (x (k) )+ K Equivalenly, we have [ β f (x (k) (k) ) d (k) β (k) D /+ d (k) x +(f (x ) f (x (k) )) LD K. f (x ) f (x (k+) ) ( /K)[f (x ) f (x (k) ) [ K β f (x (k) (k) β (k) D / + d (k) x + LD K. (9) Applying Eq. (9) recursively for k K immediaely yields f (x ) f (x (k+) ) ( /K) K [f (x ) f (x () ) + [ K β f (x (k) (k) + β (k) D / + d (k), x v (k) + LD K. Recall ha he poin played in round is x x (K+), he firs ierae in he sequence is x () = 0, and ha ( /K) K /e for all K so ha f (x ) f (x ) e [f (x ) f (0) + K Since f (0) 0, we obain [ β f (x (k) (k) + β (k) D / + d (k), x v (k) + LD K. ( /e)f (x ) f (x ) K [ β f (x (k) (k) + β (k) D / + d (k), x v (k) + LD K. (0)

16 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy If we sum Eq. (0) over =,, 3,..., T, we obain ( /e) f (x ) = f (x ) K = [ β (k) = f (x (k) + β (k) D T/ + = d (k), x v (k) + LD T K. By he definiion of he regre, we have d (k) =, x v (k) R E T. Therefore, we deduce ( /e) K f (x ) = [ β (k) = Taking he expecaion in boh sides, we obain f (x ) = f (x (k) + β (k) D T/ + LD T K + RE T. ( /e) E[f (x ) = = E[f (x ) = [ E[ f K β (k) (x (k) + β (k) D T/ + LD T K + RE T. () Noice ha f (x (k) ) f (x (k ) ) L v (k) /T LR/T LR/(k + 3). By Theorem 3, if we se ρ k =, (k+3) /3 we have E[ f (x (k) Q (k + 4) /3 Q (k + 4), /3 () where Q max{ f (0) 4 /3, 4σ + 6L R } and Q max{max T f (x ) 4 /3, 4σ + 6L R }. Plugging Eq. () ino Eq. () and seing β (k) = (Q / )/(D(k + 3) /3 ), we deduce Since K ( /e) K (k+4) /3 0 E[f (x ) = = E[f (x ) T DQ/ K dx = 3 (x+4) /3 [(K + 4)/3 9 /3 3 K/3, we have ( /e) E[f (x ) = E[f (x ) = 3T DQ/ K /3 (k + 4) + LD T /3 K + LD T K + RE T. + RE T

17 Projecion-Free Online Opimizaion wih Sochasic Gradien: From Convexiy o Submodulariy D. Proof of Theorem : Convex Case Le f(x) = E f D[f (x) denoe he expeced funcion. Because f is L-smooh and convex, we have f(x + ) f(x ) = f(x + η (v x )) f(x ) f(x ) f(x ) + η f(x ), v x + η L v x f(x ) f (x ) + η f(x ), v x + η LD. As before, he dual pairing may be decomposed as f(x ), v x = f(x ) d, v x + f(x ), x x + d, v x. We can bound he firs erm using Young s Inequaliy o ge f(x ) d, v x β f(x ) d + β v x β f(x ) d + βd. for any β > 0, which will be chosen laer in he proof. We may also bound he second erm in he decomposiion of he dual pairing using convexiy of f, i.e. f(x ), x x f (x ) f(x ). Finally, he hird erm is nonposiive, by he choice of v, namely v = arg min v K d, v. Using hese inequaliies, we now have ha ( ) f(x + ) f(x ) ( η ) (f(x ) f(x )) + η β f(x ) d + βd + η LD. Taking expecaion over he randomness in he ieraes (i.e. he sochasic gradien esimaes), we have ha ( ) E[f(x + ) f(x ) ( η ) (E[f(x ) f(x )) + η β E[ f(x ) d + βd + η LD. (3) Now we will apply he variance reducion echnique. Noe ha f(x + ) f(x ) L x + x Lη x v Lη D where we have used ha f is L-smooh, he convex updae, and he diameer. Now, using Theorem 3 wih G = LD and s = 3, we have ha E[ f(x ) d Q ( + 4), /3 where Q max{4 /3 f(x ), 4σ + 3(LD) /}. Using his bound in Eq (3) and seing β = By inducion, we have E[f(x + ) f(x ) ( η ) (E[f(x ) f(x Q / D )) + η ( + 4) + /3 η E[f(x + ) f(x ) ( η k ) M + η k j=k+ LD. ( Q / D ( η j ) (k + 4) + η /3 k Q / yields D(+4) /3 LD ),

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given