Distributed Strongly Convex Optimization

Size: px

Start display at page:

Download "Distributed Strongly Convex Optimization"

Gervais Cannon
5 years ago
Views:

1 Distributed Strogly Covex Optimizatio Kostatios I. siaos Departmet of Electrical ad Computer Egieerig McGill Uiversity Motreal, Quebec H3A 0E9 ad arxiv:07.303v cs.dc] 0 Jul 0 Michael G. Rabbat Departmet of Electrical ad Computer Egieerig McGill Uiversity Motreal, Quebec H3A 0E9 michael.rabbat@mcgill.ca Abstract A lot of effort has bee ivested ito characterizig the covergece rates of gradiet based algorithms for o-liear covex optimizatio. Recetly, motivated by large datasets ad problems i machie learig, the iterest has shifted towards distributed optimizatio. I this wor we preset a distributed algorithm for strogly covex costraied optimizatio. Each ode i a etwor of computers coverges to the optimum of a strogly covex, L-Lipchitz cotiuous, separable objective at a rate O log where is the umber of iteratios. his rate is achieved i the olie settig where the data is revealed oe at a time to the odes, ad i the batch settig where each ode has access to its full local dataset from the start. he same covergece rate is achieved i expectatio whe the subgradiets used at each ode are corrupted with additive zero-mea oise. I. INRODUCION I this wor we focus o solvig optimizatio problems of the form miimize w W F w = f t w where each fuctio f w, f w,..., is covex over a covex set W R d. his formulatio applies widely i machie learig scearios, where f t w measures the loss of model w with respect to data poit t, ad F w is the average loss over data poits. I particular, we are iterested i the behavior of olie distributed optimizatio algorithms for this sort of problem as the umber of data poits teds to ifiity. We describe a distributed algorithm which, for strogly covex fuctios f t, coverges at a rate O. o the best of our owledge t= log this is the first distributed algorithm to achieve this coverge rate for costraied optimizatio without relyig o July 3, 0 DRAF

2 smoothess assumptios o the objective or o-trivial commuicatio mechaisms betwee the odes. he result is true both i the olie ad the batch optimizatio settig. Whe faced with a o-liear covex optimizatio problem, gradiet-based methods ca be applied to fid the solutio. he behavior of these algorithms is well-uderstood i the sigle-processor cetralized settig. Uder the assumptio that the objective is L-Lipschitz cotiuous, projected gradiet descet-type algorithms coverge at a rate O ], ]. his rate is achieved both i a olie settig where the f t s are revealed to the algorithm sequetially ad i the batch settig where all f t are ow i advace. If the cost fuctios are also strogly covex the gradiet algorithms ca achieve liear rates, O, i the batch settig 3] ad early-liear rates, O log, i the olie settig 4]. Uder additioal smoothess assumptios, such as Lipschitz cotiuous gradiets, the same rate of covergece ca also be achieved by secod order methods i the olie settig 5], 6], while accelerated methods ca achieve a quadratic rate i the batch settig; see 7] ad refereces therei. he aim of this wor is to exted the aforemetioed results to the distributed settig where a etwor of processors joitly optimize a similar objective. Assumig the etwor is arraged as a expader graph with costat spectral gap, for geeral covex cost fuctios that are oly L-Lipschitz cotiuous, the rate at which existig algorithms o a etwor of processors will all reach the optimum value is O log, i.e., similar to the optimal sigle processor algorithms up to a logarithmic factor 8], 9]. his is true both i a batch settig ad i a olie settig, eve whe the gradiets are corrupted by oise. he techique proposed i 0] maes log use of mii-batches to obtai asymptotic rates O for olie optimizatio of smooth cost fuctios that have Lipschitz cotiuous gradiets corrupted by bouded variace oise, ad O for smooth strogly covex fuctios. However, this techique requires that each ode exchage messages with every other ode at the ed of each iteratio. Fially, if the objective fuctio is strogly covex ad three times differetiable, a distributed versio of Nesterov s accelerated method ] achieves a rate of O for ucostraied problems i the batch settig, but the depedece o is ot characterized. he algorithm preseted i this paper achieves a rate O log log for strogly covex fuctios. Our formulatio allows for covex costraits i the problem ad assumes the objective fuctio is Lipschitz cotiuous ad strogly covex; o higher-order smoothess assumptios are made. Our algorithm wors i both the olie ad batch settig ad it scales early-liearly i umber of iteratios for etwor topologies with fast iformatio diffusio. I additio, at each iteratio odes are oly required to exchage messages with a subset of other odes i the etwor their eighbors. he rest of the paper is orgaized as follows. Sectio II itroduces otatio ad formalizes the problem. Sectio III describes the proposed algorithm ad states our mai results. hese results are prove i Sectio IV, ad Sectio V exteds the aalysis to the case where gradiets are oisy. Sectio VI presets the results of umerical experimets illustratig the performace of the algorithm, ad the paper cocludes i Sectio VII. July 3, 0 DRAF

3 3 II. ONLINE CONVEX OPIMIZAION Cosider the problem of miimizig a covex fuctio F w over a covex set W R d. Of particular iterest is the settig where the algorithm sequetially receives oisy samples of the subgradiets of F w. his settig arises i olie loss miimizatio for machie learig whe the data arrives as a steam ad the subgradiet is evaluated usig a idividual data poit at each step ]. Suppose the tth data poit xt X R d is draw i.i.d. from a uow distributio D, ad let f t w = fw, xt deote the loss of this data poit with respect to a particular model w. I this settig oe would lie to fid the model w that miimizes the expected loss E D fw, x], possibly with the costrait that w be restricted to a model space W. Clearly, as, the objective F w = t= f t w E D fw, x], ad so if the data stream is fiite this motivates miimizig the empirical loss F w. A olie covex optimizatio algorithm observes a data stream x, x,..., ad sequetially chooses a sequece of models w, w,..., after each observatio. Upo choosig wt, the algorithm receives a subgradiet gt f t wt. he goal is for the sequece w, w,... to coverge to a miimizer w of F w. he performace of a olie optimizatio algorithm is measured i terms of the regret: R = f t wt mi f t w. w W t= he regret measures the gap betwee the cost accumulated by the olie optimizatio algorithm over steps ad that of a model chose to simultaeously miimize the total regret over all cost terms. If the costs f t are allowed to be arbitrary covex fuctios the it ca be show that the best achievable rate for ay olie optimizatio algorithm is R = Ω, ad this boud is also achievable ]. he rate ca be sigificatly improved if the cost fuctios has more favourable properties. t= A. Assumptios Assumptio : We assume for the rest of the paper that each cost fuctio f t w = fw, xt is σ-strogly covex for all xt X ; i.e., there is a σ > 0 such that for all θ 0, ] ad all u, w W f t θu θw θf t u θf t w σ θ θ u w. 3 If each f t w is σ-strogly covex, it follows that F w is also σ-strogly covex. Moreover, if F w is strogly covex the it is also strictly covex, ad so F w has a uique miimizer which we deote by w. Assumptio : We also assume that the subgradiets gt of each cost fuctio f t are bouded by a ow costat L > 0; i.e., gt L where is the l Euclidea orm. B. Example: raiig a Classifier For a specific example of this setup, cosider the problem of traiig a SVM classifier usig a hige-loss with l regularizatio 4]. I this case, the data stream cosists of pairs {xt, yt} such that xt X ad July 3, 0 DRAF

4 4 yt {, }. he goal is to miimize the misclassificatio error as measured by the l -regularized hige loss. Formally, we wish to fid the w W R d that solves σ miimize w W w m max{0, yt w, xt } 4 m t= which is σ-strogly covex. For these types of problems, usig a sigle-processor stochastic gradiet descet algorithm, oe ca achieve R = O log R 4] or = O ] by usig differet update schemes. C. Distributed Olie Covex Optimizatio I this paper, we are iterested i solvig olie covex optimizatio problems with a etwor of computers. he computers are orgaized as a etwor G = V, E with V = odes, ad messages are oly exchaged betwee odes coected with a edge i E. Assumptio 3: I this wor we assume that G is coected ad udirected. Each ode i receives a stream of data x i, x i,..., similar to the serial case, ad the odes must collaborate to miimize the etwor-wide objective F w = t= i= fi t w, 5 where f t i w = fw, x it is the cost icurred at processor i at time t. I the distributed settig, the defiitio of regret is aturally exteded to R = t= i= fw i t, x i t mi w W t= i= fw, x i t. 6 For geeral covex cost fuctios, the distributed algorithm proposed i 8] has bee prove to have a average regret that decreases at a rate, similar to the serial case, ad this result holds eve whe the algorithm receives oisy, ubiased, observatios of the true subgradiets at each step. I the ext sectio, we preset a distributed algorithm that achieves a early-liear rate of decrease of the average regret up to a logarithmic factor whe the cost fuctios are strogly covex. III. ALGORIHM Nodes must collaborate to solve the distributed olie covex optimizatio problem described i the previous sectio. o that ed, the etwor is edowed with a cosesus matrix P which respects the structure of G, i the sese that P ] ji = 0 if i, j / E. We assume that P is doubly stochastic, although geeralizatios to the case where P is row stochastic or colum stochastic but ot both are also possible 3], 4]. A detailed descriptio of the proposed algorithm, distributed olie gradiet descet DOGD, is give i Algorithm. I the algorithm, each ode performs a total of updates. Oe update ivolves processig a sigle data poit x i t at each processor. he updates are performed over rouds, ad s updates are performed i roud Although the hige loss itself is ot strogly covex, addig a strogly covex regularizer maes the overall cost fuctio strogly covex. July 3, 0 DRAF

5 5 Algorithm DOGD : Iitialize: = σ, a =, =, z i = w i = 0 : 3: while s= s do Each ode i repeats 4: for t = to do 5: Sed/receive zi t ad z j t to/from eighbors 6: Obtai ext subgradiet g i t w f t i w i t 7: zi t = j= p ijzj t a g i t 8: wi t = Π W z i t ] 9: ed for 0: w i = w i : z i = w i : ŵ i = t= w i t 3: 4: 5: a a 6: = 7: ed while s. he mai steps withi each roud lies 9 ivolve updatig a accumulated gradiet variable, zi t, by simultaeously icorporatig the iformatio received from eighborig odes ad taig a local gradiet-descet lie step. he accumulated gradiet is projected oto the costrait set to obtai wi t, where Π W z] = argmi w z 7 w W deotes the Euclidea projectio of z oto W, ad the this projected value is merged ito a ruig average ŵ i r. he step size parameter a remais costat withi each roud, ad the step size is reduced by half at the ed of each roud. he umber of updates per roud doubles from oe roud to the ext. Note that the algorithm proposed here differs from the distributed dual averagig algorithm described i 8], where a proximal projectio is used rather tha the Euclidea projectio. Also, i cotrast to the distributed subgradiet algorithms described i 5], DOGD maitais a accumulated gradiet variable i zi t which is updated usig {zj t} as opposed to the primal feasible variables {w j t}. Fially, ey to achievig fast covergece is the expoetial decrease of the learig rate after performig a expoetially icreasig umber of gradiet steps together with a proper iitializatio of the learig rate. he ext sectio provides theoretical guaratees o the performace of DOGD. July 3, 0 DRAF

6 6 IV. CONVERGENCE ANALYSIS Our mai covergece result, stated below, guaratees that the average regret decreases at a rate which is early liear. heorem : Let Assumptios 3 hold ad suppose that the cosesus matrix P is doubly stochastic with costat λ. Let w be the miimizer of F w. he the sequece {ŵi } produced by odes ruig DOGD to miimize F w obeys F ŵ i F w = O log, 8 where = log / is the umber of rouds executed durig a total of gradiet steps per ode, ad ŵ i is the ruig average maitaied locally at each ode. Remar : We state the result for the case where λ is costat. his is the case whe G is, e.g., a complete graph or a expader graph 6]. For other graph topologies where λ shris with ad cosesus does ot coverge fast, the covergece rate depedece o is goig to be worse due to a factor λ i the deomiator; see the proof of heorem below for the precise depedece o the spectral gap λ. Remar : he theorem characterizes performace of the olie algorithm DOGD, where the data ad cost fuctios f t i are processed sequetially at each ode i order to miimize a objective of the form F w = i= fi t w. 9 However, as poited out i 4], if the etire dataset is available i advace, we ca use the same scheme to do batch miimizatio by effectively settig fi tw = f i w, where f i w is the objective fuctio accoutig for the etire dataset available to ode i. hus, the same result holds immediately for a batch versio of DOGD. he remaider of this sectio is devoted to the proof of heorem. Our aalysis follows argumets that ca be foud i ], 8], ] ad refereces therei. We first state ad prove some itermediate results. t= A. Properties of Strogly Covex Fuctios Recall the defiitio of σ-strog covexity give i Assumptio. A direct cosequece of this defiitio is that if F w is σ-strogly covex the F w F w σ w w. 0 Strog covexity ca be combied with the assumptios above to upper boud the differece F w F w for a arbitrary poit w W. Lemma : Let w be the miimizer of F w. For all w W, we have F w F w L σ. Proof: For ay subgradiet g of F at w, by covexity we ow that F w F w g, w w. It follows from Assumptio that F w F w L w w. Furthermore, from Assumptio, we obtai that σ w w L w w or w w L σ. As a result, F w F w L σ. July 3, 0 DRAF

7 7 B. he Lazy Projectio Algorithm he aalysis of DOGD below ivolves showig that the average state, i= w i t, evolves accordig to the so-called sigle processor lazy projectio algorithm ], which we discuss ext. he lazy projectio algorithm is a olie covex optimizatio scheme for the serial problem discussed at the begiig of Sectio II. A sigle processor sequetially chooses a ew variable wt ad receives a subgradiet gt of fwt, xt. he algorithm chooses wt by repeatig the steps By uwrappig the recursive form of, we get zt =zt agt wt =Π W zt ]. zt = a t gt z. 3 s= he followig is a typical result for subgradiet descet-style algorithms, ad is useful towards evetually characterizig how the regret accumulates. Its proof ca be foud i the appedix of the exteded versio of ]. heorem Zievich ]: Let w W, let a > 0, ad set z = w. After rouds of the serial lazy projectio algorithm, we have gt, wt w w w t= heorem immediately yields the same boud for the regret of lazy projectio ]. C. Evolutio of Networ-Average Quatities i DOGD a al. 4 We tur our attetio to Algorithm. A stadard approach to studyig covergece of distributed optimizatio algorithms, such as DOGD, is to eep trac of the discrepacy betwee every ode s state ad a average state sequece defied as z t = zi t ad w t = Π W z t ]. 5 i= Observe that z t evolves i a simple recursive maer, z t = zi t i= 6 = p ij zj t a g i t 7 = i= j= j= zj t p ij a =zt a = a t s= i= g i t 8 i= g i t 9 i= g i s i= zi 0 i= July 3, 0 DRAF

8 8 where equatio 9 holds sice P is doubly stochastic. Notice cf. eq. 3 that the states {z t, w t} evolve accordig to the lazy projectio algorithm with gradiets gt = i= g it ad learig rate a. I the sequel, we will also use a aalytic expressio for zi t derived by bac substitutig i its recursive update equatio. After some algebraic maipulatio, we obtai t zi t = a s= j= P t s ] ij g js a g i t P t ] ij zj, j= ad sice the projectio i o-expasive ad zi = 0, i, z i = w i = w i = ΠW z i ] D. Aalysis of Oe Roud of DOGD z i 3 a P s ] g ij is t= i= a g i a L ] P j= ij ] P j= ij z j 4 z j 5 6 L a s s. 7 s= Next, we focus o boudig the amout of regret accumulated durig the th roud of DOGD lies 5 of Algorithm durig which the learig rate remais fixed at a. Usig Assumptios,, ad the triagle iequality July 3, 0 DRAF

9 9 we have that t= For the first summad we have F w i t F w ] = F w t F w F wi t F w t ] 8 t= F w t F w L w i t w t ] 9 t= t= t= fi wi t f i w ] i= fi w t f i wi t ] i= L w i t w t 30 t= g i t, wi t w t= i= }{{} A L w t wi t t= i= L w i t w t. 3 t= A = t= t= t= g i t, wi t w 3 i= g i t, w t w i= g i t, wi t w t 33 i= g i t, w t w t= i= }{{} A L w i t w t. 34 t= i= July 3, 0 DRAF

10 0 o boud term A we ivoe heorem for the average sequeces {w t} ad {z t}. t= A = t= = t= t= g i t, w t w 35 i= g i t, Π W z t ] w 36 i= = gt, Π W z t ] w 37 w w a a w w = a L a F w i t F w ] i= g it Collectig ow all the partial results ad bouds, so far we have show that w w t= a L a L w i t w t i= L w i t w t. 40 ad sice the projectio operator is o-expasive, we have w F wi t F w w ] a L a t= t= t= L z i t z t 4 i= L zi t z t. t= he first two terms are stadard for subgradiet algorithms usig a costat step size. he last two terms deped o the error betwee each ode s iterate z i t ad the etwor-wide average z t, which we boud ext. E. Boudig the Networ Error What remais is to boud the term z i t z t which describes a error iduced by the etwor sice the differet odes do ot agree o the directio towards the optimum. By recallig that P is doubly stochastic ad maipulatig the recursive expressios ad 0 for z i t ad z t usig argumets similar to those i 8], July 3, 0 DRAF

11 4], we obtai the boud, z i t z t t a L P t s ] ij a L s= j= P t ] ij zj 4 j= t =a L P t s ] i,: a L s= P t ] ij z j. 43 j= he l orm ca be bouded usig Lemma, which is stated ad prove i the Appedix, ad usig 7 we arrive at z i t z t a L log 3a L L s= a s s 44 λ where λ is the secod largest eigevalue of P. Usig this boud i equatio 4, alog with the fact that F w is covex, we coclude that F ŵ i F w =F t= w i t F w 45 F w i t F w ] 46 where w = Π W i= z i ]. t= w w a L a L a 6 log ] 9 λ 3L s= a s s, 47 F. Aalysis of DOGD over Multiple Rouds As our last itermediate step, we must cotrol the learig rate ad update of from roud-to-roud to esure liear covergece of the error. From strog covexity of F we have ad thus w w F w F w σ F ŵ i F w F w F w σa L a log ] 9 λ 48 3L s= a s s. 49 July 3, 0 DRAF

12 Now, from heorem 3 i ] which is a direct cosequece of heorem for the average sequece w viewed as a sigle processor lazy projectio algorithm, we have that after executig gradiet steps i roud, w F w F w w a L a ad by repeatedly usig strog covexity ad heorem we see that F w F w F w F w a L σa F w F w j=0 σa j j a j L j s= σa s s. 53 j= Now, let us fix positive itegers b ad c, ad suppose we use the followig rules to determie the step size ad umber of updates performed withi each roud: a = a b Combiig 53 with 49 ad ivoig Lemma, we have F ŵ i F w = = a b 54 =c = = c. 55 j= L a b L σ j=0 σa c b a L σa c b b j j s=0 log c j s ] 9 λ s 3L s= a c b c. 56 o esure covergece to zero, we eed c b ad σa > or a > σ. Give these restrictios, let us mae the choices a =, =, c = b =. 57 σ July 3, 0 DRAF

13 3 o simplify the expositio, let us assume that = σ F ŵ i F w j= L is a iteger. Usig the selected values, we obtai σ j=0 L j L j j s=0 s log σ ] 9 λ s 3L s= L σ L 58 L j j j= log σ ] 9 λ L σ L L log σ ] 9 λ L σ L L log σ ] 9 λ Fially, we have all we eed to complete the aalysis of Algorithm. 3L 59 6L 60 6L. 6 G. Proof of heorem Suppose we ru Algorithm for total steps at each ode. his allows for rouds, where is determied by solvig i i= i= i log. 6 July 3, 0 DRAF

14 4 Usig this value for we see that F ŵ i F w L L σ L log σ 9 6L λ 63 L σ L log L log σ 9 λ 6L log =O = O λ log whe λ is costat ad does ot scale with, ad this cocludes the proof of heorem., 64 V. EXENSION O SOCHASIC OPIMIZAION he proof preseted i the previous sectio ca easily be exteded to the case where each ode receives a radom estimate ĝt of the gradiet, satisfyig Eĝt] = gt, istead of receivig gt directly. We assume that oisy gradiets still have bouded variace i.e., E ĝ i t ] L. I this settig, istead of equatio 35, we have A = t= = t= t= g i t, w t w 65 i= ĝ i t, w t w i= g i t ĝ i t, w t w. 66 i= However, the proof of heorem does ot deped o the gradiets beig correct; rather, it holds for oisy gradiets ĝ t as well. Moreover, we have E ĝ i t ] L, ad by Hölder s iequality E ĝ i t ĝ j t ] L. hus, ] E ĝ i t E ĝ i t ĝ j t ] L. 67 i= i,j= July 3, 0 DRAF

15 5 hus, ivoig heorem, if the ew data ad thus the subgradiets are idepedet of the past, ad sice Eĝ i t] = g i t, we have w w EA ] a L a E g i t ĝ i t, w t w ] 68 t= i= w w = a L a E g i t ĝ i t], w t w 69 t= i= w w = a L. 70 a Furthermore, the etwor error boud holds i expectatio as well, i.e., E w t wi t ] E z t zi t ] a L log 3a L L s= a s s 7 λ Collectig all these observatios we have show that, i expectatio, E F ŵ i which, after usig the update rules for a F w ] w w a L a L a 6 log ] 9 λ 3L s= a s s 7 ad, is exactly the same rate as before. We ote however that there may still be room for improvemet i the distributed stochastic optimizatio settig sice ] describes a sigle-processor algorithm that coverges at a rate O. VI. SIMULAION o illustrate the performace of DOGD we simulate olie traiig of a classifier by solvig the problem 4 usig a etwor of 0 odes arraged as a radom geometric graph. Each ode is give = 600 data poits, ad the iput dimesio is d = 00. We set σ = 0. ad geerate the data from a stadard ormal distributio ad classify them as or depedig o their relative positio to a radomly draw hyperplae i R d. As we see i Figure, DODG miimizes the objective much faster tha Distributed Dual Averagig DDA 8] which has a covergece rate of O log. DDA is simulated usig the learig rate that is suggested i 8]. We have observed that boostig this learig rate may yield faster covergece, but still ot as fast as DOGD. Figure also shows the performace of a versio of Fast Distributed Gradiet Descet FDGD ]. As we ca see, FDGD fails to coverge i a olie or stochastic settig ad eds up oscillatig. July 3, 0 DRAF

16 6.5 DOGD FDGD DDA i= F wit Iteratios Fig.. Optimizatio of a d = 00 dimesioal problem of the form 4 with a radom etwor of 0 odes. Our proposed algorithm DOGDred coverges faster tha DDAgree as expected from the istead of i the deomiator of the covergece rate boud. FDGDblac, is uable to coverge i the olie problem. VII. FUURE WORK I this paper we have proposed ad aalyzed a ovel distributed optimizatio algorithm which we call Distributed Olie Gradiet Descet DOGD. Our aalysis shows that DOGD coverges at a rate O log whe solvig olie, stochastic or batch costraied covex optimizatio problems if the objective fuctio is strogly covex. his rate is optimal i the umber of iteratios for the olie ad batch settig ad slower tha a serial algorithm oly by a logarithmic factor i the stochastic optimizatio settig. I its curret form, DOGD requires the odes i the etwor to exchage gradiet iformatio at every iteratio. Our prelimiary ivestigatio suggests that gradually performig more ad more updates betwee each commuicatio ca speed up distributed optimizatio algorithms i the batch settig whe oe explicitly accouts for the time required to commuicate data. Our future wor will carry out a similar aalysis for olie ad stochastic optimizatio algorithms. APPENDIX Lemma : If P is a doubly stochastic matrix defied over a strogly coected graph G = V, E with V = odes so that p ji = 0 if i, j E, the for ay t, t P t s] i,: log s= 73 λ where λ is the secod largest eigevalue of P. Proof: If the cosesus matrix P is doubly stochastic it is straightforward to show that P t as t. Moreover, from stadard Perro-Frobeius is it easy to show see e.g., 7] P t] i,: = P t] i,: t λ 74 V July 3, 0 DRAF

17 so i our case P t s] t s. λ i,: Next, demad that the right had side boud is less tha δ with δ to be determied: λ t s δ t s log δ log λ. 75 So with the choice δ =, P t s] i,: = 76 if t s log δ log = ˆt. Whe s is large ad t s < ˆt we tae λ P t s]. he desired i,: boud is ot obtaied as follows t P t s] t ˆt i,: = P t s] i,: 77 s= s= t s=t ˆt t ˆt t s= P t s] i,: s=t ˆt 7 78 t ˆt ˆt ˆt 79 Sice t we ow that t ˆt <. Moreover, log λ λ. Usig there two fact we arrive at the result. he same boud is true for ay idividual etry of P t approachig. REFERENCES ] M. A. Zievich, Olie covex programmig ad geeralized ifiitesimal gradiet ascet, i 0th Iteratioal Coferece o Machie Learig ICML, 003. ] Y. Nesterov, Primal-dual subgradiet methods for covex problems, Mathematical Programmig Series B, vol. 0, pp. 59, ] L. Bottou, Large-scale machie learig with stochastic gradiet descet, i Proceedigs of the 9th Iteratioal Coferece o Computatioal Statistics, Y. Lechevallier ad G. Saporta, Eds., Paris, Frace, August 00, pp ] S. Shalev-Shwartz ad Y. Siger, Logarithmic regret algorithms for strogly covex repeated games, i he Hebrew Uiversity, ] E. Haza, A. Kalai, S. Kale, ad A. Agarwal, Logarithmic regret algorithms for olie covex optimizatio, i 9 th COL, 006, pp ] P. L. Bartlett, E. Haza, ad A. Rahli, Adaptive olie gradiet descet, i Advaces i Neural Iformatio Processig Systems 0, J. C. Platt, D. Koller, Y. Siger, ad S. Roweis, Eds. MI Press, ] P. seg, O accelerated proximal gradiet methods for covex-cocave optimizatio, SIAM Joural o Optimizatio, vol. Submitted, ] J. Duchi, A. Agarwal, ad M. Waiwright, Dual averagig for distributed optimizatio: Covergece aalysis ad etwor scalig, IEEE rasactios o Automatic Cotrol, vol. 57, o. 3, pp , 0. 9] A. Nedic, A. Ozdaglar, ad P. A. Parrilo, Costraied cosesus ad optimizatio i multi-aget etwors, IEEE rasactios o Automatic Cotrol, vol. 55, o. 4, pp , 00. 0] O. Deel, R. Gilad-Bachrach, O. Shamir, ad L. Xiao, Optimal distributed olie predictio usig mii-batches. Joural of Machie Learig Research, vol. 3, pp. 65 0, 0. ] D. Jaovetic, J. Xavier, ad J. M. Moura, Fast distributed gradiet methods, arxiv:.97v, 0. July 3, 0 DRAF

18 8 ] E. Haza ad S. Kale, Beyod the regret miimizatio barrier: a optimal algorithm for stochastic strogly-covex optimizatio, i 4th Aual Coferece o Learig heory COL, 0. 3] K. I. siaos, S. Lawlor, ad M. G. Rabbat, Push-sum distributed dual averagig for covex optimizatio, i 5st IEEE Coferece o Decisio ad Cotrol, 0. 4] K. I. siaos ad M. G. Rabbat, Distributed dual averagig for covex optimizatio uder commuicatio delays, i America Cotrol Coferece ACC, 0. 5] S. S. Ram, A. Nedic, ad V. V. Veeravalli, Distributed stochastic subgradiet projectio algorithms for covex optimizatio, Joural of Optimizatio heory ad Applicatios, vol. 47, o. 3, pp , 0. 6] O. Reigold, S. Vadha, ad A. Wigderso, Etropy waves, the zig-zag graph product, ad ew costat-degree expaders, Aals of Mathematics, vol. 55, o., pp , 00. 7] P. Diacois ad D. Strooc, Geometric bouds for eigevalues of marov chais, he Aals of Applied Probability, vol., o., pp. 36 6, 99. July 3, 0 DRAF

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but