On the Linear Convergence of a Cyclic Incremental Aggregated Gradient Method

Size: px

Start display at page:

Download "On the Linear Convergence of a Cyclic Incremental Aggregated Gradient Method"

Kory Flynn
5 years ago
Views:

1 O the Liear Covergece of a Cyclic Icremetal Aggregated Gradiet Method Arya Mokhtari Departmet of Electrical ad Systems Egieerig Uiversity of Pesylvaia Philadelphia, PA 19104, USA Mert Gürbüzbalaba Departmet of Maagemet Sciece ad Iformatio Systems Rutgers Uiversity Piscataway, NJ 08854, USA Alejadro Ribeiro Departmet of Electrical ad Systems Egieerig Uiversity of Pesylvaia Philadelphia, PA 19104, USA aryam@seas.upe.edu mgurbuzbalaba@busiess.rutgers.edu aribeiro@seas.upe.edu Editor: Abstract This paper cosiders the problem of miimizig the average of a fiite set of strogly covex fuctios. We itroduce a cyclic icremetal aggregated gradiet method that at each iteratio computes the gradiet of oly oe fuctio, which is chose based o a cyclic scheme, ad uses that to update the aggregated average gradiet of all the fuctios. We prove that ot oly the proposed method coverges liearly to the optimal argumet, but also its liear covergece rate factor justifies the advatage of icremetal methods with respect to full batch gradiet descet. I particular, we show theoretically ad empirically that oe pass of the proposed method is more efficiet tha oe iteratio of gradiet descet. I additio, we propose a accelerated versio of the itroduced cyclic icremetal aggregated gradiet method that softes the depedecy of the covergece rate to the coditio umber of the problem. Keywords: Icremetal gradiet methods, fiite sum miimizatio, large-scale optimizatio, liear covergece rate, accelerated methods 1. Itroductio We cosider the optimizatio problem where the objective fuctio ca be writte as the average of a set of strogly covex fuctios. I particular, cosider x R p as the optimizatio variable ad f i : R p R as the i-th available fuctio. We aim to fid the miimizer of the average fuctio fx) = 1/) f ix), i.e., x 1 = argmi fx) := argmi x R p x R p f i x). 1) We call f i as the istataeous fuctios ad the average fuctio f as the global objective fuctio. This class of optimizatio problems arises i machie learig Bottou ad Le Cu 2005)), estimatio, wireless systems, ad sesor etworks. I this work, we cosider the case that the objective fuctios f i are smooth ad strogly covex. Gradiet descet GD) method is oe of the first methods used for solvig the problem i 1). However, gradiet descet is impractical whe the umber of fuctios is extremely large, sice it

2 requires computatio of gradiets at each iteratio. Stochastic gradiet descet SGD) or miibatch gradiet descet MGD) which use oe or a subset of gradiets, respectively, to approximate the full gradiet are more popular for large-scale problems Robbis ad Moro 1951); Bottou 2010)). Although these methods reduce the computatioal complexity of GD, they caot achieve a liear covergece rate as GD. The last decade has see fudametal progress i developig alteratives with faster covergece. A partial list of this cosequetial literature icludes stochastic averagig gradiet Roux et al. 2012); Defazio et al. 2014)), variace reductio methods Johso ad Zhag 2013); Xiao ad Zhag 2014)), dual coordiate methods Shalev-Shwartz ad Zhag 2013, 2016)), hybrid algorithms Zhag et al. 2013); Koečỳ ad Richtárik 2013)), ad majorizatio-miimizatio algorithms Mairal 2015)). All these stochastic algorithms are successful i achievig a liear covergece rate i expectatio. The other class of first-order alteratives for GD are icremetal methods Blatt et al. 2007); Tseg ad Yu 2014)). This class of algorithms is differet from stochastic methods i the way that they choose fuctios for gradiet approximatio. To be more precise, i stochastic methods a radom fuctio has bee chose radomly from the set of fuctios, while i icremetal methods fuctios are chose i a cyclic order. Although icremetal methods perform as well as stochastic methods i practice, their covergece results are limited relative to stochastic methods. I particular, as i the case of SGD, cyclic GD exhibits subliear covergece. This limitatio motivated the developmet of the icremetal aggregated gradiet IAG) method that achieves a liear covergece rate Gürbüzbalaba et al. 2015). To explai our cotributio, we must emphasize that the covergece costat of IAG ca be smaller tha the covergece costat of GD Sectio 3). Thus, eve though IAG is desiged to improve upo GD, the available aalyses still make it impossible to assert that IAG outperforms GD uder all circumstaces. I fact, the questio of whether it is possible at all to desig a cyclic method that is guarateed to always outperform GD remais ope. I this paper, we propose a ovel icremetal first-order method called Double Icremetal Aggregated Gradiet method DIAG). The DIAG update uses the average of both delayed variables ad gradiets i oppose to the classic icremetal methods i Blatt et al. 2007); Tseg ad Yu 2014); Gürbüzbalaba et al. 2015)) that oly use the average of delayed gradiets. This major differece comes from the fact that DIAG uses a approximatio of the global fuctio f at each iteratio which is differet from the oe of IAG. We show that this critical differece leads to a icremetal algorithm with a liear covergece factor which improves the covergece factor of GD uder all circumstaces. Based o our kowledge, this is the first icremetal method which is guarateed to improve the performace of GD. We start the paper by studyig existig methods ad their covergece guaratees Sectio 2). The, we preset the proposed icremetal method Sectio 3) ad suggest a efficiet mechaism to implemet the proposed algorithm Sectio 3.1). Further, we provide the covergece aalysis of the DIAG method Sectio 4). We show that if the fuctios f i are strogly covex ad their gradiets f are Lipschitz cotiuous, the the sequece of variables x k geerated by DIAG coverges liearly to the optimal argumet x Propositio 3). Moreover, we show that the fuctio decremet for the proposed method after each pass over the dataset is strictly smaller tha the fuctio decremet of gradiet descet after oe iteratio Theorem 5 ad Theorem 8). We compare the performace of DIAg with its stochastic variat MISO) ad the IAG method Sectio 7). Fially we close the paper by cocludig remarks. 2. Related Works ad Prelimiaries Sice the objective fuctio i 1) is covex, descet methods ca be used to fid the optimal argumet x. I this paper, we are iterested i studyig methods that coverge to the optimal argumet of fx) at a liear rate. It is customary for the liear covergece aalysis of first-order 2

3 methods to assume that the fuctios are smooth ad strogly covex. We formalize these coditios i the followig assumptio. Assumptio 1 The fuctios f i are differetiable ad strogly covex with costat µ > 0, i.e., f i x) f i y)) T x y) µ x y 2. 2) Moreover, the gradiets f i are Lipschitz cotiuous with costat L <, i.e., f i x) f i y) L x y. 3) The strog covexity of the fuctios f i with costat µ implies that the global objective fuctio f is also strogly covex with costat µ. Likewise, the Lipschitz cotiuity of the gradiets f i with costat L yields Lipschitz cotiuity of the global objective fuctio gradiets f with costat L. Note that the coditios i Assumptio 1 are mild ad hold for most large-scale applicatios such as, liear regressio, logistic regressio, least squares, ad support vector machies. The optimizatio problem i 1) ca be solved usig the gradiet descet GD) method Nesterov 2004)). The idea of GD is to update the curret iterate x k by descedig through the egative directio of the curret gradiet fx k ) with a proper stepsize ɛ. I other words, the update of GD at step k is defied as x k+1 = x k ɛ k fx k ) = x k ɛ f i x k ), 4) where ɛ k is a positive stepsize learig rate). Covergece aalysis of GD shows that the sequece of iterates x k coverges liearly to the optimal argumet for the costat stepsizes that satisfy ɛ k = ɛ < 2/L Nesterov 2004)). The fastest covergece rate is achieved by stepsize 2/µ + L) which leads to the liear covergece factor κ 1)/κ + 1), i.e., x k x ) k κ 1 x 0 x, 5) κ + 1 where κ := L/µ is the global objective fuctio coditio umber. Although, GD has a fast liear covergece rate, it is ot computatioally affordable i large-scale applicatios because of its high computatioal complexity. To comprehed this limitatio, ote that each iteratio of GD requires gradiet evaluatios which is ot computatioally affordable i large-scale applicatios with massive values of. Stochastic gradiet descet SGD) arises as a atural solutio i large-scale settigs. SGD modifies the update of GD by approximatig the gradiet of the global objective fuctio f by the average of a small umber of istataeous gradiets chose uiformly at radom from the set of gradiets. To be more precise, the update of SGD at step k is defied as x k+1 = x k ɛk b f i x k ), 6) where Sb k has cardiality of Sk b = b << ad its compoets are chose uiformly at radom from the set {1, 2,..., }. Note that the stochastic gradiet 1/b) i S f b k i x k ) is a ubiased estimator of the gradiet fx k ) = 1/) f ix k ). Thus, the sequece of the iterates geerated by SGD coverges to the optimal argumet i expectatio. It has bee show that the covergece rate of SGD is subliear ad ca be characterized as i S k b E x k x 2 O 3 ) 1, 7) k

4 whe the sequece of dimiishig stepsizes ɛ k is of the order 1/k. Note that the expectatio i 7) is take with respect to the idices of the chose radom fuctios up to step k. Oe may use a cyclic order istead of stochastic selectio of fuctios i SGD which leads to the update of Icremetal Gradiet method IG) as i Blatt et al. 2007); Tseg ad Yu 2014). Similar to the case for SGD, the sequece of iterates geerated by the IG method coverges to the optimal argumet with a subliear rate of O 1/k) whe the stepsize is dimiishig. SGD ad IG are able to reduce the computatioal complexity of GD by requirig oly oe or a subset of) gradiet evaluatio per iteratio; however, they both suffer from slow subliear) covergece rates. The subliear covergece rate of SGD has bee improved recetly. The first successful attempt for achievig liear covergece rate with the computatioal complexity of SGD was the stochastic average gradiet descet method SAG) which updates oly oe gradiet per iteratio ad uses the average of the most recet versio of the gradiets as a approximatio for the full gradiet Roux et al. 2012)). I particular, defie yi k as the copy of the decisio variable x for the last time that the fuctio f i s gradiet is updated. The, at each iteratio, a radom idex i k is chose uiformly at radom ad the gradiet of its correspodig fuctio f i kx k ) is evaluated ad stored as f i kyi k). The, the variable x k is updated as x k+1 = x k ɛ f i yi k ). 8) I have to double check the followig text The sequece of iterates geerated by SAG coverges liearly to x i expectatio with respect to the choices of radom idices, i.e., E x k x ) k C 0, 9) 8κ where C 0 is a costat idepedet of ad κ. This result justifies the advatage of SAG with respect to GD. basically the error x k x decays by the factor of 1 1/8κ)) after a pass over the dataset which shows improvemet with respect to GD that decays with the factor of κ 1)/κ + 1). Similar advatages have bee observed for other recet stochastic methods such as SAGA, SVRG, SDCA, MISOot sure about MISO. check it The other alterative for solvig the optimizatio problem i 1) is the Icremetal Aggregated Gradiet IAG) method which is a middle groud betwee GD ad IG. The IAG method requires oe gradiet evaluatio per iteratio, as i IG, while it approximates the gradiet of the global objective fuctio fx) by the average of the most recet gradiet of all istataeous fuctios Blatt et al. 2007)), ad it has a liear covergece rate, as i GD. I the IAG method, the fuctios are chose i a cyclic order ad it takes iteratios to have a pass over all the available fuctios. To itroduce the update of IAG, recall the defiitio of yi k as the copy of the decisio variable x for the last time that the fuctio f i s gradiet is updated before step k. The, the update of IAG is give by x k+1 = x k ɛ f i yi k ), 10) Therefore, the update of IAG is idetical to the update of SAG i 8), ad the oly differece is i the scheme that the idex i k is chose. The covergece results i Tseg ad Yu 2014)) provide global covergece ad local liear covergece of IAG i a more geeral settig whe each compoet fuctio satisfies a local Lipschitzia error coditio. More recetly, a ew covergece aalysis of IAG has bee studied i Gürbüzbalaba et al. 2015)) which proves global liear covergece of IAG for strogly covex fuctios with Lipschitz cotiuous gradiets. I particular, it has bee show that the sequece of 4

5 iterates x k geerated by IAG satisfies the followig iequality x k x ) k )κ + 1) 2 x 0 x. 11) Notice that the covergece rate of IAG is liear ad evetually the error of IAG will be smaller tha the errors of SGD ad IGD which dimiish with a subliear rate of O1/k). To compare the performaces of GD ad IAG it is fair to compare oe iteratio of GD with iteratios of IAG. This is reasoable sice oe iteratio of GD requires gradiet evaluatios, while IAG uses gradiet evaluatio after iteratios. Comparig the decremet factors of GD i 5) ad IAG after gradiet evaluatios i 11) does ot guaratee the advatage of IAG relative to GD for all choices of coditio umber κ ad umber of fuctios, sice we could face the sceario that ) ) κ 1 2 < 1 κ )κ + 1) 2. 12) Note that the boud for GD i 5) is strict ad we ca desig a sequece which satisfies the equality case of the result i 5). However, the boud i 11) is ot ecessarily tight ad it could be the reaso that the compariso i 12) does ot justify the use of IAG istead of GD. Our goal i this paper is to come up with a first order icremetal method that has a guarateed upper boud which is better tha the oe for GD i 5). We propose this algorithm i the followig sectio. 3. Algorithm Defiitio Recall y i as the copy of the decisio variable x correspodig to the fuctio f i. The update of IAG i 10) ca be iterpreted as the solutio of the optimizatio program x k+1 = argmi x R p { 1 f i x k ) + 1 f i yi k ) T x x k ) + 1 } 1 2ɛ x xk 2. 13) This iterpretatio shows that i the update of IAG each istataeous fuctio f i x) is approximated by the followig approximatio f i x) f i x k ) + f i y k i ) T x x k ) + 1 2ɛ x xk 2. 14) Notice that the first two terms f i x k )+ f i yi k)t x x k ) correspod to the first order approximatio of the fuctio f i aroud the iterate yi k. The last term which is 1/2ɛ) x xk 2 is a proximal term that is added to the first order approximatio. This approximatio is differet from the classic approximatio that is used i first-order methods, sice the first-order approximatio is evaluated aroud a poit yi k which is differet from the iterate xk used i the proximal term. This observatio verifies that the IAG algorithm performs well whe the delayed versio variables yi k are close to the curret iterate x k which is true whe the stepsize ɛ is very small or the iterates are all close to the optimal solutio. We resolve this issue by itroducig a differet approach for approximatig each compoet fuctio f i, I particular, we use the approximatio f i x) f i y k i ) + f i y k i ) T x y k i ) + 1 2ɛ x yk i 2. 15) As we observe, the approximatio i 15) is more cosistet to classic first-order methods comparig to the oe for IAG i 14). This is true sice the first order approximatio ad the proximal term i 5

6 15) are evaluated for the same poit yi k. Ideed, the approximatio i 15) implies that the global objective fuctio fx) ca be approximated by fx) 1 f i yi k ) + 1 f i yi k ) T x yi k ) ɛ x yk i 2. 16) We ca approximate the optimal argumet of the global objective fuctio f by miimizig its approximatio i 16). Thus, the updated iterate x k+1 ca be computed as the miimizer of the approximated global objective fuctio i 16), i.e., x k+1 = argmi { 1 f i yi k ) + 1 f i yi k ) T x yi k ) + 1 } 1 2ɛ x yk i 2. 17) Cosiderig the covex programmig i 17) we ca derive a closed form solutio for the variable x k+1 which is give by x k+1 = 1 yi k ɛ f i yi k ). 18) We call the proposed method with the update 18) as Double Icremetal Aggregated Gradiet method DIAG). This appellatio is justified cosiderig that the update of DIAG requires the icremeted aggregate of both variables ad gradiets ad oly uses gradiet first-order) iformatio. Notice that sice we use a cyclic scheme, the set of variables {y1, k y2, k..., y} k is equal to the set {x k, x k 1,..., x k +1 }. Hece, the update for the proposed cyclic icremetal aggregated gradiet descet method that the cycle that has the order f 1, f 2,..., f ca be writte as x k+1 = 1 x k +i ɛ f j x k +i ), 19) where j = k + i mod ). The update i 19) shows that we use first order approximatio of the fuctios f i aroud the last iterates to evaluate the ew update x k+1. I other words, x k+1 is a fuctio of the last iterates {x k, x k 1,..., x k +1 }. This observatio is very fudametal i the aalysis of the proposed DIAG method as we observe i Sectio 4. Remark 1 Oe may cosider the proposed DIAG method as a cyclic versio of the stochastic MISO algorithm i Mairal 2015)). This is a valid iterpretatio; however, the covergece aalysis of MISO caot guaratee that for all choices of ad κ it outperforms GD, while we establish theoretical results i Sectio 4 which guaratee the advatages of DIAG o GD for ay ad κ. Moreover, the proposed DIAG method is desiged based o the ew iterpretatio i 15) that leads to a ovel proof techique; see Lemma 2. This ew aalysis is differet from the aalysis of MISO i Mairal 2015) ad provides stroger covergece results. 3.1 Implemetatio Details Naive implemetatio of the update i 18) requires computatio of sums of vectors per iteratio which is computatioally costly. This uecessary computatioal complexity ca be avoided by trackig the sums over time. To be more precise, the fist sum i 18) which is the sum of the variables ca be updated as y k+1 i = yi k + x k y k ik, 20) where i k is the idex of the fuctio that is chose at step k. Likewise, the sum of gradiets i 18) ca be updated as f i y k+1 i ) = f i yi k ) + f i kx k ) f i ky k ik). 21) 6

7 Algorithm 1 Double Icremetal Aggregated Gradiet method DIAG) 1: Require: iitial variables y1 0 = = y 0 = x 0 ad gradiets f 1 y1), 0..., f y) 0 2: for k = 0, 1,... do 3: Compute the fuctio idex i k = modk, ) + 1 4: Compute x k+1 = 1 yk i ɛ f iyi k). 5: Update the sum of variables yk+1 i = yk i + xk yj k. 6: Compute f j x k+1 ) ad update the sum f iy k+1 i ) = f j x k ) f j yj k) + f iyi k). 7: Replace y k i ad f k i ky k i ) i the table by f k i kx k+1 ) ad x k+1, respectively. The other compoets remai uchaged. i.e., y k+1 i = yi k ad f iy k+1 i ) = f i yi k) for i ik. 8: ed for The proposed double icremetal aggregated gradiet DIAG) method is summarized i Algorithm 1. The variables for all the copies of the vector x are iitialized by vector 0, i.e., y1 0 = = y 0 = x 0, ad their correspodig gradiets are stored i the memory. At each iteratio k, the updated variable x k+1 is computed i Step 4 usig the update i 18). The sums of variables ad gradiets are updated i Step 5 ad 6, respectively, followig the recursio i 20) ad 21). I Step 7, the old variable ad gradiets of the updated fuctio f i k are replaced with their updated versios ad other compoets of the variables ad gradiets tables remai uchaged. I Step 3, the idex i k is updated i a cyclig maer. 4. Covergece Aalysis I this sectio, we study the covergece properties of the proposed double icremetal aggregated gradiet method ad justify its advatages versus the gradiet descet method. The followig lemma characterizes a upper boud for the optimality error x k+1 x i terms of the optimality errors of the last iteratios. Lemma 2 Cosider the proposed double icremetal aggregated gradiet DIAG) method i 18). If the coditios i Assumptio 1 hold, ad the stepsize ɛ is chose as ɛ = 2/µ + L), the sequece of iterates x k geerated by DIAG satisfies the iequality x k+1 x κ 1 κ + 1 where κ = L/µ is the objective fuctio coditio umber. Proof See Appedix A. ) x k x + + x k +1 x, 22) The result i Lemma 2 has a sigificat role i the aalysis of the proposed method ad it shows that error at step k + 1 is smaller tha the average of the last errors. I particular, the ratio κ 1)/κ + 1) is strictly smaller tha 1 which shows that the error at each iteratio is strictly smaller tha the average error of its last steps. This cyclic scheme is critical to prove the result i 22), sice it allows to replace the sum yk i x by the sum of last steps errors x k x + + x k x. Note that If we pick fuctios uiformly at radom, as i MISO, it is ot possible to write the expressio i 22), eve i expectatio. We also caot a iequality similar to the oe i 22) for IAG, although, it uses a cyclic scheme. This comes from the fact that IAG oly uses the gradiets average while DIAG uses both the variables ad gradiets averages. Thus, this special property distiguishes DIAG from IAG ad MISO. I the followig theorem, we use the result i Lemma 2 to show that the sequece of errors x k x is coverget. 7

8 Propositio 3 Cosider the proposed double icremetal aggregated gradiet DIAG) method i 18). If the coditios i Assumptio 1 hold, ad the stepsize ɛ is chose as ɛ = 2/µ + L), the sequece of iterates x k geerated by the proposed DIAG method satisfies the iequality ) ) + x k x ρ +k k 1 k 1 1 ρ) x 0 x, 23) where ρ := κ 1)/κ + 1) ad a + idicates the floor of a. Proof See Appedix B. The first outcome of the result i Propositio 3 is the covergece of the sequece x k x to zero as k approaches ifiity. The secod result which we formalize i the followig corollary shows that the sequece of error coverges liearly after each pass over the dataset. Corollary 4 If the coditios i Propositio 3 are satisfied, the error of the proposed DIAG method after m passes over the fuctios f i, i.e., k = m iteratios, is bouded above by ) ) 1 x m x ρ 1 m 1 ρ) x 0 x 24) Proof Set k = m i 23) ad the claim follows. The result i Corollary 4 shows liear covergece of the subsequece of iterates which are sampled after each pass over the set of fuctios. Moreover, the result i Corollary 4 verifies the advatage of DIAG method versus the full gradiet descet method. The result i 24) shows that the error of DIAG after m passes over the dataset is bouded above by ρ m 1 1 ρ) 1)/) x 0 x which is strictly smaller tha the upper boud for the error of GD after m iteratios give by ρ m x 0 x. Therefore, the DIAG method outperforms GD for ay choice of κ ad > 1. Notice that the upper boud ρ m x 0 x for the error of GD after m iteratios is tight, ad there exists a optimizatio problem such that the error of GD satisfies the relatio x m x = ρ m x 0 x. Although, the result i Corollary 4 implies that the DIAG method is preferable with respect to GD ad shows liear covergece of a subsequece of iterates, it is ot eough to prove liear covergece of the whole sequece of iterates geerated by DIAG. To be more precise, the result i Corollary 4 shows that the subsequece of errors { x k x } k=0, which are associated with the variables at the ed of each pass over the set of fuctios, is liearly coverget. However, we aim to show that the whole sequece { x k x } k=0 is liearly coverget. To be more precise, we aim to show that the sequece of DIAG iterates satisfies x k x aγ k x 0 x for a costat a > 0 ad a positive coefficiet 0 γ < 1. I the followig theorem, we show that this coditio is satisfied for the DIAG method. Theorem 5 Cosider the proposed double icremetal aggregated gradiet DIAG) method i 18). If the coditios i Assumptio 1 hold, ad the stepsize ɛ is chose as ɛ = 2/µ + L), we ca write the followig iequality x k x aγ k x 0 x, 25) if the costats a > 0 ad 0 γ < 1 satisfy the followig coditios ) k 1)1 ρ) ρ 1 aγ k for k = 1,..., 26) γ ρ ) γ + ρ 0 for k >. 27) 8

9 Proof See Appedix C. The result i Theorem 5 provides coditios o the costats a ad γ such that the liear covergece iequality x k x aγ k x 0 x holds. However, it does ot guratee that the set of costats {a, γ} that satisfy the required coditios i 26) ad 27) is o-empty. I the followig propositio we show that there exists costats a ad γ that satisfy these coditios. Propositio 6 Cosider the positive costats a > 0 ad the costat 0 γ < 1. The, there exists a ad γ such that the iequalities i 26) ad 27). I other words, the set of feasible solutios for the system of iequalities i 26) ad 27) is o-empty. Proof See Appedix D. The result i Propositio 6 i cojuctio with the result i Theorem 5 guaratees liear covergece of the iterates geerated by the DIAG method. Although there are differet pairs of {a, γ} that satisfy the coditios i 26) ad 27) ad lead to the liear covergece result i 25), we are iterested i fidig the pair {a, γ} that leads to smallest liear covergece factor γ, i.e., the pair that guaratees faster liear covergece factor. To fid the smallest γ for the liear covergece rate we should pick the smallest γ that satisfies the iequality γ ρ/) γ + ρ/ 0. The choose the smallest costat a that satisfies the coditios i 26) for the give γ. To do so, we first look at the properties of the fuctio hγ) := γ ρ/) γ + ρ/ i the followig lemma. Lemma 7 Cosider the fuctio hγ) := γ ρ/) γ + ρ/ for γ 0, 1). The fuctio h has oly oe root γ 0 i the iterval 0, 1). Moreover, γ 0 is the smallest choice of γ that satisfies the coditio i 27). Proof The derivative of the fuctio h is give by d dγ h = + 1)γ + ρ)γ 1. 28) Therefore, the oly critical poit of the fuctio h i the iterval 0, 1) is γ = + ρ)/ + 1). The poit γ is a local miimum for the fuctio h, sice the secod derivative of the fuctio h is positive at γ. Notice that the objective fuctio value hγ ) < 0 is egative. Moreover, we kow that h0) > 0 ad h1) = 0. This observatio shows that the fuctio h has a root γ 0 betwee 0 ad γ ad this is the oly root of fuctio h i the iterval 0, 1). Thus, γ 0 is the smallest value of γ i the iterval 0, 1) that satisfies the coditio i 27). The result i Lemma 7 shows that the uique root of the fuctio hγ) := γ ρ/) γ + ρ/ i the iterval 0, 1) is the smallest γ that satisfies the coditio i 27). We use this result to formalize the pair {a, γ} with the smallest choice of γ which satisfies the coditios i 26) ad 27). Theorem 8 Cosider the proposed double icremetal aggregated gradiet DIAG) method i 18). Let the coditios i Assumptio 1 hold, ad set the stepsize as ɛ = 2/µ + L). The, the sequece of iterates geerated by DIAG is liearly coverget as where γ 0 is the uique root of the equatio x k x a 0 γ k 0 x 0 x, 29) γ ρ ) γ + ρ = 0, 30) 9

10 i the iterval 0, 1) ad a 0 is give by a 0 = max ρ 1 i {1,...,} Proof It follows from the results i Theorem 5 ad Lemma 7. ) i 1)1 ρ) γ0 i. 31) The result i Theorem 8 shows R-liear covergece of the DIAG iterates with the smallest liear covergece factor γ 0. I the followig sectio, we show that a sequece which is a upper boud for the errors x k x coverges Q-liearly to zero with the liear covergece costat i 30). 5. Worst-case asymptotic rate of DIAG I this sectio, our aim is to provide a upper boud o the quatity x k x which is the distace to optimality after k steps. It follows directly from 22) that the sequece d k defied by the recursio where ρ = have κ 1 κ+1 Defiig the colum vector d k+1 = ρ dk + d k d k +1 ) ad d j := x j x for j = 0, 1, 2,..., 1 provides a upper boud, i.e. we x k x d k for all k 0. D k := d k d k 1... d k +1 ote that this recursio ca be rewritte as the followig matrix iteratio ρ ρ ρ... D k+1 = M ρ D k where M ρ := We observe that M ρ is a o-egative matrix whose eigevalues determie the asymptotic growth rate of the sequece D k ad hece of d k. It is straightforward to check that the characteristic polyomial of M ρ is, T λ) = λ ρ λ 1 ρ λ 2... ρ = λ ρ ) λ + ρ λ 1 whose roots are the eigevalues of M ρ. I the remaider of this sectio, we will ifer iformatio about the eigevalues of M ρ usig Perro-Frobeius PF) theory. This theory is well developed for positive matrices where all the etries are strictly positive but M ρ has zero etries ad is therefore ot positive. Nevertheless, the PF theory has bee successfully exteded to certai o-egative matrices called irreducible matrices. A square matrix A is called irreducible if for every i ad j, there exists a r such that A r i, j) > 0. I the ext lemma, we prove that the matrix M ρ is irreducible which will justify our use of PF theory developed for irreducible matrices. Lemma 9 The matrix M σ is irreducible for ay ρ > 0. 10

11 Proof By the defiitio of irreducibility, we eed to show that for every i ad j, there exists a r such that Mρ r i, j) > 0. Let e 1, e 2,..., e be the stadard basis for R. We will show that we ca choose r = for all i ad j. Let ad cosider D = d d 1... d 1 = e j Mρ e j = Mρ D = d 2 d d +1 Usig the defiitio of M σ, it is easy to check that such a iitializatio of D leads to d +1 = ρ/ > 0, d +2 > 0,..., d 2 > 0. Therefore, for every i ad j, we have which completes the proof.. M ρ i, j) = e T i M ρ e j = d 2 i > 0 Theorem 10 Let ρ 0, 1) ad let λ ρ) be the spectral radius of M ρ. The, i) λ ρ) is the largest real root of the characteristic polyomial T λ). Furthermore, it is a simple root. ii) We have the limit. iii) We have the bouds. Proof lim d k+1/d k = λ ρ) k ρ > λ ρ) ρ i) Part i) is a direct cosequece of the Perro-Frobeius theorem for irreducible o-egative matrices. ii) By Hom ad Johso, 1991, Theorem 8.5.1), we also have Mρ k lim k λ ρ) k = uvt where u, v are the right ad left eigevectors of M ρ correspodig to the eigevalue λ ρ) ormalized to satisfy v T u = 1. Note also that Therefore, lim d k+1/d k = lim k k d k = e T 1 D k = e T 1 Mρ k D e T 1 Mρ k +1 D e T 1 M ρ k = lim D k ρet 1 Mρ k +1 D /λ ρ) k +1 e T 1 M k ρ D /λ ρ) k 32) = lim k λ ρ) et 1 uv T D e T 1 uvt D 33) = λ ρ). 34) 11

12 iii) Theorem Hom ad Johso 1991) directly implies that λ ρ) ρ which proves the lower boud o λ ρ). To get the upper boud, let 1 = T be the vector of oes. We will show that M 2 ρ 1 < ρ ) where < deotes the compoetwise iequality for vectors. The, by Hom ad Johso, 1991, Corollary ), this would imply λ ρ) 2 < ρ 2 which is equivalet to the desired upper boud. It is a straightforward computatio to show that if we set D = 1, the after a simple iductio argumet we obtai d +1 = ρ ad d +2 < ρ, d +3 < ρ,..., d 2 < ρ, i.e. D 2 = d 2 d d +1 = M ρ D = Mρ 1 = ρv for some vector v = v 1, v 2,..., v T, where v i < 1 if i < ad v = 1. Similarly, D 3 = d 3 d d 2+1 ad a straightforward computatio shows that = M ρ 2 D = M ρ Mρ 1 = ρm ρ v M ρ v < ρ1. Combiig this iequality with the previous equatio proves 35) ad cocludes the proof. 6. Acceleratio Li et al. Li et al. 2015) describe a geeric method, called the Catalyst Algorithm to accelerate a liearly coverget algorithm. This method is directly applicable to Algorithm 1, it oly requires the tuig of a parameter κ which adjusts the resultig covergece rate. Lemma 2 shows that Algorithm 1 is liearly coverget ad part iii) of Theorem 10 shows that the resultig rate is smaller) better tha that of the stadard gradiet descet method. Li et al. propose to choose κ = L µ to accelerate the gradiet descet method, the resultig rate for the accelerated method whe L/µ is O ) L/µ log1/ε). 36) As Algorithm 1 is faster tha the gradiet descet method, its accelerated versio obtaied by choosig the same tuig parameter κ = L µ will have a complexity that is o worse tha 36). 12

13 GD IAG DIAG error xk x x 0 x umber of gradiet evaluatios Figure 1: Covergece paths of GD, IAG, ad DIAG for the quadratic programmig with = 200 ad κ = Numerical experimets I this sectio, we compare the performaces of GD, IAG, ad DIAG. First, we apply these methods to solve the quadratic programmig mi x R p fx) := xt A i x + b T i x, 37) where A i R p p is a diagoal matrix ad b i R p is a radom vector chose from the box 0, 1 p. To cotrol the problem coditio umber, the first p/2 diagoal elemets of A i are chose uiformly at radom from the iterval 1, 10 1,..., 10 η/2 ad its last p/2 elemets chose from the iterval 1, 10 1,..., 10 η/2. This selectio resultig i the sum A i havig eigevalues i the rage 10 η/2, 10 η/2. I our simulatios, we fix the variable dimesio as p = 20 ad the umber of fuctios as = 200. Moreover, the stepsizes of GD ad DIAG are set as their best theoretical stepsize which are ɛ GD = 2/µ + L) ad ɛ DIAG = 2/µ + L), respectively. Note that the stepsize suggested i Gürbüzbalaba et al. 2015) for IAG is ɛ IAG = 0.32/L)L+µ)); however, this choice of stepsize is very slow i practice. Thus, we use the stepsize ɛ IAG = 2/L) which performs better tha the oe suggested i Gürbüzbalaba et al. 2015). To have a fair compariso, we compare the algorithms i terms of the total umber of gradiet evaluatios. Note that compariso of these methods i terms of the total umber of iteratios would ot be fair sice each iteratio of GD requires gradiet evaluatios, while IAG ad DIAG oly require oe gradiet computatio per iteratio. We first cosider the case that η = 1 ad use the realizatio with coditio umber κ = 10 to have a relatively small coditio umber. Fig. 1 demostrates the covergece paths of the ormalized error x k x / x 0 x for IAG, DIAG, ad GD whe = 200 ad κ = 10. As we observe, IAG performs better tha GD, while the best performace belogs to DIAG. I the secod experimet, we icrease the problem coditio umber by settig η = 2 ad usig the realizatio with coditio umber κ = 117. Fig. 2 illustrates the performaces of these methods for the case that = 200 ad κ = 117. We observe that the covergece path of IAG is almost idetical to the oe for GD. I this experimet, we also observe that DIAG has the best performace amog the three methods. Note that the relative performace of IAG ad GD chages for problems with differet coditio umbers. O the other had, the relative covergece paths of DIAG ad GD does ot chage i differet settigs, ad DIAG cosistetly outperforms GD. 13

14 GD IAG DIAG error xk x x 0 x umber of gradiet evaluatios 10 4 Figure 2: Covergece paths of GD, IAG, ad DIAG for the quadratic programmig with = 200 ad κ = 117. We also compare the performaces of GD, IAG, ad DIAG i solvig a biary classificatio problem. Cosider the logistic regressio problem where samples {u i } ad their correspodig labels {l i } are give. The dimesio of samples is p, i.e., u i R p, ad the labels l i are either 1 or 1. The goal is to fid the optimal classifier x R p that miimizes the regularized logistic loss which is give by mi fx) := 1 log1 + exp l i x T u i )) + λ x R p 2 x 2. 38) The objective fuctio f i 38) is strogly covex with costat µ = λ ad its gradiets are Lipschitz cotiuous with costat L = λ + ζ/4 where ζ = max i u T i u i. Note that the fuctios f i i this case ca be defied as f i x) = log1 + exp l i x T u i )) + λ/2) x 2. It is easy to verify that the istataeous fuctios f i are also strogly covex with costat µ = L, ad their gradiets are Lipschitz cotiuous with costat L = λ + ζ/4. We apply GD, IAG, ad DIAG to the logistic regressio problem i 38) for the MNIST dataset LeCu et al. 1998). We assig label l i = 1 to the samples that correspod to digit 8 ad label l i = 1 to those associated with digit 0. We get a total of = 11, 774 traiig examples, each of dimesio p = 784. The objective fuctio error fx k ) fx ) of the GD, IAG, ad DIAG methods versus the umber of passes over the dataset are show i Fig. 3 for the stepsizes ɛ GD = 2/µ + L), ɛ IAG = 2/L), ad ɛ DIAG = 2/µ + L). Moreover, we report the covergece paths of these algorithms for their best choice of stepsize i practice. The results verify the advatage of the proposed DIAG method relative to IAG ad GD i both scearios. 14

15 objective fuctio value error fx k ) fx ) GD with ǫ = 2/µ+L) IAG with ǫ = 2/L) DIAG with ǫ = 2/µ+L) GD with the best stepsize IAG with the best stepsize DIAG with the best stepsize umber of passes over the dataset Figure 3: Covergece paths of GD, IAG, ad DIAG for the biary classificatio applicatio. Appedix A. Proof of Lemma 2 Cosider the update i 18). Subtract the optimal argumet x from both sides of the equality to obtai x k+1 x = 1 yi k x ) ɛ f i yi k ). 39) Note that the global objective fuctio gradiet at the optimal poit is ull, i.e., 1/) f ix ) = 0. This observatio i cojuctio with the expressio i 39) leads to x k+1 x = 1 y k i x ) ɛ fi yi k ) f i x ) ) = 1 y k i x ɛ f i y k i ) f i x ) ). 40) Compute the orm of both sides i 40), ad use the Cauchy-Schwarz iequality to obtai x k+1 x 1 y k i x ɛ f i yi k ) f i x ) ). 41) Now we proceed to derive a upper boud for each summad i 41). Accordig to the result i Ryu ad Boyd 2016); page 13, if the fuctios are µ-strogly covex ad their gradiets are L-Lipschitz cotiuous we ca show that y k i x ɛ f i y k i ) f i x ) ) max{ 1 ɛµ, 1 ɛl } y k i x. 42) By settig the stepsize ɛ i 42) as ɛ = 2/µ + L), we ca write y k i x ɛ f i y k i ) f i x ) ) κ 1 κ + 1 yk i x, 43) where κ = L/µ is the fuctio f i coditio umber. By replacig the summads i the right had side of 41) with their upper bouds κ 1)/κ + 1)) yi k x, as show i 43), we ca show that the residual x k+1 x is bouded above as x k+1 x ) κ 1 κ + 1 y k i x 44) 15

16 Note that i the DIAG method we use a cyclic scheme to update the variables. Thus, the set of variables {y1, k..., y} k is idetical to the set of the last iterates before the iterate x k+1 which si give by {x k,..., x k +1 }. Thus, we ca replace the sum i yk i x i 44) by the sum xk i+1 x which follows the claim i 22). Appedix B. Proof of Propositio 3 Cosider the defiitio of the costat ρ := κ 1)/κ + 1) where κ = L/µ is the objective fuctio coditio umber. Thus, if all the copies y i are iitialized at x 0, the result i Lemma 2 implies that We ca use the same iequality for the secod iterate to obtai x 1 x ρ x 0 x. 45) x 2 x ρ x1 x ρ 1) + x 0 x ρ2 x0 x ρ 1) + x 0 x = ρ 1 1 ρ x 0 x, 46) where the secod iequality follows by replacig x 1 x by its upper boud i 45), ad the equality holds by regroupig the terms. We repeat the same process for the third residual x 3 x to obtai x 3 x ρ x2 x + ρ x1 x + 21 ρ) ρ1 ρ) ρ 1 2 ρ 2) x 0 x x 0 x, 47) where i the secod iequality we use the bouds i 45) ad 46). Sice the term ρ1 ρ)/ 2 is egative we ca drop this term ad show that the residual x 3 x is upper bouded by x 3 x ρ 1 21 ρ) x 0 x. 48) By followig the same logic we ca show that for the first residuals { x k x } k=1 the followig iequality holds x k x k 1)1 ρ) ρ 1 x 0 x, for k = 1,...,. 49) Now we proceed to prove the claim i... by iductio. Assume that the followig coditio holds for j 0 x k x ρ j+1 k j 1)1 ρ) 1 x 0 x, for k = j + 1,..., j +. 50) The, our goal is to show that the same iequalities hold for j + 1, i.e., x k x ρ j+2 1 k j + 1) 1)1 ρ) x 0 x, for k = j + 1) + 1,..., j + 1) +. 51) 16

17 To do so, we start we the time idex k = j + 1) + 1. Accordig to the result i Lemma 2 we ca boud the residual x j+1)+1 x by x j+1)+1 x ρ xj+1 x + + ρ xj+ x 52) Cosiderig the iequalities i 50) we ca show that x k x for k = j +1,..., j + is bouded above by ρ j+1 x 0 x. Replacig the terms i 52) by their upper boud ρ j+1 x 0 x yields x j+1)+1 x ρ j+2 x 0 x 53) We use the result i Lemma 2 for k + 1 = j + 1) + 2 this time to obtai x j+1)+2 x ρ xj+1)+1 x + + ρ xj+ x 54) By replacig the first summad i the right had side of 54) by its upper boud i 53) ad the rest of the summads by the upper boud ρ j+1 x 0 x we ca write x j+1)+2 x ρj+3 x0 x + = ρ j ρ 1)ρj+2 x 0 x Followig the same steps for the ext residual x j+1)+3 x yields x 0 x 55) x j+1)+3 x ρ xj+1)+2 x + ρ xj+1)+1 x + ρ xj+3 x + + ρ xj+ x 2ρj+3 = ρ j+2 1 x0 x + 21 ρ) 2)ρj+2 x 0 x x 0 x 56) Note that i the secod iequality we have replaced x j+1)+2 x by the upper boud ρ j+2 x 0 x which is looser tha the upper boud i 55). By repeatig the same scheme we ca show that x j+1)+u x ρ j+2 u1 ρ) 1 x 0 x, 57) for u = 0,..., 1. Note that this result is idetical to the claim i 51). Thus, if the sets of ieqaulities i 50) hold for j it also hold for the ext set of iequalities that are geerated for j + 1. Therefore, the proof is complete by iductio ad the claim i 23) follows. Appedix C. Proof of Theorem 5 Accordig to the proof i Appedix B, we kow that for k = 1,..., the followig iequality holds. x k x k 1)1 ρ) ρ 1 x 0 x 58) Combiig this result ad the defiitio of the costat a i 26) we obtai that x k x aγ k x 0 x, for k = 1,...,. 59) Thus, the iequality i 25) holds for steps k = 1,...,. 17

18 Now we proceed to show that the claim i 25) also holds for k. To do so, we use a iductio argumet. Let s assume we aim to show that the iequality i 25) holds for k = j, while it holds for the last iterates k = j 1,..., j. Accordig to the result i Lemma 2 we ca write x x j x j 1 x + + x j x ρ, 60) where ρ = κ 1)/κ + 1). Based o the iductio assumptio, for steps k = j 1,..., j, the result i 25) holds. Thus, we ca replace the terms i the right had side of 60) by the upper bouds from 25). This substitutio implies x j x ρa γ j γ j x 0 x = ρaγj 1 γ ) x 0 x 61) 1 γ) Rearragig the terms i 27) allows us to show that ρ1 γ ))/1 γ)) is bouded above by γ. This is true sice γ ρ ) γ + ρ 0 ρ 1 γ ) γ 1 γ) 0 ρ1 γ ) 1 γ) γ. 62) Therefore, we ca replace the term ρ1 γ ))/1 γ)) i 61) by its upper boud γ to obtai x j x aγ j x 0 x. 63) The result i 63) completes the proof. Thus, by iductio the claim i 25) holds for all k 1 if the coditios i 26) ad 27) satisfied. Appedix D. Proof of Propositio 6 To prove the claim i Propositio 6 we first derive the followig lemma. Lemma 11 For all 1 ad 0 φ 1 we have 1 ) φ 1 φ ) +1 64) + 1 Proof Cosider the fuctio hx) = 1 φ/x)) x for x > 1. The atural logarithm of the fuctio hx) is give by l hx)) = x l1 φ/x)). Compute the derivative of both sides with respect to x to obtai dh dx 1 hx) = l 1 φ ) + x x φ x 2 1 φ x By multiplyig both sides by hx), replacig hx) by the expressio terms we obtai that the derivative of the fuctio hx) is give by dh dx = 1 φ ) x l 1 φ ) φ x + x x 1 φ x 65) 1 φ x ) x, ad simplifyig the Note that the sum l1 u) + u/1 u) is always positive for 0 < u < 1. By seetig u := φ/x, we ca coclude that the term i the right had side of 66) is positive for x > 1. Therefore, the 66) 18

19 derivative dh/dx is always positive for x > 1. Thus, the fuctio hx) is a icreasig fuctio for x > 1 ad we ca write 1 ) φ 1 φ ) +1, 67) + 1 for > 1. It remais to show that the same the claim is also valid for = 1 which is equivalet to the iequality 1 φ 1 φ 2 ) 2. 68) This iequality is trivial, ad, therefore, the claim i 64) holds for all 1. Now proceed to prove the claim i Propositio 6 usig the result i Lemma 11. To prove that the feasible set of the coditio i 27) is o-empty we show that γ = ρ 1/ satisfies the iequality i 27). I other words, ρ ρ ) ρ + ρ 0 69) Divide both sides of 69) by ρ ad regroupe the terms to obtai the followig ieqaulity ρ 1 1 ρ ), 70) which is equivalet to 69). I other words, the iequality i 70) is a ecessary ad sufficiet coditio for the coditio i 69). Recall the result i Lemma 11. By settig φ = 1 ρ we obtai that ρ = 1 1 ρ ) ρ ) ρ ), 71) 1 2 for 1. Thus, the iequality i 70) holds, ad, cosequetly, the iequality i 69) is valid. Therefore, γ = ρ 1/ satisfies the iequality i 27). The, we ca defie a as the smallest costat that satisfies 26) for the choice γ = ρ 1/, which is give by a = max 1 k=1,..., ) k 1)1 ρ) ρ 1 k. 72) Therefore, γ = ρ 1/ ad the costat a i 72) satisfy the coditios i 26) ad 27), ad the claim i Propositio 6 follows. Refereces Doro Blatt, Alfred O Hero, ad Hillel Gauchma. A coverget icremetal gradiet method with a costat step size. SIAM Joural o Optimizatio, 181):29 51, Léo Bottou. Large-scale machie learig with stochastic gradiet descet. I Proceedigs of COMPSTAT 2010, pages Spriger, Léo Bottou ad Ya Le Cu. O-lie learig for very large data sets. Applied stochastic models i busiess ad idustry, 212): ,

20 Aaro Defazio, Fracis Bach, ad Simo Lacoste-Julie. SAGA: A fast icremetal gradiet method with support for o-strogly covex composite objectives. I Advaces i Neural Iformatio Processig Systems, pages , Mert Gürbüzbalaba, Asuma Ozdaglar, ad Pablo Parrilo. O the covergece rate of icremetal aggregated gradiet algorithms. arxiv preprit arxiv: , Roger A Hom ad Charles R Johso. Topics i matrix aalysis Rie Johso ad Tog Zhag. Acceleratig stochastic gradiet descet usig predictive variace reductio. I Advaces i Neural Iformatio Processig Systems, pages , Jakub Koečỳ ad Peter Richtárik. Semi-stochastic gradiet descet methods. arxiv preprit arxiv: , Ya LeCu, Coria Cortes, ad Christopher JC Burges. The MNIST database of hadwritte digits, Hogzhou Li, Julie Mairal, ad Zaid Harchaoui. A uiversal catalyst for first-order optimizatio. I Advaces i Neural Iformatio Processig Systems, pages , Julie Mairal. Icremetal majorizatio-miimizatio optimizatio with applicatio to large-scale machie learig. SIAM Joural o Optimizatio, 252): , Yurii Nesterov. Itroductory lectures o covex optimizatio, volume 87. Spriger Sciece & Busiess Media, Herbert Robbis ad Sutto Moro. A stochastic approximatio method. The aals of mathematical statistics, pages , Nicolas L Roux, Mark Schmidt, ad Fracis R Bach. A stochastic gradiet method with a expoetial covergece rate for fiite traiig sets. I Advaces i Neural Iformatio Processig Systems, pages , Erest K Ryu ad Stephe Boyd. Primer o mootoe operator methods. Appl. Comput. Math, 15 1):3 43, Shai Shalev-Shwartz ad Tog Zhag. Stochastic dual coordiate ascet methods for regularized loss. The Joural of Machie Learig Research, 141): , Shai Shalev-Shwartz ad Tog Zhag. Accelerated proximal stochastic dual coordiate ascet for regularized loss miimizatio. Mathematical Programmig, ): , Paul Tseg ad Sagwoo Yu. Icremetally updated gradiet methods for costraied ad regularized optimizatio. Joural of Optimizatio Theory ad Applicatios, 1603): , Li Xiao ad Tog Zhag. A proximal stochastic gradiet method with progressive variace reductio. SIAM Joural o Optimizatio, 244): , Liju Zhag, Mehrdad Mahdavi, ad Rog Ji. Liear covergece with coditio umber idepedet access of full gradiets. I Advaces i Neural Iformatio Processig Systems, pages ,

Optimally Sparse SVMs

Optimally Sparse SVMs A. Proof of Lemma 3. We here prove a lower boud o the umber of support vectors to achieve geeralizatio bouds of the form which we cosider. Importatly, this result holds ot oly for liear classifiers, but