On Optimal Probabilities in Stochastic Coordinate Descent Methods

Size: px

Start display at page:

Download "On Optimal Probabilities in Stochastic Coordinate Descent Methods"

Jodie Hubbard
5 years ago
Views:

1 On Optmal Probabltes n Stochastc Coordnate Descent Methods Peter Rchtárk and Martn Takáč Unversty of Ednburgh, Unted Kngdom October, 203 Abstract We propose and analyze a new parallel coordnate descent method NSync n whch at each teraton a random subset of coordnates s updated, n parallel, allowng for the subsets to be chosen non-unformly. We derve convergence rates under a strong convexty assumpton, and comment on how to assgn probabltes to the sets to optmze the bound. The complexty and practcal performance of the method can outperform ts unform varant by an order of magntude. Surprsngly, the strategy of updatng a sngle randomly selected coordnate per teraton wth optmal probabltes may requre less teratons, both n theory and practce, than the strategy of updatng all coordnates at every teraton. Introducton In ths work we consder the optmzaton problem mn φx), ) x Rn where φ s strongly convex and smooth. We propose a new algorthm, and call t NSync Nonunform SYNchronous Coordnate descent). Algorthm NSync) Input: Intal pont x 0 R n, subset probabltes {p S } and stepsze parameters w,..., w n > 0 for k = 0,, 2,... do Select a random set of coordnates Ŝ {,..., n} such that ProbŜ = S) = p S Updated selected coordnates: x k+ = x k Ŝ w φx k )e end for In NSync, we frst assgn a probablty p S 0 to every subset S of [n] := {,..., n}, wth S p S =, and pck stepsze parameters w > 0, =, 2,..., n. At every teraton, a random set Ŝ s generated, ndependently from prevous teratons, followng the law ProbŜ = S) = p S, and then coordnates Ŝ are updated n parallel by movng n the drecton of the negatve partal dervatve wth stepsze /w. The updates are synchronzed: no processor/thread s allowed to proceed before all updates are appled, generatng the new terate x k+. We specfcally study samplngs Ŝ whch are non-unform n the sense that p := Prob Ŝ) = S: S p S s allowed to vary wth. By φx) we mean φx), e, where e R n s the -th unt coordnate vector. Lterature. Seral stochastc coordnate descent methods were proposed and analyzed n [6, 3, 5, 8], and more recently n varous settngs n [2, 7, 8, 9, 2, 9, 24, 3]. Parallel methods were consdered n [2, 6, 4], and more recently n [22, 5, 23, 4,, 20, 0, ]. A memory dstrbuted method scalng to bg data problems was recently developed n [7]. A nonunform coordnate

2 descent method updatng a sngle coordnate at a tme was proposed n [5], and one updatng two coordnates at a tme n [2]. To the best of our knowledge, NSync s the frst nonunform parallel coordnate descent method. 2 Analyss Our analyss of NSync s based on two assumptons. The frst assumpton generalzes the ESO concept ntroduced n [6] and later used n [22, 23, 5, 4, 7] to nonunform samplngs. The second assumpton requres that φ be strongly convex. Notaton: For x, y, u R n we wrte x 2 u := u x 2, x, y u := n = u y x, x y := x y,..., x n y n ) and u := /u,..., /u n ). For S [n] and h R n, let h [S] := S h e. Assumpton Nonunform ESO: Expected Separable Overapproxmaton). Assume p = p,..., p n ) T > 0 and that for some postve vector w R n and all x, h R n, E[φx + h [ Ŝ] )] φx) + φx), h p + 2 h 2 p w. 2) Inequaltes of type 2), n the unform case p = p j for all, j), were studed n [6, 22, 5, 7]. Assumpton 2 Strong convexty). We assume that φ s γ-strongly convex wth respect to the norm v, where v = v,..., v n ) T > 0 and γ > 0. That s, we requre that for all x, h R n, φx + h) φx) + φx), h + γ 2 h 2 v. 3) We can now establsh a bound on the number of teratons suffcent for NSync to approxmately solve ) wth hgh probablty. Theorem 3. Let Assumptons and 2 be satsfed. Choose x 0 R n, 0 < ɛ < φx 0 ) φ and 0 < ρ <, where φ := mn x φx). Let Λ := max w p v. 4) If {x k } are the random terates generated by NSync, then ) K Λ γ log φx 0 ) φ ɛρ Probφx K ) φ ɛ) ρ. 5) Moreover, we have the lower bound Λ w v )/E[ Ŝ ]. Proof. We frst clam that φ s µ-strongly convex wth respect to the norm w p,.e., φx + h) φx) + φx), h + µ 2 h 2 w p, 6) where µ := γ/λ. Indeed, ths follows by comparng 3) and 6) n the lght of 4). Let x be such that φx ) = φ. Usng 6) wth h = x x, φ φx) 6) mn h R n φx), h + µ 2 h 2 w p = 2µ φx) 2 p w. 7) Let h k := Dagw)) φx k ). Then x k+ = x k + h k ) [ Ŝ], and utlzng Assumpton, we get E[φx k+ ) x k ] = E[φx k + h k 2) ) [ Ŝ] )] φx k ) + φx k ), h k p + 2 hk 2 p w 8) = φx k ) 2 φxk ) 2 p w 7) φx k ) µφx k ) φ ). 9) Takng expectatons n the last nequalty and rearrangng the terms, we obtan E[φx k+ ) φ ] µ)e[φx k ) φ ] µ) k+ φx 0 ) φ ). Usng ths, Markov nequalty, and the defnton of K, we fnally get Probφx K ) φ ɛ) E[φx K ) φ ]/ɛ µ) K φx 0 ) φ )/ɛ ρ. Let us now establsh the last clam. Frst, note that see [6, Sec 3.2] for more results of ths type), p = S: S p S = S Lettng := {p R n : p 0, p = E[ Ŝ ]}, we have Λ 4)+0) w p = v E[ Ŝ ] mn max p : S p S = S p S S = E[ Ŝ ]. 0) w v, where the last equalty follows snce optmal p s proportonal to w /v. 2

3 Theorem 3 s generc n the sense that we do not say when Assumptons and 2 are satsfed, how should one go about to choose the stepszes w and probabltes {p S }. In the next secton we address these ssues. On the other hand, ths abstract settng allowed us to wrte a bref complexty proof. Change of varables. Consder the change of varables y = Dagd)x, where d > 0. Defnng φ d y) := φx), we get φ d y) = Dagd)) φx). It can be seen that 2), 3) can equvalently be wrtten n terms of φ d, wth w replaced by w d := w d 2 and v replaced by v d := v d 2. By choosng d = v, we obtan v d = for all, recoverng standard strong convexty. 3 Nonunform samplngs and ESO Consder now problem ) wth φ of the form φx) := fx) + γ 2 x 2 v, ) where v > 0. Note that Assumpton 2 s satsfed. We further make the followng two assumptons. Assumpton 4 Smoothness). f has Lpschtz gradent wth respect to the coordnates, wth postve constants L,..., L n. That s, fx) fx + te ) L t for all x R n and t R. Assumpton 5 Partal separablty). fx) = J J f Jx), where J s a fnte collecton of nonempty subsets of [n] and f J are dfferentable convex functons such that f J depends on coordnates J only. Let ω := max J J. We say that f s separable of degree ω. Unform parallel coordnate descent methods for regularzed problems wth f of the above structure were analyzed n [6]. Example 6. Let fx) = 2 Ax b 2 2, where A R m n. Then L = A : 2 2 and fx) = 2 m j= A j:x b j ) 2, whence ω s the maxmum # of nonzeros n a row of A. Nonunform samplng. Instead of consderng the general case of arbtrary p S assgned to all subsets of [n], here we consder a specal knd of samplng havng two advantages: ) sets can be generated easly, ) t leads to larger stepszes /w and hence mproved convergence rate. Fx τ [n] and c and let S,..., S c be a collecton of possbly overlappng) subsets of [n] such that S j τ for all and c j= S j = [n]. Moreover, let q = q,..., q c ) > 0 be a probablty vector. Let Ŝj be τ-nce samplng from S j ; that s, Ŝj pcks subsets of S j havng cardnalty τ, unformly at random. We assume these samplngs are ndependent. Now, Ŝ s generated as follows. We frst pck j {,..., c} wth probablty q j, and then draw Ŝj. Note that we do not need to compute the quanttes p S, S [n], to execute NSync. In fact, t s much easer to mplement the samplng va the two-ter procedure explaned above. Samplng Ŝ s a nonunform varant of the τ-nce samplng studed n [6], whch here arses as a specal case for c =. Note that where δ j = f S j, and 0 otherwse. p = c j= q j τ S j δ j > 0, [n], 2) Theorem 7. Let Assumptons 4 and 5 be satsfed, and let Ŝ be the samplng descrbed above. Then Assumpton s satsfed wth p gven by 2) and any w = w,..., w n ) T for whch w w where ω j := max J J J S j ω. := L+v p c j= q j τ S j δ j + τ )ωj ) max{, S j } Proof. Snce f s separable of degree ω, so s φ because 2 x 2 v s separable). Now, ), [n], 3) E[φx + h [ Ŝ] )] = E[E[φx + h [Ŝj]) j]] = c j= q je[φx + h [ Ŝ j] )] 4) { ) )} c j= q j fx) + τ S j fx), h [Sj] τ )ωj ) max{, S j } h [Sj] 2 L+v, 5) where the last nequalty follows from the ESO for τ-nce samplngs establshed n [6, Theorem 5]. The clam now follows by comparng the above expresson and 2). 3

4 4 Optmal probabltes Observe that formula 3) can be used to desgn a samplng characterzed by the sets S j and probabltes q j ) that mnmzes Λ, whch n vew of Theorem 3 optmzes the convergence rate of the method. Seral settng. Consder the seral verson of NSync Prob Ŝ = ) = ). We can model ths va c = n, wth S = {} and p = q for all [n]. In ths case, usng 2) and 3), we get w = w = L + v. Mnmzng Λ n 4) over the probablty vector p gves the optmal probabltes we refer to ths as the optmal seral method) and optmal complexty p = L+v)/v, [n], Λ j Lj+vj)/vj OS = L +v v = n + L v, 6) respectvely. Note that the unform samplng, p = /n for all, leads to Λ US := n + n max j L j /v j we call ths the unform seral method), whch can be much larger than Λ OS. Moreover, under the change of varables y = Dagd)x, the gradent of f d y) := fdagd )y) has coordnate Lpschtz constants L d = L /d 2, whle the weghts n ) change to vd = v /d 2. Hence, the condton numbers L /v can not be mproved va such a change of varables. Optmal seral method can be faster than the fully parallel method. To model the fully parallel settng.e., the varant of NSync updatng all coordnates at every teraton), we can set c = and τ = n, whch yelds Λ F P = ω + ω max j L j /v j. Snce ω n, t s clear that Λ US Λ F P. However, for large enough ω t wll be the case that Λ F P Λ OS, mplyng, surprsngly, that the optmal seral method can be faster than the fully parallel method. ) Parallel settng. Fx τ and sets S j, j =, 2,..., c, and defne θ := max j + τ )ωj ) max{, S j }. Consder runnng NSync wth stepszes w = θl + v ) note that w w, so we are fne). From 4), 2) and 3) we see that the complexty of NSync s determned by ) w Λ = max p v = θ τ max c + L v j= q j δj S j ). The probablty vector q mnmzng ths quantty can be computed by solvng a lnear program wth c+ varables q,..., q c, α), 2n lnear nequalty constrants and a sngle lnear equalty constrant: max α,q {α subject to α b ) T q for all, q 0, } j q j =, where b R c, [n], are gven by b j = 5 Experments v δ j L +v ) S. j We now conduct 2 prelmnary small scale experments to llustrate the theory; the results are depcted below. All experments are wth problems of the form ) wth f chosen as n Example Unform Seral Optmal Seral 0 0 ω= φx k ) φ * 0 0 φx k ) φ * 0 0 ω=6 ω= Iteraton k Fully Parallel Seral Nonunform Epochs In the left plot we chose A R 2 30, γ =, v = 0.05, v = for and L = for all. We compare the US method p = /n, blue) wth the OS method p gven by 6), red). The dashed lnes show 95% confdence ntervals we run the methods 00 tmes, the lne n the mddle s the average behavor). Whle OS can be faster, t s senstve to over/under-estmaton of the constants L, v. In the rght plot we show that a nonunform seral NS) method can be faster than the fully parallel FP) varant we have chosen m = 8, n = 0 and 3 values of ω). On the horzontal axs we dsplay the number of epochs, where epoch corresponds to updatng n coordnates for FP ths s a sngle teraton, whereas for NS t corresponds to n teratons). 4

5 References [] Y. Ban, X. L, and Y. Lu. Parallel coordnate descent Newton for large-scale l-regularzed mnmzaton. arxv306:4080v. [2] J. Bradley, A. Kyrola, D. Bckson, and C. Guestrn. Parallel coordnate descent for l-regularzed loss mnmzaton. In ICML, 20. [3] C. D. Dang and G. Lan. Stochastc block mrror descent methods for nonsmooth and stochastc optmzaton. Techncal report, Georga Insttute of Technology, 203. [4] O. Fercoq. Parallel coordnate descent for the AdaBoost problem. In ICMLA, 203. [5] O. Fercoq and P. Rchtárk. Smooth mnmzaton of nonsmooth functons wth parallel coordnate descent methods. arxv: , 203. [6] C-J. Hseh, K-W. Chang, C-J. Ln, S.S. Keerth,, and S. Sundarajan. A dual coordnate descent method for large-scale lnear SVM. In ICML, [7] S. Lacoste-Julen, M. Jagg, M. Schmdt, and P. Pletcher. Block-coordnate frank-wolfe optmzaton for structural svms. In ICML, 203. [8] Z. Lu and L. Xao. On the complexty analyss of randomzed block-coordnate descent methods. arxv: , 203. [9] Z. Lu and L. Xao. Randomzed block coordnate non-monotone gradent methods for a class of nonlnear programmng. arxv: , 203. [0] I. Mukherjee, Y. Snger, R. Frongllo, and K. Cann. Parallel boostng wth momentum. In ECML, 203. [] I. Necoara and D. Clpc. Effcent parallel coordnate descent algorthm for convex optmzaton problems wth separable constrants: applcaton to dstrbuted mpc. J. of Process Control, 23: , 203. [2] I. Necoara, Yu. Nesterov, and F. Glneur. Effcency of randomzed coordnate descent methods on optmzaton problems wth lnearly coupled constrants. Techncal report, 202. [3] Yu. Nesterov. Effcency of coordnate descent methods on huge-scale optmzaton problems. SIAM Journal on Optmzaton, 222):34 362, 202. [4] P. Rchtárk and M. Takáč. Effcent seral and parallel coordnate descent methods for huge-scale truss topology desgn. In Operatons Research Proceedngs, pages Sprnger, 202. [5] P. Rchtárk and M. Takáč. Iteraton complexty of randomzed block-coordnate descent methods for mnmzng a composte functon. Mathematcal Programmng, 202. [6] P. Rchtárk and M. Takáč. Parallel coordnate descent methods for bg data optmzaton. arxv: , 202. [7] P. Rchtárk and M. Takáč. Dstrbuted coordnate descent method for learnng wth bg data. arxv: , 203. [8] S. Shalev-Shwartz and A. Tewar. Stochastc Methods for l-regularzed Loss Mnmzaton. JMLR, 2: , 20. [9] S. Shalev-Shwartz and T. Zhang. Proxmal stochastc dual coordnate ascent. arxv:2:277, 202. [20] S. Shalev-Shwartz and T. Zhang. Accelerated mn-batch stochastc dual coordnate ascent. arxv: v, May 203. [2] S. Shalev-Shwartz and T. Zhang. Stochastc dual coordnate ascent methods for regularzed loss mnmzaton. JMLR, 4: , 203. [22] M. Takáč, A. Bjral, P. Rchtárk, and N. Srebro. Mn-batch prmal and dual methods for SVMs. In ICML, 203. [23] R. Tappenden, P. Rchtárk, and B. Büke. Separable approxmatons and decomposton methods for the augmented Lagrangan. arxv: , 203. [24] R. Tappenden, P. Rchtárk, and J. Gondzo. Inexact coordnate descent: complexty and precondtonng. arxv: ,

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set