Primal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses

Size: px

Start display at page:

Download "Primal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses"

Terence Doyle
6 years ago
Views:

1 Prmal Method for ERM wth Flexble Mn-batchng Schemes and Non-convex Losses Domnk Csba Peter Rchtárk June 7, 205 Abstract In ths work we develop a new algorthm for regularzed emprcal rsk mnmzaton. Our method extends recent technques of Shalev-Shwartz 02/205], whch enable a dual-free analyss of SDCA, to arbtrary mn-batchng schemes. Moreover, our method s able to better utlze the nformaton n the data defnng the ERM problem. For convex loss functons, our complexty results match those of QUARTZ, whch s a prmal-dual method also allowng for arbtrary mn-batchng schemes. The advantage of a dual-free analyss comes from the fact that t guarantees convergence even for non-convex loss functons, as long as the average loss s convex. We llustrate through experments the utlty of beng able to desgn arbtrary mn-batchng schemes. Introducton Emprcal rsk mnmzaton (ERM s a very successful and mmensely popular paradgm n machne learnng, used to tran a varety of predcton and classfcaton models. Gven examples A,..., A n R d m, loss functons φ,..., φ n : R m R and a regularzaton parameter λ > 0, the L2-regularzed ERM problem s an optmzaton problem of the form ] P (w := φ (A w + λ n 2 w 2 ( mn w R d = Throughout the paper we shall assume that for each, the loss functon φ s l -smooth wth l > 0. That s, for all x, y R m and all n] := {, 2,..., n}, we have φ (x φ (y l x y. (2 Further, let L,..., L n > 0 be constants for whch the nequalty φ (A w φ (A z L w z (3 holds for all w, z R d and all and let L := max L. Note that we can always bound L l A. However, L can be better (smaller than l A. The authors acknowledge support from the EPSRC Grant EP/K02325X/, Accelerated Coordnate Descent Methods for Bg Data Optmzaton. School of Mathematcs, The Unversty of Ednburgh, Unted Kngdom (e-mal: cdomnk@gmal.com School of Mathematcs, The Unversty of Ednburgh, Unted Kngdom (e-mal: peter.rchtark@ed.ac.uk

2 . Background In the last few years, a lot of research effort was put nto desgnng new effcent algorthms for solvng ths problem (and some of ts modfcatons. The frenzy of actvty was motvated by the realzaton that SGD ], not so long ago consdered the state-of-the-art method for ERM, was far from beng optmal, and that new deas can lead to algorthms whch are far superor to SGD n both theory and practce. The methods that belong to ths category nclude SAG 2], SDCA 3], SVRG 4], S2GD 5], ms2gd 6], SAGA 7], S2CD 8], QUARTZ 9], ASDCA 0], prox-sdca ], IPROX-SDCA 2], A-PROX-SDCA 3], AdaSDCA 4], SDNA 5]. Methods analyzed for arbtrary mn-batchng schemes nclude NSync 6], ALPHA 7] and QUARTZ 9]. In order to fnd an ɛ-soluton n expectaton, state of the art (non-accelerated methods for solvng ( only need O((n + κ log(/ɛ steps, where each step nvolves the computaton of the gradent φ (A w for some randomly selected example. The quantty κ s the condton number. Typcally one has κ = max l A 2 l A 2 for methods pckng unformly at random, and κ = nλ for methods pckng usng a carefully desgned data-dependent mportance samplng. Computaton of such a gradent typcally nvolves work whch s equvalent to readng the example A, that s, O(nnz(A dm arthmetc operatons..2 Contrbutons In ths work we develop a new algorthm for the L2-regularzed ERM problem (. Our method extends a technque recently ntroduced by Shalev-Shwartz 8], whch enables a dual-free analyss of SDCA, to arbtrary mn-batchng schemes. That s, our method works at each teraton wth a random subset of examples, chosen n an..d. fashon from an arbtrary dstrbuton. Such flexble schemes are useful for varous reasons, ncludng the development of dstrbuted or robust varants of the method, desgn of mportance samplng for mprovng the complexty rate, desgn of a samplng whch s amed at obtanng effcences elsewhere, such us utlzng NUMA (non-unform memory access archtectures, and v streamlnng and speedng up the processng of each mn-batch by means of assgnng to each processor approxmately even workload so as to reduce dle tme (we do experments wth the latter setup. In comparson wth 8], our method s able to better utlze the nformaton n the data examples A,..., A n, leadng to a better data-dependent bound. For convex loss functons, our complexty results match those of QUARTZ 9] n terms of the rate (the logarthmc factors dffer. QUARTZ s a prmal-dual method also allowng for arbtrary mn-batchng schemes. However, whle 9] only characterze the decay of expected rsk, we also gve bounds for the sequence of terates. In partcular, we show that for convex loss functons, our method enjoys the rate (Theorem 2 ( max + l v log p λp n ( (L + λe (0, λɛ where p s the probablty that coordnate s updated n an teraton, v,..., v n > 0 are certan stepsze parameters of the method assocated wth the samplng and data (see (6, and E (0 s a constant dependng on the startng pont. For nstance, n the specal case pckng a sngle example at a tme unformly at random, we have p = /n and v = A 2, whereby we obtan one λ 2

3 of the O(n + κ log(/ɛ rates mentoned above. The other rate can be recovered usng mportance samplng. The advantage of a dual-free analyss comes from the fact that t guarantees convergence even for non-convex loss functons, as long as the average loss s convex. Ths s a step toward understandng non-convex models. In partcular, we show that for non-convex loss functons, our method enjoys the rate (Theorem ( max + L2 v ( (L + λd (0 p λ 2 log, p n λɛ where D (0 s a constant dependng on the startng pont. Fnally, we llustrate through experments wth chunkng a smple load balancng technque the utlty of beng able to desgn arbtrary mn-batchng schemes. 2 Algorthm We shall now descrbe the method (Algorthm. Algorthm dfsdca: Dual-Free SDCA wth Arbtrary Samplng Parameters: Samplng Ŝ, stepsze θ Intalzaton α (0,..., α(0 n R m, set w (0 = λn for t do Sample a set S t accordng to Ŝ for S t do α (t = α (t θp ( φ (A w(t + α (t w (t = w (t S t θ(nλp A ( φ (A w(t + α (t n = A α (0, p = Prob( Ŝ The method encodes a famly of algorthms, dependng on the choce of the samplng Ŝ, whch encodes a partcular mn-batchng scheme. Formally, a samplng Ŝ s a set-valued random varable wth values beng the subsets of n],.e., subsets of examples. In ths paper, we use the terms mn-batchng scheme and samplng nterchangeably. A samplng s defned by the collecton of probabltes Prob(S assgned to every subset S n] of the examples. The method mantans n vectors α R m and a vector w R d. At the begnnng of step t, we have α (t for all and w (t computed and stored n memory. We then pck a random subset S t of the examples, accordng to the mn-batchng scheme, and update varables α for S t, based on the computaton of the gradents φ (A w(t for S t. Ths s followed by an update of the vector w, whch s performed so as to mantan the relaton w (t = λn A α (t. (4 Ths relaton s mantaned for the followng reason. If w s the optmal soluton to (, then 0 = P (w = n A φ (A w + λw, (5 = 3

4 and hence w = λn n = A α, where α := φ (A w. So, f we beleve that the varables α converge to φ (A w, t ndeed does make sense to mantan (4. Why should we beleve ths? Ths s where the specfc update of the dual varables α comes from: α s set a convex combnaton of ts prevous value and our best estmate so far of φ (A w, namely, φ (A w(t. Indeed, the update can be wrtten as α (t = ( θp α (t + θp ( φ (A w (t. Why does ths make sense? Because we beleve that w (t converges to w. Admttedly, ths reasonng s somewhat crcular. However, a better word to descrbe ths reasonng would be: teratve. 3 Man Results Let p := P( Ŝ. We assume the knowledge of parameters v,..., v n > 0 for whch 2 E A h Ŝ p v h 2. (6 = Tght and easly computable formulas for such parameters can be found n 9]. whenever Prob( Ŝ τ =, nequalty (6 holds wth v = τ A 2. To smplfy the exposure, we wll wrte For nstance, B (t def = w (t w 2, C (t def = α (t α 2, =, 2,..., n. (7 3. Non-convex loss functons Our result wll be expressed n terms of the decay of the potental D (t def = λ 2 B(t + λ where B (t and C (t are defned n (7. n 2n = L 2 Theorem. Assume that the average loss functon, n n = φ, s convex. If (3 holds and we let C (t, θ mn p nλ 2 L 2 v + nλ 2, (8 then the for t 0 the potental D (t decays exponentally to zero as E D (t] e θt D (0. (9 Moreover, f we set θ equal to the upper bound n (8, then ( T max + L2 v ( (L + λd (0 p λ 2 log EP (w (T P (w ] ɛ. p n λɛ 4

5 3.2 Convex loss functons Our result wll be expressed n terms of the decay of the potental E (t def = λ 2 B(t + 2n where B (t and C (t are defned n (7. n = l C (t, Theorem 2. Assume that all loss functons {φ } are convex and satsfy (2. If we run Algorthm wth parameter θ satsfyng the nequalty θ mn p nλ l v + nλ, (0 then the for t 0 the potental E (t decays exponentally to zero as E E (t] e θt E (0. ( Moreover, f we set θ equal to the upper bound n (0, then ( T max + l ( v (L + λe (0 log p λp n λɛ EP (w (T P (w ] ɛ The rate, θ, precsely matches that of the QUARTZ algorthm 9]. Quartz s the only other method for ERM whch has been analyzed for an arbtrary mn-batchng scheme. Our algorthm s dual-free, and as we have seen above, allows for an analyss coverng the case of non-convex loss functons. 4 Chunkng In ths secton we llustrate one use of the ablty of our method to work wth an arbtrary mnbatchng scheme. Further examples nclude the ablty to desgn dstrbuted varants of the method 20], or the use of mportance/adaptve samplng to lower the number of teratons 2, 2, 9, 4]. One marked dsadvantage of standard mn-batchng ( choose a subset of examples, unformly at random used n the context of parallel processng on multcore processors s the fact that n a synchronous mplementaton there s a loss of effcency due to the fact that the computaton tme of φ(a w may dffer through. Ths s caused by the data examples havng varyng degree of sparsty. We hence ntroduce a new samplng whch mtgates ths ssue. Chunks: Choose sets G,..., G k n], such that k = G = n] and G G j =, j and ψ( := j G nnz(a j s smlar for every,.e. ψ( ψ(k. Instead of samplng τ coordnates we propose a new samplng, whch on each teraton t samples τ sets G (t (,..., G(t (τ out of G,..., G k and uses coordnates τ = G(t ( as the sampled set. We assgn each core one of the sets G (t ( for parallel computaton. The advantage of ths samplng les n the fact, that the load of computng φ(a w for all G j s smlar for all j k]. Hence, usng ths samplng we mnmze the watng tme of processors. 5

6 Algorthm 2 Nave Chunks Parameters: vector of nnz u Intalzaton n = length(u; Empty vector g and s of length n; m = max(u g] =, s] = u], = for t = 2 : n do f g] + ut] m then g] = g] +, s] = s] + ut] else = +, g] =, s] = ut] How to choose G,..., G k : We ntroduce the followng algorthm: The algorthms returns the partton of n] nto G,..., G k n a sense, that the frst g] coordnates belong to G, next g2] coordnates belong to G 2 and so on. The man advantage of ths approach s, that t makes a preprocessng step on the dataset whch takes just one pass through the data. On Fgure a through Fgure f we show the mpact of Algorthm 2 on the probablty of the watng tme of a sngle core, whch we measure by the dfference and max S t {nnz(a } τ max τ] {nnz(g(t ( } τ for the ntal and preprocessed dataset respectvely. smaller usng the preprocessng. 5 Experments S t nnz(a τ = nnz(g (t ( We can observe, that the watng tme s In all our experments we used logstc regresson. We normalzed the datasets so that max A =, and fxed λ = /n. The datasets used for experments are summarzed n Table. Dataset #samples #features sparsty w8a 49, % dorothea , % proten 7, % rcv 20,242 47, % cov 58, % Table : Datasets used n the experments. Experment. In Fgure 2a we compared the performance of Algorthm wth unform seral samplng aganst state of the art algorthms such as SGD ], SAG2] and S2GD 5] n number of epochs. The real runnng tme of the algorthms was 0.46s for S2GD, 0.79s for SAG, 0.47s for SDCA and 0.58s for SGD. In Fgure 2b we show the convergence rate for dfferent regularzaton parameters λ. In Fgure 2c we show convergence rates for dfferent seral samplngs: unform, 6

7 τ = 5 τ = 0 τ = 20 τ = τ = 5 τ = 0 τ = 20 τ = τ = 5 τ = 0 τ = 20 τ = 50 Prob a b lty Prob a b lty Prob a b lty Ma x(n n z - Me a n (n n z (a w8a ntally Ma x(n n z - Me a n (n n z (b dorothea ntally Ma x(n n z - Me a n (n n z (c proten ntally τ = 5 τ = 0 τ = 20 τ = τ = 5 τ = 0 τ = 20 τ = τ = 5 τ = 0 τ = 20 τ = 50 Prob a b lty Prob a b lty Prob a b lty Ma x(n n z - Me a n (n n z (d w8a chunked Ma x(n n z - Me a n (n n z (e dorothea chunked Ma x(n n z - Me a n (n n z (f proten chunked Fgure : Dstrbuton of the dfference between the maxmum number of nonzeros processed by a sngle core and the mean of all nonzeros processed by each core. Ths dfference shows us, how much tme s wasted per core watng on the slowest core to fnsh ts task, therefore smaller numbers are better. The frst row corresponds to the ntal dstrbuton whle the second row shows the dstrbuton after usng Algorthm 2. mportance 2] and also 4 dfferent randomly generated seral samplngs. These samplngs were generated n a controlled manner, such that random c has (max p /(mn p < c. All of these samplngs have lnear convergence as shown n the theory. Experment 2: New samplng vs. old samplng. In Fgure 3a through Fgure 3l we compare the performance of a standard parallel samplng aganst samplng of blocks G,..., G k output by Algorthm 2. In each teraton we measure the tme by and max{nnz(a } S t max τ] {nnz(g (} for the standard and new samplng respectvely. Ths way we measure only the computatons done by the core whch s gong to fnsh the last n each teraton, and consder the number of multplcatons wth nonzero entres of the data matrx as a proxy for tme. 7

8 Un form Im p orta n ce ra n d om 2 Objectve mnus Optmum SGD S2GD SAG Objectve mnus Optmum e 2 0e 3 0e 5 te s t p on t - op tm u m ra n d om 3 ra n d om 4 ra n d om 5 SDCA 0e Passes through Data (a rcv, state of the art Passes through Data (b rcv, dfferent λ (c cov, varous samplngs Fgure 2: LEFT: Comparson of SDCA wth other state of the art methods. MIDDLE: SDCA for varous values of λ. RIGHT: SDCA run wth varous samplngs Ŝ. new samplng standard samplng new samplng standard samplng new samplng standard samplng new samplng standard samplng test pont - optmum test pont - optmum test pont - optmum test pont - optmum (a w8a wth τ = 5 (b w8a wth τ = 0 (c w8a wth τ = 20 (d w8a wth τ = 50 new samplng standard samplng new samplng standard samplng new samplng standard samplng new samplng standard samplng test pont - optmum test pont - optmum test pont - optmum test pont - optmum (e dorothea wth τ = 5 (f dorothea wth τ = 0 (g dorothea wth τ = 20 (h dorothea wth τ = 50 new samplng standard samplng new samplng standard samplng new samplng standard samplng new samplng standard samplng test pont - optmum test pont - optmum test pont - optmum test pont - optmum ( proten wth τ = 5 (j proten wth τ = 0 (k proten wth τ = 20 (l proten wth τ = 50 Fgure 3: Logstc regresson wth λ = /n. Comparson between new and standard samplng wth fne-tuned stepszes for dfferent values of τ. 8

9 6 Proofs As a frst approxmaton, our proof s an extenson of the proof of Shalev-Shwartz 8] to accommodate an arbtrary samplng 6, 7, 9, 5]. For all and t we let u (t = φ (A w(t and z (t = α (t u (t. We wll use the followng lemma. Lemma 3 (Evoluton of C (t ] EŜ EŜ C (t C (t and B (t. For a fxed teraton tand all we have: = θ α (t α 2 u (t B (t B (t] 2θ λ (w(t w P (w (t α 2 + ( θp z (t θ2 n 2 λ 2 Proof. It follows that for S t usng the defnton (7 we have = 2] (2 v p z (t 2. (3 C (t C (t (7 = α (t α 2 α (t α 2 α 2 ( θp (α (t = α (t = α (t +θp = θp α 2 ( θp α (t ( θp α (t u (t 2 α (t α 2 u (t α + θp (u (t α 2 α 2 θp u (t α 2 α 2 + ( θp z (t 2] and for / S t we have C (t C (t = 0. Takng the expectaton over S t we get the result. For the second potental we get B (t B (t (7 = w (t w 2 w (t w 2 = 2θ (w (t w A z (t nλ p S t θ2 Takng the expectaton over S t, usng nequalty (6, and notng that we get n = A z (t E B (t B (t] = 2θ nλ (6 2θ nλ = n = = n 2 λ 2 p A z (t 2. S t A φ(a w (t + λw (t = P (w (t, (4 = (w (t w A z (t θ2 n 2 λ 2 E (w (t w A z (t θ2 n 2 λ 2 (4 = 2θ λ (w(t w P (w (t θ2 n 2 λ 2 = = A (p z (t 2 S t p v p z (t 2 v p z (t 2 ] 9

10 6. Proof of Theorem (nonconvex case Combnng (2 and (3, we obtan ED (t D (t ] θλ 2n Usng (3 we have = θ 2n (8 L 2 = = 2θ L 2 α (t α 2 u (t + λ 2 λ (w(t w P (w (t n 2 λ 2 λ ( C (t u (t α θ(w (t w P (w (t λ θ 2n L 2 = α 2 + ( θp z (t 2] θ2 = v p z (t 2 ] ( λ( θp L 2 θv nλp ( C (t u (t α 2 + θ(w (t w P (w (t. u (t α 2 = φ (A w (t φ (A w 2 L 2 w (t w 2. By strong convexty of P, (w (t w P (w (t P (w (t P (w + λ 2 w(t w 2 and P (w (t P (w λ 2 w(t w 2, whch together yelds (w (t w P (w (t λ w (t w 2. ] z (t 2 Therefore, ED (t D (t ] θ n λ 2L 2 = C (t + ( λ2 ] + λ B (t = θd (t. It follows that ED (t ] ( θd (t, and repeatng ths recursvely we end up wth ED (t ] ( θ t D (0 e θt D (0. Ths concludes the proof of the frst part of Theorem. The second part of the proof follows by observng that P s (L+λ-smooth, whch gves P (w P (w L+λ 2 w w Convex case For the next theorem we need an addtonal lemma: Lemma 4. Assume that φ are L -smooth and convex. Then, for every w, n = φ (w φ (w 2 2 (P (w P (w λ2 L w w 2 (5 0

11 Proof. Let g (x = φ (x φ (A w φ (A w (x A w. Clearly, g s also l -smooth. By convexty of φ we have g (x 0 for all x. It follows that g satsfes g (x 2 2l g (x. Usng the defnton of g, we obtan φ (A w φ (A w 2 = g (A w 2 Summng these terms up weghted by /l and usng (5 we get n = 2l φ (A w φ (A w φ (A w (A w A w ]. (6 φ (A w φ (A w 2 (6 2 l n φ (A w φ (A w A φ (A w (w w ] = (5 = 2 P (w λ 2 w 2 P (w + λ ] 2 w 2 + λw (w w = 2 P (w P (w λ2 ] w w Proof of Theorem 2 Combnng (2 and (3, we obtan EE (t E (t ] θ n = θ n 2l = + λ 2 = = 2θ α (t α 2 u (t α 2 + ( θp z (t 2] θ2 λ (w(t w P (w (t n 2 λ 2 (C (t u (t α 2 + 2l + θ(w (t w P (w (t (0 θ ] (C (t u (t α 2 n 2l = v p z (t 2 ( ( θp 2l θv 2p λn ] ] + θ(w (t w P (w (t Usng the convexty of P we have P (w P (w (t (w (t w P (w (t and usng Lemma 4, we have EE (t E (t ] (5 θ n = C (t θ (P (w (t P (w λ2 2l w(t w 2 +θ(w (t w P (w (t ] θ C (t + λ n 2l 2 B(t = = θe (t. Ths gves EE (t ] ( θe (t, whch concludes the frst part of the Theorem 2. The second part follows by observng, that P s (L + λ-smooth, whch gves P (w P (w L+λ 2 w w 2.

12 References ] Herbert Robbns and Sutton Monro. A stochastc approxmaton method. Ann. Math. Statst., 22(3: , ] Mark Schmdt, Ncolas Le Roux, and Francs Bach. Mnmzng fnte sums wth the stochastc average gradent. arxv: , ] Sha Shalev-Shwartz and Tong Zhang. Stochastc dual coordnate ascent methods for regularzed loss. Journal of Machne Learnng Research, 4(: , ] Re Johnson and Tong Zhang. Acceleratng stochastc gradent descent usng predctve varance reducton. In NIPS, ] Jakub Konečný and Peter Rchtárk. S2GD: Sem-stochastc gradent descent methods. arxv:32.666, ] Jakub Konečný, Je Lu, Peter Rchtárk, and Martn Takáč. ms2gd: Mn-batch semstochastc gradent descent n the proxmal settng. arxv: , ] Aaron Defazo, Francs Bach, and Smon Lacoste-Julen. SAGA: A fast ncremental gradent method wth support for non-strongly convex composte objectves. Advances n Neural Informaton Processng Systems 27 (NIPS 204, ] Jakub Konečný, Zheng Qu, and Peter Rchtárk. Sem-stochastc coordnate descent. arxv: , ] Zheng Qu, Peter Rchtárk, and Tong Zhang. Randomzed Dual Coordnate Ascent wth Arbtrary Samplng. arxv:4.5873, ] Sha Shalev-Shwartz and Tong Zhang. Accelerated mn-batch stochastc dual coordnate ascent. In Advances n Neural Informaton Processng Systems 26, pages ] Sha Shalev-Shwartz and Tong Zhang. Proxmal stochastc dual coordnate ascent. arxv:2.277, ] Peln Zhao and Tong Zhang. Stochastc optmzaton wth mportance samplng. ICML, ] Sha Shalev-Shwartz and Tong Zhang. Accelerated proxmal stochastc dual coordnate ascent for regularzed loss mnmzaton. to appear n Mathematcal Programmng, ] Domnk Csba, Zheng Qu, and Peter Rchtárk. Stochastc dual coordnate ascent wth adaptve probabltes. ICML ] Zheng Qu, Peter Rchtárk, Martn Takáč, and Olver Fercoq. Stochastc Dual Newton Ascent for emprcal rsk mnmzaton. arxv: ] Peter Rchtárk and Martn Takáč. On optmal probabltes n stochastc coordnate descent methods. arxv: , ] Zheng Qu and Peter Rchtárk. Coordnate descent methods wth arbtrary samplng I: Algorthms and complexty. arxv: ,

13 8] Sha Shalev-Shwartz. SDCA wthout dualty. CoRR, abs/ , ] Zheng Qu and Peter Rchtárk. Coordnate Descent wth Arbtrary Samplng II: Expected Separable Overapproxmaton. arxv: , ] Peter Rchtárk and Martn Takáč. Dstrbuted coordnate descent method for learnng wth bg data. arxv: , ] Peter Rchtárk and Martn Takáč. Iteraton complexty of randomzed block-coordnate descent methods for mnmzng a composte functon. Mathematcal Programmng, 44(2: 38,

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Samplng for Mnbatches Domnk Csba and Peter Rchtárk School of Mathematcs Unversty of Ednburgh Unted Kngdom arxv:602.02283v [cs.lg] 6 Feb 206 February 9, 206 Abstract Mnbatchng s a very well studed