DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization

Size: px
Start display at page:

Download "DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization"

Transcription

1 DSCOVR: Randomzed Prma-Dua Boc Coordnate Agorthms for Asynchronous Dstrbuted Optmzaton Ln Xao Mcrosoft Research AI Redmond, WA 9805, USA Adams We Yu Machne Learnng Department, Carnege Meon Unversty Pttsburgh, PA 53, USA Qhang Ln Tppe Coege of Busness, The Unversty of Iowa Iowa Cty, IA 545, USA Wezhu Chen Mcrosoft AI and Research Redmond, WA 9805, USA October 3, 07 Abstract Machne earnng wth bg data often nvoves arge optmzaton modes. For dstrbuted optmzaton over a custer of machnes, frequent communcaton and synchronzaton of a mode parameters optmzaton varabes can be very costy. A promsng souton s to use parameter servers to store dfferent subsets of the mode parameters, and update them asynchronousy at dfferent machnes usng oca datasets. In ths paper, we focus on dstrbuted optmzaton of arge near modes wth convex oss functons, and propose a famy of randomzed prma-dua boc coordnate agorthms that are especay sutabe for asynchronous dstrbuted mpementaton wth parameter servers. In partcuar, we wor wth the sadde-pont formuaton of such probems whch aows smutaneous data and mode parttonng, and expot ts structure by douby stochastc coordnate optmzaton wth varance reducton DSCOVR. Compared wth other frst-order dstrbuted agorthms, we show that DSCOVR may requre ess amount of overa computaton and communcaton, and ess or no synchronzaton. We dscuss the mpementaton detas of the DSCOVR agorthms, and present numerca experments on an ndustra dstrbuted computng system. Keywords: asynchronous dstrbuted optmzaton, parameter servers, randomzed agorthms, sadde-pont probems, prma-dua coordnate agorthms, emprca rs mnmzaton. Introducton Agorthms and systems for dstrbuted optmzaton are crtca for sovng arge-scae machne earnng probems, especay when the dataset cannot ft nto the memory or storage of a snge machne. In ths paper, we consder dstrbuted optmzaton probems of the form mnmze w R d m f X w gw, = where X R N d s the oca data stored at the th machne, f : R N R s a convex cost functon assocated wth the near mappng X w, and gw s a convex reguarzaton functon. In addton,

2 we assume that g s separabe,.e., for some nteger n > 0, we can wrte gw = g w, = where g : R d R, and w R d for =,..., n are non-overappng subvectors of w R d wth n = d = d they form a partton of w. Many popuar reguarzaton functons n machne earnng are separabe, for exampe, gw = λ/ w or gw = λ w for some λ > 0. An mportant speca case of s dstrbuted emprca rs mnmzaton ERM of near predctors. Let x, y,..., x N, y N be N tranng exampes, where each x R d s a feature vector and y R s ts abe. The ERM probem s formuated as mnmze w R d N N φ x T w gw, 3 = where each φ : R R s a oss functon measurng the msmatch between the near predcton x T w and the abe y. Popuar oss functons n machne earnng ncude, e.g., for regresson, the squared oss φ t = /t y, and for cassfcaton, the ogstc oss φ t = og exp y t where y {±}. In the dstrbuted optmzaton settng, the N exampes are dvded nto m subsets, each stored on a dfferent machne. For =,..., m, et I denote the subset of {,..., N } stored at machne and et N = I they satsfy m = N = N. Then the ERM probem 3 can be wrtten n the form of by ettng X consst of x T wth I as ts rows and defnng f : R N R as f u I = m φ u, 4 N I where u I R N s a subvector of u R N, consstng of u wth I. The nature of dstrbuted agorthms and ther convergence propertes argey depend on the mode of the communcaton networ that connects the m computng machnes. A popuar settng n the terature s to mode the communcaton networ as a graph, and each node can ony communcate n one step wth ther neghbors connected by an edge, ether synchronousy or asynchronousy e.g., Bertseas and Tstss, 989; Nedć and Ozdagar, 009. The convergence rates of dstrbuted agorthms n ths settng often depend on characterstcs of the graph, such as ts dameter and the egenvaues of the graph Lapacan e.g. Xao and Boyd, 006; Duch et a., 0; Nedć et a., 06; Scaman et a., 07. Ths s often caed the decentrazed settng. Another mode for the communcaton networ s centrazed, where a the machnes partcpate synchronous, coectve communcaton, e.g., broadcastng a vector to a m machnes, or computng the sum of m vectors, each from a dfferent machne AReduce. These coectve communcaton protocos hde the underyng mpementaton detas, whch often nvove operatons on graphs. They are adopted by many popuar dstrbuted computng standards and pacages, such as MPI MPI Forum, 0, MapReduce Dean and Ghemawat, 008 and Aparche Spar Zahara et a., 06, and are wdey used n machne earnng practce e.g., Ln et a., 04; Meng et a., 06. In partcuar, coectve communcatons are very usefu for addressng data paraesm,.e., by aowng dfferent machnes to wor n parae to mprove the same mode w R d usng ther oca dataset. A dsadvantage of coectve communcatons s ther synchronzaton cost: faster machnes or machnes wth ess computng tass have to become de whe watng for other machnes to fnsh ther tass n order to partcpate a coectve communcaton.

3 One effectve approach for reducng synchronzaton cost s to expot mode paraesm here mode refers to w R d, ncudng a optmzaton varabes. The dea s to aow dfferent machnes wor n parae wth dfferent versons of the fu mode or dfferent parts of a common mode, wth tte or no synchronzaton. The mode parttonng approach can be very effectve for sovng probems wth arge modes arge dmenson d. Dedcated parameter servers can be set up to store and mantan dfferent subsets of the mode parameters, such as the w s n, and be responsbe for coordnatng ther updates at dfferent worers L et a., 04; Xng et a., 05. Ths requres fexbe pont-to-pont communcaton. In ths paper, we deveop a famy of randomzed agorthms that expot smutaneous data and mode paraesm. Correspondngy, we adopt a centrazed communcaton mode that support both synchronous coectve communcaton and asynchronous pont-to-pont communcaton. In partcuar, t aows any par of machnes to send/receve a message n a snge step, and mutpe pont-to-pont communcatons may happen n parae n an event-drven, asynchronous manner. Such a communcaton mode s we supported by the MPI standard. To evauate the performance of dstrbuted agorthms n ths settng, we consder the foowng three measures. Computaton compexty: tota amount of computaton, measured by the number of passes over a datasets X for =,..., m, whch can happen n parae on dfferent machnes. Communcaton compexty: the tota amount of communcaton requred, measured by the equvaent number of vectors n R d sent or receved across a machnes. Synchronous communcaton: measured by the tota number of vectors n R d that requres synchronous coectve communcaton nvovng a m machnes. We snge t out from the overa communcaton compexty as a parta measure of the synchronzaton cost. In Secton, we ntroduce the framewor of our randomzed agorthms, Douby Stochastc Coordnate Optmzaton wth Varance Reducton DSCOVR, and summarze our theoretca resuts on the three measures acheved by DSCOVR. Compared wth other frst-order methods for dstrbuted optmzaton, we show that DSCOVR may requre ess amount of overa computaton and communcaton, and ess or no synchronzaton. Then we present the detas of severa DSCOVR varants and ther convergence anayss n Sectons 3-6. We dscuss the mpementaton of dfferent DSCOVR agorthms n Secton 7, and present resuts of our numerca experments n Secton 8.. The DSCOVR Framewor and Man Resuts Frst, we derve a sadde-pont formuaton of the convex optmzaton probem. Let f be the convex conugate of f,.e., f α { = sup u R N α T u f u }, and defne Lw, α α T m X w f = m α gw, 5 = where α = α ;... ; α m ] R N. Snce both the f s and g are convex, Lw, α s convex n w and concave n α. We aso defne a par of prma and dua functons: Pw = max Lw, α = f X w gw, 6 α RN m = Dα = mn Lw, α = f w R d m α g X T α, 7 m 3 = =

4 w w w n α α X X :... α m... X :. Fgure : Partton of prma varabe w, dua varabe α, and the data matrx X. where Pw s exacty the obectve functon n and g s the convex conugate of g. We assume that L has a sadde pont w, α, that s, Lw, α Lw, α Lw, α, w, α R d R N. In ths case, we have w = arg mn Pw and α = arg mn Dα, and Pw = Dα. The DSCOVR framewor s based on sovng the convex-concave sadde-pont probem mn w R d max Lw, α. 8 α RN Snce we assume that g has a separabe structure as n, we rewrte the sadde-pont probem as mn w R d max α R N { m = = α T X w m = f α = } g w, 9 where X R N d for =,..., n are coumn parttons of X. For convenence, we defne the foowng notatons. Frst, et X = X ;... ; X m ] R N d be the overa data matrx, by stacng the X s vertcay. Conformng to the separaton of g, we aso partton X nto boc coumns X : R N d for =,..., n, where each X : = X ;... ; X m ] staced vertcay. For consstency, we aso use X : to denote X from now on. See Fgure for an ustraton. We expot the douby separabe structure n 9 by a douby stochastc coordnate update agorthm outned n Agorthm. Let p = {p,..., p m } and q = {q,..., q n } be two probabty dstrbutons. Durng each teraton t, we randomy pc an ndex {,..., m} wth probabty p, and ndependenty pc an ndex {,..., n} wth probabty q. Then we compute two vectors u t R N and v t R d detas to be dscussed ater, and use them to update the boc coordnates α and w whe eavng other boc coordnates unchanged. The update formuas n 0 and use the proxma mappngs of the scaed functons f and g respectvey. We. More techncay, we need to assume that each f s convex and ower sem-contnuous so that f = f see, e.g., Rocafear, 970, Secton. It automatcay hods f f s convex and dfferentabe, whch we w assume ater. 4

5 Agorthm DSCOVR framewor nput: nta ponts w 0, α 0, and step szes σ for =,..., m and τ for =,..., n. : for t = 0,,,..., do : pc {,..., m} and {,..., n} randomy wth dstrbutons p and q respectvey. 3: compute varance-reduced stochastc gradents u t and v t. 4: update prma and dua boc coordnates: { proxσ f 5: end for α t = w t = α t σ u t f =, α t 0, f, { prox τ g w t τ v t f =, w t, f. reca that the proxma mappng for any convex functon φ : R d R { } s defned as prox φ v = arg mn {φu } u v. u R d There are severa dfferent ways to compute the vectors u t and v t n Step 3 of Agorthm. They shoud be the parta gradents or stochastc gradents of the bnear coupng term n Lw, α wth respect to α and w respectvey. Let Kw, α = α T Xw = = = α T X w, whch s the bnear term n Lw, α wthout the factor /m. We can use the foowng parta gradents n Step 3: ū t = Kwt, α t = X w t α, v t = m Kw t, α t w = = m X T α t We note that the factor /m does not appear n the frst equaton because t mutpes both Kw, α and f α n 9 and hence does not appear n updatng α. Another choce s to use =. u t = q X w t, v t = p m X T α t, whch are unbased stochastc parta gradents, because E u t E v t ] = = ] = = q q X w t = = p p m X T α t = m X w t = = ū t, X T α t = v t, 3 5

6 w w n w w n α α m Fgure : Smutaneous data and mode paraesm. At any gven tme, each machne s busy updatng one parameter boc and ts own dua varabe. Whenever some machne s done, t s assgned to wor on a random boc that s not beng updated. where E and E are expectatons wth respect to the random ndces and respectvey. It can be shown that, Agorthm converges to a sadde pont of Lw, α wth ether choce or 3 n Step 3, and wth sutabe step szes σ and τ. It s expected that usng the stochastc gradents n 3 eads to a sower convergence rate than appyng. However, usng 3 has the advantage of much ess computaton durng each teraton. Specfcay, t empoys ony one boc matrx-vector mutpcaton for both updates, nstead of n and m boc mutpcatons done n. More mportanty, the choce n 3 s sutabe for parae and dstrbuted computng. To see ths, et t, t denote the par of random ndces drawn at teraton t we omt the superscrpt t to smpfy notaton whenever there s no confuson from the context. Suppose for a sequence of consecutve teratons t,..., t s, there s no common ndex among t,..., ts, nor among t,..., ts, then these s teratons can be done n parae and they produce the same updates as beng done sequentay. Suppose there are s processors or machnes, then each can carry out one teraton, whch ncudes the updates n 3 as we as 0 and. These s teratons are ndependent of each other, and n fact can be done n any order, because each ony nvove one prma boc w t and one dua boc α t, for both nput and output varabes on the rght and eft sdes of the assgnments respectvey. In contrast, the nput for the updates n depend on a prma and dua bocs at the prevous teraton, thus cannot be done n parae. In practce, suppose we have m machnes for sovng probem 9, and each hods the data matrx X : n memory and mantans the dua boc α, for =,..., m. We assume that the number of mode parttons n s arger than m, and the n mode bocs {w,..., w n } are stored at one or more parameter servers. In the begnnng, we can randomy pc m mode bocs sampng wthout repacement from {w,..., w n }, and assgn each machne to update one of them. If machne s assgned to update boc, then both α and w are updated, usng ony the matrx X ; moreover, t needs to communcate ony the boc w wth the parameter server that are responsbe to mantan t. Whenever one machne fnshes ts update, a scheduer can randomy pc another parameter boc that s not currenty updated by other machnes, and assgn t to the free machne. Therefore a machnes can wor n parae, n an asynchronous, event-drven manner. Here an event s the competon of a boc update at any machne, as ustrated n Fgure. We w dscuss the mpementaton detas n Secton 7. 6

7 The dea of usng douby stochastc updates for dstrbuted optmzaton n not new. It has been studed by Yun et a. 04 for sovng the matrx competon probem, and by Matsushma et a. 04 for sovng the sadde-pont formuaton of the ERM probem. Despte ther nce features for paraezaton, these agorthms nhert the O/ t or O/t wth strong convexty subnear convergence rate of the cassca stochastc gradent method. They transate nto hgh communcaton and computaton cost for dstrbuted optmzaton. In ths paper, we propose new varants of douby stochastc update agorthms by usng varance-reduced stochastc gradents Step 3 of Agorthm. More specfcay, we borrow the varance-reducton technques from SVRG Johnson and Zhang, 03 and SAGA Defazo et a., 04 to deveop the DSCOVR agorthms, whch enoy fast near rates of convergence. In the rest of ths secton, we summarze our theoretca resuts characterzng the three measures for DSCOVR: computaton compexty, communcaton compexty, and synchronzaton cost. We compare them wth dstrbuted mpementaton of batch frst-order agorthms.. Summary of Man Resuts Throughout ths paper, we use to denote the standard Eucdean norm for vectors. For matrces, denotes the operator spectra norm and F denotes the Frobenus norm. We mae the foowng assumpton regardng the optmzaton probem. Assumpton Each f s convex and dfferentabe, and ts gradent s /γ -Lpschtz contnuous,.e., f u f v γ u v, u, v R N, =,..., m. 4 In addton, the reguarzaton functon g s λ-strongy convex,.e., gw gw ξ T w w λ w w, ξ gw, w, w R d. Under Assumpton, each f s γ -strongy convex see, e.g., Hrart-Urruty and Lemarécha, 00, Theorem 4.., and Lw, α defned n 5 has a unque sadde pont w, α. The condton 4 s often referred to as f beng /γ -smooth. To smpfy dscusson, here we assume γ = γ for =,..., m. Under these assumptons, each composte functon f X w has a smoothness parameter X /γ upper bound on the argest egenvaue of ts Hessan. Ther average /m m = f X w has a smooth parameter X /mγ, whch no arger than the average of the ndvdua smooth parameters /m m = X /γ. We defne a condton number for probem as the rato between ths smooth parameter and the convexty parameter λ of g: κ bat = X mλγ m = X : λγ X max λγ, 5 where X max = max { X : }. Ths condton number s a ey factor to characterze the teraton compexty of batch frst-order methods for sovng probem,.e., mnmzng Pw. Specfcay, to fnd a w such that Pw Pw ɛ, the proxma gradent method requres O κ bat og/ɛ teratons, and ther acceerated varants requre O κ bat og/ɛ teratons e.g., Nesterov, 004; Bec and Teboue, 009; Nesterov, 03. Prma-dua frst order methods for sovng the sadde-pont probem 8 share the same compexty Chamboe and Poc, 0, 05. 7

8 Agorthms Computaton compexty Communcaton compexty number of passes over data number of vectors n R d batch frst-order methods κ bat og/ɛ m κ bat og/ɛ DSCOVR κ rand /m og/ɛ m κ rand og/ɛ acceerated batch frst-order methods κbat og/ɛ m κbat og/ɛ acceerated DSCOVR κrand /m og/ɛ m m κrand og/ɛ Tabe : Computaton and communcaton compextes of batch frst-order methods and DSCOVR for both SVRG and SAGA varants. We omt the O notaton n a entres and an extra og κ rand /m factor for acceerated DSCOVR agorthms. A fundamenta basene for evauatng any dstrbuted optmzaton agorthms s the dstrbuted mpementaton of batch frst-order methods. Let s consder sovng probem usng the proxma gradent method. Durng every teraton t, each machne receves a copy of w t R d from a master machne through Broadcast, and computes the oca gradent z t = X T f X w t R d. at the master Reduce. The master then taes a proxma gradent step, usng z t and the proxma mappng of g, to compute the next terate w t and broadcast t to every machne for the next teraton. We can aso use the AReduce operaton n MPI to obtan z t at each machne wthout a master. In ether case, the tota number of passes over the data s twce the number of teratons Then a coectve communcaton s nvoed to compute the batch gradent z t = /m m = z t due to matrx-vector mutpcatons usng both X and X T, and the number of vectors n Rd sent/receved across a machnes s m tmes the number of teratons see Tabe. Moreover, a communcatons are coectve and synchronous. Snce DSCOVR s a famy of randomzed agorthms for sovng the sadde-pont probem 8, we woud e to fnd w, α such that w t w /m α t α ɛ hods n expectaton and wth hgh probabty. We st the communcaton and computaton compextes of DSCOVR n Tabe, comparng them wth batch frst-order methods. Smar guarantees aso hod for reducng the duaty gap Pw t Dα t, where P and D are defned n 6 and 7 respectvey. The ey quantty characterzng the compextes of DSCOVR s the condton number κ rand, whch can be defned n severa dfferent ways. If we pc the data boc and mode boc wth unform dstrbuton,.e., p = /m for =,..., m and q = /n for =,..., n, then κ rand = n X m n, where X m n = max λγ X. 6, Comparng the defnton of κ bat n 5, we have κ bat κ rand because m X m X m = = = X n X m n. Wth X : = X X m ] R N d and X : = X ;... ; X m ] R N d, we can aso defne κ rand = X max,f λγ { }, where X max,f = max X: F, X : F. 7, 8

9 Agorthms Synchronous Communcaton Asynchronous Communcaton number of vectors n R d equv. number of vectors n R d DSCOVR-SVRG m og/ɛ κ rand og/ɛ DSCOVR-SAGA m m κ rand og/ɛ acceerated DSCOVR-SVRG m og/ɛ m κrand og/ɛ acceerated DSCOVR-SAGA m m κrand og/ɛ Tabe : Breadown of communcaton compextes nto synchronous and asynchronous communcatons for two dfferent types of DSCOVR agorthms. We omt the O notaton and an extra og κ rand /m factor for acceerated DSCOVR agorthms. In ths case, we aso have κ bat κ rand because X max X max,f. Fnay, f we pc the par, wth non-unform dstrbuton p = X : F / X F and q = X : F / X F, then we can defne κ rand = X F mλγ. 8 Agan we have κ bat κ rand because X X F. We may repace κ rand n Tabes and by ether κ rand or κ rand, dependng on the probabty dstrbutons p and q and dfferent proof technques. From Tabe, we observe smar type of speed-ups n computaton compexty, as obtaned by varance reducton technques over the batch frst-order agorthms for convex optmzaton e.g., Le Roux et a., 0; Johnson and Zhang, 03; Defazo et a., 04; Xao and Zhang, 04; Lan and Zhou, 05; Aen-Zhu, 07, as we as for convex-concave sadde-pont probems Zhang and Xao, 07; Baamurugan and Bach, 06. Bascay, DSCOVR agorthms have potenta mprovement over batch frst-order methods by a factor of m for non-acceerated agorthms or m for acceerated agorthms, but wth a worse condton number. In the worst case, the rato between κ rand and κ bat may be of order m or arger, thus canceng the potenta mprovements. More nterestngy, DSCOVR aso has smar mprovements n terms of communcaton compexty over batch frst-order methods. In Tabe, we decompose the communcaton compexty of DSCOVR nto synchronous and asynchronous communcaton. The decomposton turns out to be dfferent dependng on the varance reducton technques empoyed: SVRG Johnson and Zhang, 03 versus SAGA Defazo et a., 04. We note that DSCOVR-SAGA essentay requres ony asynchronous communcaton, because the synchronous communcaton of m vectors are ony necessary for ntazaton wth non-zero startng pont. The comparsons n Tabe and gve us good understandng of the compextes of dfferent agorthms. However, these compextes are not accurate measures of ther performance n practce. For exampe, coectve communcaton of m vectors n R d can often be done n parae over a spannng tree of the underyng communcaton networ, thus ony cost ogm tmes nsted of m tmes compared wth sendng ony one vector. Aso, for pont-to-pont communcaton, sendng one vector n R d atogether can be much faster than sendng n smaer vectors of tota ength d separatey. A far comparson n term of wa-coc tme on a rea-word dstrbuted computng system requres customzed, effcent mpementaton of dfferent agorthms. We w shed some ght on tmng comparsons wth numerca experments n Secton 8. 9

10 . Reated Wor There s an extensve terature on dstrbuted optmzaton. Many agorthms deveoped for machne earnng adopt the centrazed communcaton settng, due to the wde avaabty of supportng standards and patforms such as MPI, MapReduce and Spar as dscussed n the ntroducton. They ncude parae mpementatons of the batch frst-order and second-order methods e.g., Ln et a., 04; Chen et a., 04; Lee et a., 07, ADMM Boyd et a., 0, and dstrbuted dua coordnate ascent Yang, 03; Jagg et a., 04; Ma et a., 05. For mnmzng the average functon /m m = f w, n the centrazed settng and wth ony frst-order oraces.e., gradents of f s or ther conugates, t has been shown that dstrbuted mpementaton of acceerated gradent methods acheves the optma convergence rate and communcaton compexty Arevan and Shamr, 05; Scaman et a., 07. The probem we consder has the extra structure of composton wth a near transformaton by the oca data, whch aows us to expot smutaneous data and mode paraesm usng randomzed agorthms and obtan mproved communcaton and computaton compexty. Most wor on asynchronous dstrbuted agorthms expot mode paraesm n order to reduce the synchronzaton cost, especay n the settng wth parameter servers e.g., L et a., 04; Xng et a., 05; Ayten et a., 06. Besdes, deay caused by the asynchrony can be ncorporated to the step sze to gan practca mprovement on convergence e.g., Agarwa and Duch, 0; McMahan and Streeter, 04; Sra et a., 06, though the theoretca subnear rates reman. There are aso many recent wor on asynchronous parae stochastc gradent and coordnate-descent agorthms for convex optmzaton e.g., Recht et a., 0; Lu et a., 04; Sh et a., 05; Redd et a., 05; Rchtár and Taáč, 06; Peng et a., 06. When the woroads or computng power of dfferent machnes or processors are nonunform, they may sgnfcanty ncrease teraton effcency number of teratons done n unt tme, but often at the cost of requrng more teratons than ther synchronous counterparts due to deays and stae updates. So there s a subte baance between teraton effcency and teraton compexty e.g., Hannah and Yn, 07. Our dscussons n Secton. show that DSCOVR s capabe of mprovng both aspects. For sovng bnear sadde-pont probems wth a fnte-sum structure, Zhang and Xao 07 proposed a randomzed agorthm that wors wth dua coordnate update but fu prma update. Yu et a. 05 proposed a douby stochastc agorthm that wors wth both prma and dua coordnate updates based on equaton. Both of them acheved acceerated near convergence rates, but nether can be ready apped to dstrbuted computng. In addton, Baamurugan and Bach 06 proposed stochastc varance-reducton methods aso based on SVRG and SAGA for sovng more genera convex-concave sadde pont probems. For the speca case wth bnear coupng, they obtaned smar computaton compexty as DSCOVR. However, ther methods requre fu mode updates at each teraton even though worng wth ony one sub-boc of data, thus are not sutabe for dstrbuted computng. Wth addtona assumptons and structure, such as smarty between the oca cost functons at dfferent machnes or usng second-order nformaton, t s possbe to obtan better communcaton compexty for dstrbuted optmzaton; see, e.g., Shamr et a. 04; Zhang and Xao 05; Redd et a. 06. However, these agorthms rey on much more computaton at each machne for sovng a oca sub-probem at each teraton. Wth addtona memory and preprocessng at each machne, Lee et a. 05 showed that SVRG can be adapted for dstrbuted optmzaton to obtan ow communcaton compexty. 0

11 Agorthm DSCOVR-SVRG nput: nta ponts w 0, ᾱ 0, number of stages S and number of teratons per stage M. : for s = 0,,,..., S do : ū s = X w s and v s = m XT ᾱ s 3: w 0 = w s and α 0 = ᾱ s 4: for t = 0,,,..., M do 5: pc {,..., m} and {,..., n} randomy wth dstrbutons p and q respectvey. 6: compute varance-reduced stochastc gradents: u t = ū s X w t q v t = v s p m X T α t 7: update prma and dua boc coordnates: { proxσ f 8: end for α t = w t = 9: w s = w M and ᾱ s = α M. 0: end for output: w S and ᾱ S. α t α t w s, 9 σ u t ᾱ s. 0 f =,, f, { prox τ g w t τ v t f =,, f. w t 3. The DSCOVR-SVRG Agorthm From ths secton to Secton 6, we present severa reazatons of DSCOVR usng dfferent varance reducton technques and acceeraton schemes, and anayze ther convergence propertes. These agorthms are presented and anayzed as sequenta randomzed agorthms. We w dscuss how to mpement them for asynchronous dstrbuted computng n Secton 7. Agorthm s a DSCOVR agorthm that uses the technque of SVRG Johnson and Zhang, 03 for varance reducton. The teratons are dvded nto stages and each stage has a nner oop. Each stage s ntazed by a par of vectors w s R d and ᾱ s R N, whch come from ether ntazaton f s = 0 or the ast terate of the prevous stage f s > 0. At the begnnng of each stage, we compute the batch gradents ū s = ᾱ s ᾱ s T X w s = X w s, v s = w s m ᾱs T X w s = m XT ᾱ s. The vectors ū s and v s share the same parttons as α t and w t, respectvey. Insde each stage s, the varance-reduced stochastc gradents are computed n 9 and 0. It s easy to chec that they

12 are unbased. More specfcay, tang expectaton of u t wth respect to the random ndex gves E u t] = ū s = and tang expectaton of v t E v t] = v s = q X w t q ws = ū s X : w t X : w s = X : w t, wth respect to the random ndex gves p p m X T α t ᾱ s = v s m X : T α t ᾱ s = m X : T α t. In order to measure the dstance of any par of prma and dua varabes to the sadde pont, we defne a weghted squared Eucdean norm on R dn. Specfcay, for any par w, α where w R d and α = α,..., α m ] R N wth α R N, we defne Ωw, α = λ w m γ α. = If γ = γ for a =,..., m, then Ωw, α = λ w γ m α. We have the foowng theorem concernng the convergence rate of Agorthm. Theorem Suppose Assumpton hods, and et w, α be the unque sadde pont of Lw, α. Let Γ be a constant that satsfes { Γ max, p 9 X In Agorthm, f we choose the step szes as q λγ, q 9n X }. mp λγ σ = τ =, γ p Γ =,..., m, 3, λq Γ =,..., n, 4 and the number of teratons durng each stage satsfes M og3γ, then for any s > 0, E Ω w s w, ᾱ s α ] s Ω w 0 w, ᾱ 0 α. 5 3 The proof of Theorem s gven n Appendx A. Here we dscuss how to choose the parameter Γ to satsfy. For smpcty, we assume γ = γ for a =,..., m. If we et X m n = max, { X } and sampe wth the unform dstrbuton across both rows and coumns,.e., p = /m for =,..., m and q = /n for =,..., n, then we can set Γ = max{m, n} where κ rand = n X m n/ λγ as defned n 6. 9n X m n = max{m, n} 9 λγ κ rand,

13 An aternatve condton for Γ to satsfy s shown n Secton A. n the Appendx { Γ max 9 X : F, 9 X } : F. 6, p q mλγ q p mλγ Agan usng unform sampng, we can set Γ = max{m, n} 9 X max,f λγ = max{m, n} 9 κ rand, where X max,f = max, { X : F, X : F } and κ rand = X max,f / λγ as defned n 7. Usng the condton 6, f we choose the probabtes to be proportona to the squared Frobenus norms of the data parttons,.e., then we can choose Γ = mn, {p, q } p = X : F X F 9 X F mλγ, q = X : F X F, 7 = 9 mn, {p, q } κ rand, where κ rand = X F / mλγ. Moreover, we can set the step szes as see Appendx A. σ = mλ 9 X F, τ = mγ 9 X F. For the ERM probem 3, we assume that each oss functon φ, for =,..., N, s /νsmooth. Accordng to 4, the smooth parameter for each f s γ = γ = N/mν. Let R be the argest Eucdean norm among a rows of X or we can normaze each row to have the same norm R, then we have X F NR and κ rand = X F mλγ NR mλγ = R λν. 8 The upper bound R /λν s a condton number used for characterzng the teraton compexty of many randomzed agorthms for ERM e.g., Shaev-Shwartz and Zhang, 03; Le Roux et a., 0; Johnson and Zhang, 03; Defazo et a., 04; Zhang and Xao, 07. In ths case, usng the non-unform sampng n 7, we can set the step szes to be σ = λ m 9R N, τ = γ m 9R N = ν 9R. 9 Next we estmate the overa computaton compexty of DSCOVR-SVRG n order to acheve E Ω w s w, ᾱ s α ] ɛ. From 5, the number of stages requred s og Ω 0 /ɛ / og3/, where Ω 0 = Ω w 0 w, ᾱ 0 α. The number of nner teratons wthn each stage s M = og3γ. At the begnnng of of each stage, computng the batch gradents ū s and v s requres 3

14 gong through the whoe data set X, whose computatona cost s equvaent to m n nner teratons. Therefore, the overa compexty of Agorthm, measured by tota number of nner teratons, s mn Ω 0 O Γ og. ɛ To smpfy dscusson, we further assume m n, whch s aways the case for dstrbuted mpementaton see Fgure and Secton 7. In ths case, we can et Γ = n 9/κ rand. Thus the above teraton compexty becomes O n m κ rand og/ɛ. 30 Snce the teraton compexty n 30 counts the number of bocs X beng processed, the number of passes over the whoe dataset X can be obtaned by dvdng t by mn,.e., O κ rand m og/ɛ. 3 Ths s the computaton compexty of DSCOVR sted n Tabe. We can repace κ rand by κ rand or κ rand dependng on dfferent proof technques and sampng probabtes as dscussed above. We w address the communcaton compexty for DSCOVR-SVRG, ncudng ts decomposton nto synchronous and asynchronous ones, after descrbng ts mpementaton detas n Secton 7. In addton to convergence to the sadde pont, our next resut shows that the prma-dua optmaty gap aso enoys the same convergence rate, under sghty dfferent condtons. Theorem Suppose Assumpton hods, and et Pw and Dα be the prma and dua functons defned n 6 and 7, respectvey. Let Λ and Γ be two constants that satsfy Λ X F, =,..., m, =,..., n, and { Γ max, p In Agorthm, f we choose the step szes as 8Λ, q λγ q 8nΛ }. p mλγ σ = τ =, γ p Γ =,..., m, 3, λq Γ =,..., n, 33 and the number of teratons durng each stage satsfes M og3γ, then E ] P w s Dᾱ s s Γ P w 0 Dᾱ The proof of Theorem s gven n Appendx B. In terms of teraton compexty or tota number of passes to reach E P w s Dᾱ s ] ɛ, we need to add an extra factor of og κ rand to 30 or 3, due to the factor Γ on the rght-hand sde of 34. 4

15 Agorthm 3 DSCOVR-SAGA nput: nta ponts w 0, α 0, and number of teratons M. : ū 0 = Xw 0 and v 0 = m XT α 0 : U 0 = X w 0, V 0 = m α0 T X, for a =,..., m and =,..., K. 3: for t = 0,,,..., M do 4: pc {,..., m} and {,..., n} randomy wth dstrbutons p and q respectvey. 5: compute varance-reduced stochastc gradents: 6: update prma and dua boc coordnates: u t = ū t U t X q w t, 35 q v t = v t V t p T p m X T α t. 36 α t = w t = { proxσ Φ α t α t σ u t f =., f, { prox τ g w t τ v t f =,, f. w t 7: update averaged stochastc gradents: ū t = v t = { ūt ū t { v t U t X w t f =, f, V t T m X T α t f =, v t f, 8: update the tabe of hstorca stochastc gradents: 9: end for output: w M and α M. U t = V t = { X w t f = and =, U t otherwse. { m X T α t T f = and =, V t otherwse. 4. The DSCOVR-SAGA Agorthm Agorthm 3 s a DSCOVR agorthm that uses the technques of SAGA Defazo et a., 04 for varance reducton. Ths s a snge stage agorthm wth teratons ndexed by t. In order to compute the varance-reduced stochastc gradents u t and v t at each teraton, we aso need to mantan and update two vectors ū t R N and v t R d, and two matrces U t R N n and V t R m d. 5

16 The vector ū t shares the same partton as α t nto m bocs, and v t share the same parttons as w t nto n bocs. The matrx U t s parttoned nto m n bocs, wth each boc U t RN. The matrx V t s aso parttoned nto m n bocs, wth each boc V t R d. Accordng to the updates n Steps 7 and 8 of Agorthm 3, we have ū t = v t = = = U t, =,..., m, 37 V t T, =,..., n. 38 Based on the above constructons, we can show that u t s an unbased stochastc gradent of α t T Xw t wth respect to α, and v t s an unbased stochastc gradent of /m α t T Xw t wth respect to w. More specfcay, accordng to 35, we have E u t] = ū t = ū t = ū t = q q U t U t = ū t = X : w t = X w t q q X w t = X : w t = α α t T Xw t, 39 where the thrd equaty s due to 37. Smary, accordng to 36, we have E v t] = v t p V t T p = p = p m X T α t = v t V t X T α t m = v t = v t = m X : T α t = m X : T α t = w α t T Xw t, 40 m where the thrd equaty s due to 38. Regardng the convergence of DSCOVR-SAGA, we have the foowng theorem, whch s proved n Appendx C. Theorem 3 Suppose Assumpton hods, and et w, α be the unque sadde pont of Lw, α. Let Γ be a constant that satsfes { Γ max 9 X, 9n X },. 4, p q λγ q p mλγ p q If we choose the step szes as σ = τ =, γ p Γ =,..., m, 4, λq Γ =,..., n, 43 6

17 Agorthm 4 Acceerated DSCOVR nput: nta ponts w 0, α 0, and parameter δ > 0. : for r = 0,,,..., do : fnd an approxmate sadde pont of 46 usng one of the foowng two optons: 3: end for opton : run Agorthm wth S = ogδ og3/ and M = og3γ δ to obtan w r, α r = DSCOVR-SVRG w r, α r, S, M. opton : run Agorthm 3 wth M = 6 og 8δ 3 Γ δ to obtan w r, α r = DSCOVR-SAGA w r, α r, M. Then the teratons of Agorthm 3 satsfy, for t =,,..., E Ω w t w, α t α ] t 4 3Γ 3 Ω w 0 w, α 0 α. 44 The condton on Γ n 4 s very smar to the one n, except that here we have an addtona term /p q when tang the maxmum over and. Ths resuts n an extra mn term n estmatng Γ under unform sampng. Assumng m n true for dstrbuted mpementaton, we can et Γ = n 9 κ rand mn. Accordng to 44, n order to acheve E Ωw t w, α t α ] ɛ, DSCOVR-SAGA needs O Γ og/ɛ teratons. Usng the above expresson for Γ, the teraton compexty s O n m κ rand og/ɛ, 45 whch s the same as 30 for DSCOVR-SVRG. Ths aso eads to the same computatona compexty measured by the number of passes over the whoe dataset, whch s gven n 3. Agan we can repace κ rand by κ rand or κ rand as dscussed n Secton 3. We w dscuss the communcaton compexty of DSCOVR-SAGA n Secton 7, after descrbng ts mpementaton detas. 5. Acceerated DSCOVR Agorthms In ths secton, we deveop an acceerated DSCOVR agorthm by foowng the catayst framewor Ln et a., 05; Frostg et a., 05. More specfcay, we adopt the same procedure by Baamurugan and Bach 06 for sovng convex-concave sadde-pont probems. Agorthm 4 proceeds n rounds ndexed by r = 0,,,.... Gven the nta ponts w 0 R d and α 0 R N, each round r computes two new vectors w r and α r usng ether the DSCOVR- SVRG or DSCOVR-SAGA agorthm for sovng a reguated sadde-pont probem, smar to the cassca proxma pont agorthm Rocafear,

18 Let δ > 0 be a parameter whch we w determne ater. Consder the foowng perturbed sadde-pont functon for round r: L r δ w, a = Lw, α δλ w wr δ m = γ α α r. 46 Under Assumpton, the functon L r δ w, a s δλ-strongy convex n w and δγ /m-strongy concave n α. Let Γ δ be a constant that satsfes { 9 X 9n X } Γ δ max, p q λγ δ, q p mλγ δ,, p q where the rght-hand sde s obtaned from 4 by repacng λ and γ wth δλ and δγ respectvey. The constant Γ δ s used n Agorthm 4 to determne the number of nner teratons to run wth each round, as we as for settng the step szes. The foowng theorem s proved n Appendx D. Theorem 4 Suppose Assumpton hods, and et w, α be the sadde-pont of Lw, α. Wth ether optons n Agorthm 4, f we choose the step szes nsde Agorthm or Agorthm 3 as Then for a r, E σ = τ = Ω w r w, α r α ], δγ p Γ δ =,..., m, 47, δλq Γ δ =,..., n. 48 r Ω w 0 w, α 0 α. δ Accordng to Theorem 4, n order to have E Ω w r w, α r α ] ɛ, we need the number of rounds r to satsfy Ω w 0 w, α 0 α r δ og. ɛ Foowng the dscussons n Sectons 3 and 4, when usng unform sampng and assumng m n, we can have Γ δ = n 9κ rand δ mn. 49 Then the tota number of boc coordnate updates n Agorthm 4 s O δγ δ og δ og/ɛ, where the og δ factor comes from the number of stages S n opton and number of steps M n opton. We hde the og δ factor wth the Õ notaton and pug 49 nto the expresson above to obtan Õ n δ m κ rand og. δ ɛ Now we can choose δ dependng on the reatve sze of κ rand and m: 8

19 If κ rand > m, we can mnmzng the above expresson by choosng δ = κrand m, so that the overa teraton compexty becomes Õ n mκ rand og/ɛ. If κ rand m, then no acceeraton s necessary and we can choose δ = 0 to proceed wth a snge round. In ths case, the teraton compexty s Omn as seen from 49. Therefore, n ether case, the tota number of boc teratons by Agorthm 4 can be wrtten as Õ mn n mκ rand og/ɛ. 50 As dscussed before, the tota number of passes over the whoe dataset s obtaned by dvdng by mn: Õ κ rand /m og/ɛ. Ths s the computatona compexty of acceerated DSCOVR sted n Tabe. 5. Proxma Mappng for Acceerated DSCOVR When appyng Agorthm or 3 to approxmate the sadde-pont of 46, we need to repace the proxma mappngs of g and f by those of g δλ/ w r and f δγ / α r, respectvey. More precsey, we repace w t = prox τ g w t τ v t by { w t = arg mn g w δλ w w r } w w t w R d τ τ v t and repace α t α t = prox τ τ δλ g = prox σ f α t = arg mn α R N { f = prox σ σ δγ f τ δλ σ u t w t by τ v t α δγ α α r α σ α t σ δγ σ u t τ δλ τ δλ wr α t σ δγ α r σ δγ, 5 } σ u t. 5 We aso examne the number of nner teratons determned by Γ δ and how to set the step szes. If we choose δ = κrand m, then Γ δ n 49 becomes Γ δ = n 9κ rand 9κ rand δ mn = n mn = 5.5m n. κ rand /m Therefore a sma constant number of passes s suffcent wthn each round. Usng the unform sampng, the step szes can be estmated as foows: σ = τ = δγ p Γ δ κ rand /mγ 5.5n γ n κ rand /m, 53 δλq Γ δ κ rand /mλ5.5m λ. m κ rand 54 As shown by our numerca experments n Secton 8, the step szes can be set much arger n practce. 9

20 6. Conugate-Free DSCOVR Agorthms A maor dsadvantage of prma-dua agorthms for sovng probem s the requrement of computng the proxma mappng of the conugate functon f, whch may not admt cosed-formed souton or effcent computaton. Ths s especay the case for ogstc regresson, one of the most popuar oss functons used n cassfcaton. Lan and Zhou 05 deveoped conugate-free varants of prma-dua agorthms that avod computng the proxma mappng of the conugate functons. The man dea s to repace the Eucdean dstance n the dua proxma mappng wth a Bregman dvergence defned over the conugate functon tsef. Ths technque has been used by Wang and Xao 07 to sove structured ERM probems wth prma-dua frst order methods. Here we use ths approach to derve conugatefree DSCOVR agorthms. In partcuar, we repace the proxma mappng for the dua update α t = prox σ f α t σ u t { = arg mn f α R n α α, u t α α t } σ, by α t where B α, α t = f α f where β t { = arg mn f α R n α α, u t B α, α t σ αt can be computed recursvey by β t, α α t. The souton to 55 s gven by α t = f β t, = βt σ u t, t 0, σ }, 55 wth nta condton β 0 = f α0 see Lan and Zhou, 05, Lemma. Therefore, n order to update the dua varabes α, we do not need to compute the proxma mappng for the conugate functon f ; nstead, tang the gradent of f at some easy-to-compute ponts s suffcent. Ths conugate-free update can be apped n Agorthms, and 3. For the acceerated DSCOVR agorthms, we repace 5 by { α t = arg mn f α R n α α, u t B α, α t σ δγb α, α t }. The souton to the above mnmzaton probem can aso be wrtten as α t = f β t, where β t can be computed recursvey as β t = βt t σ u t σ δγ β σ σ δγ α r, t 0, wth the ntazaton β 0 = f α 0 and β = f. The convergence rates and computatona compextes of the conugate-free DSCOVR agorthms are very smar to the ones gven n Sectons 3 5. We omt detas here, but refer the readers to Lan and Zhou 05 and Wang and Xao 07 for reated resuts. 0

21 server server server h w t, S w t 00, S w t 00 0, S h scheduer resetss free = {,..., n} sends sync message to a servers and worers at begnnng of each stage w s X :, ū s X 00 0 :, ū s X 00 0 m:, ū s m ᾱ s ,αt ᾱ s,α t ᾱ m s,α m t w s, v s w s, v s w s, v s 3 3 worer worer worer m Fgure 3: A dstrbuted system for mpementng DSCOVR conssts of m worers, h parameter servers, and one scheduer. The arrows abeed wth the numbers, and 3 represent three coectve communcatons at the begnnng of each stage n DSCOVR-SVRG. 7. Asynchronous Dstrbuted Impementaton In ths secton, we show how to mpement the DSCOVR agorthms presented n Sectons 3 6 n a dstrbuted computng system. We assume that the system provde both synchronous coectve communcaton and asynchronous pont-to-pont communcaton, whch are a supported by the MPI standard MPI Forum, 0. Throughout ths secton, we assume m < n see Fgure. 7. Impementaton of DSCOVR-SVRG In order to mpement Agorthm, the dstrbuted system need to have the foowng components see Fgure 3: m worers. Each worer, for =,..., m, stores the foowng oca data and varabes : data matrx X : R N d. vectors n R N : ū s vectors n R d : w s, v s., α t, ᾱ s. extra buffers for computaton and communcaton: u t, v t, w t and w t.

22 h parameter servers. Each server stores a subset of the bocs { w t R d : S }, where S,..., S h form a partton of the set {,..., n}. one scheduer. It mantans a set of boc ndces S free {,..., n}. At any gven tme, S free contans ndces of parameter bocs that are not currenty updated by any worer. The reason for havng h > servers s not about nsuffcent storage for parameters, but rather to avod the communcaton overoad between ony one server and a m worers m can be n hundreds. At the begnnng of each stage s, the foowng three coectve communcatons tae pace across the system ustrated n Fgure 3 by arrows wth crced abes, and 3: The scheduer sends a sync message to a servers and worers, and resets S free = {,..., n}. Upon recevng the sync message, the servers aggregate ther bocs of parameters together to form w s and send t to a worers e.g., through the AReduce operaton n MPI. 3 Upon recevng w s, each worer compute ū s = X : w s and X : T ᾱ s, then nvoe a coectve communcaton AReduce to compute v s = /m m = X : T ᾱ s. The number of vectors n R d sent and receved durng the above process s m, countng the communcatons to form w s and v s at m worers gnorng the short sync messages. After the coectve communcatons at the begnnng of each stage, a worers start worng on the nner teratons of Agorthm n parae n an asynchronous, event-drven manner. Each worer nteracts wth the scheduer and the servers n a four-step oop shown n Fgure 4. There are aways m teratons tang pace concurrenty see aso Fgure, each may at a dfferent phase of the four-step oop: Whenever worer fnshes updatng a boc, t sends the par, to the scheduer to request for another boc to update. At the begnnng of each stage, s not needed. When the scheduer receves the par,, t randomy choose a boc from the st of free bocs S free whch are not currenty updated by any worer, oos up for the server whch stores the parameter boc w t.e., S, and then send the par, to server. In addton, the scheduer updates the st S free by addng and deetng. 3 When server receves the par,, t sends the vector w t to worer, and wats for recevng the updated verson w t from worer. 4 After worer receves w t, t computes the updates αt and w t foowng steps 6-7 n Agorthm, and then send w t bac to server. At ast, t assgns the vaue of to and send the par, to the scheduer, requestng the next boc to wor on. The amount of pont-to-pont communcaton requred durng the above process s d foat numbers, for sendng and recevng w t and w t we gnore the sma messages for sendng and recevng, and,. Snce the bocs are pced randomy, the average amount of communcaton per teraton s d/n, or equvaent to /n vectors n R d. Accordng to Theorem, each stage of Agorthm requres og3γ nner teratons; In addton, the dscussons above 30 show that we can tae Γ = n 9/κ rand. Therefore, the average amount of pont-to-pont communcaton wthn each stage s Oκ rand vectors n R d.

23 server S... S h scheduer, receve par, send w t and v t to worer wat to receve w t and v t randomy pc S free fnd server storng boc send par, to server S free S free { } \{} w t, vt 3 4 wt, v t worer... worer m, receve w t and v t computeα t,w t, v t as n Agorthm or 3 send w t,v t to server send to scheduer worer Fgure 4: Communcaton and computaton processes for one nner teraton of DSCOVR-SVRG Agorthm. The bue texts n the parentheses are the addtona vectors requred by DSCOVR-SAGA Agorthm 3. There are aways m teratons tang pace n parae asynchronousy, each evovng around one worer. A server may support mutpe or zero teratons f more than one or none of ts stored parameter bocs are beng updated. Now we are ready to quantfy the communcaton compexty of DSCOVR-SVRG to fnd an ɛ-optma souton. Our dscussons above show that each stage requres coectve communcaton of m vectors n R d and asynchronous pont-to-pont communcaton of equvaenty κ rand such vectors. Snce there are tota Oog/ɛ stages, the tota communcaton compexty s O m κ rand og/ɛ. Ths gves the communcaton compexty shown n Tabe, as we as ts decomposton n Tabe. 7. Impementaton of DSCOVR-SAGA We can mpement Agorthm 3 usng the same dstrbuted system shown n Fgure 3, but wth some modfcatons descrbed beow. Frst, the storage at dfferent components are dfferent: m worers. Each worer, for =,..., m, stores the foowng data and varabes: data matrx X : R N d vectors n R N : α t, u t, ū t, and U t for =,..., n. 3

24 vector n R d : V t : = V t V t ] T n whch s the th row of V t, wth V t buffers for communcaton and update of w t and v t R d. both stored at some server. h servers. Each server stores a subset of bocs { w t, vt R d : S }, for =,..., n. one scheduer. It mantans the set of ndces S free {,..., n}, same as n DSCOVR-SVRG. Une DSCOVR-SVRG, there s no stage-wse sync messages. A worers and servers wor n parae asynchronousy a the tme, foowng the four-step oops ustrated n Fgure 4 ncudng bue coored texts n the parentheses. Wthn each teraton, the man dfference from DSCOVR- SVRG s that, the server and worer need to exchange two vectors of ength d : w t and v t and ther updates. Ths doubes the amount of pont-to-pont communcaton, and the average amount of communcaton per teraton s 4/n vectors of ength d. Usng the teraton compexty n 45, the tota amount of communcaton requred measured by number of vectors of ength d s O m κ rand og/ɛ, whch s the same as for DSCOVR-SVRG. However, ts decomposton nto synchronous and asynchronous communcaton s dfferent, as shown n Tabe. If the nta vectors w 0 0 or α 0 0, then one round of coectve communcaton s requred to propagate the nta condtons to a servers and worers, whch refect the Om synchronous communcaton n Tabe. 7.3 Impementaton of Acceerated DSCOVR Impementaton of the acceerated DSCOVR agorthm s very smar to the non-acceerated ones. The man dfferences e n the two proxma mappngs presented n Secton 5.. In partcuar, the prma update n 5 needs the extra varabe w r, whch shoud be stored at a parameter server together wth w t. We modfy the four-step oops shown n Fgures 4 as foows: Each parameter server stores the extra boc parameters { w r, S }. Durng step 3, s send together wth w t for SVRG or w t, vt for SAGA to a worer. w r In step 4, no update of w r s sent bac to the server. Instead, whenever swtchng rounds, the scheduer w nform each server to update ther w r to the most recent w t. For the dua proxma mappng n 5, each worer needs to store an extra vector α r, and reset t to the most recent α t when movng to the next round. There s no need for addtona synchronzaton or coectve communcaton when swtchng rounds n Agorthm 4. The communcaton compexty measured by the number of vectors of ength d sent or receved can be obtaned by dvdng the teraton compexty n 50 by n,.e., O m mκ rand og/ɛ, as shown n Tabe. Fnay, n order to mpement the conugate-free DSCOVR agorthms descrbed n Secton 6, each worer smpy need to mantan and update an extra vector β t ocay. 8. Experments In ths secton, we present numerca experments on an ndustra dstrbuted computng system. Ths system has hundreds of computers connected by hgh speed Ethernet n a data center. The hardware 4

25 CPU #cores RAM networ operatng system dua Inte Xeon processors 6 8 GB 0 Gbps Wndows Server E5-650 v,.6 GHz.8 GHz Ethernet adapter verson 0 Tabe 3: Confguraton of each machne n the dstrbuted computng system. and software confguratons for each machne are sted n Tabe 3. We mpemented a DSCOVR agorthms presented n ths paper, ncudng the SVRG and SAGA versons, ther acceerated varants, as we as the conugate-free agorthms. A mpementatons are wrtten n C, usng MPI for both coectve and pont-to-pont communcatons see Fgures 3 and 4 respectvey. On each worer machne, we aso use OpenMP OpenMP Archtecture Revew Board, 0 to expot the mut-core archtecture for parae computng, ncudng sparse matrx-vector mutpcatons and vectorzed functon evauatons. Impementng the DSCOVR agorthms requres m h machnes, among them m are worers wth oca datasets, h are parameter servers, and one s a scheduer see Fgure 3. We focus on sovng the ERM probem 3, where the tota of N tranng exampes are eveny parttoned and stored at m worers. We partton the d-dmensona parameters nto n subsets of roughy the same sze dffer at most by one, where each subset conssts of randomy chosen coordnates wthout repacement. Then we store the n subsets of parameters on h servers, each gettng ether n/h or n/h subsets. As descrbed n Secton 7, we mae the confguratons to satsfy n > m > h. For DSCOVR-SVRG and DSCOVR-SAGA, the step szes n 9 are very conservatve. In the experments, we repace the coeffcent /9 by two tunng parameter η d and η p for the dua and prma step szes respectvey,.e., σ = η d λ R m N, τ = η p ν R. 56 For the acceerated DSCOVR agorthms, we use κ rand = R /λν as shown n 8 for ERM. Then the step szes n 53 and 54, wth γ = m/nν and a generc constant coeffcent η, become σ = η d mλ nr ν m N, τ = η p ν R mλ. 57 For comparson, we aso mpemented the foowng frst-order methods for sovng probem : PGD: parae mpementaton of the Proxma Gradent Descent method usng synchronous coectve communcaton over m machnes. We use the adaptve ne search procedure proposed n Nesterov 03, and the exact form used s Agorthm n Ln and Xao 05. APG: parae mpementaton of the Acceerated Proxma Gradent method Nesterov, 004, 03. We use a smar adaptve ne search scheme to the one for PGD, and the exact form used wth strong convexty s Agorthm 4 n Ln and Xao 05. ADMM: the Aternatng Drecton Method of Mutpers. We use the reguarzed consensus verson n Boyd et a. 0, Secton 7... For sovng the oca optmzaton probems at each node, we use the SDCA method Shaev-Shwartz and Zhang, 03. CoCoA: the addng verson of CoCoA n Ma et a. 05. Foowng the suggeston n Ma et a. 07, we use a randomzed coordnate descent agorthm Nesterov, 0; Rchtár and Taáč, 04 for sovng the oca optmzaton probems. 5

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization

A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization Journa of Machne Learnng Research 18 17 1-5 Submtted 9/16; Revsed 1/17; Pubshed 1/17 A Genera Dstrbuted Dua Coordnate Optmzaton Framework for Reguarzed Loss Mnmzaton Shun Zheng Insttute for Interdscpnary

More information

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not

More information

Nested case-control and case-cohort studies

Nested case-control and case-cohort studies Outne: Nested case-contro and case-cohort studes Ørnuf Borgan Department of Mathematcs Unversty of Oso NORBIS course Unversty of Oso 4-8 December 217 1 Radaton and breast cancer data Nested case contro

More information

Associative Memories

Associative Memories Assocatve Memores We consder now modes for unsupervsed earnng probems, caed auto-assocaton probems. Assocaton s the task of mappng patterns to patterns. In an assocatve memory the stmuus of an ncompete

More information

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu

More information

Neural network-based athletics performance prediction optimization model applied research

Neural network-based athletics performance prediction optimization model applied research Avaabe onne www.jocpr.com Journa of Chemca and Pharmaceutca Research, 04, 6(6):8-5 Research Artce ISSN : 0975-784 CODEN(USA) : JCPRC5 Neura networ-based athetcs performance predcton optmzaton mode apped

More information

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students.

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students. Exampe: Suppose we want to bud a cassfer that recognzes WebPages of graduate students. How can we fnd tranng data? We can browse the web and coect a sampe of WebPages of graduate students of varous unverstes.

More information

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks Shengyang Sun, Changyou Chen, Lawrence Carn Suppementary Matera: Learnng Structured Weght Uncertanty n Bayesan Neura Networks Shengyang Sun Changyou Chen Lawrence Carn Tsnghua Unversty Duke Unversty Duke

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory Advanced Scence and Technoogy Letters Vo.83 (ISA 205), pp.60-65 http://dx.do.org/0.4257/ast.205.83.2 Research on Compex etworks Contro Based on Fuzzy Integra Sdng Theory Dongsheng Yang, Bngqng L, 2, He

More information

On the Power Function of the Likelihood Ratio Test for MANOVA

On the Power Function of the Likelihood Ratio Test for MANOVA Journa of Mutvarate Anayss 8, 416 41 (00) do:10.1006/jmva.001.036 On the Power Functon of the Lkehood Rato Test for MANOVA Dua Kumar Bhaumk Unversty of South Aabama and Unversty of Inos at Chcago and Sanat

More information

COXREG. Estimation (1)

COXREG. Estimation (1) COXREG Cox (972) frst suggested the modes n whch factors reated to fetme have a mutpcatve effect on the hazard functon. These modes are caed proportona hazards (PH) modes. Under the proportona hazards

More information

A finite difference method for heat equation in the unbounded domain

A finite difference method for heat equation in the unbounded domain Internatona Conerence on Advanced ectronc Scence and Technoogy (AST 6) A nte derence method or heat equaton n the unbounded doman a Quan Zheng and Xn Zhao Coege o Scence North Chna nversty o Technoogy

More information

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06 Dervng the Dua Prof. Bennett Math of Data Scence /3/06 Outne Ntty Grtty for SVM Revew Rdge Regresson LS-SVM=KRR Dua Dervaton Bas Issue Summary Ntty Grtty Need Dua of w, b, z w 2 2 mn st. ( x w ) = C z

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry Quantum Runge-Lenz ector and the Hydrogen Atom, the hdden SO(4) symmetry Pasca Szrftgser and Edgardo S. Cheb-Terrab () Laboratore PhLAM, UMR CNRS 85, Unversté Le, F-59655, France () Mapesoft Let's consder

More information

Subgradient Methods and Consensus Algorithms for Solving Convex Optimization Problems

Subgradient Methods and Consensus Algorithms for Solving Convex Optimization Problems Proceedngs of the 47th IEEE Conference on Decson and Contro Cancun, Mexco, Dec. 9-11, 2008 Subgradent Methods and Consensus Agorthms for Sovng Convex Optmzaton Probems Björn Johansson, Tamás Kevczy, Mae

More information

Andre Schneider P622

Andre Schneider P622 Andre Schneder P6 Probem Set #0 March, 00 Srednc 7. Suppose that we have a theory wth Negectng the hgher order terms, show that Souton Knowng β(α and γ m (α we can wrte β(α =b α O(α 3 (. γ m (α =c α O(α

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

3. Stress-strain relationships of a composite layer

3. Stress-strain relationships of a composite layer OM PO I O U P U N I V I Y O F W N ompostes ourse 8-9 Unversty of wente ng. &ech... tress-stran reatonshps of a composte ayer - Laurent Warnet & emo Aerman.. tress-stran reatonshps of a composte ayer Introducton

More information

Lower Bounding Procedures for the Single Allocation Hub Location Problem

Lower Bounding Procedures for the Single Allocation Hub Location Problem Lower Boundng Procedures for the Snge Aocaton Hub Locaton Probem Borzou Rostam 1,2 Chrstoph Buchhem 1,4 Fautät für Mathemat, TU Dortmund, Germany J. Faban Meer 1,3 Uwe Causen 1 Insttute of Transport Logstcs,

More information

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory Proceedngs of the 2009 IEEE Internatona Conference on Systems Man and Cybernetcs San Antono TX USA - October 2009 Mutspectra Remote Sensng Image Cassfcaton Agorthm Based on Rough Set Theory Yng Wang Xaoyun

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

A Class of Distributed Optimization Methods with Event-Triggered Communication

A Class of Distributed Optimization Methods with Event-Triggered Communication A Cass of Dstrbuted Optmzaton Methods wth Event-Trggered Communcaton Martn C. Mene Mchae Ubrch Sebastan Abrecht the date of recept and acceptance shoud be nserted ater Abstract We present a cass of methods

More information

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Grover s Algorithm + Quantum Zeno Effect + Vaidman Grover s Algorthm + Quantum Zeno Effect + Vadman CS 294-2 Bomb 10/12/04 Fall 2004 Lecture 11 Grover s algorthm Recall that Grover s algorthm for searchng over a space of sze wors as follows: consder the

More information

Distributed Moving Horizon State Estimation of Nonlinear Systems. Jing Zhang

Distributed Moving Horizon State Estimation of Nonlinear Systems. Jing Zhang Dstrbuted Movng Horzon State Estmaton of Nonnear Systems by Jng Zhang A thess submtted n parta fufment of the requrements for the degree of Master of Scence n Chemca Engneerng Department of Chemca and

More information

Delay tomography for large scale networks

Delay tomography for large scale networks Deay tomography for arge scae networks MENG-FU SHIH ALFRED O. HERO III Communcatons and Sgna Processng Laboratory Eectrca Engneerng and Computer Scence Department Unversty of Mchgan, 30 Bea. Ave., Ann

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Lecture 8: Time & Clocks. CDK: Sections TVS: Sections

Lecture 8: Time & Clocks. CDK: Sections TVS: Sections Lecture 8: Tme & Clocks CDK: Sectons 11.1 11.4 TVS: Sectons 6.1 6.2 Topcs Synchronzaton Logcal tme (Lamport) Vector clocks We assume there are benefts from havng dfferent systems n a network able to agree

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

A new P system with hybrid MDE- k -means algorithm for data. clustering. 1 Introduction

A new P system with hybrid MDE- k -means algorithm for data. clustering. 1 Introduction Wesun, Lasheng Xang, Xyu Lu A new P system wth hybrd MDE- agorthm for data custerng WEISUN, LAISHENG XIANG, XIYU LIU Schoo of Management Scence and Engneerng Shandong Norma Unversty Jnan, Shandong CHINA

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016 U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and

More information

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1]

Outline. Communication. Bellman Ford Algorithm. Bellman Ford Example. Bellman Ford Shortest Path [1] DYNAMIC SHORTEST PATH SEARCH AND SYNCHRONIZED TASK SWITCHING Jay Wagenpfel, Adran Trachte 2 Outlne Shortest Communcaton Path Searchng Bellmann Ford algorthm Algorthm for dynamc case Modfcatons to our algorthm

More information

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0 MODULE 2 Topcs: Lnear ndependence, bass and dmenson We have seen that f n a set of vectors one vector s a lnear combnaton of the remanng vectors n the set then the span of the set s unchanged f that vector

More information

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System.

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System. Part 7: Neura Networ & earnng /2/05 Superved earnng Neura Networ and Bac-Propagaton earnng Produce dered output for tranng nput Generaze reaonaby & appropratey to other nput Good exampe: pattern recognton

More information

IV. Performance Optimization

IV. Performance Optimization IV. Performance Optmzaton A. Steepest descent algorthm defnton how to set up bounds on learnng rate mnmzaton n a lne (varyng learnng rate) momentum learnng examples B. Newton s method defnton Gauss-Newton

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

Development of whole CORe Thermal Hydraulic analysis code CORTH Pan JunJie, Tang QiFen, Chai XiaoMing, Lu Wei, Liu Dong

Development of whole CORe Thermal Hydraulic analysis code CORTH Pan JunJie, Tang QiFen, Chai XiaoMing, Lu Wei, Liu Dong Deveopment of whoe CORe Therma Hydrauc anayss code CORTH Pan JunJe, Tang QFen, Cha XaoMng, Lu We, Lu Dong cence and technoogy on reactor system desgn technoogy, Nucear Power Insttute of Chna, Chengdu,

More information

Research Article Green s Theorem for Sign Data

Research Article Green s Theorem for Sign Data Internatonal Scholarly Research Network ISRN Appled Mathematcs Volume 2012, Artcle ID 539359, 10 pages do:10.5402/2012/539359 Research Artcle Green s Theorem for Sgn Data Lous M. Houston The Unversty of

More information

Analysis of Bipartite Graph Codes on the Binary Erasure Channel

Analysis of Bipartite Graph Codes on the Binary Erasure Channel Anayss of Bpartte Graph Codes on the Bnary Erasure Channe Arya Mazumdar Department of ECE Unversty of Maryand, Coege Par ema: arya@umdedu Abstract We derve densty evouton equatons for codes on bpartte

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Greyworld White Balancing with Low Computation Cost for On- Board Video Capturing

Greyworld White Balancing with Low Computation Cost for On- Board Video Capturing reyword Whte aancng wth Low Computaton Cost for On- oard Vdeo Capturng Peng Wu Yuxn Zoe) Lu Hewett-Packard Laboratores Hewett-Packard Co. Pao Ato CA 94304 USA Abstract Whte baancng s a process commony

More information

Chapter - 2. Distribution System Power Flow Analysis

Chapter - 2. Distribution System Power Flow Analysis Chapter - 2 Dstrbuton System Power Flow Analyss CHAPTER - 2 Radal Dstrbuton System Load Flow 2.1 Introducton Load flow s an mportant tool [66] for analyzng electrcal power system network performance. Load

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

18.1 Introduction and Recap

18.1 Introduction and Recap CS787: Advanced Algorthms Scrbe: Pryananda Shenoy and Shjn Kong Lecturer: Shuch Chawla Topc: Streamng Algorthmscontnued) Date: 0/26/2007 We contnue talng about streamng algorthms n ths lecture, ncludng

More information

n-step cycle inequalities: facets for continuous n-mixing set and strong cuts for multi-module capacitated lot-sizing problem

n-step cycle inequalities: facets for continuous n-mixing set and strong cuts for multi-module capacitated lot-sizing problem n-step cyce nequates: facets for contnuous n-mxng set and strong cuts for mut-modue capactated ot-szng probem Mansh Bansa and Kavash Kanfar Department of Industra and Systems Engneerng, Texas A&M Unversty,

More information

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique Outlne and Readng Dynamc Programmng The General Technque ( 5.3.2) -1 Knapsac Problem ( 5.3.3) Matrx Chan-Product ( 5.3.1) Dynamc Programmng verson 1.4 1 Dynamc Programmng verson 1.4 2 Dynamc Programmng

More information

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD Matrx Approxmaton va Samplng, Subspace Embeddng Lecturer: Anup Rao Scrbe: Rashth Sharma, Peng Zhang 0/01/016 1 Solvng Lnear Systems Usng SVD Two applcatons of SVD have been covered so far. Today we loo

More information

Cyclic Codes BCH Codes

Cyclic Codes BCH Codes Cycc Codes BCH Codes Gaos Feds GF m A Gaos fed of m eements can be obtaned usng the symbos 0,, á, and the eements beng 0,, á, á, á 3 m,... so that fed F* s cosed under mutpcaton wth m eements. The operator

More information

DISTRIBUTED PROCESSING OVER ADAPTIVE NETWORKS. Cassio G. Lopes and Ali H. Sayed

DISTRIBUTED PROCESSING OVER ADAPTIVE NETWORKS. Cassio G. Lopes and Ali H. Sayed DISTRIBUTED PROCESSIG OVER ADAPTIVE ETWORKS Casso G Lopes and A H Sayed Department of Eectrca Engneerng Unversty of Caforna Los Angees, CA, 995 Ema: {casso, sayed@eeucaedu ABSTRACT Dstrbuted adaptve agorthms

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Expected Value and Variance

Expected Value and Variance MATH 38 Expected Value and Varance Dr. Neal, WKU We now shall dscuss how to fnd the average and standard devaton of a random varable X. Expected Value Defnton. The expected value (or average value, or

More information

Polite Water-filling for Weighted Sum-rate Maximization in MIMO B-MAC Networks under. Multiple Linear Constraints

Polite Water-filling for Weighted Sum-rate Maximization in MIMO B-MAC Networks under. Multiple Linear Constraints 2011 IEEE Internatona Symposum on Informaton Theory Proceedngs Pote Water-fng for Weghted Sum-rate Maxmzaton n MIMO B-MAC Networks under Mutpe near Constrants An u 1, Youjan u 2, Vncent K. N. au 3, Hage

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning A Delay-tolerant Proxmal-Gradent Algorthm for Dstrbuted Learnng Konstantn Mshchenko Franck Iutzeler Jérôme Malck Massh Amn KAUST Unv. Grenoble Alpes CNRS and Unv. Grenoble Alpes Unv. Grenoble Alpes ICML

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

CHAPTER III Neural Networks as Associative Memory

CHAPTER III Neural Networks as Associative Memory CHAPTER III Neural Networs as Assocatve Memory Introducton One of the prmary functons of the bran s assocatve memory. We assocate the faces wth names, letters wth sounds, or we can recognze the people

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Lecture 4: November 17, Part 1 Single Buffer Management

Lecture 4: November 17, Part 1 Single Buffer Management Lecturer: Ad Rosén Algorthms for the anagement of Networs Fall 2003-2004 Lecture 4: November 7, 2003 Scrbe: Guy Grebla Part Sngle Buffer anagement In the prevous lecture we taled about the Combned Input

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Time-Varying Systems and Computations Lecture 6

Time-Varying Systems and Computations Lecture 6 Tme-Varyng Systems and Computatons Lecture 6 Klaus Depold 14. Januar 2014 The Kalman Flter The Kalman estmaton flter attempts to estmate the actual state of an unknown dscrete dynamcal system, gven nosy

More information

Chapter 6. Rotations and Tensors

Chapter 6. Rotations and Tensors Vector Spaces n Physcs 8/6/5 Chapter 6. Rotatons and ensors here s a speca knd of near transformaton whch s used to transforms coordnates from one set of axes to another set of axes (wth the same orgn).

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Distributed and Stochastic Machine Learning on Big Data

Distributed and Stochastic Machine Learning on Big Data Dstrbuted and Stochastc Machne Learnng on Bg Data Department of Computer Scence and Engneerng Hong Kong Unversty of Scence and Technology Hong Kong Introducton Synchronous ADMM Asynchronous ADMM Stochastc

More information

A MIN-MAX REGRET ROBUST OPTIMIZATION APPROACH FOR LARGE SCALE FULL FACTORIAL SCENARIO DESIGN OF DATA UNCERTAINTY

A MIN-MAX REGRET ROBUST OPTIMIZATION APPROACH FOR LARGE SCALE FULL FACTORIAL SCENARIO DESIGN OF DATA UNCERTAINTY A MIN-MAX REGRET ROBST OPTIMIZATION APPROACH FOR ARGE SCAE F FACTORIA SCENARIO DESIGN OF DATA NCERTAINTY Travat Assavapokee Department of Industra Engneerng, nversty of Houston, Houston, Texas 7704-4008,

More information

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression 11 MACHINE APPLIED MACHINE LEARNING LEARNING MACHINE LEARNING Gaussan Mture Regresson 22 MACHINE APPLIED MACHINE LEARNING LEARNING Bref summary of last week s lecture 33 MACHINE APPLIED MACHINE LEARNING

More information

Finding Dense Subgraphs in G(n, 1/2)

Finding Dense Subgraphs in G(n, 1/2) Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS Avalable onlne at http://sck.org J. Math. Comput. Sc. 3 (3), No., 6-3 ISSN: 97-537 COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013 COS 511: heoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 15 Scrbe: Jemng Mao Aprl 1, 013 1 Bref revew 1.1 Learnng wth expert advce Last tme, we started to talk about learnng wth expert advce.

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

Games of Threats. Elon Kohlberg Abraham Neyman. Working Paper

Games of Threats. Elon Kohlberg Abraham Neyman. Working Paper Games of Threats Elon Kohlberg Abraham Neyman Workng Paper 18-023 Games of Threats Elon Kohlberg Harvard Busness School Abraham Neyman The Hebrew Unversty of Jerusalem Workng Paper 18-023 Copyrght 2017

More information

2.3 Nilpotent endomorphisms

2.3 Nilpotent endomorphisms s a block dagonal matrx, wth A Mat dm U (C) In fact, we can assume that B = B 1 B k, wth B an ordered bass of U, and that A = [f U ] B, where f U : U U s the restrcton of f to U 40 23 Nlpotent endomorphsms

More information

Journal of Multivariate Analysis

Journal of Multivariate Analysis Journa of Mutvarate Anayss 3 (04) 74 96 Contents sts avaabe at ScenceDrect Journa of Mutvarate Anayss journa homepage: www.esever.com/ocate/jmva Hgh-dmensona sparse MANOVA T. Tony Ca a, Yn Xa b, a Department

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

QUARTERLY OF APPLIED MATHEMATICS

QUARTERLY OF APPLIED MATHEMATICS QUARTERLY OF APPLIED MATHEMATICS Voume XLI October 983 Number 3 DIAKOPTICS OR TEARING-A MATHEMATICAL APPROACH* By P. W. AITCHISON Unversty of Mantoba Abstract. The method of dakoptcs or tearng was ntroduced

More information

ON AUTOMATIC CONTINUITY OF DERIVATIONS FOR BANACH ALGEBRAS WITH INVOLUTION

ON AUTOMATIC CONTINUITY OF DERIVATIONS FOR BANACH ALGEBRAS WITH INVOLUTION European Journa of Mathematcs and Computer Scence Vo. No. 1, 2017 ON AUTOMATC CONTNUTY OF DERVATONS FOR BANACH ALGEBRAS WTH NVOLUTON Mohamed BELAM & Youssef T DL MATC Laboratory Hassan Unversty MORO CCO

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Linear Regression Analysis: Terminology and Notation

Linear Regression Analysis: Terminology and Notation ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented

More information

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 4: Universal Hash Functions/Streaming Cont d CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected

More information

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2 Note 2 Lng fong L Contents Ken Gordon Equaton. Probabty nterpretaton......................................2 Soutons to Ken-Gordon Equaton............................... 2 2 Drac Equaton 3 2. Probabty nterpretaton.....................................

More information

Lecture 4: Constant Time SVD Approximation

Lecture 4: Constant Time SVD Approximation Spectral Algorthms and Representatons eb. 17, Mar. 3 and 8, 005 Lecture 4: Constant Tme SVD Approxmaton Lecturer: Santosh Vempala Scrbe: Jangzhuo Chen Ths topc conssts of three lectures 0/17, 03/03, 03/08),

More information

The Entire Solution Path for Support Vector Machine in Positive and Unlabeled Classification 1

The Entire Solution Path for Support Vector Machine in Positive and Unlabeled Classification 1 Abstract The Entre Souton Path for Support Vector Machne n Postve and Unabeed Cassfcaton 1 Yao Lmn, Tang Je, and L Juanz Department of Computer Scence, Tsnghua Unversty 1-308, FIT, Tsnghua Unversty, Bejng,

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

On the Equality of Kernel AdaTron and Sequential Minimal Optimization in Classification and Regression Tasks and Alike Algorithms for Kernel

On the Equality of Kernel AdaTron and Sequential Minimal Optimization in Classification and Regression Tasks and Alike Algorithms for Kernel Proceedngs of th European Symposum on Artfca Neura Networks, pp. 25-222, ESANN 2003, Bruges, Begum, 2003 On the Equaty of Kerne AdaTron and Sequenta Mnma Optmzaton n Cassfcaton and Regresson Tasks and

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

A General Column Generation Algorithm Applied to System Reliability Optimization Problems

A General Column Generation Algorithm Applied to System Reliability Optimization Problems A Genera Coumn Generaton Agorthm Apped to System Reabty Optmzaton Probems Lea Za, Davd W. Cot, Department of Industra and Systems Engneerng, Rutgers Unversty, Pscataway, J 08854, USA Abstract A genera

More information