A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

Size: px

Start display at page:

Download "A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning"

Cody Barker
5 years ago
Views:

1 A Delay-tolerant Proxmal-Gradent Algorthm for Dstrbuted Learnng Konstantn Mshchenko Franck Iutzeler Jérôme Malck Massh Amn KAUST Unv. Grenoble Alpes CNRS and Unv. Grenoble Alpes Unv. Grenoble Alpes ICML 2018

2 >>> Dstrbuted Learnng CONTEXT Global objectve 1 mn x R d m m examples m l j (x) + g(x) j=1 ndvdual losses (l j ) emprcal rsk mnmzaton regularzer g x data S 1 data S 2... data S M mn x R d Local data M π f (x) + g(x) =1 M data blocks stored locally local functon (f ) f (x) = 1 S j S l j (x) proporton π = S /m at Problem: Optmzaton: Large sum mnmzaton Varance-reduced sto. gradent v.s. v.s. Md-szed dstrbuted optmzaton ths presentaton 1 / 13

3 DISTRIBUTED OPTIMIZATION ASYNCHRONISM SCARSE COMMUNICATIONS CONCLUSION

4 >>> Dstrbuted Proxmal Gradent DISTRIBUTED OPTIMIZATION Problem: mn x M =1 π f (x) + g(x) Algorthm Implementaton Worker update on local varable x k+1/2 = x k γ f (x k ) for all = 1,.., M Master Map Worker 1 Worker 2... Worker M Dstrbuted Proxmal Gradent f 1 f 2... f M Drect extenson of the prox. grad.: Master gatherng of the local varables x k+1 = M =1 π x k+1/2 Master performs a proxmty operaton x k+1 1 =.. = x k+1 M = prox γg M =1 π prox γg Master (x k+1) Reduce Worker 1 Worker 2... Worker M Master: Intalze x = x 0, whle not converged do when all workers have fnshed: Receve (x ) from each of them x M =1 π x Broadcast x to all agents k k + 1 Interrupt all slaves Output x Worker : Intalze x = x = x, whle not nterrupted by master do Receve the most recent x z prox γg (x) x z γ f (z) Send x to the master f (x) = S 1 j S l j (x) 2 / 13

5 >>> Convergence and rate DISTRIBUTED OPTIMIZATION Dstrbuted Proxmal Gradent Master: Worker : Intalze x whle not converged do when all workers have fnshed: Receve (x ) from each of them x M =1 π x Broadcast x to all agents k k + 1 Interrupt all slaves Output x Intalze x = x = x, whle not nterrupted by master do Receve the most recent x z prox γg (x) x z γ f (z) Send x to the master f (x) = S 1 j S l j (x) Defne tme k as the number of master updates x k s the value of varable x at tme k Theorem Let each f be L-smooth and µ-strongly convex. Then, for γ (0, 2/(µ + L)], x k x 2 (1 α) k x 0 x 2 where x s the unque mnmzer of the mn x M =1 π f (x) + g(x) and α = 2γµL/(µ + L) (0, 1]. Proof. It s exactly proxmal gradent descent. 3 / 13

6 >>> Two Lmtatons DISTRIBUTED OPTIMIZATION Synchronsm: Master wats for all workers at each tme mage: W. Yn Communcatons: Sendng may be more costly than computng a gradent Local updates may be: fast (or not, dependng on S ), Context: Federated Learnng Dstrbuted Data mage: Google AI late (or not, dependng on state), costly (often). We provde an effcent Dstrbuted Proxmal Gradent algorthm: Asynchronous delay-tolerant Scarser comp./comm. tradeoff communcatons 4 / 13

7 DISTRIBUTED OPTIMIZATION ASYNCHRONISM SCARSE COMMUNICATIONS CONCLUSION

8 >>> Asynchronous Master Slave Framework ASYNCHRONISM Master x k = x k 1 + = (k) x k Worker 1 ( f1, prox g ) 1... Worker ( f, prox g )... Worker M ( fm, prox g ) M = (k) vewpont k D k k = k d k tme j (k) vewpont j j j k D k j j k d k j k tme teraton = receve from a worker + master update + send back tme k = number of teratons delay d k = tme snce last exchange wth d k = 0 ff updates at tme k, d k = d k elsewhere second delay D k = tme snce penultmate exchange wth Algorthm = global communcaton scheme + local optmzaton method what s x what s 5 / 13

9 >>> Communcaton scheme ASYNCHRONISM Master x k = x k 1 + = (k) x k Worker 1 Worker Worker M ( f1, prox g ) 1... ( f, prox g )... ( fm, prox g ) M DAve communcaton scheme master varable x k = combnaton of workers last contrbutons (x k dk ) one update/tme = one worker contrbuton but all workers are always nvolved at the master x k = x k 1 + wth = π (x k x k Dk M M.e. x k = π x k dk = π (x k Dk ) =1 =1 ) for = (k) PG proxmal gradent optmzaton method one step of proxmal gradent on regularzer g and local loss f = 1 S j S l j (x) z prox(x) γg x z γ f (z) ( π x x prev ) x prev x 6 / 13

10 >>> DAve-PG ASYNCHRONISM DAve-PG Master: Intalze x whle not converged do when a worker fnshes: Receve adjustment from t x x + Send x to the agent n return k k + 1 Interrupt all slaves Output x = prox γg (x) Worker : Intalze x = x = x, whle not nterrupted by master do Receve the most recent x z prox γg (x) x z γ f (z) ( π x x prev ) x prev Send adjustment to master f (x) = S 1 j S l j (x) x In practce: MPI blockng Send and Receve No computaton/storage at the master x = prox γg (x) s the convergng varable 7 / 13

>>> Comparson wth other combnatons ASYNCHRONISM DAve-PG PIAG Combnng: terates gradents Update x k ( M : prox γg =1 πxk Dk γ ) ( M =1 π f(xk Dk ) prox γg x k 1 γ ) M =1 π f(xk Dk ) Combnng

Stepsze γ of PIAG s 10x smaller due to delays the one for DAve-PG stays the same to be detaled later. DAve-PG s less chaotc and faster than PIAG - A. Aytekn, H. Feyzmahdavan, and M.

11 >>> Comparson wth other combnatons ASYNCHRONISM DAve-PG PIAG Combnng: terates gradents Update x k ( M : prox γg =1 πxk Dk γ ) ( M =1 π f(xk Dk ) prox γg x k 1 γ ) M =1 π f(xk Dk ) Combnng terates s more stable than combnng gradents Example: 2D quadratc functons on 5 worker but one worker 10x slower than the others. Stepsze γ of PIAG s 10x smaller due to delays the one for DAve-PG stays the same to be detaled later. DAve-PG s less chaotc and faster than PIAG - A. Aytekn, H. Feyzmahdavan, and M. Johansson Analyss and mplementaton of an asynchronous optmzaton algorthm for the parameter server, arxv: N. Vanl, M. Gurbuzbalaban, and A. Ozdaglar A stronger convergence result on the proxmal ncremental aggregated gradent method, arxv: / 13

12 >>> Analyss ASYNCHRONISM Revstng the clock: epoch sequence (k m) = recursvely defned by k 0 = 0 and k m+1 = mn{k : each worker made at least 2 updates on the nterval [k m, k]} = mn{k : k D k k m for all = 1,.., M} epoch tme m = number of epochs Intuton: k m+1 s the frst moment when x k no longer depends drectly on nformaton pror to k m. x k = M =1 π x k Dk γ M =1 π f (x k Dk ) Theorem Let each f be L-smooth and µ-strongly convex. Then, for γ (0, 2/(µ + L)], k k m, x k x 2 (1 α) m x 0 x 2 where x s the unque mnmzer of the mn x M =1 π f (x) + g(x) and α = 2γµL/(µ + L). Exact same result as the synchronous case but over the epoch tme m, not k 9 / 13

13 >>> Performances ASYNCHRONISM 1 Logstc regresson w/ elastc net m m j=1 log(1+exp( y jz T j x)) + λ1 x 1 + λ 2 2 x machnes (1 CPU, 1 GB) n a cluster 10% of the data n machne one, even on the rest DAve-PG Synchronous PG PIAG 10 0 Suboptmalty ,000 1,200 1,400 1,600 1, ,000 1,200 1,400 1,600 1,800 2,000 2,200 Wallclock tme (s) Wallclock tme (s) RCV1 ( ) URL ( ) 10 / 13

14 DISTRIBUTED OPTIMIZATION ASYNCHRONISM SCARSE COMMUNICATIONS CONCLUSION

15 >>> More local computaton SCARSE COMMUNICATIONS To exchange less, a soluton s to compute more. DAve-RPG Master: Intalze x whle not converged do when a worker fnshes: Receve adjustment from t x x + Send x to the agent n return k k + 1 Interrupt all slaves Output x = prox γg (x) Worker : Intalze x = x = x, whle not nterrupted by master do Receve the most recent x Select a number of repettons p Intalze = 0 for q = 1 to p do z prox γg (x + ) x z γ f (z) ( + π x x prev ) x x prev Send the adjustment to the master f (x) = S 1 j S l j (x) Dfference wth before: at each local step, the worker performs p proxmal gradent steps controlled rate mprovement by max repettons p n the epoch but the epochs become longer p 1 1 γµ q=1 (1 γµ) q 1 mn π q 11 / 13

16 >>> Performance SCARSE COMMUNICATIONS 1 Logstc regresson w/ elastc net m m j=1 log(1+exp( y jz T j x)) + λ1 x 1 + λ 2 2 x machnes (1 CPU, 1 GB) n a cluster 10% of the data n machne one, even on the rest 10 1 p = 1 p = 4 p = 7 p = 10 Suboptmalty Wallclock tme (s) there s a compromse to fnd but p can be changed wthout restrctons 12 / 13

17 DISTRIBUTED OPTIMIZATION ASYNCHRONISM SCARSE COMMUNICATIONS CONCLUSION

18 >>> Summary & Perspectves CONCLUSION Dstrbuted Delay-Tolerant Proxmal Gradent Algorthm: Smple to mplement Adaptable to performance/computaton compromse General, adaptable epoch analyss Poster # 155 Future works: Sparse communcatons Usng dentfcaton to control the communcatons Thank you! Franck IUTZELER 13 / 13

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?