Distributed and Stochastic Machine Learning on Big Data

Size: px

Start display at page:

Download "Distributed and Stochastic Machine Learning on Big Data"

Jody Francis
5 years ago
Views:

1 Dstrbuted and Stochastc Machne Learnng on Bg Data Department of Computer Scence and Engneerng Hong Kong Unversty of Scence and Technology Hong Kong

2 Introducton Synchronous ADMM Asynchronous ADMM Stochastc ADMM Concluson Bg Data human and machnes are creatng tons of data everyday

3 Introducton Synchronous ADMM Asynchronous ADMM Stochastc ADMM Concluson Bg Data human and machnes are creatng tons of data everyday some statstcs Facebook: more than 12 terabytes ( ) of data every day the world created about 1.8 zettabytes ( ) of data n 2011 by 2020, 35 zettabytes of data across the globe

4 Hardware Platform sngle machne

5 Hardware Platform sngle machne data sets can be too large to be processed / stored on one sngle computer

6 Hardware Platform sngle machne data sets can be too large to be processed / stored on one sngle computer dstrbuted processng: e.g., Google s Sbyl 100B samples and 100B features based on MapReduce

7 Hardware Platform sngle machne data sets can be too large to be processed / stored on one sngle computer dstrbuted processng: e.g., Google s Sbyl 100B samples and 100B features based on MapReduce machne learnng algorthms are often teratve not very well suted for MapReduce

8 Dstrbuted Archtectures 1 shared memory: varables stored n a shared address space e.g., servers, hgh-end workstatons, multcore processors

9 Dstrbuted Archtectures 1 shared memory: varables stored n a shared address space e.g., servers, hgh-end workstatons, multcore processors data can be convenently accessed less scalable

10 Dstrbuted Archtectures 1 shared memory: varables stored n a shared address space e.g., servers, hgh-end workstatons, multcore processors data can be convenently accessed less scalable 2 dstrbuted memory: each node has ts own memory nodes connected by hgh-speed communcaton network

less scalable 2 dstrbuted memory: each node has ts own memory nodes connected by

11 Dstrbuted Archtectures 1 shared memory: varables stored n a shared address space e.g., servers, hgh-end workstatons, multcore processors data can be convenently accessed less scalable 2 dstrbuted memory: each node has ts own memory nodes connected by hgh-speed communcaton network scalable need to dstrbute/collect nformaton to/from the nodes

12 Example ML Scenaro supervsed learnng data set D = {(x 1, y 1 ),..., (x n, y n )} lnear model y = w x loss for sample : l (w) = (y w x ) 2 mnmze tranng error: mn n w =1 l (w)

13 Example ML Scenaro supervsed learnng data set D = {(x 1, y 1 ),..., (x n, y n )} lnear model y = w x loss for sample : l (w) = (y w x ) 2 mnmze tranng error: mn n w =1 l (w) n s bg, how to solve ths wth multple machnes?

14 Example ML Scenaro supervsed learnng data set D = {(x 1, y 1 ),..., (x n, y n )} lnear model y = w x loss for sample : l (w) = (y w x ) 2 mnmze tranng error: mn n w =1 l (w) n s bg, how to solve ths wth multple machnes? splt the n samples over N machnes D D 1 D 2 D N

15 Example ML Scenaro supervsed learnng data set D = {(x 1, y 1 ),..., (x n, y n )} lnear model y = w x loss for sample : l (w) = (y w x ) 2 mnmze tranng error: mn n w =1 l (w) n s bg, how to solve ths wth multple machnes? splt the n samples over N machnes D D 1 D 2 D N n =1 l (w) = l (w) D } 1 {{} f 1 (w) mnmze one f on each machne + + D N l (w) }{{} f N (w)

16 Dstrbuted Consensus Optmzaton mn x f (x) mn x1,...,x N f (x ) x : node s local copy of x to be learned

17 Dstrbuted Consensus Optmzaton mn x f (x) mn x1,...,x N,z f (x ) : x 1 = = x N = z x : node s local copy of x to be learned z: consensus varable how to ensure a consensus?

18 Alternatng Drecton Method of Multplers (ADMM) mn x,z φ(x) + ψ(z) }{{} convex s.t. Ax + Bz = c }{{} constrant between x and y φ, ψ: convex functons A, B: constant matrces; c: constant vector

19 Alternatng Drecton Method of Multplers (ADMM) augmented Lagrangan mn x,z φ(x) + ψ(z) s.t. Ax + Bz = c L(x, z, λ) = φ(x) + ψ(z) + λ (Ax + Bz c) + β 2 Ax + Bz c 2 λ: Lagrangan multplers; β > 0: penalty parameter

20 Alternatng Drecton Method of Multplers (ADMM) augmented Lagrangan mn x,z φ(x) + ψ(z) s.t. Ax + Bz = c L(x, z, λ) = φ(x) + ψ(z) + λ (Ax + Bz c) + β 2 Ax + Bz c 2 λ: Lagrangan multplers; β > 0: penalty parameter mnmze L n an alternatng manner x t+1 arg mn x L(x, z t, λ t ) z t+1 arg mn z L(x t+1, z, λ t ) λ t+1 λ t + β(ax t+1 + Bz t+1 c)

21 Alternatng Drecton Method of Multplers (ADMM) augmented Lagrangan mn x,z φ(x) + ψ(z) s.t. Ax + Bz = c L(x, z, λ) = φ(x) + ψ(z) + λ (Ax + Bz c) + β 2 Ax + Bz c 2 λ: Lagrangan multplers; β > 0: penalty parameter mnmze L n an alternatng manner x t+1 arg mn x L(x, z t, λ t ) z t+1 arg mn z L(x t+1, z, λ t ) λ t+1 λ t + β(ax t+1 + Bz t+1 c) updates of x and z are often easer

22 Consensus Optmzaton usng ADMM mn x1,...,x N,z f (x ) : x 1 = = x N = z augmented Lagrangan L = N =1 f (x )+λ (x z)+ β 2 x z 2 mnmze L n an alternatng manner x k+1 arg mn L(x, z k, λ k ), = 1,..., N x z k+1 arg mn z L(x k+1, z, λ k ) λ k+1 λ k + β(x k+1 z k+1 ), = 1,..., N

23 Consensus Optmzaton usng ADMM mn x1,...,x N,z f (x ) : x 1 = = x N = z augmented Lagrangan L = N =1 f (x )+λ (x z)+ β 2 x z 2 mnmze L n an alternatng manner x k+1 arg mn L(x, z k, λ k ), = 1,..., N x z k+1 arg mn z L(x k+1, z, λ k ) λ k+1 λ k + β(x k+1 z k+1 ), = 1,..., N dstrbuted consensus optmzaton each worker : updates x, λ master: updates z

24 Dstrbuted Implementaton worker x k+1 arg mn x f (x) + λ k x + β 2 x z k 2 }{{} dff w/ consensus update n parallel usng local data subset

25 Dstrbuted Implementaton worker x k+1 arg mn x f (x) + λ k x + β 2 x zk 2 sends x k+1 to the master

26 Dstrbuted Implementaton worker master x k+1 arg mn x f (x) + λ k x + β 2 x zk 2 z k+1 1 N N =1 x k+1 recompute the consensus

27 Dstrbuted Implementaton worker master x k+1 arg mn x f (x) + λ k x + β 2 x zk 2 z k+1 1 N N =1 x k+1 dstrbutes z k+1 back to all the workers

28 Dstrbuted Implementaton worker master worker x k+1 arg mn x f (x) + λ k x + β 2 x zk 2 z k+1 1 N N =1 x k+1 λ k+1 λ k + β(x k+1 z k+1 )

29 Problem updates have to be synchronzed master needs to wat for the x updates from all workers

30 Problem updates have to be synchronzed master needs to wat for the x updates from all workers workers have dfferent delays (processng speeds, network delays, etc.) has to wat for the slowest worker

31 Dstrbuted Asynchronous ADMM: Worker x update x k+1 arg mn x f (x) + λ k x + β 2 x zk 2

32 Dstrbuted Asynchronous ADMM: Worker x update x k+1 arg mn x f (x) + λ k x + β 2 x zk 2 x k +1 arg mn x f (x) + λ k x + β 2 x zk 2 master and each worker keep ndependent clocks

33 Dstrbuted Asynchronous ADMM: Worker x update x k+1 arg mn x f (x) + λ k x + β 2 x zk 2 x k +1 arg mn x f (x) + λ k x + β 2 x zk 2 x k +1 = arg mn x f (x) + λ k x + β 2 x z 2 z : most recent z value receved from master n general, z s are dfferent (workers have dfferent speeds)

34 Dstrbuted Asynchronous ADMM: Worker x update λ update x k+1 x k +1 x k +1 arg mn x f (x) + λ k x + β 2 x zk 2 arg mn x f (x) + λ k x + β 2 x zk 2 x + β 2 x z 2 = arg mn x f (x) + λ k λ k+1 λ k + β(x k+1 z k+1 )

35 Dstrbuted Asynchronous ADMM: Worker x update λ update x k+1 x k +1 x k +1 arg mn x f (x) + λ k x + β 2 x zk 2 arg mn x f (x) + λ k x + β 2 x zk 2 x + β 2 x z 2 = arg mn x f (x) + λ k λ k+1 λ k + β(x k+1 z k+1 ) λ k +1 λ k + β(x k +1 z )

36 Dstrbuted Asynchronous ADMM: Master master needs to wat for all N worker updates z k+1 1 N N =1 x k+1

37 Dstrbuted Asynchronous ADMM: Master master needs to wat for all N worker updates z k+1 1 N N =1 x k+1

38 Dstrbuted Asynchronous ADMM: Master master only needs to wat for a mnmum of S worker updates synchronous ADMM: S = N z k+1 1 N N =1 ( ) x k β λk

39 Dstrbuted Asynchronous ADMM: Master master only needs to wat for a mnmum of S worker updates synchronous ADMM: S = N z k+1 1 N N =1 ( ) x k β λk S can be much smaller than N (partal barrer) z k+1 1 N N =1 (ˆx + 1 β ˆλ ) (ˆx, ˆλ ): most recent (x, λ ) receved from worker update s stll based on all the {(ˆx, ˆλ )} N =1 (some may not be very fresh)

40 Dstrbuted Asynchronous ADMM: Master master only needs to wat for a mnmum of S worker updates synchronous ADMM: S = N z k+1 1 ( ) N N =1 x k β λk S can be much smaller than N (partal barrer) z k+1 1 N N =1 (ˆx + 1 β ˆλ ) (ˆx, ˆλ ): most recent (x, λ ) receved from worker update s stll based on all the {(ˆx, ˆλ )} N =1 (some may not be very fresh) master sends the updated z k+1 back to (only) those workers whose (x, λ ) have been processed

41 Slow Workers some (ˆx, ˆλ ) s may not be very fresh faster workers more updates slow workers fewer updates, can be very outdated

42 Slow Workers some (ˆx, ˆλ ) s may not be very fresh faster workers more updates slow workers fewer updates, can be very outdated how to ensure suffcent freshness of all updates?

43 Slow Workers some (ˆx, ˆλ ) s may not be very fresh faster workers more updates slow workers fewer updates, can be very outdated how to ensure suffcent freshness of all updates?

44 Bounded Delay every worker has to be updated at least once every τ teratons (x, λ ) update can at most be τ clock cycles old τ = 1 reduces back to synchronous ADMM

45 Example (S = 2, τ = 3) delay counter: how old the worker update s

46 Example (S = 2, τ = 3) master teraton 1: 2 worker updates arrve

47 Example (S = 2, τ = 3) master teraton 2: 4 worker updates arrve

48 Example (S = 2, τ = 3) master teraton 3: 3 worker updates arrve

49 Example (S = 2, τ = 3) master wats... untl update from worker 3 arrves

50 Example (S = 2, τ = 3) master wats... untl update from worker 3 arrves partal barrer S speeds up the algorthm bounded delay τ guarantees convergence to globally optmal soluton

51 Convergence mn x1,...,x N,z f (x ) : x 1 = = x N = z Theorem after T master teratons, E f ( x ) f (x ) +λ ( x }{{} z ) }{{} dff w/ optmal obj dff w/ consensus varable { } Nτ β z 0 z T S β λ0 λ 2 x : average x for node durng the teratons; z: average z

52 Convergence mn x1,...,x N,z f (x ) : x 1 = = x N = z Theorem after T master teratons, E f ( x ) f (x ) +λ ( x }{{} z ) }{{} dff w/ optmal obj dff w/ consensus varable { } Nτ β z 0 z T S β λ0 λ 2 x : average x for node durng the teratons; z: average z workers and network are fast n each teraton, updates from all workers can arrve recover the O( 1 T ) convergence rate of standard ADMM

53 Experments: Structured Sparse Models real-world data are often hgh-dmensonal text bonformatcs hyperspectral mage

54 Experments: Structured Sparse Models real-world data are often hgh-dmensonal text bonformatcs hyperspectral mage Feature selecton va regularzed rsk mnmzaton mnmze loss l(w) + sparsty-nducng regularzer Ω(w)

55 Experments: Structured Sparse Models real-world data are often hgh-dmensonal text bonformatcs hyperspectral mage Feature selecton va regularzed rsk mnmzaton mnmze loss l(w) + sparsty-nducng regularzer Ω(w) lasso: Ω(w) = w 1 = w

56 Structured Sparsty features often have ntrnsc structures structured sparsty

57 Structured Sparsty features often have ntrnsc structures structured sparsty group lasso e.g., categorcal feature: represented by a group of bnary features ((0,0,1) for chnese; (0,1,0) french; (1,0,0) german) graph-guded fused lasso: groups can overlap Ω(w) = (,j) E w j x x j (encourages x x j for (, j) E)

58 Experment: Graph-Guded Fused Lasso 1 L mn x l (x) +λ (,j) E L w j x x j =1 }{{} logstc loss dgts 4 and 9 from the MNIST data set 1.6 mllon 784-dmensonal samples, dvded unformly among the 64 workers

59 Cluster Setup 18 computng nodes nterconnected wth a ggabt Ethernet each node 4 AMD Opteron 2216 (2.4GHz) processors 16GB memory one core for master and each worker process nter-processor communcaton: MPI

60 Convergence of the Objectve wth Tme async-admm faster than sync-admm (S = 64 or τ = 1)

61 Reduces Network Watng Tme Total tme (n seconds) Computatonal tme Network watng tme (64,1) (2,8) (2,16) (4,16) (4,32) (S,τ) combnaton dfferent (S, τ) combnatons have smlar computaton tme smaller S and/or larger τ allows for a hgher degree of asynchrony less tme on network watng

62 More Workers Total tme(n seconds) Computatonal tme(sync ADMM) Network watng tme(sync ADMM) Computatonal tme(async ADMM) Network watng tme(async ADMM) number of workers async-admm s agan faster than sync-admm more workers less tme to wat at least S worker updates sgnfcantly less network watng tme

63 Low-Rank Matrx Factorzaton ADMM can also be effcently used on nonconvex problems low-rank matrx factorzaton (e.g., n collaboratve flterng) mn L,R M LR 2 F }{{} dff w/ orgnal matrx + λ 1 L 2 F + λ 2 R 2 F }{{} standard regularzer

64 Low-Rank Matrx Factorzaton ADMM can also be effcently used on nonconvex problems low-rank matrx factorzaton (e.g., n collaboratve flterng) mn L,R M LR 2 F }{{} dff w/ orgnal matrx + λ 1 L 2 F + λ 2 R 2 F }{{} standard regularzer m = 10000, n = and rank= 100 partton M evenly across columns and then assgn to N = 64 workers

65 Convergence of Objectve wth Tme x 10 5 objectve value sync ADMM async ADMM tme(n seconds) async-admm converges faster than sync-admm

66 Reduces Network Watng Tme 2500 Computatonal tme Network watng tme Total tme(n seconds) (64,1) (2,32) (S,τ) combnaton

67 Problems wth Verson 1 n cloud computng envronments, computng nodes may be dynamcally added, removed and may also fal MPI has no mechansm to handle faults

68 Verson 2: Parameter Server parameter dvded nto shards and stored n multple masters each master stores a replca of ts neghbors keys master manager / worker manager: handle addton, removal, falure of masters and workers

69 Verson 2: Parameter Server parameter dvded nto shards and stored n multple masters each master stores a replca of ts neghbors keys master manager / worker manager: handle addton, removal, falure of masters and workers fault tolerance

70 Verson 2: Parameter Server parameter dvded nto shards and stored n multple masters each master stores a replca of ts neghbors keys master manager / worker manager: handle addton, removal, falure of masters and workers fault tolerance spreads master-worker communcaton more scalable

71 Prelmnary Results Google Cloud cluster wth 32 nodes, each wth 2 cores RCV1 data set; l 1 -regularzed logstc regresson 3.5 x Computatonal tme(sync ADMM) Network watng tme(sync ADMM) Computatonal tme(async ADMM) Network watng tme(async ADMM) 2.5 Total tme(n seconds) sync admm (1 master) async admm (1 master) sync admm (4 master) async admm (4 master) asynchronous algorthm spends less tme on network watng more masters further speedup

72 From BIG Data to bg data

73 From BIG Data to bg data learnng subproblem on each worker 1 n mn w l (w) + Ω(w) n }{{} =1 }{{} regularzer emprcal loss l (w): sample s contrbuton to the loss

74 From BIG Data to bg data learnng subproblem on each worker 1 n mn w l (w) + Ω(w) n }{{} =1 }{{} regularzer emprcal loss l (w): sample s contrbuton to the loss may have closed-form soluton may have to be solved teratvely (e.g., gradent descent, ADMM,...)

75 From BIG Data to bg data learnng subproblem on each worker 1 n mn w l (w) + Ω(w) n }{{} =1 }{{} regularzer emprcal loss l (w): sample s contrbuton to the loss may have closed-form soluton may have to be solved teratvely (e.g., gradent descent, ADMM,...) orgnal problem s BIG, subproblems can stll be bg

76 ADMM on Each Worker mn w,y 1 n n =1 l (w) + Ω(y) s.t. Aw + By = c

77 ADMM on Each Worker mn w,y 1 n n =1 l (w) + Ω(y) s.t. Aw + By = c ADMM s w update: w t+1 arg mn w 1 n n l (w) + β 2 Aw + By t c + α t 2 =1

78 ADMM on Each Worker mn w,y 1 n n =1 l (w) + Ω(y) s.t. Aw + By = c ADMM s w update: w t+1 arg mn w 1 n n l (w) + β 2 Aw + By t c + α t 2 =1 each teraton needs to vst all the samples (batch learnng) bg data become computatonally expensve

79 From Batch to Stochastc popularly used n gradent descent replace the gradent over the whole data set 1 n n =1 l (w) by the gradent l k(t) (w) at a sngle sample k(t) {1, 2,..., n}

80 From Batch to Stochastc popularly used n gradent descent replace the gradent over the whole data set 1 n n =1 l (w) by the gradent l k(t) (w) at a sngle sample k(t) {1, 2,..., n} or over a small mn-batch of samples 1 m S = m n) S l (w) (where

81 From Batch to Stochastc popularly used n gradent descent replace the gradent over the whole data set 1 n n =1 l (w) by the gradent l k(t) (w) at a sngle sample k(t) {1, 2,..., n} or over a small mn-batch of samples 1 m S = m n) S l (w) (where per-teraton complexty s much lower (O(n) vs O(1)) can scale to much larger data sets

82 Stochastc ADMM 1 w t+1 arg mn n w n =1 l (w) + β 2 Aw + By t c + α t 2

83 Stochastc ADMM 1 w t+1 arg mn n w n =1 l (w) + β 2 Aw + By t c + α t 2 w t+1 arg mn w learns from only one sample l k(t) (w) + β 2 Aw + By t c + α t 2 +R(w)

84 Stochastc ADMM 1 w t+1 arg mn n w n =1 l (w) + β 2 Aw + By t c + α t 2 w t+1 arg mn w l k(t) (w) + β 2 Aw + By t c + α t 2 +R(w) w t+1 arg mn w l k(t) (w t ) (w w t ) + β 2 Aw + By t c + α t 2 + R(w)

85 Stochastc ADMM 1 w t+1 arg mn n w n =1 l (w) + β 2 Aw + By t c + α t 2 w t+1 arg mn w l k(t) (w) + β 2 Aw + By t c + α t 2 +R(w) w t+1 arg mn w l k(t) (w t ) (w w t ) + β 2 Aw + By t c + α t 2 + R(w) slower convergence (T : number of teratons) batch ADMM stochastc ADMM convergence rate O(1/T ) O(1/ T ) teraton cost O(n) O(1)

86 Stochastc ADMM... w t+1 arg mn w l k(t) (w t ) (w w t ) + β 2 Aw + By t c + α t 2 + R(w) not usng the old gradents seems wasteful

87 Stochastc ADMM... w t+1 arg mn w l k(t) (w t ) (w w t ) + β 2 Aw + By t c + α t 2 + R(w) not usng the old gradents seems wasteful

88 Stochastc Average ADMM save gradents 1 l 1 (w 0 ),..., n l n (w 0 ) at teraton t randomly choose a sample k(t) {1,..., n} update the gradent for k(t) only: k(t) l k(t) (w t )

89 Stochastc Average ADMM save gradents 1 l 1 (w 0 ),..., n l n (w 0 ) at teraton t randomly choose a sample k(t) {1,..., n} update the gradent for k(t) only: k(t) l k(t) (w t ) mn w l k(t) (w t ) (w w t ) + β 2 Aw + By t c + α t 2 + R(w)

90 Stochastc Average ADMM save gradents 1 l 1 (w 0 ),..., n l n (w 0 ) at teraton t mn w randomly choose a sample k(t) {1,..., n} update the gradent for k(t) only: k(t) l k(t) (w t ) l k(t) (w t ) (w w t ) + β 2 Aw + By t c + α t 2 + R(w) mn w 1 n n =1 (w w t ) + β 2 Aw + By t c + α t 2 + R(w) (cf. mn w 1 n n =1 l (w w t ) + β 2 Aw + By t c + α t 2 + R(w)) out of the n gradents only one of them (whch corresponds to sample k(t)) s based on the current w t ) the others are prevously-stored gradent values

91 Convergence Theorem mn x,y Φ(x, y) φ(x) + ψ(y) : Ax + By = c E Φ( x T, ȳ T ) Φ(x, y ) +γ A x }{{} T + Bȳ T c }{{} dff w/ optmal obj constrant volaton 1 { ( )} γ nl x x y y H 2T y + 2β β 2 + α 0 2 x T : average of x s; ȳ T : average of y s

92 Convergence Theorem mn x,y Φ(x, y) φ(x) + ψ(y) : Ax + By = c E Φ( x T, ȳ T ) Φ(x, y ) +γ A x }{{} T + Bȳ T c }{{} dff w/ optmal obj constrant volaton 1 { ( )} γ nl x x y y H 2T y + 2β β 2 + α 0 2 x T : average of x s; ȳ T : average of y s batch stochastc stochastc avg convergence rate O(1/T ) O(1/ T ) O(1/T ) teraton cost O(n) O(1) O(1)

93 Convergence Theorem mn x,y Φ(x, y) φ(x) + ψ(y) : Ax + By = c E Φ( x T, ȳ T ) Φ(x, y ) +γ A x }{{} T + Bȳ T c }{{} dff w/ optmal obj constrant volaton 1 { ( )} γ nl x x y y H 2T y + 2β β 2 + α 0 2 x T : average of x s; ȳ T : average of y s batch stochastc stochastc avg convergence rate O(1/T ) O(1/ T ) O(1/T ) teraton cost O(n) O(1) O(1) caveat: need to store the gradent values extra O(n) space

94 Experment: Graph-Guded Fused Lasso forest cover type: 581,012 samples; drug: 12,678 objectve value vs tme speed: stochastc average > stochastc > batch

95 Experment: Graph-Guded Fused Lasso... objectve value vs number of passes over data batch: one effectve pass = one teraton stochastc: one effectve pass = n teratons speed: stochastc average > stochastc > batch

96 Concluson BIG data sets dstrbuted processng: asynchronous ADMM partal barrer and bounded delay to control asynchrony

97 Concluson BIG data sets dstrbuted processng: asynchronous ADMM partal barrer and bounded delay to control asynchrony bg data sets (processng on each worker) batch (better) stochastc recyclng old gradents

98 Concluson BIG data sets dstrbuted processng: asynchronous ADMM partal barrer and bounded delay to control asynchrony bg data sets (processng on each worker) batch (better) stochastc recyclng old gradents smple; has convergence guarantees; fast (R. Zhang, J.T. Kwok. ICML-2014) (L.W. Zhong, J.T. Kwok. ICML-2014)

99 Shameless Plug I am lookng for research students research assstants / postdocs / programmers on a ML/DM project related to fnancal data If nterested, please emal me (jamesk@cse.ust.hk) Thank You

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning A Delay-tolerant Proxmal-Gradent Algorthm for Dstrbuted Learnng Konstantn Mshchenko Franck Iutzeler Jérôme Malck Massh Amn KAUST Unv. Grenoble Alpes CNRS and Unv. Grenoble Alpes Unv. Grenoble Alpes ICML