Distributed Consensus Optimization

Size: px

Start display at page:

Download "Distributed Consensus Optimization"

Jeremy Miller
5 years ago
Views:

1 Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1

Backgroundwhy andwe motivation need decentralized optimization?

coordination Aircrafts formation I Average consensus problem min {x(i) } Flock of birds s.t. kx(1) b1 k22 + + kx(n) bn k22 x(1) = = x(n) [1] W.

1 Ren, Wei, Randal W. Beard, and Ella M. Atkins.

2 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized vehicles/aircrafts coordination [1] 1 Decentralized vehicles/aircrafts coordination Aircrafts formation I Average consensus problem min {x(i) } Flock of birds s.t. kx(1) b1 k kx(n) bn k22 x(1) = = x(n) [1] W. Ren, R. Beard, et al., Information Consensus in Multivehicle Cooperative Control, CSM07. 1 Ren, Wei, Randal W. Beard, and Ella M. Atkins. Information consensus in multivehicle cooperative control. IEEE Control Systems 27.2 (2007): Decentralized-2

3 ackground and motivation why we need decentralized optimization? Decentralized state estimation of smart grid [2] Decentralized state estimation of smart grid 2 Least squares (Gaussian noise) + l 1 norm (sparse anomalies) min {x (i) X i,v (i) } s.t. n f i (x (i), v (i) ) i=1 x (h) [j] = x (j) [h], j N h, h where f i (x (i), v (i) ) = z i H i x (i) v (i) λ v (i) 1, and the model parameter λ can be obtained through cross validation [2] V. Kekatos, G. Giannakis, Distributed Robust Power System State Estimation, TSP13. [3] A. Monticelli, Electric power system state estimation, Proc. IEEE00. 2 Kekatos, Vassilis, and Georgios B. Giannakis. Distributed robust power system state estimation. IEEE Transactions on Power Systems 28.2 (2013): Decentralized-3

distributed over c agents D Rm p dictionary that constitutes Y A Rp n sparse coefficient vectors that encode Y, divided into n parts [4] H. Wai, et al.

4 Background why and we motivation need decentralized optimization? I Decentralized Dictionary Learning [4] Decentralized dictionary learning 3 I Matrix factorization + regularization min D,A 1 2 c P j=1 ky:,j DA:,j k2f + λka:,j k1 +γkdk2f, where Y Rm n training data distributed over c agents D Rm p dictionary that constitutes Y A Rp n sparse coefficient vectors that encode Y, divided into n parts [4] H. Wai, et al., A Consensus-based Decentralized Algorithm for Non-convex Optimization with Application to Dictionary Learning, ICASSP15. [5] Z. Wang, et al., Decentralized Recommender Systems, arxiv15. 3 Wai, Hoi-To, Tsung-Hui Chang, and Anna Scaglione. A consensus-based decentralized algorithm for non-convex optimization with application to dictionary learning. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, Decentralized-4

5 Background and motivation why we need decentralized optimization? Decentralized data/signal processing Decentralized data/signal processing 4 cost/risk minimization: min f (x) = 1 n n i=1 f i(x) Communication and computation balance, robust Privacy preservation: Exchange fi? No! Exchange x k (i)... Unmanned vehicles coordination, in-vehicle networking Smart grid management, power station management Decentralized recommender systems, multi-group cooperation Decentralized network utility maximization Decentralized resource allocation... 4 Ren, Wei, Randal W. Beard, and Ella M. Atkins. Information consensus in multivehicle cooperative control. IEEE Control Systems 27.2 (2007): Decentralized-5

6 Problem statement what is decentralized optimization? Decentralized consensus optimization Decentralized consensus optimization x = arg min x C R p f (x) = 1 n n f i (x) (1) i=1 involves multiple agents connected network messaging 1-hop neighbors each agent owns private objective each agent makes local decision optimize overall obejective all agents reach consensus Compared to centralized system: robustness, computation balanced, computation balanced, privacy preserving Related topics: in-vehicle networking, internet of things, cloud computing, big data Decentralized-6

7 simplest decentralized consensus problem: averaging Decentralized-7

8 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. Decentralized-8

9 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 Decentralized-8

10 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Decentralized-8

11 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. Decentralized-8

12 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. W has all other eigenvalues in ( 1, 1). Decentralized-8

13 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. W has all other eigenvalues in ( 1, 1). 1 x k+1 = 1 Wx k = 1 x k ; the sum is fixed during the iteration. Decentralized-8

14 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. W has all other eigenvalues in ( 1, 1). 1 x k+1 = 1 Wx k = 1 x k ; the sum is fixed during the iteration. Convergence speed depends on the second largest eigenvalue of W in absolute value. Decentralized-8

15 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Decentralized-9

16 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. Decentralized-9

17 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. It is equivalent to gradient descent with stepsize 1 for minimize x 1 2 I Wx 2 2. Decentralized-9

18 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. It is equivalent to gradient descent with stepsize 1 for minimize x 1 2 I Wx 2 2. The final solution depends on the initial sum 1 x 0. Decentralized-9

19 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. It is equivalent to gradient descent with stepsize 1 for minimize x 1 2 I Wx 2 2. The final solution depends on the initial sum 1 x 0. The Lipschitz constant of (I W)x is smaller than 2, so we can choose stepsize 1. Decentralized-9

20 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized-10

21 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Decentralized-10

22 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Rewrite it as x k+1 = x k ((I W)x k + λ f(x k )). DGD is gradient descent with stepsize one of minimize x 1 2 I Wx λf(x). Decentralized-10

23 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Rewrite it as x k+1 = x k ((I W)x k + λ f(x k )). DGD is gradient descent with stepsize one of minimize x 1 2 I Wx λf(x). The solution is generally non consensus, i.e., Wx = x + λ f(x ) x. Decentralized-10

24 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Rewrite it as x k+1 = x k ((I W)x k + λ f(x k )). DGD is gradient descent with stepsize one of minimize x 1 2 I Wx λf(x). The solution is generally non consensus, i.e., Wx = x + λ f(x ) x. Diminishing stepsize, i.e., decreasing λ during the iteration. Decentralized-10

25 constant stepsize? Decentralized-11

26 constant stepsize? alternating direction method of multipliers (ADMM) (Shi et al. 14, Chang-Hong-Wang 15, Hong-Chang 17) Decentralized-11

27 constant stepsize? alternating direction method of multipliers (ADMM) (Shi et al. 14, Chang-Hong-Wang 15, Hong-Chang 17) multi-consensus inner loops (Chen-Ozdaglar 12, Jakovetic-Xavier-Moura 14) Decentralized-11

28 constant stepsize? alternating direction method of multipliers (ADMM) (Shi et al. 14, Chang-Hong-Wang 15, Hong-Chang 17) multi-consensus inner loops (Chen-Ozdaglar 12, Jakovetic-Xavier-Moura 14) EXTRA/PG-EXTRA (Shi et al. 15) Decentralized-11

29 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Decentralized-12

30 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Decentralized-12

31 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. Decentralized-12

32 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. It is the same as [ ] [ ] [ ] f(x ) 0 I W x = 0. I W 0 s Decentralized-12

33 forward-backward The KKT system [ ] f(x ) 0 [ ] [ ] 0 I W x =. I W 0 s Decentralized-13

34 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 Decentralized-13

35 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 It reduces to [ αi I W ] [ ] [ ] [ ] [ ] I W x k f(x k ) αi 0 x k+1 = βi s k 0 2. I W βi s k+1 Decentralized-13

36 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 It reduces to [ αi I W It is equivalent to ] [ ] [ ] [ ] [ ] I W x k f(x k ) αi 0 x k+1 = βi s k 0 2. I W βi s k+1 αx k I Ws k f(x k ) =αx k+1, I Wx k + βs k = 2 I Wx k+1 + βs k+1. Decentralized-13

37 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 It reduces to [ αi I W It is equivalent to ] [ ] [ ] [ ] [ ] I W x k f(x k ) αi 0 x k+1 = βi s k 0 2. I W βi s k+1 αx k I Ws k f(x k ) =αx k+1, I Wx k + βs k = 2 I Wx k+1 + βs k+1. For simplicity, let t = I Ws, and we have αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. Decentralized-13

38 EXact first-order Algorithm (EXTRA) From the previous slide αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. Decentralized-14

39 EXact first-order Algorithm (EXTRA) From the previous slide We have αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. αx k+1 =αx k t k f(x k ) =αx k I W (2x k x k 1 ) t k 1 f(x k ) β =αx k I W (2x k x k 1 )+αx k + f(x k 1 ) αx k 1 f(x k ) β = ( αi I W ) (2x k x k 1 ) + f(x k 1 ) f(x k ). β Decentralized-14

40 EXact first-order Algorithm (EXTRA) From the previous slide We have αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. αx k+1 =αx k t k f(x k ) =αx k I W (2x k x k 1 ) t k 1 f(x k ) β =αx k I W (2x k x k 1 )+αx k + f(x k 1 ) αx k 1 f(x k ) β = ( αi I W ) (2x k x k 1 ) + f(x k 1 ) f(x k ). β Let αβ = 2 and we have EXTRA (Shi et al. 15) x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Decentralized-14

41 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) Decentralized-15

42 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Decentralized-15

43 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Let I + W = UΣU. [ ] [ I + W I+W 2 U = I 0 ] [ ] [ Σ Σ 2 U U I 0 ] 0. U Decentralized-15

44 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Let I + W = UΣU. [ ] [ I + W I+W 2 U = I 0 ] [ ] [ Σ Σ 2 U U I 0 ] 0. U The iteration becomes [ ] U x k+1 = U x k [ ][ ] Σ Σ 2 U x k. I 0 U x k 1 Decentralized-15

45 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Let I + W = UΣU. [ ] [ I + W I+W 2 U = I 0 ] [ ] [ Σ Σ 2 U U I 0 ] 0. U The iteration becomes [ ] U x k+1 = U x k [ ][ ] Σ Σ 2 U x k. I 0 U x k 1 The condition for W is 2/3 < λ(σ) = λ(w + I) 2, which is 5/3 < λ(w) 1. Decentralized-15

46 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) Decentralized-16

47 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f(x k ) = x k b: [ ] [ x k+1 I + W 1 = I I+W + 1 I ] [ ] α 2 α x k. x k I 0 x k 1 Decentralized-16

48 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f(x k ) = x k b: [ ] [ x k+1 I + W 1 = I I+W + 1 I ] [ ] α 2 α x k. x k I 0 x k 1 Let I + W = UΣU. [ I + W 1 I I+W + 1 I ] [ α 2 α U = I 0 U ] [ Σ 1 α I I 0 Σ + 1 I ] [ 2 α U ] 0. U Decentralized-16

49 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f(x k ) = x k b: [ ] [ x k+1 I + W 1 = I I+W + 1 I ] [ ] α 2 α x k. x k I 0 x k 1 Let I + W = UΣU. [ I + W 1 I I+W + 1 I ] [ α 2 α U = I 0 U ] [ Σ 1 α I I 0 Σ + 1 I ] [ 2 α U ] 0. U The condition for W is 4/(3α) 2/3 < λ(σ) = λ(w + I) 2, which is 4/(3α) 5/3 < λ(w) 1. In addition, we have stepsize 1/α < 2. Decentralized-16

50 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Decentralized-17

51 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Decentralized-17

52 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Convergence condition (Li-Yan 17): 4/(3α) 5/3 < λ(w) 1, 1/α < 2/L. Decentralized-17

53 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Convergence condition (Li-Yan 17): 4/(3α) 5/3 < λ(w) 1, 1/α < 2/L. Linear convergence condition: f(x) is strongly convex. (Li-Yan 17) Decentralized-17

54 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Convergence condition (Li-Yan 17): 4/(3α) 5/3 < λ(w) 1, 1/α < 2/L. Linear convergence condition: f(x) is strongly convex. (Li-Yan 17) weaker condition on f(x) but more restrict condition for both parameters. (Shi et al. 15) Decentralized-17

55 large stepsize as centralized ones? Decentralized-18

56 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Decentralized-19

57 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Decentralized-19

58 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. Decentralized-19

59 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. It is the same as [ ] [ ] [ ] f(x ) 0 I W x = 0. I W 0 s Decentralized-19

60 forward-backward The KKT system [ ] f(x ) 0 [ ] [ ] 0 I W x =. I W 0 s Decentralized-20

61 forward-backward Using forward-backward in the KKT form [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] [ ] [ ] αi 0 x k+1 0 I W x k+1 = + 0 βi 1 (I W) s k+1. I W 0 s k+1 α Decentralized-20

62 forward-backward Combine the right hand side: [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] αi I W x k+1 =. I W βi 1 (I W) s k+1 α Decentralized-20

63 forward-backward Combine the right hand side: [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] αi I W x k+1 =. I W βi 1 (I W) s k+1 α Apply Gaussian elimination: [ ] [ ] [ ] αi 0 x k f(x k ) I W βi 1 (I W) s k 1 α α I W f(x k ) [ ] [ ] αi I W x k+1 =. 0 βi s k+1 Decentralized-20

64 forward-backward Combine the right hand side: [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] αi I W x k+1 =. I W βi 1 (I W) s k+1 α Apply Gaussian elimination: [ ] [ ] [ ] αi 0 x k f(x k ) I W βi 1 (I W) s k 1 α α I W f(x k ) [ ] [ ] αi I W x k+1 =. 0 βi s k+1 It is equivalent to αx k f(x k ) I Ws k+1 =αx k+1, I Wx k + β ( I 1 αβ (I W)) s k 1 I W f(x k ) =βs k+1. α Decentralized-20

65 From the previous slide: NIDS (Li-Shi-Yan 17) αx k f(x k ) I Ws k+1 =αx k+1, I Wx k + β ( I 1 αβ (I W)) s k 1 I W f(x k ) =βs k+1. α Decentralized-21

66 From the previous slide: NIDS (Li-Shi-Yan 17) αx k f(x k ) I Ws k+1 =αx k+1, I Wx k + β ( I 1 αβ (I W)) s k 1 I W f(x k ) =βs k+1. α Let t = I Ws: αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. Decentralized-21

67 Let t = I Ws: NIDS (Li-Shi-Yan 17) αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. Decentralized-21

68 Let t = I Ws: NIDS (Li-Shi-Yan 17) αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. We have αx k+1 =αx k f(x k ) t k+1 =αx k f(x k ) ( I 1 αβ (I W)) t k 1 β (I W)xk + 1 αβ (I W) f(xk ) = ( I 1 αβ (I W)) (αx k t k f(x k )) = ( I 1 αβ (I W)) (αx k + αx k αx k 1 + f(x k 1 ) f(x k )). Decentralized-21

69 Let t = I Ws: NIDS (Li-Shi-Yan 17) αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. We have αx k+1 =αx k f(x k ) t k+1 =αx k f(x k ) ( I 1 αβ (I W)) t k 1 β (I W)xk + 1 αβ (I W) f(xk ) = ( I 1 αβ (I W)) (αx k t k f(x k )) = ( I 1 αβ (I W)) (αx k + αx k αx k 1 + f(x k 1 ) f(x k )). Thus x k+1 = ( I 1 αβ (I W)) (2x k x k 1 1 α ( f(xk ) f(x k 1 )). Decentralized-21

70 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-22

71 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. Decentralized-22

72 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. If f(x k ) = x k b: [ ] x k+1 = x k [ (2 1 ) I+W (1 1 ) ] [ ] I+W α 2 α 2 x k I 0 x k 1 Decentralized-22

73 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. If f(x k ) = x k b: [ ] x k+1 = x k Let I + W = UΣU. [ ] U x k+1 = U x k [ (2 1 ) I+W (1 1 ) ] [ ] I+W α 2 α 2 x k I 0 x k 1 [ (2 1 ) Σ (1 1 ) ] [ ] Σ α 2 α 2 U x k I 0 U x k 1 Decentralized-22

74 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. If f(x k ) = x k b: [ ] x k+1 = x k Let I + W = UΣU. [ ] U x k+1 = U x k [ (2 1 ) I+W (1 1 ) ] [ ] I+W α 2 α 2 x k I 0 x k 1 [ (2 1 ) Σ (1 1 ) ] [ ] Σ α 2 α 2 U x k I 0 U x k 1 Therefore, one sufficient condition is 5/3 < λ(w) 1 and 1/α < 2. Decentralized-22

75 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-23

76 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Decentralized-23

77 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Initial condition (k = 0, 1): Convergence condition (Li-Yan 17): x 1 = x 0 1 α f(x0 ). 5/3 < λ(w) 1, 1/α < 2/L. Decentralized-23

78 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Initial condition (k = 0, 1): Convergence condition (Li-Yan 17): Linear convergence condition: x 1 = x 0 1 α f(x0 ). 5/3 < λ(w) 1, 1/α < 2/L. f(x) is strongly convex and 1 < λ(w) 1 (Li-Shi-Yan 17): ( ( O max 1 µ L, 1 1 λ2(w) )). 1 λ n(w) Decentralized-23

79 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-24

80 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The difference is in the data to be communicated. Decentralized-24

81 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The difference is in the data to be communicated. But NIDS has a larger range for parameters than EXTRA. Decentralized-24

82 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The difference is in the data to be communicated. But NIDS has a larger range for parameters than EXTRA. NIDS is faster than EXTRA. Decentralized-24

83 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-25

84 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The stepsize is large and does not depend on the network topology. 1 α < 2 L. Decentralized-25

85 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The stepsize is large and does not depend on the network topology. Individual stepsizes can be included. 1 α < 2 L. 1 α i < 2 L i. Decentralized-25

86 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The stepsize is large and does not depend on the network topology. Individual stepsizes can be included. 1 α < 2 L. 1 α i < 2 L i. The linear convergence rate from the functions and the network are separated. ( ( O max 1 µ L, 1 1 λ2(w) )). 1 λ n(w) It matches the results for gradient descent and decentralized averaging without acceleration. Decentralized-25

87 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-26

88 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. Decentralized-26

89 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. E ξ D f(x; ξ) = f(x), x. Decentralized-26

90 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. E ξ D f(x; ξ) = f(x), x. E ξ D f(x; ξ) f(x) 2 σ 2, x. Decentralized-26

91 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. E ξ D f(x; ξ) = f(x), x. E ξ D f(x; ξ) f(x) 2 σ 2, x. Convergence result: if the stepsize is small enough (in the order of (c + T/n) 1 ), the convergence rate is ( ) σ O + 1. nt T Decentralized-26

92 numerical experiments Decentralized-27

93 compared algorithms NIDS EXTRA/PG-EXTRA DIGing-ATC (Nedic et al. 16): x k+1 =W(x k αy k ), y k+1 =W(y k + f(x k+1 ) f(x k )). accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in (Qu-Li 17) dual friendly optimal algorithm (OA) for distributed optimization (Uribe et al. 17). Decentralized-28

94 strongly convex: same stepsize number of iterations Decentralized-29

95 strongly convex: same stepsize number of iterations Decentralized-29

96 strongly convex: adaptive stepsize number of iterations Decentralized-30

97 linear convergence rate bottleneck number of iterations Decentralized-31

98 linear convergence rate bottleneck number of iterations Decentralized-31

99 nonsmooth functions 10 2 NIDS-1/L 10 0 NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L number of iterations 10 4 Decentralized-32

100 stochastic case: shuffled Decentralized Loss Decentralized D 2 Centralized Loss D 2 Centralized # Epochs (a) TRANSFERLEARNING # Epochs (b) LENET Decentralized-33

101 stochastic case: unshuffled Loss Decentralized 1 D Centralized # Epochs Loss 10 Decentralized D 2 0 Centralized # Epochs (a) TRANSFERLEARNING (b) LENET Decentralized-34

102 conclusion and open questions conclusion Decentralized-35

103 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA Decentralized-35

104 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS Decentralized-35

105 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions Decentralized-35

106 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction Decentralized-35

107 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning Decentralized-35

108 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? Decentralized-35

109 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? directed network? Decentralized-35

110 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? directed network? dynamical network? Decentralized-35

111 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? directed network? dynamical network? asynchronous? Decentralized-35

112 Paper 1 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arxiv: Code Paper 2 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arxiv: Paper 3 H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, D 2 : decentralized training over decentralized data, ICML 2018, Thank You! Decentralized-36

Primal-dual algorithms for the sum of two and three functions 1

Primal-dual algorithms for the sum of two and three functions 1 Ming Yan Michigan State University, CMSE/Mathematics 1 This works is partially supported by NSF. optimization problems for primal-dual algorithms