Distributed Consensus Optimization

Size: px
Start display at page:

Download "Distributed Consensus Optimization"

Transcription

1 Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1

2 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized vehicles/aircrafts coordination [1] 1 Decentralized vehicles/aircrafts coordination Aircrafts formation I Average consensus problem min {x(i) } Flock of birds s.t. kx(1) b1 k kx(n) bn k22 x(1) = = x(n) [1] W. Ren, R. Beard, et al., Information Consensus in Multivehicle Cooperative Control, CSM07. 1 Ren, Wei, Randal W. Beard, and Ella M. Atkins. Information consensus in multivehicle cooperative control. IEEE Control Systems 27.2 (2007): Decentralized-2

3 ackground and motivation why we need decentralized optimization? Decentralized state estimation of smart grid [2] Decentralized state estimation of smart grid 2 Least squares (Gaussian noise) + l 1 norm (sparse anomalies) min {x (i) X i,v (i) } s.t. n f i (x (i), v (i) ) i=1 x (h) [j] = x (j) [h], j N h, h where f i (x (i), v (i) ) = z i H i x (i) v (i) λ v (i) 1, and the model parameter λ can be obtained through cross validation [2] V. Kekatos, G. Giannakis, Distributed Robust Power System State Estimation, TSP13. [3] A. Monticelli, Electric power system state estimation, Proc. IEEE00. 2 Kekatos, Vassilis, and Georgios B. Giannakis. Distributed robust power system state estimation. IEEE Transactions on Power Systems 28.2 (2013): Decentralized-3

4 Background why and we motivation need decentralized optimization? I Decentralized Dictionary Learning [4] Decentralized dictionary learning 3 I Matrix factorization + regularization min D,A 1 2 c P j=1 ky:,j DA:,j k2f + λka:,j k1 +γkdk2f, where Y Rm n training data distributed over c agents D Rm p dictionary that constitutes Y A Rp n sparse coefficient vectors that encode Y, divided into n parts [4] H. Wai, et al., A Consensus-based Decentralized Algorithm for Non-convex Optimization with Application to Dictionary Learning, ICASSP15. [5] Z. Wang, et al., Decentralized Recommender Systems, arxiv15. 3 Wai, Hoi-To, Tsung-Hui Chang, and Anna Scaglione. A consensus-based decentralized algorithm for non-convex optimization with application to dictionary learning. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, Decentralized-4

5 Background and motivation why we need decentralized optimization? Decentralized data/signal processing Decentralized data/signal processing 4 cost/risk minimization: min f (x) = 1 n n i=1 f i(x) Communication and computation balance, robust Privacy preservation: Exchange fi? No! Exchange x k (i)... Unmanned vehicles coordination, in-vehicle networking Smart grid management, power station management Decentralized recommender systems, multi-group cooperation Decentralized network utility maximization Decentralized resource allocation... 4 Ren, Wei, Randal W. Beard, and Ella M. Atkins. Information consensus in multivehicle cooperative control. IEEE Control Systems 27.2 (2007): Decentralized-5

6 Problem statement what is decentralized optimization? Decentralized consensus optimization Decentralized consensus optimization x = arg min x C R p f (x) = 1 n n f i (x) (1) i=1 involves multiple agents connected network messaging 1-hop neighbors each agent owns private objective each agent makes local decision optimize overall obejective all agents reach consensus Compared to centralized system: robustness, computation balanced, computation balanced, privacy preserving Related topics: in-vehicle networking, internet of things, cloud computing, big data Decentralized-6

7 simplest decentralized consensus problem: averaging Decentralized-7

8 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. Decentralized-8

9 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 Decentralized-8

10 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Decentralized-8

11 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. Decentralized-8

12 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. W has all other eigenvalues in ( 1, 1). Decentralized-8

13 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. W has all other eigenvalues in ( 1, 1). 1 x k+1 = 1 Wx k = 1 x k ; the sum is fixed during the iteration. Decentralized-8

14 decentralized averaging One iteration: x k+1 = Wx k, where x = [x 1,..., x n]. W encodes the network; nonzero entries correspond to edges; we assume that W is symmetric (for undirected networks). 1/4 1/4 1/4 1/4 2/3 1/3 0 0 W = 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4, W = 1/3 1/3 1/ /3 1/3 1/3 1/4 1/4 1/4 1/ /3 2/3 1 n 1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals one. Any fixed point is consensus, i.e., x = c1 n 1. W has all other eigenvalues in ( 1, 1). 1 x k+1 = 1 Wx k = 1 x k ; the sum is fixed during the iteration. Convergence speed depends on the second largest eigenvalue of W in absolute value. Decentralized-8

15 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Decentralized-9

16 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. Decentralized-9

17 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. It is equivalent to gradient descent with stepsize 1 for minimize x 1 2 I Wx 2 2. Decentralized-9

18 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. It is equivalent to gradient descent with stepsize 1 for minimize x 1 2 I Wx 2 2. The final solution depends on the initial sum 1 x 0. Decentralized-9

19 decentralized averaging as gradient descent Decentralized averaging: x k+1 = Wx k Rewrite the iteration as: x k+1 = Wx k = x k (I W)x k. It is equivalent to gradient descent with stepsize 1 for minimize x 1 2 I Wx 2 2. The final solution depends on the initial sum 1 x 0. The Lipschitz constant of (I W)x is smaller than 2, so we can choose stepsize 1. Decentralized-9

20 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized-10

21 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Decentralized-10

22 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Rewrite it as x k+1 = x k ((I W)x k + λ f(x k )). DGD is gradient descent with stepsize one of minimize x 1 2 I Wx λf(x). Decentralized-10

23 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Rewrite it as x k+1 = x k ((I W)x k + λ f(x k )). DGD is gradient descent with stepsize one of minimize x 1 2 I Wx λf(x). The solution is generally non consensus, i.e., Wx = x + λ f(x ) x. Decentralized-10

24 decentralized gradient descent Consider problem minimize x n f(x) = f i(x i), s.t. x 1 = x 2 = = x n. i=1 Decentralized gradient descent (DGD) (Nedic-Ozdaglar 09) x k+1 = Wx k λ f(x k ). Rewrite it as x k+1 = x k ((I W)x k + λ f(x k )). DGD is gradient descent with stepsize one of minimize x 1 2 I Wx λf(x). The solution is generally non consensus, i.e., Wx = x + λ f(x ) x. Diminishing stepsize, i.e., decreasing λ during the iteration. Decentralized-10

25 constant stepsize? Decentralized-11

26 constant stepsize? alternating direction method of multipliers (ADMM) (Shi et al. 14, Chang-Hong-Wang 15, Hong-Chang 17) Decentralized-11

27 constant stepsize? alternating direction method of multipliers (ADMM) (Shi et al. 14, Chang-Hong-Wang 15, Hong-Chang 17) multi-consensus inner loops (Chen-Ozdaglar 12, Jakovetic-Xavier-Moura 14) Decentralized-11

28 constant stepsize? alternating direction method of multipliers (ADMM) (Shi et al. 14, Chang-Hong-Wang 15, Hong-Chang 17) multi-consensus inner loops (Chen-Ozdaglar 12, Jakovetic-Xavier-Moura 14) EXTRA/PG-EXTRA (Shi et al. 15) Decentralized-11

29 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Decentralized-12

30 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Decentralized-12

31 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. Decentralized-12

32 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. It is the same as [ ] [ ] [ ] f(x ) 0 I W x = 0. I W 0 s Decentralized-12

33 forward-backward The KKT system [ ] f(x ) 0 [ ] [ ] 0 I W x =. I W 0 s Decentralized-13

34 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 Decentralized-13

35 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 It reduces to [ αi I W ] [ ] [ ] [ ] [ ] I W x k f(x k ) αi 0 x k+1 = βi s k 0 2. I W βi s k+1 Decentralized-13

36 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 It reduces to [ αi I W It is equivalent to ] [ ] [ ] [ ] [ ] I W x k f(x k ) αi 0 x k+1 = βi s k 0 2. I W βi s k+1 αx k I Ws k f(x k ) =αx k+1, I Wx k + βs k = 2 I Wx k+1 + βs k+1. Decentralized-13

37 forward-backward Using forward-backward in the KKT form [ αi ] [ ] [ ] I W x k f(x k ) I W βi s k 0 [ αi ] [ ] [ ] [ ] I W x k+1 = 0 I W x k+1 + I W βi s k+1. I W 0 s k+1 It reduces to [ αi I W It is equivalent to ] [ ] [ ] [ ] [ ] I W x k f(x k ) αi 0 x k+1 = βi s k 0 2. I W βi s k+1 αx k I Ws k f(x k ) =αx k+1, I Wx k + βs k = 2 I Wx k+1 + βs k+1. For simplicity, let t = I Ws, and we have αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. Decentralized-13

38 EXact first-order Algorithm (EXTRA) From the previous slide αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. Decentralized-14

39 EXact first-order Algorithm (EXTRA) From the previous slide We have αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. αx k+1 =αx k t k f(x k ) =αx k I W (2x k x k 1 ) t k 1 f(x k ) β =αx k I W (2x k x k 1 )+αx k + f(x k 1 ) αx k 1 f(x k ) β = ( αi I W ) (2x k x k 1 ) + f(x k 1 ) f(x k ). β Decentralized-14

40 EXact first-order Algorithm (EXTRA) From the previous slide We have αx k t k f(x k ) =αx k+1, (I W)x k + βt k = 2(I W)x k+1 + βt k+1. αx k+1 =αx k t k f(x k ) =αx k I W (2x k x k 1 ) t k 1 f(x k ) β =αx k I W (2x k x k 1 )+αx k + f(x k 1 ) αx k 1 f(x k ) β = ( αi I W ) (2x k x k 1 ) + f(x k 1 ) f(x k ). β Let αβ = 2 and we have EXTRA (Shi et al. 15) x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Decentralized-14

41 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) Decentralized-15

42 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Decentralized-15

43 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Let I + W = UΣU. [ ] [ I + W I+W 2 U = I 0 ] [ ] [ Σ Σ 2 U U I 0 ] 0. U Decentralized-15

44 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Let I + W = UΣU. [ ] [ I + W I+W 2 U = I 0 ] [ ] [ Σ Σ 2 U U I 0 ] 0. U The iteration becomes [ ] U x k+1 = U x k [ ][ ] Σ Σ 2 U x k. I 0 U x k 1 Decentralized-15

45 convergence conditions for EXTRA: I EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f = 0: [ ] x k+1 = x k [ ] [ ] I + W I+W 2 x k. I 0 x k 1 Let I + W = UΣU. [ ] [ I + W I+W 2 U = I 0 ] [ ] [ Σ Σ 2 U U I 0 ] 0. U The iteration becomes [ ] U x k+1 = U x k [ ][ ] Σ Σ 2 U x k. I 0 U x k 1 The condition for W is 2/3 < λ(σ) = λ(w + I) 2, which is 5/3 < λ(w) 1. Decentralized-15

46 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) Decentralized-16

47 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f(x k ) = x k b: [ ] [ x k+1 I + W 1 = I I+W + 1 I ] [ ] α 2 α x k. x k I 0 x k 1 Decentralized-16

48 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f(x k ) = x k b: [ ] [ x k+1 I + W 1 = I I+W + 1 I ] [ ] α 2 α x k. x k I 0 x k 1 Let I + W = UΣU. [ I + W 1 I I+W + 1 I ] [ α 2 α U = I 0 U ] [ Σ 1 α I I 0 Σ + 1 I ] [ 2 α U ] 0. U Decentralized-16

49 convergence conditions for EXTRA: II EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )) If f(x k ) = x k b: [ ] [ x k+1 I + W 1 = I I+W + 1 I ] [ ] α 2 α x k. x k I 0 x k 1 Let I + W = UΣU. [ I + W 1 I I+W + 1 I ] [ α 2 α U = I 0 U ] [ Σ 1 α I I 0 Σ + 1 I ] [ 2 α U ] 0. U The condition for W is 4/(3α) 2/3 < λ(σ) = λ(w + I) 2, which is 4/(3α) 5/3 < λ(w) 1. In addition, we have stepsize 1/α < 2. Decentralized-16

50 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Decentralized-17

51 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Decentralized-17

52 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Convergence condition (Li-Yan 17): 4/(3α) 5/3 < λ(w) 1, 1/α < 2/L. Decentralized-17

53 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Convergence condition (Li-Yan 17): 4/(3α) 5/3 < λ(w) 1, 1/α < 2/L. Linear convergence condition: f(x) is strongly convex. (Li-Yan 17) Decentralized-17

54 conditions for general EXTRA EXTRA: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 )). Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Convergence condition (Li-Yan 17): 4/(3α) 5/3 < λ(w) 1, 1/α < 2/L. Linear convergence condition: f(x) is strongly convex. (Li-Yan 17) weaker condition on f(x) but more restrict condition for both parameters. (Shi et al. 15) Decentralized-17

55 large stepsize as centralized ones? Decentralized-18

56 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Decentralized-19

57 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Decentralized-19

58 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. Decentralized-19

59 decentralized smooth optimization Problem: minimize x f(x), s.t. I Wx = 0. Lagrangian function f(x) + I Wx, s, where s is the Lagrangian multiplier. Optimality condition (KKT): 0 = f(x ) + I Ws, 0 = I Wx. It is the same as [ ] [ ] [ ] f(x ) 0 I W x = 0. I W 0 s Decentralized-19

60 forward-backward The KKT system [ ] f(x ) 0 [ ] [ ] 0 I W x =. I W 0 s Decentralized-20

61 forward-backward Using forward-backward in the KKT form [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] [ ] [ ] αi 0 x k+1 0 I W x k+1 = + 0 βi 1 (I W) s k+1. I W 0 s k+1 α Decentralized-20

62 forward-backward Combine the right hand side: [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] αi I W x k+1 =. I W βi 1 (I W) s k+1 α Decentralized-20

63 forward-backward Combine the right hand side: [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] αi I W x k+1 =. I W βi 1 (I W) s k+1 α Apply Gaussian elimination: [ ] [ ] [ ] αi 0 x k f(x k ) I W βi 1 (I W) s k 1 α α I W f(x k ) [ ] [ ] αi I W x k+1 =. 0 βi s k+1 Decentralized-20

64 forward-backward Combine the right hand side: [ ] [ ] [ ] αi 0 x k f(x k ) 0 βi 1 (I W) s k 0 α [ ] [ ] αi I W x k+1 =. I W βi 1 (I W) s k+1 α Apply Gaussian elimination: [ ] [ ] [ ] αi 0 x k f(x k ) I W βi 1 (I W) s k 1 α α I W f(x k ) [ ] [ ] αi I W x k+1 =. 0 βi s k+1 It is equivalent to αx k f(x k ) I Ws k+1 =αx k+1, I Wx k + β ( I 1 αβ (I W)) s k 1 I W f(x k ) =βs k+1. α Decentralized-20

65 From the previous slide: NIDS (Li-Shi-Yan 17) αx k f(x k ) I Ws k+1 =αx k+1, I Wx k + β ( I 1 αβ (I W)) s k 1 I W f(x k ) =βs k+1. α Decentralized-21

66 From the previous slide: NIDS (Li-Shi-Yan 17) αx k f(x k ) I Ws k+1 =αx k+1, I Wx k + β ( I 1 αβ (I W)) s k 1 I W f(x k ) =βs k+1. α Let t = I Ws: αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. Decentralized-21

67 Let t = I Ws: NIDS (Li-Shi-Yan 17) αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. Decentralized-21

68 Let t = I Ws: NIDS (Li-Shi-Yan 17) αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. We have αx k+1 =αx k f(x k ) t k+1 =αx k f(x k ) ( I 1 αβ (I W)) t k 1 β (I W)xk + 1 αβ (I W) f(xk ) = ( I 1 αβ (I W)) (αx k t k f(x k )) = ( I 1 αβ (I W)) (αx k + αx k αx k 1 + f(x k 1 ) f(x k )). Decentralized-21

69 Let t = I Ws: NIDS (Li-Shi-Yan 17) αx k f(x k ) t k+1 =αx k+1, (I W)x k + β ( I 1 αβ (I W)) t k 1 α (I W) f(xk ) =βt k+1. We have αx k+1 =αx k f(x k ) t k+1 =αx k f(x k ) ( I 1 αβ (I W)) t k 1 β (I W)xk + 1 αβ (I W) f(xk ) = ( I 1 αβ (I W)) (αx k t k f(x k )) = ( I 1 αβ (I W)) (αx k + αx k αx k 1 + f(x k 1 ) f(x k )). Thus x k+1 = ( I 1 αβ (I W)) (2x k x k 1 1 α ( f(xk ) f(x k 1 )). Decentralized-21

70 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-22

71 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. Decentralized-22

72 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. If f(x k ) = x k b: [ ] x k+1 = x k [ (2 1 ) I+W (1 1 ) ] [ ] I+W α 2 α 2 x k I 0 x k 1 Decentralized-22

73 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. If f(x k ) = x k b: [ ] x k+1 = x k Let I + W = UΣU. [ ] U x k+1 = U x k [ (2 1 ) I+W (1 1 ) ] [ ] I+W α 2 α 2 x k I 0 x k 1 [ (2 1 ) Σ (1 1 ) ] [ ] Σ α 2 α 2 U x k I 0 U x k 1 Decentralized-22

74 convergence conditions for NIDS NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) If f = 0 (same as EXTRA): The condition for W is 5/3 < λ(w) 1. If f(x k ) = x k b: [ ] x k+1 = x k Let I + W = UΣU. [ ] U x k+1 = U x k [ (2 1 ) I+W (1 1 ) ] [ ] I+W α 2 α 2 x k I 0 x k 1 [ (2 1 ) Σ (1 1 ) ] [ ] Σ α 2 α 2 U x k I 0 U x k 1 Therefore, one sufficient condition is 5/3 < λ(w) 1 and 1/α < 2. Decentralized-22

75 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-23

76 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Initial condition (k = 0, 1): x 1 = x 0 1 α f(x0 ). Decentralized-23

77 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Initial condition (k = 0, 1): Convergence condition (Li-Yan 17): x 1 = x 0 1 α f(x0 ). 5/3 < λ(w) 1, 1/α < 2/L. Decentralized-23

78 conditions of NIDS for general smooth functions NIDS (with αβ = 2): x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Initial condition (k = 0, 1): Convergence condition (Li-Yan 17): Linear convergence condition: x 1 = x 0 1 α f(x0 ). 5/3 < λ(w) 1, 1/α < 2/L. f(x) is strongly convex and 1 < λ(w) 1 (Li-Shi-Yan 17): ( ( O max 1 µ L, 1 1 λ2(w) )). 1 λ n(w) Decentralized-23

79 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-24

80 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The difference is in the data to be communicated. Decentralized-24

81 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The difference is in the data to be communicated. But NIDS has a larger range for parameters than EXTRA. Decentralized-24

82 NIDS vs EXTRA EXTRA NIDS: x k+1 = I + W (2x k x k 1 ) 1 2 α ( f(xk ) f(x k 1 ) x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The difference is in the data to be communicated. But NIDS has a larger range for parameters than EXTRA. NIDS is faster than EXTRA. Decentralized-24

83 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-25

84 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The stepsize is large and does not depend on the network topology. 1 α < 2 L. Decentralized-25

85 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The stepsize is large and does not depend on the network topology. Individual stepsizes can be included. 1 α < 2 L. 1 α i < 2 L i. Decentralized-25

86 advantages of NIDS NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) The stepsize is large and does not depend on the network topology. Individual stepsizes can be included. 1 α < 2 L. 1 α i < 2 L i. The linear convergence rate from the functions and the network are separated. ( ( O max 1 µ L, 1 1 λ2(w) )). 1 λ n(w) It matches the results for gradient descent and decentralized averaging without acceleration. Decentralized-25

87 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) Decentralized-26

88 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. Decentralized-26

89 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. E ξ D f(x; ξ) = f(x), x. Decentralized-26

90 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. E ξ D f(x; ξ) = f(x), x. E ξ D f(x; ξ) f(x) 2 σ 2, x. Decentralized-26

91 D 2 : stochastic NIDS (Huang et al. 18) NIDS: x k+1 = I + W (2x k x k α ( f(xk ) f(x k 1 )) NIDS-stochastic (D 2 : Decentralized Training over Decentralized Data): x k+1 = I + W (2x k x k α ( f(xk, ξ k ) f(x k 1, ξ k 1 )) f(x k, ξ k ) is a stochastic gradient by sampling ξ t from distribution D. E ξ D f(x; ξ) = f(x), x. E ξ D f(x; ξ) f(x) 2 σ 2, x. Convergence result: if the stepsize is small enough (in the order of (c + T/n) 1 ), the convergence rate is ( ) σ O + 1. nt T Decentralized-26

92 numerical experiments Decentralized-27

93 compared algorithms NIDS EXTRA/PG-EXTRA DIGing-ATC (Nedic et al. 16): x k+1 =W(x k αy k ), y k+1 =W(y k + f(x k+1 ) f(x k )). accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in (Qu-Li 17) dual friendly optimal algorithm (OA) for distributed optimization (Uribe et al. 17). Decentralized-28

94 strongly convex: same stepsize number of iterations Decentralized-29

95 strongly convex: same stepsize number of iterations Decentralized-29

96 strongly convex: adaptive stepsize number of iterations Decentralized-30

97 linear convergence rate bottleneck number of iterations Decentralized-31

98 linear convergence rate bottleneck number of iterations Decentralized-31

99 nonsmooth functions 10 2 NIDS-1/L 10 0 NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L number of iterations 10 4 Decentralized-32

100 stochastic case: shuffled Decentralized Loss Decentralized D 2 Centralized Loss D 2 Centralized # Epochs (a) TRANSFERLEARNING # Epochs (b) LENET Decentralized-33

101 stochastic case: unshuffled Loss Decentralized 1 D Centralized # Epochs Loss 10 Decentralized D 2 0 Centralized # Epochs (a) TRANSFERLEARNING (b) LENET Decentralized-34

102 conclusion and open questions conclusion Decentralized-35

103 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA Decentralized-35

104 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS Decentralized-35

105 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions Decentralized-35

106 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction Decentralized-35

107 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning Decentralized-35

108 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? Decentralized-35

109 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? directed network? Decentralized-35

110 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? directed network? dynamical network? Decentralized-35

111 conclusion and open questions conclusion optimal bounds for EXTRA/PG-EXTRA new algorithm NIDS open questions network construction preconditioning acceleration? directed network? dynamical network? asynchronous? Decentralized-35

112 Paper 1 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arxiv: Code Paper 2 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arxiv: Paper 3 H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, D 2 : decentralized training over decentralized data, ICML 2018, Thank You! Decentralized-36

Primal-dual algorithms for the sum of two and three functions 1

Primal-dual algorithms for the sum of two and three functions 1 Primal-dual algorithms for the sum of two and three functions 1 Ming Yan Michigan State University, CMSE/Mathematics 1 This works is partially supported by NSF. optimization problems for primal-dual algorithms

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Decentralized Consensus Optimization with Asynchrony and Delay

Decentralized Consensus Optimization with Asynchrony and Delay Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University

More information

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 57, NO. 1, JANUARY 2012 33 Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren,

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

Distributed Convex Optimization

Distributed Convex Optimization Master Program 2013-2015 Electrical Engineering Distributed Convex Optimization A Study on the Primal-Dual Method of Multipliers Delft University of Technology He Ming Zhang, Guoqiang Zhang, Richard Heusdens

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization

More information

arxiv: v3 [math.oc] 1 Jul 2015

arxiv: v3 [math.oc] 1 Jul 2015 On the Convergence of Decentralized Gradient Descent Kun Yuan Qing Ling Wotao Yin arxiv:1310.7063v3 [math.oc] 1 Jul 015 Abstract Consider the consensus problem of minimizing f(x) = n fi(x), where x Rp

More information

Primal Solutions and Rate Analysis for Subgradient Methods

Primal Solutions and Rate Analysis for Subgradient Methods Primal Solutions and Rate Analysis for Subgradient Methods Asu Ozdaglar Joint work with Angelia Nedić, UIUC Conference on Information Sciences and Systems (CISS) March, 2008 Department of Electrical Engineering

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS Songtao Lu, Ioannis Tsaknakis, and Mingyi Hong Department of Electrical

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of

More information

Asynchronous Parallel Computing in Signal Processing and Machine Learning

Asynchronous Parallel Computing in Signal Processing and Machine Learning Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks 06 IEEE International Conference on Big Data (Big Data) Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks Benjamin Sirb Xiaojing Ye Department of Mathematics and Statistics

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

A Primal-dual Three-operator Splitting Scheme

A Primal-dual Three-operator Splitting Scheme Noname manuscript No. (will be inserted by the editor) A Primal-dual Three-operator Splitting Scheme Ming Yan Received: date / Accepted: date Abstract In this paper, we propose a new primal-dual algorithm

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Resilient Distributed Optimization Algorithm against Adversary Attacks

Resilient Distributed Optimization Algorithm against Adversary Attacks 207 3th IEEE International Conference on Control & Automation (ICCA) July 3-6, 207. Ohrid, Macedonia Resilient Distributed Optimization Algorithm against Adversary Attacks Chengcheng Zhao, Jianping He

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

On the linear convergence of distributed optimization over directed graphs

On the linear convergence of distributed optimization over directed graphs 1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Mingyi Hong IMSE and ECpE Department Iowa State University ICCOPT, Tokyo, August 2016 Mingyi Hong (Iowa State University)

More information

Adaptive Primal Dual Optimization for Image Processing and Learning

Adaptive Primal Dual Optimization for Image Processing and Learning Adaptive Primal Dual Optimization for Image Processing and Learning Tom Goldstein Rice University tag7@rice.edu Ernie Esser University of British Columbia eesser@eos.ubc.ca Richard Baraniuk Rice University

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Distributed Optimization over Random Networks

Distributed Optimization over Random Networks Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Operator Splitting for Parallel and Distributed Optimization

Operator Splitting for Parallel and Distributed Optimization Operator Splitting for Parallel and Distributed Optimization Wotao Yin (UCLA Math) Shanghai Tech, SSDS 15 June 23, 2015 URL: alturl.com/2z7tv 1 / 60 What is splitting? Sun-Tzu: (400 BC) Caesar: divide-n-conquer

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 16 Extra: Support Vector Machine Optimization with Dual Dr. Yanjun Qi University of Virginia Department of Computer Science Today Extra q Optimization of SVM ü SVM

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Finite-Time Distributed Consensus in Graphs with Time-Invariant Topologies

Finite-Time Distributed Consensus in Graphs with Time-Invariant Topologies Finite-Time Distributed Consensus in Graphs with Time-Invariant Topologies Shreyas Sundaram and Christoforos N Hadjicostis Abstract We present a method for achieving consensus in distributed systems in

More information

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization

More information

ONLINE DICTIONARY LEARNING OVER DISTRIBUTED MODELS. Jianshu Chen, Zaid J. Towfic, and Ali H. Sayed

ONLINE DICTIONARY LEARNING OVER DISTRIBUTED MODELS. Jianshu Chen, Zaid J. Towfic, and Ali H. Sayed 014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP ONLINE DICTIONARY LEARNING OVER DISTRIBUTED MODELS Jianshu Chen, Zaid J. Towfic, and Ali H. Sayed Department of Electrical

More information

automating the analysis and design of large-scale optimization algorithms

automating the analysis and design of large-scale optimization algorithms automating the analysis and design of large-scale optimization algorithms Laurent Lessard University of Wisconsin Madison Joint work with Ben Recht, Andy Packard, Bin Hu, Bryan Van Scoy, Saman Cyrus October,

More information

Algorithm-Hardware Co-Optimization of Memristor-Based Framework for Solving SOCP and Homogeneous QCQP Problems

Algorithm-Hardware Co-Optimization of Memristor-Based Framework for Solving SOCP and Homogeneous QCQP Problems L.C.Smith College of Engineering and Computer Science Algorithm-Hardware Co-Optimization of Memristor-Based Framework for Solving SOCP and Homogeneous QCQP Problems Ao Ren Sijia Liu Ruizhe Cai Wujie Wen

More information

Asymptotics, asynchrony, and asymmetry in distributed consensus

Asymptotics, asynchrony, and asymmetry in distributed consensus DANCES Seminar 1 / Asymptotics, asynchrony, and asymmetry in distributed consensus Anand D. Information Theory and Applications Center University of California, San Diego 9 March 011 Joint work with Alex

More information

On the Linear Convergence of Distributed Optimization over Directed Graphs

On the Linear Convergence of Distributed Optimization over Directed Graphs 1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm,

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Matrix Notation Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I need

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Analysis of Coupling Dynamics for Power Systems with Iterative Discrete Decision Making Architectures

Analysis of Coupling Dynamics for Power Systems with Iterative Discrete Decision Making Architectures Analysis of Coupling Dynamics for Power Systems with Iterative Discrete Decision Making Architectures Zhixin Miao Department of Electrical Engineering, University of South Florida, Tampa FL USA 3362. Email:

More information

Convergence Rates in Decentralized Optimization. Alex Olshevsky Department of Electrical and Computer Engineering Boston University

Convergence Rates in Decentralized Optimization. Alex Olshevsky Department of Electrical and Computer Engineering Boston University Convergence Rates in Decentralized Optimization Alex Olshevsky Department of Electrical and Computer Engineering Boston University Distributed and Multi-agent Control Strong need for protocols to coordinate

More information

Research on Consistency Problem of Network Multi-agent Car System with State Predictor

Research on Consistency Problem of Network Multi-agent Car System with State Predictor International Core Journal of Engineering Vol. No.9 06 ISSN: 44-895 Research on Consistency Problem of Network Multi-agent Car System with State Predictor Yue Duan a, Linli Zhou b and Yue Wu c Institute

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen

More information

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Hongchao Zhang hozhang@math.lsu.edu Department of Mathematics Center for Computation and Technology Louisiana State

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Distributed Resource Allocation Using One-Way Communication with Applications to Power Networks

Distributed Resource Allocation Using One-Way Communication with Applications to Power Networks Distributed Resource Allocation Using One-Way Communication with Applications to Power Networks Sindri Magnússon, Chinwendu Enyioha, Kathryn Heal, Na Li, Carlo Fischione, and Vahid Tarokh Abstract Typical

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Asynchronous Distributed Optimization. via Randomized Dual Proximal Gradient

Asynchronous Distributed Optimization. via Randomized Dual Proximal Gradient Asynchronous Distributed Optimization 1 via Randomized Dual Proximal Gradient Ivano Notarnicola and Giuseppe Notarstefano arxiv:1509.08373v2 [cs.sy] 24 Jun 2016 Abstract In this paper we consider distributed

More information

Fastest Distributed Consensus Averaging Problem on. Chain of Rhombus Networks

Fastest Distributed Consensus Averaging Problem on. Chain of Rhombus Networks Fastest Distributed Consensus Averaging Problem on Chain of Rhombus Networks Saber Jafarizadeh Department of Electrical Engineering Sharif University of Technology, Azadi Ave, Tehran, Iran Email: jafarizadeh@ee.sharif.edu

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem: HT05: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford Convex Optimization and slides based on Arthur Gretton s Advanced Topics in Machine Learning course

More information

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State

More information

arxiv: v1 [math.oc] 24 Dec 2018

arxiv: v1 [math.oc] 24 Dec 2018 On Increasing Self-Confidence in Non-Bayesian Social Learning over Time-Varying Directed Graphs César A. Uribe and Ali Jadbabaie arxiv:82.989v [math.oc] 24 Dec 28 Abstract We study the convergence of the

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Preconditioning via Diagonal Scaling

Preconditioning via Diagonal Scaling Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Distributed Computation of Quantiles via ADMM

Distributed Computation of Quantiles via ADMM 1 Distributed Computation of Quantiles via ADMM Franck Iutzeler Abstract In this paper, we derive distributed synchronous and asynchronous algorithms for computing uantiles of the agents local values.

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

SIGNIFICANT increase in amount of private distributed

SIGNIFICANT increase in amount of private distributed 1 Distributed DC Optimal Power Flow for Radial Networks Through Partial Primal Dual Algorithm Vahid Rasouli Disfani, Student Member, IEEE, Lingling Fan, Senior Member, IEEE, Zhixin Miao, Senior Member,

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

Complex Laplacians and Applications in Multi-Agent Systems

Complex Laplacians and Applications in Multi-Agent Systems 1 Complex Laplacians and Applications in Multi-Agent Systems Jiu-Gang Dong, and Li Qiu, Fellow, IEEE arxiv:1406.186v [math.oc] 14 Apr 015 Abstract Complex-valued Laplacians have been shown to be powerful

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Continuous-time Distributed Convex Optimization with Set Constraints

Continuous-time Distributed Convex Optimization with Set Constraints Preprints of the 19th World Congress The International Federation of Automatic Control Cape Town, South Africa. August 24-29, 214 Continuous-time Distributed Convex Optimization with Set Constraints Shuai

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information