ADMM and Fast Gradient Methods for Distributed Optimization

Size: px

Start display at page:

Download "ADMM and Fast Gradient Methods for Distributed Optimization"

Alexander Hunt
5 years ago
Views:

1 ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013

2 Joint work Fast gradient methods Du san Jakovetić (IST-CMU) José M. F. Moura (CMU) ADMM João Mota (IST-CMU) Markus Püschel (ETH) Pedro Aguiar (IST)

3 Multi-agent optimization minimize subject to f (x) := f 1 (x) + f (x) + + f n (x) x X i f i f i is convex, private to agent i X R d is closed, convex (hereafter, d = 1) f = inf x X f (x) is attained at x network is connected and static applications: distributed learning, cognitive radio, consensus,...

4 Distributed subgradient method Update at each agent i (with constant step size) x i (t) = P X W ij x j (t 1) α f i (x i (t 1)) j N i N i is neighborhood of node i (including i) W ij are weights α > 0 is stepsize f i (x) is a subgradient of f i at x P X is projector onto X

5 Several variations exist and time-varying networks are supported Small sample of representative work: J.Tsitsiklis et al., Distributed asynchronous deterministic and stochastic gradient optimization algorithms, IEEE TAC, 31(9), 1986 A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE TAC, 54(1), 009 B. Johansson, A randomized incremental subgradient method for distributed optimization in networked systems, SIAM J. Optimization, 0(3), 009 A. Nedić et al., Constrained consensus and optimization in multi-agent networks, IEEE TAC, 55(4), 010 J. Duchi et al., Dual averaging for distributed optimization: convergence and network scaling, IEEE TAC, 57(3), 01

6 Convergence analysis: under appropriate conditions ( f (x i (t)) f = O α + 1 ) αt For optimized α, O(1/ɛ ) iterations suffice to reach ɛ-suboptimality Distributed subgradient method in matrix form x(t) = P (W x(t 1) α F (x(t 1))) x(t) := (x 1 (t),..., x n (t)) is network state F (x 1,..., x n ) := f 1 (x 1 ) + + f n (x n ) F (x 1,..., x n ) = ( f 1 (x 1 ),..., f n (x n )) P is projector onto X n

7 Interpretation: when W = I αρl (L = network Laplacian, ρ > 0) x(t) = P (x(t 1) α Ψ ρ (x(t 1))) classical subgradient method applied to penalized objective Ψ ρ (x) = F (x) + ρ x Lx = n f i (x i ) + ρ x i x j i=1 i j Key idea: apply instead Nesterov s fast gradient method x(t) = P (y(t 1) α Ψ ρ (y(t 1))) y(t) = x(t) + t 1 (x(t) x(t 1)) t + y(t) = (y 1 (t),..., y n (t)) is auxiliary variable

8 Distributed Nesterov gradient method (D-NG) with constant stepsize x(t) = P (W y(t 1) α F (y(t 1))) y(t) = x(t) + t 1 (x(t) x(t 1)) t + Convergence analysis: if f i s are differentiable, f i s are Lipschitz continuous, and X is compact ( 1 f (x i (t)) f = O ρ + 1 t + ρ ) t For optimized ρ, O(1/ɛ) iterations suffice to reach ɛ-suboptimality D. Jakovetić et al., Distributed Nesterov-like gradient algorithms, IEEE 51st Annual Conference on Decision and Control (CDC), 01

9 Proof Step 1: plug-in Nesterov s classical analysis f i (x) f i (y) L x y implies Ψ ρ (x) Ψ ρ (y) L ρ x y for L ρ = L + ρλ max (L) notation: Ψ ρ := inf x X Ψ ρ (x) is attained at x ρ with α := 1/L ρ, classical analysis yields Ψ ρ (x(t)) Ψ ρ L ρ t x(0) x ρ for some B 0 (since X is compact) L ρ t B (1)

10 Step : relate f (x i ) to Ψ ρ (x), x = (x 1,..., x n ) n f (x i ) = f j (x i ) = j=1 n f j (x j ) + ρ x Lx + j=1 } {{ } Ψ ρ (x) n f j (x i ) f j (x j ) ρ x Lx j=1 } {{ } (x) () Step 3: upper bound (x) C ρ for some C 0 use Lipschitz continuity of f j to obtain n f j (x i ) f j (x j ) G j=1 n x i x j (3) j=1 for some G 0

11 since L1 = 0, x Lx = (x x i 1) L (x x i 1) (4) combine (3) and (4) to obtain (x) G x x i 1 1 ρ (x x i1) L (x x i 1) = G x 1 ρ x L x x is x xi 1 with ith entry removed L is L with ith row and ith column removed easy to see that L is positive definite (network is connected)

12 it follows (x) max G y y 1 ρ y Ly = max y max z 1 Gz y ρ y Ly = max max Gz y ρ z 1 y y Ly = 1 ρ G max z 1 z L 1 z }{{} C (5) use f Ψ ρ, and combine (1), () with (5) to conclude f (x i (t)) f (L + ρλ max(l))b t + C ρ = O ( 1 ρ + 1 t + ρ t )

13 Numerical example Distributed logistic regression minimize s,r subject to n i=1 j=1 s R 5 φ ( b ij (s a ij + r) ) } {{ } f i (s,r) {(a ij, b ij ) R 3 R : j = 1,..., 5}: training data for agent i φ(t) = log (1 + e t ) geometric graph, n = 0 nodes and 86 edges

14 Constant stepsize e f (t) := 1 n f (x i (t)) f n i=1 f 10 0 Nesterov subgradient dual averaging e f iteration number, k x 10 4 Red: D-NG 1 Blue: subgradient Black: dual averaging 3 1 D. Jakovetić et al., Distributed Nesterov-like gradient algorithms, IEEE 51st Annual Conference on Decision and Control (CDC), 01 A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE TAC, 54(1), J. Duchi et al., Dual averaging for distributed optimization: convergence and network scaling, IEEE TAC, 57(3), 01

15 Diminishing stepsize e f (t) := 1 n f (x i (t)) f n i=1 f e f Nesterov subgradient dual averaging iteration number, k D-NG: α(t) = 1/t Subgradient, dual averaging: α(t) = 1/ t

16 Distributed Nesterov gradient method (D-NG) with diminishing stepsize Unconstrained problem minimize subject to f (x) := f 1 (x) + f (x) + + f n (x) x R d Diminishing stepsize α(t) = c t+1 x(t) = W y(t 1) α(t 1) F (y(t 1)) (6) y(t) = x(t) + t 1 (x(t) x(t 1)) t + (7) D. Jakovetić et al., Fast cooperative distributed learning, 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 01

17 Convergence analysis: if f i s are differentiable, f i s are bounded and L-Lipschitz continuous, and W is symmetric, stochastic and pos.-def. ( ) log t f (x i (t)) f = O t 1 dependence on network spectral gap 1 λ (W ) agents may ignore L and λ (W ) also known Sample of related work (different assumptions): K. Tsianos and M. Rabbat, Distributed strongly convex optimization, 50th Allerton Conference on Communication, Control and Computing, 01 E. Ghadimi et al., Accelerated gradient methods for networked optimization, arxiv preprint, 01

18 Sketch of proof Step 1: look at network averages x(t) := 1 n n x i (t) i=1 y(t) := 1 n n y i (t) i=1 from (6) and (7) x(t) = y(t 1) α(t 1) n } {{ } 1 L(t 1) n f i (y i (t 1)) i=1 } {{ } f (y(t 1)) y(t) = x(t) + t 1 (x(t) x(t 1)) t + interpretation: inexact Nesterov s gradient method

19 using ideas from optimization with inexact oracles 4, where f (x(t)) f n ct x(0) x + L t δ y (t) := y(t) y(t)1 t 1 s=0 = (y 1 (t) y(t),..., y n (t) y(t)) (s + ) s + 1 δ y (s) (8) 4 O. Devolder et al., First-order methods of smooth convex optimization with inexact oracle, submitted, Mathematical Programming, 011

20 Step : show δ y (t) = O ( ) 1 t rewrite (6) and (7) as the time-varying linear system where A(t) := [ t 1 [ ] δx (t) δ x (t 1) = A(t) [ ] δx (t 1) δ x (t ) + u(t 1), (9) t+1 W t t+1 ] [ ] W α(t)(i J) F (y(t)), u(t) := I 0 0 δx(t) := x(t) x(t)1 J := 1 n 11 W := W J u(t) = O ( ) 1 t due to α(t) = c t+1 and bounded gradient assumption

21 Fact: if x(t) = λx(t 1) + O with λ < 1, then x(t) = O ( ) 1 t ( ) 1 t (10) hand-waving argument: upon approximating [ ] W A(t) W, I 0 and diagonalizing, system (9) reduces to (10) since δ x (t) = O ( 1 t ), δ y (t) = δ x (t) + t 1 t + (δ x(t) δ x (t 1)) ( ) 1 = O t

22 Step 3: relate f (x i ) to f (x) f (x i ) = = n f j (x i ) j=1 n f j (x) + j=1 } {{ } f (x) n f j (x i ) f j (x) j=1 } {{ } (x) (11) by the bounded gradient assumption n (x) = f j (x i ) f j (x) Gn x i x Gn δ x (1) j=1 combine (8), (11) and (1) to conclude ( ) log t f (x i (t)) f = O t

23 Numerical example Acoustic source localization in sensor networks agent i measures r i = position of agent i y i = goal: determine source position x 1 x r i + noise convex approach 5 : n minimize x i=1 dist (x, C i ) { } with C i = x : x r i 1 yi geometric graph, n = 70 nodes and 99 edges 5 A. O. Hero and D. Blatt, Sensor network source localization via projection onto convex sets (POCS), IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 005

24 e f (t) := 1 n f (x i (t)) f n i=1 f avg. relative eror [ log 10 ] dis. Nesterov dis. grad., k =1/(k+1) 1/ dis. grad., k =1/(k+1) 1/3 dis. grad., k =1/(k+1) 1/10 dis. grad., k =1/(k+1) iteraion number, k [ log ] 10 Red: D-NG 6, α(t) = 1/(t + 1) Others: subgradient 7 6 D. Jakovetić et al., Fast cooperative distributed learning, 46th Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 01 7 A. Nedić and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE TAC, 54(1), 009

25 Numerical example Distributed regularized logistic regression minimize s,r n i=1 φ ( b i (s a i + r) ) + β s }{{} f i (s,r) (a i, b i ): training data for agent i φ(t) = log (1 + e t ) geometric graph, n = 0 nodes and 67 edges

26 e f (t) := 1 n f (x i (t)) f n i=1 f 0 avg. relative error [ log 10 ] dis. grad, k =1/(k+1) dis. Nesterov dis. grad, k =1/(k+1) 1/ dis. grad, k =1/(k+1) 1/3 dis. grad, k =1/(k+1) 1/ iteration number, k [ log 10 ] Red: D-NG, α(t) = 1/(t + 1) Others: subgradient

27 Distributed Nesterov gradient method (D-NC) with consensus iterations x(t) = W a(t) (y(t 1) α F (y(t 1))) ( y(t) = W b(t) x(t) + t 1 ) (x(t) x(t 1)) t + log t a(t) = log λ (W ) and b(t) = log 3 λ (W ) must be known by all agents log λ (W ) + log t log λ (W ) D. Jakovetić et al., Fast distributed gradient methods, arxiv preprint, 011

28 Convergence analysis: if f i s are differentiable, f i s are bounded and L-Lipschitz continuous, W is symmetric and stochastic, and α = 1/(L) ( ) 1 f (x i (t)) f = O t t iterations involve O(t log t) communication rounds dependence on network spectral gap also known Similar rate guarantees in: A. Chen and A. Ozdaglar, A fast distributed proximal gradient method, 50th Allerton Conference on Communication, Control and Computing, 01

29 Distributed optimization via ADMM (D-ADMM) ADMM = Alternate Direction Method of Multipliers Recent review: S. Boyd et al., Distributed optimization and statistical learning via the alternating method of multipliers, Foundations and Trends in Machine Learning, 011 Star network

30 ADMM, distributed optimization: I. Schizas et al., Consensus in ad hoc WSNs with noisy links - part I: distributed estimation of deterministic signals, IEEE TSP, 56(1), 008 many others This talk: Generic network J. Mota et al., D-ADMM: a communication efficient distributed algorithm for separable optimization, IEEE TSP, 61(10), 013 J. Mota et al., Distributed optimization with local domains: applications in MPC and network flows, arxiv preprint, 013

31 Illustrative example: f 1 (x 1,x ) 1 f 3 (x 1,x ) 3 f (x ) minimize x 1,x,x 3 f 1 (x 1, x ) + f (x ) + f 3 (x 1, x ) f i s are proper, closed, convex functions with range R {+ } f i s may depend on different subsets of variables

32 minimize x 1,x,x 3 f 1 (x 1, x ) + f (x ) + f 3 (x 1, x ) Roadmap to obtain a distributed algorithm Step 1: split each variable across associated nodes x (1) 1,x(1) 1 x (3) 1,x(3) 3 x () Step : introduce consistency constraints ( ) ( ) ( ) minimize f 1 x (1) 1, x (1) + f x () + f 3 x (3) 1, x (3) subject to x (1) 1 = x (3) 1 x (1) = x (3) x (1) = x ()

33 Step 3: pass to augmented Lagrangian dual maximize (1,3) λ 1,λ (1,3),λ (1,) with ( ) L ρ λ (1,3) 1, λ (1,3), λ (1,) ( ) L ρ λ (1,3) 1, λ (1,3), λ (1,) ( ) ( = f 1 x (1) 1, x (1) + f x () + λ (1,3) 1, x (1) 1 x (3) 1 + ρ + λ (1,3), x (1) x (3) + ρ + λ (1,), x (1) x () + ρ ) x (3) 1, x (3) 1 x (3) 1 ) + f 3 ( x (1) x (1) x (3) x () x (1)

34 Step 4: color the network f 1 (x 1,x ) 1 f 3 (x 1,x ) 3 f (x ) ( ) L ρ λ (1,3) 1, λ (1,3), λ (1,) ( ) ( = f 1 x (1) 1, x (1) + f x () + λ (1,3) 1, x (1) 1 x (3) 1 + ρ + λ (1,3), x (1) x (3) + ρ + λ (1,), x (1) x () + ρ ) + f 3 ( ) x (3) 1, x (3) 1 x (3) 1 x (1) x (1) x (3) x () x (1)

35 Step 5: apply extended ADMM Primal update at node 1 ( ) x (1) 1, x (1) (t + 1) = argmin x1,x f 1 (x 1, x ) + λ (1,3) 1 (t), x 1 + ρ + λ (1,3) (t), x + ρ + λ (1,) (t), x + ρ x 1 x (3) 1 (t) x x (3) (t) x x () (t)

36 Primal update at node x () (t + 1) = argmin x f (x ) Primal update at node 3 ( ) x (3) 1, x (3) λ (1,) (t), x + ρ x (1) (t + 1) = argmin x1,x f 3 (x 1, x ) λ (1,3) 1 (t), x 1 + ρ λ (1,3) (t), x + ρ (t + 1) x x (1) 1 (t + 1) x 1 (t + 1) x x (1) Key point: nodes and 3 work in parallel (same color)

37 Step 6: dual update at all relevant nodes ( λ (1,3) 1 (t + 1) = λ (1,3) 1 (t) + ρ ( λ (1,3) (t + 1) = λ (1,3) (t) + ρ ( λ (1,) (t + 1) = λ (1,) (t) + ρ x (1) 1 x (1) x (1) (t + 1) x (3) 1 (t + 1) ) (t + 1) x (3) (t + 1) ) (t + 1) x () (t + 1) ) Convergence analysis: for colors (bipartite network): classical ADMM results apply for 3 colors 8 : convergence for strongly convex functions and suitable ρ 8 D. Han and X. Yuan, A note on the alternating direction method of multipliers, JOTA, 155(1), 01

38 Numerical example Consensus minimize x n (x θ i ) i=1 } {{ } f i (x) θ i = measurement of agent i (θ i i.i.d. N ( 10, 10 4) ) Network Model (parameters) # Colors 1 Erdős-Rényi (0.1) 5 Watts-Strogatz (4, 0.4) 4 3 Barabasi () 3 4 Geometric (0.3) 10 5 Lattice (5 10)

39 Communication steps B A D-ADMM C Network number D-ADMM 9 Others: A 10, B 11, C 1 9 J. Mota et al., D-ADMM: a communication efficient distributed algorithm for separable optimization, IEEE TSP, 61(10), I. Schizas et al., Consensus in ad hoc WSNs with noisy links - part I: distributed estimation of deterministic signals, IEEE TSP, 56(1), H. Zhu et al., Distributed in-network channel decoding, IEEE TSP, 57(10), B. Oreshkin et al., Optimization and analysis of distributed averaging with short node memory, IEEE TSP, 58(5), 010

40 LASSO Numerical example minimize x subject to x 1 Ax b σ Column partition: node i holds A i R 00 0 After regularization and dualization: n 1 minimize i=1 λ n (b λ + σ λ ) inf { x x 1 + λ A i x + δ } x }{{} f i (λ)

41 Communication steps B A D-ADMM Network number D-ADMM 13 Others: A 14, B J. Mota et al., D-ADMM: a communication efficient distributed algorithm for separable optimization, IEEE TSP, 61(10), I. Schizas et al., Consensus in ad hoc WSNs with noisy links - part I: distributed estimation of deterministic signals, IEEE TSP, 56(1), H. Zhu et al., Distributed in-network channel decoding, IEEE TSP, 57(10), 009

42 What if variables are not connected? Example: variable x is not connected f 1 (x 1 ) 1 f 3 (x 1,x ) 3 f (x ) Propagate variable across a Steiner tree J. Mota et al., Distributed optimization with local domains: applications in MPC and network flows, arxiv preprint, 013

43 Thank you!

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent