Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India

Distributed Optimization A general multi-agent cooperative distributed optimization problem min f(x) := =1 g (x) + h(x) s. t. x X (1) Master node Centralized Architecture for Distributed Optimization Decentralized Architecture for Distributed Optimization Where x R N 1, g : R N R for = 1,..., K and h : R N R, Machine learning, robotics, Economics, Big data analytics, Networ optimization, Signal processing Each agent, could be sensor nodes, processors, robots etc,. Heterogeneity of the nodes, resource, spatial and temporal constraints, motivates for distributed and asynchronous decision maing. Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 2 / 19

Distributed and Asynchronous Optimization Literature g (x) is convex Huge literature for both type of architecture (1). Consensus based, Diffusion, Gossip based, Incremental (sub)gradient, Distributed (sub)gradient, Distributed ADMM, Dual averaging, Proximal Dual, Bloc Coordinate, Mirror descent, Alternate minimization [1]-[10] g (x) is non-convex Proposed very recently, provably convergent solution, only for centralized architecture. Non-convex ADMM, Stochastic subgradient, Proximal Dual [11]-[14]. Applicable for many applications including, Machine Learning. Limitations of Centralized Architecture Needs a master node, or assumes sharing of global database/variable among all nodes. Not suited for communication and sensor networ applications. Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 3 / 19

Partially Separable Form [15] Partition the variables {x n } N n=1 among nodes {S } K =1 denote disjoint subsets, {x n x n S } are local to node g ( ) at node depends on the neighborhood N. P = min g ({x n } n S x ) + h ({x n } n S ) (2) =1 s. t. {x n } n S X = 1, 2,..., K. 1 1 4 3 4, 6 5 3 2 2 3 4 1 2 2, 3 4 4 Figure 1: Factor graph representation for the objective function of (2). Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 4 / 19

Decentralized Consensus Problem Formulation Introduce copies x j of the variable x j, j N The consensus variable z = {z }, = 1,..., K x = {x j } j N, z = {z j } j N and y = {y j } j N g 1 x 11 x 12 z 1 min {x },z =1 K g ({x j } j N ) + h (z ) (3) s. t. x j = z j, j N (4) z X, = 1,..., K (5) g 2 g 3 x 13 x 21 x 22 x 24 x 31 x 33 x 34 z 2 z 3 x 1 x 2 x 3 z 4 g 4 x 42 x 43 x 44 x 4 L({x }, z, {y }) = ( K g (x ) + h (z ) + ) ρ y j, x j z j + 2 x j z j 2 j N j N =1 (6) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 5 / 19

A Wireless Sensor Networ Example Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 6 / 19

Fundamentals of ADMM Alternating direction of method of multipliers (ADMM), blends the decomposability of dual ascent with superior convergence properties of the method of multipliers The algorithm solves the problem of the form min f(x) + g(z) s.t Ax + Bz = c (7) With variables x R n and z R m, A R p n, B R p m and c R p Optimal value is p = inf{f(x) + g(z) Ax + Bz = c} L ρ (x, z, y) = f(x) + g(z) + y T (Ax + Bz c) + ρ/2 Ax + Bz c 2 2 z +1 := arg min L ρ (x, z, y ), z minimization step z x +1 := arg min L ρ (x, z +1, y ), x minimization step x y +1 := y +1 + ρ(ax +1 + Bz +1 c), dual variable update, ρ > 0 (8) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 7 / 19

Consesnus update Starting with arbitrary {x 1 } and {y1 j z t+1 j = arg min z j X j h j (z j ) + = prox j N j ( }, the update for {zt+1 j } are yj, t x t ρ j z j + x t 2 j z j 2 N j N j ) ρ x t j + yt j (9) N j ρ Where the proximal point function prox j ( ) is defined as prox j (x) := arg min u X j h(u) + 1 2 x u 2. (10) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 8 / 19

Primal and Dual updates x t+1 = arg min g (x ) + y x j, t x j z t+1 j + ρ xj z t+1 j 2 j N j N By linearizing g (x ) at z t+1 an approximate yet an accurate update can be obtained as x t+1 arg min g (z t+1 x ) + g (z t+1 ), x z t+1 + yj, t x j z t+1 j + ρ x j z t+1 j 2 j N j N where the vector [z ] j := z j for all j N and zero otherwise. The approximate update of x t+1 j thus becomes { ( ) z t+1 x t+1 j 1 ρ j = [ g (z t+1 )] j + yj t j N 0 j / N. The dual updates are as 2 2 (11) (12) y t+1 j = y t j + ρ {x t+1 j z t+1 j } j N (13) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 9 / 19

Asynchronous updates Sipping of g ( ) or prox ( ) calculations, and communications are allowed for some iterations Let S t as the set of nodes that carry out the update at time t, then the update can be written as z t+1 j z t+1 j = ( N prox j (ρ x t j +yt j) j j S t (14) zj t j / S t N j ρ ) prox j (x) := arg min u Xj h(u) + 1 2 x u 2. Use the latest available gradient g (z [t+1] ) for x update, where t + 1 T [t + 1] t + 1 for some T <. x t+1 j = { ( ) z t+1 j 1 ρ [ g (z [t+1] )] j + yj t j N 0 j / N. (15) y t+1 j = y t j + ρ {x t+1 j z t+1 j } j N (16) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 10 / 19

Async-ADMM Algorithm STATE Set t = 1, initialize {x 1 j, y1 j, z1 j } for all j N. FOR t = 1, 2,... STATE (Optional) Send {ρ x t j + yt j } to neighbors j N IF{ρ j x t j + yt j } received from all j N STATE(Optional) Update z t+1 as in (??) and transmit to each j N ENDIF IFz t+1 j STATE set z t+1 j not received from some j N = z t j ENDIF STATE(Optional) Calculate gradient g (z t+1 ) STATE Update the primal variable x t+1 as in (15) IF x t+1 x t δ STATE terminate loop ENDIF Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 11 / 19

Assumption For each node, the component function gradient g (x) is Lipschitz continuous, that is, there exists L > 0, for all x, x domg such that g (x) g (x ) L x x. (17) Assumption The set X is a closed, convex, and compact. The functions g (x) is bounded from below over X. Assumption For node, the step size ρ is chosen large enough such that, it holds that α > 0 and β > 0, where α := ρ ( f 7L + 1 ) N L 2 2 ρ (T + 1) 2 N L T 2 2 β := ρ 7L 2ρ 2 (18) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 12 / 19

Lemma (a) Starting from any time t = t 0, there exists T < such that L({x T +t0 }; z T +t0, {y T +t0 K β 2 T +t 0 1 i=t 0 =1 T +t 0 i=t 0 =1 }) L({x t0 }; zt0, {y t0 }) j N x i+1 j x i j 2 K α j N z i+1 j z i j 2. (19) (b) The augmented Lagrangian values in (6) are bounded from below, i.e., for any time t 1, it holds that Lagrangian satisfies L({x t }; z t, {y t }) P L 2 j N diam 2 (X ) > Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 13 / 19

Theorem (a) The iterates generated by asynchronous Algorithm converges in the following sense lim z t+1 t z t = 0, (20a) lim x t+1 t j x t j = 0, j N, (20b) lim y t+1 yj t = 0, j N, (20c) t j (b) For each K and j N, denote limit points of the sequences {z t }, {x t j }, and {yt j } by z, x j, and y j, respectively. Then {{z }, {x j }, {y j }} is a stationary point of (3) and satisfies g (x ) + y = 0, = 1,..., K (21a) j N y j (h (z)) z=z = 1,..., K (21b) x j = z j X j, j N, = 1,..., K (21c) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 14 / 19

Practical examples of partially separable form Distributed cooperative localization over networs ˆX = arg min g ({x j } j N ) (22) X B =1 g ({x j } j N ) = j N 2 w j (δ j x x j 2 + ɛ) (23) Figure 2: Cooperative Localization Example Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 15 / 19

Than You Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 16 / 19

References [1] Chang, Tsung-Hui, Angelia Nedi, and Anna Scaglione. Distributed constrained optimization by consensus-based primal-dual perturbation method. IEEE Transactions on Automatic Control 59.6 (2014): 1524-1538. [2] Lobel, Ilan, and Asuman Ozdaglar. Distributed subgradient methods for convex optimization over random networs. IEEE Transactions on Automatic Control 56.6 (2011): 1291-1306. [3] Nedic, Angelia, Dimitri P. Bertseas, and Vive S. Borar. Distributed asynchronous incremental subgradient methods. (2000). [4] Duchi, John C., Aleh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: convergence analysis and networ scaling. IEEE Transactions on Automatic Control 57.3 (2012): 592-606. [5] Chen, Jianshu, and Ali H. Sayed. Diffusion adaptation strategies for distributed optimization and learning over networs. IEEE Transactions on Signal Processing 60.8 (2012): 4289-4305. Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 17 / 19

References [6] Zhang, Ruiliang, and James T. Kwo. Asynchronous Distributed ADMM for Consensus Optimization. ICML. 2014. [7] Richtri, Peter, and Martin Tac. Distributed coordinate descent method for learning with big data. (2013). [8] Deel, Ofer, et al. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13.Jan (2012): 165-202. [9] Boyd, Stephen, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3.1 (2011): 1-122. [10] Boyd, Stephen, et al. Gossip algorithms: Design, analysis and applications. Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.. Vol. 3. IEEE, 2005. non-convex admm hong Decomposing nonconvex hong Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 18 / 19

References [11] Hong, Mingyi. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM based approach. arxiv preprint arxiv:1412.6058 (2014). [12] Davis, Dame. The Asynchronous PALM Algorithm for Nonsmooth Nonconvex Problems. arxiv preprint arxiv:1604.00526 (2016). [13] Ghadimi, Saeed, and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23.4 (2013): 2341-2368. [14] Hong, Mingyi. Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications. arxiv preprint arxiv:1604.00543 (2016). [15] Kumar, Sandeep, Rahul Jain, and Ketan Rajawat. Asynchronous Optimization Over Heterogeneous Networs via Consensus ADMM. arxiv preprint arxiv:1605.00076 (2016). Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 19 / 19