Asynchronous Non-Convex Optimization For Separable Problem

Similar documents
ADMM and Fast Gradient Methods for Distributed Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Distributed Consensus Optimization

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed online optimization over jointly connected digraphs

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed online optimization over jointly connected digraphs

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Distributed Optimization over Random Networks

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Decentralized Consensus Optimization with Asynchrony and Delay

DECENTRALIZED algorithms are used to solve optimization

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Divide-and-combine Strategies in Statistical Modeling for Massive Data

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Constrained Consensus and Optimization in Multi-Agent Networks

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Algorithms for Nonsmooth Optimization

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems

Distributed Computation of Quantiles via ADMM

Dual Ascent. Ryan Tibshirani Convex Optimization

Accelerated primal-dual methods for linearly constrained convex problems

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Provable Non-Convex Min-Max Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

Consensus-Based Distributed Optimization with Malicious Nodes

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

STA141C: Big Data & High Performance Statistical Computing

Optimization methods

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v2 [cs.dc] 2 May 2018

Subgradient Methods in Network Resource Allocation: Rate Analysis

Dual and primal-dual methods

arxiv: v1 [math.oc] 24 Oct 2017

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Optimization methods

You should be able to...

Communication/Computation Tradeoffs in Consensus-Based Distributed Optimization

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand,

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

On the linear convergence of distributed optimization over directed graphs

ECS289: Scalable Machine Learning

WE consider an undirected, connected network of n

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

Sparse and Regularized Optimization

THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS

Parallel Coordinate Optimization

Proximal Methods for Optimization with Spasity-inducing Norms

Asynchronous Gossip Algorithms for Stochastic Optimization

Asymptotics, asynchrony, and asymmetry in distributed consensus

Convergence Rate for Consensus with Delays

WE consider an undirected, connected network of n

Introduction to Alternating Direction Method of Multipliers

Stochastic Gradient Descent Algorithms for Resource Allocation

Matrix Factorization with Applications to Clustering Problems: Formulation, Algorithms and Performance

The Alternating Direction Method of Multipliers

Dual Proximal Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications

Stochastic dynamical modeling:

arxiv: v1 [math.oc] 23 May 2017

ECS289: Scalable Machine Learning

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Large-scale Stochastic Optimization

Convergence of Rule-of-Thumb Learning Rules in Social Networks

Enhanced Fritz John Optimality Conditions and Sensitivity Analysis

Dual Decomposition.

Primal Solutions and Rate Analysis for Subgradient Methods

arxiv: v4 [math.oc] 5 Jan 2016

Resilient Distributed Optimization Algorithm against Adversary Attacks

Distributed Convex Optimization

Composite nonlinear models at scale

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

Learning with stochastic proximal gradient

On the Convergence of Federated Optimization in Heterogeneous Networks

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Coordinate Descent and Ascent Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Distributed Optimization via Alternating Direction Method of Multipliers

arxiv: v3 [cs.lg] 15 Sep 2018

Stochastic Composition Optimization

arxiv: v2 [math.oc] 7 Apr 2017

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Math 273a: Optimization Subgradients of convex functions

A Distributed Newton Method for Network Optimization

Transcription:

Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India

Distributed Optimization A general multi-agent cooperative distributed optimization problem min f(x) := =1 g (x) + h(x) s. t. x X (1) Master node Centralized Architecture for Distributed Optimization Decentralized Architecture for Distributed Optimization Where x R N 1, g : R N R for = 1,..., K and h : R N R, Machine learning, robotics, Economics, Big data analytics, Networ optimization, Signal processing Each agent, could be sensor nodes, processors, robots etc,. Heterogeneity of the nodes, resource, spatial and temporal constraints, motivates for distributed and asynchronous decision maing. Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 2 / 19

Distributed and Asynchronous Optimization Literature g (x) is convex Huge literature for both type of architecture (1). Consensus based, Diffusion, Gossip based, Incremental (sub)gradient, Distributed (sub)gradient, Distributed ADMM, Dual averaging, Proximal Dual, Bloc Coordinate, Mirror descent, Alternate minimization [1]-[10] g (x) is non-convex Proposed very recently, provably convergent solution, only for centralized architecture. Non-convex ADMM, Stochastic subgradient, Proximal Dual [11]-[14]. Applicable for many applications including, Machine Learning. Limitations of Centralized Architecture Needs a master node, or assumes sharing of global database/variable among all nodes. Not suited for communication and sensor networ applications. Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 3 / 19

Partially Separable Form [15] Partition the variables {x n } N n=1 among nodes {S } K =1 denote disjoint subsets, {x n x n S } are local to node g ( ) at node depends on the neighborhood N. P = min g ({x n } n S x ) + h ({x n } n S ) (2) =1 s. t. {x n } n S X = 1, 2,..., K. 1 1 4 3 4, 6 5 3 2 2 3 4 1 2 2, 3 4 4 Figure 1: Factor graph representation for the objective function of (2). Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 4 / 19

Decentralized Consensus Problem Formulation Introduce copies x j of the variable x j, j N The consensus variable z = {z }, = 1,..., K x = {x j } j N, z = {z j } j N and y = {y j } j N g 1 x 11 x 12 z 1 min {x },z =1 K g ({x j } j N ) + h (z ) (3) s. t. x j = z j, j N (4) z X, = 1,..., K (5) g 2 g 3 x 13 x 21 x 22 x 24 x 31 x 33 x 34 z 2 z 3 x 1 x 2 x 3 z 4 g 4 x 42 x 43 x 44 x 4 L({x }, z, {y }) = ( K g (x ) + h (z ) + ) ρ y j, x j z j + 2 x j z j 2 j N j N =1 (6) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 5 / 19

A Wireless Sensor Networ Example Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 6 / 19

Fundamentals of ADMM Alternating direction of method of multipliers (ADMM), blends the decomposability of dual ascent with superior convergence properties of the method of multipliers The algorithm solves the problem of the form min f(x) + g(z) s.t Ax + Bz = c (7) With variables x R n and z R m, A R p n, B R p m and c R p Optimal value is p = inf{f(x) + g(z) Ax + Bz = c} L ρ (x, z, y) = f(x) + g(z) + y T (Ax + Bz c) + ρ/2 Ax + Bz c 2 2 z +1 := arg min L ρ (x, z, y ), z minimization step z x +1 := arg min L ρ (x, z +1, y ), x minimization step x y +1 := y +1 + ρ(ax +1 + Bz +1 c), dual variable update, ρ > 0 (8) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 7 / 19

Consesnus update Starting with arbitrary {x 1 } and {y1 j z t+1 j = arg min z j X j h j (z j ) + = prox j N j ( }, the update for {zt+1 j } are yj, t x t ρ j z j + x t 2 j z j 2 N j N j ) ρ x t j + yt j (9) N j ρ Where the proximal point function prox j ( ) is defined as prox j (x) := arg min u X j h(u) + 1 2 x u 2. (10) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 8 / 19

Primal and Dual updates x t+1 = arg min g (x ) + y x j, t x j z t+1 j + ρ xj z t+1 j 2 j N j N By linearizing g (x ) at z t+1 an approximate yet an accurate update can be obtained as x t+1 arg min g (z t+1 x ) + g (z t+1 ), x z t+1 + yj, t x j z t+1 j + ρ x j z t+1 j 2 j N j N where the vector [z ] j := z j for all j N and zero otherwise. The approximate update of x t+1 j thus becomes { ( ) z t+1 x t+1 j 1 ρ j = [ g (z t+1 )] j + yj t j N 0 j / N. The dual updates are as 2 2 (11) (12) y t+1 j = y t j + ρ {x t+1 j z t+1 j } j N (13) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 9 / 19

Asynchronous updates Sipping of g ( ) or prox ( ) calculations, and communications are allowed for some iterations Let S t as the set of nodes that carry out the update at time t, then the update can be written as z t+1 j z t+1 j = ( N prox j (ρ x t j +yt j) j j S t (14) zj t j / S t N j ρ ) prox j (x) := arg min u Xj h(u) + 1 2 x u 2. Use the latest available gradient g (z [t+1] ) for x update, where t + 1 T [t + 1] t + 1 for some T <. x t+1 j = { ( ) z t+1 j 1 ρ [ g (z [t+1] )] j + yj t j N 0 j / N. (15) y t+1 j = y t j + ρ {x t+1 j z t+1 j } j N (16) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 10 / 19

Async-ADMM Algorithm STATE Set t = 1, initialize {x 1 j, y1 j, z1 j } for all j N. FOR t = 1, 2,... STATE (Optional) Send {ρ x t j + yt j } to neighbors j N IF{ρ j x t j + yt j } received from all j N STATE(Optional) Update z t+1 as in (??) and transmit to each j N ENDIF IFz t+1 j STATE set z t+1 j not received from some j N = z t j ENDIF STATE(Optional) Calculate gradient g (z t+1 ) STATE Update the primal variable x t+1 as in (15) IF x t+1 x t δ STATE terminate loop ENDIF Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 11 / 19

Assumption For each node, the component function gradient g (x) is Lipschitz continuous, that is, there exists L > 0, for all x, x domg such that g (x) g (x ) L x x. (17) Assumption The set X is a closed, convex, and compact. The functions g (x) is bounded from below over X. Assumption For node, the step size ρ is chosen large enough such that, it holds that α > 0 and β > 0, where α := ρ ( f 7L + 1 ) N L 2 2 ρ (T + 1) 2 N L T 2 2 β := ρ 7L 2ρ 2 (18) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 12 / 19

Lemma (a) Starting from any time t = t 0, there exists T < such that L({x T +t0 }; z T +t0, {y T +t0 K β 2 T +t 0 1 i=t 0 =1 T +t 0 i=t 0 =1 }) L({x t0 }; zt0, {y t0 }) j N x i+1 j x i j 2 K α j N z i+1 j z i j 2. (19) (b) The augmented Lagrangian values in (6) are bounded from below, i.e., for any time t 1, it holds that Lagrangian satisfies L({x t }; z t, {y t }) P L 2 j N diam 2 (X ) > Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 13 / 19

Theorem (a) The iterates generated by asynchronous Algorithm converges in the following sense lim z t+1 t z t = 0, (20a) lim x t+1 t j x t j = 0, j N, (20b) lim y t+1 yj t = 0, j N, (20c) t j (b) For each K and j N, denote limit points of the sequences {z t }, {x t j }, and {yt j } by z, x j, and y j, respectively. Then {{z }, {x j }, {y j }} is a stationary point of (3) and satisfies g (x ) + y = 0, = 1,..., K (21a) j N y j (h (z)) z=z = 1,..., K (21b) x j = z j X j, j N, = 1,..., K (21c) Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 14 / 19

Practical examples of partially separable form Distributed cooperative localization over networs ˆX = arg min g ({x j } j N ) (22) X B =1 g ({x j } j N ) = j N 2 w j (δ j x x j 2 + ɛ) (23) Figure 2: Cooperative Localization Example Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 15 / 19

Than You Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 16 / 19

References [1] Chang, Tsung-Hui, Angelia Nedi, and Anna Scaglione. Distributed constrained optimization by consensus-based primal-dual perturbation method. IEEE Transactions on Automatic Control 59.6 (2014): 1524-1538. [2] Lobel, Ilan, and Asuman Ozdaglar. Distributed subgradient methods for convex optimization over random networs. IEEE Transactions on Automatic Control 56.6 (2011): 1291-1306. [3] Nedic, Angelia, Dimitri P. Bertseas, and Vive S. Borar. Distributed asynchronous incremental subgradient methods. (2000). [4] Duchi, John C., Aleh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: convergence analysis and networ scaling. IEEE Transactions on Automatic Control 57.3 (2012): 592-606. [5] Chen, Jianshu, and Ali H. Sayed. Diffusion adaptation strategies for distributed optimization and learning over networs. IEEE Transactions on Signal Processing 60.8 (2012): 4289-4305. Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 17 / 19

References [6] Zhang, Ruiliang, and James T. Kwo. Asynchronous Distributed ADMM for Consensus Optimization. ICML. 2014. [7] Richtri, Peter, and Martin Tac. Distributed coordinate descent method for learning with big data. (2013). [8] Deel, Ofer, et al. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13.Jan (2012): 165-202. [9] Boyd, Stephen, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3.1 (2011): 1-122. [10] Boyd, Stephen, et al. Gossip algorithms: Design, analysis and applications. Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.. Vol. 3. IEEE, 2005. non-convex admm hong Decomposing nonconvex hong Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 18 / 19

References [11] Hong, Mingyi. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM based approach. arxiv preprint arxiv:1412.6058 (2014). [12] Davis, Dame. The Asynchronous PALM Algorithm for Nonsmooth Nonconvex Problems. arxiv preprint arxiv:1604.00526 (2016). [13] Ghadimi, Saeed, and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23.4 (2013): 2341-2368. [14] Hong, Mingyi. Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications. arxiv preprint arxiv:1604.00543 (2016). [15] Kumar, Sandeep, Rahul Jain, and Ketan Rajawat. Asynchronous Optimization Over Heterogeneous Networs via Consensus ADMM. arxiv preprint arxiv:1605.00076 (2016). Sandeep Kumar (IIT K) Async-DADMM iwml, 2 July, 2016 19 / 19