Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Similar documents
DECENTRALIZED algorithms are used to solve optimization

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Decentralized Consensus Optimization with Asynchrony and Delay

ADMM and Fast Gradient Methods for Distributed Optimization

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Asynchronous Non-Convex Optimization For Separable Problem

Dual Ascent. Ryan Tibshirani Convex Optimization

Decentralized Dynamic Optimization Through the Alternating Direction Method of Multipliers

ABSTRACT 1. INTRODUCTION

High Order Methods for Empirical Risk Minimization

Distributed Computation of Quantiles via ADMM

High Order Methods for Empirical Risk Minimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

On the linear convergence of distributed optimization over directed graphs

Distributed Consensus Optimization

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

Decentralized Quasi-Newton Methods

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Efficient Methods for Large-Scale Optimization

Resilient Distributed Optimization Algorithm against Adversary Attacks

Incremental Quasi-Newton methods with local superlinear convergence rate

Math 273a: Optimization Lagrange Duality

On the Linear Convergence of Distributed Optimization over Directed Graphs

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Fast Nonnegative Matrix Factorization with Rank-one ADMM

arxiv: v3 [math.oc] 1 Jul 2015

WE consider the problem of estimating a time varying

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

Lecture 23: November 19

ADD-OPT: Accelerated Distributed Directed Optimization

Consensus-Based Distributed Optimization with Malicious Nodes

D4L: Decentralized Dynamic Discriminative Dictionary Learning

c 2015 Society for Industrial and Applied Mathematics

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks

Distributed online optimization over jointly connected digraphs

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 62, NO. 23, DECEMBER 1, Aryan Mokhtari and Alejandro Ribeiro, Member, IEEE

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Introduction to Alternating Direction Method of Multipliers

Distributed Optimization over Random Networks

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

arxiv: v2 [math.oc] 7 Apr 2017

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

A Distributed Newton Method for Network Utility Maximization

Sparse Gaussian conditional random fields

Preconditioning via Diagonal Scaling

Convergence rates for distributed stochastic optimization over random networks

Continuous-time Distributed Convex Optimization with Set Constraints

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

Distributed online optimization over jointly connected digraphs

ARock: an algorithmic framework for asynchronous parallel coordinate updates

WE consider an undirected, connected network of n

1 Kernel methods & optimization

Distributed Convex Optimization

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

An Optimization-based Approach to Decentralized Assignability

Distributed Optimization over Networks Gossip-Based Algorithms

Divide-and-combine Strategies in Statistical Modeling for Massive Data

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

arxiv: v1 [math.oc] 23 May 2017

Diffusion based Projection Method for Distributed Source Localization in Wireless Sensor Networks

Convergence Rate for Consensus with Delays

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Kun Yuan and Qing Ling*

Coordinate Descent and Ascent Methods

arxiv: v4 [math.oc] 25 Aug 2018

Monotonicity and Restart in Fast Gradient Methods

High Order Methods for Empirical Risk Minimization

A Distributed Newton Method for Network Optimization

Network Optimization with Heuristic Rational Agents

The Alternating Direction Method of Multipliers

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

ECS289: Scalable Machine Learning

ALADIN An Algorithm for Distributed Non-Convex Optimization and Control

Lecture 16: October 22

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Convergence Rate of Expectation-Maximization

Online Learning of Optimal Strategies in Unknown Environments

Nonlinear Optimization Methods for Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Sparse Optimization Lecture: Basic Sparse Optimization Models

SIGNIFICANT increase in amount of private distributed

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimization methods

Input Optimization for Multi-Antenna Broadcast Channels with Per-Antenna Power Constraints

arxiv: v1 [math.oc] 29 Sep 2018

Constrained Optimization and Lagrangian Duality

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Distributed MAP probability estimation of dynamic systems with wireless sensor networks

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Ergodic Stochastic Optimization Algorithms for Wireless Communication and Networking

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Distributed Maximum Likelihood Estimation with Time-Varying Network Topology

Fast Linear Iterations for Distributed Averaging 1

Transcription:

Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania Department of Automation, University of Science and Technology of China Abstract This paper considers an optimization problem that components of the objective function are available at different nodes of a network and nodes are allowed to only echange information with their neighbors. The decentralized alternating method of multipliers (DADMM) is a well-established iterative method for solving this category of problems; however, implementation of DADMM requires solving an optimization subproblem at each iteration for each node. This procedure is often computationally costly for the nodes. We introduce a decentralized quadratic approimation of ADMM (DQM) that reduces computational compleity of DADMM by minimizing a quadratic approimation of the objective function. Notwithstanding that DQM successively minimizes approimations of the cost, it converges to the optimal arguments at a linear rate which is identical to the convergence rate of DADMM. Further, we show that as time passes the coefficient of linear convergence for DQM approaches the one for DADMM. Numerical results demonstrate the effectiveness of DQM. Inde Terms Multi-agent network, decentralized optimization, alternating direction method of multipliers. I. INTRODUCTION Decentralized algorithms are designed to solve the problem of minimizing a global cost function over a set of nodes. Agents (nodes) only have access to their local cost functions and try to minimize the global cost cooperatively only by echanging information with their neighbors. To be more precise, consider a variable R p and a connected network containing n agents each of which has access to a strongly conve local cost function f i : R p R. The agents cooperate to solve = argmin f i ( ). (1) Problems of this form arise in decentralized control systems [1 [3, wireless communication networks [4, [5, sensor networks [6 [8, and large scale machine learning systems [9 [11. There are different algorithms to solve (1) in a distributed manner [6, [12 [25. Decentralized implementations of the alternating direction method of multipliers (DADMM) are wellknown for solving (1), with a fast linear convergence rate [6, [24 [26. On the other hand, DADMM steps are computationally costly, since each node has to minimize a conve problem at each iteration. This issue is addressed by the Decentralized Linearized ADMM (DLM) algorithm that in lieu of minimizing the eact primal conve problem, minimizes a first-order linearized version of the primal objective function [27, [28. DLM reduces the computational compleity of DADMM, however, convergence rate of DLM is slow when the primal objective function is ill-conditioned, since DLM operates on first-order information. i=1 Supported by NSF CAREER CCF-0952867 and ONR N00014-12-1-0997. Moreover, the coefficient of linear convergence for DLM is strictly smaller than the one for DADMM [29. These drawbacks can be resolved by incorporating second-order information of the primal function. In this paper we propose a decentralized quadratic approimation of ADMM (DQM) that successively minimizes a quadratic approimation of the primal objective function. Therefore, DQM enjoys the low computational cost of DLM, while its convergence is eact and linear same as DADMM. This approimation incorporates the second-order information of the primal objective function which leads to a provably fast linear convergence even in the case that the cost function is ill-conditioned. Further, we show that as time progresses the coefficient of linear convergence for DQM approaches the one for DADMM. We begin the paper with describing the DADMM steps and eplaining the idea of the DLM algorithm (Section II). Then, we introduce the DQM method which is different from DADMM and DLM in minimizing a quadratic approimation of the primal function (Section III). We analyze convergence properties of DQM (Section IV) and evaluate its performance in solving a logistic regression problem (Section V). Proofs of the results in this paper and comprehensive numerical results are provided in [29. II. DADMM: DECENTRALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS Consider a connected network with n nodes and m edges where the set of nodes is V = {1,..., n} and the set of ordered edges E contains pairs of nodes (i, j). Node i can communicate with node j if and only if the pair (i, j) E and we further assume that the network is symmetric so that (i, j) E implies (j, i) E. We define the neighborhood of node i as the set N i = {j (i, j) E} of nodes that can communicate with i. In problem (1) agent i has access to the local objective function f i ( ) and agents cooperate to minimize the global cost n i=1 f i( ). This specification is more naturally formulated by defining variables i representing the local copies of the variable. In DADMM we further introduce the auiliary variables z ij associated with edge (i, j) E and rewrite (1) as min,z i=1 f i ( i ), s.t. i = z ij, j = z ij, for all (i, j) E. (2) The constraints i = z ij and j = z ij enforce consensus among neighbors and, since the network is connected, they also enforce global consensus. Therefore, problems (1) and (2) are equivalent in the sense that for all i and j the optimal arguments of (2) satisfy i = for all i V and z ij =, for all (i, j) E.

To specify the DADMM algorithm we write the constraints of (2) in matri form. Begin by defining the block source matri A s R mp np which contains m n square blocks (A s ) e,i R p p. The block (A s ) e,i is not identically null if and only if the edge e = (i, j) corresponds to (i, j) E in which case (A s ) e,i = I p. Likewise, define the block destination matri A d R mp np containing m n square blocks (A d ) e,i R p p. The square block (A d ) e,i is equal to identity matri I p if and only if e = (j, i) corresponds to (j, i) E, otherwise the block is null. Further define the vector := [ 1,..., n R np as the vector that concatenates the local variables i and the vector z := [z 1,..., z m R mp that concatenates the auiliary variables z e = z ij. If we further define the aggregate function f : R np R as f() := n i=1 f i( i ) we can rewrite (2) as min f(), s. t. A + Bz = 0. (3),z where the matrices A R 2mp np and B R 2mp mp are defined as the stacks A := [A s ; A d and B := [ I mp ; I mp. Introduce now multipliers α e = α ij and β e = β ij respectively associated with the constraints i = z ij and j = z ij [cf. (2). Concatenate the multipliers α e in the vector α := [α 1,..., α m and, likewise, stack the Lagrange multipliers β e in the vector β := [β 1,..., β m. Further stacking the vectors α and β into λ := [α; β R 2mp leads to the Lagrange multiplier λ associated with the constraint A + Bz = 0 in (3). The augmented Lagrangian of (3), and (2), which is the same optimization problem written with different notation, is now defined as L(, z, λ) := f() + λ T (A + Bz) + c 2 A + Bz 2, (4) where c > 0 is an arbitrary positive constant. The idea of the DADMM algorithm is to alternatively minimize the Lagrangian L(, z, λ) with respect to the variable and the auiliary variable z and to follow these minimizations by a multiplier update collinear with the constraint violation. Specifically, consider a time inde k N and define k, z k, and λ k as the primal and dual iterates at step k. The first step of DADMM is Lagrangian minimization with respect to with z k and λ k given, k+1 = argmin f()+λ T k (A + Bz k )+ c 2 A + Bz k 2. (5) The second step is minimization with respect to the auiliary variable z with λ k given but using the updated variable k+1, z k+1 =argmin f( k+1 )+λ T k (A k+1 +Bz)+ c z 2 A k+1+bz 2. (6) With the primal iterates k+1 and z k+1 updated, the third and final step is to move the Lagrange multiplier λ k in the direction of the constraint violation A k+1 + Bz k+1, λ k+1 = λ k + c (A k+1 + Bz k+1 ), (7) where the constant c is the same constant used in (4). The DADMM algorithm follows from the observation that the updates in (5)-(7) can be implemented in a distributed manner, [6, [24, [25. Notice that the update formulas in (6) and (7) have low computational cost. However, the update for the primal variable in (5) requires solution of an optimization problem. To avoid the cost of this minimization, the primal variable k+1 can be updated ineactly. This idea leads to the DLM algorithm that approimates the objective function value f( k+1 ) in (5) through a linearization of the function f in a neighborhood of the current variable k. This approimation is defined as f( k ) + f( k ) T ( k ) + ρ k 2, where ρ is an arbitrary positive constant. Using this approimation, the update formula for the primal variable in DLM replaces (5) by k+1 = argmin f( k ) + f( k ) T ( k ) + ρ k 2 + λ T k (A + Bz k ) + c 2 A + Bz k 2. (8) Since the problem in (8) is quadratic, its solution is elementary. E.g., the first order optimality condition for (8) requires the variable k+1 to satisfy f( k ) + ρ( k+1 k ) + A T λ k + ca T (A k+1 + Bz k ) = 0. (9) The epression in (9) is a linear equation for that an be solved by inversion of the positive definite matri ρi + ca T A. The sequence of variables k generated by DLM converges linearly to the optimal argument [27. This is the same rate of convergence of DADMM, but the linear rate coefficient of DLM is strictly smaller than the linear rate coefficient of DADMM [29. In this paper we propose an alternative approimation that will be shown to achieve linear convergence with a coefficient that is asymptotically equivalent to the DADMM coefficient. This approimation utilizes second order information of the local functions f i () and leads to the DQM algorithm that we introduce in the following section. III. DQM: DECENTRALIZED QUADRATICALLY APPROXIMATED ADMM We introduce a Decentralized Quadratic Approimation of ADMM (DQM) that uses a quadratic approimation of the primal function f() for updating the variable. To be more precise, instead of using f( k+1 ) in the optimization problem (5), DQM eecutes the quadratic approimation f( k )+ f( k ) T ( k )+ (1/2)( k ) T H k ( k ) for computing the new iterate k+1, where H k := 2 f( k ) is the Hessian of the primal function f computed at k which is a block diagonal matri. Therefore, the primal variable is updated as k+1 = argmin f( k ) + f( k ) T ( k ) (10) + 1 2 ( k) T H k ( k ) + λ T k (A + Bz k ) + c 2 A + Bz k 2. The DQM updates for the variables z k and λ k are identical to the DADMM updates in (6) and (7), respectively. Comparison of the updates in (8) and (10) shows that in DLM the quadratic term ρ k+1 k 2 is added to the first-order approimation of the primal objective function, while in DQM the second-order approimation of the primal objective function is used to reach a more accurate approimation for f( k+1 ). First order optimality conditions of updates in (10), (6) and (7) imply that the DQM iterates can be generated by solving the equations f( k )+H k ( k+1 k ) + A T λ k +ca T (A k+1 +Bz k ) = 0, B T λ k + cb T (A k+1 + Bz k+1 ) = 0, λ k+1 λ k c (A k+1 + Bz k+1 ) = 0. (11) Notice that the first equation in (11) can be solved by inverting H k +ca T A. To guarantee that the system of equations in (11) can

be solved in a distributed manner, we assume a specific structure for the initial vectors λ 0 = [α 0 ; β 0, 0, and z 0 as mentioned in Assumption 1. Before introducing this assumption we define the oriented incidence matri as E o := A s A d and the unoriented incidence matri as E u := A s + A d. Assumption 1 The initial Lagrange multipliers α 0 and β 0, and the initial variables 0 and z 0 are chosen such that a) α 0 = β 0, b) E u 0 = 2z 0, c) α 0 lies in the column space of E o. Assumption 1 enforces some initial conditions on the Lagrange multipliers α 0 and β 0, and the initial variables 0 and z 0. These conditions are satisfied by setting α 0 = β 0 = 0 and 0 = z 0 = 0. Considering the initial conditions in Assumption 1 we propose a simpler update rule for generating the iterates k instead of solving equations in (11). Proposition 1 Consider the system of equations for the DQM algorithm in (11) and define the sequence φ k := E T o α k. Further, define the unoriented Laplacian as L u := (1/2)E T u E u, the oriented Laplacian as L o := (1/2)E T o E o, and the Degree matri as D := (L u + L o )/2. If Assumption 1 holds true, the DQM variables k can be generated as k+1 = (2cD + H k ) 1 [(cl u + H k ) k f( k ) φ k, (12) φ k+1 = φ k + cl o k+1. (13) Proposition 1 states that by introducing the new variables φ k, the update formulas for the DQM iterates can be computed using the primal objective function Hessian H k, the degree matri D, and the oriented and unoriented Laplacians L o and L u. This observation guarantees that the updates in (12) and (13) are implementable in a distributed manner, since all of these matrices can be computed using local and neighboring information of the nodes. To be more precise, the matri 2cD + H k is block diagonal and its i-th diagonal block is given by 2cd i I + 2 f i ( i ) which is locally available at node i. Likewise, the inverse matri (2cD + H k ) 1 is block diagonal and locally computable since the i-th diagonal block is (2cd i I + 2 f i ( i )) 1. Computations of the products L u k and L o k+1 can be implemented in a decentralized manner as well, since the Laplacian matrices L u and L o are block neighbor sparse. Note that a matri is block neighbor sparse when its ij-th block is not null if and only if nodes i and j are neighbors or j = i. Therefore, nodes can compute L u k and L o k+1 by echanging information with their neighbors. By defining the components of φ k as φ k := [φ 1,k,..., φ n,k, the update in (12) can be implemented locally as i,k+1 = ( 2cd i I + 2 f i ( i,k ) ) [ 1 cd i i,k + c j,k + 2 f i ( i,k ) i,k f i ( i,k ) φ i,k, (14) where i,k and φ i,k are the iterates of node i at step k. Notice that the definition L u := (1/2)E T u E u = (1/2)(A s + A d ) T (A s +A d ) is used to simplify the i-th component of cl u k as cd i i,k + c j,k. Further, considering the definition L o = (1/2)E T o E o = (1/2)(A s A d ) T (A s A d ), the i-th component of cl o k+1 can be simplified as c ( i,k j,k ). Algorithm 1 DQM method at node i Require: Initial local iterates i,0 and φ i,0. 1: for k = 0, 1, 2,... do 2: Update the local iterate i,k+1 i,k+1 = ( 2cd ii + 2 f i( i,k ) ) 1 [ cd i i,k + c j,k + 2 f i( i,k ) i,k f i( i,k ) φ i,k. 3: Echange iterates i,k+1 with neighbors j N i. 4: Update local dual variable φ k+1 as φ i,k+1 = φ i,k + c ( i,k+1 j,k+1 ). 5: end for Therefore, the update formula in (13) for node i is given by φ i,k+1 = φ i,k + c ( i,k+1 j,k+1 ). (15) The proposed DQM algorithm is summarized in Algorithm 1. At each iteration k, the primal and dual updates in (14) and (15) are implemented in Steps 2 and 4, respectively. Nodes echange their local variables i,k with their neighbors j N i in Step 3, since this information is required for Steps 2 and 4. IV. CONVERGENCE ANALYSIS In this section we show that the sequence of iterates k generated by the DQM method converges linearly to the optimal argument. In addition, we compare the linear convergence coefficients of the DQM and DADMM methods. To provide these results first we make the following assumptions. Assumption 2 The local objective functions f i () are twice differentiable and the eigenvalues of the local Hessians 2 f i () are bounded with positive constants 0 < m M <. I.e., mi 2 f i () MI. (16) Assumption 3 The local Hessians 2 f i () are Lipschitz continuous with parameter L. I.e., for all, ˆ R p, it holds 2 f i () 2 f i (ˆ) L ˆ. (17) The lower bound m for the eigenvalues of the local Hessians 2 f i () implies that the local objective functions f i () are strongly conve with parameter m. The upper bound M for the eigenvalues of the local Hessians 2 f i () is similar to the condition that the local gradients f i () are Lipschitz continuous with parameter M. The Lipschitz continuity of the local Hessians imposed by Assumption 3 is typical of second order methods. Define γ u and Γ u as the minimum and maimum singular values of the unoriented incidence matri E u, respectively. Further, define γ o as the smallest non-zero singular value of the oriented incidence matri E o. These parameters capture connectivity of the network. Denote the unique solution of (2) as (, z ). Notice that the uniqueness is implied by the strong conveity assumption. Further, define α as the unique optimal multiplier that lies in the column space of E o see Lemma 1 of [28 for the uniqueness of such an optimal dual variable α. We define the energy function V : R mp mp R as introduced in [25 for DADMM, V (z, α) := c z z 2 + (1/c) α α 2. (18)

The energy function V (z, α) captures the distances of the auiliary variable z k and the dual variable α k with their optimal arguments z and α, respectively. Therefore, convergence rate of the energy function is a valid tool for comparing performances of DQM and DADMM. To simplify the notation of the energy function V (z, α), define u R 2mp and C R 2mp 2mp as [ z u := α, C := [ ci 0 0 (1/c)I. (19) The energy function V (z, α) is equal to weighted squared norm u u 2 C where u := [z ; α. Our goal is to show that the sequence u k u 2 C converges linearly to null. Theorem 1 Consider the DQM method as introduced in (10)- (15). Define the sequence of non-negative variables ζ k as { } L ζ k := min 2 k+1 k, 2M. (20) Assume that the constant c is chosen such that c > ζk 2/(mγ2 u). Moreover, consider µ, µ > 1 as arbitrary constants and η as a positive constant chosen from the interval (ζ k /m, cγu/ζ 2 k ). If Assumptions 1-3 hold true, then the sequence u k u 2 C generated by DQM satisfies u k+1 u 2 C 1 1 + δ k u k u 2 C, (21) where the sequence of positive scalars δ k is given by { } c ηζ k γu 2 m ζ k /η δ k =min, 4µ µζk 2 c(µ 1)(µ 1) γ 2 u γo 2 + µ µ (µ 1) Γ2 uγo 2 c 4 Γ2 u+ µ c M 2 γo 2. (22) Notice that δ k is a decreasing function of ζ k and observe that ζ k is bounded above by 2M. Therefore, if we substitute ζ k by 2M in (22), the inequality in (21) is still valid. This substitution implies that the sequence u k u 2 C converges linearly to null. As a result of this convergence we obtain that u k approaches the optimal argument u. Therefore, the sequence of primal iterates k converges to the optimal argument. This result is formalized in the following corollary. Corollary 1 Under the assumptions in Theorem 1, the sequence of squared norms k 2 generated by the DQM algorithm converges R-linearly to null, i.e., k 2 4 cγu 2 u k u 2 C. (23) Corollary 1 states that the sequence k converges to the optimal argument. Hence, we obtain that the sequence k+1 k approaches null. This observation implies that the sequence of scalars ζ k converges to 0 as time passes, since ζ k is bounded above by (L/2) k+1 k. By considering lim k ζ k = 0 and making µ arbitrary close to 1, we obtain lim δ k = min k { (µ 1)γo 2 µγ 2, u } m c 4 Γ2 u + µ c M 2 γo 2. (24) Notice that the coefficient δ k in (24) is identical to the coefficient of linear convergence for the DADMM algorithm [25. This observation implies that as time passes the coefficient of linear convergence for DQM approaches the one for DADMM. Relative error k 0 10 0 10 2 10 4 10 6 10 8 DADMM DLM DQM 10 10 0 100 200 300 400 500 600 700 800 900 1000 Number of iterations k Fig. 1: Relative error k / 0 of DADMM, DQM, and DLM versus number of iterations. The convergence path of DQM is almost identical to the one for DADMM, while DQM has less computation compleity. Further, DQM outperforms DLM in convergence speed by orders of magnitude, though at the cost of higher computation compleity. V. NUMERICAL ANALYSIS In this section we compare performances of DLM, DQM and DADMM in solving a logistic regression problem. We assume that each node in the network has access to q training points. Therefore, the total number of training points is nq. Each of the training samples {s il, y il } q l=1 at node i contains a feature vector s il R p with class y il { 1, 1}. It follows from the logistic regression model that the maimum log-likelihood estimate of the classifier given the training samples (s il, y il ) for l = 1,..., q and i = 1,..., n is := argmin R p q i=1 l=1 log [1 + ep( y il s T il ). (25) The optimization problem in (25) can be written in the form of (1) by defining the local objective functions f i as q f i ( ) := log [1 + ep( y il s T il ). (26) l=1 We compare convergence paths of the DLM, DQM and DADMM algorithms for solving the logistic regression problem. We assume the network contains n = 10 nodes and the edges between nodes are generated randomly with probability P c = 0.4. Each agent holds q = 5 samples and the dimension of feature vectors is p = 3. The reference (ground true) logistic classifier is pre-computed with a centralized method. Notice that the parameter c for the three methods is optimized by c ADMM = 0.7, c DLM = 5.5, and c DQM = 0.7. Figure 1 illustrates the relative errors k / 0 of DLM, DQM, and DADMM versus the number of iterations. The convergence path of DQM is almost identical to the convergence path of DADMM and they both converge to the optimal argument faster than DLM. The relative errors k / 0 for DQM and DADMM after k = 300 iterations are below 10 9, while for DLM the relative error after the same number of iterations is 5 10 2. Conversely, achieving error k / 0 = 10 3 for DQM and DADMM requires 91 iterations, while DLM requires 758 iterations. Observe that the convergence paths of DQM and DADMM are almost identical, while the computation compleity of DQM is lower than DADMM.

REFERENCES [1 F. Bullo, J. Cortés, and S. Martinez, Distributed control of robotic networks: a mathematical approach to motion coordination algorithms. Princeton University Press, 2009. [2 Y. Cao, W. Yu, W. Ren, and G. Chen, An overview of recent progress in the study of distributed multi-agent coordination, Industrial Informatics, IEEE Transactions on, vol. 9, no. 1, pp. 427 438, 2013. [3 C. G. Lopes and A. H. Sayed, Diffusion least-mean squares over adaptive networks: Formulation and performance analysis, Signal Processing, IEEE Transactions on, vol. 56, no. 7, pp. 3122 3136, 2008. [4 A. Ribeiro, Ergodic stochastic optimization algorithms for wireless communication and networking, Signal Processing, IEEE Transactions on, vol. 58, no. 12, pp. 6369 6386, 2010. [5, Optimal resource allocation in wireless communication and networking, EURASIP Journal on Wireless Communications and Networking, vol. 2012, no. 1, pp. 1 19, 2012. [6 I. D. Schizas, A. Ribeiro, and G. B. Giannakis, Consensus in ad hoc wsns with noisy links part i: Distributed estimation of deterministic signals, Signal Processing, IEEE Transactions on, vol. 56, no. 1, pp. 350 364, 2008. [7 U. A. Khan, S. Kar, and J. M. Moura, Diland: An algorithm for distributed sensor localization with noisy distance measurements, Signal Processing, IEEE Transactions on, vol. 58, no. 3, pp. 1940 1947, 2010. [8 M. Rabbat and R. Nowak, Distributed optimization in sensor networks, in Proceedings of the 3rd international symposium on Information processing in sensor networks. ACM, 2004, pp. 20 27. [9 R. Bekkerman, M. Bilenko, and J. Langford, Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011. [10 K. I. Tsianos, S. Lawlor, and M. G. Rabbat, Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning, Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pp. 1543 1550, 2012. [11 V. Cevher, S. Becker, and M. Schmidt, Conve optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, Signal Processing Magazine, IEEE, vol. 31, no. 5, pp. 32 43, 2014. [12 A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, Automatic Control, IEEE Transactions on, vol. 54, no. 1, pp. 48 61, 2009. [13 D. Jakovetic, J. Xavier, and J. M. Moura, Fast distributed gradient methods, Automatic Control, IEEE Transactions on, vol. 59, no. 5, pp. 1131 1146, 2014. [14 K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, arxiv preprint arxiv:1310.7063, 2013. [15 E. Wei and A. Ozdaglar, Distributed alternating direction method of multipliers, in Decision and Control (CDC), 2012 IEEE 51st Annual Conference on. IEEE, 2012, pp. 5445 5450. [16 H. Terelius, U. Topcu, and R. Murray, Decentralized multi-agent optimization via dual decomposition, in 18th IFAC World Congress, 28 August 2011 through 2 September 2011, Milano, Italy, 2011, pp. 11 245 11 251. [17 W. Shi, Q. Ling, G. Wu, and W. Yin, Etra: An eact first-order algorithm for decentralized consensus optimization, arxiv preprint arxiv:1404.6264, 2014. [18 A. Mokhtari, Q. Ling, and A. Ribeiro, Network newton-part i: Algorithm and convergence, arxiv preprint arxiv:1504.06017, 2015. [19, Network newton-part ii: Convergence rate and implementation, arxiv preprint arxiv:1504.06020, 2015. [20 J. C. Duchi, A. Agarwal, and M. J. Wainwright, Dual averaging for distributed optimization: convergence analysis and network scaling, Automatic Control, IEEE Transactions on, vol. 57, no. 3, pp. 592 606, 2012. [21 K. I. Tsianos, S. Lawlor, and M. G. Rabbat, Push-sum distributed dual averaging for conve optimization. pp. 5453 5458, 2012. [22 Z.-q. Luo and P. Tseng, On the convergence rate of dual ascent methods for linearly constrained conve minimization, Mathematics of Operations Research, vol. 18, no. 4, pp. 846 867, 1993. [23 D. Jakovetic, J. M. Moura, and J. Xavier, Linear convergence rate of a class of distributed augmented lagrangian algorithms, Automatic Control, IEEE Transactions on, vol. 60, no. 4, pp. 922 936, 2015. [24 S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends R in Machine Learning, vol. 3, no. 1, pp. 1 122, 2011. [25 W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, On the linear convergence of the admm in decentralized consensus optimization, Signal Processing, IEEE Transactions on, vol. 62, no. 7, pp. 1750 1761, 2014. [26 F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, Eplicit convergence rate of a distributed alternating direction method of multipliers, arxiv preprint arxiv:1312.1085, 2013. [27 Q. Ling and A. Ribeiro, Decentralized linearized alternating direction method of multipliers, Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 5447 5451, 2014. [28 Q. Ling, W. Shi, G. Wu, and A. Ribeiro, Dlm: Decentralized linearized alternating direction method of multipliers, Signal Processing, IEEE Transactions on, 2014. [29 A. Mokhtari, W. Shi, Q. Ling, and A. Ribeiro, Dqm: Decentralized quadratically approimated alternating direction method of multipliers, arxiv preprint arxiv:1508.02073, 2015.