ADD-OPT: Accelerated Distributed Directed Optimization

Size: px

Start display at page:

Download "ADD-OPT: Accelerated Distributed Directed Optimization"

Basil Hicks
5 years ago
Views:

1 1 ADD-OPT: Accelerated Distributed Directed Optimization Chenguang Xi, Student Member, IEEE, and Usman A. Khan, Senior Member, IEEE Abstract This paper considers distributed optimization problems where agents collectively minimize the sum of locally nown convex functions. We focus on the case when the communication between agents is described by a directed graph. The proposed algorithm, ADD-OPT: Accelerated Distributed Directed OPTimization, achieves the best nown rate of convergence for this class of problems, Oµ ) for 0 < µ < 1 given that the objective functions are strongly-convex, where is the number of iterations. Moreover, ADD-OPT supports a wider and more realistic range of step-size compared with DEXTRA, the first algorithm to provide a linear convergence rate over directed graphs. In particular, the greatest lower bound of DEXTRA s step-size is strictly greater than zero while that of ADD-OPT s equals exactly to zero. Simulation examples further illustrate the improvements. Index Terms Distributed optimization, directed graph, linear convergence, DEXTRA. I. INTRODUCTION Distributed consensus optimization problems are modeled as minimizing a global objective function by multiple agents over the networ. Formally, we consider a decision variable z R p and a strongly connected networ containing n agents where each agent i has only access to a local objective function f i : R p R. The goal is to have each local agent minimize the sum of objectives, n i=1 f iz), by letting agents exchanging information with neighbors only. The formulation has gained great interests due to its widespread application in, e.g., large scale machine learning, [1, 2], model predictive control, [3], cognitive networs, [4, 5], source localization, [6, 7], resource scheduling, [8], and message routing, [9]. Most of the existing algorithms assume information exchange over undirected networs 1 where the communication between agents bidirectional, i.e., if agent i can send information to agent j, then agent j can also send information to agent i. However, the distributed optimization problem over directed networs has wider applications than that over undirected graphs. For example, agents may broadcast at different power levels, implying communication capability in one direction, but not the other. Moreover, an algorithm should remain to converge when slow communication lins are eliminated from the networ because of their delaying the whole networ, thus results in a directed graph. In general, a directed topology is more flexible compared with an undirected topology. Related methods over undirected graphs are well established. A number of recent papers proposed Distributed Gradient Descent DGD), [10 13], which achieves O ln ) for general convex functions, and O ln ) when objective func- tions are strongly-convex, where is the number of iterations. The authors are with the ECE Department at Tufts University, Medford, MA; chenguang.xi@tufts.edu, han@ece.tufts.edu. This wor has been partially supported by an NSF Career Award # CCF We use the terms networ and graph interchangeably. The convergence rates can be accelerated with an additional Lipschitz continuous gradient assumption. With the Lipschitz assumption, the wor in [14] shows that DGD with a constant step-size converges at rate O 1 ) to an error neighborhood of step-size, and further converges linearly to an error neighborhood for strongly-convex functions. The distributed Nestrov s method, [15], accelerates the convergence rate to O ln ) 2 for general convex functions. EXTRA, [16], shows the exact convergence with rate O 1 ) for general convex functions and a linear rate for strongly-convex functions. The wor in [17] improves EXTRA by relaxing the weighting matrices to be asymmetric. Besides the gradient-based methods, the distributed implementation of ADMM, [18 20], also wors well to solve the problem over undirected graphs. We report the literatures considering directed graphs here. Broadly, the following three approaches refer to the techniques of reaching average consensus 2 over directed graphs, and extend the results to solve the distributed optimization. Gradient-Push GP), [25 28], combines DGD, [10], and pushsum consensus, [29, 30]. Directed-Distributed Gradient Descent D-DGD), [31, 32], extends Cai and Ishii s wor, [33], by combining it with DGD. In [34], the authors apply the weight-balancing technique, [35], to DGD. These gradientbased methods, [25 28, 31, 32, 34], restricted by the diminishing step-size, converge relatively slow at O ln ). When the objective functions are strongly-convex, the convergence rate can be accelerated to O ln ), [36]. A recent paper proposed a fast distributed algorithm, termed DEXTRA, [37, 38], to solve the distributed consensus optimization problem over directed graphs. By combining the push-sum protocol, [29, 30], and EXTRA, [16], DEXTRA achieves a linear convergence rate given that the objective functions are strongly-convex. However, one drawbac of DEXTRA is the restrictive range of step-size. The greatest lower bound of DEXTRA s step-size is strictly greater than zero. In particular, a proper step-size, α, for DEXTRA converging to the optimal solution lies in α α, α), where α and α denote the lower and upper bound respectively. It is true that α > 0. Considering that it is hard to estimate α in a distributed setting because the expression of α requires the global nowledge, it is always an open problem on how to pic a proper step-size, α, in DEXTRA to guarantee the convergence, i.e., α α, α). In contrast if α = 0, agents can pic whatever small value for α to ensure the convergence. In this paper, we propose ADD-OPT to solve the distributed optimization over directed graphs. The algorithm achieves a linear convergence rate when the objective functions are strongly-convex. Compared to DEXTRA, ADD-OPT s stepsize, α, lies in α 0, α), i.e., α = 0. This guarantees ADD- 2 See [21 24], for additional information on average-consensus problems.

2 2 OPT to be a more reliable algorithm in distributed setting. We show that, after some derivation, the ADD-OPT has a similar representation as DEXTRA. In this point of view, it can be regarded as an extended result of DEXTRA. The remainder of the paper is organized as follows. Section II describes, interprets ADD-OPT, and show its relations with DEXTRA. We also analyze the performance of both ADD-OPT and DEXTRA when the step-size is zero. Section III presents the appropriate assumptions and states the main convergence results. In Section IV, we present some lemmas as the basis of the proof of ADD-OPT s convergence. The proof of main results is provided in Section V. We show numerical results in Section VI and Section VII contains the concluding remars. Basic Notations: We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. For a matrix, A = {a ij }, we denote a ij the element of A at ith row, jth column. The matrix, I n, represents the n n identity, and 1 n and 0 n are the n-dimensional vector of all 1 s and 0 s for any n. The inner product of two vectors, x and y, is x, y, or x y. The Euclidean norm of any vector, x, is denoted by x. For any fx), fx) denotes the gradient of f at x. The spectral radius of a matrix, A, is represented by ρa), and λa) denotes any eigenvalue of A. II. ADD-OPT DEVELOPMENT In this section, we formulate the optimization problem and describe ADD-OPT. We first derive an informal but intuitive proof showing that ADD-OPT pushes the agents to achieve consensus and reach the optimal solution of Problem P1. After proposing ADD-OPT, we derive it to a similar representation of DEXTRA to show the relations between the two. The analysis helps to reveal how to increase the range of step-size in DEXTRA by adjusting the weigthing matrices. We also analyze the performance of both ADD-OPT and DEXTRA when the step-size is zero. Formal convergence results are left to Sections III. Consider a strongly-connected networ of n agents communicating over a directed graph, G = V, E), where V is the set of agents, and E is the collection of ordered pairs, i, j), i, j V, such that agent j can send information to agent i. Define Ni in to be the collection of in-neighbors, i.e., the set of agents that can send information to agent i. Similarly, Ni out is the set of out-neighbors of agent i. We allow both Ni in and Ni out to include the node i itself. Note that in a directed graph when i, j) E, it is not necessary that j, i) E. Consequently, Ni in Ni out, in general. We focus on solving a convex optimization problem that is distributed over the above multi-agent networ. In particular, the networ of agents cooperatively solve the following optimization problem: n P1 : min fz) = f i z), i=1 where each local objective function, f i : R p R, is convex and differentiable, and nown only by agent i. Our goal is to develop a distributed iterative algorithm such that each agent converges to the global solution of Problem P1. A. ADD-OPT Algorithm To solve Problem P1, we describe the implementation of ADD-OPT as follows. Each agent, j V, maintains three vector variables: x,j, z,j, w,j R p, as well as a scalar variable, y,j R, where is the discrete-time index. At th iteration, agent j weights its states, a ij x,j, a ij y,j, as well as a ij w,j, and sends these to each of its out-neighbors, i Nj out, where the weights, a ij s are such that: > 0, i Nj out, n a ij = a ij = 1, j. 1) 0, otw., With agent i receiving the information from its in-neighbors, j Ni in, +1,i, y +1,i, z +1,i and w +1,i as follows: x +1,i = a ij x,j αw,i, 2a) j N in i y +1,i = j N in i a ij y,j, i=1 z +1,i = x +1,i, y +1,i w +1,i = a ij w,j + f i z +1,i ) f i z,i ). j N in i 2b) 2c) 2d) In the above, f i z,i ) in the gradient of the function f i z) at z = z,i, for all 0. The step-size, α, is a positive number within a certain interval. We will explicitly show the range of α in Section III. For any agent i, it is initiated with an arbitrary vector, x 0,i, and with w 0,i = f i z 0,i ) and y 0,i = 1. We note that the implementation of Eq. 2) needs each agent to at least have the nowledge of its out-neighbors degree such that it can design the weights according to Eqs. 1). See [25 28, 31, 32, 34, 37] for the similar assumptions. To simplify the analysis, we assume from now on that all sequences updated by Eq. 2) have only one dimension, i.e., p = 1; thus x,i, y,i, w,i, z,i R, i,. For x i, w i, z i R p being p-dimensional vectors, the proof is the same for every dimension by applying the results to each coordinate. Therefore, assuming p = 1 is without the loss of generality. We next write Eq. 2) in a matrix form. Define x, y, w, z, f R n as x = [x,1,, x,n ], y = [y,1,, y,n ], w = [w,1,, w,n ], z = [z,1,, z,n ], and f = [ f 1 z,1 ),, f n z,n )]. Let A = {a ij } R n n be the collection of weights a ij. It is clear that A is a columnstochastic matrix. Define a diagonal matrix, Y R n n, for each, as follow Y = diag y ). 3) Given that the graph, G, is strongly-connected and the corresponding weighting matrix, A, is non-negative, it follows that Y is invertible for any. Then, we can write Eq. 2) in the matrix form equivalently as follows: x +1 =Ax αw, y +1 =Ay, 4a) 4b) z +1 =Y +1 x +1, 4c)

3 3 w +1 =Aw + f +1 f, where similarly we have the initial condition w 0 = f 0. B. Interpretation of ADD-OPT 4d) Based on Eq. 4), we now give an intuitive interpretation on the convergence of ADD-OPT to the optimal solution. By combining Eqs. 4a) and 4d), we obtain that x +1 = Ax α [Aw + f f ] [ ] Ax x = Ax αa α [ f f ] α = 2Ax A 2 x α [ f f ]. 5) Assume that the sequences generated by Eq. 4) converge to their limits not necessarily true), denoted by x, y, w, z, f, respectively. It follows from Eq. 5) that x = 2Ax A 2 x α [ f f ], 6) which implies that I n A) 2 x = 0 n, or x span{y }, considering that y = Ay. Therefore, we obtain that z = Y x span{1 n }, 7) where the consensus is reached. By summing up the updates in Eq. 5) over from 0 to, we obtain that x = Ax + A I n )x r A 2 A)x r α f. r=1 r=1 Noting that x = Ax, it follows α f = A I n )x r A 2 A)x r. r=1 Therefore, we obtain that α1 n f = 1 n A I n ) x r 1 n A 2 A) x r = 0, which is the optimality condition of Problem P1 considering that z span{1 n }. To conclude, if we assume the sequences updated in Eq. 4) have limits, x, y, w, z, f, we have the fact that z achieves consensus and reaches the optimal solution of Problem P1. We next discuss the relations between ADD-OPT and DEXTRA. C. ADD-OPT and DEXTRA Recent papers proposed a fast distributed algorithm, termed DEXTRA [37, 38], to solve the corresponding distributed optimization problem over directed graphs. It achieves a linear convergence rate given that the objective functions are strongly-convex. At th iteration, each agent i eeps and updates three states, x,i, y,i, and z,i. The iteration, in matrix form, is shown as follow. x +1 = I n + A) x Ãx α [ f f ], y +1 =Ay, 8a) 8b) z +1 =Y +1 x +1, 8c) where Ã is a column-stochastic matrix satisfying that Ã = θi n + 1 θ)a with any θ 0, 1 2 ], and all other notations stic to the same definition appeared earlier in the paper. By comparing Eqs. 5) and 8a), 4b) and 8b), and 4c) and 8c), it follows that the only difference between ADD-OPT and DEXTRA lies in the weighting matrices used when updating x. From DEXTRA to ADD-OPT, we change I n +A) in 8a) to 2A in 5), and Ã to A2, respectively. Mathematically, if A = I n, the two algorithms become the same. Therefore, ADD- OPT can be regarded as an extended version of DEXTRA, in that we will show in Section III that it has a wider range of step-size compared to DEXTRA, i.e., the greatest lower bound, α, of ADD-OPT s step-size is zero while that of DEXTRA s is positve. This also reveals the reason why in DEXTRA we prefer constructing A to be an extremely diagonal dominant matrix see Assumption A2c) in [37]). The more similar A is to I n, the closer α approaches zero. However, in DEXTRA α can never reach zero since A can not be the identity, I n, which otherwise means there is no communication between agents. Therefore, ADD-OPT can not be regarded as a special case of DEXTRA since A I n. In Section V, we provide a totally different, but much more compact and elegant proof, compared to DEXTRA s proof, to show the linear convergence rate of ADD-OPT. Note that the implementation of Eq. 5) requires agents to communicate with the neighbors of their own neighbors because of A 2 ), which taes two iterations for each agent. This explains why ADD-OPT needs to eep and update 4 variables, x, y, w, and z, compared with DEXTRA which only have 3 variables, x, y, and z. It can be considered as a tradeoff of between increasing the step-size range and decreasing the number of variables. D. Interpretation of Algorithms when α = 0 For any gradient-based method, let α = 0 is mathematically equivalent to f = 0 n for all, which means all local objective functions are constants. Thus, the methods are simplified to have agents reach consensus only. We now discuss whether DEXTRA can push all agents to consensus if we force the step-size to be zero. Assume that the sequences generated by Eq. 8) converge to their limits not necessarily true), denoted by x, y, z, and f, respectively. Consider Eq. 8a) when α = 0, x +1 = I n + A)x Ãx. 9) By summing Eq. 9) over from 0 to, we obtain that x = Ax Ã A)x r. 10) According to the previous analysis from Eqs. 6) to 7), we now that in order to reach consensus, i.e., z span{1 n }, it is equivalent to satisfy x span{y }, which results in Ã A)x r = 0 n. 11) Since x 0 is arbitrary, and only x satisfying that x = Ax or Ã A)x = 0 n, we have that Ã A)x r 0 n.

4 4 This reaches a contradiction. Therefore, DEXTRA does not push all agents to the consensus when α = 0. We now consider the performance of ADD-OPT when α = 0. We still assume that the sequences generated by Eq. 4) converge to their limits not necessarily true), denoted by x, y, w, z, f, respectively. In fact, it is straightforward to observe that Eq. 4) with having α = 0 is eactly the pushsum consensus, [29, 30], which push agents to reach average consensus in a directed graph. Therefore, ADD-OPT converges to the optimal solution when α = 0. To better compare ADD- OPT with DEXTRA, we analyze ADD-OPT using similar derivations for DEXTRA above. By summing up the updates in Eq. 5) over from 0 to, we obatin that x = Ax + AI n A)x 0 which is sufficient to satisfy that I n A) 2 x r, 12) r=1 A I n )x r = Ax 0, 13) r=1 by considering the condition that x = Ax. Compared Eqs. 13) from the derivation of ADD-OPT with 11) from DEXTRA, we say that ADD-OPT wors when α = 0 because there exists a additional term Ax 0. The infinit sum r=1 A I n)x r is accumulated to compensate this initial term Ax 0. In the next section, we state the convergence result with appropriate assumptions. III. ASSUMPTIONS, NOTATIONS, AND MAIN RESULT With appropriate assumptions, our main result states that ADD-OPT converges to the optimal solution of Problem P1 linearly. We state again that from now on we assume that the states of agents have only one dimension, i.e., p = 1, which is without the loss of generality. In this paper, we assume that the agent graph, G, is strongly-connected; each local function, f i z), is convex and differentiable, and the optimal solution of Problem P1 and the corresponding optimal value exist. Formally, we denote the optimal solution by z and optimal value by f, i.e., f = fz ) = min z R fz). Besides the above assumptions, we formally present assumptions regarding the gradients of objective functions as follows, which is standard for smooth convex functions, [14, 16, 37], Assumption A1 Lipschitz continuous gradients and strong convexity). Each private function, f i, is differentiable and strongly-convex, and the gradient is Lipschitz continuous, i.e., for any i and z 1, z 2 R, a) there exists a positive constant l such that, f i z 1 ) f i z 2 ) l z 1 z 2 ; b) there exists a positive constant s such that, s z 1 z 2 2 f i z 1 ) f i z 2 ), z 1 z 2. With these assumptions, we are able to present ADD-OPT s convergence result, the representation of which are based on the following notations. Based on earlier notations, x, w, and f, we further define x, w, z, g, h R n as x = 1 n 1 n1 n x, 14) w = 1 n 1 n1 n w, 15) z = z 1 n, 16) g = 1 n 1 n1 n f, 17) h = 1 n 1 n1 n fx ), 18) where fx ) = [ f 1 1 n 1 n x ),..., f n 1 n 1 n x )]. We denote constants τ, ɛ, and η as τ = A I n, 19) ɛ = I n A, 20) η = max 1 αl, 1 αs ), 21) where A is the column-stochastic weighting matrix used in Eq. 4), A = lim A represents A s limit, α is the step-size, and l and s are respectively the Lipschitz gradient constant and strong-convexity constant in Assumption A1. Let Y be the limit of Y in Eq. 3), Y = lim Y, 22) and y and y be the maximum of Y and Y over, respectively, y = max Y, 23) y = max Y. 24) Moreover, we define two constants, σ, and, γ 1, through the following two lemmas, which is related to the convergence of A and Y. Note that Lemmas 1 and 2 are reformulated with notations introduced above to simplify the proof. Lemma 1. Nedic et al. [25]) Consider Y, generated from the column-stochastic matrix, A, and its limit Y. There exist 0 < γ 1 < 1 and 0 < T < such that for all Y Y T γ 1. 25) Lemma 2. Olshevsy et al. [39]) Consider Y in Eq. 22), and A being the column-stochastic matrix used in Eq. 4). For any a R n, define a = 1 n 1 n1 n a. There exist 0 < σ < 1 such that for all Aa Y a σ a Y a. 26) Based on the above notations, we finally denote t, s R 3, and G, H R 3 3 for all as x Y x x t = x z, s = 0, w Y g 0 σ 0 α G = αly ) η 0, ɛlτy + αɛl 2 yy ) 2 αɛl 2 yy ) σ + αɛly )

5 H = αly T γ1 0 0, 27) αly + 2)ɛly T 2 γ1 0 0 We now state the ey relation in this paper. Theorem 1. Let Assumption 2 holds. The following inequality holds for all 1, Proof. See Section V. t Gt + H s. 28) Eq. 28) in Theorem 1 is the ey relation of this paper. We leave the complete proof to Section V, with the help of several auxiliary relations in Section IV. Note that Eq. 28) provides a linear iterative relation between t and t with matrix G and H. Thus, the convergence of t is fully determined by G and H. More specifically, if we want to prove a linear convergence rate of t to zero, it is sufficient to show that ρg) < 1, where ρ ) denotes the spectral radius, as well as the linear decaying of H, which is straightforward since 0 < γ 1 < 1. In Lemma 3, we first show that with appropriate step-size, the spectral radius of G is less than 1. Following Lemma 3, we show the linear convergence rate of G and H in Lemma 4. Lemma 3. Consider the matrix, G α, defined in Eq. 27) as a function of the step-size, α. It follows that ρg α ) < 1 if the step-size, α 0, α), where ɛτs)2 + 4ɛyl + s)s1 σ) α = 2 ɛτs. 29) 2ɛlyy l + s) 4ɛyl+s)s1 σ) 2 Proof. It is easy to verify that α 2ɛlyy l+s) < 1 l. As a result, we have η = 1 αs. When α = 0, we have that σ 0 0 G 0 = 0 1 0, 30) ɛlτy 0 σ the eigenvalues of which are σ, σ, and 1. Therefore, ρg 0 ) = 1, where ρ ) denotes the spectral radius. We now consider how the eigenvalue 1 is changed if we slightly increase α from 0. We denote P Gα q) = detqi n G α ) the characteristic polynomial of G α. By letting detqi n G α ) = 0, we get the following equation. q σ) 2 αɛly q σ))q 1 + αs) α 3 l 3 ɛyy 2 αq 1 + αs)ɛlτy + αɛl 2 yy 2 )) = 0. 31) Since we have already shown that 1 is one of the eigenvalues of G 0, Eq. 31) is valid when q = 1 and α = 0. Tae the derivative on both sides of Eq. 31), and let q = 1 and α = 0, dq dα α=0,q=1 we obtain that = s < 0. This is saying that when α increases from 0 slightly, ρg α ) will decrease first. We now calculate all possible values of α for λg α ) = 1. Let q = 1 in Eq. 31), and solve the step-size, α, we obtain that, α 1 = 0, α 2 < 0, and ɛτs)2 + 4ɛyl + s)s1 σ) α 3 = α = 2 ɛτs. 2ɛlyy l + s) Since there is no other value of α for λg α ) = 1, we now that all eigenvalues of G α is less than 1 for α 0, α) by considering the fact that eigenvalues are continuous functions of matrix. Therefore, ρg α ) < 1 when α 0, α). Lemma 4. With the step-size, α 0, α), where α is defined in Eq. 29), the following statements hold for all, a) there exist 0 < γ 1 < 1 and 0 < Γ 1 <, where γ 1 is defined in Eq. 25), such that H = Γ 1 γ 1 ; b) there exist 0 < γ 2 < 1 and 0 < Γ 2 <, such that G Γ 2 γ 2 ; c) there exist γ = max{γ 1, γ 2 } and Γ = Γ 1 Γ 2 /γ, such that for all 0 r, G r H r Γγ. Proof. a). This is easy to verify according to Eq. 27), and by letting Γ 1 = ly T α 2 + αly + 2) 2 ɛ 2 y. 2 b). We represent G in the Jordan canonical form as G = P J Q. According to Lemma 3, we have that all diagonal entries in J are smaller than one. Therefore, there exist 0 < Γ 2 < and 0 < γ 2 < 1, such that G P Q J Γ2 γ 2. 32) c). The proof of c) is achieved by combining a) and b). We now present the main result of this paper in Theorem 2, which shows the linear convergence rate of ADD-OPT. Theorem 2. Let Assumption 2 holds. With the step-size, α 0, α), where α is defined in Eq. 29), the sequence, {z }, generated by ADD-OPT, converges exactly to the optimal solution, z, at a linear rate, i.e., there exist some bounded constants M > 0 and γ < µ < 1, where γ is used in Lemma 4c), such that for any, z z Mµ. 33) Proof. We write Eq. 28) recursively, which results t G t 0 + G r H r s r. 34) By taing the norm on both sides of Eq. 34), and considering Lemma 4, we obtain that t G t0 + G r H r sr Γ 2 γ2 t 0 + Γγ s r, 35) in which we can bound s r as s r x r Y x r + Y x r z + Y z 1 + y) t r + y z. 36)

6 6 Therefore, we have that for all ) t Γ 2 t 0 + Γ1 + y) t r + Γy z γ. 37) Our first step is to show that t is bounded for all. It is true that there exists some bounded K > 0 such that for all > K it satisfies that Γ 2 + Γ1 + 2y)) γ 1. 38) Define Φ = max 0 K t, z ), which is bounded since K is bounded. It is true that t Φ for 0 K. Consider the case when = K + 1. By combining Eqs. 37) and 38), we have that ) t K+1 Φ Γ 2 + Γ1 + 2y)K + 1) γ K+1 Φ. 39) We repeat the procedures to show that t Φ for all. The next step is to show that t decays linearly. For any µ satisfying γ < µ < 1, there exist a constant U such that µ γ ) > U for all. Therefore, by bounding all t and z by Φ in Eq. 37), we obtain that for all ) t Φ Γ 2 + Γ1 + 2y) γ ) ) γ µ µ Φ Γ 2 + Γ1 + 2y)U U ) Φ Γ 2 + Γ1 + 2y)U µ. 40) It follows that z z and t satisfy the relation that z z Y x Y Y x + Y Y z z + Y Y x Y Y z y 1 + y) t + y T γ1 z, 41) where in the second inequality we use the relation Y Y I n Y Y Y y T γ1 achieved from Eq. 25). By combining Eqs. 40) and 41), we obtain that z z y Φ [1 + y)γ 2 + Γ1 + 2y)U) + T ] µ. The desired result is obtained by letting M = y Φ[1 + y)γ 2 + Γ1 + 2y)U) + T ]. Themorem 2 shows the linear convergence rate of ADD-OPT. In Sections IV and V, we prove Theorem 1. IV. AUXILIARY RELATIONS We provide several basic relations in this section, which will help the proof of Theorem 1. Lemma 5 derives iterative equations that govern the average sequence x and w. Lemma 6 gives inequalities that are direct consequences of Eq. 25). Lemma 7 can be found in standard optimization literature, [40]. It states that if we perform a gradient descent step with a fixed step-size for a strongly convex and smooth function, then the distance to optimizer shrins by at least a fixed ratio. Lemma 5. The following equations hold for all, a) w = g ; b) x +1 = x αg. Proof. Since A is column-stochastic, satisfying 1 n A = 1 n, we obtain that w = 1 n 1 n1 n Aw + f f ) = w + g g. Do this recursively, and we have that w = w 0 + g g 0. Recall that we have the initial condition that w 0 = f 0, which is equivalently to w 0 = g. Hence, we achieve the result of a). The proof of b) follows the following derivation, x +1 = 1 n 1 n1 n Ax αw ) = x αw, = x αg, where the last equation use result of a). The proof is done. Lemma 6. The following inequalities hold for all 1, a) Y Y I n y T γ1 ; b) Y Y 2y 2 T γ1. Proof. By considering Eq. 25), it follows that Y Y I n Y Y Y y T γ The proof of b) follows Y Y Y Y Y Y 2y T 2 γ1. This finishes the proof. 1 ; Lemma 7. Let Assumption A1 hold for the objective function, fz), in P1, and s and l are the strong-convexity constant and Lipschitz continuous gradient constant, respectively. For any z R, define z + = z α fz), where 0 < α < 2 l. Then z + z η z z, where η = max 1 αl, 1 αs ). V. CONVERGENCE ANALYSIS The proof of Theorem 1 is provided in this section. We will bound x Y x, x z, and w Y g by the linear combinations of their past values, i.e., x Y x, x z, and w Y g, as well as x. The coefficients will be shown to be the entries of G and H. Step 1: Bound x Y x. According to Eq. 4a) and Lemma 5b), we obtain that x Y x Ax Y x + α w Y g. 42) Note that Ax Y x σ x Y x from Eq. 26), we have x Y x σ x Y x + α w Y g. 43)

7 7 Step 2: Bound x z. By considering Lemma 5b), we obtain that x = [x αh ] α [g h ]. 44) Let x + = x αh, which is performing a centralized) gradient descent to minimize the objective function in Problem P1. Therefore, we have that, according to Lemma 7, x + z η x z. 45) By applying the Lipschitz continuous gradient assumption, Assumption 2a), we obtain g h 1 n 1 n1 n l z x. 46) Therefore, it follows that x z x + z + α g h η x z + αl z x. 47) Notice by Eq. 4c) and Lemma 6a), it follows that z x Y x Y x ) + Y Y I n ) x y x Y x + y T γ 1 x, 48) where in the second inequality we also mae use of the relation x x. By substituting Eq. 48) into Eq. 47), we obtain that x z αly x Y x + η x z Step 3: Bound w Y g. According to Eq. 4d), we have + αly T γ 1 x. 49) w Y g Aw Y g + f f ) Y g Y g ). With Lemma 5a) and Eq. 26), we obtain that Aw Y g = Aw Y w It follows from the definition of g that σ w Y w. 50) f f ) Y g Y g ) = I n 1 ) n Y 1 n 1 n f f ). 51) Since 1 n Y 1 n 1 n = A, where A = lim A, we obtain that f f ) Y g Y g ) ɛl z z, where in the preceding relation we use the Lipschitz continuous gradient assumption, Assumption A1a). Therefore, we have w Y g σ w Y g + ɛl z z. 52) We now bound z z. Note that h = 1 n 1 n1 n fx ) l x z. 53) As a result, we have Y w Y w Y g ) + Y Y h Y g h ) + Y y w Y g + y yl x z + y yl z x y w Y g + y yl x z + y 2 yl x Y x + y 2 ylt γ 1 x, 54) where the last inequality is valid by considering Eq. 48). With the upper bound of Y w provided in the preceding relation and note that A I n )Y x = 0 n, we can bound z z as follow. z z Y x x ) + Y Y ) x Y A I n ) x + α Y w + Y Y x y τ + αy yl) 2 x Y x + αy w Y g + αy yl x z + αyl + 2)y 2 T γ 1 x. 55) By substituting Eq. 55) in Eq. 52), we obtain that w Y g ɛlτy + αɛl 2 yy 2 ) x Y x + αɛl 2 yy x z + σ + αɛly ) w Y g + αyl + 2)ɛly 2 T γ 1 x. 56) Step 4: By combining Eqs. 43) in step 1, 49) in step 2, and 56) in step 3, we complete the proof. VI. NUMERICAL EXPERIMENTS In this section, we compare the performances of algorithms solving the distributed consensus optimization problem over directed graphs, including ADD-OPT, DEXTRA [37], Gradient-Push [25], Directed-Distributed Gradient Descent [31], and the Weighting Balancing-Distributed Gradient Descent [34]. Our numerical experiments are based on the distributed logistic regression problem over a directed graph: z β = argmin z R p 2 z 2 + n m i ln [ 1 + exp c ijz ) )] b ij, i=1 j=1 where for any agent i, it is accessible to m i training examples, c ij, b ij ) R p {, +1}, where c ij includes the p features of the jth training example of agent i, and b ij is the corresponding label. This problem can be formulated in the form of P1 with the private objective function f i being f i = β m i 2n z 2 + ln [ 1 + exp c ijz ) )] b ij. j=1

8 8 In our setting, we have n = 10, m i = 10, for all i, and p = 3. The networ topology is described in Fig. 1. In the implementation of algorithms, we apply to all algorithms the local degree weighting strategy, i.e., to assign each agent itself and its out-neighbors equal weights according to the agents s own out-degree. In our first experiment, we Residual DEXTRA α=0.001 DEXTRA α=0.2 DEXTRA α=0.3 Augmented DEXTRA α=0.001 Augmented DEXTRA α=0.2 Augmented DEXTRA α= Fig. 1: A strongly-connected directed networ. compare the convergence rates between ADD-OPT and other methods that designed for directed graphs. we apply the same local degree weighting strategy to all methods. The step-size used in Gradient-Push, Directed-Distributed Gradient Descent, and WeightingBalencing-Distributed Gradient Descent is α = 1/. The constant step-size used in DEXTRA and ADD- OPT is α = 1. It can be found that ADD-OPT and DEXTRA has a fast linear convergence rate, while other methods are sublinear. The convergence rate performances between different algorithms are found in Fig Fig. 3: Comparison between ADD-OPT and DEXTRA in terms of step-size ranges. can be found in Fig. 4 that the practical upper bound of stepsize is much bigger, i.e., α = In the implementation of ADD-OPT when we are trying to estimate α, we set α 1 10l assuming that l is not a global nowledge. This is an experienced estimation. Finally, we also show the relation between convergence speed and step-size does not simply satisfy any linear function. This can be found in Fig. 4. Residual 10 0 Gradient-Push Directed-Distributed Gradient Descent WeightBalancing-Distributed Gradient Descent DEXTRA Augmented DEXTRA Residual α=0.01 α=0.05 α=0.1 α=0.5 α=1.1 α=1.11 α=1.12 α= Fig. 2: Convergence rates between optimization methods for directed networs Fig. 4: The range of ADD-OPT s step-size. The second experiment compares ADD-OPT and DEXTRA in terms of their step-size ranges. We stic to the same local degree weighting strategy for both ADD-OPT and DEXTRA. It is shown in Fig. 3 that the greatest lower bound of DEXTRA is round α = 0.2. Since the value of α requires the global nowledge, it is hard for agents to estimate this value in the distributed implementation. Once agents pic some value for α < 0.2, DEXTRA diverges. In contrast, agents that implementing ADD-OPT can pic whatever small values to ensure the convergence. According to the setting, we can calculate that τ = 1.25, ɛ = 1.11, y = 1.96, y = 2.2, and l = 1. It is also satisfied that σ < 1. Therefore, we can estimate α < = 0.3. It VII. CONCLUSIONS In this paper, we focus on solving the distributed consensus optimization problem over directed graphs. The proposed algorithm, termed ADD-OPT, can be viewed as an improvement of our recent wor, DEXTRA. ADD-OPT converges at a linear rate Oµ ) for 0 < µ < 1 given the assumption that the objective functions are strongly-convex, where is the number of iterations. Compared to DEXTRA, ADD-OPT owns a wider and more realistic range of step-size for the convergence. In particular, the greatest lower bound of DEXTRA s step-size is strictly greater than zero while that of ADD-OPT s equals exactly to zero. This guarantees the convergence of ADD-OPT in the distributed implementation as long as agents picing

9 9 some small step-size. Therefore, ADD-OPT is more reliable in distributed setting. We provide a much more compact proof compared with DEXTRA to show the linear convergence rate of ADD-OPT. Simulation examples further illustrate the improvements. REFERENCES [1] S. Boyd, N. Parih, E. Chu, B. Peleato, and J. Ecstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundation and Trends in Maching Learning, vol. 3, no. 1, pp , Jan [2] V. Cevher, S. Becer, and M. Schmidt, Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, IEEE Signal Processing Magazine, vol. 31, no. 5, pp , [3] I. Necoara and J. A. K. Suyens, Application of a smoothing technique to decomposition in convex optimization, IEEE Transactions on Automatic Control, vol. 53, no. 11, pp , Dec [4] G. Mateos, J. A. Bazerque, and G. B. Giannais, Distributed sparse linear regression, IEEE Transactions on Signal Processing, vol. 58, no. 10, pp , Oct [5] J. A. Bazerque and G. B. Giannais, Distributed spectrum sensing for cognitive radio networs by exploiting sparsity, IEEE Transactions on Signal Processing, vol. 58, no. 3, pp , March [6] M. Rabbat and R. Nowa, Distributed optimization in sensor networs, in 3rd International Symposium on Information Processing in Sensor Networs, Bereley, CA, Apr. 2004, pp [7] U. A. Khan, S. Kar, and J. M. F. Moura, Diland: An algorithm for distributed sensor localization with noisy distance measurements, IEEE Transactions on Signal Processing, vol. 58, no. 3, pp , Mar [8] C. L and L. Li, A distributed multiple dimensional qos constrained resource scheduling optimization policy in computational grid, Journal of Computer and System Sciences, vol. 72, no. 4, pp , [9] G. Neglia, G. Reina, and S. Alouf, Distributed gradient optimization for epidemic routing: A preliminary evaluation, in 2nd IFIP in IEEE Wireless Days, Paris, Dec. 2009, pp [10] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1, pp , Jan [11] A. Nedic, A. Ozdaglar, and P. A. Parrilo, Constrained consensus and optimization in multi-agent networs, IEEE Transactions on Automatic Control, vol. 55, no. 4, pp , Apr [12] I. Lobel and A. Ozdaglar, Distributed subgradient methods for convex optimization over random networs, IEEE Transactions on Automatic Control, vol. 56, no. 6, pp , Jun [13] S. S. Ram, A. Nedic, and V. V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization, Journal of Optimization Theory and Applications, vol. 147, no. 3, pp , [14] K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, arxiv preprint arxiv: , [15] D. Jaovetic, J. Xavier, and J. M. F. Moura, Fast distributed gradient methods, IEEE Transactions on Automatic Control, vol. 59, no. 5, pp , [16] W. Shi, Q. Ling, G. Wu, and W Yin, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, vol. 25, no. 2, pp , [17] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, arxiv preprint arxiv: , [18] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. Puschel, D-ADMM: A communication-efficient distributed algorithm for separable optimization, IEEE Transactions on Signal Processing, vol. 61, no. 10, pp , May [19] E. Wei and A. Ozdaglar, Distributed alternating direction method of multipliers, in 51st IEEE Annual Conference on Decision and Control, Dec. 2012, pp [20] W. Shi, Q. Ling, K Yuan, G Wu, and W Yin, On the linear convergence of the admm in decentralized consensus optimization, IEEE Transactions on Signal Processing, vol. 62, no. 7, pp , April [21] A. Jadbabaie, J. Lim, and A. S. Morse, Coordination of groups of mobile autonomous agents using nearest neighbor rules, IEEE Transactions on Automatic Control, vol. 48, no. 6, pp , Jun [22] R. Olfati-Saber and R. M. Murray, Consensus problems in networs of agents with switching topology and time-delays, IEEE Transactions on Automatic Control, vol. 49, no. 9, pp , Sep [23] R. Olfati-Saber and R. M. Murray, Consensus protocols for networs of dynamic agents, in IEEE American Control Conference, Denver, Colorado, Jun. 2003, vol. 2, pp [24] L. Xiao, S. Boyd, and S. J. Kim, Distributed average consensus with least-mean-square deviation, Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp , [25] A. Nedic and A. Olshevsy, Distributed optimization over time-varying directed graphs, IEEE Transactions on Automatic Control, vol. PP, no. 99, pp. 1 1, [26] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, Push-sum distributed dual averaging for convex optimization, in 51st IEEE Annual Conference on Decision and Control, Maui, Hawaii, Dec. 2012, pp [27] K. I. Tsianos, The role of the Networ in Distributed Optimization Algorithms: Convergence Rates, Scalability, Communication/Computation Tradeoffs and Communication Delays, Ph.D. thesis, Dept. Elect. Comp. Eng. McGill University, [28] K. I. Tsianos, S. Lawlor, and M. G. Rabbat, Consensus-based distributed optimization: Practical issues and applications in large-scale machine learning, in 50th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, Oct. 2012, pp [29] D. Kempe, A. Dobra, and J. Gehre, Gossip-based computation of aggregate information, in 44th Annual IEEE Symposium on Foundations of Computer Science, Oct. 2003, pp [30] F. Benezit, V. Blondel, P. Thiran, J. Tsitsilis, and M. Vetterli, Weighted gossip: Distributed averaging using non-doubly stochastic matrices, in IEEE International Symposium on Information Theory, Jun. 2010, pp [31] C. Xi, Q. Wu, and U. A. Khan, Distributed gradient descent over directed graphs, arxiv preprint arxiv: , [32] C. Xi and U. A. Khan, Distributed subgradient projection algorithm over directed graphs, arxiv preprint arxiv: , [33] K. Cai and H. Ishii, Average consensus on general strongly connected digraphs, Automatica, vol. 48, no. 11, pp , [34] A. Mahdoumi and A. Ozdaglar, Graph balancing for distributed subgradient methods over directed graphs, to appear in 54th IEEE Annual Conference on Decision and Control, [35] L. Hooi-Tong, On a class of directed graphswith an application to traffic-flow problems, Operations Research, vol. 18, no. 1, pp , [36] A. Nedic and A. Olshevsy, Distributed optimization of strongly convex functions on directed time-varying graphs, in IEEE Global Conference on Signal and Information Processing, Dec. 2013, pp [37] C. Xi and U. A. Khan, On the linear convergence of distributed optimization over directed graphs, arxiv preprint arxiv: , 2015.

10 10 [38] J. Zeng and W. Yin, Extrapush for convex smooth decentralized optimization over directed networs, arxiv preprint arxiv: , [39] A. Olshevsy and J. N. Tsitsilis, Convergence speed in distributed consensus and averaging, SIAM Journal on Control and Optimization, vol. 48, no. 1, pp , [40] S. Bubec, Convex optimization: Algorithms and complexity, arxiv preprint arxiv: , 2014.

On the Linear Convergence of Distributed Optimization over Directed Graphs

1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm,