On the Linear Convergence of Distributed Optimization over Directed Graphs

Size: px

Start display at page:

Download "On the Linear Convergence of Distributed Optimization over Directed Graphs"

Mildred Clark
5 years ago
Views:

1 1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv: v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm, termed DEXTRA, to solve the optimization problem when N agents reach agreement and collaboratively minimize the sum of their local objective functions over the networ, where the communications between agents are described by a directed graph, which is uniformly strongly-connected. Existing algorithms, including Gradient-Push (GP and Directed-Distributed Gradient Descent (D-DGD, solve the same problem restricted to directed graphs with a convergence rate of O(ln /, where is the number of iterations. Our analysis shows that DEXTRA converges at a linear rate O(ɛ for some constant ɛ < 1, with the assumption that the objective functions are strongly convex. The implementation of DEXTRA requires each agent now its local out-degree only. Simulation examples illustrate our findings. Index Terms Distributed optimization; multi-agent networs; directed graphs;. I. INTRODUCTION Distributed computation and optimization have gained great interests due to their widely applications often in, e.g., large scale machine learning, [1, ], model predictive control, [3], cognitive networs, [4, 5], source localization, [6, 7], resource scheduling, [8], and message routing, [9]. All of above applications can be reduced to variations of distributed optimization problems by a networ of nodes when nowledge of the objective function is scattered over the networ. In particular, we consider the problem that minimizing a sum of objectives, n i=1 f i(x, where f i : R p R is a private objective function at the ith agent of the networ. C. Xi and U. A. Khan are with the Department of Electrical and Computer Engineering, Tufts University, 161 College Ave, Medford, MA 0155; chenguang.xi@tufts.edu, han@ece.tufts.edu. This wor has been partially supported by an NSF Career Award # CCF

2 There have been many types of algorithms to solve the problem in a distributed manner. The most common alternatives are Distributed Gradient Descent (DGD, [10, 11], Distributed Dual Averaging (DDA, [1], and the distributed implementations of the Alternating Direction Method of Multipliers (ADMM, [13 15]. DGD and DDA can be reduced to gradient-based methods, where at each iteration a gradient related step is calculated, followed by averaging with neighbors in the networ. The main advantage of these methods is computational simplicity. However, they suffer from a slower convergence rate due to the necessary slowly diminishing stepsize in order to achieve an exact convergence. The convergence rate of DGD and DDA, with a diminishing stepsize, is shown to be O( ln, [10]. When given a constant stepsize, the algorithms accelerate themselves to O( 1, at a cost of inexact converging to a neighborhood of optimal solutions, [11]. To overcome the difficulties, some alternatives include Nesterov-based methods, e.g., the Distributed Nesterov Gradient (DNG, [16], whose convergence rate is O( log with a diminishing stepsize, and Distributed Nesterov gradient with Consensus iterations (DNC, [16], with a faster convergence rate of O( 1 at the cost of times iterations at th iteration. Some methods use second order information such as the Networ Newton (NN method, [17], with a convergence rate that is at least linear. Both DNC and NN need multiple communication steps in one iteration. The distributed implementations of ADMM is based on augmented Lagrangians, where at each iteration the primal and dual variables are solved to minimize a lagrangian related function. This type of methods converges exactly to the optimal point with a faster rate of O( 1 with a constant stepsize and further get linear convergence rate with the assumption of strongly convex of objective functions. However, the disadvantage is the high computation burden. Each agent needs to solve an optimization problem at each iteration. To resolve the issue, Decentralized Linearized ADMM (DLM, [18], and EXTRA, [19], are proposed, which can be considered as a first order approximation of decentralized ADMM. DLM and EXTRA have been proven to converge at a linear rate if the local objective functions are assumed to be strongly convex. All these proposed distributed algorithms, [10 19], assume the multi-agent networ to be an undirected graph. In contrast, Researches successfully solving the problem restricted to directed graphs are limited. We report the papers considering directed graphs here. There are two types of alternatives, which are both gradient-based algorithms, solving the problems restricted to directed graphs. The first type of algorithms, called Gradient-Push (GP method, [0, 1], combines gradient

3 3 descent and push-sum consensus. The push-sum algorithm, [, 3], is first proposed in consensus problems 1 to achieve average-consensus given a column-stochastic matrix. The idea is based on computing the stationary distribution of the Marov chain characterized by the multi-agent networ and canceling the imbalance by dividing with the left-eigenvector. GP follows a similar spirit of push-sum consensus and can be regarded as a perturbed push-sum protocol converging to the optimal solution. Directed-Distributed Graident Descent (D-DGD, [8, 9], is the other gradient-based alternative to solve the problem. The main idea follows Cai s paper, [30], where a new non-doubly-stochastic matrix is constructed to reach average-consensus. In each iteration, D- DGD simultaneously constructs a row-stochastic matrix and a column-stochastic matrix instead of only a doubly-stochastic matrix. The convergence of the new weight matrix, depending on the row-stochastic and column-stochastic matrices, ensures agents to reach both consensus and optimality. Since both GP and D-DGD are gradient-based algorithm, they suffer from a slow convergence rate of O( ln due to the diminishing stepsize. We conclude the existing first order distributed algorithms solving the optimization problem of minimizing the sum of agents local objective functions, and do a comparison in terms of speed, in Table I, including both the circumstance of undirected graphs and directed graphs. As seen in Table I, I means DGD with a constant stepsize is an Inexact method; M is saying that the DNC needs to do Multiple ( times of consensus at th iteration, which is impractical in real situation as goes to infinity, and C represents that DADMM has a much higher Computation burden compared to other first order methods. In this paper, we propose a fast distributed algorithm, termed DEXTRA, to solve the problem over directed graphs, by combining the push-sum protocol and EXTRA. The push-sum protocol has proven to be a good technique for dealing with optimization over digraphs, [0, 1]. EXTRA wors well in optimization problems over undirected graph with a fast convergence rate and a low computation complexity. By applying the push-sum techniqies into EXTRA, we show that DEXTRA converges exactly to the optimal solution with a linear rate, O(ɛ, even when the underlying networ is directed. The fast convergence rate is guaranteed because we apply a constant stepsize in DEXTRA compared with the diminishing stepsize used in GP or D- DGD. Currently, our formulation is restricted to strongly convex functions. We claim that though DEXTRA is a combination of push-sum and EXTRA, the proof is not trival by simply following 1 See, [4 7], for additional information on average consensus problems.

4 4 Algorithms General Convex Strongly Convex DGD (α DDA (α O( ln O( ln O( ln O( ln DGD (α O( 1 I DNG (α O( ln DNC (α O( 1 M DADMM (α O(ɛ C DLM (α O(ɛ EXTRA (α O( 1 O(ɛ GP (α D-DGD (α O( ln O( ln O( ln O( ln DEXTRA (α O(ɛ TABLE I: Convergence rate of first order distributed optimization algorithms regarding undirected and directed graphs. the proof of either algorithm. As the communication lin is more restricted compared with the case in undirected graph, many properties used in the proof of EXTRA no longer wors for DEXTRA, e.g., the proof of EXTRA requiring the weighting matrix to be symmetric which never happens in directed graphs. The remainder of the paper is organized as follows. Section II describes, develops and interprets the DEXTRA algorithm. Section III presents the appropriate assumptions and states the main convergence results. In Section IV, we show the proof of some ey lemmas, which will be used in the subsequent proofs of convergence rate of DEXTRA. The main proof of the convergence rate of DEXTRA is provided in Section V. We show numerical results in Section VI and Section VII contains the concluding remars. Notation: We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. We denote by [x] i the ith component of a vector x. For matrix A, we denote by [A] i the ith row of the matrix, and by [A] ij the (i, jth element of a matrix, A. The matrix, I n,

5 5 represents the n n identity, and 1 n (0 n are the n-dimensional vector of all 1 s (0 s. The inner product of two vectors x and y is x, y. We define the A-matrix norm, x A, of any vector x as x A x, Ax = x, A x = x, A+A x for A + A being positive semi-definite. If a symmetric matrix A is positive semi-definite, we write A 0, while A 0 means A is positive definite. The largest and smallest eigenvalues of a matrix A are denoted as λ max (A and λ min (A. The smallest nonzero eigenvalue of a matrix A is denoted as λ min (A. For any objective function f(x, we denote f(x the gradient of f at x. II. DEXTRA DEVELOPMENT In this section, we formulate the optimization problems and describe the DEXTRA algorithm. We derive an informal but intuitive proof of DEXTRA, showing that the algorithm push agents to achieve consensus and reach the optimal solution. The EXTRA algorithm, [19], is briefly recapitulated in this section. We derive DEXTRA to a similar form as EXTRA such that our algorithm can be viewed as an improvement of EXTRA suited to the case of directed graphs. This reveals the meaning of DEXTRA: Directed EXTRA. Formal convergence results and proofs are left to Section III, IV, and V. Consider a strongly-connected networ of n agents communicating over a directed graph, G = (V, E, where V is the set of agents, and E is the collection of ordered pairs, (i, j, i, j V, such that agent j can send information to agent i. Define N in i to be the collection of in-neighbors, i.e., the set of agents that can send information to agent i. Similarly, N out i is defined as the out-neighborhood of agent i, i.e., the set of agents that can receive information from agent i. We allow both N in i and N out i to include the node i itself. Note that in a directed graph when (i, j E, it is not necessary that (j, i E. Consequently, N in i Ni out, in general. We focus on solving a convex optimization problem that is distributed over the above multi-agent networ. In particular, the networ of agents cooperatively solve the following optimization problem: n P1 : min f(x = f i (x, where each f i : R p R is convex, not necessarily differentiable, representing the local objective function at agent i. i=1

6 6 A. EXTRA for undirected graphs EXTRA is a fast exact first-order algorithm to solve Problem P1 but in the case that the communication networ is described by an undirected graph. At th iteration, each agent i performs the following updates: x +1 i =x i + w ij x j w ij x 1 j α [ f i (x i f i (x 1 i ], (1 j N i j N i where the weight, w ij, forms a weighting matrix, W = {w ij } satisfying that W is symmetric and doubly stochastic, and the collection W = { w ij } satisfies W = θi n + (1 θw with some θ (0, 1 ]. The update in Eq. (1 ensures x i to converge to the optimal solution of the problem for all i with convergence rate O( 1, and converge linearly when the objective function is further strongly convex. To better represent EXTRA and later compare with the DEXTRA algorithm, we write Eq. (1 in a matrix form. Let x, f(x R np be the collection of all agents states and gradients at time, i.e., x and W, W R n n [x 1; ; x n], f(x [ f 1 (x 1;,, f n (x n], be the weighting matrix collecting weights w ij, w ij, respectively. Then Eq. (1 can be represented in a matrix form as: x +1 = [(I n + W I p ] x ( W I p x 1 α [ f(x f(x 1 ], ( where the symbol is the Kronecer product. We now state DEXTRA and derive it in a similar form as EXTRA. B. DEXTRA Algorithm To solve the Problem P1 suited to the case of directed graphs, we claim the DEXTRA algorithm implemented in a distributed manner as follow. Each agent, j V, maintains two vector variables: x j, z j R p, as well as a scalar variable, y j R, where is the discrete-time index. At the th iteration, agent j weights its state estimate, a ij x j, a ij yj,and ã ij x 1 j, and sends it to each out-neighbor, i Nj out, where the weights, a ij, and, ã ij, s are such that: a ij = > 0, i N out j, 0, otw., θ + (1 θa ij, i = j, ã ij = (1 θa ij, i j, n a ij = 1, j, i=1 j,

7 7 where θ (0, 1 in ]. With agent i receiving the information from its in-neighbors, j N i, it calculates the states, z i, by dividing x i over yi, and updates the variables x +1 i, and y +1 i follows: x +1 i y +1 i z i = x i, yi =x i + = j N in i j N in i ( aij x j j N in i as (3a [ (ãij x 1 j α fi (z i f i (z 1 i ], (3b ( aij yj, (3c where f i (z i is the gradient of the function f i (z at z = z i, and f i (z 1 i is the gradient at z 1 i, respectively. The method is initiated with an arbitrary vector, x 0 i, and with y 0 i = 1 for any agent i. The stepsize, α, is a positive number within a certain interval. We will explicitly discuss the range of α in Section III. At the first iteration, i.e., = 0, we have the following iteration instead of Eq. (3, z 0 i = x0 i, yi 0 x 1 i = j N in i y 1 i = j N in i ( aij x 0 j α fi (z 0 i, ( aij yj 0. We note that the implementation of Eq. (3 needs each agent having the nowledge of its outneighbors (such that it can design the weights. In a more restricted setting, e.g., a broadcast application where it may not be possible to now the out-neighbors, we may use a ij = N out j 1 ; thus, the implementation only requires each agent nowing its out-degree, see, e.g., [0, 8] for similar assumptions. To simplify the proof, we write DEXTRA, Eq. (3, in a matrix form. Let, A = {a ij } R n n, Ã = {ã ij } R n n, be the collection of weights, a ij, ã ij, respectively, x, z, f(x R np, be the collection of all agents states and gradients at time, i.e., x [x 1; ; x n], z [z 1; ; z n], f(x [ f 1 (x 1; ; f n (x n], and y R n be the collection of agent states, y i, i.e., y [y 1; ; y n]. Note that at time, value, y 0 : y can be represented by the initial y = Ay 1 = A y 0 = A 1 n. (5

8 8 Define the diagonal matrix, D R n n, for each, such that the ith element of D is y i, i.e., D = diag ( A 1 n. (6 Then we have Eq. (3 in the matrix form equivalently as follows: ( [D z = ] 1 Ip x, x +1 =x + (A I p x (Ã I px 1 α [ f(z f(z 1 ], y +1 =Ay, (7a (7b (7c where both the two weight matrices, A and Ã, are column stochastic matrix and satisfy the relationship: Ã = θi n + (1 θa with some θ (0, 1 ], and when = 0, Eq. (7b changes to x 1 = (A I p x 0 α f(z 0. Consider Eq. (7a, we obtain x = ( D I p z,. (8 Therefore, it follows that DEXTRA can be represented in one single equation: ( D +1 I p z +1 = [ (I n + AD I p ] z (ÃD 1 I p z 1 α [ f(z f(z 1 ]. We refer to our algorithm DEXTRA, since Eq. (9 has a similar form as the EXTRA algorithm, Eq. (, and can be used to solve Problem P1 in the case of directed graph. We state our main result in Section III, showing that as time goes to infinity, the iteration in Eq. (9 push z to achieve consensus and reach the optimal solution. Our proof in this paper will based on the form, Eq. (9, of DEXTRA. (9 C. Interpretation of DEXTRA In this section, we give an intuitive interpretation on DEXTRA s convergence to the optimal point before we show the formal proof which will appear in Section IV, V. Note that the analysis given here is not a formal proof of DEXTRA. However, it gives a snapshot on how DEXTRA wors. Since A is a column stochastic matrix, the sequence of iterates, { y }, generated by Eq. (7c converges to the span of A s right eigenvector corresponding to eigenvalue 1, i.e., y = u π for some number u, where π is the right eigenvector of A corresponding to eigenvalue 1. We assume that the sequence of iterates, { z } and { x }, generated by DEXTRA, Eq. (7 or (9, converges

9 9 to their own limit points, z, x, respectively, (which might not be the truth. According to the updating rule in Eq. (7b, the limit point x satisfies x =x + (A I p x (Ã I px α [ f(z f(z ], (10 which implies that [(A Ã I p]x = 0 np, or x = π v for some vector v R p. Therefore, it follows from Eq. (7a that z = ( [D ] 1 I p (π v {1n v}, (11 where the consensus is reached. The above analysis reveals the idea of DEXTRA to overcome the imbalance of agent states occurred when the graph is directed: both x and y lie in the span of π. By dividing x over y, the imbalance is canceled. By summing up the updates in Eq. (7b over, we obtain that 1 ] x +1 = (A I p x α f(z [(Ã A I p x r. Consider that x = π v and the preceding relation. It follows that the limit point z satisfies ] α f(z = [(Ã A I p x r. (1 Therefore, we obtain that α (1 n I p f(z = r=0 [ 1 n r=0 ] (Ã A I p x r = 0 p, which is the optimality condition of Problem P1. Therefore, given the assumption that the sequence of DEXTRA iterates, { z } and { x }, have limit points, z and x, the limit point, z, achieves consensus and reaches the optimal solution of Problem P1. In next section, we state our main result of this paper and we leave the formal proof to Section IV, V. r=0 III. ASSUMPTIONS AND MAIN RESULTS With appropriate assumptions, our main result states that DEXTRA converges to the optimal solution of Problem P1 linearly. In this paper, we assume that the agent graph, G, is stronglyconnected; each local function, f i : R p R, is convex and differentiable, i V, and the optimal solution of Problem P1 and the corresponding optimal value exist. Formally, we denote the optimal solution by φ R p and optimal value by f, i.e., f(φ = f = min f(x. (13 x Rp

10 10 Besides the above assumptions, we emphasize some other assumptions regarding the objective functions and weighting matrices, which are formally presented as follow. Assumption A1 (Functions and Gradients. For all agent i, each function, f i, is differentiable and satisfies the following assumptions. (a The function, f i, has Lipschitz gradient with the constant L fi, i.e., f i (x f i (y L fi x y, x, y R p. (b The gradient, f i (x, is bounded, i.e., f i (x B fi, x R p. (c The function, f i, is strictly convex with a constant i, i.e., i x y f i (x f i (y, x y, x, y R p. Following Assumption A1, we have for any x, y R np, f(x f(y L f x y, f(x B f, x y f(x f(y, x y, (14a (14b (14c where the constants B f = max i {B fi }, L f = max i {L fi }, and = max i {i }. By combining Eqs. (14b and (14c, we obtain which is x y 1 f(x f(y B f, x B f, (15 by letting y = 0 np. Eq. (15 reveals that the sequences updated by DEXTRA is bounded. Recall the definition of D in Eq. (6, we denote the limits of D by D, i.e., D = lim D = diag (A 1 n = n diag (π, (16 where π is the normalized right eigenvector of A corresponding to eigenvalue 1. The next assumption is related to the weighting matrices, A and Ã, and D. Assumption A (Weighting matrices. The weighting matrices, A and Eq. (7 or (9, satisfy the following. (a A is a column stochastic matrix. Ã, used in DEXTRA, (b Ã is a column stochastic matrix and satisfies Ã = θi n + (1 θa, for some θ (0, 1 ].

11 11 (c (D 1 Ã + Ã (D 1 0. To ensure (D 1 Ã + Ã (D 1 0, the agent i can assign a large value to its own weight a ii in the implementation of DEXTRA, (such that the weighting matrices A and Ã are more lie diagonally-dominant matrices. Then all eigenvalues of (D 1 Ã + Ã (D 1 can be positive. Before we present the main result of this paper, we define some auxiliary sequences. Let z R np be the collection of optimal solution, φ, of Problem P1, i.e., z = 1 n φ, (17 and q R np some vector satisfying [(Ã A I p ] q + α f(z = 0 np. (18 Introduce auxiliary sequence and for each q q = x r, (19 r=0 ( t D I p z =, (D I p z t =, [Ã (D 1] I p G = q [ ( ] (D 1 Ã A I p, (0 We are ready to state the main result as shown in Theorem 1, which maes explicit the rate at which the objective function converges to its optimal value. Theorem 1. Let Assumptions A1, A hold. Denote P = I n A, L = Ã A, N = (D 1 (Ã A, { M = (D 1 Ã, d max = max D } {, d max = max (D 1 }, and d = (D 1. ( Define ρ 1 = 4(1 θ λ max P P (Ã + 8(αL f d max, and ρ = 4λ max Ã + 8(αL f d max, where α is some constant regarded as a ceiling of α. Then with proper step size α [α min, α max ], there exists, δ > 0, Γ <, and γ (0, 1, such that the sequence { t } generated by DEXTRA satisfies t t G (1 + δ t +1 t G Γγ. (1

12 1 The lower bound, α min, of α is and the upper bound, α max, of α is ( λ min α max = α min = M+M 1 + σλmax(nn σ λ min (L L ρ 1 (d max η σλ max (, MM σλmax(nn ρ λ min (L L, L f η (d maxd where, η and σ, are arbitrary constants to ensure α min > 0 and α max > 0, i.e., η < σ < λ min(m+m ( 1 MM λmax + λmax(nn λ min (L ρ L., and d max We will show the proof of Theorem 1 in Section V, with the help of Lemmas 1, 3, and 4 in Section IV. Note that the result of Theorem 1, Eq. (1, does not mean the convergence of DEXTRA to the optimal solution. To proof t t, we further need to show that the matrix G satisfying a G 0, a Rnp. With Assumption A(c, we already have a Ã (D 1 0, a R n. It is left to show that a (D 1 (Ã A 0, a Rn, which will be presented in Section IV. Assume that we have shown that a (D 1 (Ã A 0, a Rn. Since γ (0, 1, then we can say that t t G converges to 0 at the R-linear rate O((1 + δ for some δ (0, δ. Consequently, z z Ã (D 1 converges to 0 at the R-linear rate O((1 + δ for some δ (0, δ. Again with Assumption A(c, it follows that z z with a R-linear rate. In Theorem 1, we have given the specific value of α min and α max. We represent α min = αt min, α d min and α max = αt max for the sae of simplicity. To ensure α α d min < α max, the objective function needs max to have the strong convexity constant satisfying ( α (d max t min αmax d + η ( αmax t It follows from Eq. ( that in order to achieve a linear convergence rate for DEXTRA algorithm, the objective function should not only be strongly convex, but also with the constant above some threshold. It is beyond the scope of this paper on how to design the weighting matrix, A, which influences the eigenvalues in the representation of α min and α max, such that the threshold of can be close to zero. In real situation, numerical experiments illustrate that practically can be really small, i.e., DEXTRA have all agents converge to the optimal solution with an R-linear rate for almost all the strongly convex functions.. Note that the values of α min and α max need the nowledge of networ topology, which is not available in a distributed manner. It is an open question on how to avoid the global nowledge

13 13 of networ topology when designing the value of α. In numerical experiments, however, the range of α is much wider than the theoretical value. IV. BASIC RELATIONS We will present the complete proof of Theorem 1 in Section V, which will rely on three lemmas stated in this section. For the proof from now on, we will assume that the sequences updated by DEXTRA have only one dimension, i.e., z i, x i R, i,. For x i, z i R p being p-dimensional vectors, the proof is same for every dimension by applying the results to each coordinate. Thus, assuming p = 1 is without the loss of generality. The advantage is that with p = 1 we get rid of the Kronecer product in equations, which is easier to follow. Let p = 1, and rewrite DEXTRA, Eq. (9, D +1 z +1 =(I n + AD z ÃD 1 z 1 α [ f(z f(z 1 ]. (3 We begin to introduce three main lemmas in this section. Recall the definition of q, q, in Eqs. (19 and (18, and D, D, in Eqs. (6 and (16. Lemma 1 establishes the relations among D z, q, D z, and q. Lemma 1. Let Assumptions A1, A hold. In DEXTRA, the qradruple sequence {D z, q, D z, q } obeys for any = 0, 1,. (I n + A Ã ( D +1 z +1 D z + Ã ( D +1 z +1 D z = (Ã A (q +1 q α [ f(z f(z ], (4 Proof. We sum DEXTRA, Eq. (3, over time from 0 to, D +1 z +1 = ÃD z α f(z (Ã A D r z r. By substracting r=0 (Ã A D +1 z +1 on both sides of the preceding equation, and rearrange the terms, it follows that ( I n + A Ã D +1 z +1 + Ã ( D +1 z +1 D z = (Ã A q +1 α f ( z. (5 ( Since I n + A Ã π = 0 n, where π is the right eigenvector of A corresponding to eigenvalue 1, we have ( I n + A Ã D z = 0 n. (6

14 14 By combining Eqs. (5 and (6, and noting that obtain the desired result. (Ã A q + α f(z = 0 n, Eq. (18, we Before presenting Lemma 3, we introduce an existing result on the special structure of columnstochastic matrices, which helps efficiently on our proof. Specifically, we have the following property of the weighting matrix, A and Ã, used in DEXTRA. Lemma. (Nedic et al. [0] For any column stochastic matrix A R n n, we have (a The limit lim [ A ] exists and lim [ A ] ij = π i, where π is the right eigenvector of A corresponding to eigenvalue 1. (b For all i {1,, n}, the entries [ A ] and π ij i satisfy [ A ] π ij i < Cλ, j, where we can have C = 4 and λ = (1 1 n n. The convergence properties of the weighting matrix, A, help to build the relation between D +1 z +1 D z and z +1 z. There is also similar relations between D +1 z +1 D z and z +1 z. Lemma 3. Let Assumptions A1, A hold. Denote d max = max { D }, d max = max { (D 1 }. The sequences, D +1 z +1 D z, z +1 z, D +1 z +1 D z, and z +1 z, generated by DEXTRA, satisfies the following relations. (a (b (c Proof. (a z +1 z d max D +1 z +1 D z + 4d maxncb f λ. z +1 z d max D +1 z +1 D z + d maxncb f λ. D +1 z +1 D z dmax z +1 z + ncb f λ. z +1 z ( = D +1 1 ( ( D +1 z +1 z ( D +1 1 D +1 z +1 D z + D z D +1 z

15 15 d max D +1 z +1 D z + d max D D +1 z. d max D +1 z +1 D z 4d + maxncb f λ, where the last inequality follows by applying Lemma and the boundedness of the sequences updated by DEXTRA, Eq. (15. Similarly, we can proof the result of part (b. The proof of part (c follows: D +1 z +1 D z = D +1 z +1 D +1 z + D +1 z D z d max z +1 z + D +1 D z. d max z +1 z + ncb f λ. Lemmas 1 and 3 will be ( used in the proof of Theorem 1 in Section V. We now analysis the property of matrix (D 1 Ã A. As we have stated in Section III, the result of Theorem 1 itself can not show that the sequence, { z }, generated by DEXTRA converges, to the optimal solution, z, i.e., z z. To proof this, we still need to show that a (D 1 0, a (Ã A R n. This property can be easily proved by directly applying the following lemma. Lemma 4. (Chung. [31] The Laplacian, L, of a directed graph, G, is defined by L(G = I n S1/ RS 1/ + S 1/ R S 1/, (7 where R is the transition probability matrix and S = diag(s, where s is the left eigenvector of R corresponding to eigenvalue 1. If G is strongly connected, then the eigenvalues of L(G satisfy 0 = λ 0 < λ 1 < < λ n. Since (D 1 (I n A + (I n A (D 1 = (D 1/ L(G (D 1/, it follows from Lemma 4 that for any a R n, a (D 1 (I n A = a (D 1 (I n A+(I n A (D 1 0. Noting that Ã A = θ(i n A, and I n + A Ã = (1 θ(i n A, and θ (0, 1 ], we also have the following relations. a (D 1 (Ã A 0, a Rn, (8

16 16 a (D 1 (I n+a Ã 0, a Rn, (9 and λ min ((D ( 1 Ã A + (Ã A (D 1 = 0, λ min ((D 1 (I n + A Ã + (I n + A Ã (D 1 = 0, Therefore, Eq. (8, together with the result of Theorem 1, Eq. (1 prove the convergence of DEXTRA to the optimal solution of Problem P1. What is left to prove is Theorem 1, which is presented in the next section. V. PROOF OF THEOREM 1 With all the pieces in place, we are finally ready to prove Theorem 1. According to the strong convexity assumption, A1(c, we have α z +1 z α D z +1 D z, (D 1 [ f(z +1 f(z ], =α D z +1 D +1 z +1, (D 1 [ f(z +1 f(z ] + α D +1 z +1 D z, (D 1 [ f(z +1 f(z ] + D +1 z +1 D z, (D 1 α [ f(z f(z ] (30 :=s 1 + s + s 3, where s 1, s, s 3 denote each of RHS terms in Eq. (30. We discuss each term in sequence. Denote d max = max { D }, d max = max { (D 1 }, and d = (D 1, which are all bounded due to the convergence of the weighting matrix, A. For the first two terms, we have s 1 α D D +1 z +1 (D 1 f(z +1 f(z 8αncB f d λ ; (31 s αη D +1 z +1 D z + αl f η αη D +1 z +1 D z + αl f (d d max (D 1 z +1 z η D +1 z +1 D z + 3α(ncL fb f d d max λ ; (3 ηsf

17 17 Recall Lemma 1, Eq. (4, we can represent s 3 as s 3 = D +1 z +1 D z (D 1 (I n+a Ã + D +1 z +1 D z, (D 1 Ã ( D z D +1 z +1 ( (q + D +1 z +1 D z, (D 1 Ã A q +1 :=s 3a + s 3b + s 3c, (33 where and s 3b = D +1 z +1 D z, Ã (D 1 ( D z D +1 z +1 (34 ( (q s 3c = D +1 z +1, (D 1 Ã A q +1 ( (q = q +1 q, (D 1 Ã A q +1. (35 The first equality in Eq. (35 holds because we have (Ã A (D 1 D z = 0 n. By substituting Eqs. (34 and (35 into (33, and recalling the definition of t, t, G in Eq. (0, we obtain s 3 = D +1 z +1 D z (D 1 (I n+a Ã + t +1 t, G ( t t +1. (36 By applying the basic inequality t +1 t, G ( t t +1 and note that = t t G t +1 t G t +1 t G G ( t +1 t, t t +1, (37 t +1 t = G D +1 z +1 D z Ã (D q +1 q 1 (D 1 (Ã A where the inequality holds due to Eq. (8, and D +1 z +1 D z Ã (D 1, (38 G ( t +1 t, t t +1 Ã = (D 1 ( D +1 z +1 D z, D z D +1 z +1 D +1 z +1 D z, (Ã A (D 1 ( q q +1 σ D +1 z +1 D z + 1 (D 1 ÃÃ (D D +1 z +1 D z 1 σ

18 18 + σ q q +1 (D 1 (Ã A(Ã A (D 1, (39 we are able to bound s 3 with all terms of matrix norms by combining Eqs. (36, (37, (38, and (39, s 3 D +1 z +1 D z (D 1 (I n+a Ã + t t G t +1 t G + D +1 z +1 D z + Ã (D 1 +σ(d 1 ÃÃ (D D +1 z +1 D z 1 σ + σ q q +1 (D 1 (Ã A(Ã A (D 1. (40 It follows from Lemma 3(c that D +1 z +1 D z d max z +1 z 8(nCB f + λ. (41 By combining Eqs. (41 and (30, we obtain α D +1 z +1 D z s d 1 + s + s 3 + 8α(nCB f λ. (4 max d max Finally, by plugging the bounds of s 1, Eq. (31, s, Eq. (3, and s 3, Eq. (40, into Eq. (4, and rearranging the equations, it follows that t t G t +1 t G D +1 z +1 D z ( α Sf d η I n 1 σ In+(D 1 (I n+a Ã max + D +1 z +1 D z M σ MM αl f (d d max σ [ 4α(nCB f η I n S f + 8αncB f d + 16α(ncL fb f d d max ηs f d max q q +1, (43 NN where we denote M = (D 1 Ã, and N = (D 1 (Ã A for simplicity in representation. In order to establish the result of Theorem 1, Eq. (1, i.e., t t G (1+δ t +1 t G Γγ. it remains to show that the RHS of Eq. (43 is no less than δ t +1 t G Γγ. This is equivalent to show that D +1 z +1 D z α ( Sf d η max ] λ I n 1 σ In+(D 1 (I n+a Ã δm + D +1 z +1 D z M σ MM αl f (d d max η I n

19 19 From Lemma 1, we have [ 4α(nCB f d max ] + 8αncB f d + 16α(ncL fb f d d max λ ηsf q q +1 σ NN +δn Γλ. (44 q q +1 (Ã A (Ã A (Ã (q = A q +1 = (I n + A Ã(D+1 z +1 D z + α[ f(z +1 f(z ] + Ã(D+1 z +1 D z + α[ f(z f(z +1 ] ( D 4 +1 z +1 D z + (1 θ P P α L f z +1 z ( D z +1 D z Ã Ã + α L f z +1 z D +1 z +1 D z 4(1 θ P P +8(αL f d max I n + D +1 z +1 D z 4Ã Ã+8(αL f d max I n ( αlf d maxncb f λ, (45 where we denote P = I n A for simplicity in representation. Consider the analysis in Eqs. (8 ( and (9, we get λ 0. It is also true that λ ( NN 0, and λ ((Ã A (Ã A N+N 0. Therefore, we have the relation that 1 ( q q +1 σ λ max (NN N+N + δλ σ NN +δn max q q +1 1 ( q q +1 λ min (Ã A (Ã A (Ã A (Ã A, (46 where λ min ( gives the smallest nonzero eigenvalue. We denote L = (Ã A. By considering Eqs. (44, (45, and (46, and let Γ in Eq. (44 be [ ] 4α(nCB f Γ = + 8αncB f d + 16α(ncL fb f d d max d max + σ λ max ( ( NN N+N + δλ max λ min (L L ηs f ( αlf d 160 maxncb f,

20 0 it follows that in order to valid Eq. (44, we need to have the following relation be valid. D +1 z +1 D z ( α Sf I n 1 σ In+(D 1 (I n+a Ã δm d η max + D +1 z +1 D z M σ MM αl f (d d max σ λ max ( ( NN N+N + δλ max λ min (L L ( D +1 z +1 D z which can be implemented by letting ( ( α Sf η I d n 1 I max σ n δ η I n 4(1 θ P P +8(αL f d max I n + D +1 z +1 D z 4Ã Ã+8(αL f d max I n, M+M ( σ λmax(nn +δλ max λ min(l L N+N (4(1 θ P P + 8(αL f d max I n, M+M I n ( σ λmax(nn +δλ max σ MM αl f (d d max η N+N (4Ã Ã + 8(αL λ min(l L f d max I n. ( Denote ρ 1 = 4(1 θ λ max P P (Ã + 8(αL f d max, and ρ = 4λ max Ã + 8(αL f d max, where α is some constant regarded as a ceiling of α. Therefore, we have ( α Sf η 1 σλmax(nn d max σ λ min(l δ min L ρ 1 ( ( N+N λ max + λmax, λ min(l L ρ 1 λ min ( M+M M+M σλ ( max MM αl f (d d max η λ max ( N+N λ min(l L ρ To ensure δ > 0, we set the parameters, η, σ, to be η <, σ < d max σλ ( max NN ( λ max N+N. λ min(m+m ( 1 MM λmax + λmax(nn λ min (L ρ L and also α (α min, α max, where α min, and α max are the values given in Theorem 1. Therefore, we obtained the desired result., VI. NUMERICAL EXPERIMENTS This section provides numerical experiments to study the convergence rate of DEXTRA for a least squares problem in a directed graph. The local objective functions in the least squares

21 1 problems are strongly convex. We compare the performance of DEXTRA with other algorithms suited to the case of directed graph: GP as defined by [0, 1] and D-DGD as defined by [3]. Our second experiment verifies the existance of α min and α max, such that the proper stepsize α should be α [α min, α max ]. We also consider various networ topologies, to see how they effect the stepsize α. Convergence is studied in terms of the residual r = 1 n z n i φ, i=1 where φ is the optimal solution. The distributed least squares problem is described as follow. Each agent owns a private objective function, s i = R i x + n i, where s i R m i and M i R m i p are measured data, x R p is unnown states, and n i R m i is random unnown noise. The goal is to estimate x. This problem can be formulated as a distributed optimization problem solving min f(x = 1 n R i x s i. n We consider the networ topology as the digraphs shown in Fig. 1. i= G a G b G c Fig. 1: Three examples of strongly-connected but non-balanced digraphs. Our first experiment compares several algorithms with the choice of networ G a, illustrated in Fig. 1. The comparison of DEXTRA, GP, D-DGD and DGD with weighting matrix being row-stochastic is shown in Fig.. In this experiment, we set α = 0.1. The convergence rate of DEXTRA is linear as stated in Section III. G-P and D-DGD apply the same stepsize, α = α. As a result, the convergence rate of both is sub-linear. We also consider the DGD algorithm, but with the weighting matrix being row-stochastic. The reason is that in a directed graph, it is impossible to construct a doubly-stochastic matrix. As expected, DGD with row-stochastic matrix does not converge to the exact optimal solution while other three algorithms are suited to the case of directed graph.

22 As stated in Theorem 1, DEXTRA converges to the optimal solution when the stepsize α is in a proper interval. Fig. 3 illustrate this fact. The networ topology is chosen to be G a in Fig. 1. It is shown that when α = 0.01 or α = 1, DEXTRA does not converge. DEXTRA has a more restricted stepsize compared with EXTRA, see [19], where the value of stepsize can be as low as any value close to zero, i.e., α min = 0, when a symmetric and doubly-stochastic matrix is applied in EXTRA. The relative smaller range of interval is due to the fact that the weighting matrix applied in DEXTRA can not be symmetric Residual DGD with Row Stochastic Matrix GP D DGD DEXTRA Fig. : Comparison of different distributed optimization algorithms in a least squares problem. GP, D-DGD, and DEXTRA are proved to wor when the networ topology is described by digraphs. Moreover, DEXTRA has a much faster convergence rate compared with GP and D- DGD. In third experiments, we explore the relations between the range of stepsize and the networ topology. Fig. 4 shows the interval of proper stepsizes for the convergence of DEXTRA to optimal solution when the networ topology is chosen to be G a, G b, and G c in Fig. 1. The interval of proper stepsizes is related to the networ topology as stated in Theorem 1 in Section III. Compared Fig. 4 (b and (c with (a, a faster information diffusion on the networ can increase the range of stepsize interval. Given a fixed networ topology, e.g., see Fig. 4(a only, a larger stepsize increase the convergence speed of algorithms. When we compare Fig. 4(b with (a, it is interesting to see that α = 0.3 maes DEXTRA iterates about 190 times to converge with the

23 DEXTRA α=0.01 DEXTRA α=0.1 DEXTRA α= Residual Fig. 3: There exists some interval for the stepsize to have DEXTRA converge to the optimal solution. choice of G a while the same stepsize needs DEXTRA iterates 30 times to converge when the topology changes to G b. We claim that a faster information diffusion on the networ fasten the convergence speed. VII. CONCLUSIONS We introduce DEXTRA, a distributed algorithm to solve multi-agent optimization problems over directed graphs. We have shown that DEXTRA succeeds in driving all agents to the same point, which is the optimal solution of the problem, given that the communication graph is strongly-connected and the objective functions are strongly convex. Moreover, the algorithm converges at a linear rate O(ɛ for some constant ɛ < 1. The constant depends on the networ topology G, as well as the Lipschitz constant, strong convexity constant, and gradient norm of objective functions. Numerical experiments are conducted for a least squares problem, where we show that DEXTRA is the fastest distributed algorithm among all algorithms suited to the case of directed graphs, compared with GP, and D-DGD. Our analysis also explicitly gives the range of interval, which needs the nowledge of networ topology, of stepsize to have DEXTRA converge to the optimal solution. Although numerical experiments show that the practical range of interval are much bigger, it is still an open question

24 G a, α [0.03, 0.3] 10 0 Residual G b, α [10 10, 0.3] 10 0 Residual G c, α [10 10, 0.4] 10 0 Residual Fig. 4: There exists some interval for the stepsize to have DEXTRA converge to the optimal solution. The upperbound and lowerbound of the interval depends on the networ topology, Lipschitz constant, strong convexity constant, and gradient norm of objective functions. (a G a ; (b G b ; (c G c. Given a fixed topology, the convergence speed increases as α increases.

25 5 on how to get rid of the nowlege of networ topology, which is a global information in a distributed setting. REFERENCES [1] V. Cevher, S. Becer, and M. Schmidt, Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 3 43, 014. [] S. Boyd, N. Parih, E. Chu, B. Peleato, and J. Ecstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundation and Trends in Maching Learning, vol. 3, no. 1, pp. 1 1, Jan [3] I. Necoara and J. A. K. Suyens, Application of a smoothing technique to decomposition in convex optimization, IEEE Transactions on Automatic Control, vol. 53, no. 11, pp , Dec [4] G. Mateos, J. A. Bazerque, and G. B. Giannais, Distributed sparse linear regression, IEEE Transactions on Signal Processing, vol. 58, no. 10, pp , Oct [5] J. A. Bazerque and G. B. Giannais, Distributed spectrum sensing for cognitive radio networs by exploiting sparsity, IEEE Transactions on Signal Processing, vol. 58, no. 3, pp , March 010. [6] M. Rabbat and R. Nowa, Distributed optimization in sensor networs, in 3rd International Symposium on Information Processing in Sensor Networs, Bereley, CA, Apr. 004, pp [7] U. A. Khan, S. Kar, and J. M. F. Moura, Diland: An algorithm for distributed sensor localization with noisy distance measurements, IEEE Transactions on Signal Processing, vol. 58, no. 3, pp , Mar [8] C. L and L. Li, A distributed multiple dimensional qos constrained resource scheduling optimization policy in computational grid, Journal of Computer and System Sciences, vol. 7, no. 4, pp , 006. [9] G. Neglia, G. Reina, and S. Alouf, Distributed gradient optimization for epidemic routing: A preliminary evaluation, in nd IFIP in IEEE Wireless Days, Paris, Dec. 009, pp [10] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1, pp , Jan. 009.

26 6 [11] K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, arxiv preprint arxiv: , 013. [1] J. C. Duchi, A. Agarwal, and M. J. Wainwright, Dual averaging for distributed optimization: Convergence analysis and networ scaling, IEEE Transactions on Automatic Control, vol. 57, no. 3, pp , Mar. 01. [13] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. Puschel, D-ADMM: A communication-efficient distributed algorithm for separable optimization, IEEE Transactions on Signal Processing, vol. 61, no. 10, pp , May 013. [14] E. Wei and A. Ozdaglar, Distributed alternating direction method of multipliers, in 51st IEEE Annual Conference on Decision and Control, Dec. 01, pp [15] W. Shi, Q. Ling, K Yuan, G Wu, and W Yin, On the linear convergence of the admm in decentralized consensus optimization, IEEE Transactions on Signal Processing, vol. 6, no. 7, pp , April 014. [16] D. Jaovetic, J. Xavier, and J. M. F. Moura, Fast distributed gradient methods, IEEE Transactions onautomatic Control, vol. 59, no. 5, pp , 014. [17] A. Mohatari, Q. Ling, and A. Ribeiro, Networ newton, aryanm/wii/nn.pdf, 014. [18] Q. Ling and A. Ribeiro, Decentralized linearized alternating direction method of multipliers, in IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 014, pp [19] W. Shi, Q. Ling, G. Wu, and W Yin, Extra: An exact first-order algorithm for decentralized consensus optimization, SIAM Journal on Optimization, vol. 5, no., pp , 015. [0] A. Nedic and A. Olshevsy, Distributed optimization over time-varying directed graphs, IEEE Transactions on Automatic Control, vol. PP, no. 99, pp. 1 1, 014. [1] K. I. Tsianos, The role of the Networ in Distributed Optimization Algorithms: Convergence Rates, Scalability, Communication/Computation Tradeoffs and Communication Delays, Ph.D. thesis, Dept. Elect. Comp. Eng. McGill University, 013. [] D. Kempe, A. Dobra, and J. Gehre, Gossip-based computation of aggregate information, in 44th Annual IEEE Symposium on Foundations of Computer Science, Oct. 003, pp [3] F. Benezit, V. Blondel, P. Thiran, J. Tsitsilis, and M. Vetterli, Weighted gossip: Distributed averaging using non-doubly stochastic matrices, in IEEE International Symposium on

27 7 Information Theory, Jun. 010, pp [4] A. Jadbabaie, J. Lim, and A. S. Morse, Coordination of groups of mobile autonomous agents using nearest neighbor rules, IEEE Transactions on Automatic Control, vol. 48, no. 6, pp , Jun [5] R. Olfati-Saber and R. M. Murray, Consensus problems in networs of agents with switching topology and time-delays, IEEE Transactions on Automatic Control, vol. 49, no. 9, pp , Sep [6] R. Olfati-Saber and R. M. Murray, Consensus protocols for networs of dynamic agents, in IEEE American Control Conference, Denver, Colorado, Jun. 003, vol., pp [7] L. Xiao, S. Boyd, and S. J. Kim, Distributed average consensus with least-mean-square deviation, Journal of Parallel and Distributed Computing, vol. 67, no. 1, pp , 007. [8] C. Xi, Q. Wu, and U. A. Khan, Distributed gradient descent over directed graphs, cxi01/paper/015d-dgd.pdf, 015. [9] C. Xi and U. A. Khan, Directed-distributed gradient descent, in 53rd Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, Oct [30] K. Cai and H. Ishii, Average consensus on general strongly connected digraphs, Automatica, vol. 48, no. 11, pp , 01. [31] F. Chung, Laplacians and the cheeger inequality for directed graphs, Annals of Combinatorics, vol. 9, no. 1, pp. 1 19, 005. [3] C. Xi, Q. Wu, and U. A. Khan, Distributed gradient descent over directed graphs, cxi01/paper/015d-dgd.pdf, 015.

On the linear convergence of distributed optimization over directed graphs

1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,