On the linear convergence of distributed optimization over directed graphs

Size: px
Start display at page:

Download "On the linear convergence of distributed optimization over directed graphs"

Transcription

1 1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv: v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm, termed DEXTRA, to solve the optimization problem when n agents reach agreement and collaboratively minimize the sum of their local objective functions over the networ, where the communication between the agents is described by a directed graph. Existing algorithms solve the problem restricted to directed graphs with convergence rates of O(ln / ) for general convex objective functions and O(ln /) when the objective functions are strongly-convex, where is the number of iterations. We show that, with the appropriate step-size, DEXTRA converges at a linear rate O(τ ) for 0 < τ < 1, given that the objective functions are restricted strongly-convex. The implementation of DEXTRA requires each agent to now its local out-degree. Simulation examples further illustrate our findings. Index Terms Distributed optimization; multi-agent networs; directed graphs. I. INTRODUCTION Distributed computation and optimization have gained great interests due to their widespread applications in, e.g., large-scale machine learning, [1, ], model predictive control, [3], cognitive networs, [4, 5], source localization, [6, 7], resource scheduling, [8], and message routing, [9]. All of these applications can be reduced to variations of distributed optimization problems by a networ of agents when the nowledge of objective functions is distributed over the networ. In particular, we consider the problem of minimizing a sum of objectives, n i=1 f i(x), where f i : R p R is a private objective function at the ith agent of the networ. C. Xi and U. A. Khan are with the Department of Electrical and Computer Engineering, Tufts University, 161 College Ave, Medford, MA 0155; chenguang.xi@tufts.edu, han@ece.tufts.edu. This wor has been partially supported by an NSF Career Award # CCF

2 There are many algorithms to solve the above problem in a distributed manner. A few notable approaches are Distributed Gradient Descent (DGD), [10, 11], Distributed Dual Averaging (DDA), [1], and the distributed implementations of the Alternating Direction Method of Multipliers (ADMM), [13 15]. The algorithms, DGD and DDA, are essentially gradientbased, where at each iteration a gradient-related step is calculated, followed by averaging over the neighbors in the networ. The main advantage of these methods is computational simplicity. However, their convergence rate is slow due to the diminishing step-size, which is required to ensure exact convergence. The convergence rate of DGD and DDA with a diminishing step-size is shown to be O( ln ), [10]; under a constant step-size, the algorithm accelerates to O( 1 ) at the cost of inexact convergence to a neighborhood of the optimal solution, [11]. To overcome such difficulties, some alternate approaches include the Nesterov-based methods, e.g., Distributed Nesterov Gradient (DNG) with a convergence rate of O( ln ), and Distributed Nesterov gradient with Consensus iterations (DNC), [16]. The algorithm, DNC, can be interpreted to have an inner loop, where information is exchanged, within every outer loop where the optimization-step is performed. The time complexity of the outer loop is O( 1 ) whereas the inner loop performs a substantial O(ln ) information exchanges within the th outer loop. Therefore, the equivalent convergence rate of DNC is O( ln ). Both DNG and DNC assume the gradient to be bounded and Lipschitz continuous at the same time. The discussion of convergence rate above applies to general convex functions. When the objective functions are further strongly-convex, DGD and DDA have a faster convergence rate of O( ln ), and DGD with a constant step-size converges linearly to a neighborhood of the optimal solution. See Table I for a comparison of related algorithms. Other related algorithms include the distributed implementation of ADMM, based on augmented Lagrangian, where at each iteration the primal and dual variables are solved to minimize a Lagrangian-related function, [13 15]. Comparing to the gradient-based methods with diminishing step-sizes, this type of method converges exactly to the optimal solution with a faster rate of O( 1 ) owing to the constant step-size; and further has a linear convergence when the objective functions are strongly-convex. However, the disadvantage is a high computation burden because each agent needs to optimize a subproblem at each iteration. To resolve this issue, Decentralized Linearized ADMM (DLM), [17], and EXTRA, [18], are proposed, which can be considered as a first-order approximation of decentralized ADMM. DLM and EXTRA converge at a linear rate if the local objective functions are strongly-convex. All these distributed algorithms, [10 19], assume the

3 3 multi-agent networ to be an undirected graph. In contrast, literature concerning directed graphs is relatively limited. The challenge lies in the imbalance caused by the asymmetric information exchange in directed graphs. We report the papers considering directed graphs here. Broadly, there are three notable approaches, which are all gradient-based algorithms with diminishing step-sizes. The first is called Gradient-Push (GP), [0 3], which combines gradient-descent and push-sum consensus. The push-sum algorithm, [4, 5], is first proposed in consensus problems to achieve averageconsensus 1 in a directed graph, i.e., with a column-stochastic matrix. The idea is based on computing the stationary distribution of the column-stochastic matrix characterized by the underlying multi-agent networ and canceling the imbalance by dividing with the right eigenvector of the column-stochastic matrix. Directed-Distributed Gradient Descent (D-DGD), [30, 31], follows the idea of Cai and Ishii s wor on average-consensus, [3], where a new non-doublystochastic matrix is constructed to reach average-consensus. The underlying weighting matrix contains a row-stochastic matrix as well as a column-stochastic matrix, and provides some nice properties similar to doubly-stochastic matrices. In [33], where we name the method Weight- Balancing-Distributed Gradient Descent (WB-DGD), the authors combine the weight-balancing technique, [34], together with gradient-descent. These gradient-based methods, [0 3, 30, 31, 33], restricted by the diminishing step-size, converge relatively slow at O( ln ). Under stronglyconvex objective functions, the convergence rate of GP can be accelerated to O( ln ), [35]. We sum up the existing first-order distributed algorithms and provide a comparison in terms of speed, in Table I, including both undirected and directed graphs. In Table I, I means DGD with a constant step-size is an Inexact method, and C represents that DADMM has a much higher Computation burden compared to other first-order methods. In this paper, we propose a fast distributed algorithm, termed DEXTRA, to solve the corresponding distributed optimization problem over directed graphs. We assume that the objective functions are restricted strongly-convex, a relaxed version of strong-convexity, under which we show that DEXTRA converges linearly to the optimal solution of the problem. DEXTRA combines the push-sum protocol and EXTRA. The push-sum protocol has been proven useful in dealing with optimization over digraphs, [0 3], while EXTRA wors well in optimization problems over undirected graphs with a fast convergence rate and a low computation complexity. 1 See [6 9], for additional information on average-consensus problems.

4 4 Algorithms General Convex strongly-convex DGD (α ) DDA (α ) O( ln ) O( ln ) O( ln ) O( ln ) DGD (α) (I) O( 1 ) O(τ ) undirected DNG (α ) DNC (α) O( ln ) O( ln ) DADMM (α) (C) O(τ ) DLM (α) O(τ ) EXTRA (α) O( 1 ) O(τ ) GP (α ) O( ln ) O( ln ) directed D-DGD (α ) O( ln ) O( ln ) WB-DGD (α ) O( ln ) O( ln ) DEXTRA (α) O(τ ) TABLE I: Convergence rate of first-order distributed optimization algorithms regarding undirected and directed graphs. By integrating the push-sum technique into EXTRA, we show that DEXTRA converges exactly to the optimal solution with a linear rate, O(τ ), when the underlying networ is directed. Note that O(τ ) is commonly described as linear and it should be interpreted as linear on a log-scale. The fast convergence rate is guaranteed because DEXTRA has a constant step-size compared with the diminishing step-size used in GP, D-DGD, or WB-DGD. Currently, our formulation is limited to restricted strongly-convex functions. Finally, we note that an earlier version of DEXTRA, [36], was used in [37] to develop Normalized EXTRAPush. Normalized EXTRAPush implements the DEXTRA iterations after computing the right eigenvector of the underlying column-stochastic, weighting matrix; this computation requires either the nowledge of the weighting matrix at each agent, or, an iterative algorithm that converges asymptotically to the right eigenvector. Clearly, DEXTRA does not assume such nowledge. The remainder of the paper is organized as follows. Section II describes, develops, and interprets the DEXTRA algorithm. Section III presents the appropriate assumptions and states the main convergence results. In Section IV, we present some lemmas as the basis of the proof

5 5 of DEXTRA s convergence. The main proof of the convergence rate of DEXTRA is provided in Section V. We show numerical results in Section VI and Section VII contains the concluding remars. Notation: We use lowercase bold letters to denote vectors and uppercase italic letters to denote matrices. We denote by [x] i, the ith component of a vector, x. For a matrix, A, we denote by [A] i, the ith row of A, and by [A] ij, the (i, j)th element of A. The matrix, I n, represents the n n identity, and 1 n and 0 n are the n-dimensional vector of all 1 s and 0 s. The inner product of two vectors, x and y, is x, y. The Euclidean norm of any vector, x, is denoted by x. We define the A-matrix norm, x A, of any vector, x, as x A x, Ax = x, A x = x, A + A x, where A is not necessarily symmetric. Note that the A-matrix norm is non-negative only when A+ A is Positive Semi-Definite (PSD). If a symmetric matrix, A, is PSD, we write A 0, while A 0 means A is Positive Definite (PD). The largest and smallest eigenvalues of a matrix A are denoted as λ max (A) and λ min (A). The smallest nonzero eigenvalue of a matrix A is denoted as λ min (A). For any f(x), f(x) denotes the gradient of f at x. II. DEXTRA DEVELOPMENT In this section, we formulate the optimization problem and describe DEXTRA. We first derive an informal but intuitive proof showing that DEXTRA pushes the agents to achieve consensus and reach the optimal solution. The EXTRA algorithm, [18], is briefly recapitulated in this section. We derive DEXTRA to a similar form as EXTRA such that our algorithm can be viewed as an improvement of EXTRA suited to the case of directed graphs. This reveals the meaning behind DEXTRA: Directed EXTRA. Formal convergence results and proofs are left to Sections III, IV, and V. Consider a strongly-connected networ of n agents communicating over a directed graph, G = (V, E), where V is the set of agents, and E is the collection of ordered pairs, (i, j), i, j V, such that agent j can send information to agent i. Define N in i the set of agents that can send information to agent i. Similarly, N out i of agent i. We allow both N in i and N out i to be the collection of in-neighbors, i.e., is the set of out-neighbors to include the node i itself. Note that in a directed graph when (i, j) E, it is not necessary that (j, i) E. Consequently, N in i Ni out, in general. We focus on solving a convex optimization problem that is distributed over the above multi-agent

6 6 networ. In particular, the networ of agents cooperatively solve the following optimization problem: n P1 : min f(x) = f i (x), where each local objective function, f i : R p R, is convex and differentiable, and nown only by agent i. Our goal is to develop a distributed iterative algorithm such that each agent converges to the global solution of Problem P1. i=1 A. EXTRA for undirected graphs EXTRA is a fast exact first-order algorithm that solve Problem P1 when the communication networ is undirected. At the th iteration, agent i performs the following update: x +1 i =x i + w ij x j w ij x 1 j α [ f i (x i ) f i (x 1 i ) ], (1) j N i j N i where the weights, w ij, form a weighting matrix, W = {w ij }, that is symmetric and doublystochastic. The collection W = { w ij } satisfies W = θi n + (1 θ)w, with some θ (0, 1 ]. The update in Eq. (1) converges to the optimal solution at each agent i with a convergence rate of O( 1 ) and converges linearly when the objective functions are strongly-convex. To bet- ter represent EXTRA and later compare with DEXTRA, we write Eq. (1) in a matrix form. Let x, f(x ) R np be the collections of all agent states and gradients at time, i.e., x [x 1; ; x n], f(x ) [ f 1 (x 1); ; f n (x n)], and W, W R n n be the weighting matrices collecting weights, w ij, w ij, respectively. Then, Eq. (1) can be represented in a matrix form as: x +1 = [(I n + W ) I p ] x ( W I p )x 1 α [ f(x ) f(x 1 ) ], () where the symbol is the Kronecer product. We now state DEXTRA and derive it in a similar form as EXTRA. B. DEXTRA Algorithm To solve the Problem P1 suited to the case of directed graphs, we propose DEXTRA that can be described as follows. Each agent, j V, maintains two vector variables: x j, z j R p, as well as a scalar variable, y j R, where is the discrete-time index. At the th iteration,

7 7 agent j weights its states, a ij x j, a ij yj, as well as ã ij x 1 j, and sends these to each of its outneighbors, i N out j, where the weights, a ij, and, ã ij, s are such that: a ij = > 0, i N out j, 0, otw., n a ij = 1, j, (3) i=1 θ + (1 θ)a ij, i = j, ã ij = (1 θ)a ij, i j, j, (4) where θ (0, 1 in ]. With agent i receiving the information from its in-neighbors, j N i, it calculates the state, z i, by dividing x i over yi, and updates x +1 i and y +1 i x +1 i y +1 i z i = x i, yi =x i + = j N in i j N in i ( aij x j ) j N in i as follows: (5a) ) [ (ãij x 1 j α fi (z i ) f i (z 1 i ) ], (5b) ( ) aij yj. (5c) In the above, f i (z i ) is the gradient of the function f i (z) at z = z i, and f i (z 1 i ) is the gradient at z 1 i, respectively. The method is initiated with an arbitrary vector, x 0 i, and with y 0 i = 1 for any agent i. The step-size, α, is a positive number within a certain interval. We will explicitly discuss the range of α in Section III. We adopt the convention that x 1 i = 0 p and f i (z 1 i ) = 0 p, for any agent i, such that at the first iteration, i.e., = 0, we have the following iteration instead of Eq. (5), z 0 i = x0 i, (6a) yi 0 x 1 i = ( ) aij x 0 j α fi (z 0 i ), (6b) j N in i y 1 i = j N in i ( ) aij yj 0. (6c) We note that the implementation of Eq. (5) needs each agent to have the nowledge of its outneighbors (such that it can design the weights according to Eqs. (3) and (4)). In a more restricted setting, e.g., a broadcast application where it may not be possible to now the out-neighbors,

8 8 we may use a ij out-degree, [0 3, 30, 31, 33]. = Nj out 1 ; thus, the implementation only requires each agent to now its To simplify the proof, we write DEXTRA, Eq. (5), in a matrix form. Let, A = {a ij } R n n, à = {ã ij } R n n, be the collection of weights, a ij, ã ij, respectively. It is clear that both A and à are column-stochastic matrices. Let x, z, f(x ) R np, be the collection of all agent states and gradients at time, i.e., x [x 1; ; x n], z [z 1; ; z n], f(x ) [ f 1 (x 1); ; f n (x n)], and y R n be the collection of agent states, y i, i.e., y [y 1; ; y n]. Note that at time, y can be represented by the initial value, y 0 : y = Ay 1 = A y 0 = A 1 n. (7) Define a diagonal matrix, D R n n, for each, such that the ith element of D is y i, i.e., D = diag ( y ) = diag ( A 1 n ). (8) Given that the graph, G, is strongly-connected and the corresponding weighting matrix, A, is non-negative, it follows that D is invertible for any. Then, we can write Eq. (5) in the matrix form equivalently as follows: ( [D z = ] ) 1 Ip x, (9a) x +1 =x + (A I p )x (à I p)x 1 α [ f(z ) f(z 1 ) ], y +1 =Ay, (9b) (9c) where both of the weight matrices, A and Ã, are column-stochastic and satisfy the relationship: à = θi n + (1 θ)a with some θ (0, 1 ]. From Eq. (9a), we obtain for any Therefore, Eq. (9) can be represented as a single equation: x = ( D I p ) z. (10) ( D +1 I p ) z +1 = [ (I n + A)D I p ] z (ÃD 1 I p )z 1 α [ f(z ) f(z 1 ) ]. We refer to the above algorithm as DEXTRA, since Eq. (11) has a similar form as EXTRA in Eq. () and is designed to solve Problem P1 in the case of directed graphs. We state our main result in Section III, showing that as time goes to infinity, the iteration in Eq. (11) pushes z to achieve consensus and reach the optimal solution in a linear rate. Our proof in this paper will based on the form, Eq. (11), of DEXTRA. (11)

9 9 C. Interpretation of DEXTRA In this section, we give an intuitive interpretation on DEXTRA s convergence to the optimal solution; the formal proof will appear in Sections IV and V. Since A is column-stochastic, the sequence, { y }, generated by Eq. (9c), satisfies lim y = π, where π is some vector in the span of A s right-eigenvector corresponding to the eigenvalue 1. We also obtain that D = diag (π). For the sae of argument, let us assume that the sequences, { z } and { x }, generated by DEXTRA, Eq. (9) or (11), also converge to their own limits, z and x, respectively (not necessarily true). According to the updating rule in Eq. (9b), the limit x satisfies x =x + (A I p )x (Ã I p)x α [ f(z ) f(z )], (1) which implies that [(A Ã) I p]x = 0 np, or x = π u for some vector, u R p. It follows from Eq. (9a) that z = ( [D ] 1 I p ) (π u) = 1n u, (13) where the consensus is reached. The above analysis reveals the idea of DEXTRA, which is to overcome the imbalance of agent states occurred when the graph is directed: both x and y lie in the span of π; by dividing x over y, the imbalance is canceled. Summing up the updates in Eq. (9b) over from 0 to, we obtain that ) ] x = (A I p ) x α f(z ) [(Ã A I p x r ; note that the first iteration is slightly different as shown in Eqs. (6). Consider x = π u and the preceding relation. It follows that the limit, z, satisfies ) ] α f(z ) = [(Ã A I p x r. (14) Therefore, we obtain that α (1 n I p ) f(z ) = r=0 [ 1 n r=0 ) ] (Ã A I p x r = 0 p, which is the optimality condition of Problem P1 considering that z = 1 n u. Therefore, given the assumption that the sequence of DEXTRA iterates, { z } and { x }, have limits, z and x, we have the fact that z achieves consensus and reaches the optimal solution of Problem P1. In the next section, we state our main result of this paper and we defer the formal proof to Sections IV and V. r=0

10 10 III. ASSUMPTIONS AND MAIN RESULTS With appropriate assumptions, our main result states that DEXTRA converges to the optimal solution of Problem P1 linearly. In this paper, we assume that the agent graph, G, is stronglyconnected; each local function, f i : R p R, is convex and differentiable, and the optimal solution of Problem P1 and the corresponding optimal value exist. Formally, we denote the optimal solution by u R p and optimal value by f, i.e., Let z R np be defined as f = f(u) = min f(x). (15) x Rp z = 1 n u. (16) Besides the above assumptions, we emphasize some other assumptions regarding the objective functions and weighting matrices, which are formally presented as follows. Assumption A1 (Functions and Gradients). Each private function, f i, is convex and differentiable and satisfies the following assumptions. (a) The function, f i, has Lipschitz gradient with the constant L fi, i.e., f i (x) f i (y) L fi x y, x, y R p. (b) The function, f i, is restricted strongly-convex with respect to point u with a positive constant S fi, i.e., S fi x u f i (x) f i (u), x u, x R p, where u is the optimal solution of the Problem P1. Following Assumption A1, we have for any x, y R np, f(x) f(y) L f x y, S f x z f(x) f(z ), x z, (17a) (17b) where the constants L f = max i {L fi }, S f = min i {S fi }, and f(x) [ f 1 (x 1 ); ; f n (x n )] for any x [x 1 ; ; x n ]. Recall the definition of D in Eq. (8), we formally denote the limit of D by D, i.e., D = lim D = diag (A 1 n ) = diag (π), (18) There are different definitions of restricted strong-convexity. We use the same as the one used in EXTRA, [18].

11 11 where π is some vector in the span of the right-eigenvector of A corresponding to eigenvalue 1. The next assumption is related to the weighting matrices, A, Ã, and D. Assumption A (Weighting matrices). The weighting matrices, A and Eq. (9) or (11), satisfy the following. Ã, used in DEXTRA, (a) A is a column-stochastic matrix. (b) à is a column-stochastic matrix and satisfies à = θi n + (1 θ)a, for some θ (0, 1 ]. (c) (D ) 1 à + à (D ) 1 0. One way to guarantee Assumption A(c) is to design the weighting matrix, Ã, to be diagonallydominant. For example, each agent j designs the following weights: a ij = 1 ζ( N out j 1), i = j, ζ, i j, i N out j, where ζ is some small positive constant close to zero. This weighting strategy guarantees the Assumption A(c) as we explain in the following. According to the definition of D in Eq. (18), all eigenvalues of the matrix, (D ) 1 = (D ) 1 I n + I n (D ) 1, are greater than zero. Since eigenvalues are a continuous functions of the corresponding matrix elements, [38, 39], there must exist a small constant ζ such that for all ζ (0, ζ) the weighting matrix, Ã, designed by the constant weighting strategy with parameter ζ, satisfies that all the eigenvalues of the matrix, (D ) 1 à + à (D ) 1, are greater than zero. In Section VI, we show DEXTRA s performance using this strategy. Since the weighting matrices, A and, Ã, are designed to be column-stochastic, they satisfy the following. Lemma 1. (Nedic et al. [0]) For any column-stochastic matrix A R n n, we have (a) The limit lim [ A ] exists and lim [ A ] ij = π i, where π = {π i } is some vector in the span of the right-eigenvector of A corresponding to eigenvalue 1. (b) For all i {1,, n}, the entries [ A ] and π ij i satisfy [ A ] π ij i < Cγ, j, where we can have C = 4 and γ = (1 1 n n ).,

12 1 As a result, we obtain that for any, D D ncγ. (19) Eq. (19) implies that different agents reach consensus in a linear rate with the constant γ. Clearly, the convergence rate of DEXTRA will not exceed this consensus rate (because the convergence of DEXTRA means both consensus and optimality are achieved). We will show this fact theoretically later in this section. We now denote some notations to simplify the representation in the rest of the paper. Define the following matrices, and constants, M = (D ) 1 Ã, (0) N = (D ) 1 (Ã A), (1) Q = (D ) 1 (I n + A Ã), () P = I n A, (3) L = Ã A, (4) R = I n + A Ã, (5) { d = max D }, (6) { d = max (D ) 1 }, (7) d = (D ) 1. (8) We also define some auxiliary variables and sequences. Let q R np be some vector satisfying and q be the accumulation of x r over time: [L I p ] q + α f(z ) = 0 np ; (9) q = x r. (30) Based on M, N, D, z, z, q, and q, we further define ( ) M I p G =, D I p z t =, (D I p ) z t =. (31) N I p q q r=0

13 13 It is useful to note that the G-matrix norm, a G, of any vector, a Rnp, is non-negative, i.e., a G 0, a. This is because G + G is PSD as can be shown with the help of the following lemma. Lemma. (Chung. [40]) Let L G denote the Laplacian matrix of a directed graph, G. Let U be a transition probability matrix associated to a Marov chain described on G and s be the left-eigenvector of U corresponding to eigenvalue 1. Then, L G = I n S1/ US 1/ + S 1/ U S 1/, where S = diag(s). Additionally, if G is strongly-connected, then the eigenvalues of L G satisfy 0 = λ 0 < λ 1 < < λ n. Considering the underlying directed graph, G, and let the weighting matrix A, used in DEXTRA, be the corresponding transition probability matrix, we obtain that L G = (D ) 1/ (I n A )(D ) 1/ Therefore, we have the matrix N, defined in Eq. (1), satisfy + (D ) 1/ (I n A)(D ) 1/. (3) N + N = θ (D ) 1/ L G (D ) 1/, (33) where θ is the positive constant in Assumption A(b). Clearly, N + N is PSD as it is a product of PSD matrices and a non-negative scalar. Additionally, from Assumption A(c), note that M + M is PD and thus for any a R np, it also follows that a M I p we conclude that G + G is PSD and for any a R np, We now state the main result of this paper in Theorem Therefore, a G 0. (34) Theorem 1. Define ) C 1 = d (d (I n + A) + d à + αl f, ( ( ) ( λmax NN + λ )) max N + N C = λ min (L L) C 3 = α(nc) [ C 1 η + (d d L f ) C 4 = 8C ( Lf d ),, ( η + 1 ) + S ] f, η d

14 14 ( ) M + M ( C 5 = λ max + 4C λ max R R ), S f η η(d d C 6 = d L f ), C 7 = 1 λ ( ) (Ã ) max MM + 4C λ max Ã, ( ) 1 = C6 4C 4 δ δ + C 5δ, S where η is some positive constant satisfying that 0 < η < f, and δ < λ d (1+(d d L f ) ) min(m + M )/(C 7 ) is a positive constant reflecting the convergence rate. Let Assumptions A1 and A hold. Then with proper step-size α [α min, α max ], there exist, 0 < Γ < and 0 < γ < 1, such that the sequence { t } defined in Eq. (31) satisfies t t G (1 + δ) t +1 t G Γγ. (35) The constant γ is the same as used in Eq. (19), reflecting the consensus rate. The lower bound, α min, of α satisfies α min α, where α C 6, (36) C 4 δ and the upper bound, α max, of α satisfies α max α, where { ( ) ηλmin M + M α min, C } 6 +. (37) (d d L f ) C 4 δ Proof. See Section V. Theorem 1 is the ey result of this paper. We will show the complete proof of Theorem 1 in Section V. Note that Theorem 1 shows the relation between t t G and t+1 t G but we would lie to show that z converges linearly to the optimal point z, which Theorem 1 does not show. To this aim, we provide Theorem that describes a relation between z z and z +1 z. In Theorem 1, we are given specific bounds on α min and α max. In order to ensure that the solution set of step-size, α, is not empty, i.e., α min α max, it is sufficient (but not necessary) to satisfy α = C 6 C 4 δ ηλ ( ) min M + M α, (38) (d d L f )

15 15 which is equivalent to Recall from Theorem 1 that η ( Sf ) / (C d 4 δ) λ min(m+m ) + 1+(d d L f ) L f (d d ) 4C 4 δ η. (39) S f d (1 + (d d L f ) ). (40) We note that it may not always be possible to find solutions for η that satisfy both Eqs. (39) and (40). The theoretical restriction here is due to the fact that the step-size bounds in Theorem 1 are not tight. However, the representation of α and α imply how to increase the interval of appropriate step-sizes. For example, it may be useful to set the weights to increase λ ) min M + N ( /(d d ) such that α is increased. We will discuss such strategies in the numerical experiments in Section VI. We also observe that in reality, the range of appropriate step-sizes is much wider. Note that the values of α and α need the nowledge of the networ topology, which may not be available in a distributed manner. Such bounds are not uncommon in the literature where the step-size is a function of the entire topology or global objective functions, see [11, 18]. It is an open question on how to avoid the global nowledge of networ topology when designing the interval of α. Remar 1. The positive constant δ in Eq. (35) reflects the convergence rate of t t G. The larger δ is, the faster t t G converges to zero. As δ < λ max(m + M )/(C 7 ), we claim that the convergence rate of t t G can not be arbitrarily large. Based on Theorem 1, we now show the r-linear convergence rate of DEXTRA to the optimal solution. Theorem. Let Assumptions A1 and A hold. With the same step-size, α, used in Theorem 1, the sequence, {z }, generated by DEXTRA, converges exactly to the optimal solution, z, at an r-linear rate, i.e., there exist some bounded constants, T > 0 and max { 1 1+δ, γ} < τ < 1, where δ and γ are constants used in Theorem 1, Eq. (35), such that for any, ( D I p ) z (D I p ) z T τ. Proof. We start with Eq. (35) in Theorem 1, which is defined earlier in Eq. (31). Since the G-matrix norm is non-negative. recall Eq. (34), we have t t G 0, for any. Define

16 16 ψ = max { 1 1+δ, γ}, where δ and γ are constants in Theorem 1. From Eq. (35), we have for any, t t 1 G t 1 t 1 + δ + Γ γ 1 G 1 + δ, ψ t 1 t G + Γψ, ψ t 0 t G + Γψ. For any τ satisfying ψ < τ < 1, there exists a constant Ψ such that ( τ ψ ) >, for all. Ψ Therefore, we obtain that t t G τ t 0 t G + (ΨΓ) Ψ ( ) ψ τ, τ ( t 0 t G + ΨΓ ) τ. (41) From Eq. (31) and the corresponding discussion, we have t t G = ( D I p ) z (D I p ) z M + q q N. Since N + N is PSD, (see Eq. (33)), it follows that ( D I p ) z (D I p ) z M+M t t G. Noting that M +M is PD (see Assumption A(c)), i.e., all eigenvalues of M +M are positive, we obtain that ( ) D I p z (D I p ) z λ min (M+M ) I np ( ) D I p z (D I p ) z M+M. Therefore, we have that By letting λ min (M + M ) we obtain the desired result. ( D I p ) z (D I p ) z t t G. T = t0 t G + ΨΓ λ min (M + M ), ( t 0 t G + ΨΓ ) τ. Theorem shows that the sequence, {z }, converges at an r-linear rate to the optimal solution, z, where the convergence rate is described by the constant, τ. During the derivation of τ, we have τ satisfying that γ max{ 1, γ} < τ < 1. This implies that the convergence rate 1+δ (described by the constant τ) is bounded by the consensus rate (described by the constant γ). In Sections IV and V, we present some basic relations and the proof of Theorem 1.

17 17 IV. AUXILIARY RELATIONS We provide several basic relations in this section, which will help in the proof of Theorem 1. For the proof, we will assume that the sequences updated by DEXTRA have only one dimension, i.e., p = 1; thus zi, x i R, i,. For x i, z i R p being p-dimensional vectors, the proof is the same for every dimension by applying the results to each coordinate. Therefore, assuming p = 1 is without the loss of generality. Let p = 1 and rewrite DEXTRA, Eq. (11), as D +1 z +1 =(I n + A)D z ÃD 1 z 1 α [ f(z ) f(z 1 ) ]. (4) We first establish a relation among D z, q, D z, and q, recall the notation and discussion after Lemma 1). Lemma 3. Let Assumptions A1 and A hold. In DEXTRA, the quadruple sequence {D z, q, D z, q } obeys, for any, R ( D +1 z +1 D z ) + Ã ( D +1 z +1 D z ) = L ( q +1 q ) α [ f(z ) f(z ) ], (43) recall Eqs. (0) (30) for notation. Proof. We sum DEXTRA, Eq. (4), over time from 0 to, D +1 z +1 = ÃD z α f(z ) L D r z r. By subtracting LD +1 z +1 on both sides of the preceding equation and rearranging the terms, it follows that RD +1 z +1 + Ã ( D +1 z +1 D z ) = Lq +1 α f ( z ). (44) Note that D z = π, where π is some vector in the span of the right-eigenvector of A corresponding to eigenvalue 1. Since Rπ = 0 n, we have RD z = 0 n. (45) By subtracting Eq. (45) from Eq. (44), and noting that Lq +α f(z ) = 0 n, Eq. (9), we obtain the desired result. Recall Eq. (19) that shows the convergence of D to D at a geometric rate. We will use this result to develop a relation between D +1 z +1 D z and z +1 z, which is in r=0

18 18 the following lemma. Similarly, we can establish a relation between D +1 z +1 D z and z +1 z. Lemma 4. Let Assumptions A1 and A hold and recall the constants d and d from Eqs. (6) and (7). If z is bounded, i.e., z B <, then (a) z +1 z d D +1 z +1 D z + d ncbγ ; (b) z +1 z d D +1 z +1 D z + d ncbγ ; (c) D +1 z +1 D z d z +1 z + ncbγ ; where C and γ are constants defined in Lemma 1. Proof. (a) z +1 z ( ) = D +1 1 ( ) ( D +1 z +1 z ), ( D +1) 1 D +1 z +1 D z + D z D +1 z, Similarly, we can prove (b). Finally, we have The proof is complete. d D +1 z +1 D z + d D D +1 z, d D +1 z +1 D z + d ncbγ. D +1 z +1 D z = D +1 z +1 D +1 z + D +1 z D z, d z +1 z + D +1 D z, d z +1 z + ncbγ. Note that the result of Lemma 4 is based on the prerequisite that the sequence {z } generated by DEXTRA at th iteration is bounded. We will show this boundedness property (for all ) together with the proof of Theorem 1 in the next section. The following two lemmas discuss the boundedness of z for a fixed. Lemma 5. Let Assumptions A1 and A hold and recall t, t, and G defined in Eq. (31). If t t G is bounded by some constant F for some, i.e., t t G F, we have z be bounded by a constant B for the same, defined as follow, z B (d ) F ( λ min M+M ) + (d ) D z, (46)

19 19 where d, M are constants defined in Eq. (7) and (0). Proof. We follow the following derivation, 1 z (d ) D z, (d ) D z D z + (d ) D z, (d ) λ min ( M+M (d ) λ min ( M+M (d ) F λ min ( M+M ) D z D z M + (d ) D z, ) t t G + (d ) D z, ) + (d ) D z, where the third inequality holds due to M + M being PD (see Assumption A(c)), and the fourth inequality holds because N-matrix norm has been shown to be nonnegative (see Eq. (33)). Therefore, it follows that z B for B defined in Eq. (46), which is clearly < as long as F <. Lemma 6. Let Assumptions A1 and A hold and recall the definition of constant C 1 Theorem 1. If z 1 and z are bounded by a same constant B, we have that z +1 is also bounded. More specifically, we have z +1 C 1 B. Proof. According to the iteration of DEXTRA in Eq. (4), we can bound D +1 z +1 as D +1 z +1 (In + A)D z + ÃD 1 z 1 + αlf z + αlf z 1, [ ] d (I n + A) + d à + αl f B, where d is the constant defined in Eq. (6). Accordingly, we have z +1 be bounded as follow, from z +1 d D +1 z +1 = C1 B. (47) V. MAIN RESULTS In this section, we first give two propositions that provide the main framewor of the proof. Based on these propositions, we use induction to prove Theorem 1. Proposition 1 claims that

20 0 for all N +, if t 1 t G F 1 and t t G F 1, for some bounded constant F 1, then, t t G (1 + δ) t+1 t G Γγ, for some appropriate step-size. Proposition 1. Let Assumptions A1 and A hold, and recall the constants C 1, C, C 3, C 4, C 5, C 6, C 7,, δ, and γ from Theorem 1. Assume t 1 t G F 1 and t t G F 1, for a same bounded constant F 1. Let the constant B be a function of F 1 as defined in Eq. (46) by substituting F with F 1, and we define Γ as With proper step-size α, Eq. (35) is satisfied at th iteration, i.e., Γ =C 3 B. (48) t t G (1 + δ) t +1 t G Γγ, where the range of step-size is given in Eqs. (36) and (37) in Theorem 1. Proof. See Appendix A. Note that Proposition 1 is different from Theorem 1 in that: (i) it only proves the result, Eq. (35), for a certain, not for all N + ; and, (ii) it requires the assumption that t 1 t G F 1, and t t G F 1, for some bounded constant F 1. Next, Proposition shows that for all K, where K is some specific value defined later, if t t G F, and t t G (1 + δ) t +1 t G Γγ, we have that t +1 t G F. Proposition. Let Assumptions A1 and A hold, and recall the constants C 1, C, C 3, C 4, C 5, C 6, C 7,, δ, and γ from Theorem 1. Assume that at th iteration, t t G F, for some bounded constant F, and t t G (1 + δ) t+1 t G Γγ. Then we have that is satisfied for all K, where K is defined as Proof. See Appendix B. K = t +1 t G F (49) log r ( M+M δλ min α(d ) C 3 ). (50)

21 1 A. Proof of Theorem 1 We now formally state the proof of Theorem 1. Proof. Define F = max 1 K { t t G }, where K is the constant defined in Eq. (50). The goal is to show that Eq. (35) in Theorem 1 is valid for all with the step-size being in the range defined in Eqs. (36) and (37). We first prove the result for [1,..., K]: Since t t G the result of Proposition 1 to have F, [1,..., K], we use t t G (1 + δ) t +1 t G Γγ, [1,..., K]. Next, we use induction to show Eq. (35) for all K. For F defined above: (i) Base case: when = K, we have the initial relations that t K 1 t F, G t K t F, G t K t G (1 + δ) t K+1 t G Γγ K. (51a) (51b) (51c) (ii) We now assume that the induction hypothesis is true at the th iteration, for some K, i.e., t 1 t F, G t t F, G (5a) (5b) t t G (1 + δ) t +1 t G Γγ, (5c) and show that this set of equations also hold for + 1. (iii) Given Eqs. (5b) and (5c), we obtain t +1 t G F by applying Proposition. Therefore, by combining t +1 t G F with (5b), we obtain that t+1 t G (1 + δ) t + t G Γγ+1 by Proposition 1. To conclude, we obtain that t t F, G t +1 t F, G (53a) (53b) t +1 t G (1 + δ) t + t G Γγ +1. (53c) hold for + 1. By induction, we conclude that this set of equations holds for all, which completes the proof.

22 VI. NUMERICAL EXPERIMENTS This section provides numerical experiments to study the convergence rate of DEXTRA for a least squares problem over a directed graph. The local objective functions in the least squares problems are strongly-convex. We compare the performance of DEXTRA with other algorithms suited to the case of directed graph: GP as defined by [0 3], and D-DGD as defined by [30]. Our second experiment verifies the existence of α min and α max, such that the proper step-size α is between α min and α max. We also consider various networ topologies and weighting strategies to see how the eigenvalues of networ graphs effect the interval of step-size, α. Convergence is studied in terms of the residual re = 1 n n z i u, i=1 where u is the optimal solution. The distributed least squares problem is described as follows. Each agent owns a private objective function, h i = H i x+n i, where h i R m i and H i R m i p are measured data, x R p is unnown, and n i R m i is random noise. The goal is to estimate x, which we formulate as a distributed optimization problem solving min f(x) = 1 n H i x h i. n i=1 We consider the networ topology as the digraph shown in Fig. 1. We first apply the local degree weighting strategy, i.e., to assign each agent itself and its out-neighbors equal weights according to the agent s own out-degree, i.e., a ij = 1, (i, j) E. (54) Nj out According to this strategy, the corresponding networ parameters are shown in Fig. 1. We now estimate the interval of appropriate step-sizes. We choose L f = max i {λ max (H i H i )} = 0.14, and S f = min i {λ min (H i H i )} = 0.1. We set η = 0.04 < S f /d, and δ = 0.1. Note that η and δ are estimated values. According to the calculation, we have C 1 = 36.6 and C = 5.6. Therefore, we estimate that α = ηλ min(m+m ) = 0.6, and α < S f /(d ) η/ L f (d d ) C = We thus pic δ α = 0.1 [α, α] for the following experiments. Our first experiment compares several algorithms suited to directed graphs, illustrated in Fig. 1. The comparison of DEXTRA, GP, D-DGD and DGD with weighting matrix being row-stochastic is shown in Fig.. In this experiment, we set α = 0.1, which is in the range of our theoretical value calculated above. The convergence rate of DEXTRA is linear as stated in Section III. G-P

23 Fig. 1: Strongly-connected but non-balanced digraphs and networ parameters. and D-DGD apply the same step-size, α = α. As a result, the convergence rate of both is sublinear. We also consider the DGD algorithm, but with the weighting matrix being row-stochastic. The reason is that in a directed graph, it is impossible to construct a doubly-stochastic matrix. As expected, DGD with row-stochastic matrix does not converge to the exact optimal solution while other three algorithms are suited to directed graphs Residual DGD with Row Stochastic Matrix GP D DGD DEXTRA Fig. : Comparison of different distributed optimization algorithms in a least squares problem. GP, D-DGD, and DEXTRA are proved to wor when the networ topology is described by digraphs. Moreover, DEXTRA has a linear convergence rate compared with GP and D-DGD. According to the theoretical value of α and α, we are able to set available step-size, α [ , 0.6]. In practice, this interval is much wider. Fig. 3 illustrates this fact. Numerical experiments show that α min = 0 + and α max = Though DEXTRA has a much wider range

24 4 of step-size compared with the theoretical value, it still has a more restricted step-size compared with EXTRA, see [18], where the value of step-size can be as low as any value close to zero in any networ topology, i.e., α min = 0, as long as a symmetric and doubly-stochastic matrix is applied in EXTRA. The relative smaller range of interval is due to the fact that the weighting matrix applied in DEXTRA can not be symmetric DEXTRA converge, α=0.001 DEXTRA converge, α=0.03 DEXTRA converge, α=0.1 DEXTRA converge, α=0.447 DEXTRA diverge, α=0.448 Residual Fig. 3: DEXTRA convergence w.r.t. different step-sizes. The practical range of step-size is much wider than theoretical bounds. In this case, α [α min = 0, α max = 0.447] while our theoretical bounds show that α [α = , α = 0.6]. The explicit representation of α and α given in Theorem 1 imply the way to increase the interval of step-size, i.e., α λ min(m + M ) (d d ), α 1 (d d). To increase α, we increase λ min(m+m ) (d d ) ; to decrease α, we can decrease 1 (d d). Compared with applying the local degree weighting strategy, Eq. (54), as shown in Fig. 3, we achieve a wider range of step-sizes by applying the constant weighting strategy, which can be expressed as a ij = ( N out j 1), i = j, 0.01, i j, i N out j, This constant weighting strategy constructs a diagonal-dominant weighting matrix, which increases λ min(m+m ) (d d ). It may also be observed from Figs. 3 and 4 that the same step-size generates j,

25 DEXTRA converge, α=0.001 DEXTRA converge, α=0.1 DEXTRA converge, α=0.3 DEXTRA converge, α=0.599 DEXTRA diverge, α=0.6 Residual Fig. 4: DEXTRA convergence with the weights in Eq. (55). A wider range of step-size is obtained due to the increase in λ min (M+M ) (d d ). quiet different convergence speed when the weighting strategy changes. Comparing Figs. 3 and 4 when step-size α = 0.1, DEXTRA with local degree weighting strategy converges much faster. VII. CONCLUSIONS In this paper, we introduce DEXTRA, a distributed algorithm to solve multi-agent optimization problems over directed graphs. We have shown that DEXTRA succeeds in driving all agents to the same point, which is the exact optimal solution of the problem, given that the communication graph is strongly-connected and the objective functions are strongly-convex. Moreover, the algorithm converges at a linear rate O(τ ) for some constant, τ < 1. Numerical experiments on a least squares problem show that DEXTRA is the fastest distributed algorithm among all algorithms applicable to directed graphs. APPENDIX A PROOF OF PROPOSITION 1 We first bound z 1, z, and z +1. According to Lemma 5, we obtain that z 1 B and z B, since t 1 t G F 1 and t t G F 1. By applying Lemma 6, we further obtain that z +1 C 1 B. Based on the boundedness of z 1, z, and z +1, we start to

26 6 prove the desired result. By applying the restricted strong-convexity assumption, Eq. (17b), it follows that αs f z +1 z α D ( z +1 z ), (D ) 1 [ f(z +1 ) f(z ) ], =α D z +1 D +1 z +1, (D ) 1 [ f(z +1 ) f(z ) ] + α D +1 z +1 D z, (D ) 1 [ f(z +1 ) f(z ) ] + D +1 z +1 D z, (D ) 1 α [ f(z ) f(z ) ], :=s 1 + s + s 3, (55) where s 1, s, s 3 denote each of RHS terms. We show the boundedness of s 1, s, and s 3 in Appendix C. Next, it follows from Lemma 4(c) that D +1 z +1 D z d z +1 z + (ncb) γ. Multiplying both sides of the preceding relation by αs f d obtain and combining it with Eq. (55), we αs f d D +1 z +1 D z s 1 + s + s 3 + αs f(ncb) d γ. (56) By plugging the related bounds from Appendix C (s 1 from Eq. (83), s from Eq. (84), and s 3 from Eq. (9)) in Eq. (56), it follows that t t t +1 t G G D +1 z +1 D z α In order to derive the relation that [ Sf d η η(d d L f ) ] I n 1 δ In+Q + D +1 z +1 D z M δ MM α(d d L f ) [ ( C α(nc) 1 η + (d d L f ) η + 1 ) η η + S f d I n ] B γ q q +1 δ NN. (57) t t G (1 + δ) t +1 t G Γγ, (58) it is sufficient to show that the RHS of Eq. (57) is no less than δ t +1 t G Γγ. Recall the definition of G, t, and t in Eq. (31), we have δ t +1 t G Γγ = D +1 z +1 D z δm + q q +1 δn Γγ. (59)

27 7 Comparing Eqs. (57) with (59), it is sufficient to prove that D +1 z +1 D z α [ Sf d η η(d d L f ) ] I n 1 δ In+Q δm + D +1 z +1 D z M δ MM α(d d L f ) η [ ( C + Γγ α(nc) 1 η + (d d L f ) η + 1 ) η I n + S f d ] B γ q q +1 δ( NN ). (60) +N We next aim to bound q q +1 in terms of δ( NN +N) D+1 z +1 D z and D +1 z +1 D z, such that it is easier to analyze Eq. (60). From Lemma 3, we have q q +1 = ( L L L q q +1), = R(D +1 z +1 D z ) + α[ f(z +1 ) f(z )] + Ã(D+1 z +1 D z ) + α[ f(z ) f(z +1 )], ( D 4 +1 z +1 D z + R R α L f z +1 z ) ( D z +1 D z à à + α L f z +1 z ), D +1 z +1 D z 4R R+8(αL f d ) I n + D +1 z +1 D z 4à Ã+8(αL f d ) I n + 4 ( ) αncd L f B γ. (61) ( ) Since that λ 0, λ ( NN ) 0, λ ( L L ) ( ) ( 0, and λ min = λ ) min NN = N+N N+N ( λ min L L ) = 0 with the same corresponding eigenvector, we have q q +1 δ( NN ) δc +N q q +1, (6) L L where C is the constant defined in Theorem 1. By combining Eqs. (61) with (6), it follows that q q +1 δ( NN ) δc +N q q +1, L L D +1 z +1 D z δc (4R R+8(αL f d ) I n) + D +1 z +1 D z δc (4à Ã+8(αL f d ) I n) ( ) + 4δC αncd L f B γ. (63) Consider Eq. (60), together with (63). Let Γ =C 3 B, (64)

28 8 where C 3 is the constant defined in Theorem 1, such that all γ items in Eqs. (60) and (63) can be canceled out. In order to prove Eq. (60), it is sufficient to show that the LHS of Eq. (60) is no less than the RHS of Eq. (63), i.e., D +1 z +1 D z α [ Sf d η η(d d L f ) ] I n 1 δ In+Q δm + D +1 z +1 D z M δ MM α(d d L f ) D +1 z +1 D z δc (4R R+8(αL f d ) I n) + D +1 z +1 D sz δc (4à Ã+8(αL f d ) I n). (65) To satisfy Eq. (65), it is sufficient to have the following two relations hold simultaneously, [ ] α Sf d η η(d d L f ) 1 ( ) M + M δ δλ [ ( max δc 4λmax R R ) + 8(αL f d ) ], ( ) M + M λ min δ λ ( ) max MM α(d d L f ) η η I n (66a) δc [4λ (à ) ] max à + 8(αL f d ). (66b) where in Eq. (66a) we ignore the term λ min(q+q ) due to λ min ( Q + Q ) = 0. Recall the definition ( C 4 = 8C Lf d ), (67) ( ) M + M ( C 5 = λ max + 4C λ max R R ), (68) S f η η(d d C 6 = d L f ), (69) ( ) 1 = C6 4C 4 δ δ + C 5δ. (70) The solution of step-size, α, satisfying Eq. (66a), is where we set C 6 C 4 δ η < α C 6 +, (71) C 4 δ S f d (1 + (d d L f ) ), (7) to ensure the solution of α contains positive values. In order to have δ > 0 in Eq. (66b), the step-size, α, is sufficient to satisfy α ηλ ( ) min M + M. (73) (d d L f )

29 9 By combining Eqs. (71) with (73), we conclude it is sufficient to set the step-size α [α, α], where and to establish the desired result, i.e., α C 6, (74) C 4 δ { ( ) ηλmin M + M α min, C } 6 +, (75) (d d L f ) C 4 δ t t G (1 + δ) t +1 t G Γγ. (76) Finally, we bound the constant δ, which reflecting how fast t +1 t G converges. Recall the definition of C 7 C 7 = 1 λ ( ) (Ã ) max MM + 4C λ max Ã. (77) To have α s solution of Eq. (66b) contains positive values, we need to set δ < λ ( ) min M + M. (78) C 7 APPENDIX B PROOF OF PROPOSITION Since we have t t G F, and t t G (1 + δ) t+1 t G Γγ, it follows that t t +1 t t G + Γγ G 1 + δ 1 + δ, F 1 + δ + Γγ 1 + δ. (79) Given the definition of K in Eq. (50), it follows that for K ( ) M+M δλ min B γ δf α(d ) C 3 B Γ, (80) where the second inequality follows with the definition of Γ, and F in Eqs. (48) and (46). Therefore, we obtain that t +1 t G F 1 + δ + δf 1 + δ = F. (81)

30 30 APPENDIX C BOUNDING s 1, s AND s 3 Bounding s 1 : By using a, b η a + 1 η b for any appropriate vectors a, b, and a positive η, we obtain that s 1 α η D D K+1 z K+1 + αη(d L f ) z K+1 z. (8) It follows D D K+1 ncγ K as shown in Eq. (19), and z K+1 C 1B as shown in Eq. (47). The term z K+1 z can be bounded with applying Lemma 4(b). Therefore, [ ] C s 1 α(nc) 1 η + η(d d L f ) B γ K + αη(d d L f ) D K+1 z K+1 D z. (83) Bounding s : Similarly, we use Lemma 4(a) to obtain s αη D K+1 z K+1 D z + α(d L f ) η z K+1 z, αη D K+1 z K+1 D z + α(ncd d L f ) B + α(d d L f ) η Bounding s 3 : We rearrange Eq. (43) in Lemma 3 as follow, γ K η D +1 z +1 D z. (84) α [ f(z ) f(z ) ] = R ( D +1 z +1 D z ) + Ã ( D +1 z +1 D z ) + L ( q +1 q ). By substituting α[ f(z ) f(z )] in s 3 with the representation in the preceding relation, we represent s 3 as s 3 = D K+1 z K+1 D z Q + D K+1 z K+1 D z, M ( D K z K D K+1 z K+1) + D K+1 z K+1 D z, N ( q q +1), (85) :=s 3a + s 3b + s 3c, (86) where s 3b is equivalent to s 3b = D K+1 z K+1 D K z K, M ( D z D K+1 z K+1), and s 3c can be simplified as s 3c = D K+1 z K+1, N ( q q K+1) = q K+1 q K, N ( q q K+1).

31 31 The first equality in the preceding relation holds due to the fact that N D z = 0 n and the second equality follows from the definition of q, see Eq. (30). By substituting the representation of s 3b and s 3c into (86), and recalling the definition of t, t, G in Eq. (31), we simplify the representation of s 3, s 3 = D K+1 z K+1 D z + t K+1 t K, G ( t t K+1). (87) Q With the basic rule t K+1 t K, G ( t t K+1) + G ( t K+1 t K), t t K+1 We obtain that = t K t G t K+1 t G t K+1 t K G, (88) s 3 = D K+1 z K+1 D z + Q t K t G t K+1 t G t K+1 t K G G ( t K+1 t K), t t K+1. (89) We analyze the last two terms in Eq. (89): t K+1 t K G D K+1 z K+1 D K z K, (90) M where the inequality holds due to N-matrix norm is nonnegative, and G ( t K+1 t K), t t K+1 = M ( D K+1 z K+1 D K z K), D z D K+1 z K+1 D K+1 z K+1 D z, N ( q q K+1), δ D K+1 z K+1 D K z K + δ MM q q K+1 NN + D K+1 z K+1 D z, (91) δ for some δ > 0. By substituting Eqs. (90) and (91) into Eq. (89), we obtain that s 3 t K t G t K+1 t G + q q K+1 δnn + D K+1 z K+1 D z δ In Q + D K+1 z K+1 D K z K M +δmm. (9)

32 3 REFERENCES [1] V. Cevher, S. Becer, and M. Schmidt, Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics, IEEE Signal Processing Magazine, vol. 31, no. 5, pp. 3 43, 014. [] S. Boyd, N. Parih, E. Chu, B. Peleato, and J. Ecstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundation and Trends in Maching Learning, vol. 3, no. 1, pp. 1 1, Jan [3] I. Necoara and J. A. K. Suyens, Application of a smoothing technique to decomposition in convex optimization, IEEE Transactions on Automatic Control, vol. 53, no. 11, pp , Dec [4] G. Mateos, J. A. Bazerque, and G. B. Giannais, Distributed sparse linear regression, IEEE Transactions on Signal Processing, vol. 58, no. 10, pp , Oct [5] J. A. Bazerque and G. B. Giannais, Distributed spectrum sensing for cognitive radio networs by exploiting sparsity, IEEE Transactions on Signal Processing, vol. 58, no. 3, pp , March 010. [6] M. Rabbat and R. Nowa, Distributed optimization in sensor networs, in 3rd International Symposium on Information Processing in Sensor Networs, Bereley, CA, Apr. 004, pp [7] U. A. Khan, S. Kar, and J. M. F. Moura, Diland: An algorithm for distributed sensor localization with noisy distance measurements, IEEE Transactions on Signal Processing, vol. 58, no. 3, pp , Mar [8] C. L and L. Li, A distributed multiple dimensional qos constrained resource scheduling optimization policy in computational grid, Journal of Computer and System Sciences, vol. 7, no. 4, pp , 006. [9] G. Neglia, G. Reina, and S. Alouf, Distributed gradient optimization for epidemic routing: A preliminary evaluation, in nd IFIP in IEEE Wireless Days, Paris, Dec. 009, pp [10] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Transactions on Automatic Control, vol. 54, no. 1, pp , Jan [11] K. Yuan, Q. Ling, and W. Yin, On the convergence of decentralized gradient descent, arxiv preprint arxiv: , 013. [1] J. C. Duchi, A. Agarwal, and M. J. Wainwright, Dual averaging for distributed optimization: Convergence analysis and networ scaling, IEEE Transactions on Automatic Control, vol. 57, no. 3, pp , Mar. 01. [13] J. F. C. Mota, J. M. F. Xavier, P. M. Q. Aguiar, and M. Puschel, D-ADMM: A communication-efficient distributed algorithm for separable optimization, IEEE Transactions on Signal Processing, vol. 61, no. 10, pp , May 013. [14] E. Wei and A. Ozdaglar, Distributed alternating direction method of multipliers, in 51st IEEE Annual Conference on Decision and Control, Dec. 01, pp [15] W. Shi, Q. Ling, K Yuan, G Wu, and W Yin, On the linear convergence of the admm in decentralized consensus optimization, IEEE Transactions on Signal Processing, vol. 6, no. 7, pp , April 014. [16] D. Jaovetic, J. Xavier, and J. M. F. Moura, Fast distributed gradient methods, IEEE Transactions on Automatic Control, vol. 59, no. 5, pp , 014.

On the Linear Convergence of Distributed Optimization over Directed Graphs

On the Linear Convergence of Distributed Optimization over Directed Graphs 1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm,

More information

ADD-OPT: Accelerated Distributed Directed Optimization

ADD-OPT: Accelerated Distributed Directed Optimization 1 ADD-OPT: Accelerated Distributed Directed Optimization Chenguang Xi, Student Member, IEEE, and Usman A. Khan, Senior Member, IEEE Abstract This paper considers distributed optimization problems where

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

Distributed Consensus Optimization

Distributed Consensus Optimization Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

Consensus-Based Distributed Optimization with Malicious Nodes

Consensus-Based Distributed Optimization with Malicious Nodes Consensus-Based Distributed Optimization with Malicious Nodes Shreyas Sundaram Bahman Gharesifard Abstract We investigate the vulnerabilities of consensusbased distributed optimization protocols to nodes

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 5, No., pp. 944 966 c 05 Society for Industrial and Applied Mathematics EXTRA: AN EXACT FIRST-ORDER ALGORITHM FOR DECENTRALIZED CONSENSUS OPTIMIZATION WEI SHI, QING LING, GANG WU, AND

More information

Distributed intelligence in multi agent systems

Distributed intelligence in multi agent systems Distributed intelligence in multi agent systems Usman Khan Department of Electrical and Computer Engineering Tufts University Workshop on Distributed Optimization, Information Processing, and Learning

More information

Decentralized Consensus Optimization with Asynchrony and Delay

Decentralized Consensus Optimization with Asynchrony and Delay Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Convergence rates for distributed stochastic optimization over random networks

Convergence rates for distributed stochastic optimization over random networks Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed

More information

Discrete-time Consensus Filters on Directed Switching Graphs

Discrete-time Consensus Filters on Directed Switching Graphs 214 11th IEEE International Conference on Control & Automation (ICCA) June 18-2, 214. Taichung, Taiwan Discrete-time Consensus Filters on Directed Switching Graphs Shuai Li and Yi Guo Abstract We consider

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

arxiv: v3 [math.oc] 1 Jul 2015

arxiv: v3 [math.oc] 1 Jul 2015 On the Convergence of Decentralized Gradient Descent Kun Yuan Qing Ling Wotao Yin arxiv:1310.7063v3 [math.oc] 1 Jul 015 Abstract Consider the consensus problem of minimizing f(x) = n fi(x), where x Rp

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Constrained Consensus and Optimization in Multi-Agent Networks

Constrained Consensus and Optimization in Multi-Agent Networks Constrained Consensus Optimization in Multi-Agent Networks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract

More information

Convergence Rate for Consensus with Delays

Convergence Rate for Consensus with Delays Convergence Rate for Consensus with Delays Angelia Nedić and Asuman Ozdaglar October 8, 2007 Abstract We study the problem of reaching a consensus in the values of a distributed system of agents with time-varying

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

arxiv: v1 [math.oc] 24 Dec 2018

arxiv: v1 [math.oc] 24 Dec 2018 On Increasing Self-Confidence in Non-Bayesian Social Learning over Time-Varying Directed Graphs César A. Uribe and Ali Jadbabaie arxiv:82.989v [math.oc] 24 Dec 28 Abstract We study the convergence of the

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

arxiv: v2 [eess.sp] 20 Nov 2017

arxiv: v2 [eess.sp] 20 Nov 2017 Distributed Change Detection Based on Average Consensus Qinghua Liu and Yao Xie November, 2017 arxiv:1710.10378v2 [eess.sp] 20 Nov 2017 Abstract Distributed change-point detection has been a fundamental

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Distributed Estimation and Detection for Smart Grid

Distributed Estimation and Detection for Smart Grid Distributed Estimation and Detection for Smart Grid Texas A&M University Joint Wor with: S. Kar (CMU), R. Tandon (Princeton), H. V. Poor (Princeton), and J. M. F. Moura (CMU) 1 Distributed Estimation/Detection

More information

A Distributed Newton Method for Network Optimization

A Distributed Newton Method for Network Optimization A Distributed Newton Method for Networ Optimization Ali Jadbabaie, Asuman Ozdaglar, and Michael Zargham Abstract Most existing wor uses dual decomposition and subgradient methods to solve networ optimization

More information

Randomized Gossip Algorithms

Randomized Gossip Algorithms 2508 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 52, NO 6, JUNE 2006 Randomized Gossip Algorithms Stephen Boyd, Fellow, IEEE, Arpita Ghosh, Student Member, IEEE, Balaji Prabhakar, Member, IEEE, and Devavrat

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

Preconditioning via Diagonal Scaling

Preconditioning via Diagonal Scaling Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.

More information

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks 06 IEEE International Conference on Big Data (Big Data) Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks Benjamin Sirb Xiaojing Ye Department of Mathematics and Statistics

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Lecture 1 and 2: Introduction and Graph theory basics. Spring EE 194, Networked estimation and control (Prof. Khan) January 23, 2012

Lecture 1 and 2: Introduction and Graph theory basics. Spring EE 194, Networked estimation and control (Prof. Khan) January 23, 2012 Lecture 1 and 2: Introduction and Graph theory basics Spring 2012 - EE 194, Networked estimation and control (Prof. Khan) January 23, 2012 Spring 2012: EE-194-02 Networked estimation and control Schedule

More information

Fast Linear Iterations for Distributed Averaging 1

Fast Linear Iterations for Distributed Averaging 1 Fast Linear Iterations for Distributed Averaging 1 Lin Xiao Stephen Boyd Information Systems Laboratory, Stanford University Stanford, CA 943-91 lxiao@stanford.edu, boyd@stanford.edu Abstract We consider

More information

Asymptotics, asynchrony, and asymmetry in distributed consensus

Asymptotics, asynchrony, and asymmetry in distributed consensus DANCES Seminar 1 / Asymptotics, asynchrony, and asymmetry in distributed consensus Anand D. Information Theory and Applications Center University of California, San Diego 9 March 011 Joint work with Alex

More information

Distributed Optimization over Random Networks

Distributed Optimization over Random Networks Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY Uplink Downlink Duality Via Minimax Duality. Wei Yu, Member, IEEE (1) (2)

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY Uplink Downlink Duality Via Minimax Duality. Wei Yu, Member, IEEE (1) (2) IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 2, FEBRUARY 2006 361 Uplink Downlink Duality Via Minimax Duality Wei Yu, Member, IEEE Abstract The sum capacity of a Gaussian vector broadcast channel

More information

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 57, NO. 1, JANUARY 2012 33 Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren,

More information

Lecture 3: Huge-scale optimization problems

Lecture 3: Huge-scale optimization problems Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

A note on the minimal volume of almost cubic parallelepipeds

A note on the minimal volume of almost cubic parallelepipeds A note on the minimal volume of almost cubic parallelepipeds Daniele Micciancio Abstract We prove that the best way to reduce the volume of the n-dimensional unit cube by a linear transformation that maps

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases 2558 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 9, SEPTEMBER 2002 A Generalized Uncertainty Principle Sparse Representation in Pairs of Bases Michael Elad Alfred M Bruckstein Abstract An elementary

More information

Resilient Distributed Optimization Algorithm against Adversary Attacks

Resilient Distributed Optimization Algorithm against Adversary Attacks 207 3th IEEE International Conference on Control & Automation (ICCA) July 3-6, 207. Ohrid, Macedonia Resilient Distributed Optimization Algorithm against Adversary Attacks Chengcheng Zhao, Jianping He

More information

COMP 558 lecture 18 Nov. 15, 2010

COMP 558 lecture 18 Nov. 15, 2010 Least squares We have seen several least squares problems thus far, and we will see more in the upcoming lectures. For this reason it is good to have a more general picture of these problems and how to

More information

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization

More information

Newton s Method for Constrained Norm Minimization and Its Application to Weighted Graph Problems

Newton s Method for Constrained Norm Minimization and Its Application to Weighted Graph Problems Newton s Method for Constrained Norm Minimization and Its Application to Weighted Graph Problems Mahmoud El Chamie 1 Giovanni Neglia 1 Abstract Due to increasing computer processing power, Newton s method

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Digraph Fourier Transform via Spectral Dispersion Minimization

Digraph Fourier Transform via Spectral Dispersion Minimization Digraph Fourier Transform via Spectral Dispersion Minimization Gonzalo Mateos Dept. of Electrical and Computer Engineering University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Alternative Characterization of Ergodicity for Doubly Stochastic Chains Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define

More information

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)

More information

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate

More information

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Degenerate Perturbation Theory. 1 General framework and strategy

Degenerate Perturbation Theory. 1 General framework and strategy Physics G6037 Professor Christ 12/22/2015 Degenerate Perturbation Theory The treatment of degenerate perturbation theory presented in class is written out here in detail. The appendix presents the underlying

More information

Decentralized Stabilization of Heterogeneous Linear Multi-Agent Systems

Decentralized Stabilization of Heterogeneous Linear Multi-Agent Systems 1 Decentralized Stabilization of Heterogeneous Linear Multi-Agent Systems Mauro Franceschelli, Andrea Gasparri, Alessandro Giua, and Giovanni Ulivi Abstract In this paper the formation stabilization problem

More information

Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks

Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks Continuous-Model Communication Complexity with Application in Distributed Resource Allocation in Wireless Ad hoc Networks Husheng Li 1 and Huaiyu Dai 2 1 Department of Electrical Engineering and Computer

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics

Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics 1 Distributed Constrained Recursive Nonlinear Least-Squares Estimation: Algorithms and Asymptotics Anit Kumar Sahu, Student Member, IEEE, Soummya Kar, Member, IEEE, José M. F. Moura, Fellow, IEEE and H.

More information

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

10-725/36-725: Convex Optimization Spring Lecture 21: April 6 10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Research on Consistency Problem of Network Multi-agent Car System with State Predictor

Research on Consistency Problem of Network Multi-agent Car System with State Predictor International Core Journal of Engineering Vol. No.9 06 ISSN: 44-895 Research on Consistency Problem of Network Multi-agent Car System with State Predictor Yue Duan a, Linli Zhou b and Yue Wu c Institute

More information

10725/36725 Optimization Homework 4

10725/36725 Optimization Homework 4 10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage 1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Learning distributions and hypothesis testing via social learning

Learning distributions and hypothesis testing via social learning UMich EECS 2015 1 / 48 Learning distributions and hypothesis testing via social learning Anand D. Department of Electrical and Computer Engineering, The State University of New Jersey September 29, 2015

More information

6.842 Randomness and Computation March 3, Lecture 8

6.842 Randomness and Computation March 3, Lecture 8 6.84 Randomness and Computation March 3, 04 Lecture 8 Lecturer: Ronitt Rubinfeld Scribe: Daniel Grier Useful Linear Algebra Let v = (v, v,..., v n ) be a non-zero n-dimensional row vector and P an n n

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Accelerated Distributed Nesterov Gradient Descent

Accelerated Distributed Nesterov Gradient Descent Accelerated Distributed Nesterov Gradient Descent Guannan Qu, Na Li arxiv:705.0776v3 [math.oc] 6 Aug 08 Abstract This paper considers the distributed optimization problem over a network, where the objective

More information

Subgradient Methods in Network Resource Allocation: Rate Analysis

Subgradient Methods in Network Resource Allocation: Rate Analysis Subgradient Methods in Networ Resource Allocation: Rate Analysis Angelia Nedić Department of Industrial and Enterprise Systems Engineering University of Illinois Urbana-Champaign, IL 61801 Email: angelia@uiuc.edu

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

On the convergence of a regularized Jacobi algorithm for convex optimization

On the convergence of a regularized Jacobi algorithm for convex optimization On the convergence of a regularized Jacobi algorithm for convex optimization Goran Banjac, Kostas Margellos, and Paul J. Goulart Abstract In this paper we consider the regularized version of the Jacobi

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24 Outline 1 Problems

More information

Measurement partitioning and observational. equivalence in state estimation

Measurement partitioning and observational. equivalence in state estimation Measurement partitioning and observational 1 equivalence in state estimation Mohammadreza Doostmohammadian, Student Member, IEEE, and Usman A. Khan, Senior Member, IEEE arxiv:1412.5111v1 [cs.it] 16 Dec

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Decentralized Quasi-Newton Methods

Decentralized Quasi-Newton Methods 1 Decentralized Quasi-Newton Methods Mark Eisen, Aryan Mokhtari, and Alejandro Ribeiro Abstract We introduce the decentralized Broyden-Fletcher- Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 16, AUGUST 15, Shaochuan Wu and Michael G. Rabbat. A. Related Work

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 16, AUGUST 15, Shaochuan Wu and Michael G. Rabbat. A. Related Work IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 61, NO. 16, AUGUST 15, 2013 3959 Broadcast Gossip Algorithms for Consensus on Strongly Connected Digraphs Shaochuan Wu and Michael G. Rabbat Abstract We study

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

This appendix provides a very basic introduction to linear algebra concepts.

This appendix provides a very basic introduction to linear algebra concepts. APPENDIX Basic Linear Algebra Concepts This appendix provides a very basic introduction to linear algebra concepts. Some of these concepts are intentionally presented here in a somewhat simplified (not

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information