c 2015 Society for Industrial and Applied Mathematics

Size: px
Start display at page:

Download "c 2015 Society for Industrial and Applied Mathematics"

Transcription

1 SIAM J. OPTIM. Vol. 5, No., pp c 05 Society for Industrial and Applied Mathematics EXTRA: AN EXACT FIRST-ORDER ALGORITHM FOR DECENTRALIZED CONSENSUS OPTIMIZATION WEI SHI, QING LING, GANG WU, AND WOTAO YIN Abstract. Recently, there has been growing interest in solving consensus optimization problems in a multiagent networ. In this paper, we develop a decentralized algorithm for the consensus optimization problem minimize x R p f(x) = n n i= f i(x), which is defined over a connected networ of n agents, where each function f i is held privately by agent i and encodes the agent s data and objective. All the agents shall collaboratively find the minimizer while each agent can only communicate with its neighbors. Such a computation scheme avoids a data fusion center or long-distance communication and offers better load balance to the networ. This paper proposes a novel decentralized exact first-order algorithm (abbreviated as EXTRA) to solve the consensus optimization problem. Exact means that it can converge to the exact solution. EXTRA uses a fixed, large step size, which can be determined independently of the networ size or topology. The local variable of every agent i converges uniformly and consensually to an exact minimizer of f. In contrast, the well-nown decentralized gradient descent (DGD) method must use diminishing step sizes in order to converge to an exact minimizer. EXTRA and DGD have the same choice of mixing matrices and similar periteration complexity. EXTRA, however, uses the gradients of the last two iterates, unlie DGD which uses just that of the last iterate. EXTRA has the best nown convergence rates among the existing synchronized first-order decentralized algorithms for minimizing convex Lipschitz differentiable functions. Specifically, if the f i s are convex and have Lipschitz continuous gradients, EXTRA has an ergodic convergence rate O( ) in terms of the first-order optimality residual. In addition, as long as f is (restricted) strongly convex (not all individual f i s need to be so), EXTRA converges to an optimal solution at a linear rate O(C )forsomeconstantc>. Key words. consensus optimization, decentralized optimization, gradient method, linear convergence AMS subject classifications. 90C5, 90C30 DOI. 0.37/ X. Introduction. This paper focuses on decentralized consensus optimization, a problem defined on a connected networ and solved by n agents cooperatively, n (.) minimize f(x) = f i (x), x R p n i= over a common variable x R p, and for each agent i, f i : R p R is a convex function privately nown by the agent. We assume that the f i s are continuously differentiable and will introduce a novel first-order algorithm to solve (.) in a decentralized manner. We stic to the synchronous case in this paper, that is, all the agents carry out their iterations at the same time intervals. Problems of the form (.) that require decentralized computation are found widely in various scientific and engineering areas including sensor networ information processing, multiple-agent control and coordination, as well as distributed machine learning. Examples and wors include decentralized averaging [7, 5, 34], learning Received by the editors April 8, 04; accepted for publication (in revised form) January, 05; published electronically May 7, 05. This wor was supported by Chinese Scholarship Council (CSC) grants and , NSFC grant , MOF/MIIT/MOST grant BB000005, and NSF grants DMS and DMS Department of Automation, University of Science and Technology of China, Hefei, China (shiwei00@mail.ustc.edu.cn, qingling@mail.ustc.edu.cn, wug@.ustc.edu.cn). Department of Mathematics, UCLA, Los Angeles, CA (wotaoyin@math.ucla.edu). 944

2 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 945 [9,, 6], estimation [,, 6, 8, 9], sparse optimization [9, 35], and low-ran matrix completion [0] problems. Functions f i can tae the form of least squares [7, 5, 34], regularized least squares [,, 9, 8, ], as well as more general ones [6]. The solution x can represent, for example, the average temperature of a room [7, 34], frequency-domain occupancy of spectra [, ], states of a smart grid system [0, 6], sparse vectors [9, 35], a matrix factor [0], and so on. In general, decentralized optimization fits the scenarios in which the data are collected and/or stored in a distributed networ, a fusion center is either infeasible or not economical, and/or computing is required to be performed in a decentralized and collaborative manner by multiple agents... Related methods. Existing first-order decentralized methods for solving (.) include the (sub)gradient method [, 5, 36], the (sub)gradient-push method [3, 4], the fast (sub)gradient method [5, 4], and the dual averaging method [8]. Compared to classical centralized algorithms, decentralized algorithms encounter more restrictive assumptions and typically worse convergence rates. Most of the above algorithms are analyzed under the assumption of bounded (sub)gradients. The wor [] assumes a bounded Hessian for strongly convex functions. Recent wor [36] relaxes such assumptions for decentralized gradient descent (DGD). When (.) has additional constraints that force x into a bounded set, which also leads to bounded (sub)gradients and a Hessian, projected first-order algorithms are applicable [7, 37]. When using a fixed step size, these algorithms do not converge to a solution x of problem (.) but a point in its neighborhood regardless of whether the f i s are differentiable or not [36]. This motivates the use of certain diminishing step sizes in [5, 8, 4] to guarantee convergence to x. The rates of convergence are generally weaer than their analogues in centralized computation. For the general convex case and under the bounded (sub)gradient (or Lipschitz continuous objective) assumption, [5] shows that diminishing step sizes α = lead to a convergence rate of O( ln ) in terms of the running best of objective error, and [8] shows that the dual averaging method has a rate of O( ln ) in the ergodic sense in terms of objective error. For the general convex case, under assumptions of fixed step size and Lipschitz continuous, bounded gradient, [4] shows an outer loop convergence rate of O ( ) in terms of objective error, utilizing Nesterov s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, the diminishing step sizes α = lead to a reduced rate of O( ln /3 ). The (sub)gradientpush method [3] can be implemented in a dynamic digraph and, under the bounded (sub)gradient assumption and diminishing step sizes α = O( ), has a rate of O( ln ) in the ergodic sense in terms of objective error. A better rate of O( ln )isproved for the (sub)gradient-push method in [4] under the strong convexity and Lipschitz gradient assumptions, in terms of expected objective error plus squared consensus residual. Some of the other related algorithms are as follows. For general convex functions and assuming closed and bounded feasible sets, the decentralized asynchronous alternating direction nethod of multipliers (ADMM) [3] is proved to have a rate of O( ) in terms of expected objective error and feasibility violation. The augmented Lagrangian based primal-dual methods have linear convergence under strong convexity and Lipschitz gradient assumptions [4, 30] or under the positive-definite bounded Hessian assumption [, 3]. Our proposed algorithm is a synchronous gradient-based algorithm that has an ergodic rate of O( ) (in terms of optimality condition violation) for general convex

3 946 WEI SHI, QING LING, GANG WU, AND WOTAO YIN objectives with Lipschitz differentials and has a linear rate once the sum of, rather than individual, functions f i is also (restricted) strongly convex... Notation. Throughout the paper, we let agent i hold a local copy of the global variable x, which is denoted by x (i) R p ; its value at iteration is denoted by x (i). We introduce an aggregate objective function of the local variables where n f(x) f i (x (i) ), i= x T () x T () x R n p.. x T (n) The gradient of f(x) is defined by T f (x () ) T f (x () ) f(x). T f n (x (n) ) Rn p. Each row i of x and f(x) is associated with agent i. We say that x is consensual if all of its rows are identical, i.e., x () = = x (n). The analysis and results of this paper hold for all p. The reader can assume p = for convenience (so x and f become vectors) without missing any major point. Finally, for given matrix A and symmetric positive semidefinite matrix G, we define the G-matrix norm A G trace(a T GA). The largest singular value of a matrix A is denoted as σ max (A). The largest and smallest eigenvalues of a symmetric matrix B are denoted as λ max (B) andλ min (B), respectively. The smallest nonzero eigenvalue of a symmetric positive semidefinite matrix B 0 is denoted as λ min (B), which is strictly positive. For a matrix A R m n,null{a} {x R n Ax =0} is the null space of A and span{a} {y R m y = Ax, x R n } is the linear span of all the columns of A..3. Summary of contributions. This paper introduces a novel gradientbased decentralized algorithm EXTRA, establishes its convergence conditions and rates, and presents numerical results in comparison to DGD. EXTRA can use a fixed step size independent of the networ size and quicly converges to the solution to (.). It has a rate of convergence O( ) in terms of best running violation to the first-order optimality condition when f is Lipschitz differentiable, and has a linear rate of convergence if f is also (restricted) strongly convex. Numerical simulations verify the theoretical results and demonstrate its competitive performance..4. Paper organization. The rest of this paper is organized as follows. Section develops and interprets EXTRA. Section 3 presents its convergence results. Then, section 4 presents three sets of numerical results. Finally, section 5 concludes this paper.

4 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 947. Algorithm development. This section derives the proposed algorithm EX- TRA. We start by briefly reviewing DGD and discussing the dilemma that DGD converges slowly to an exact solution when it uses a sequence of diminishing step sizes, yet it converges faster using a fixed step size but stalls at an inaccurate solution. We then obtain the update formula of EXTRA by taing the difference of two formulas of the DGD update. Provided that the sequence generated by the new update formula with a fixed step size converges to a point, we argue that the point is consensual and optimal. Finally, we briefly discuss the choice of mixing matrices in EXTRA. Formal convergence results and proofs are left to section 3... Review of decentralized gradient descent and its limitation. DGD carries out the following iteration, (.) x + (i) = n w ij x (j) α f i (x (i) ) j= for agent i =,...,n. Recall that x (i) Rp is the local copy of x held by agent i at iteration, W = [w ij ] R n n is a symmetric mixing matrix satisfying null{i W } =span{} and σ max (W n T ) <, and α > 0 is a step size for iteration. If two agents i and j are neither neighbors nor identical, then w ij = 0. This way, the computation of (.) involves only local and neighbor information, and hence the iteration is decentralized. Following our notation, we rewrite (.) for all the agents together as (.) x + = W x α f(x ). With a fixed step size α α, DGD has inexact convergence. For each agent i, x (i) converges to a point in the O(α)-neighborhood of a solution to (.), and these points for different agents can be different. On the other hand, properly reducing α enables exact convergence, namely, that each x (i) converges to the same exact solution. However, reducing α causes slower convergence, both in theory and in practice. The wor [36] assumes that the f i s are Lipschitz continuous, and studies DGD with a constant α α. Before the iterates reach the O(α)-neighborhood, the objective value reduces at the rate O( ), and this rate improves to linear if the f i s are also (restricted) strongly convex. In comparison, the wor [4] studies DGD with diminishing α = and assumes that the f /3 i s are Lipschitz continuous and bounded. The objective convergence rate slows down to O( ). The wor [5] studies DGD /3 with diminishing α = and assumes that the f / i s are Lipschitz continuous; a slower rate O( ln ) is proved. A simple example of decentralized least squares in section 4. gives a rough comparison of these three schemes (and how they compare to the proposed algorithm). To see the cause of inexact convergence with a fixed step size, letx be the limit of x (assuming the step size is small enough to ensure convergence). Taing the limit over on both sides of iteration (.) gives us x = W x α f(x ). When α is fixed and nonzero, assuming the consensus of x (namely, it has identical rows x (i) ) will mean x = W x,asaresultofw =, and thus f(x )=0, which is equivalent to f i (x (i) )=0 i, i.e., the same point x (i) simultaneously minimizes f i for all agents i. This is impossible in general and is different from our objective to find a point that minimizes n i= f i.

5 948 WEI SHI, QING LING, GANG WU, AND WOTAO YIN.. Development of EXTRA. The next proposition provides simple conditions for the consensus and optimality for problem (.). Proposition.. Assume null{i W } =span{}. If (.3) x x T () x T (). x T (n) satisfies conditions. x = W x (consensus),. T f(x )=0(optimality), then x = x (i), for any i, is a solution to the consensus optimization problem (.). Proof. Since null{i W } =span{}, x is consensual if and only if condition holds, i.e., x = W x.sincex is consensual, we have T f(x )= n i= f i(x ), so condition means optimality. Next, we construct the update formula of EXTRA, following which the iterate sequence will converge to a point satisfying the two conditions in Proposition.. Consider the DGD update (.) written at iterations + and as follows: (.4) (.5) x + = W x + α f(x + ), x + = W x α f(x ), where the former uses the mixing matrix W and the latter uses W = I + W. The choice of W will be generalized later. The update formula of EXTRA is simply their difference, subtracting (.5) from (.4), (.6) x + x + = W x + W x α f(x + )+α f(x ). Given x and x +, the next iterate x + is generated by (.6). Let us assume that {x } converges for now and let x = lim x. Let us also assume that f is continuous. We first establish condition of Proposition.. Taing in (.6) gives us (.7) x x =(W W )x α f(x )+α f(x ) from which it follows that (.8) W x x =(W W )x = 0. Therefore, x is consensual. Provided that T (W W ) = 0, we show that x also satisfies condition of Proposition.. To see this, adding the first update x = W x 0 α f(x 0 )tothe subsequent updates following the formulas of (x x ), (x 3 x ),...,(x + x + ) given by (.6) and then applying telescopic cancellation, we obtain (.9) x + = W + x + α f(x + )+ (W W )x t t=0

6 or, equivalently, EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 949 (.0) x + = W x + α f(x + )+ (W W )x t. Taing,fromx = lim x and x = W x = W x, it follows that (.) α f(x )= (W W )x t. t=0 Left multiplying T on both sides of (.), in light of T (W W ) = 0, we obtain the condition of Proposition.: (.) T f(x )=0. To summarize, provided that null{i W } =span{}, W = I+W, T (W W )= 0, and the continuity of f, if a sequence following EXTRA (.6) converges to a point x, then by Proposition., x is consensual and any of its identical row vectors solves problem (.)..3. The algorithm EXTRA and its assumptions. We present EXTRA an exact first-order algorithm for decentralized consensus optimization in Algorithm. Algorithm. EXTRA. Choose α>0 and mixing matrices W R n n and W R n n ; Pic any x 0 R n p ;. x W x 0 α f(x 0 );. for =0,,... do x + (I + W )x + W x α [ f(x + ) f(x ) ] ; end for Breaing the computation to the individual agents, step of EXTRA performs updates x (i) = n j= w ij x 0 (j) α f i(x 0 (i) ), t=0 i =,...,n, and step at each iteration performs updates n n [ ] x + (i) = x + (i) + w ij x + (j) w ij x (j) α f i (x + (i) ) f i(x (i) ), i =,...,n. j= j= Each agent computes f i (x (i) )onceforeach and uses it twice for x+ (i) and x + (i). For our recommended choice of W =(W + I)/, each agent computes n j= w ijx (j) once as well. Here we formally give the assumptions on the mixing matrices W and W for EXTRA. All of them will be used in the convergence analysis in the next section. Assumption (mixing matrix). Consider a connected networ G = {V, E} consisting of a set of agents V = {,,...,n} and a set of undirected edges E. The mixing matrices W =[w ij ] R n n and W =[ w ij ] R n n satisfy the following.. (Decentralized property) If i j and (i, j) E,thenw ij = w ij =0.. (Symmetry) W = W T, W = W T.

7 950 WEI SHI, QING LING, GANG WU, AND WOTAO YIN 3. (Null space property) null{w W } =span{}, null{i W } span{}. 4. (Spectral property) W 0and I+W W W. We claim that parts 4 of Assumption imply null{i W } =span{} and the eigenvalues of W lie in (, ], which are commonly assumed for DGD. Therefore, the additional assumptions are merely on W. In fact, EXTRA can use the same W used in DGD and simply tae W = I+W, which satisfies part 4. It is also worth noting that in the recent wor push-dgd [3] relaxes the symmetry condition, yet such relaxation for EXTRA is not trivial and is our future wor. Proposition.. Parts 4 of Assumption imply null{i W } =span{} and that the eigenvalues of W lie in (, ]. Proof. Frompart4,wehave I+W W 0 and thus W I and λ min (W ) >. Also from part 4, we have I+W W and thus I W,whichmeansλ max (W ). Hence, all eigenvalues of W (and those of W ) lie in (, ]. Now, we show null{i W } =span{}. Consideranonzero vector v null{i W }, which satisfies (I W )v =0andthusv T (I W )v =0andv T v = v T W v. From I+W W (part 4), we get v T v = v T ( I+W )v v T W v, while from W W (part 4) we also get v T W v v T W v = v T v. Therefore, we have v T W v = v T v or equivalently ( W I)v = 0, adding which to (I W )v = 0 yields ( W W )v =0. In light of null{w W } = span{} (part 3), we must have v span{} and thus null{i W } =span{}. Note that it is easy to ensure W to be diagonally dominant, so that the range of eigenvalues in Proposition. will lie in [0, ]. Then, the simple choice W = I+W will have λ min ( W ), so that the step size α (given in (3.8) and Theorem 3.3 below) becomes independent of the networ topology..4. Mixing matrices. In EXTRA, the mixing matrices W and W diffuse information throughout the networ. The role of W is similar to that in DGD [5, 3, 36] and average consensus [33]. It has a few common choices, which can significantly affect performance. (i) Symmetric doubly stochastic matrix [5, 3, 36]: W = W T, W =, and w ij 0. Special cases of such matrices include parts (ii) and (iii) below. (ii) Laplacian-based constant edge weight matrix [8, 33], W = I L τ, where L is the Laplacian matrix of the graph G and τ > λ max(l) isa scaling parameter. Denote deg(i) as the degree of agent i. Whenλ max (L) is not available, τ =max i V {deg(i)} + ɛ for some small ɛ>0, say ɛ =,can be used. (iii) Metropolis constant edge weight matrix [3, 34], w ij = max{deg(i),deg(j)}+ɛ if (i, j) E, 0 if (i, j) / E and i j, if i = j w i V for some small positive ɛ>0. (iv) Symmetric fastest distributed linear averaging (FDLA) matrix. It is a symmetric W that achieves fastest information diffusion and can be obtained by a semidefinite program [33].

8 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 95 It is worth noting that the optimal choice for average consensus, FDLA, no longer appears optimal in decentralized consensus optimization, which is more general. When W is chosen following any strategy above, W = I+W is found to be very efficient..5. EXTRA as corrected DGD. We rewrite (.0) as (.3) x + = W x α f(x ) }{{} DGD (W W )x t, =0,,... + t=0 } {{ } correction An EXTRA update is, therefore, a DGD update with a cumulative correction term. In subsection., we have argued that the DGD update cannot reach consensus asymptotically unless α asymptotically vanishes. Since α f(x )withafixedα>0 cannot vanish in general, it must be corrected, or otherwise x + W x does not vanish, preventing x from being asymptotically consensual. Provided that (.3) converges, the role of the cumulative term t=0 (W W )x t is to neutralize α f(x ) in (span{}), the subspace orthogonal to. If a vector v obeys v T (W W )=0, then the convergence of (.3) means the vanishing of v T f(x ) in the limit. We need T f(x ) = 0 for consensus optimality. The correction term in (.3) is the simplest that we could find so far. In particular, the summation is necessary since each individual term (W W )x t is asymptotically vanishing. The terms must wor cumulatively. 3. Convergence analysis. To establish convergence of EXTRA, this paper maes two additional but common assumptions as follows. Unless otherwise stated, the results in this section are given under Assumptions 3. Assumption (convex objective with Lipschitz continuous gradient). Objective functions f i are proper closed convex and Lipschitz differentiable: f i (x a ) f i (x b ) L fi x a x b x a,x b R p, where L fi 0areconstant. Following Assumption, function f(x) = n i= f i(x (i) )isproperclosedconvex, and f is Lipschitz continuous, f(x a ) f(x b ) F L f x a x b F x a, x b R n p, with constant L f =max i {L fi }. Assumption 3(solution existence). Problem (.) has a nonempty set of optimal solutions: X. 3.. Preliminaries. We first state a lemma that gives the first-order optimality conditions of (.). Lemma 3. (first-order optimality conditions). Given mixing matrices W and W, define U =( W W ) / by letting U VS / V T R n n,wherevsv T = W W is the economical-form singular value decomposition. Then, under Assumptions 3, x is consensual and x () x () x (n) is optimal to problem (.) if and only if there exists q = Up for some p R n p such that { (3.) Uq + α f(x )=0, (3.) Ux = 0.

9 95 WEI SHI, QING LING, GANG WU, AND WOTAO YIN Proof. AccordingtoAssumptionandthedefinitionofU, wehave null{u} =null{v T } =null{ W W } =span{}. Hence from Proposition., condition, x is consensual if and only if (3.) holds. Next, following Proposition., condition, x is optimal if and only if T f(x )= 0. Since U is symmetric and U T = 0, (3.) gives T f(x ) = 0. Conversely, if T f(x )=0,then f(x ) span{u} follows from span{u} =(span{}) and thus α f(x )= Uqfor some q. Let q =Proj U q. Then, Uq = Uq and (3.) holds. Let x and q satisfy the optimality conditions (3.) and (3.). Introduce auxiliary sequence q = Ux t and for each, ( (3.3) z q = x t=0 ) ( ) (, z q I 0 = x, G = 0 W The next lemma establishes the relations among x, q, x,andq. Lemma 3.. In EXTRA, the quadruple sequence {x, q, x, q } obeys (I + W (3.4) W )(x + x )+ W (x + x ) = U(q + q ) α[ f(x ) f(x )] for any =0,,... Proof. Similarly to how (.9) is derived, summing EXTRA iterations through +, x = W x 0 α f(x 0 ), x =(I + W )x W x 0 α f(x )+α f(x 0 ),..., x + =(I + W )x W x α f(x )+α f(x ), we get (3.5) x + = W x ( W W )x t α f(x ). t=0 Using q + = + t=0 Uxt and the decomposition W W = U, it follows from (3.5) that (3.6) (I + W W )x + + W (x + x )= Uq + α f(x ). Since (I + W W ) =0,wehave (3.7) (I + W W )x = 0. Subtracting (3.7) from (3.6) and adding 0 = Uq + α f(x ) to (3.6), we obtain (3.4). The convergence analysis is based on the recursion (3.4). Below we will show that x converges to a solution x X and z + z W converges to 0 at a rate of O( ) in an ergodic sense. Further assuming (restricted) strong convexity, we obtain the Q-linear convergence of z z G to 0, which implies the R-linear convergence of x to x. ).

10 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION Convergence and rate. Let us first interpret the step size condition (3.8) α< λ min( W ) L f, which is assumed by Theorem 3.3 below. First of all, let W satisfy Assumption. It is easy to ensure λ min (W ) 0 since otherwise, we can replace W by I+W. In light of part 4 of Assumption, if we let W = I+W,thenwehaveλ min ( W ),which simplifies the bound (3.8) to α< L f, which is independent of any networ property (size, diameter, etc.). Furthermore, if L fi (i =,...,n) are in the same order, the bound L f has the same order as the bound /( n n i= L f i ), which is used in the (centralized) gradient descent method. In other words, a fixed and rather large step size is permitted by EXTRA. Theorem 3.3. Under Assumptions 3, if α satisfies 0 <α< λmin( W ) L f,then (3.9) z z G z + z G ζ z z + G, =0,,..., where ζ = αl f λ min( W ).Furthermore,z converges to an optimal z. Proof. Following Assumption, f is Lipschitz continuous and thus we have (3.0) α f(x ) f(x ) F L f α x x, f(x ) f(x ) = x + x,α[ f(x ) f(x )] +α x x +, f(x ) f(x ). Substituting (3.4) from Lemma 3. for α[ f(x ) f(x )], it follows from (3.0) that (3.) α f(x ) f(x ) F L f x + x,u(q q + ) + x + x, W (x x + ) x + x I+W W +α x x +, f(x ) f(x ). For the terms on the right-hand side of (3.), we have (3.) (3.3) x + x,u(q q + ) = U(x + x ), q q + = Ux +, q q + = q + q, q q +, x + x, W (x x + ) = x + x, W (x x + ), where the second equality follows from Ux = 0 and (3.4) α x x +, f(x ) f(x ) αl f x x + F + α L f f(x ) f(x ) F.

11 954 WEI SHI, QING LING, GANG WU, AND WOTAO YIN Plugging (3.) (3.4) into (3.) and recalling the definitions of z, z,andg, we have (3.5) that is α f(x ) f(x ) F L f q + q, q q + + x + x, W (x x + ) x + x I+W W + αl f x x + F + α f(x ) f(x ) F L, f (3.6) 0 z + z,g(z z + ) x + x I+W W + αl f x x + F. Apply the basic equality z + z,g(z z + ) = z z G z+ z G z z + G to (3.6), we have (3.7) Define 0 z z G z+ z G z z + G x + x I+W W + αl f x x + F. ( I 0 G = 0 W αl f I ). By Assumption, in particular, I + W W 0, we have x + x I+W W 0 and thus z z G z + z G z z + G αl f x x + F = z z + G. Since α< λmin( W ) L f,wehaveg 0and (3.8) z z + G ζ z z + G, which gives (3.9). It shows from (3.9) that for any optimal z, z z G is bounded and contractive, so z z G is converging as z z + G 0. The convergence of z toasolutionz follows from the standard analysis for contraction methods; see, for example, Theorem 3 in []. To estimate the rate of convergence, we need the following result. Proposition 3.4. If a sequence {a } R obeys a 0 and t= a t <, then we have (i) lim a =0, (ii) t= a t = O( ), (iii) min t {a t } = o( ). Proof. Part (i) is obvious. Let b t= a t. By the assumptions, b is uniformly bounded and obeys Part (iii) is due to [6]. lim b <,

12 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 955 from which part (ii) follows. Since c min {a t} is monotonically nonincreasing, we t have c = min t {a t} t=+ This and the fact that lim t=+ a t 0 give us c = o( ) or part (iii). Theorem 3.5. In the same setting as Theorem 3.3, the following rates hold: () running-average progress: ( ) z t z t+ G = O ; t= () running-best progress: ( ) { min z t z t+ } G = o ; t (3) running-average optimality residuals: ( ) Uq t + α f(x t ) W = O t= and a t. ( ) Ux t F = O ; (4) running-best optimality residuals: ( ) ( ) { min Uq t + α f(x t ) W } { = o and min Ux t } F = o. t t Proof. Parts () and (): Since the individual terms z z G converge to 0, we are able to sum (3.9) in Theorem 3.3 over =0through and apply the telescopic cancellation, i.e., (3.9) z t z t+ G = ( z t z G z t+ z ) z 0 z G = G <. δ δ t=0 t=0 Then, the results follow from Proposition 3.4 immediately. Parts (3) and (4): The progress z z + G can be interpreted as the residual to the first-order optimality condition. In light of the first-order optimality conditions (3.) and (3.) in Lemma 3., the optimality residuals are defined as Uq + α f(x ) W and Ux F.Furthermore, t= α T (Uq + α f(x )) = f (x () )+ + f n(x (n) ) is the violation to the first-order optimality of (.), while Ux F is the violation of consensus. Below we obtain the convergence rates of the optimality residuals. Using the basic inequality a + b F ρ a F ρ b F which holds for any ρ> and any matrices a and b of the same size, it follows that (3.0) z z + G = q q + F + x x + W = x + W W + (I W )x + Uq + α f(x ) W x + W W + ρ Uq + α f(x ) W ρ (I W )x W.

13 956 WEI SHI, QING LING, GANG WU, AND WOTAO YIN Since W W and (I W ) W (I W ) are symmetric and null{ W W } null{(i W ) W (I W )}, there exists a bounded υ > 0 such that (I W )x W = x (I W ) W (I W ) υ x W W. It follows from (3.0) that (3.) z t z t+ G + x W W ( x t+ W W υ ) ρ xt W W + x W W t= + ρ Uqt + α f(x t ) W (Set ρ>υ+) t= = t= t= ( υ ρ + x+ W W + ) Ux t F t= ρ Uqt + α f(x t ) W. As part () shows that t= zt z t+ G = O( ), we have ( ) Uq t + α f(x t ) W = O t= and t= Uxt F = O( ). From (3.) and (3.9), we see that both Uq t + α f(x t ) W and Ux t F are summable. Again, by Proposition 3.4, we have part (4), the o( ) rate of running best first-order optimality residuals. It is open whether z z + G is monotonic or not. If one can show its monotonicity, then the convergence rates will hold for the last point in the running sequence Linear convergence under restricted strong convexity. In this subsection we prove that EXTRA with a proper step size reaches linear convergence if the original objective f is restricted strongly convex. A convex function h : R p R is strongly convex if there exists μ>0 such that h(x a ) h(x b ),x a x b μ x a x b x a,x b R p. h is restricted strongly convex with respect to point x if there exists μ>0 such that h(x) h( x),x x μ x x x R p. For convenience in the proof, we introduce the function g(x) f(x)+ 4α x W W and claim that f is restricted strongly convex with respect to its solution x if and only if g is so with respect to x = (x ) T. Proposition 3.6. Under Assumptions and, the following two statements are equivalent: (i) The original objective f(x) = n n i= f i(x) is restricted strongly convex with respect to x. There are different definitions of restricted strong convexity. Ours is derived from [7].

14 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 957 (ii) The penalized function g(x) =f(x)+ 4α x W W is restricted strongly convex with respect to x. In addition, the strong convexity constant of g is no less than that of f. See Appendix A for its proof. Theorem 3.7. If g(x) f(x) + 4α x W W is restricted strongly convex with respect to x with constant μ g > 0, then with proper step size α< μgλmin( W ) L f,there exists δ>0 such that the sequence {z } generated by EXTRA satisfies (3.) z z G ( + δ) z+ z G. That is, z z G converges to 0 at the Q-linear rate O(( + δ) ). Consequently, x x W converges to 0 at the R-linear rate O(( + δ) ). Proof. Towardalowerboundof z z G z+ z G. From the definition of g and its restricted strong convexity, we have αμ g x + x F α x + x, g(x + ) g(x ) (3.3) = x + x W W +α x + x, f(x + ) f(x ) + x + x,α[ f(x ) f(x )]. Using Lemma 3. for α[ f(x ) f(x )] in (3.3), we get αμ g x + x F x + x W W +α x + x, f(x + ) f(x ) x + x I+W (3.4) W + x+ x,u(q q + ) + x + x, W (x x + ) = x + x ( W W ) (I+W W ) +α x+ x, f(x + ) f(x ) + x + x,u(q q + ) + x + x, W (x x + ). For the last three terms on the right-hand side of (3.4), we have from Young s inequality, α x + x, f(x + ) f(x ) αη x + x F + α (3.5) η f(x+ ) f(x ) F αη x + x F + αl f η x+ x F, where η>0 is a tunable parameter and (3.6) x + x,u(q q + ) = q + q, q q + and (3.7) x + x, W (x x + ) = x + x, W (x x + ). Plugging (3.5) (3.7) into (3.4) and recalling the definition of z, z,andg, we obtain αμ g x + x F (3.8) x + x ( W W ) (I+W W ) + αη x+ x F + αl f η x+ x F + z+ z,g(z z + ).

15 958 WEI SHI, QING LING, GANG WU, AND WOTAO YIN By z + z,g(z z + ) = z z G z+ z G z z + G, (3.8) turns into z z G z+ z G x+ x α(μ g η)i ( W W )+(I+W W ) (3.9) + z z + G αl f η x x + F. Acriticalinequality.In order to establish (3.), in light of (3.9), it remains to show (3.30) x + x α(μ g η)i ( W W )+(I+W W ) + z z + G αl f η x x + F δ z + z G. With the terms of z, z in (3.30) expanded and from x + x W W = U(x + x ) F = Ux+ F = q+ q F, (3.30) is equivalent to (3.3) x + x α(μ g η)i+(i+w W ) δ W + x x + W αl f η I δ q+ q F, which is what remains to be shown below. That is, we must find an upper bound for q + q F in terms of x+ x F and x x + F. Establishing (3.3): Step. From Lemma 3. we have (3.3) U(q + q ) F = (I + W W )(x + x )+ W (x + x )+α[ f(x ) f(x )] F = (I + W W )(x + x )+α[ f(x + ) f(x )] + W (x + x )+α[ f(x ) f(x + )] F. From the inequality a+b+c+d F θ( β β a F +β b F )+ θ θ ( γ γ c F +γ d F ), which holds for any θ>, β>, γ>andany matrices a, b, c, d of the same dimensions, it follows that q + q W W ( ) β (3.33) θ β x+ x (I+W W + βα f(x + ) f(x ) ) F + θ ( ) γ θ γ x x + W + γα f(x ) f(x + ) F. By Lemma 3. and the definition of q, all the columns of q and q + lie in the column space of W W. This together with the Lipschitz continuity of f(x) turns (3.33) into (3.34) q + q F θ λ min ( W W ) θ + (θ ) λ min ( W W ) ( βλ max ((I + W W ) ) β ( γλ max ( W ) + γα L f γ ) + βα L f x + x F ) x x + F,

16 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 959 where λ min ( ) gives the smallest nonzero eigenvalue. To mae a rather tight bound, we choose γ =+ σmax( W ) αl f and β =+ σmax(i+w W ) αl f in (3.34) and obtain q + q F θ(σ max(i + W W )+αl f ) (3.35) λ min ( W x + x F W ) + θ(σ max( W )+αl f ) (θ ) λ min ( W W ) x x + F. Establishing (3.3): Step. In order to establish (3.3), with (3.35), it only remains to show (3.36) (3.37) x + x α(μ g η)i+(i+w W ) δ W + x x + W αl f η ( θ(σmax (I + W δ W )+αl f ) λ min ( W x + x F W ) + θ(σ max( W )+αl f ) ) (θ ) λ min ( W W ) x x + F. To validate (3.36), we need α(μ g η)i +(I + W W ) δ W δθ(σmax(i+w W )+αl f ) λ min( W I, W ) W αl f η I δθ(σmax( W )+αl f ) (θ ) λ min( W W ) I, which holds as long as { α(μ g η) λ min ( δ min W W ) θ(σ max (I + W (3.38) W )+αl f ) + λ max ( W ) λ min ( W W ), (θ )(ηλ min ( W ) αl f ) λ min ( W W ) θη(σ max ( W )+αl f ) To ensure δ>0, the following conditions are what we finally need: ( ) (3.39) η (0, μ g ) and α 0, ηλmin( W ) L f S. Obviously set S is nonempty. Therefore, with a proper step size α S, the sequences z z G are Q-linearly convergent to 0 at the rate O((+δ) ). Since the definition of G-norm implies x x W z z G, x x W is R-linearly convergent to 0atthesamerate. Remar (strong convexity condition for linear convergence). The restricted strong convexity assumption in Theorem 3.7 is imposed on g(x) =f(x)+ 4α x W W, not on f(x). In other words, the linear convergence of EXTRA does not require all f i to be individually (restricted) strongly convex. Remar (acceleration by overshooting W ). For conciseness, we used Assumption for both Theorems 3.5 and 3.7. In fact, for Theorem 3.7, the condition I+W W in part 4 of Assumption can be relaxed, thans to μ g in (3.37). Certain W I+W,suchas W =.5I+W.5, can still give linear convergence. In fact, we observed that such an overshot choice of W can slightly accelerate the convergence of EXTRA. Remar 3(step size optimization). We tried deriving an optimal step size and corresponding explicit linear convergence rate by optimizing certain quantities that I }.

17 960 WEI SHI, QING LING, GANG WU, AND WOTAO YIN appear in the proof, but it becomes quite tricy and messy. For the special case W = I+W μg(+λmin(w )), by taing η μ g, we get a satisfactory step size α 4L f. Remar 4(step size for ensuring linear convergence). Interestingly, the critical step size, α< μgλmin( W ) L f = O( μg L f ), in (3.39) for ensuring the linear convergence, and λ the parameter α = min( W W ) μg (+ γ )(μ f L = O( f γ) L f ) in (A.7) for ensuring the restricted strong convexity with O(μ g )=O(μ f ), have the same order. On the other hand, we numerically observed that a step size as large as O( L f ) still leads to linear convergence, and EXTRA becomes faster with this larger step size. It remains an open question to prove linear convergence under this larger step size Decentralized implementation. We shall explain how to perform EX- TRA with only local computation and neighbor communication. EXTRA s formula is formed by f(x), W x and W x, andα. By definition f(x) is local computation. Assumption part ensures that W x and W x can be computed with local and neighbor information. Following our convergence theorems above, determining α requires the bounds on L f and λ min ( W ), as well as that of μ g in the (restricted) strongly convex case. As we have argued at the beginning of subsection 3., it is easy to ensure λ min ( W ),soλ min( W ) can be conservatively set as. In many circumstances, the Lipschitz constants L fi and strong convexity constants, or their bounds, are nown a priori and globally. When they are not nown globally, to obtain L f =max i {L fi },a maximum consensus algorithm is needed. For μ f, except in the case that each f i is (restricted) strongly convex, we can conservatively use min i {μ fi }. When no bound μ g is available in the (restricted) strongly convex case, setting α according to the general convex case (subsection 3.) often still leads to linear convergence. 4. Numerical experiments. 4.. Decentralized least squares. Consider a decentralized sensing problem: Each agent i {,...,n} holds its own measurement equation, y (i) = M (i) x + e (i), where y (i) R mi and M (i) R mi p are measured data, x R p is an unnown signal, and e (i) R mi is unnown noise. The goal is to estimate x. We apply the least squares loss and try to solve minimize x f(x) = n n i= M (i)x y (i). The networ in this experiment is randomly generated with connectivity ratio r =0.5, where r is defined as the number of edges divided by L(L ), the number of all possible edges. We set n = 0, m i = i p=5. Datay (i) and M (i),aswellasnoisee (i) i are generated following the standard normal distribution. We normalize the data so that L f =. The algorithm starts from x 0 (i) =0 i and x x 0 (i) = 300. We use the same matrix W by strategy (iv) in section.4 for both DGD and EXTRA. For EXTRA, we simply use the aforementioned matrix W = I+W.Werun α DGD with a fixed step size α, a diminishing one [4], a diminishing one α0 /3 /3 with hand optimized α 0 α, a diminishing one [5], and a diminishing one α0 with / / hand optimized α 0,whereαisthetheoretical critical step size given in [36]. We let EXTRA use the same fixed step size α. The numerical results are illustrated in Figure. In this experiment, we observe that both DGD with the fixed step size and EXTRA show similar linear convergence

18 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION Fig.. Plot of residuals x x F x 0 x F. Constant α =0.576 is the theoretical critical step size given for DGD in [36]. For DGD with diminishing step sizes O(/ /3 ) and O(/ / ), we have hand optimized their initial step sizes as 3α and 5α, respectively. in the first 00 iterations. Then DGD with the fixed step size begins to slow down and eventually stall, and EXTRA continues its progress. 4.. Decentralized robust least squares. Consider the same decentralized sensing setting and networ as in section 4.. In this experiment, we use the Huber loss, which is nown to be robust to outliers, and it allows us to observe both sublinear and linear convergence. We call the problem decentralized robust least squares: minimize x f(x) = n n m i H ξ (M (i)j x y (i)j ), i= where M (i)j is the jth row of matrix M (i) and y (i)j is the jth entry of vector y (i).the Huber loss function H ξ is defined as { H ξ (a) = a for a ξ (l zone), ξ( a ξ) otherwise (l zone). We set ξ =. The optimal solution x is artificially set in the l zone while x0 (i) is set in the l zone at all agents i. Except for new hand-optimized initial step sizes for DGD s diminishing step sizes, all other algorithmic parameters remain unchanged from the last test. The numerical results are illustrated in Figure. EXTRA has sublinear convergence for the fist 000 iterations and then begins linear convergence, as x (i) for most i enter the l zone Decentralized logistic regression. Consider the decentralized logistic regression problem minimize x f(x) = n n i= m i m i j= j= ln ( +exp ( )) (M (i)j x)y (i)j, where every agent i holds its training date (M (i)j,y (i)j ) R p {, +}, j =,...,m i, including explanatory/feature variables M (i)j and binary output/outcome y (i)j. To simplify the notation, we set the last entry of every M (i)j to, thus the last entry of x will yield the offset parameter of the logistic regression model.

19 96 WEI SHI, QING LING, GANG WU, AND WOTAO YIN Fig.. Plot of residuals x x F x 0 x F. Constant α =0.576 is the theoretical critical step size given for DGD in [36]. For DGD with diminishing step sizes O(/ /3 ) and O(/ / ), we have hand optimized their initial step sizes as 0α and 0α, respectively. The initial large step sizes have helped them (the red and purple curves) realize faster convergence initially. Fig. 3. Plot of residuals x x F x 0 x F. Constant α = is the theoretical critical step size given for DGD in [36]. For DGD with diminishing step sizes O(/ /3 ) and O(/ / ), we have hand optimized their initial step sizes as 0α and 0α, respectively. We show a decentralized logistic regression problem solved by DGD and EXTRA over a medium-scale networ. The settings are as follows. The connected networ is randomly generated with n = 00 agents and connectivity ratio r =0.. Each agent holds 0 samples, i.e., m i =0 i. The agents shall collaboratively obtain p =0 coefficients via logistic regression. All the 000 samples are randomly generated, and the reference (ground true) logistic classifier x is precomputed with a centralized method. As it is easy to implement in practice, we use the Metropolis constant edge weight matrix W, which is mentioned by strategy (iii) in section.4, with ɛ =, and we use W = I+W. The numerical results are illustrated in Figure 3. EXTRA outperforms DGD, showing linear and exact convergence to the reference logistic classifier x. 5. Conclusion. As one of the fundamental methods, gradient descent has been adapted to decentralized optimization, giving rise to simple and elegant iterations. In this paper, we attempted to address a dilemma or deficiency of the current DGD method: To obtain an accurate solution, it wors slowly as it must use a small step size or iteratively diminish the step size; a large step size will lead to faster convergence

20 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 963 to, however, an inaccurate solution. Our solution is an exact first-order algorithm, EXTRA, which uses a fixed large step size and quicly returns an accurate solution. The claim is supported by both theoretical convergence and preliminary numerical results. On the other hand, EXTRA is far from perfect, and more wor is needed to adapt it to the asynchronous and dynamic networ settings. They are interesting open questions for future wor. Appendix A. Proof of Proposition 3.6. Proof. (ii) (i) : By definition of restricted strong convexity, there exists μ g > 0so that for any x, μ g x x F g(x) g(x ), x x (A.) = f(x) f(x ), x x + α x x W W. For any x R p,setx = x T, and from the above inequality, we get μ g x x (A.) n f i (x) f i (x ),x x n i= = f(x) f(x ),x x. Therefore, f(x) is restricted strongly convex with a constant μ f μ g. (i) (ii) : For any x R n p, decompose x = u + v so that every column of u belongs to span{} (i.e., u is consensual) while that of v belongs to span{}. Such an orthogonal decomposition obviously satisfies x F = u F + v F. Since solution x is consensual and thus u x, v = 0, we also have x x F = u x F + v F. In addition, being consensual, u = ut for some u R p. From the inequalities f(u) f(x ), u x = n n f i (u) f i (x ),u x n i= nμ f u x = μ f u x F, f(x) f(u), x u 0, f(u) f(x ), x u L f u x F v F, f(x) f(u), u x L f v F u x F, we get (A.3) f(x) f(x ), x x = f(u) f(x ), u x + f(x) f(u), x u + f(u) f(x ), x u + f(x) f(u), u x μ f u x F L f u x F v F. In addition, from the fact that u x null{ W W } and v span{ W W }, it follows that (A.4) α x x W W = α v W W λ min ( W W ) v α F, where λ min ( ) gives the smallest nonzero eigenvalue of a positive semidefinite matrix.

21 964 WEI SHI, QING LING, GANG WU, AND WOTAO YIN Pic any γ>0. When v F γ u x F, it follows that (A.5) g(x) g(x ), x x = f(x) f(x ), x x + α x x W W μ f u x F L f u x F v F + λ min ( W W ) v F α (μ f L f γ) u x F + λ min ( W W ) v F { α min μ f L f γ, λ min ( W } W ) x x F α. (by (A.3) and (A.4)) When v F γ u x F, it follows that (A.6) g(x) g(x ), x x = f(x) f(x ), x x + α x x W W 0+ λ min ( W W ) v F α λ min ( W W ) α( + γ ) = λ min( W W ) α(+ γ ) x x F. (applied convexity of f and (A.4)) v F + λ min ( W W ) α( + u x F γ ) Finally, in all conditions, (A.7) g(x) g(x ), x x { min μ f L f γ, λ min ( W } W ) α( + x x F γ ) μ g x x F. By, for example, setting γ = μ f 4L f,wehaveμ g > 0. Hence, function g is restricted strongly convex for any α>0aslongasfunction f is restricted strongly convex. In the direction of (ii) (i), we find μ g <μ f, unlie the more pleasant μ f = μ g in the other direction. However, from (A.7), we have α= sup μ g = lim μ g γ,α γ 0 + λmin ( W W ) (+ γ )(μ L f f γ) = μ f, which means that μ g can be arbitrarily close to μ f as α goes to zero. On the other hand, just to have O(μ g )=O(μ f ), we can set γ = O( μ f L f )andα = λ min( W W ) (+ γ )(μ f L f γ) = O( μ f L f )=O( μg L f ). This order of α coincides, in terms of order of magnitude, with the critical step size for ensuring the linear convergence. REFERENCES [] J. Bazerque and G. Giannais, Distributed spectrum sensing for cognitive radio networs by exploiting sparsity, IEEE Trans. Signal Process., 58 (00), pp [] J. Bazerque, G. Mateos, and G. Giannais, Group-Lasso on splines for spectrum cartography, IEEE Trans. Signal Process., 59 (0), pp

22 EXACT ALGORITHM FOR DECENTRALIZED OPTIMIZATION 965 [3] S. Boyd, P. Diaconis, and L. Xiao, Fastest mixing Marov chain on a graph, SIAMRev.,46 (004), pp [4] T. Chang, M. Hong, and X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Trans. Signal Process., 63 (05), pp [5] I. Chen, Fast Distributed First-Order Methods, Master s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 0. [6] D. Davis and W. Yin, Convergence Rate Analysis of Several Splitting Systems, preprint, arxiv: , 04. [7] A.Dimais,S.Kar,M.R.J.Moura,andA.Scaglione, Gossip algorithms for distributed signal processing, Proc. IEEE, 98 (00), pp [8] J. Duchi, A. Agarwal, and M. Wainwright, Dual averaging for distributed optimization: Convergence analysis and networ scaling, IEEE Trans. Automat. Control, 57 (0), pp [9] P. Forero, A. Cano, and G. Giannais, Consensus-based distributed support vector machines, J. Mach. Learn. Res., 59 (00), pp [0] L. Gan, U. Topcu, and S. Low, Optimal decentralized protocol for electric vehicle charging, IEEE Trans. Power Systems, 8 (03), pp [] B. He, A new method for a class of linear variational inequalities, Math. Program., 66 (994), pp [] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, Explicit Convergence Rate of a Distributed Alternating Direction Method of Multipliers, preprint, arxiv:3.085, 03. [3] D.Jaovetic,J.Moura,andJ.Xavier, Linear Convergence Rate of Class of a Distributed Augmented Lagrangian Algorithms, Trans. Automat. Control, 60 (04), pp [4] D. Jaovetic, J. Xavier, and J. Moura, Fast distributed gradient methods, IEEE Trans. Automat. Control, 59 (04), pp [5] B. Johansson, On Distributed Optimization in Networed Systems, Ph.D. thesis, KTH, Stocholm, Sweden, 008. [6] V. Keatos and G. Giannais, Distributed robust power system state estimation, IEEE Trans. Power Systems, 8 (03), pp [7] M.-J. Lai and W. Yin, Augmented l and nuclear-norm models with a globally linearly convergent algorithm, SIAM J. Imaging Sci., 6 (03), pp [8] Q. Ling and Z. Tian, Decentralized sparse signal recovery for compressive sleeping wireless sensor networs, IEEE Trans. Signal Process., 58 (00), pp [9] Q. Ling, Z. Wen, and W. Yin, Decentralized jointly sparse recovery by reweighted l q minimization, IEEE Trans. Signal Process., 6 (03), pp [0] Q. Ling, Y. Xu, W. Yin, and Z. Wen, Decentralized low-ran matrix completion, in Proceedings of the 37th IEEE International Conference on Acoustics, Speech, and Signal Processing, 0, IEEE, Piscataway, NJ, 0, pp [] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process., 5 (0), pp [] G. Mateos, J. Bazerque, and G. Giannais, Distributed sparse linear regression, IEEE Trans. Signal Process., 58 (00), pp [3] A. Nedic and A. Olshevsy, Distributed optimization over time-varying directed graphs, in The 5nd IEEE Annual Conference on Decision and Control, 03, IEEE, Piscataway, NJ, 03, pp [4] A. Nedic and A. Olshevsy, Stochastic Gradient-Push for Strongly Convex Functions on Time-Varying Directed Graphs, preprint, arxiv: , 04. [5] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multi-agent optimization, IEEE Trans. Automat. Control, 54 (009), pp [6] J. Predd, S. Kularni, and H. Poor, A collaborative training algorithm for distributed learning, IEEE Trans. Inform. Theory, 55 (009), pp [7] S. Ram, A. Nedic, and V. Veeravalli, Distributed stochastic subgradient projection algorithms for convex optimization, J. Optim. Theory Appl., 47 (00), pp [8] A. Sayed, Diffusion adaptation over networs, in Academic Press Library in Signal Processing 3, R. Challapa and S. Theodoridis, eds., 03, pp [9] I. Schizas, A. Ribeiro, and G. Giannais, Consensus in ad hoc WSNs with noisy lins Part I: Distributed estimation of deterministic signals, IEEE Trans. Signal Process., 56 (008), pp [30] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, On the linear convergence of the ADMM in decentralized consensus optimization, IEEE Trans. Signal Process., 6 (04), pp

23 966 WEI SHI, QING LING, GANG WU, AND WOTAO YIN [3] J. Tsitsilis, Problems in Decentralized Decision Maing and Computation, Ph.D. thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 984. [3] E. Wei and A. Ozdaglar, On the O(/) Convergence of Asynchronous Distributed Alternating Direction Method of Multipliers, preprint, arxiv: , 03. [33] L. Xiao and S. Boyd, Fast linear iterations for distributed averaging, Systems Control Lett., 53 (004), pp [34] L.Xiao,S.Boyd,andS.Kim, Distributed average consensus with least-mean-square deviation, J. Parallel Distrib. Comput., 67 (007), pp [35] K. Yuan, Q. Ling, A. Ribeiro, and W. Yin, A linearized Bregman algorithm for decentralized basis pursuit, in Proceedings of the st European Signal Processing Conference, 03, IEEE, Piscataway, NJ, 03. [36] K. Yuan, Q. Ling, and W. Yin, On the Convergence of Decentralized Gradient Descent, preprint, arxiv: , 03. [37] M. Zhu and S. Martinez, On distributed convex optimization under inequality and equality constraints, IEEE Trans. Automat. Control, 57 (0), pp

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

Decentralized Consensus Optimization with Asynchrony and Delay

Decentralized Consensus Optimization with Asynchrony and Delay Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

On the linear convergence of distributed optimization over directed graphs

On the linear convergence of distributed optimization over directed graphs 1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,

More information

Distributed Consensus Optimization

Distributed Consensus Optimization Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

arxiv: v3 [math.oc] 1 Jul 2015

arxiv: v3 [math.oc] 1 Jul 2015 On the Convergence of Decentralized Gradient Descent Kun Yuan Qing Ling Wotao Yin arxiv:1310.7063v3 [math.oc] 1 Jul 015 Abstract Consider the consensus problem of minimizing f(x) = n fi(x), where x Rp

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of

More information

On the Linear Convergence of Distributed Optimization over Directed Graphs

On the Linear Convergence of Distributed Optimization over Directed Graphs 1 On the Linear Convergence of Distributed Optimization over Directed Graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v1 [math.oc] 7 Oct 015 Abstract This paper develops a fast distributed algorithm,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Fast Linear Iterations for Distributed Averaging 1

Fast Linear Iterations for Distributed Averaging 1 Fast Linear Iterations for Distributed Averaging 1 Lin Xiao Stephen Boyd Information Systems Laboratory, Stanford University Stanford, CA 943-91 lxiao@stanford.edu, boyd@stanford.edu Abstract We consider

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

ADD-OPT: Accelerated Distributed Directed Optimization

ADD-OPT: Accelerated Distributed Directed Optimization 1 ADD-OPT: Accelerated Distributed Directed Optimization Chenguang Xi, Student Member, IEEE, and Usman A. Khan, Senior Member, IEEE Abstract This paper considers distributed optimization problems where

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Fastest Distributed Consensus Averaging Problem on. Chain of Rhombus Networks

Fastest Distributed Consensus Averaging Problem on. Chain of Rhombus Networks Fastest Distributed Consensus Averaging Problem on Chain of Rhombus Networks Saber Jafarizadeh Department of Electrical Engineering Sharif University of Technology, Azadi Ave, Tehran, Iran Email: jafarizadeh@ee.sharif.edu

More information

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks

Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks 06 IEEE International Conference on Big Data (Big Data) Consensus Optimization with Delayed and Stochastic Gradients on Decentralized Networks Benjamin Sirb Xiaojing Ye Department of Mathematics and Statistics

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis

ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear

More information

8.1 Concentration inequality for Gaussian random matrix (cont d)

8.1 Concentration inequality for Gaussian random matrix (cont d) MGMT 69: Topics in High-dimensional Data Analysis Falll 26 Lecture 8: Spectral clustering and Laplacian matrices Lecturer: Jiaming Xu Scribe: Hyun-Ju Oh and Taotao He, October 4, 26 Outline Concentration

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Distributed Optimization over Random Networks

Distributed Optimization over Random Networks Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Convergence rates for distributed stochastic optimization over random networks

Convergence rates for distributed stochastic optimization over random networks Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed

More information

ABSTRACT 1. INTRODUCTION

ABSTRACT 1. INTRODUCTION A DIAGONAL-AUGMENTED QUASI-NEWTON METHOD WITH APPLICATION TO FACTORIZATION MACHINES Aryan Mohtari and Amir Ingber Department of Electrical and Systems Engineering, University of Pennsylvania, PA, USA Big-data

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Discrete-time Consensus Filters on Directed Switching Graphs

Discrete-time Consensus Filters on Directed Switching Graphs 214 11th IEEE International Conference on Control & Automation (ICCA) June 18-2, 214. Taichung, Taiwan Discrete-time Consensus Filters on Directed Switching Graphs Shuai Li and Yi Guo Abstract We consider

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Convex Optimization of Graph Laplacian Eigenvalues

Convex Optimization of Graph Laplacian Eigenvalues Convex Optimization of Graph Laplacian Eigenvalues Stephen Boyd Abstract. We consider the problem of choosing the edge weights of an undirected graph so as to maximize or minimize some function of the

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

Resilient Distributed Optimization Algorithm against Adversary Attacks

Resilient Distributed Optimization Algorithm against Adversary Attacks 207 3th IEEE International Conference on Control & Automation (ICCA) July 3-6, 207. Ohrid, Macedonia Resilient Distributed Optimization Algorithm against Adversary Attacks Chengcheng Zhao, Jianping He

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Distributed Estimation and Detection for Smart Grid

Distributed Estimation and Detection for Smart Grid Distributed Estimation and Detection for Smart Grid Texas A&M University Joint Wor with: S. Kar (CMU), R. Tandon (Princeton), H. V. Poor (Princeton), and J. M. F. Moura (CMU) 1 Distributed Estimation/Detection

More information

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)

More information

Randomized Gossip Algorithms

Randomized Gossip Algorithms 2508 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 52, NO 6, JUNE 2006 Randomized Gossip Algorithms Stephen Boyd, Fellow, IEEE, Arpita Ghosh, Student Member, IEEE, Balaji Prabhakar, Member, IEEE, and Devavrat

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Consensus-Based Distributed Optimization with Malicious Nodes

Consensus-Based Distributed Optimization with Malicious Nodes Consensus-Based Distributed Optimization with Malicious Nodes Shreyas Sundaram Bahman Gharesifard Abstract We investigate the vulnerabilities of consensusbased distributed optimization protocols to nodes

More information

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate www.scichina.com info.scichina.com www.springerlin.com Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate WEI Chen & CHEN ZongJi School of Automation

More information

Diffusion based Projection Method for Distributed Source Localization in Wireless Sensor Networks

Diffusion based Projection Method for Distributed Source Localization in Wireless Sensor Networks The Third International Workshop on Wireless Sensor, Actuator and Robot Networks Diffusion based Projection Method for Distributed Source Localization in Wireless Sensor Networks Wei Meng 1, Wendong Xiao,

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

17 Solution of Nonlinear Systems

17 Solution of Nonlinear Systems 17 Solution of Nonlinear Systems We now discuss the solution of systems of nonlinear equations. An important ingredient will be the multivariate Taylor theorem. Theorem 17.1 Let D = {x 1, x 2,..., x m

More information

arxiv: v2 [math.oc] 7 Apr 2017

arxiv: v2 [math.oc] 7 Apr 2017 Optimal algorithms for smooth and strongly convex distributed optimization in networks arxiv:702.08704v2 [math.oc] 7 Apr 207 Kevin Scaman Francis Bach 2 Sébastien Bubeck 3 Yin Tat Lee 3 Laurent Massoulié

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Distributed Computation of Quantiles via ADMM

Distributed Computation of Quantiles via ADMM 1 Distributed Computation of Quantiles via ADMM Franck Iutzeler Abstract In this paper, we derive distributed synchronous and asynchronous algorithms for computing uantiles of the agents local values.

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

A Distributed Newton Method for Network Optimization

A Distributed Newton Method for Network Optimization A Distributed Newton Method for Networ Optimization Ali Jadbabaie, Asuman Ozdaglar, and Michael Zargham Abstract Most existing wor uses dual decomposition and subgradient methods to solve networ optimization

More information

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate

More information

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit New Coherence and RIP Analysis for Wea 1 Orthogonal Matching Pursuit Mingrui Yang, Member, IEEE, and Fran de Hoog arxiv:1405.3354v1 [cs.it] 14 May 2014 Abstract In this paper we define a new coherence

More information

3.10 Lagrangian relaxation

3.10 Lagrangian relaxation 3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the

More information

EUSIPCO

EUSIPCO EUSIPCO 3 569736677 FULLY ISTRIBUTE SIGNAL ETECTION: APPLICATION TO COGNITIVE RAIO Franc Iutzeler Philippe Ciblat Telecom ParisTech, 46 rue Barrault 753 Paris, France email: firstnamelastname@telecom-paristechfr

More information

Agreement algorithms for synchronization of clocks in nodes of stochastic networks

Agreement algorithms for synchronization of clocks in nodes of stochastic networks UDC 519.248: 62 192 Agreement algorithms for synchronization of clocks in nodes of stochastic networks L. Manita, A. Manita National Research University Higher School of Economics, Moscow Institute of

More information

Complex Laplacians and Applications in Multi-Agent Systems

Complex Laplacians and Applications in Multi-Agent Systems 1 Complex Laplacians and Applications in Multi-Agent Systems Jiu-Gang Dong, and Li Qiu, Fellow, IEEE arxiv:1406.186v [math.oc] 14 Apr 015 Abstract Complex-valued Laplacians have been shown to be powerful

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Optimal Scaling of a Gradient Method for Distributed Resource Allocation 1

Optimal Scaling of a Gradient Method for Distributed Resource Allocation 1 JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 129, No. 3, pp. 469 488, June 2006 ( C 2006) DOI: 10.1007/s10957-006-9080-1 Optimal Scaling of a Gradient Method for Distributed Resource Allocation

More information

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Math 273a: Optimization Overview of First-Order Optimization Algorithms Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Optimization for Compressed Sensing

Optimization for Compressed Sensing Optimization for Compressed Sensing Robert J. Vanderbei 2014 March 21 Dept. of Industrial & Systems Engineering University of Florida http://www.princeton.edu/ rvdb Lasso Regression The problem is to solve

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions

Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Solution-recovery in l 1 -norm for non-square linear systems: deterministic conditions and open questions Yin Zhang Technical Report TR05-06 Department of Computational and Applied Mathematics Rice University,

More information

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Decentralized Quasi-Newton Methods

Decentralized Quasi-Newton Methods 1 Decentralized Quasi-Newton Methods Mark Eisen, Aryan Mokhtari, and Alejandro Ribeiro Abstract We introduce the decentralized Broyden-Fletcher- Goldfarb-Shanno (D-BFGS) method as a variation of the BFGS

More information

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 57, NO. 1, JANUARY 2012 33 Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren,

More information

Constrained Consensus and Optimization in Multi-Agent Networks

Constrained Consensus and Optimization in Multi-Agent Networks Constrained Consensus Optimization in Multi-Agent Networks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher

More information

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage 1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Proximal-like contraction methods for monotone variational inequalities in a unified framework

Proximal-like contraction methods for monotone variational inequalities in a unified framework Proximal-like contraction methods for monotone variational inequalities in a unified framework Bingsheng He 1 Li-Zhi Liao 2 Xiang Wang Department of Mathematics, Nanjing University, Nanjing, 210093, China

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Research Article Solving the Matrix Nearness Problem in the Maximum Norm by Applying a Projection and Contraction Method

Research Article Solving the Matrix Nearness Problem in the Maximum Norm by Applying a Projection and Contraction Method Advances in Operations Research Volume 01, Article ID 357954, 15 pages doi:10.1155/01/357954 Research Article Solving the Matrix Nearness Problem in the Maximum Norm by Applying a Projection and Contraction

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Preconditioning via Diagonal Scaling

Preconditioning via Diagonal Scaling Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning,

More information

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 55, NO. 9, SEPTEMBER 2010 1987 Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE Abstract

More information

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications

Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Alternative Decompositions for Distributed Maximization of Network Utility: Framework and Applications Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization

More information