WE consider an undirected, connected network of n

Size: px

Start display at page:

Download "WE consider an undirected, connected network of n"

Easter Berry
6 years ago
Views:

1 On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus nonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do sometimes. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are nown in the convex setting. In particular, when diminishing or constant step sizes are used, we can prove convergence to a or a neighborhood of consensus stationary solution and have guaranteed rates of convergence. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD and l q quasi-norms, q [0,. Similarly, Prox-DGD can tae the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used the convex setting. Index Terms Nonconvex dencentralized computing, consensus optimization, decentralized gradient descent method, proximal decentralized gradient descent I. INTRODUCTION WE consider an undirected, connected networ of n agents and the following consensus optimization problem defined on the networ: n minimize fx f i x, x R p i= where f i is a differentiable function only nown to the agent i. We also consider the consensus optimization problem in the following differentiable+proximable form: n minimize x R p sx f i x + r i x, i= where f i, r i are differentiable and proximable functions, respectively, only nown to the agent i. Each function r i is possibly non-differentiable or nonconvex, or both. The models and find applications in decentralized averaging, learning, estimation, and control. Some specific examples include: i the distributed compressed sensing and machine learning problems, where f i is the data-fidelity term, which is often differentiable, and r i is a sparsity-promoting regularizer such as the l q quasi-norm with 0 q [], [7]; ii optimization problems with per-agent constraints, J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 3300, China jsh.zeng@gmail.com W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095, USA wotaoyin@ucla.edu. We call { a function proximable if its proximal operator prox αf y argmin x αfx + x y } is easy to compute. where f i is a differentiable objective function of agent i and r i is the indicator function of the constraint set of agent i, that is, r i x = 0 if x satisfies the constraint and otherwise [7], [0]. When f i s are convex, the existing algorithms include the subgradient methods [6], [8], [6], [5], [8], [4], [46], [3], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers DADMM [35], [36], [7], and EXTRA [37], [38]. However, when f i s are nonconvex, few algorithms have convergence guarantees. Some existing results include [3], [4], [3], [3], [4], [39], [40], [9], [4], [43], [48]. In spite of the algorithms and their analysis in these wors, the convergence of the simple algorithm Decentralized Gradient Descent DGD [8] under nonconvex f i s is still unnown. Furthermore, although DGD is slower than D-ADMM and EXTRA on convex problems, DGD is simpler and thus easier to extend to a variety of settings such as [3], [45], [6], [5], where online processing and delay tolerance are considered. Therefore, we expect our results to motivate future adoptions of nonconvex DGD. This paper studies the convergence of two algorithms: DGD for solving problem and Prox-DGD for problem. In each DGD iteration, every agent locally computes a gradient and then updates its variable by combining the average of its neighbors with the negative gradient step. In each Prox-DGD iteration, every agent locally computes a gradient of f i and a proximal map of r i, as well as exchanges information with its neighbors. Both algorithms can use either a fixed step size or a sequence of decreasing step sizes. When the problem is convex and a fixed step size is used, DGD does not converge to a solution of the original problem but a point in its neighborhood[46]. This motivates the use of decreasing step sizes such as in [8], [6]. Assuming f i s are convex and have Lipschitz continuous and bounded gradients, [8] shows that decreasing step sizes α = lead to a convergence rate O ln of the running best of objective errors. [6] uses nested loops and shows an outerloop convergence rate O of objective errors, utilizing Nesterov s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, their single-loop algorithm using the decreasing step sizes α = /3 has a reduced rate O ln. The objective of this paper is two-fold: a we aim to show, other than losing global optimality, most existing convergence results of DGD and Prox-DGD that are nown in the convex setting remain valid in the nonconvex setting, and b to achieve a, we illustrate how to tailor nonconvex analysis tools for decentralized optimization. In particular, our asymptotic exact and inexact consensus results require new treatments because they are special to decentralized algorithms. The analytic results of this paper can be summarized as

2 follows. a When a fixed step size α is used and properly bounded, the DGD iterates converge to a stationary point of a Lyapunov function. The difference between each local estimate of x and the global average of all local estimates is bounded, and the bound is proportional to α. b When a decreasing step size α = O/ + ɛ is used, where 0 < ɛ and is the iteration number, the objective sequence converges, and the iterates of DGD are asymptotically consensual i.e., become equal one another, and they achieve this at the rate of O/ + ɛ. Moreover, we show the convergence of DGD to a stationary point of the original problem, and derive the convergence rates of DGD with different ɛ for objective functions that are convex. c The convergence analysis of DGD can be extended to the algorithm Prox-DGD for solving problem. However, when the proximable functions r i s are nonconvex, the mixing matrix is required to be positive definite and a smaller step size is also required. Otherwise, the mixing matrix can be non-definite. The detailed comparisons between our results and the existing results on DGD and Prox-DGD are presented in Tables I and II. The global objective error rate in these two tables refers to the rate of {f x fx opt } or {s x sx opt }, where x = n n i= x i is the average of the th iterate and x opt is a global solution. The comparisons beyond DGD and Prox- DGD are presented in Section IV and Table III. New proof techniques are introduced in this paper, particularly, in the analysis of convergence of DGD and Prox-DGD with decreasing step sizes. Specifically, the convergence of objective sequence and convergence to a stationary point of the original problem with decreasing step sizes are justified via taing a Lyapunov function and several new lemmas cf. Lemmas 9,, and the proof of Theorem. Moreover, we estimate the consensus rate by introducing an auxiliary sequence and then showing both sequences have the same rates cf. the proof of Proposition 3. All these proof techniques are new and distinguish our paper from the existing wors such as [8], [6], [8], [3], [3], [3], [40], [43]. The rest of this paper is organized as follows. Section II describes the problem setup and reviews the algorithms. Section III presents our assumptions and main results. Section IV discusses related wors. Section V presents the proofs of our main results. We conclude this paper in Section VI. Notation: Let I denote the identity matrix of the size n n, and R n denote the vector of all s. For the matrix X, X T denotes its transpose, X ij denotes its i, jth component, and X X, X = i,j X ij is its Frobenius norm, which simplifies to the Euclidean norm when X is a vector. Given a symmetric, positive semidefinite matrix G R n n, we let X G X, GX be the induced semi-norm. Given a function h, domf denotes its domain. i, j E represents a communication lin between nodes i and j. Let x i R p denote the local copy of x at node i. We reformulate the consensus problem into the equivalent problem: minimize x T fx n f i x i, 3 i= subject to x i = x j, i, j E, where x R n p, fx R n with x x T x T. x T n, fx f x f x. f n x n In addition, the gradient of fx is f x T f x T fx. Rn p. f n x n T. The ith rows of the matrices x and fx, and vector fx, correspond to agent i. The analysis in this paper applies to any integer p. For simplicity, one can let p = and treat x and fx as vectors rather than matrices. The algorithm DGD [8] for 3 is described as follows: Pic an arbitrary x 0. For = 0,,..., compute x + W x α fx, 4 where W is a mixing matrix and α > 0 is a stepsize parameter. Similarly, we can reformulate the composite problem as the following equivalent form: minimize x n f i x i + r i x i, i= subject to x i = x j, i, j E. 5 Let rx n i= r ix i. The algorithm Prox-DGD can be applied to the above problem 5: Prox-DGD: Tae an arbitrary x 0. For = 0,,..., perform x + prox α rw x α fx, 6 where the proximal operator is prox α rx argmin {α ru + u R n p } u x. 7 II. PROBLEM SETUP AND ALGORITHM REVIEW Consider a connected undirected networ G = {V, E}, where V is a set of n nodes and E is the edge set. Any edge III. ASSUMPTIONS AND MAIN RESULTS This section presents all of our main results.

3 3 TABLE I COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS SMOOTH OPTIMIZATION PROBLEM. Fixed step size Decreasing step sizes algorithm DGD [46] DGD this paper D-NG [6] DGD this paper f i convex only nonconvex convex only nonconvex f i Lipschitz Lipschitz, bounded step size 0 < α < +λnw O with Nesterov acc. O ɛ ɛ 0, ] consensus error Oα O O ɛ min j x j+ x j o no rate o +ɛ global objective error O until error O α ζ Convex: O until error O α ζ ; Nonconvex: no rate O ln Convex : O ln ɛ = /, O ɛ =, ln O min{ɛ, ɛ} other ɛ; Nonconvex: no rate The objective error rates of DGD and Prox-DGD obtained in this paper and those in convex DProx-Grad [8] are ergodic or running best rates. TABLE II COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS COMPOSITE OPTIMIZATION PROBLEM. Fixed step size Decreasing step sizes algorithm AccDProx-Grad [6] DProx-Grad [8] Prox-DGD this paper DProx-Grad [8] Prox-DGD this paper f i, r i convex only nonconvex convex only nonconvex f i Lipschitz, bounded Lipschitz Lipschitz, bounded r i bounded bounded step size 0 < α < 0 < α < +λnw convex r i ; 0 < α < λnw nonconvex r i, λ nw > 0 O O + / ɛ 0, ] + ɛ consensus Oγ, 0 < γ < error Oα O / O min j x j+ x j no rate no rate o no rate o +ɛ ɛ global objective error O D α + D α, D, D > 0 Form Convex: form D 3 α + D 4α, D 3, D 4 > 0; Nonconvex: no rate O ln Convex : O ln ɛ = /, O ɛ =, ln O min{ɛ, ɛ} other ɛ, Nonconvex: no rate The objective error rates are ergodic or running best rates. A. Definitions and assumptions Definition Lipschitz differentiability. A function h is called Lipschitz differentiable if h is differentiable and its gradient h is Lipschitz continuous, i.e., hu hv L u v, u, v domh, where L > 0 is its Lipschitz constant. Definition Coercivity. A function h is called coercive if u + implies hu +.

4 4 TABLE III COMPARISONS ON SCENARIOS APPLIED FOR DIFFERENT NONCONVEX DECENTRALIZED ALGORITHMS. f i nonsmooth r i step size networ W algorithm type fusion scheme algorithm smooth cvx ncvx fixed diminish static dynamic determin stochastic ATC CTA DGD this paper doubly Perturbed Push-sum [40] column ZENITH [3] doubly Prox-DGD this paper NEXT [3] DeFW [43] Proj SGD [3] doubly doubly doubly row In this table, the full names of these abbreviations are list as follows: cvx convex, ncvx nonconvex, diminish diminishing, determin deterministic, ATC adaptive-then-combine, CTA combine-then-adaptive, doubly doubly stochastic, column column stochastic, row row stochastic, where vocabularies in the bracets are the full names. A row, or column, or double stochastic W means that: W =, or W T =, or both hold. The next definition is a property that many functions have see [44, Section.] for examples and can help obtain whole sequence convergence from subsequence convergence. Definition 3 urdya-łojasiewicz Ł property [], [5], []. A function h : R p R {+ } has the Ł property at x dom h if there exist η 0, + ], a neighborhood U of x, and a continuous concave function ϕ : [0, η R + such that: i ϕ0 = 0 and ϕ is differentiable on 0, η; ii for all s 0, η, ϕ s > 0; iii for all x in U {x : hx < hx < hx + η}, the Ł inequality holds ϕ hx hx dist 0, hx. 8 Proper lower semi-continuous functions that satisfy the Ł inequality at each point of dom h are called Ł functions. Assumption Objective. The objective functions f i : R p R {+ }, i =,..., n, satisfy the following: f i is Lipschitz differentiable with constant i > 0. f i is proper i.e., not everywhere infinite and coercive. The sum n i= f ix i is -Lipschitz differentiable with max i i. In addition, each f i is lower bounded following Part of the above assumption. Assumption Mixing matrix. The mixing matrix W = [w ij ] R n n has the following properties: Graph If i j and i, j / E, then w ij = 0, otherwise, w ij > 0. Symmetry W = W T. 3 Null space property null{i W } = span{}. 4 Spectral property I W I. By Assumption, a solution x opt to problem 3 satisfies I W x opt = 0. Due to the symmetric assumption of W, Whole sequence convergence from any starting point is referred to as global convergence in the literature. Its limit is not necessarily a global solution. its eigenvalues are real and can be sorted in the nonincreasing order. Let λ i W denote the ith largest eigenvalue of W. Then by Assumption, λ W = > λ W λ n W >. Let ζ be the second largest magnitude eigenvalue of W. Then ζ = max{ λ W, λ n W }. 9 B. Convergence results of DGD We consider the convergence of DGD with both a fixed step size and a sequence of decreasing step sizes. Convergence results of DGD with a fixed step size: The convergence result of DGD with a fixed step size i.e., α α is established based on the Lyapunov function [46]: L α x T fx + α x I W. 0 It is worth reminding that convexity is not assumed. Theorem Global convergence. Let {x } be the sequence generated by DGD 4 with the step size 0 < α < +λnw. Let Assumptions and hold. Then {x } has at least one accumulation point x, and any such point is a stationary point of L α x. Furthermore, the running best rates of the sequences { x + x } and { L α x } are o. In addition, if L α satisfies the Ł property at an accumulation point x, then {x } globally converges to x. Remar. Let x be a stationary point of L α x, and thus 0 = fx + α I W x. Since T I W = 0, yields 0 = T fx, indicating that x is also a stationary point to the separable function n i= f ix i. Since the rows of x are not necessarily Given a nonnegative sequence a, its running best sequence is b = min{a i : i }. We say a has a running best rate of o/ if b = o/. These quantities naturally appear in the analysis, so we eep the squares.

5 5 identical, we cannot say x is a stationary point to Problem 3. However, the differences between the rows of x are bounded, following our next result below adapted from [46]: Proposition Consensual bound on x. For each iteration, define x n n i= x i. Then, it holds for each node i that x i x αd ζ, where D is a universal bound of fx defined in Lemma 6 below, ζ is the second largest magnitude eigenvalue of W specified in 9. As, yields the consensual bound where x n n i= x i. x i x αd ζ, In Proposition, the consensual bound is proportional to the step size α and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of W. Let us compare the DGD iteration with the iteration of centralized gradient descent 4 for fx. Averaging the rows of 4 yields the following comparison: n DGD averaged: x + x α f i x i. 3 Centralized: n x + x α n i= n i= f i x. 4 Apparently, DGD approximates centralized gradient descent by evaluating f i at local variables x i instead of the global average. We can estimate the error of this approximation as n f i x i n n f i x n n i= n i= i= f i x i f i x αd ζ. Unlie the convex analysis in [46], it is impossible to bound the difference between the sequences of 3 and 4 without convexity because the two sequences may converge to different stationary points of L α. Remar. The Ł assumption on L α in Theorem can be satisfied if each f i is a sub-analytic function. Since x I W is obviously sub-analytic and the sum of two sub-analytic functions remains sub-analytic, L α is sub-analytic if each f i is so. See [44, Section.] for more details and examples. Proposition Ł convergence rates. Let the assumptions of Theorem hold. Suppose that L α satisfies the Ł inequality at an accumulation point x with ψs = cs θ for some constant c > 0. Then, the following convergence rates hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C 0 τ for all for some > 0, C 0 > 0, τ [0,. c If θ,, x x C 0 θ/θ for all, for certain > 0, C 0 > 0. Note that the rates in parts b and c of Proposition are of the eventual type. Using fixed step sizes, our results are limited because the stationary point x of L α is not a stationary point of the original problem. We only have a consensual bound on x. To address this issue, the next subsection uses decreasing step sizes and presents better convergence results. Convergence of DGD with decreasing step sizes: The positive consensual error bound in Proposition, which is proportional to the constant step size α, motivates the use of properly decreasing step sizes α = O +, for some ɛ 0 < ɛ, to diminish the consensual bound to 0. As a result, any accumulation point x becomes a stationary point of the original problem 3. To analyze DGD with decreasing step sizes, we add the following assumption. Assumption 3 Bounded gradient. For any, fx is uniformly bounded by some constant B > 0, i.e., fx B. Note that the bounded gradient assumption is a regular assumption in the convergence analysis of decentralized gradient methods see, [3], [4], [3], [3], [4], [39], [40], [9], [43] for example, even in the convex setting [6] and also [8], though it is not required for centralized gradient descent. We tae the step size sequence: α =, 0 < ɛ, 5 + ɛ throughout the rest part of this section. The numerator can be replaced by any positive constant. By iteratively applying iteration 4, we obtain the following expression x = W x 0 α j W j fx j. 6 Proposition 3 Asymptotic consensus rate. Let Assumptions and 3 hold. Let DGD use 5. Let x n T x. Then, x x converges to 0 at the rate of O/ + ɛ. According to Proposition 3, the iterates of DGD with decreasing step sizes can reach consensus asymptotically compared to a nonzero bound in the fixed step size case in Proposition. Moreover, with a larger ɛ, faster decaying step sizes generally imply a faster asymptotic consensus rate. Note that I W x = 0 and thus x I W = x x I W. Therefore, the above proposition implies the following result. Corollary. Apply the setting of Proposition 3. x I W converges to 0 at the rate of O/ + ɛ. Corollary shows that the sequence {x } in the I W semi-norm can decay to 0 at a sublinear rate. For any global consensual solution x opt to problem 3, we have x x opt I W = x I W so, if {x } does converge to x opt, then their distance in the same semi-norm decays at O/ ɛ. Theorem Convergence. Let Assumptions, and 3 hold. Let DGD use step sizes 5. Then a {L α x } and { T fx } converge to the same limit; b lim T fx = 0, and any limit point of {x } is a stationary point of problem 3;

6 6 c In addition, if there exists an isolated accumulation point, then {x } converges. In the proof of Theorem, we will establish =0 α + λ nw x + x <, which implies that the running best rate of the sequence { x + x } is o/ +ɛ. Theorem shows that the objective sequence converges, and any limit point of {x } is a stationary point of the original problem. However, there is no result on the convergence rate of the objective sequence to an optimal value, and it is generally difficult to get such a rate without convexity. Although our primary focus is nonconvexity, next we assume convexity and present the objective convergence rate, which has an interesting relation with ɛ. For any x R n p n, let fx i= f ix i. Even if f i s are convex, the solution to 3 may be non-unique. Thus, let X be the set of solutions to 3. Given x, we pic the solution x opt = Proj X x X. Also let f opt = fx opt be the optimal value of. Define the ergodic objective: f =0 = α f x + =0 α, 7 where x + = n T x +. Obviously, f min f x. 8 =,...,+ Proposition 4 Convergence rates under convexity. Let Assumptions, and 3 hold. Let DGD use step sizes 5. If λ n W > 0 and each f i is convex, then { f } defined in 7 converges to the optimal objective value f opt at the following rates: a if 0 < ɛ < /, the rate is O ; ɛ b if ɛ = /, the rate is O ln ; c if / < ɛ <, the rate is O ; ɛ d if ɛ =, the rate is O ln. The convergence rates established in Proposition 4 almost as good as O when ɛ =. As ɛ goes to either 0 or, the rates become slower, and ɛ = / may be the optimal choice in terms of the convergence rate. However, by Proposition 3, a larger ɛ implies a faster consensus rate. Therefore, there is a tradeoff to choose an appropriate ɛ in the practical implementation of DGD. C. Convergence results of Prox-DGD Similarly, we consider the convergence of Prox-DGD with both a fixed step size and decreasing step sizes. The iteration 6 can be reformulated as x + = prox α rx α L α x 9 based on which, we define the Lyapunov function ˆL α x L α x + rx, where we recall L α x = n i= f ix i + α x I W. Then 9 is clearly the forward-bacward splitting a..a., proxgradient iteration for minimize x ˆLα x. Specifically, 9 first performs gradient descent to the differentiable function L α x and then computes the proximal of rx. To analyze Prox-DGD, we should revise Assumption as follows. Assumption 4 Composite objective. The objective function of 5 satisfies the following: Each f i is Lipschitz differentiable with constant i > 0. Each f i +r i is proper, lower semi-continuous, coercive. As before, n i= f ix i is -Lipschitz differentiable for max i i. Convergence results of Prox-DGD with a fixed step size: Based on the above assumptions, we can get the global convergence of Prox-DGD as follows. Theorem 3 Global convergence of Prox-DGD. Let {x } be the sequence generated by Prox-DGD 6 where the step size α satisfies 0 < α < +λnw when r i s are convex; and 0 < α < λnw, when r i s are not necessarily convex this case requires λ n W > 0. Let Assumptions and 4 hold. Then {x } has at least one accumulation point x, and any accumulation point is a stationary point of ˆLα x. Furthermore, the running best rates of the sequences { x + x } and g + where g + is defined in Lemma 8 are both o. In addition, if ˆLα satisfies the Ł property at an accumulation point x, then {x } converges to x. The rate of convergence of Prox-DGD can be also established by leveraging the Ł property. Proposition 5 Rate of convergence of Prox-DGD. Under assumptions of Theorem 3, suppose that ˆLα satisfies the Ł inequality at an accumulation point x with ψs = c s θ for some constant c > 0. Then the following hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C τ for all for some > 0, C > 0, τ [0,. c If θ,, x x C θ/θ for all, for certain > 0, C > 0. Convergence of Prox-DGD with decreasing step sizes: In Prox-DGD, we also use the decreasing step size 5. To investigate its convergence, the bounded gradient Assumption 3 should be revised as follows. Assumption 5 Bounded composite subgradient. For each i, f i is uniformly bounded by some constant B i > 0, i.e., f i x B i for any x R p. Moreover, ξ i B ri for any ξ i r i x and x R p, i =..., n. Let B n i= B i + B ri. Then fx + ξ where ξ rx for any x R n p is uniformly bounded by B. Note that the same assumption is used to analyze the convergence A nonnegative sequence a induces its running best sequence b = min{a i : i }; therefore, a has running best rate of o/ if b = o/.

7 7 of distributed proximal-gradient method in the convex setting [6], [8], and also is widely used to analyze the convergence of nonconvex decentralized algorithms lie in [3], [4]. In light of Lemma 9 below, the claims in Proposition 3 and Corollary also hold for Prox-DGD. Proposition 6 Asymptotic consensus and rate. Let Assumptions and 5 hold. In Prox-DGD, use the step sizes 5. There hold x x C x 0 ζ + B α j ζ j, and x x converges to 0 at the rate of O/ + ɛ. Moreover, let x be any global solution of the problem 5. Then x x I W = x I W = x x I W converges to 0 at the rate of O/ + ɛ. For any x R n p, define sx = n i= f ix i + r i x i. Let X be a set of solutions of 5, x opt = Proj X x X, and s opt = sx opt be the optimal value of 5. Define s = =0 α s x + =0 α. 0 Theorem 4 Convergence and rate. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 5. Then a { ˆL α x } and { n i= f ix i + r ix i} converge to the same limit; b =0 α +λ nw x + x < when r i s are convex; or, =0 α λ nw x + x < when r i s are not necessarily convex this case requires λ n W > 0; c if {ξ } satisfies ξ + ξ L r x + x for each > 0, some constant L r > 0, and a sufficiently large integer 0 > 0, then lim T fx + ξ + = 0, where ξ + rx + is the one determined by the proximal operator 7, and any limit point is a stationary point of problem 5. d in addition, if there exists an isolated accumulation point, then {x } converges. e furthermore, if f i and r i are convex and λ n W > 0, then the claims on the rates of { f } in Proposition 4 hold for the sequence { s } defined in 0. Theorem 4b implies that the best running rate of x + x is o. The additional condition imposed on {ξ } in +ɛ Theorem 4c is some type of restricted continuous regularity of the subgradient r with respect to the generated sequence, which may be held for a class of proximal functions as studied in [47]. If r is locally Lipschitz continuous in a neighborhood of a limit point, then such condition can generally be satisfied, since {x } is asymptotic regular, and thus x will lies in such neighborhood of this limit point when is sufficiently large. Theorem 4e gives the convergence rates of Prox-DGD in the convex setting. IV. RELATED WORS AND DISCUSSIONS We summarize some recent nonconvex decentralized algorithms in Table III. Most of them apply to either the smooth optimization problem or the composite optimization problem and use diminishing step sizes. Although is a special case of via letting r i x = 0, there are still differences in both algorithm design and theoretical analysis. Therefore, we divide their comparisons. We first discuss the algorithms for. In [40], the authors proved the convergence of perturbed push-sum for nonconvex under some regularity assumptions. They also introduced random perturbations to avoid local minima. The networ considered in [40] is time-varying and directed, and specific column stochastic matrices and diminishing step sizes are used. Their algorithm is an extension of DGD with diminishing step sizes of this paper. The convergence results for the deterministic perturbed push-sum algorithm obtained in [40] are similar to those of DGD developed in this paper under similar assumptions see, Theorem above and [40, Theorem 3]. However, in this paper, we obtain the asymptotic consensus and convergence to a stationary point of DGD via a Lyapunov function and developing several new results such as Lemma for the convergence of the so-called wealysummable sequence. The proofs in [40] are mainly based on [30, Theorem.7.3]. In [3], a primal-dual approximate gradient algorithm called ZENITH was developed for. The convergence of ZENITH was given in the expectation of constraint violation under the Lipschitz differentiable assumption and other assumptions. Table III includes three algorithms for solving the composite problem, which are related to ours. All of them only deal with convex r i whereas r i in this paper can also be nonconvex. In [4], the authors proposed NEXT based on the previous successive convex approximation SCA technique. The iterates of NEXT include two stages, a local SCA stage to update local variables and a consensus update stage to fuse the information between agents. While NEXT has results similar to Prox-DGD using diminishing step sizes, Prox- DGD is simpler than NEXT. Another interesting algorithm is decentralized Fran-Wolfe DeFW proposed in [43] for nonconvex, smooth, constrained decentralized optimization, where a bounded convex constraint set is imposed. There are three steps at each iteration of DeFW: average gradient computation, local variable evaluation by Fran-Wolfe, and information fusion between agents. In [43], the authors established convergence results similar to Prox-DGD under diminishing step sizes. The stochastic version of DeFW has also been developed in [9] for high-dimensional convex sparse optimization. The last one is projected stochastic gradient algorithm Proj SGD [3] for constrained, nonconvex, smooth consensus optimization. It has two steps at each iteration: a projected stochastic gradientstep to update local variables and a consensus step to exchange the information between local agents. The mixing matrix used in this algorithm is The original form of this algorithm, push-sum, was proposed in [7] for the average consensus problem. It was modified and analyzed in [9] for convex consensus optimization problem over time-varying directed graphs.

8 8 random and row stochastic, but its expectation is column stochastic. Asymptotic consensus and convergence to the set of arush-uhn-tucer points were proved under diminishing step sizes, smooth objective function, some mean and variance restrictions to the stochastic direction, and other assumptions on the mixing matrices and the constraint set. Based on the above analysis, the convergence results of DGD and Prox-DGD with diminishing step sizes of this paper are comparable with most of the existing ones, which involve more complicated methods. However, we allow nonconvex nonsmooth r i and are able to obtain the estimates of asymptotic consensus rates. We also establish global convergenceusing a fixed step size while it is only found in ZENITH. V. PROOFS In this section, we present the proofs of our main theorems and propositions. A. Proof for Theorem The setch of the proof is as follows: DGD is interpreted as the gradient descent algorithm applied to the Lyapunov function L α, following the argument in [46]; then, the properties of sufficient descent, lower boundedness, and bounded gradients are established for the sequence {L α x }, giving subsequence convergence of the DGD iterates; finally, whole sequence convergence of the DGD iterates follows from the Ł property of L α. Lemma Gradient descent interpretation. The sequence {x } generated by the DGD iteration 4 is the same sequence generated by applying gradient descent with the fixed step size α to the objective function L α x. A proof of this lemma is given in [46], and it is based on reformulating 4 as the iteration: x + = x α fx + α I W x = x α L α x. Although the sequence {x } generated by the DGD iteration 4 can be interpreted as a centralized gradient descent sequence of function L α x, it is different to the gradient descent of the original problem 3. Lemma Sufficient descent of {L α x }. Let Assumptions and hold. Set the step size 0 < α < +λnw. It holds that L α x + L α x α + λ n W x + x, N. Proof. From x + = x α L α x, it follows that L α x, x + x = x+ x. 3 α Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ max I W = + α λ n W, implying L α x + L α x + L α x, x + x + L x+ x. 4 Combining 3 and 4 yields. Lemma 3 Boundedness. Under Assumptions and, if 0 < α < +λnw, then the sequence {L α x } is lower bounded, and the sequence {x } is bounded, i.e., there exists a constant B > 0 such that x < B for all. Proof. The lower boundedness of L α x is due to the lower boundedness of each f i as it is proper and coercive Assumption Part. By Lemma and the choice of α, L α x is nonincreasing and upper bounded by L α x 0 < +. Hence, T fx L α x 0 implies that x is bounded due to the coercivity of T fx Assumption Part. From Lemmas and 3, we immediately obtain the following lemma. Lemma 4 l -summable and asymptotic regularity. It holds that =0 x+ x < + and that x + x 0 as. From, the result below directly follows: Lemma 5 Gradient bound. L α x α x + x. Based on the above lemmas, we get the global convergence of DGD. Proof of Theorem. By Lemma 3, the sequence {x } is bounded, so there exist a convergent subsequence and a limit point, denoted by {x s } s N x as s +. By Lemmas and 3, L α x is monotonically nonincreasing and lower bounded, and therefore x + x 0 as. Based on Lemma 5, L α x 0 as. In particular, L α x s 0 as s. Hence, we have L α x = 0. The running best rate of the sequence { x + x } follows from [0, Lemma.] or [8, Theorem 3.3.]. By Lemma 5, the running best rate of the sequence { L α x } is o. Similar to [, Theorem.9], we can claim the global convergence of the considered sequence {x } N under the Ł assumption of L α. Next, we derive a bound on the gradient sequence { fx }, which is used in Proposition. Lemma 6. Under Assumption, there exists a point y satisfying fy = 0, and the following bound holds fx D B + y, N, 5 where B is the bound of x given in Lemma 3. A sequence {a } is said to be asymptotic regular if a + a 0 as.

9 9 Proof. By the lower boundedness assumption Assumption Part, the minimizer of T fy exists. Let y be a minimizer. Then by Lipschitz differentiability of each f i Assumption Part, we have that fy = 0. Then, for any, we have fx = fx fy x y Lemma 3 B + y. Therefore, we have proven this lemma. B. Proof for Proposition Proof. Note that L α x + L α x + L α x + L α x L x + x + α x + x = α λ n W + x + x, where the second inequality holds for Lemma 5 and the Lipschitz continuity of L α with constant L = + α λ n W. Thus, it shows that {x } satisfies the socalled relative error condition as list in []. Moreover, by Lemmas and 3, {x } also satisfies the so-called sufficient decrease and continuity conditions as listed in []. Under such three conditions and the Ł property of L α at x with ψs = cs θ, following the proof of [, Lemma.6], there exists 0 > 0 such that for all 0, we have x + x x x + cb a 6 Lα x L α x θ L α x + L α x θ, where a α + λ n W and b α λ n W +. Then, an easy induction yields t= 0 x t+ x t x 0 x 0 + cb a Lα x 0 L α x θ L α x + L α x θ. Following a derivation similar to the proof of [, Theorem 5], we can estimate the rate of convergence of {x } in the different cases of θ. C. Proof for Proposition 3 In order to prove Proposition 3, we also need the following lemmas. {}}{ Lemma 7. [8, Proposition ] Let W W W be the power of W with degree for any N. Under Assumption, it holds W n T Cζ 7 for some constant C > 0, where ζ is the second largest magnitude eigenvalue of W as specified in 9. Lemma 8. [33, Lemma 3.] Let {γ } be a scalar sequence. If lim γ = γ and 0 < β <, then lim l=0 β l γ l = γ β. Proof of Proposition 3. By the recursion 6, note that x x = W n T x 0 8 α j W j n T fx j. Further by Lemma 7 and Assumption 3, we obtain x x W n T x 0 α j W j n T fx j + C x 0 ζ + B α j ζ j. 9 Furthermore, by Lemma 8 and step sizes 5, we get lim x x = 0. Let b + ɛ. To show the rate of x x, we only need to show that lim b x x C for some 0 < C <. Let j [ + log ζb ] where [x] denotes the integer part of x for any x R. Note that b x x x 0 ζ + B α j ζ j Cb = C x 0 b ζ + CBb + CBb j=j + α j ζ j j α j ζ j T + T + T 3, 30 where the first inequality holds because of 9. In the following, we will estimate the above three terms in the right-hand side of 30, respectively. First, by the definition of j, for any j j, we have Thus, b ζ j T CB b Second, for j < j, b α j and also + ɛ j + ɛ ζ j. j α j ζ j/. 3 + ɛ + ɛ log ζ + ɛ, b α + ɛ j + ɛ =,

10 0 Thus, for any j < j Furthermore, note that lim b α j =. 3 lim b ζ/ = Therefore, there exists a such that for b α j, 34 b ζ/. 35 The above two inequalities imply that for sufficiently large, T C x 0 ζ /, 36 T 3 CB From 3, 36 and 37, we get b j=j + ζ j. 37 x x C x 0 ζ / 38 j + CB α j ζ j/ +. j=j + ζ j By Lemma 8 and 38, there exists a C > 0 such that lim b x x C. 39 We have completed the proof of this proposition. D. Proof for Theorem To prove Theorem, we first note that similar to, the DGD iterates under decreasing step sizes can be rewritten as where L α x = T fx + the following lemmas. x + = x α L α x, 40 α x I W, and we also need Lemma 9 [34]. Let {v t } be a nonnegative scalar sequence such that v t+ + b t v t u t + c t for all t N, where b t 0, u t 0 and c t 0 with t=0 b t < and t=0 c t <. Then the sequence {v t } converges to some v 0 and t=0 u t <. Lemma 0. Let α satisfy 5. Then it holds Proof. We first prove that α + α ɛ + ɛ. + x ɛ ɛx, x [0, ]. 4 Let gx = + x ɛ ɛx. Then its derivative g x = ɛ + x ɛ ɛ < 0, x [0, ]. It implies gx g0 = 0 for any x [0, ], that is, the inequality 4 holds. Note that α + α = + ɛ + ɛ = + ɛ + + ɛ where the last inequality holds for 4. ɛ + ɛ, 4 Note that the term {α + α x+ I W } exists in the right hand side the latter inequality 48. In order to apply Lemma 9 and then show the convergence of {L α x }, we need the following lemma to guarantee that {α + α x+ I W } is summable. Lemma. Let Assumptions,, and 3 hold. In DGD, use step sizes α in 5. Then {α + α x+ I W } is summable, i.e., =0 α + α x+ I W <. Proof. Note that x + I W = x + x + I W λ n W x + x By Lemma 0, α + α x+ I W ɛ + ɛ x + I W ɛ + ɛ λ n W x + x Furthermore, by 44 and Proposition 3, the sequence {α + α x+ I W } converges to 0 at the rate of O/ + +ɛ, which implies that the sequence {α α α x+ I W } is l -summable, i.e., x+ I W <. + =0 α + Lemma convergence of wealy summable sequence. Let {β } and {γ } be two nonnegative scalar sequences such that a γ = +, for some ɛ 0, ], N; ɛ b =0 γ β < ; c β + β γ, where means that β + β Mγ for some constant M > 0, then lim β 0. We call a sequence {β } satisfying Lemma a and b a wealy summable sequence since itself is not necessarily summable but becomes summable via multiplying another non-summable, diminishing sequence {γ }. It is generally impossible to claim that β converges to 0. However, if the distance of two successive steps of {β } with the same order of the multiplied sequence γ, then we can claim the convergence of β. A special case with ɛ = / has been observed in [9]. Proof. By condition b, we have + i= γ i β i 0, 45 as and for any N. In the following, we will show lim β = 0 by contradiction. Assume this is not the case, i.e., β 0 as,

11 then lim sup β C > 0. Thus, for every N > 0, there exists a > N such that β > C. Let [ ] C + ɛ, 4M where [x] denotes the integer part of x for any x R. By condition c, i.e., β j+ β j Mγ j for any j N, then Hence, + j= { C = β +i C 4, i {0,,..., } γ j β j C γ j C x + ɛ dx j= + + ɛ + ɛ, ɛ 0,, 4 ɛ C 4 ln + + ln +, ɛ =. Note that when ɛ 0,, the term + + ɛ + ɛ is monotonically increasing with respect to, which implies that + j= γ jβ j is lower bounded by a positive constant when ɛ 0,. While when ɛ =, noting that the specific form of, we have ln+ + ln+ = ln + = ln + C, + 4M which is a positive constant. As a consequence, + j= γ jβ j will not go to 0 as 0, which contradicts with 45. Therefore, lim β = 0. Proof of Theorem. We first develop the following inequality L α+ x + L α x + α + α x+ I W α + λ nw x + x, 48 and then claim the convergence of the sequences {L α x }, { T fx } and {x } based on this inequality. a Development of 48: From x + = x α L α x, it follows that L α x, x + x = x+ x α. 49 Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ maxi W = + λ nw, implying α L α x + 50 L α x + L α x, x + x + L x+ x = L α x α + λ nw x + x. Moreover, L α+ x + = L α x + + α + α x+ I W. 5 Combining 50 and 5 yields 48. b Convergence of objective sequence: By Lemma and Lemma 9, 48 yields the convergence of {L α x } and α + λ nw x + x < 5 =0 which implies that x + x converges to 0 at the rate of o ɛ and {x } is asymptotic regular. Moreover, notice that α x I W = α x x I W λ n W + ɛ x x. By Proposition 3, the term α x I W converges to 0 as. As a consequence, lim T fx = lim = lim L α x. L α x x I W α c Convergence to a stationary point: Let fx n T fx. By the specific form 5 of α, we have α + λ nw = α + λ nw α α for all > 0, where 0 = + λ n W part of + λ n W ɛ. Note that 0 + ɛ [ + λ n W ɛ x + x = n T x + x 53 ], i.e., the integer x + x. 54 Thus, 5, 53 and 54 yield α x+ x <. 55 =0 By the iterate 4 of DGD, we have x + x = α fx. 56 Plugging 56 into 55 yields α fx <. 57 Moreover, =0 fx + fx fx + fx fx + + fx B fx + fx B fx + fx B x + x, 58 where the second inequality holds by the bounded gradient assumption Assumption 3, the third inequality holds by the A sequence {a } is said to be asymptotic regular if a + a 0 as.

12 specific form of fx, and the last inequality holds by the Lipschitz continuity of f. Note that x + x = x + x + + x + x + x x x + x + + x x + α fx α, 59 where the first inequality holds for the triangle inequality and 56, and the last inequality holds for Proposition 3 and the bounded assumption of f. Thus, 58 and 59 imply fx + fx α. 60 By the specific form 5 of α, 57, 60 and Lemma, it holds As a consequence, lim fx = 0. 6 lim T fx = 0. 6 Furthermore, by the coercivity of f i for each i and the convergence of { T fx }, {x } is bounded. Therefore, there exists a convergent subsequence of {x }. Let x be any limit point of {x }. By 6 and the continuity of f, it holds T fx = 0. Moreover, by Proposition 3, x is consensual. As a consequence, x is a stationary point of problem 3. In addition, if x is isolated, then by the asymptotic regularity of {x } Lemma 4, {x } converges to x. E. Proof for Proposition 4 To prove Proposition 4, we still need the following lemmas. Lemma 3 Accumulated consensus of iterates. Under conditions of Proposition 3, we have α x + x + D + D α, 63 =0 where D = C x0 ζ ζ, D specified in Assumption 3. =0 =0 = C x 0 ζ + B ζ Proof. By 9, α x + x + C x 0 ζ α ζ + CB =0 =0, and B is ζ j α α j. 64 In the following, we estimate these two terms in the right-hand side of 64, respectively. Note that α ζ α + ζ =0 =0 ζ + =0 α, 65 =0 and =0 = ζ j α α j α =0 ζ =0 ζ j + ζ j α + αj αj ζ j =j α. 66 =0 Plugging 65 and 66 into 64 yields 63. Besides Lemma 3, we also need the following two lemmas, which have appeared in the literature cf. [8]. Lemma 4 [8]. Let γ = for some 0 < ɛ. Then the ɛ following hold a if 0 < ɛ < /, ɛ = γ ɛ = O, ɛ = γ = γ b if ɛ = /, = γ = γ = γ ɛ ɛ ɛ ɛ ɛ = O ɛ. c if / < ɛ <, = γ = γ = γ d if ɛ =, = γ = γ = γ ɛ ɛ = O, + ln / = Oln. ɛ ɛ = O, ɛ ɛ ɛ /ɛ ɛ ɛ ln = O ln, = O / ln + ln = O ln. ɛ. Lemma 5. [8, Proposition 3] Let h : R d R be a continuously differentiable function whose gradient is Lipschitz continuous with constant L h. Then for any x, y, u R p, hu hx + hy, u x L h x y. Proof of Proposition 4. To prove this proposition, we first develop the following inequality, L α x + L α u α x u x + u for any u R n p. By Lemma 5, we have 67 L α u L α x L α x, u x + L x+ x, where L = + α λ nw, and by, we have L α x = α x x +. Then 68 implies L α u L α x α x x +, u x + L x+ x.

13 3 Note that the specific form of α = +, there exists an ɛ integer 0 > 0 such that L α for all > 0. Actually, for the simplicity of the proof, we can tae α < λnw starting from the first step so that L α holds from the initial step. Thus, 69 implies L α u L α x α x x +, u x + x + x. α Recall that for any two vectors a and b, it holds a, b a = b a b. Therefore, L α u L α x + + α u x + u x. As a consequence, we get the basic inequality 67. Note that the optimal solution x opt is consensual and thus, x opt I W = 0. Therefore, L α x opt = fx opt = f opt. By 67, we have α Lα x + f opt x x opt x + x opt /. Summing the above inequality over = 0,,..., yields α L α x + f opt x 0 x opt /. 7 =0 Moreover, noting that L α x + = convexity of L α, f x + and by the L α x + f x + + L α x +, x + x + f x + B x + x +, 7 where the second inequality holds by the bounded assumption of gradient cf. Assumption 3. Plugging 7 into 7 yields α f x + f opt 73 =0 x0 x opt + B α x + x +. =0 By the definition of f 7, then 73 implies f f opt α 74 =0 x0 x opt + B D 3 + D 4 =0 α x + x + =0 α, 75 where D 3 = x0 x opt + BD, D 4 = BD, D and D are specified in Lemma 3, and the second inequality holds for Lemma 3. As a consequence, f f opt D 3 + D 4 =0 α =0 α. 76 Furthermore, by Lemma 4, we get the claims of this proposition. F. Proofs for Theorem 3 and Proposition 5 In order to prove Theorem 3, we need the following lemmas. Lemma 6 Sufficient descent of { ˆL α x }. Let Assumptions and 4 hold. Results are given in two cases below: C: r i s are convex. Set 0 < α < +λnw. ˆL α x + ˆL α x 77 α + λ n W x + x, N. C: r i s are not necessarily convex in this case, we assume λ n W > 0. Set 0 < α < λnw. ˆL α x + ˆL α x 78 α λ n W x + x, N. Proof. Recall from Lemma that L α x is L -Lipschitz continuous for L = + α λ n W, and thus ˆL α x + ˆL α x = L α x + L α x + rx + rx L α x, x + x + L x+ x + rx + rx. 79 C: From the convexity of r, 7, and 9, it follows that 0 = ξ + + α x + x + α L α x, ξ + rx +. This and the convexity of r further give us rx + rx ξ +, x + x = α x+ x L α x, x + x. Substituting this inequality into the inequality 79 and then expanding L = + α λ n W yield ˆL α x + ˆL α x α L x + x = α + λ n W x + x. Sufficient descent requires the last term to be negative, thus 0 < α < +λnw. C: From 7 and 9, it follows that the function ru + u x α L αx α reaches its minimum at u = x +. Comparing the values of this function at x + and x yields rx + rx α x x α L α x α x+ x α L α x = α x+ x L α x, x + x. Substituting this inequality into 79 and expanding L yield ˆL α x + ˆL α x α L x + x = α λ n W x + x. Hence, sufficient descent requires 0 < α < λnw.

14 4 Lemma 7 Boundedness. Under the conditions of Lemma 6, the sequence { ˆL α x } is lower bounded, and the sequence {x } is bounded. Proof. The lower boundedness of { ˆL α x } is due to Assumption 4 Part. By Lemma 6 and under a proper step size, ˆLα x is nonincreasing and upper bounded by ˆL α x 0. Hence, n i= f ix i + r ix i is upper bounded by ˆL α x 0. Consequently, {x } is bounded due to the coercivity of each f i +r i see Assumption 4 Part. Lemma 8 Bounded subgradient. Let ˆL α x + denote the limiting subdifferential of ˆLα, which is assumed to exist for all N. Then, there exists g + ˆL α x + such that g + α λ n W + x + x. Proof. By the iterate 9, the following optimality condition holds 0 α x + x + α L α x + rx +, 80 where rx + denotes the limiting subdifferential of r at x +. For any ξ + rx +, it follows from 80 that L α x + + ξ + = α x x + + L α x + L α x, which immediate yields L α x + + ξ + α x + x + L α x + L α x α + L x + x α λ n W + x + x. Thus, then the claim of Lemma 8 holds. Based on Lemmas 6 8, we can easily prove Theorem 3 and Proposition 5. Proof of Theorem 3. The proof of this theorem is similar to that of Theorem and thus is omitted. Proof of Proposition 5. The proof is similar to that of Proposition. We shall however note that in 6, a = α + λ n W if ri s are convex, while a = α λ n W if ri s are not necessarily convex and λ n W > 0. G. Proofs for Theorem 4 and Proposition 6 Based on the iterate 6 of Prox-DGD, we derive the following recursion of the iterates of Prox-DGD, which is similar to 6. Lemma 9 Recursion of {x }. For any N, x = W x 0 α j W j fx j + ξ j+, 8 where ξ j+ rx j+ is the one determined by the proximal operator 7, for any j = 0,...,. Proof. By the definition of the proximal operator 7, the iterate 6 implies x + + α ξ + = W x α fx, 8 where ξ + rx +, and thus x + = W x α fx + ξ By 83, we can easily derive the recursion 8. Proof of Proposition 6. The proof of this proposition is similar to that of Proposition 3. It only needs to note that the subgradient term fx j + ξ j+ is uniformly bounded by the constant B for any j. Thus, we omit it here. To prove Theorem 4, we still need the following lemmas. Lemma 0. Let Assumptions and 4 hold. In Prox-DGD, use the step sizes 5. Results are given in two cases below: C: r i s are convex. For any N, ˆL α+ x + ˆL α x + α + α x+ I W α + λ nw x + x. 84 C: r i s are not necessarily convex. For any N, ˆL α+ x + ˆL α x + α + α x+ I W α λ nw x + x. 85 Proof. The proof of this lemma is similar to that of Lemma 6 via noting that and ˆL α+ x + = ˆL α x + ˆL α+ x + ˆL α x + + ˆL α x + ˆL α x, ˆL α+ x + ˆL α x + = α + α x+ I W. While the term ˆL α x + ˆL α x can be estimated similarly by the proof of Lemma 6. Lemma. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 5. If further each f i and r i are convex, then for any u R n p, we have ˆL α x + ˆL α u α x u x + u. Proof. By Lemma 5, we have L α u L α x L α x, u x + L x+ x, where L = + α λ nw, and by the convexity of r, we have ru rx + + ξ +, u x +, 87 where ξ + rx + is the one determined by the proximal operator 7. By 83, it follows ξ + = α x x + L α x. 88

15 5 Plugging 88 into 87, and then summing up 86 and 87 yield ˆL α u ˆL α x α x x +, u x + L x+ x. Similar to the rest proof of the inequality 67, we can prove this lemma based on 89. Proof of Theorem 4. Based on Lemma 0 and Lemma, we can proof Theorem 4. The proof of Theorem 4a-d is similar to that of Theorem, while the proof of Theorem 4d is very similar to that of Proposition 4, and thus the proof of this theorem is omitted. VI. CONCLUSION In this paper, we study the convergence behavior of the algorithm DGD for smooth, possibly nonconvex consensus optimization. We consider both fixed and decreasing step sizes. When using a fixed step size, we show that the iterates of DGD converge to a stationary point of a Lyapunov function, which approximates to one of the original problem. Moreover, we estimate the bound between each local point and its global average, which is proportional to the step size and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of the mixing matrix. This motivate us to study the algorithm DGD with decreasing step sizes. When using decreasing step sizes, we show that the iterates of DGD reach consensus asymptotically at a sublinear rate and converge to a stationary point of the original problem. We also estimate the convergence rates of objective sequence in the convex setting using different diminishing step size strategies. Furthermore, we extend these convergence results to Prox-DGD designed for minimizing the sum of a differentiable function and a proximal function. Both functions can be nonconvex. If the proximal function is convex, a larger fixed step size is allowed. These results are obtained by applying both existing and new proof techniques. ACNOWLEDGMENTS The wor of J. Zeng has been supported in part by the NSF grants 66036, and the Doctoral start-up foundation of Jiangxi Normal University. The wor of W. Yin has been supported in part by the NSF grant ECCS and ONR grants N and N REFERENCES [] H. Attouch, and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 6: 5-6, 009. [] H. Attouch, J. Bolte and B. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forwardbacward splitting, and regularized Gauss-Seidel methods, Math. Program., Ser. A, 37: 9-9, 03. [3] P. Bianchi and J. Jaubowicz, Convergence of a multi-agent projected stochastic gradient algorithm for nonconvex optimization, IEEE Trans. Automatic Control, 58:39-405, 03. [4] P. Bianchi, G. Fort and W. Hachem, Performance of a distributed stochastic approximation algorithm, IEEE Trans. Information Theory, 59: , 03. [5] J. Bolte, A. Daniilidis and A. Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization, 74:05-3, 007. [6] A. Chen and A. Ozdaglar, A fast distributed proximal gradient method, in Proc. 50th Allerton Conf. Commun., Control Comput., Moticello, IL, Oct. 0, pp [7] T. Chang, M. Hong and X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Trans. Signal Process., 63: , 05. [8] A. Chen, Fast Distributed First-Order Methods, Master s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 0. [9] Y.T. Chow, T. Wu and W. Yin, Cyclic Coordinate Update Algorithms for Fixed-Point Problems: Analysis and Applications. UCLA CAM 6-78, 06. [0] W. Deng, M. Lai, Z. Peng and W. Yin, Parallel multi-bloc admm with o/ convergence, Journal of Scientific Computing, DOI 0.007/s , 06. [] E. Hazan,.Y. Levy and S. Shalev-Shwarz, On graduated optimization for stochastic nonconvex problems, In Proceedings of the 33rd International Conference on Machine Learning, New Yor, NY, USA, 06. [] M. Hardt, B. Retch and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, In Proceedings of the 33rd International Conference on Machine Learning, New Yor, NY, USA, 06. [3] D. Hajinezhad, M. Hong and A. Garcia, ZENTH: a zeroth-order distributed algorithm for multi-agent nonconvex optimization, Technical report. [4] M. Hong, Z. Luo and M. Razaviyayn, Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems, ICASSP 05. [5] S. Hosseini, A. Chapman and M. Mesbahi, Online distributed optimization on dynamic networs, IEEE T. Auto. Control, 6: , 06. [6] D. Jaovetic, J. Xavier and J. Moura, Fast distributed gradient methods, IEEE Trans. Automatic Control, 59:3-46, 04. [7] D. empe, A. Dobra and J. Gehre, Gossip-based computation of aggregate information, In Foundations of Computer Science, 003. Proceedings 44th Annual IEEE Symposium on, 48-49, IEEE Computer Society, 003. [8]. nopp, Infinite sequences and series, Courier Corporation, 956. [9] J. Lafond, H. Wai and E. Moulines, D-FW: communication efficient distributed algorithms for high-dimensional sparse optimization, ICASSP 06. [0] S. Lee and A. Nedic, Distributed random projection algorithm for convex optimization, IEEE J. Sel. Topics Signal Process., 7: -9, 03. [] Q. Ling and Z. Tian, Decentralized sparse signal recovery for compressive sleeping wireless sensor networs, IEEE Trans. Signal Process., 587: , 00. [] S. Łojasiewicz, Sur la géométrie semi-et sous-analytique, Ann. Inst. Fourier Grenoble 435: , 993. [3] P.D. Lorenzo and G. Scutari, NEXT: in-networ nonconvex optimization, IEEE Trans. Signal and Information Processing over Networ, :0-36, 06. [4] P.D. Lorenzo and G. Scutari, Distributed nonconvex optimization over time-varying networs, ICASSP 06. [5] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process., 5:754-77, 0. [6] H. McMahan and M. Streeter, Delay-Tolerant algorithms for asynchronous distributed online learning, In: Advances in Neural Information Processing Systems NIPS, 04. [7] G. Mateos, J. Bazerque and G. Giannais, Distributed sparse linear regression, IEEE Trans. Signal Process., 580: , 00. [8] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Trans. Automatic Control, 54:48-6, 009. [9] A. Nedic and A. Olshevsy, Distributed optimization over time-varying directed graphs, IEEE Trans. Automatic Control, 603:60-65, 05. [30] M. Nevelson and R.Z. hasminsii, Stochastic approximation and recursive estimation, [translated from the Russian by Israel Program for Scientific Translations; translation edited by B. Silver]. Americal Mathematical Society, 973. [3] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions on Control of Networ Systems, 07, Volume: PP, Issue: 99. [3] M. Raginsy, N. iarashi and R. Willett, Decentralized online convex programming with local information, In: 0 American Control Conference, San Francisco, CA, USA, 0.

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been