WE consider an undirected, connected network of n

Size: px

Start display at page:

Download "WE consider an undirected, connected network of n"

Reynold Arnold
5 years ago
Views:

1 On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus nonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are nown in the convex setting. In particular, when diminishing or constant step sizes are used, we can prove convergence to a or a neighborhood of consensus stationary solution under some regular assumptions. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD, MCP and l q quasi-norms, q [0,. Similarly, Prox-DGD can tae the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used in the convex setting. Index Terms Nonconvex dencentralized computing, consensus optimization, decentralized gradient descent method, proximal decentralized gradient descent I. INTRODUCTION WE consider an undirected, connected networ of n agents and the following consensus optimization problem defined on the networ: n minimize fx f i x, x R p i= where f i is a differentiable function only nown to the agent i. We also consider the consensus optimization problem in the following differentiable+proximable form: n minimize x R p sx f i x + r i x, i= where f i, r i are differentiable and proximable functions, respectively, only nown to the agent i. Each function r i is possibly non-differentiable or nonconvex, or both. The models and find applications in decentralized averaging, learning, estimation, and control. Some typical applications include: i the distributed compressed sensing problems [4], [30], [39], [45], [49]; ii distributed consensus [68], [9], [9], [58], [55], [6]; iii distributed and parallel machine learning [55], [5], [], [33], [43]. More specifically, in these applications, each f i can be: the data-fidelity J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 3300, China jsh.zeng@gmail.com W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095, USA wotaoyin@ucla.edu. We call { a function proximable if its proximal operator prox αf y argmin x αfx + x y } is easy to compute. term possibly nonconvex in statistical learning and machine learning [5], [6]; nonconvex utility functions used in applications such as resource allocation [6], [0]; 3 empirical ris of deep neural networs with nonlinear activation functions [3]. The proximal function r i can be taen as: convex penalties such as nonsmooth l -norm or smooth l - norm; the indicator function for a closed convex set or a nonconvex set with an easy projection [4], that is, r i x = 0 if x satisfies the constraint and otherwise; 3 nonconvex penalties such as l q quasi-norm 0 q < [49], [], smoothly clipped absolute deviation SCAD penalty [6] and the minimax concave penalty MCP [67]. When f i s are convex, the existing algorithms include the subgradient methods [8], [0], [4], [37], [40], [59], [65], [46], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers DADMM [5], [5], [9], DLM [3], and EXTRA [53], [54]. When f i s are nonconvex, some existing results include [4], [5], [8], [35], [36], [56], [57], [7], [60], [6], [68]. In spite of the algorithms and their analysis in these wors, the convergence of the simple algorithm Decentralized Gradient Descent DGD [40] under nonconvex f i s is still unnown. Furthermore, although DGD is slower than DADMM, DLM and EXTRA on convex problems, DGD is simpler and thus easier to extend to a variety of settings such as [47], [64], [38], [3], where online processing and delay tolerance are considered. Therefore, we expect our results to motivate future adoptions of nonconvex DGD. This paper studies the convergence of two algorithms: DGD for solving problem and Prox-DGD for problem. In each DGD iteration, every agent locally computes a gradient and then updates its variable by combining the average of its neighbors with the negative gradient step. In each Prox-DGD iteration, every agent locally computes a gradient of f i and a proximal map of r i, as well as exchanges information with its neighbors. Both algorithms can use either a fixed step size or a sequence of decreasing step sizes. When the problem is convex and a fixed step size is used, DGD does not converge to a solution of the original problem but a point in its neighborhood [65]. This motivates the use of decreasing step sizes such as in [0], [4]. Assuming f i s are convex and have Lipschitz continuous and bounded gradients, [0] shows that decreasing step sizes α = lead to a convergence rate O ln of the running best of objective errors. [4] uses nested loops and shows an outerloop convergence rate O of objective errors, utilizing Nesterov s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, their single-loop algorithm using the decreasing step sizes α = /3 has a reduced rate O ln.

2 The objective of this paper is two-fold: a we aim to show, other than losing global optimality, most existing convergence results of DGD and Prox-DGD that are nown in the convex setting remain valid in the nonconvex setting, and b to achieve a, we illustrate how to tailor nonconvex analysis tools for decentralized optimization. In particular, our asymptotic exact and inexact consensus results require new treatments because they are special to decentralized algorithms. The analytic results of this paper can be summarized as follows. a When a fixed step size α is used and properly bounded, the DGD iterates converge to a stationary point of a Lyapunov function. The difference between each local estimate of x and the global average of all local estimates is bounded, and the bound is proportional to α. b When a decreasing step size α = O/ + ϵ is used, where 0 < ϵ and is the iteration number, the objective sequence converges, and the iterates of DGD are asymptotically consensual i.e., become equal one another, and they achieve this at the rate of O/ + ϵ. Moreover, we show the convergence of DGD to a stationary point of the original problem, and derive the convergence rates of DGD with different ϵ for objective functions that are convex. c The convergence analysis of DGD can be extended to the algorithm Prox-DGD for solving problem. However, when the proximable functions r i s are nonconvex, the mixing matrix is required to be positive definite and a smaller step size is also required. Otherwise, the mixing matrix can be non-definite. The detailed comparisons between our results and the existing results on DGD and Prox-DGD are presented in Tables I and II. The global objective error rate in these two tables refers to the rate of {f x fx opt } or {s x sx opt }, where x = n n i= x i is the average of the th iterate and x opt is a global solution. The comparisons beyond DGD and Prox- DGD are presented in Section IV and Table III. New proof techniques are introduced in this paper, particularly, in the analysis of convergence of DGD and Prox-DGD with decreasing step sizes. Specifically, the convergence of objective sequence and convergence to a stationary point of the original problem with decreasing step sizes are justified via taing a Lyapunov function and several new lemmas cf. Lemmas 9,, and the proof of Theorem. Moreover, we estimate the consensus rate by introducing an auxiliary sequence and then showing both sequences have the same rates cf. the proof of Proposition 3. All these proof techniques are new and distinguish our paper from the existing wors such as [0], [4], [40], [4], [8], [35], [57], [6]. It should be mentioned that during the revision of this paper, we found some recent, related but independent wor on the convergence of nonconvex decentralized algorithms including [33], [], [], [9]. We will give detailed comparisons with these wor latter. The rest of this paper is organized as follows. Section II describes the problem setup and reviews the algorithms. Section III presents our assumptions and main results. Section IV discusses related wors. Section V shows some numerical experiments to verify the developed theoretical results. Section VI presents the proofs of our main results. We conclude this paper in Section VII. Notation: Let I denote the identity matrix of the size n n, and R n denote the vector of all s. For the matrix X, X T denotes its transpose, X ij denotes its i, jth component, and X X, X = i,j X ij is its Frobenius norm, which simplifies to the Euclidean norm when X is a vector. Given a symmetric, positive semidefinite matrix G R n n, we let X G X, GX be the induced semi-norm. Given a function h, domf denotes its domain. II. PROBLEM SETUP AND ALGORITHM REVIEW Consider a connected undirected networ G = {V, E}, where V is a set of n nodes and E is the edge set. Any edge i, j E represents a communication lin between nodes i and j. Let x i R p denote the local copy of x at node i. We reformulate the consensus problem into the equivalent problem: x minimize x T fx n f i x i, 3 i= subject to x i = x j, i, j E, where x R n p, fx R n with x T x T. x T n, fx f x f x. f n x n. In addition, the gradient of fx is f x T f x T fx. Rn p. 4 f n x n T The ith rows of the matrices x and fx, and vector fx, correspond to agent i. The analysis in this paper applies to any integer p. For simplicity, one can let p = and treat x and fx as vectors rather than matrices. The algorithm DGD [40] for 3 is described as follows: Pic an arbitrary x 0. For = 0,,..., compute x + W x α fx, 5 where W is a mixing matrix and α > 0 is a stepsize parameter. Similarly, we can reformulate the composite problem as the following equivalent form: minimize x n f i x i + r i x i, i= subject to x i = x j, i, j E. 6 Let rx n i= r ix i. The algorithm Prox-DGD can be applied to the above problem 6:

3 3 TABLE I COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS SMOOTH OPTIMIZATION PROBLEM Fixed step size Decreasing step sizes algorithm DGD [65] DGD this paper D-NG [4] DGD this paper f i convex only nonconvex convex only nonconvex f i Lipschitz Lipschitz, bounded step size 0 < α < +λnw O with Nesterov acc. O ϵ ϵ 0, ] consensus error Oα O O ϵ min j x j+ x j o no rate o +ϵ global objective error O until error O α ζ Convex: O until error O α ζ ; Nonconvex: no rate O ln Convex : O ln ϵ = /, O ϵ =, ln O min{ϵ, ϵ} other ϵ; Nonconvex: no rate The objective error rates of DGD and Prox-DGD obtained in this paper and those in convex DProx-Grad [0] are ergodic or running best rates. TABLE II COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS COMPOSITE OPTIMIZATION PROBLEM Fixed step size Decreasing step sizes algorithm AccDProx-Grad [8] DProx-Grad [0] Prox-DGD this paper DProx-Grad [0] Prox-DGD this paper f i, r i convex only nonconvex convex only nonconvex f i Lipschitz, bounded Lipschitz Lipschitz, bounded r i bounded bounded step size 0 < α < 0 < α < +λ nw convex r i ; 0 < α < λ nw nonconvex r i, λ nw > 0 O O + / ϵ 0, ] + ϵ consensus Oγ, 0 < γ < error Oα O / O min j x j+ x j no rate no rate o no rate o +ϵ ϵ global objective error O D α + D α, D, D > 0 Form Convex: form D 3 α + D 4α, D 3, D 4 > 0; Nonconvex: no rate O ln Convex : O ln ϵ = /, O ϵ =, ln O min{ϵ, ϵ} other ϵ, Nonconvex: no rate Prox-DGD: Tae an arbitrary x 0. For = 0,,..., perform x + prox α rw x α fx, 7 where the proximal operator is prox α rx argmin {α ru + u R n p u x }. 8

4 4 TABLE III COMPARISONS ON SCENARIOS APPLIED FOR DIFFERENT NONCONVEX DECENTRALIZED ALGORITHMS f i nonsmooth r i step size networ W algorithm type fusion scheme algorithm smooth cvx ncvx fixed diminish static dynamic determin stochastic ATC CTA DGD this paper doubly Perturbed Push-sum [57] column ZENITH [8] doubly Prox-DGD this paper NEXT [35] DeFW [6] Proj SGD [4] doubly doubly doubly row In this table, the full names of these abbreviations are list as follows: cvx convex, ncvx nonconvex, diminish diminishing, determin deterministic, ATC adaptive-then-combine, CTA combine-then-adaptive, doubly doubly stochastic, column column stochastic, row row stochastic, where vocabularies in the bracets are the full names. A row, or column, or double stochastic W means that: W =, or W T =, or both hold. III. ASSUMPTIONS AND MAIN RESULTS This section presents all of our main results. A. Definitions and assumptions Definition Lipschitz differentiability. A function h is called Lipschitz differentiable if h is differentiable and its gradient h is Lipschitz continuous, i.e., hu hv L u v, u, v domh, where L > 0 is its Lipschitz constant. Definition Coercivity. A function h is called coercive if u + implies hu +. The next definition is a property that many functions have see [63, Section.] for examples and can help obtain whole sequence convergence from subsequence convergence. Definition 3 urdya-łojasiewicz Ł property [34], [7], []. A function h : R p R {+ } has the Ł property at x dom h if there exist η 0, + ], a neighborhood U of x, and a continuous concave function φ : [0, η R + such that: i φ0 = 0 and φ is differentiable on 0, η; ii for all s 0, η, φ s > 0; iii for all x in U {x : hx < hx < hx + η}, the Ł inequality holds φ hx hx dist 0, hx. 9 Proper lower semi-continuous functions that satisfy the Ł inequality at each point of dom h are called Ł functions. Assumption Objective. The objective functions f i : R p R {+ }, i =,..., n, satisfy the following: f i is Lipschitz differentiable with constant i > 0. f i is proper i.e., not everywhere infinite and coercive. The sum n i= f ix i is -Lipschitz differentiable with max i i this can be easily verified via the definition of Whole sequence convergence from any starting point is referred to as global convergence in the literature. Its limit is not necessarily a global solution. fx as shown in 4. In addition, each f i is lower bounded following Part of the above assumption. Assumption Mixing matrix. The mixing matrix W = [w ij ] R n n has the following properties: Graph If i j and i, j / E, then w ij = 0, otherwise, w ij > 0. Symmetry W = W T. 3 Null space property null{i W } = span{}. 4 Spectral property I W I. By Assumption, a solution x opt to problem 3 satisfies I W x opt = 0. Due to the symmetric assumption of W, its eigenvalues are real and can be sorted in the nonincreasing order. Let λ i W denote the ith largest eigenvalue of W. Then by Assumption, λ W = > λ W λ n W >. Let ζ be the second largest magnitude eigenvalue of W. Then ζ = max{ λ W, λ n W }. 0 B. Convergence results of DGD We consider the convergence of DGD with both a fixed step size and a sequence of decreasing step sizes. Convergence results of DGD with a fixed step size: The convergence result of DGD with a fixed step size i.e., α α is established based on the Lyapunov function [65]: L α x T fx + α x I W. It is worth reminding that convexity is not assumed. Theorem Global convergence. Let {x } be the sequence generated by DGD 5 with the step size 0 < α < +λ nw. Let Assumptions and hold. Then {x } has at least one accumulation point x, and any such point is a stationary point of L α x. Furthermore, the running best rates of Given a nonnegative sequence a, its running best sequence is b = min{a i : i }. We say a has a running best rate of o/ if b = o/.

5 5 the sequences { x + x }, and { L α x }, and { n T fx } are o. The convergence rate of the sequence { =0 n T fx } is O. In addition, if L α satisfies the Ł property at an accumulation point x, then {x } globally converges to x. Remar. Let x be a stationary point of L α x, and thus 0 = fx + α I W x. Since T I W = 0, yields 0 = T fx, indicating that x is also a stationary point to the separable function n i= f ix i. Since the rows of x are not necessarily identical, we cannot say x is a stationary point to Problem 3. However, the differences between the rows of x are bounded, following our next result below adapted from [65]: Proposition Consensual bound on x. For each iteration, define x n n i= x i. Then, it holds for each node i that x i x αd ζ, 3 where D is a universal bound of fx defined in Lemma 6 below, ζ is the second largest magnitude eigenvalue of W specified in 0. As, 3 yields the consensual bound where x n n i= x i. x i x αd ζ, In Proposition, the consensual bound is proportional to the step size α and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of W. Let us compare the DGD iteration with the iteration of centralized gradient descent 5 for fx. Averaging the rows of 5 yields the following comparison: n DGD averaged: x + x α f i x i. 4 Centralized: n x + x α n i= n i= f i x. 5 Apparently, DGD approximates centralized gradient descent by evaluating f i at local variables x i instead of the global average. We can estimate the error of this approximation as n f i x i n n f i x n n i= n i= i= f i x i f i x αd ζ. Unlie the convex analysis in [65], it is impossible to bound the difference between the sequences of 4 and 5 without convexity because the two sequences may converge to different stationary points of L α. Remar. The Ł assumption on L α in Theorem can be satisfied if each f i is a sub-analytic function. Since x I W These quantities naturally appear in the analysis, so we eep the squares. is obviously sub-analytic and the sum of two sub-analytic functions remains sub-analytic, L α is sub-analytic if each f i is so. See [63, Section.] for more details and examples. Proposition Ł convergence rates. Let the assumptions of Theorem hold. Suppose that L α satisfies the Ł inequality at an accumulation point x with ψs = cs θ for some constant c > 0. Then, the following convergence rates hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C 0 τ for all for some > 0, C 0 > 0, τ [0,. c If θ,, x x C 0 θ/θ for all, for certain > 0, C 0 > 0. Note that the rates in parts b and c of Proposition are of the eventual type. Using fixed step sizes, our results are limited because the stationary point x of L α is not a stationary point of the original problem. We only have a consensual bound on x. To address this issue, the next subsection uses decreasing step sizes and presents better convergence results. Convergence of DGD with decreasing step sizes: The positive consensual error bound in Proposition, which is proportional to the constant step size α, motivates the use of properly decreasing step sizes α = O +, for some ϵ 0 < ϵ, to diminish the consensual bound to 0. As a result, any accumulation point x becomes a stationary point of the original problem 3. To analyze DGD with decreasing step sizes, we add the following assumption. Assumption 3 Bounded gradient. For any, fx is uniformly bounded by some constant B > 0, i.e., fx B. Note that the bounded gradient assumption is a regular assumption in the convergence analysis of decentralized gradient methods see, [4], [5], [8], [35], [36], [56], [57], [7], [6] for example, even in the convex setting [4] and also [0], though it is not required for centralized gradient descent. We tae the step size sequence: α =, 0 < ϵ, 6 + ϵ throughout the rest part of this section. The numerator can be replaced by any positive constant. By iteratively applying iteration 5, we obtain the following expression x = W x 0 α j W j fx j. 7 Proposition 3 Asymptotic consensus rate. Let Assumptions and 3 hold. Let DGD use 6. Let x n T x. Then, x x converges to 0 at the rate of O/ + ϵ. According to Proposition 3, the iterates of DGD with decreasing step sizes can reach consensus asymptotically compared to a nonzero bound in the fixed step size case in Proposition. Moreover, with a larger ϵ, faster decaying step sizes generally imply a faster asymptotic consensus rate. Note that I W x = 0 and thus x I W = x x I W. Therefore, the above proposition implies the following result.

6 6 Corollary. Apply the setting of Proposition 3. x I W converges to 0 at the rate of O/ + ϵ. Corollary shows that the sequence {x } in the I W semi-norm can decay to 0 at a sublinear rate. For any global consensual solution x opt to problem 3, we have x x opt I W = x I W so, if {x } does converge to x opt, then their distance in the same semi-norm decays at O/ ϵ. Theorem Convergence. Let Assumptions, and 3 hold. Let DGD use step sizes 6. Then a {L α x } and { T fx } converge to the same limit; b lim T fx = 0, and any limit point of {x } is a stationary point of problem 3; c In addition, if there exists an isolated accumulation point, then {x } converges. In the proof of Theorem, we will establish α + λ nw x + x <, =0 which implies that the running best rate of the sequence { x + x } is o/ +ϵ. Theorem shows that the objective sequence converges, and any limit point of {x } is a stationary point of the original problem. However, there is no result on the convergence rate of the objective sequence to an optimal value, and it is generally difficult to get such a rate without convexity. Although our primary focus is nonconvexity, next we assume convexity and present the objective convergence rate, which has an interesting relation with ϵ. For any x R n p n, let fx i= f ix i. Even if f i s are convex, the solution to 3 may be non-unique. Thus, let X be the set of solutions to 3. Given x, we pic the solution x opt = Proj X x X. Also let f opt = fx opt be the optimal value of. Define the ergodic objective: f =0 = α f x + =0 α, 8 where x + = n T x +. Obviously, f min f x. 9 =,...,+ Proposition 4 Convergence rates under convexity. Let Assumptions, and 3 hold. Let DGD use step sizes 6. If λ n W > 0 and each f i is convex, then { f } defined in 8 converges to the optimal objective value f opt at the following rates: a if 0 < ϵ < /, the rate is O ; ϵ b if ϵ = /, the rate is O ln ; c if / < ϵ <, the rate is O ; ϵ d if ϵ =, the rate is O ln. The convergence rates established in Proposition 4 almost as good as O when ϵ =. As ϵ goes to either 0 or, the rates become slower, and ϵ = / may be the optimal choice in terms of the convergence rate. However, by Proposition 3, a larger ϵ implies a faster consensus rate. Therefore, there is a tradeoff to choose an appropriate ϵ in the practical implementation of DGD. Remar 3. A related algorithm is the perturbed push-sum algorithm, also called subgradient-push, which was proposed in [5] for average consensus problem over time-varying networ. Its convergence in the convex setting was developed in [4]. Recently, its convergence to a critical point in the nonconvex setting was established in [57] under some regularity assumptions. Moreover, by utilizing perturbations on the update process and the assumption of no saddle-point existence, almost sure convergence to a local minimum of its perturbed variant was also shown in [57]. Remar 4. Another recent algorithm is decentralized s- tochastic gradient descent D-PSGD in [33] with support to nonconvex large-sum objectives. An O + n -ergodic convergence rate was established assuming is sufficiently large and the step size α is sufficiently small. When applied to the setting of this paper, [33, Theorem ] implies that the sequence { =0 n T fx } converges to zero at the rate O if the step size 0 < α < ζ 6 n, where ζ is defined in 0. From Theorem, we can also establish such an O -ergodic convergence rate of DGD as long as 0 < α < +λ nw. Similar rates of convergence to a stationary point have also been shown for different nonconvex algorithms in [8], [57], [8]. C. Convergence results of Prox-DGD Similarly, we consider the convergence of Prox-DGD with both a fixed step size and decreasing step sizes. The iteration 7 can be reformulated as x + = prox α rx α L α x 0 based on which, we define the Lyapunov function ˆL α x L α x + rx, where we recall L α x = n i= f ix i + α x I W. Then 0 is clearly the forward-bacward splitting a..a., proxgradient iteration for minimize x ˆLα x. Specifically, 0 first performs gradient descent to the differentiable function L α x and then computes the proximal of rx. To analyze Prox-DGD, we should revise Assumption as follows. Assumption 4 Composite objective. The objective function of 6 satisfies the following: Each f i is Lipschitz differentiable with constant i > 0. Each f i +r i is proper, lower semi-continuous, coercive. As before, n i= f ix i is -Lipschitz differentiable for max i i. Convergence results of Prox-DGD with a fixed step size: Based on the above assumptions, we can get the global convergence of Prox-DGD as follows. Theorem 3 Global convergence of Prox-DGD. Let {x } be the sequence generated by Prox-DGD 7 where the step size α satisfies 0 < α < +λ nw when r i s are convex;

7 7 and 0 < α < λ nw, when r i s are not necessarily convex this case requires λ n W > 0. Let Assumptions and 4 hold. Then {x } has at least one accumulation point x, and any accumulation point is a stationary point of ˆL α x. Furthermore, the running best rates of the sequences { x + x }, { g + } and { n T fx + n T ξ } where g + is defined in Lemma 8, and ξ is defined in Lemma 9 are o. The convergence rate of the sequence =0 n T fx + ξ } is O. { In addition, if ˆLα satisfies the Ł property at an accumulation point x, then {x } converges to x. The rate of convergence of Prox-DGD can be also established by leveraging the Ł property. Proposition 5 Rate of convergence of Prox-DGD. Under assumptions of Theorem 3, suppose that ˆLα satisfies the Ł inequality at an accumulation point x with ψs = c s θ for some constant c > 0. Then the following hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C τ for all for some > 0, C > 0, τ [0,. c If θ,, x x C θ/θ for all, for certain > 0, C > 0. Convergence of Prox-DGD with decreasing step sizes: In Prox-DGD, we also use the decreasing step size 6. To investigate its convergence, the bounded gradient Assumption 3 should be revised as follows. Assumption 5 Bounded composite subgradient. For each i, f i is uniformly bounded by some constant B i > 0, i.e., f i x B i for any x R p. Moreover, ξ i B ri for any ξ i r i x and x R p, i =..., n. Let B n i= B i + B ri. Then fx + ξ where ξ rx for any x R n p is uniformly bounded by B. Note that the same assumption is used to analyze the convergence of distributed proximal-gradient method in the convex setting [8], [0], and also is widely used to analyze the convergence of nonconvex decentralized algorithms lie in [35], [36]. In light of Lemma 9 below, the claims in Proposition 3 and Corollary also hold for Prox-DGD. Proposition 6 Asymptotic consensus and rate. Let Assumptions and 5 hold. In Prox-DGD, use the step sizes 6. There hold x x C x 0 ζ + B α j ζ j, and x x converges to 0 at the rate of O/ + ϵ. Moreover, let x be any global solution of the problem 6. Then x x I W = x I W = x x I W converges to 0 at the rate of O/ + ϵ. For any x R n p, define sx = n i= f ix i + r i x i. Let X be a set of solutions of 6, x opt = Proj X x X, and s opt = sx opt be the optimal value of 6. Define s =0 = α s x + =0 α. Theorem 4 Convergence and rate. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 6. Then a { ˆL α x } and { n i= f ix i + r ix i} converge to the same limit; b =0 α +λ nw x + x < when r i s are convex; or, =0 α λ nw x + x < when r i s are not necessarily convex this case requires λ n W > 0; c if {ξ } satisfies ξ + ξ L r x + x for each > 0, some constant L r > 0, and a sufficiently large integer 0 > 0, then lim T fx + ξ + = 0, where ξ + rx + is the one determined by the proximal operator 8, and any limit point is a stationary point of problem 6. d in addition, if there exists an isolated accumulation point, then {x } converges. e furthermore, if f i and r i are convex and λ n W > 0, then the claims on the rates of { f } in Proposition 4 hold for the sequence { s } defined in. Theorem 4b implies that the running best rate of x + x is o. The additional condition imposed on {ξ } in +ϵ Theorem 4c is some type of restricted continuous regularity of the subgradient r with respect to the generated sequence. If r is locally Lipschitz continuous in a neighborhood of a limit point, then such Lipschitz condition on {ξ } can generally be satisfied, since {x } is asymptotic regular, and thus x will lies in such neighborhood of this limit point when is sufficiently large. There are many inds of proximal functions satisfying such assumption as studied in [66] see, Remar 5 for detailed information. Theorem 4e gives the convergence rates of Prox-DGD in the convex setting. Remar 5. A typical proximal function r satisfying the assumption of Theorem 4 c is the l q quasi-norm 0 q < widely studied in sparse optimization, which taes the form rx = p i= x i q. From [] and [66], there is a positive lower bound for the absolute values of non-zero components of the solutions of l q regularized optimization problem. Furthermore, as shown by [66, Property b], the sequence generated by Prox-DGD also has the similar lower bound property. Moreover, by Theorem 4b, we have x + x 0 as. Together with the lower bound property, we can easily obtain the finite support and sign convergence of {x }, that is, the supports and signs of the non-zero components will freeze for sufficiently large. When restricted to such nonzero subspace, the gradient of r i u = u q is Lipschitz continuous for any u τ and some positive constant τ, where τ denotes the lower bound. Besides l q quasi-norm, there are some other typical cases lie SCAD [6] and MCP [67] widely used in statistical learning, satisfying the condition c of this theorem. Remar 6. One tightly related algorithm of Prox-DGD is the projected stochastic gradient descent Proj SGD method pro- When q = 0, we denote 0 0 = 0.

8 8 posed by [4] for solving the constrained multi-agent optimization problem with a convex constraint set. When restricted to the deterministic case as studied in this paper, the convergence results of Proj SGD are very similar to that of Prox-DGD see, Theorem 4 c-d in this paper and [4, Theorem ]. However, there are some differences between [4] and this paper. In short, Proj SGD in [4] uses convex constraints, which correspond to setting rx in our paper as indicator functions of those convex sets. Our paper also considers nonconvex functions lie l q quasi-norm 0 q <, SCAD, and MCP, which are widely used in statistical learning. Another difference is that Proj SGD of [4] uses adaptive-then-combine ATC and Prox-DGD of this paper does combine-then-adaptive CTA. By [4, Assumption ], Proj SGD uses decreasing step sizes lie O ϵ for some ϵ > /. We study the step size α = O ϵ for any 0 < ϵ for Prox-DGD, as well as a fixed step size. IV. RELATED WORS AND DISCUSSIONS We summarize some recent nonconvex decentralized algorithms in Table III. Most of them apply to either the s- mooth optimization problem or the composite optimization problem and use diminishing step sizes. Although is a special case of via letting r i x = 0, there are still differences in both algorithm design and theoretical analysis. Therefore, we divide their comparisons. We first discuss the algorithms for. In [57], the authors proved the convergence of perturbed push-sum for nonconvex under some regularity assumptions. They also introduced random perturbations to avoid local minima. The networ considered in [57] is time-varying and directed, and specific column stochastic matrices and diminishing step sizes are used. The convergence results for the deterministic perturbed push-sum algorithm obtained in [57] are similar to those of DGD developed in this paper under similar assumptions see, Theorem above and [57, Theorem 3]. The detailed comparisons between two algorithms are illustrated in Remar 3. In [33], the sublinear convergence to a stationary point of D-PSGD algorithm was developed under the nonconvex setting. DGD studied in this paper can be viewed a special D- PSGD with a zero variance. In [8], a primal-dual approximate gradient algorithm called ZENITH was developed for. The convergence of ZENITH was given in the expectation of constraint violation under the Lipschitz differentiable assumption and other assumptions. The last one is the proximal primaldual algorithm Prox-PDA recently proposed in []. The O -rate of convergence to a stationary point was established in []. Latter, a perturbed variant of Prox-PDA was proposed in [] for constrained composite smooth+nonsmooth optimization problem with a linear equality constraint. Table III includes three algorithms for solving the composite problem, which are related to ours. All of them only deal with convex r i whereas r i in this paper can also be nonconvex. In [36], the authors proposed NEXT based on the previous successive convex approximation SCA technique. The original form of this algorithm, push-sum, was proposed in [5] for the average consensus problem. It was modified and analyzed in [4] for convex consensus optimization problem over time-varying directed graphs. The iterates of NEXT include two stages, a local SCA stage to update local variables and a consensus update stage to fuse the information between agents. While NEXT has results similar to Prox-DGD using diminishing step sizes. Another interesting algorithm is decentralized Fran-Wolfe DeFW proposed in [6] for nonconvex, smooth, constrained decentralized optimization, where a bounded convex constraint set is imposed. There are three steps at each iteration of DeFW: average gradient computation, local variable evaluation by Fran- Wolfe, and information fusion between agents. In [6], the authors established convergence results similar to Prox-DGD under diminishing step sizes. The stochastic version of DeFW has also been developed in [7] for high-dimensional convex sparse optimization. The next one is projected stochastic gradient algorithm Proj SGD [4] for constrained, nonconvex, smooth consensus optimization with a convex constrained set. The detailed comparison between Proj SGD and Prox-DGD are shown in Remar 6. Based on the above analysis, the convergence results of DGD and Prox-DGD with decreasing step sizes of this paper are comparable with most of the existing ones. However, we allow nonconvex nonsmooth r i and are able to obtain the estimates of asymptotic consensus rates. We also establish global convergence using a fixed step size while it is only found in ZENITH. V. NUMERICAL EXPERIMENTS In this section, we describe a set of numerical experiments mainly to verify our theoretical findings for DGD and Prox- DGD. The comparisons between DGD or Prox-DGD and the other existing algorithms can be referred to these literature lie [35], [], [], [9]. A. Convergence of DGD We verify the performance of DGD using both fixed and diminishing step sizes in the experimental setting identical to [57]. The following one dimensional decentralized optimization is considered in [57], where minimize fx = f x + f x + f 3 x, x R f x = f x = f 3 x = x 3 6xx +, if x 0, 448x 3400, if x > 0, 3x 5040, if x < 0, 0.5x 3 + x x 4, if x 0, 60x 600, if x > 0, 0x 6600, if x < 0, x + x 4, if x 0, 88x 06, if x > 0, 88x 984, if x < 0. We plot the function f over the intervals [ 5, 5] and [ 6, 6] as shown in Fig.. The function achieves its global minimizer at x =.6 and has a local minimizer at x =.49, as well as a local maximizer at x =.. It is easy to compute the Lipschitz constants of f, f and f 3 as L = 88,

9 9 000 Global objective function 60 Global objective function 3 DGD α=+λ n W/ +, = 88, λ n W= DGD α=+λ n W/ +, = 88, λ n W= fx x fx x x i t Iteration t Initial: x 0 = x 0 = x3 0 = 0 agent agent agent 3 x opt =.6 x i t Iteration t Initial: x 0 =, x 0 =., x3 0 =. agent agent agent 3 x opt =.6 a plot on [ 5, 5] b plot on [ 6, 6] a fixed, x 0 b fixed, x 0 Fig.. The plots of objective function f. L = 53, and L 3 = 60, respectively. Thus, = max i L i = 88, which is used in our theoretical analysis. Moreover, f is obviously bounded. We consider one of the following three connected networs:, 3, 3, 3,, 3. In the experiment, the mixing matrix W is taen as W = 0 0 0, which has the eigenvalues:, 0.5, The mixing matrix W is not positive definite. All the assumptions used in our theory are satisfied. We test the performance of DGD with a theoretically fixed step size and several different inds of decreasing step sizes, starting iterations from two different initial points: x 0 := 0, 0, 0, and x 0 :=,.,.. The second initialization is a dangerous point since it is close to local maximum and between two local minima one of them is global. The experiment results are reported in Fig.. From these figures, DGD successfully converges to the global minimum and achieves the consensus starting from both initial points though the objective function is nonconvex. B. Prox-DGD for decentralized L 0 regularization We apply Prox-DGD to solve the following nonconvex decentralized L 0 regularization problem: x argmin fx = n f i x, x R p n where f i x = B ix b i + λ i x 0, B i R mi p, b i R m i for i =,..., n, and x 0 is called the l 0 quasi-norm, which yields the number of the nonzero components of x. In this experiment, we tae n = 0, p = 56, and m i = 50 for i =,..., n. In this case, the proximal operator of l 0 quasi-norm is the well-nown hard thresholding mapping: { x, if x < α, prox α 0 x = 0, otherwise. i= We do not consider the model selection problem, but just test the performance of Prox-DGD applied to such a given x i t x i t DGD diminishing α = 500/ t Iteration t Initial: x 0 = x 0 = x3 0 = 0 c O t, x0 DGD diminishing α = 00/ t+ agent agent agent 3 x =.6 opt Iteration t Initial: x 0 =, x 0 =., x3 0 =. e O t, x0 agent agent agent 3 x opt =.6 x i t x i t DGD diminishing α = 00/ t Iteration t Initial: x 0 = x 0 = x3 0 = 0 d O t, x 0 DGD diminishing α = 50/ t+ 0.5 agent agent agent 3 x =.6 opt Iteration t Initial: x 0 =, x 0 =., x3 0 =. f O t, x 0 agent agent agent 3 x =.6 opt Fig.. The convergence behaviors of DGD in different cases. Two initializations are considered, that is, x 0 := 0, 0, 0 and x0 :=,.,.. Fig The networ used in the decentralized L 0 regularization problem. deterministic model. Tae λ i = 0.5 for each agent i. The connected networ is shown in Fig. 3. We use several step-size strategies, including three fixed step sizes and two decreasing step sizes, to test Prox-DGD. The initializations in all cases are 0. The experiment results are illustrated in Fig. 4. By Fig. 4a and b, when a fixed step size is adopted obviously, we should assume that such step size is sufficiently small to satisfy the theoretical restriction, a larger fixed step size generally implies a faster convergence rate just converges to a stationary point of the 6 7

10 0 x t x F x t x F Convergence rate of Prox DGD with fixed step sizes for L 0 LS α = 0.05 α = 0.0 α = Iteration t a convergence rate fixed Convergence rate of Prox DGD with diminishing step sizes for L 0 LS α = /50*t+00 α = /50*t Iteration t c convergence rate decreasing x t t x ave F x t t x ave F Consensus error of Prox DGD with fixed step sizes for L 0 LS α = 0.05 α = 0.0 α = Iteration t b consensus rate fixed Consensus error of Prox DGD with diminishing step sizes for L 0 LS α = /50*t+00 α = /50*t Iteration t d consensus rate decreasing Fig. 4. The convergence and consensus rates of Prox-DGD under different inds of step sizes for decentralized L 0 regularization. In these experiments, we use the iterate x 0000 to replace the underlying x used to compute the convergence rates. x t ave denotes the average of n agents at tth iteration, i.e., x t ave = n T x t. related Lyapunov function minimization problem but not the original problem while causes a larger consensus error. These phenomena are reasonable and further verify the established results in Proposition. It can be also observed from Fig. 4a that Prox-DGD almost performs at linear rates on all the three fixed step sizes. It is expected since the L 0 regularization function satisfies the urdya-łojasiewicz Ł inequality with ψs = c s θ for θ 0, /] see []. Thus, by Proposition 5ii, Prox-DGD has eventual linear convergence. Once the initial guess in this case, 0 may be a good initial guess is good enough, Prox-DGD starts decaying linearly starting from early iterations. Similar phenomenon can be also observed by Fig. 4c and d under the decreasing step sizes. Specifically, if α t = O t, a smaller ϵ larger step ϵ size generally implies a faster convergence rate but a slower consensus rate. This indeed verifies our Proposition 5, which states that a larger ϵ generally means a faster consensus rate. Moreover, from Fig. 4b and d, under a fixed step size, the consensus error does not vanish but settles to a deterministic value, which can be overcome by using decreasing step sizes. This gives the main motivation of using decreasing step sizes. VI. PROOFS In this section, we present the proofs of our main theorems and propositions. A. Proof for Theorem The setch of the proof is as follows: DGD is interpreted as the gradient descent algorithm applied to the Lyapunov function L α, following the argument in [65]; then, the properties of sufficient descent, lower boundedness, and bounded gradients are established for the sequence {L α x }, giving subsequence convergence of the DGD iterates; finally, whole sequence convergence of the DGD iterates follows from the Ł property of L α. Lemma Gradient descent interpretation. The sequence {x } generated by the DGD iteration 5 is the same sequence generated by applying gradient descent with the fixed step size α to the objective function L α x. A proof of this lemma is given in [65], and it is based on reformulating 5 as the iteration: x + = x α fx + α I W x = x α L α x. 3 Although the sequence {x } generated by the DGD iteration 5 can be interpreted as a centralized gradient descent sequence of function L α x, it is different to the gradient descent of the original problem 3. Lemma Sufficient descent of {L α x }. Let Assumptions and hold. Set the step size 0 < α < +λnw. It holds that L α x + L α x 4 α + λ n W x + x, N. Proof: From x + = x α L α x, it follows that L α x, x + x = x+ x. 5 α Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ max I W = + α λ n W, implying L α x + L α x + L α x, x + x + L x+ x. 6 Combining 5 and 6 yields 4. Lemma 3 Boundedness. Under Assumptions and, if 0 < α < +λ nw, then the sequence {L α x } is lower bounded, and the sequence {x } is bounded, i.e., there exists a constant B > 0 such that x < B for all. Proof: The lower boundedness of L α x is due to the lower boundedness of each f i as it is proper and coercive Assumption Part. By Lemma and the choice of α, L α x is nonincreasing and upper bounded by L α x 0 < +. Hence, T fx L α x 0 implies that x is bounded due to the coercivity of T fx Assumption Part. From Lemmas and 3, we immediately obtain the following lemma. Lemma 4 l -summable and asymptotic regularity. It holds that =0 x+ x < + and that x + x 0 as. A sequence {a } is said to be asymptotic regular if a + a 0 as.

11 From 3, the result below directly follows: Lemma 5 Gradient bound. L α x α x + x. Based on the above lemmas, we get the global convergence of DGD. Proof of Theorem : By Lemma 3, the sequence {x } is bounded, so there exist a convergent subsequence and a limit point, denoted by {x s } s N x as s +. By Lemmas and 3, L α x is monotonically nonincreasing and lower bounded, and therefore L α x L for some L and x + x 0 as. Based on Lemma 5, L α x 0 as. In particular, L α x s 0 as s. Hence, we have L α x = 0. The running best rate of the sequence { x + x } follows from [3, Lemma.] or [6, Theorem 3.3.]. By Lemma 5, the running best rate of the sequence { L α x } is o. By, L α x = fx + α I W x, which implies n T fx = n T L α x due to n T I W = 0. Thus, B. Proof for Proposition Proof: Note that L α x + L α x + L α x + L α x L x + x + α x + x = α λ n W + x + x, where the second inequality holds for Lemma 5 and the Lipschitz continuity of L α with constant L = + α λ n W. Thus, it shows that {x } satisfies the socalled relative error condition as list in []. Moreover, by Lemmas and 3, {x } also satisfies the so-called sufficient decrease and continuity conditions as listed in []. Under such three conditions and the Ł property of L α at x with ψs = cs θ, following the proof of [, Lemma.6], there exists 0 > 0 such that for all 0, we have x + x x x + cb a 8 Lα x L α x θ L α x + L α x θ, n T fx = n T L α x L α x, which implies the running best rate of { n T fx } is also o. By Lemmas and 5, it holds L α x which implies =0 α + λ n W α L αx L α x +, L α x L α x 0 L α + λ n W α. Moreover, note that n T fx L α x. Thus, the convergence rate of { =0 n T fx } is O. Similar to [, Theorem.9], we can claim the global convergence of the considered sequence {x } N under the Ł assumption of L α. Next, we derive a bound on the gradient sequence { fx }, which is used in Proposition. Lemma 6. Under Assumption, there exists a point y satisfying fy = 0, and the following bound holds where a α + λ n W and b α λ n W +. Then, an easy induction yields t= 0 x t+ x t x 0 x 0 + cb a Lα x 0 L α x θ L α x + L α x θ. Following a derivation similar to the proof of [, Theorem 5], we can estimate the rate of convergence of {x } in the different cases of θ. C. Proof for Proposition 3 In order to prove Proposition 3, we also need the following lemmas. {}}{ Lemma 7. [40, Proposition ] Let W W W be the power of W with degree for any N. Under Assumption, it holds W n T Cζ 9 fx D B + y, N, 7 where B is the bound of x given in Lemma 3. Proof: By the lower boundedness assumption Assumption Part, the minimizer of T fy exists. Let y be a minimizer. Then by Lipschitz differentiability of each f i Assumption Part, we have that fy = 0. Then, for any, we have fx = fx fy x y Lemma 3 B + y. Therefore, we have proven this lemma. for some constant C > 0, where ζ is the second largest magnitude eigenvalue of W as specified in 0. Lemma 8. [48, Lemma 3.] Let {γ } be a scalar sequence. If lim γ = γ and 0 < β <, then lim l=0 β l γ l = γ β. Proof of Proposition 3: By the recursion 7, note that x x = W n T x 0 30 α j W j n T fx j.

12 Further by Lemma 7 and Assumption 3, we obtain x x W n T x 0 α j W j n T fx j + C x 0 ζ + B α j ζ j. 3 Furthermore, by Lemma 8 and step sizes 6, we get lim x x = 0. Let b + ϵ. To show the rate of x x, we only need to show that lim b x x C for some 0 < C <. Let j [ + log ζb ] where [x] denotes the integer part of x for any x R. Note that b x x x 0 ζ + B α j ζ j Cb = C x 0 b ζ + CBb + CBb j=j + α j ζ j j α j ζ j T + T + T 3, 3 where the first inequality holds because of 3. In the following, we will estimate the above three terms in the right-hand side of 3, respectively. First, by the definition of j, for any j j, we have Thus, b ζ j T CB b Second, for j < j, b α j and also + ϵ j + ϵ ζ j. j α j ζ j/. 33 Thus, for any j < j Furthermore, note that + ϵ + ϵ log ζ + ϵ, b α + ϵ j + ϵ =, lim b α j =. 34 lim b ζ/ = Therefore, there exists a such that for b α j, 36 b ζ/. 37 The above two inequalities imply that for sufficiently large, T C x 0 ζ /, 38 T 3 CB From 33, 38 and 39, we get b j=j + ζ j. 39 x x C x 0 ζ / 40 j + CB α j ζ j/ +. j=j + ζ j By Lemma 8 and 40, there exists a C > 0 such that lim b x x C. 4 We have completed the proof of this proposition. D. Proof for Theorem To prove Theorem, we first note that similar to 3, the DGD iterates under decreasing step sizes can be rewritten as where L α x = T fx + the following lemmas. x + = x α L α x, 4 α x I W, and we also need Lemma 9 [50]. Let {v t } be a nonnegative scalar sequence such that v t+ + b t v t u t + c t for all t N, where b t 0, u t 0 and c t 0 with t=0 b t < and t=0 c t <. Then the sequence {v t } converges to some v 0 and t=0 u t <. Lemma 0. Let α satisfy 6. Then it holds α + α ϵ + ϵ. Proof: We first prove that + x ϵ ϵx, x [0, ]. 43 Let gx = + x ϵ ϵx. Then its derivative g x = ϵ + x ϵ ϵ < 0, x [0, ]. It implies gx g0 = 0 for any x [0, ], that is, the inequality 43 holds. Note that α + α = + ϵ + ϵ = + ϵ + + ϵ where the last inequality holds for 43. ϵ + ϵ, 44

13 3 The following lemma shows that {α α x+ I W } is summable. + Lemma. Let Assumptions,, and 3 hold. In DGD, use step sizes α in 6. Then {α + α x+ I W } is summable, i.e., =0 α + α x+ I W <. Proof: Note that x + I W = x + x + I W λ n W x + x By Lemma 0, α + α x+ I W ϵ + ϵ x + I W ϵ + ϵ λ n W x + x Furthermore, by 46 and Proposition 3, the sequence {α + α x+ I W } converges to 0 at the rate of O/ + +ϵ, which implies that the sequence {α α α x+ I W } is l -summable, i.e., x+ I W <. + =0 α + Lemma convergence of wealy summable sequence. Let {β } and {γ } be two nonnegative scalar sequences such that a γ = +, for some ϵ 0, ], N; ϵ b =0 γ β < ; c β + β γ, where means that β + β Mγ for some constant M > 0, then lim β 0. We call a sequence {β } satisfying Lemma a and b a wealy summable sequence since itself is not necessarily summable but becomes summable via multiplying another non-summable, diminishing sequence {γ }. It is generally impossible to claim that β converges to 0. However, if the distance of two successive steps of {β } with the same order of the multiplied sequence γ, then we can claim the convergence of β. A special case with ϵ = / has been observed in []. Proof: By condition b, we have + i= γ i β i 0, 47 as and for any N. In the following, we will show lim β = 0 by contradiction. Assume this is not the case, i.e., β 0 as, then lim sup β C > 0. Thus, for every N > 0, there exists a > N such that β > C. Let [ C + ϵ 4M where [x] denotes the integer part of x for any x R. By condition c, i.e., β j+ β j Mγ j for any j N, then ], β +i C 4, i {0,,..., }. 48 Hence, + j= { C = + + γ j β j C γ j C 4 4 j= x + ϵ dx 49 4 ϵ + + ϵ + ϵ, ϵ 0,, C 4 ln + + ln +, ϵ =. Note that when ϵ 0,, the term + + ϵ + ϵ is monotonically increasing with respect to, which implies that + j= γ jβ j is lower bounded by a positive constant when ϵ 0,. While when ϵ =, noting that the specific form of, we have ln+ + ln+ = ln + = ln + C, + 4M which is a positive constant. As a consequence, + j= γ jβ j will not go to 0 as 0, which contradicts with 47. Therefore, lim β = 0. Proof of Theorem : We first develop the following inequality L α+ x + L α x + α + α x+ I W α + λ nw x + x, 50 and then claim the convergence of the sequences {L α x }, { T fx } and {x } based on this inequality. a Development of 50: From x + = x α L α x, it follows that L α x, x + x = x+ x α. 5 Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ maxi W = + λ nw, implying α L α x + 5 L α x + L α x, x + x + L x+ x = L α x α + λ nw x + x. Moreover, L α+ x + = L α x + + α + α x+ I W. 53 Combining 5 and 53 yields 50. b Convergence of objective sequence: By Lemma and Lemma 9, 50 yields the convergence of {L α x } and α + λ nw x + x < 54 =0 which implies that x + x converges to 0 at the rate of o ϵ and {x } is asymptotic regular Moreover, notice that α x I W = α x x I W λ n W + ϵ x x.

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been