WE consider an undirected, connected network of n

Size: px
Start display at page:

Download "WE consider an undirected, connected network of n"

Transcription

1 On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus nonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do sometimes. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are nown in the convex setting. In particular, when diminishing or constant step sizes are used, we can prove convergence to a or a neighborhood of consensus stationary solution and have guaranteed rates of convergence. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD and l q quasi-norms, q [0,. Similarly, Prox-DGD can tae the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used the convex setting. Index Terms Nonconvex dencentralized computing, consensus optimization, decentralized gradient descent method, proximal decentralized gradient descent I. INTRODUCTION WE consider an undirected, connected networ of n agents and the following consensus optimization problem defined on the networ: n minimize fx f i x, x R p i= where f i is a differentiable function only nown to the agent i. We also consider the consensus optimization problem in the following differentiable+proximable form: n minimize x R p sx f i x + r i x, i= where f i, r i are differentiable and proximable functions, respectively, only nown to the agent i. Each function r i is possibly non-differentiable or nonconvex, or both. The models and find applications in decentralized averaging, learning, estimation, and control. Some specific examples include: i the distributed compressed sensing and machine learning problems, where f i is the data-fidelity term, which is often differentiable, and r i is a sparsity-promoting regularizer such as the l q quasi-norm with 0 q [], [7]; ii optimization problems with per-agent constraints, J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 3300, China jsh.zeng@gmail.com W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095, USA wotaoyin@ucla.edu. We call { a function proximable if its proximal operator prox αf y argmin x αfx + x y } is easy to compute. where f i is a differentiable objective function of agent i and r i is the indicator function of the constraint set of agent i, that is, r i x = 0 if x satisfies the constraint and otherwise [7], [0]. When f i s are convex, the existing algorithms include the subgradient methods [6], [8], [6], [5], [8], [4], [46], [3], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers DADMM [35], [36], [7], and EXTRA [37], [38]. However, when f i s are nonconvex, few algorithms have convergence guarantees. Some existing results include [3], [4], [3], [3], [4], [39], [40], [9], [4], [43], [48]. In spite of the algorithms and their analysis in these wors, the convergence of the simple algorithm Decentralized Gradient Descent DGD [8] under nonconvex f i s is still unnown. Furthermore, although DGD is slower than D-ADMM and EXTRA on convex problems, DGD is simpler and thus easier to extend to a variety of settings such as [3], [45], [6], [5], where online processing and delay tolerance are considered. Therefore, we expect our results to motivate future adoptions of nonconvex DGD. This paper studies the convergence of two algorithms: DGD for solving problem and Prox-DGD for problem. In each DGD iteration, every agent locally computes a gradient and then updates its variable by combining the average of its neighbors with the negative gradient step. In each Prox-DGD iteration, every agent locally computes a gradient of f i and a proximal map of r i, as well as exchanges information with its neighbors. Both algorithms can use either a fixed step size or a sequence of decreasing step sizes. When the problem is convex and a fixed step size is used, DGD does not converge to a solution of the original problem but a point in its neighborhood[46]. This motivates the use of decreasing step sizes such as in [8], [6]. Assuming f i s are convex and have Lipschitz continuous and bounded gradients, [8] shows that decreasing step sizes α = lead to a convergence rate O ln of the running best of objective errors. [6] uses nested loops and shows an outerloop convergence rate O of objective errors, utilizing Nesterov s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, their single-loop algorithm using the decreasing step sizes α = /3 has a reduced rate O ln. The objective of this paper is two-fold: a we aim to show, other than losing global optimality, most existing convergence results of DGD and Prox-DGD that are nown in the convex setting remain valid in the nonconvex setting, and b to achieve a, we illustrate how to tailor nonconvex analysis tools for decentralized optimization. In particular, our asymptotic exact and inexact consensus results require new treatments because they are special to decentralized algorithms. The analytic results of this paper can be summarized as

2 follows. a When a fixed step size α is used and properly bounded, the DGD iterates converge to a stationary point of a Lyapunov function. The difference between each local estimate of x and the global average of all local estimates is bounded, and the bound is proportional to α. b When a decreasing step size α = O/ + ɛ is used, where 0 < ɛ and is the iteration number, the objective sequence converges, and the iterates of DGD are asymptotically consensual i.e., become equal one another, and they achieve this at the rate of O/ + ɛ. Moreover, we show the convergence of DGD to a stationary point of the original problem, and derive the convergence rates of DGD with different ɛ for objective functions that are convex. c The convergence analysis of DGD can be extended to the algorithm Prox-DGD for solving problem. However, when the proximable functions r i s are nonconvex, the mixing matrix is required to be positive definite and a smaller step size is also required. Otherwise, the mixing matrix can be non-definite. The detailed comparisons between our results and the existing results on DGD and Prox-DGD are presented in Tables I and II. The global objective error rate in these two tables refers to the rate of {f x fx opt } or {s x sx opt }, where x = n n i= x i is the average of the th iterate and x opt is a global solution. The comparisons beyond DGD and Prox- DGD are presented in Section IV and Table III. New proof techniques are introduced in this paper, particularly, in the analysis of convergence of DGD and Prox-DGD with decreasing step sizes. Specifically, the convergence of objective sequence and convergence to a stationary point of the original problem with decreasing step sizes are justified via taing a Lyapunov function and several new lemmas cf. Lemmas 9,, and the proof of Theorem. Moreover, we estimate the consensus rate by introducing an auxiliary sequence and then showing both sequences have the same rates cf. the proof of Proposition 3. All these proof techniques are new and distinguish our paper from the existing wors such as [8], [6], [8], [3], [3], [3], [40], [43]. The rest of this paper is organized as follows. Section II describes the problem setup and reviews the algorithms. Section III presents our assumptions and main results. Section IV discusses related wors. Section V presents the proofs of our main results. We conclude this paper in Section VI. Notation: Let I denote the identity matrix of the size n n, and R n denote the vector of all s. For the matrix X, X T denotes its transpose, X ij denotes its i, jth component, and X X, X = i,j X ij is its Frobenius norm, which simplifies to the Euclidean norm when X is a vector. Given a symmetric, positive semidefinite matrix G R n n, we let X G X, GX be the induced semi-norm. Given a function h, domf denotes its domain. i, j E represents a communication lin between nodes i and j. Let x i R p denote the local copy of x at node i. We reformulate the consensus problem into the equivalent problem: minimize x T fx n f i x i, 3 i= subject to x i = x j, i, j E, where x R n p, fx R n with x x T x T. x T n, fx f x f x. f n x n In addition, the gradient of fx is f x T f x T fx. Rn p. f n x n T. The ith rows of the matrices x and fx, and vector fx, correspond to agent i. The analysis in this paper applies to any integer p. For simplicity, one can let p = and treat x and fx as vectors rather than matrices. The algorithm DGD [8] for 3 is described as follows: Pic an arbitrary x 0. For = 0,,..., compute x + W x α fx, 4 where W is a mixing matrix and α > 0 is a stepsize parameter. Similarly, we can reformulate the composite problem as the following equivalent form: minimize x n f i x i + r i x i, i= subject to x i = x j, i, j E. 5 Let rx n i= r ix i. The algorithm Prox-DGD can be applied to the above problem 5: Prox-DGD: Tae an arbitrary x 0. For = 0,,..., perform x + prox α rw x α fx, 6 where the proximal operator is prox α rx argmin {α ru + u R n p } u x. 7 II. PROBLEM SETUP AND ALGORITHM REVIEW Consider a connected undirected networ G = {V, E}, where V is a set of n nodes and E is the edge set. Any edge III. ASSUMPTIONS AND MAIN RESULTS This section presents all of our main results.

3 3 TABLE I COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS SMOOTH OPTIMIZATION PROBLEM. Fixed step size Decreasing step sizes algorithm DGD [46] DGD this paper D-NG [6] DGD this paper f i convex only nonconvex convex only nonconvex f i Lipschitz Lipschitz, bounded step size 0 < α < +λnw O with Nesterov acc. O ɛ ɛ 0, ] consensus error Oα O O ɛ min j x j+ x j o no rate o +ɛ global objective error O until error O α ζ Convex: O until error O α ζ ; Nonconvex: no rate O ln Convex : O ln ɛ = /, O ɛ =, ln O min{ɛ, ɛ} other ɛ; Nonconvex: no rate The objective error rates of DGD and Prox-DGD obtained in this paper and those in convex DProx-Grad [8] are ergodic or running best rates. TABLE II COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS COMPOSITE OPTIMIZATION PROBLEM. Fixed step size Decreasing step sizes algorithm AccDProx-Grad [6] DProx-Grad [8] Prox-DGD this paper DProx-Grad [8] Prox-DGD this paper f i, r i convex only nonconvex convex only nonconvex f i Lipschitz, bounded Lipschitz Lipschitz, bounded r i bounded bounded step size 0 < α < 0 < α < +λnw convex r i ; 0 < α < λnw nonconvex r i, λ nw > 0 O O + / ɛ 0, ] + ɛ consensus Oγ, 0 < γ < error Oα O / O min j x j+ x j no rate no rate o no rate o +ɛ ɛ global objective error O D α + D α, D, D > 0 Form Convex: form D 3 α + D 4α, D 3, D 4 > 0; Nonconvex: no rate O ln Convex : O ln ɛ = /, O ɛ =, ln O min{ɛ, ɛ} other ɛ, Nonconvex: no rate The objective error rates are ergodic or running best rates. A. Definitions and assumptions Definition Lipschitz differentiability. A function h is called Lipschitz differentiable if h is differentiable and its gradient h is Lipschitz continuous, i.e., hu hv L u v, u, v domh, where L > 0 is its Lipschitz constant. Definition Coercivity. A function h is called coercive if u + implies hu +.

4 4 TABLE III COMPARISONS ON SCENARIOS APPLIED FOR DIFFERENT NONCONVEX DECENTRALIZED ALGORITHMS. f i nonsmooth r i step size networ W algorithm type fusion scheme algorithm smooth cvx ncvx fixed diminish static dynamic determin stochastic ATC CTA DGD this paper doubly Perturbed Push-sum [40] column ZENITH [3] doubly Prox-DGD this paper NEXT [3] DeFW [43] Proj SGD [3] doubly doubly doubly row In this table, the full names of these abbreviations are list as follows: cvx convex, ncvx nonconvex, diminish diminishing, determin deterministic, ATC adaptive-then-combine, CTA combine-then-adaptive, doubly doubly stochastic, column column stochastic, row row stochastic, where vocabularies in the bracets are the full names. A row, or column, or double stochastic W means that: W =, or W T =, or both hold. The next definition is a property that many functions have see [44, Section.] for examples and can help obtain whole sequence convergence from subsequence convergence. Definition 3 urdya-łojasiewicz Ł property [], [5], []. A function h : R p R {+ } has the Ł property at x dom h if there exist η 0, + ], a neighborhood U of x, and a continuous concave function ϕ : [0, η R + such that: i ϕ0 = 0 and ϕ is differentiable on 0, η; ii for all s 0, η, ϕ s > 0; iii for all x in U {x : hx < hx < hx + η}, the Ł inequality holds ϕ hx hx dist 0, hx. 8 Proper lower semi-continuous functions that satisfy the Ł inequality at each point of dom h are called Ł functions. Assumption Objective. The objective functions f i : R p R {+ }, i =,..., n, satisfy the following: f i is Lipschitz differentiable with constant i > 0. f i is proper i.e., not everywhere infinite and coercive. The sum n i= f ix i is -Lipschitz differentiable with max i i. In addition, each f i is lower bounded following Part of the above assumption. Assumption Mixing matrix. The mixing matrix W = [w ij ] R n n has the following properties: Graph If i j and i, j / E, then w ij = 0, otherwise, w ij > 0. Symmetry W = W T. 3 Null space property null{i W } = span{}. 4 Spectral property I W I. By Assumption, a solution x opt to problem 3 satisfies I W x opt = 0. Due to the symmetric assumption of W, Whole sequence convergence from any starting point is referred to as global convergence in the literature. Its limit is not necessarily a global solution. its eigenvalues are real and can be sorted in the nonincreasing order. Let λ i W denote the ith largest eigenvalue of W. Then by Assumption, λ W = > λ W λ n W >. Let ζ be the second largest magnitude eigenvalue of W. Then ζ = max{ λ W, λ n W }. 9 B. Convergence results of DGD We consider the convergence of DGD with both a fixed step size and a sequence of decreasing step sizes. Convergence results of DGD with a fixed step size: The convergence result of DGD with a fixed step size i.e., α α is established based on the Lyapunov function [46]: L α x T fx + α x I W. 0 It is worth reminding that convexity is not assumed. Theorem Global convergence. Let {x } be the sequence generated by DGD 4 with the step size 0 < α < +λnw. Let Assumptions and hold. Then {x } has at least one accumulation point x, and any such point is a stationary point of L α x. Furthermore, the running best rates of the sequences { x + x } and { L α x } are o. In addition, if L α satisfies the Ł property at an accumulation point x, then {x } globally converges to x. Remar. Let x be a stationary point of L α x, and thus 0 = fx + α I W x. Since T I W = 0, yields 0 = T fx, indicating that x is also a stationary point to the separable function n i= f ix i. Since the rows of x are not necessarily Given a nonnegative sequence a, its running best sequence is b = min{a i : i }. We say a has a running best rate of o/ if b = o/. These quantities naturally appear in the analysis, so we eep the squares.

5 5 identical, we cannot say x is a stationary point to Problem 3. However, the differences between the rows of x are bounded, following our next result below adapted from [46]: Proposition Consensual bound on x. For each iteration, define x n n i= x i. Then, it holds for each node i that x i x αd ζ, where D is a universal bound of fx defined in Lemma 6 below, ζ is the second largest magnitude eigenvalue of W specified in 9. As, yields the consensual bound where x n n i= x i. x i x αd ζ, In Proposition, the consensual bound is proportional to the step size α and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of W. Let us compare the DGD iteration with the iteration of centralized gradient descent 4 for fx. Averaging the rows of 4 yields the following comparison: n DGD averaged: x + x α f i x i. 3 Centralized: n x + x α n i= n i= f i x. 4 Apparently, DGD approximates centralized gradient descent by evaluating f i at local variables x i instead of the global average. We can estimate the error of this approximation as n f i x i n n f i x n n i= n i= i= f i x i f i x αd ζ. Unlie the convex analysis in [46], it is impossible to bound the difference between the sequences of 3 and 4 without convexity because the two sequences may converge to different stationary points of L α. Remar. The Ł assumption on L α in Theorem can be satisfied if each f i is a sub-analytic function. Since x I W is obviously sub-analytic and the sum of two sub-analytic functions remains sub-analytic, L α is sub-analytic if each f i is so. See [44, Section.] for more details and examples. Proposition Ł convergence rates. Let the assumptions of Theorem hold. Suppose that L α satisfies the Ł inequality at an accumulation point x with ψs = cs θ for some constant c > 0. Then, the following convergence rates hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C 0 τ for all for some > 0, C 0 > 0, τ [0,. c If θ,, x x C 0 θ/θ for all, for certain > 0, C 0 > 0. Note that the rates in parts b and c of Proposition are of the eventual type. Using fixed step sizes, our results are limited because the stationary point x of L α is not a stationary point of the original problem. We only have a consensual bound on x. To address this issue, the next subsection uses decreasing step sizes and presents better convergence results. Convergence of DGD with decreasing step sizes: The positive consensual error bound in Proposition, which is proportional to the constant step size α, motivates the use of properly decreasing step sizes α = O +, for some ɛ 0 < ɛ, to diminish the consensual bound to 0. As a result, any accumulation point x becomes a stationary point of the original problem 3. To analyze DGD with decreasing step sizes, we add the following assumption. Assumption 3 Bounded gradient. For any, fx is uniformly bounded by some constant B > 0, i.e., fx B. Note that the bounded gradient assumption is a regular assumption in the convergence analysis of decentralized gradient methods see, [3], [4], [3], [3], [4], [39], [40], [9], [43] for example, even in the convex setting [6] and also [8], though it is not required for centralized gradient descent. We tae the step size sequence: α =, 0 < ɛ, 5 + ɛ throughout the rest part of this section. The numerator can be replaced by any positive constant. By iteratively applying iteration 4, we obtain the following expression x = W x 0 α j W j fx j. 6 Proposition 3 Asymptotic consensus rate. Let Assumptions and 3 hold. Let DGD use 5. Let x n T x. Then, x x converges to 0 at the rate of O/ + ɛ. According to Proposition 3, the iterates of DGD with decreasing step sizes can reach consensus asymptotically compared to a nonzero bound in the fixed step size case in Proposition. Moreover, with a larger ɛ, faster decaying step sizes generally imply a faster asymptotic consensus rate. Note that I W x = 0 and thus x I W = x x I W. Therefore, the above proposition implies the following result. Corollary. Apply the setting of Proposition 3. x I W converges to 0 at the rate of O/ + ɛ. Corollary shows that the sequence {x } in the I W semi-norm can decay to 0 at a sublinear rate. For any global consensual solution x opt to problem 3, we have x x opt I W = x I W so, if {x } does converge to x opt, then their distance in the same semi-norm decays at O/ ɛ. Theorem Convergence. Let Assumptions, and 3 hold. Let DGD use step sizes 5. Then a {L α x } and { T fx } converge to the same limit; b lim T fx = 0, and any limit point of {x } is a stationary point of problem 3;

6 6 c In addition, if there exists an isolated accumulation point, then {x } converges. In the proof of Theorem, we will establish =0 α + λ nw x + x <, which implies that the running best rate of the sequence { x + x } is o/ +ɛ. Theorem shows that the objective sequence converges, and any limit point of {x } is a stationary point of the original problem. However, there is no result on the convergence rate of the objective sequence to an optimal value, and it is generally difficult to get such a rate without convexity. Although our primary focus is nonconvexity, next we assume convexity and present the objective convergence rate, which has an interesting relation with ɛ. For any x R n p n, let fx i= f ix i. Even if f i s are convex, the solution to 3 may be non-unique. Thus, let X be the set of solutions to 3. Given x, we pic the solution x opt = Proj X x X. Also let f opt = fx opt be the optimal value of. Define the ergodic objective: f =0 = α f x + =0 α, 7 where x + = n T x +. Obviously, f min f x. 8 =,...,+ Proposition 4 Convergence rates under convexity. Let Assumptions, and 3 hold. Let DGD use step sizes 5. If λ n W > 0 and each f i is convex, then { f } defined in 7 converges to the optimal objective value f opt at the following rates: a if 0 < ɛ < /, the rate is O ; ɛ b if ɛ = /, the rate is O ln ; c if / < ɛ <, the rate is O ; ɛ d if ɛ =, the rate is O ln. The convergence rates established in Proposition 4 almost as good as O when ɛ =. As ɛ goes to either 0 or, the rates become slower, and ɛ = / may be the optimal choice in terms of the convergence rate. However, by Proposition 3, a larger ɛ implies a faster consensus rate. Therefore, there is a tradeoff to choose an appropriate ɛ in the practical implementation of DGD. C. Convergence results of Prox-DGD Similarly, we consider the convergence of Prox-DGD with both a fixed step size and decreasing step sizes. The iteration 6 can be reformulated as x + = prox α rx α L α x 9 based on which, we define the Lyapunov function ˆL α x L α x + rx, where we recall L α x = n i= f ix i + α x I W. Then 9 is clearly the forward-bacward splitting a..a., proxgradient iteration for minimize x ˆLα x. Specifically, 9 first performs gradient descent to the differentiable function L α x and then computes the proximal of rx. To analyze Prox-DGD, we should revise Assumption as follows. Assumption 4 Composite objective. The objective function of 5 satisfies the following: Each f i is Lipschitz differentiable with constant i > 0. Each f i +r i is proper, lower semi-continuous, coercive. As before, n i= f ix i is -Lipschitz differentiable for max i i. Convergence results of Prox-DGD with a fixed step size: Based on the above assumptions, we can get the global convergence of Prox-DGD as follows. Theorem 3 Global convergence of Prox-DGD. Let {x } be the sequence generated by Prox-DGD 6 where the step size α satisfies 0 < α < +λnw when r i s are convex; and 0 < α < λnw, when r i s are not necessarily convex this case requires λ n W > 0. Let Assumptions and 4 hold. Then {x } has at least one accumulation point x, and any accumulation point is a stationary point of ˆLα x. Furthermore, the running best rates of the sequences { x + x } and g + where g + is defined in Lemma 8 are both o. In addition, if ˆLα satisfies the Ł property at an accumulation point x, then {x } converges to x. The rate of convergence of Prox-DGD can be also established by leveraging the Ł property. Proposition 5 Rate of convergence of Prox-DGD. Under assumptions of Theorem 3, suppose that ˆLα satisfies the Ł inequality at an accumulation point x with ψs = c s θ for some constant c > 0. Then the following hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C τ for all for some > 0, C > 0, τ [0,. c If θ,, x x C θ/θ for all, for certain > 0, C > 0. Convergence of Prox-DGD with decreasing step sizes: In Prox-DGD, we also use the decreasing step size 5. To investigate its convergence, the bounded gradient Assumption 3 should be revised as follows. Assumption 5 Bounded composite subgradient. For each i, f i is uniformly bounded by some constant B i > 0, i.e., f i x B i for any x R p. Moreover, ξ i B ri for any ξ i r i x and x R p, i =..., n. Let B n i= B i + B ri. Then fx + ξ where ξ rx for any x R n p is uniformly bounded by B. Note that the same assumption is used to analyze the convergence A nonnegative sequence a induces its running best sequence b = min{a i : i }; therefore, a has running best rate of o/ if b = o/.

7 7 of distributed proximal-gradient method in the convex setting [6], [8], and also is widely used to analyze the convergence of nonconvex decentralized algorithms lie in [3], [4]. In light of Lemma 9 below, the claims in Proposition 3 and Corollary also hold for Prox-DGD. Proposition 6 Asymptotic consensus and rate. Let Assumptions and 5 hold. In Prox-DGD, use the step sizes 5. There hold x x C x 0 ζ + B α j ζ j, and x x converges to 0 at the rate of O/ + ɛ. Moreover, let x be any global solution of the problem 5. Then x x I W = x I W = x x I W converges to 0 at the rate of O/ + ɛ. For any x R n p, define sx = n i= f ix i + r i x i. Let X be a set of solutions of 5, x opt = Proj X x X, and s opt = sx opt be the optimal value of 5. Define s = =0 α s x + =0 α. 0 Theorem 4 Convergence and rate. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 5. Then a { ˆL α x } and { n i= f ix i + r ix i} converge to the same limit; b =0 α +λ nw x + x < when r i s are convex; or, =0 α λ nw x + x < when r i s are not necessarily convex this case requires λ n W > 0; c if {ξ } satisfies ξ + ξ L r x + x for each > 0, some constant L r > 0, and a sufficiently large integer 0 > 0, then lim T fx + ξ + = 0, where ξ + rx + is the one determined by the proximal operator 7, and any limit point is a stationary point of problem 5. d in addition, if there exists an isolated accumulation point, then {x } converges. e furthermore, if f i and r i are convex and λ n W > 0, then the claims on the rates of { f } in Proposition 4 hold for the sequence { s } defined in 0. Theorem 4b implies that the best running rate of x + x is o. The additional condition imposed on {ξ } in +ɛ Theorem 4c is some type of restricted continuous regularity of the subgradient r with respect to the generated sequence, which may be held for a class of proximal functions as studied in [47]. If r is locally Lipschitz continuous in a neighborhood of a limit point, then such condition can generally be satisfied, since {x } is asymptotic regular, and thus x will lies in such neighborhood of this limit point when is sufficiently large. Theorem 4e gives the convergence rates of Prox-DGD in the convex setting. IV. RELATED WORS AND DISCUSSIONS We summarize some recent nonconvex decentralized algorithms in Table III. Most of them apply to either the smooth optimization problem or the composite optimization problem and use diminishing step sizes. Although is a special case of via letting r i x = 0, there are still differences in both algorithm design and theoretical analysis. Therefore, we divide their comparisons. We first discuss the algorithms for. In [40], the authors proved the convergence of perturbed push-sum for nonconvex under some regularity assumptions. They also introduced random perturbations to avoid local minima. The networ considered in [40] is time-varying and directed, and specific column stochastic matrices and diminishing step sizes are used. Their algorithm is an extension of DGD with diminishing step sizes of this paper. The convergence results for the deterministic perturbed push-sum algorithm obtained in [40] are similar to those of DGD developed in this paper under similar assumptions see, Theorem above and [40, Theorem 3]. However, in this paper, we obtain the asymptotic consensus and convergence to a stationary point of DGD via a Lyapunov function and developing several new results such as Lemma for the convergence of the so-called wealysummable sequence. The proofs in [40] are mainly based on [30, Theorem.7.3]. In [3], a primal-dual approximate gradient algorithm called ZENITH was developed for. The convergence of ZENITH was given in the expectation of constraint violation under the Lipschitz differentiable assumption and other assumptions. Table III includes three algorithms for solving the composite problem, which are related to ours. All of them only deal with convex r i whereas r i in this paper can also be nonconvex. In [4], the authors proposed NEXT based on the previous successive convex approximation SCA technique. The iterates of NEXT include two stages, a local SCA stage to update local variables and a consensus update stage to fuse the information between agents. While NEXT has results similar to Prox-DGD using diminishing step sizes, Prox- DGD is simpler than NEXT. Another interesting algorithm is decentralized Fran-Wolfe DeFW proposed in [43] for nonconvex, smooth, constrained decentralized optimization, where a bounded convex constraint set is imposed. There are three steps at each iteration of DeFW: average gradient computation, local variable evaluation by Fran-Wolfe, and information fusion between agents. In [43], the authors established convergence results similar to Prox-DGD under diminishing step sizes. The stochastic version of DeFW has also been developed in [9] for high-dimensional convex sparse optimization. The last one is projected stochastic gradient algorithm Proj SGD [3] for constrained, nonconvex, smooth consensus optimization. It has two steps at each iteration: a projected stochastic gradientstep to update local variables and a consensus step to exchange the information between local agents. The mixing matrix used in this algorithm is The original form of this algorithm, push-sum, was proposed in [7] for the average consensus problem. It was modified and analyzed in [9] for convex consensus optimization problem over time-varying directed graphs.

8 8 random and row stochastic, but its expectation is column stochastic. Asymptotic consensus and convergence to the set of arush-uhn-tucer points were proved under diminishing step sizes, smooth objective function, some mean and variance restrictions to the stochastic direction, and other assumptions on the mixing matrices and the constraint set. Based on the above analysis, the convergence results of DGD and Prox-DGD with diminishing step sizes of this paper are comparable with most of the existing ones, which involve more complicated methods. However, we allow nonconvex nonsmooth r i and are able to obtain the estimates of asymptotic consensus rates. We also establish global convergenceusing a fixed step size while it is only found in ZENITH. V. PROOFS In this section, we present the proofs of our main theorems and propositions. A. Proof for Theorem The setch of the proof is as follows: DGD is interpreted as the gradient descent algorithm applied to the Lyapunov function L α, following the argument in [46]; then, the properties of sufficient descent, lower boundedness, and bounded gradients are established for the sequence {L α x }, giving subsequence convergence of the DGD iterates; finally, whole sequence convergence of the DGD iterates follows from the Ł property of L α. Lemma Gradient descent interpretation. The sequence {x } generated by the DGD iteration 4 is the same sequence generated by applying gradient descent with the fixed step size α to the objective function L α x. A proof of this lemma is given in [46], and it is based on reformulating 4 as the iteration: x + = x α fx + α I W x = x α L α x. Although the sequence {x } generated by the DGD iteration 4 can be interpreted as a centralized gradient descent sequence of function L α x, it is different to the gradient descent of the original problem 3. Lemma Sufficient descent of {L α x }. Let Assumptions and hold. Set the step size 0 < α < +λnw. It holds that L α x + L α x α + λ n W x + x, N. Proof. From x + = x α L α x, it follows that L α x, x + x = x+ x. 3 α Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ max I W = + α λ n W, implying L α x + L α x + L α x, x + x + L x+ x. 4 Combining 3 and 4 yields. Lemma 3 Boundedness. Under Assumptions and, if 0 < α < +λnw, then the sequence {L α x } is lower bounded, and the sequence {x } is bounded, i.e., there exists a constant B > 0 such that x < B for all. Proof. The lower boundedness of L α x is due to the lower boundedness of each f i as it is proper and coercive Assumption Part. By Lemma and the choice of α, L α x is nonincreasing and upper bounded by L α x 0 < +. Hence, T fx L α x 0 implies that x is bounded due to the coercivity of T fx Assumption Part. From Lemmas and 3, we immediately obtain the following lemma. Lemma 4 l -summable and asymptotic regularity. It holds that =0 x+ x < + and that x + x 0 as. From, the result below directly follows: Lemma 5 Gradient bound. L α x α x + x. Based on the above lemmas, we get the global convergence of DGD. Proof of Theorem. By Lemma 3, the sequence {x } is bounded, so there exist a convergent subsequence and a limit point, denoted by {x s } s N x as s +. By Lemmas and 3, L α x is monotonically nonincreasing and lower bounded, and therefore x + x 0 as. Based on Lemma 5, L α x 0 as. In particular, L α x s 0 as s. Hence, we have L α x = 0. The running best rate of the sequence { x + x } follows from [0, Lemma.] or [8, Theorem 3.3.]. By Lemma 5, the running best rate of the sequence { L α x } is o. Similar to [, Theorem.9], we can claim the global convergence of the considered sequence {x } N under the Ł assumption of L α. Next, we derive a bound on the gradient sequence { fx }, which is used in Proposition. Lemma 6. Under Assumption, there exists a point y satisfying fy = 0, and the following bound holds fx D B + y, N, 5 where B is the bound of x given in Lemma 3. A sequence {a } is said to be asymptotic regular if a + a 0 as.

9 9 Proof. By the lower boundedness assumption Assumption Part, the minimizer of T fy exists. Let y be a minimizer. Then by Lipschitz differentiability of each f i Assumption Part, we have that fy = 0. Then, for any, we have fx = fx fy x y Lemma 3 B + y. Therefore, we have proven this lemma. B. Proof for Proposition Proof. Note that L α x + L α x + L α x + L α x L x + x + α x + x = α λ n W + x + x, where the second inequality holds for Lemma 5 and the Lipschitz continuity of L α with constant L = + α λ n W. Thus, it shows that {x } satisfies the socalled relative error condition as list in []. Moreover, by Lemmas and 3, {x } also satisfies the so-called sufficient decrease and continuity conditions as listed in []. Under such three conditions and the Ł property of L α at x with ψs = cs θ, following the proof of [, Lemma.6], there exists 0 > 0 such that for all 0, we have x + x x x + cb a 6 Lα x L α x θ L α x + L α x θ, where a α + λ n W and b α λ n W +. Then, an easy induction yields t= 0 x t+ x t x 0 x 0 + cb a Lα x 0 L α x θ L α x + L α x θ. Following a derivation similar to the proof of [, Theorem 5], we can estimate the rate of convergence of {x } in the different cases of θ. C. Proof for Proposition 3 In order to prove Proposition 3, we also need the following lemmas. {}}{ Lemma 7. [8, Proposition ] Let W W W be the power of W with degree for any N. Under Assumption, it holds W n T Cζ 7 for some constant C > 0, where ζ is the second largest magnitude eigenvalue of W as specified in 9. Lemma 8. [33, Lemma 3.] Let {γ } be a scalar sequence. If lim γ = γ and 0 < β <, then lim l=0 β l γ l = γ β. Proof of Proposition 3. By the recursion 6, note that x x = W n T x 0 8 α j W j n T fx j. Further by Lemma 7 and Assumption 3, we obtain x x W n T x 0 α j W j n T fx j + C x 0 ζ + B α j ζ j. 9 Furthermore, by Lemma 8 and step sizes 5, we get lim x x = 0. Let b + ɛ. To show the rate of x x, we only need to show that lim b x x C for some 0 < C <. Let j [ + log ζb ] where [x] denotes the integer part of x for any x R. Note that b x x x 0 ζ + B α j ζ j Cb = C x 0 b ζ + CBb + CBb j=j + α j ζ j j α j ζ j T + T + T 3, 30 where the first inequality holds because of 9. In the following, we will estimate the above three terms in the right-hand side of 30, respectively. First, by the definition of j, for any j j, we have Thus, b ζ j T CB b Second, for j < j, b α j and also + ɛ j + ɛ ζ j. j α j ζ j/. 3 + ɛ + ɛ log ζ + ɛ, b α + ɛ j + ɛ =,

10 0 Thus, for any j < j Furthermore, note that lim b α j =. 3 lim b ζ/ = Therefore, there exists a such that for b α j, 34 b ζ/. 35 The above two inequalities imply that for sufficiently large, T C x 0 ζ /, 36 T 3 CB From 3, 36 and 37, we get b j=j + ζ j. 37 x x C x 0 ζ / 38 j + CB α j ζ j/ +. j=j + ζ j By Lemma 8 and 38, there exists a C > 0 such that lim b x x C. 39 We have completed the proof of this proposition. D. Proof for Theorem To prove Theorem, we first note that similar to, the DGD iterates under decreasing step sizes can be rewritten as where L α x = T fx + the following lemmas. x + = x α L α x, 40 α x I W, and we also need Lemma 9 [34]. Let {v t } be a nonnegative scalar sequence such that v t+ + b t v t u t + c t for all t N, where b t 0, u t 0 and c t 0 with t=0 b t < and t=0 c t <. Then the sequence {v t } converges to some v 0 and t=0 u t <. Lemma 0. Let α satisfy 5. Then it holds Proof. We first prove that α + α ɛ + ɛ. + x ɛ ɛx, x [0, ]. 4 Let gx = + x ɛ ɛx. Then its derivative g x = ɛ + x ɛ ɛ < 0, x [0, ]. It implies gx g0 = 0 for any x [0, ], that is, the inequality 4 holds. Note that α + α = + ɛ + ɛ = + ɛ + + ɛ where the last inequality holds for 4. ɛ + ɛ, 4 Note that the term {α + α x+ I W } exists in the right hand side the latter inequality 48. In order to apply Lemma 9 and then show the convergence of {L α x }, we need the following lemma to guarantee that {α + α x+ I W } is summable. Lemma. Let Assumptions,, and 3 hold. In DGD, use step sizes α in 5. Then {α + α x+ I W } is summable, i.e., =0 α + α x+ I W <. Proof. Note that x + I W = x + x + I W λ n W x + x By Lemma 0, α + α x+ I W ɛ + ɛ x + I W ɛ + ɛ λ n W x + x Furthermore, by 44 and Proposition 3, the sequence {α + α x+ I W } converges to 0 at the rate of O/ + +ɛ, which implies that the sequence {α α α x+ I W } is l -summable, i.e., x+ I W <. + =0 α + Lemma convergence of wealy summable sequence. Let {β } and {γ } be two nonnegative scalar sequences such that a γ = +, for some ɛ 0, ], N; ɛ b =0 γ β < ; c β + β γ, where means that β + β Mγ for some constant M > 0, then lim β 0. We call a sequence {β } satisfying Lemma a and b a wealy summable sequence since itself is not necessarily summable but becomes summable via multiplying another non-summable, diminishing sequence {γ }. It is generally impossible to claim that β converges to 0. However, if the distance of two successive steps of {β } with the same order of the multiplied sequence γ, then we can claim the convergence of β. A special case with ɛ = / has been observed in [9]. Proof. By condition b, we have + i= γ i β i 0, 45 as and for any N. In the following, we will show lim β = 0 by contradiction. Assume this is not the case, i.e., β 0 as,

11 then lim sup β C > 0. Thus, for every N > 0, there exists a > N such that β > C. Let [ ] C + ɛ, 4M where [x] denotes the integer part of x for any x R. By condition c, i.e., β j+ β j Mγ j for any j N, then Hence, + j= { C = β +i C 4, i {0,,..., } γ j β j C γ j C x + ɛ dx j= + + ɛ + ɛ, ɛ 0,, 4 ɛ C 4 ln + + ln +, ɛ =. Note that when ɛ 0,, the term + + ɛ + ɛ is monotonically increasing with respect to, which implies that + j= γ jβ j is lower bounded by a positive constant when ɛ 0,. While when ɛ =, noting that the specific form of, we have ln+ + ln+ = ln + = ln + C, + 4M which is a positive constant. As a consequence, + j= γ jβ j will not go to 0 as 0, which contradicts with 45. Therefore, lim β = 0. Proof of Theorem. We first develop the following inequality L α+ x + L α x + α + α x+ I W α + λ nw x + x, 48 and then claim the convergence of the sequences {L α x }, { T fx } and {x } based on this inequality. a Development of 48: From x + = x α L α x, it follows that L α x, x + x = x+ x α. 49 Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ maxi W = + λ nw, implying α L α x + 50 L α x + L α x, x + x + L x+ x = L α x α + λ nw x + x. Moreover, L α+ x + = L α x + + α + α x+ I W. 5 Combining 50 and 5 yields 48. b Convergence of objective sequence: By Lemma and Lemma 9, 48 yields the convergence of {L α x } and α + λ nw x + x < 5 =0 which implies that x + x converges to 0 at the rate of o ɛ and {x } is asymptotic regular. Moreover, notice that α x I W = α x x I W λ n W + ɛ x x. By Proposition 3, the term α x I W converges to 0 as. As a consequence, lim T fx = lim = lim L α x. L α x x I W α c Convergence to a stationary point: Let fx n T fx. By the specific form 5 of α, we have α + λ nw = α + λ nw α α for all > 0, where 0 = + λ n W part of + λ n W ɛ. Note that 0 + ɛ [ + λ n W ɛ x + x = n T x + x 53 ], i.e., the integer x + x. 54 Thus, 5, 53 and 54 yield α x+ x <. 55 =0 By the iterate 4 of DGD, we have x + x = α fx. 56 Plugging 56 into 55 yields α fx <. 57 Moreover, =0 fx + fx fx + fx fx + + fx B fx + fx B fx + fx B x + x, 58 where the second inequality holds by the bounded gradient assumption Assumption 3, the third inequality holds by the A sequence {a } is said to be asymptotic regular if a + a 0 as.

12 specific form of fx, and the last inequality holds by the Lipschitz continuity of f. Note that x + x = x + x + + x + x + x x x + x + + x x + α fx α, 59 where the first inequality holds for the triangle inequality and 56, and the last inequality holds for Proposition 3 and the bounded assumption of f. Thus, 58 and 59 imply fx + fx α. 60 By the specific form 5 of α, 57, 60 and Lemma, it holds As a consequence, lim fx = 0. 6 lim T fx = 0. 6 Furthermore, by the coercivity of f i for each i and the convergence of { T fx }, {x } is bounded. Therefore, there exists a convergent subsequence of {x }. Let x be any limit point of {x }. By 6 and the continuity of f, it holds T fx = 0. Moreover, by Proposition 3, x is consensual. As a consequence, x is a stationary point of problem 3. In addition, if x is isolated, then by the asymptotic regularity of {x } Lemma 4, {x } converges to x. E. Proof for Proposition 4 To prove Proposition 4, we still need the following lemmas. Lemma 3 Accumulated consensus of iterates. Under conditions of Proposition 3, we have α x + x + D + D α, 63 =0 where D = C x0 ζ ζ, D specified in Assumption 3. =0 =0 = C x 0 ζ + B ζ Proof. By 9, α x + x + C x 0 ζ α ζ + CB =0 =0, and B is ζ j α α j. 64 In the following, we estimate these two terms in the right-hand side of 64, respectively. Note that α ζ α + ζ =0 =0 ζ + =0 α, 65 =0 and =0 = ζ j α α j α =0 ζ =0 ζ j + ζ j α + αj αj ζ j =j α. 66 =0 Plugging 65 and 66 into 64 yields 63. Besides Lemma 3, we also need the following two lemmas, which have appeared in the literature cf. [8]. Lemma 4 [8]. Let γ = for some 0 < ɛ. Then the ɛ following hold a if 0 < ɛ < /, ɛ = γ ɛ = O, ɛ = γ = γ b if ɛ = /, = γ = γ = γ ɛ ɛ ɛ ɛ ɛ = O ɛ. c if / < ɛ <, = γ = γ = γ d if ɛ =, = γ = γ = γ ɛ ɛ = O, + ln / = Oln. ɛ ɛ = O, ɛ ɛ ɛ /ɛ ɛ ɛ ln = O ln, = O / ln + ln = O ln. ɛ. Lemma 5. [8, Proposition 3] Let h : R d R be a continuously differentiable function whose gradient is Lipschitz continuous with constant L h. Then for any x, y, u R p, hu hx + hy, u x L h x y. Proof of Proposition 4. To prove this proposition, we first develop the following inequality, L α x + L α u α x u x + u for any u R n p. By Lemma 5, we have 67 L α u L α x L α x, u x + L x+ x, where L = + α λ nw, and by, we have L α x = α x x +. Then 68 implies L α u L α x α x x +, u x + L x+ x.

13 3 Note that the specific form of α = +, there exists an ɛ integer 0 > 0 such that L α for all > 0. Actually, for the simplicity of the proof, we can tae α < λnw starting from the first step so that L α holds from the initial step. Thus, 69 implies L α u L α x α x x +, u x + x + x. α Recall that for any two vectors a and b, it holds a, b a = b a b. Therefore, L α u L α x + + α u x + u x. As a consequence, we get the basic inequality 67. Note that the optimal solution x opt is consensual and thus, x opt I W = 0. Therefore, L α x opt = fx opt = f opt. By 67, we have α Lα x + f opt x x opt x + x opt /. Summing the above inequality over = 0,,..., yields α L α x + f opt x 0 x opt /. 7 =0 Moreover, noting that L α x + = convexity of L α, f x + and by the L α x + f x + + L α x +, x + x + f x + B x + x +, 7 where the second inequality holds by the bounded assumption of gradient cf. Assumption 3. Plugging 7 into 7 yields α f x + f opt 73 =0 x0 x opt + B α x + x +. =0 By the definition of f 7, then 73 implies f f opt α 74 =0 x0 x opt + B D 3 + D 4 =0 α x + x + =0 α, 75 where D 3 = x0 x opt + BD, D 4 = BD, D and D are specified in Lemma 3, and the second inequality holds for Lemma 3. As a consequence, f f opt D 3 + D 4 =0 α =0 α. 76 Furthermore, by Lemma 4, we get the claims of this proposition. F. Proofs for Theorem 3 and Proposition 5 In order to prove Theorem 3, we need the following lemmas. Lemma 6 Sufficient descent of { ˆL α x }. Let Assumptions and 4 hold. Results are given in two cases below: C: r i s are convex. Set 0 < α < +λnw. ˆL α x + ˆL α x 77 α + λ n W x + x, N. C: r i s are not necessarily convex in this case, we assume λ n W > 0. Set 0 < α < λnw. ˆL α x + ˆL α x 78 α λ n W x + x, N. Proof. Recall from Lemma that L α x is L -Lipschitz continuous for L = + α λ n W, and thus ˆL α x + ˆL α x = L α x + L α x + rx + rx L α x, x + x + L x+ x + rx + rx. 79 C: From the convexity of r, 7, and 9, it follows that 0 = ξ + + α x + x + α L α x, ξ + rx +. This and the convexity of r further give us rx + rx ξ +, x + x = α x+ x L α x, x + x. Substituting this inequality into the inequality 79 and then expanding L = + α λ n W yield ˆL α x + ˆL α x α L x + x = α + λ n W x + x. Sufficient descent requires the last term to be negative, thus 0 < α < +λnw. C: From 7 and 9, it follows that the function ru + u x α L αx α reaches its minimum at u = x +. Comparing the values of this function at x + and x yields rx + rx α x x α L α x α x+ x α L α x = α x+ x L α x, x + x. Substituting this inequality into 79 and expanding L yield ˆL α x + ˆL α x α L x + x = α λ n W x + x. Hence, sufficient descent requires 0 < α < λnw.

14 4 Lemma 7 Boundedness. Under the conditions of Lemma 6, the sequence { ˆL α x } is lower bounded, and the sequence {x } is bounded. Proof. The lower boundedness of { ˆL α x } is due to Assumption 4 Part. By Lemma 6 and under a proper step size, ˆLα x is nonincreasing and upper bounded by ˆL α x 0. Hence, n i= f ix i + r ix i is upper bounded by ˆL α x 0. Consequently, {x } is bounded due to the coercivity of each f i +r i see Assumption 4 Part. Lemma 8 Bounded subgradient. Let ˆL α x + denote the limiting subdifferential of ˆLα, which is assumed to exist for all N. Then, there exists g + ˆL α x + such that g + α λ n W + x + x. Proof. By the iterate 9, the following optimality condition holds 0 α x + x + α L α x + rx +, 80 where rx + denotes the limiting subdifferential of r at x +. For any ξ + rx +, it follows from 80 that L α x + + ξ + = α x x + + L α x + L α x, which immediate yields L α x + + ξ + α x + x + L α x + L α x α + L x + x α λ n W + x + x. Thus, then the claim of Lemma 8 holds. Based on Lemmas 6 8, we can easily prove Theorem 3 and Proposition 5. Proof of Theorem 3. The proof of this theorem is similar to that of Theorem and thus is omitted. Proof of Proposition 5. The proof is similar to that of Proposition. We shall however note that in 6, a = α + λ n W if ri s are convex, while a = α λ n W if ri s are not necessarily convex and λ n W > 0. G. Proofs for Theorem 4 and Proposition 6 Based on the iterate 6 of Prox-DGD, we derive the following recursion of the iterates of Prox-DGD, which is similar to 6. Lemma 9 Recursion of {x }. For any N, x = W x 0 α j W j fx j + ξ j+, 8 where ξ j+ rx j+ is the one determined by the proximal operator 7, for any j = 0,...,. Proof. By the definition of the proximal operator 7, the iterate 6 implies x + + α ξ + = W x α fx, 8 where ξ + rx +, and thus x + = W x α fx + ξ By 83, we can easily derive the recursion 8. Proof of Proposition 6. The proof of this proposition is similar to that of Proposition 3. It only needs to note that the subgradient term fx j + ξ j+ is uniformly bounded by the constant B for any j. Thus, we omit it here. To prove Theorem 4, we still need the following lemmas. Lemma 0. Let Assumptions and 4 hold. In Prox-DGD, use the step sizes 5. Results are given in two cases below: C: r i s are convex. For any N, ˆL α+ x + ˆL α x + α + α x+ I W α + λ nw x + x. 84 C: r i s are not necessarily convex. For any N, ˆL α+ x + ˆL α x + α + α x+ I W α λ nw x + x. 85 Proof. The proof of this lemma is similar to that of Lemma 6 via noting that and ˆL α+ x + = ˆL α x + ˆL α+ x + ˆL α x + + ˆL α x + ˆL α x, ˆL α+ x + ˆL α x + = α + α x+ I W. While the term ˆL α x + ˆL α x can be estimated similarly by the proof of Lemma 6. Lemma. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 5. If further each f i and r i are convex, then for any u R n p, we have ˆL α x + ˆL α u α x u x + u. Proof. By Lemma 5, we have L α u L α x L α x, u x + L x+ x, where L = + α λ nw, and by the convexity of r, we have ru rx + + ξ +, u x +, 87 where ξ + rx + is the one determined by the proximal operator 7. By 83, it follows ξ + = α x x + L α x. 88

15 5 Plugging 88 into 87, and then summing up 86 and 87 yield ˆL α u ˆL α x α x x +, u x + L x+ x. Similar to the rest proof of the inequality 67, we can prove this lemma based on 89. Proof of Theorem 4. Based on Lemma 0 and Lemma, we can proof Theorem 4. The proof of Theorem 4a-d is similar to that of Theorem, while the proof of Theorem 4d is very similar to that of Proposition 4, and thus the proof of this theorem is omitted. VI. CONCLUSION In this paper, we study the convergence behavior of the algorithm DGD for smooth, possibly nonconvex consensus optimization. We consider both fixed and decreasing step sizes. When using a fixed step size, we show that the iterates of DGD converge to a stationary point of a Lyapunov function, which approximates to one of the original problem. Moreover, we estimate the bound between each local point and its global average, which is proportional to the step size and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of the mixing matrix. This motivate us to study the algorithm DGD with decreasing step sizes. When using decreasing step sizes, we show that the iterates of DGD reach consensus asymptotically at a sublinear rate and converge to a stationary point of the original problem. We also estimate the convergence rates of objective sequence in the convex setting using different diminishing step size strategies. Furthermore, we extend these convergence results to Prox-DGD designed for minimizing the sum of a differentiable function and a proximal function. Both functions can be nonconvex. If the proximal function is convex, a larger fixed step size is allowed. These results are obtained by applying both existing and new proof techniques. ACNOWLEDGMENTS The wor of J. Zeng has been supported in part by the NSF grants 66036, and the Doctoral start-up foundation of Jiangxi Normal University. The wor of W. Yin has been supported in part by the NSF grant ECCS and ONR grants N and N REFERENCES [] H. Attouch, and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 6: 5-6, 009. [] H. Attouch, J. Bolte and B. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forwardbacward splitting, and regularized Gauss-Seidel methods, Math. Program., Ser. A, 37: 9-9, 03. [3] P. Bianchi and J. Jaubowicz, Convergence of a multi-agent projected stochastic gradient algorithm for nonconvex optimization, IEEE Trans. Automatic Control, 58:39-405, 03. [4] P. Bianchi, G. Fort and W. Hachem, Performance of a distributed stochastic approximation algorithm, IEEE Trans. Information Theory, 59: , 03. [5] J. Bolte, A. Daniilidis and A. Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization, 74:05-3, 007. [6] A. Chen and A. Ozdaglar, A fast distributed proximal gradient method, in Proc. 50th Allerton Conf. Commun., Control Comput., Moticello, IL, Oct. 0, pp [7] T. Chang, M. Hong and X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Trans. Signal Process., 63: , 05. [8] A. Chen, Fast Distributed First-Order Methods, Master s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 0. [9] Y.T. Chow, T. Wu and W. Yin, Cyclic Coordinate Update Algorithms for Fixed-Point Problems: Analysis and Applications. UCLA CAM 6-78, 06. [0] W. Deng, M. Lai, Z. Peng and W. Yin, Parallel multi-bloc admm with o/ convergence, Journal of Scientific Computing, DOI 0.007/s , 06. [] E. Hazan,.Y. Levy and S. Shalev-Shwarz, On graduated optimization for stochastic nonconvex problems, In Proceedings of the 33rd International Conference on Machine Learning, New Yor, NY, USA, 06. [] M. Hardt, B. Retch and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, In Proceedings of the 33rd International Conference on Machine Learning, New Yor, NY, USA, 06. [3] D. Hajinezhad, M. Hong and A. Garcia, ZENTH: a zeroth-order distributed algorithm for multi-agent nonconvex optimization, Technical report. [4] M. Hong, Z. Luo and M. Razaviyayn, Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems, ICASSP 05. [5] S. Hosseini, A. Chapman and M. Mesbahi, Online distributed optimization on dynamic networs, IEEE T. Auto. Control, 6: , 06. [6] D. Jaovetic, J. Xavier and J. Moura, Fast distributed gradient methods, IEEE Trans. Automatic Control, 59:3-46, 04. [7] D. empe, A. Dobra and J. Gehre, Gossip-based computation of aggregate information, In Foundations of Computer Science, 003. Proceedings 44th Annual IEEE Symposium on, 48-49, IEEE Computer Society, 003. [8]. nopp, Infinite sequences and series, Courier Corporation, 956. [9] J. Lafond, H. Wai and E. Moulines, D-FW: communication efficient distributed algorithms for high-dimensional sparse optimization, ICASSP 06. [0] S. Lee and A. Nedic, Distributed random projection algorithm for convex optimization, IEEE J. Sel. Topics Signal Process., 7: -9, 03. [] Q. Ling and Z. Tian, Decentralized sparse signal recovery for compressive sleeping wireless sensor networs, IEEE Trans. Signal Process., 587: , 00. [] S. Łojasiewicz, Sur la géométrie semi-et sous-analytique, Ann. Inst. Fourier Grenoble 435: , 993. [3] P.D. Lorenzo and G. Scutari, NEXT: in-networ nonconvex optimization, IEEE Trans. Signal and Information Processing over Networ, :0-36, 06. [4] P.D. Lorenzo and G. Scutari, Distributed nonconvex optimization over time-varying networs, ICASSP 06. [5] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process., 5:754-77, 0. [6] H. McMahan and M. Streeter, Delay-Tolerant algorithms for asynchronous distributed online learning, In: Advances in Neural Information Processing Systems NIPS, 04. [7] G. Mateos, J. Bazerque and G. Giannais, Distributed sparse linear regression, IEEE Trans. Signal Process., 580: , 00. [8] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Trans. Automatic Control, 54:48-6, 009. [9] A. Nedic and A. Olshevsy, Distributed optimization over time-varying directed graphs, IEEE Trans. Automatic Control, 603:60-65, 05. [30] M. Nevelson and R.Z. hasminsii, Stochastic approximation and recursive estimation, [translated from the Russian by Israel Program for Scientific Translations; translation edited by B. Silver]. Americal Mathematical Society, 973. [3] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions on Control of Networ Systems, 07, Volume: PP, Issue: 99. [3] M. Raginsy, N. iarashi and R. Willett, Decentralized online convex programming with local information, In: 0 American Control Conference, San Francisco, CA, USA, 0.

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

Decentralized Consensus Optimization with Asynchrony and Delay

Decentralized Consensus Optimization with Asynchrony and Delay Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

On the iterate convergence of descent methods for convex optimization

On the iterate convergence of descent methods for convex optimization On the iterate convergence of descent methods for convex optimization Clovis C. Gonzaga March 1, 2014 Abstract We study the iterate convergence of strong descent algorithms applied to convex functions.

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

Distributed Consensus Optimization

Distributed Consensus Optimization Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized

More information

On the convergence of a regularized Jacobi algorithm for convex optimization

On the convergence of a regularized Jacobi algorithm for convex optimization On the convergence of a regularized Jacobi algorithm for convex optimization Goran Banjac, Kostas Margellos, and Paul J. Goulart Abstract In this paper we consider the regularized version of the Jacobi

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Convergence of Fixed-Point Iterations

Convergence of Fixed-Point Iterations Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

On the linear convergence of distributed optimization over directed graphs

On the linear convergence of distributed optimization over directed graphs 1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Convergence rates for distributed stochastic optimization over random networks

Convergence rates for distributed stochastic optimization over random networks Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

Constrained Consensus and Optimization in Multi-Agent Networks

Constrained Consensus and Optimization in Multi-Agent Networks Constrained Consensus Optimization in Multi-Agent Networks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher

More information

Convex Analysis and Optimization Chapter 2 Solutions

Convex Analysis and Optimization Chapter 2 Solutions Convex Analysis and Optimization Chapter 2 Solutions Dimitri P. Bertsekas with Angelia Nedić and Asuman E. Ozdaglar Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Distributed Optimization over Random Networks

Distributed Optimization over Random Networks Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Distributed Optimization over Networks Gossip-Based Algorithms

Distributed Optimization over Networks Gossip-Based Algorithms Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 5, No., pp. 944 966 c 05 Society for Industrial and Applied Mathematics EXTRA: AN EXACT FIRST-ORDER ALGORITHM FOR DECENTRALIZED CONSENSUS OPTIMIZATION WEI SHI, QING LING, GANG WU, AND

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction J. Korean Math. Soc. 38 (2001), No. 3, pp. 683 695 ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE Sangho Kum and Gue Myung Lee Abstract. In this paper we are concerned with theoretical properties

More information

An asymptotic ratio characterization of input-to-state stability

An asymptotic ratio characterization of input-to-state stability 1 An asymptotic ratio characterization of input-to-state stability Daniel Liberzon and Hyungbo Shim Abstract For continuous-time nonlinear systems with inputs, we introduce the notion of an asymptotic

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

New hybrid conjugate gradient methods with the generalized Wolfe line search

New hybrid conjugate gradient methods with the generalized Wolfe line search Xu and Kong SpringerPlus (016)5:881 DOI 10.1186/s40064-016-5-9 METHODOLOGY New hybrid conjugate gradient methods with the generalized Wolfe line search Open Access Xiao Xu * and Fan yu Kong *Correspondence:

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

On proximal-like methods for equilibrium programming

On proximal-like methods for equilibrium programming On proximal-lie methods for equilibrium programming Nils Langenberg Department of Mathematics, University of Trier 54286 Trier, Germany, langenberg@uni-trier.de Abstract In [?] Flam and Antipin discussed

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

Alternative Characterization of Ergodicity for Doubly Stochastic Chains

Alternative Characterization of Ergodicity for Doubly Stochastic Chains Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Proximal-like contraction methods for monotone variational inequalities in a unified framework

Proximal-like contraction methods for monotone variational inequalities in a unified framework Proximal-like contraction methods for monotone variational inequalities in a unified framework Bingsheng He 1 Li-Zhi Liao 2 Xiang Wang Department of Mathematics, Nanjing University, Nanjing, 210093, China

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Subgradient Methods in Network Resource Allocation: Rate Analysis

Subgradient Methods in Network Resource Allocation: Rate Analysis Subgradient Methods in Networ Resource Allocation: Rate Analysis Angelia Nedić Department of Industrial and Enterprise Systems Engineering University of Illinois Urbana-Champaign, IL 61801 Email: angelia@uiuc.edu

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in

More information

THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS

THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS Submitted: 24 September 2007 Revised: 5 June 2008 THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS by Angelia Nedić 2 and Dimitri P. Bertseas 3 Abstract In this paper, we study the influence

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Active sets, steepest descent, and smooth approximation of functions

Active sets, steepest descent, and smooth approximation of functions Active sets, steepest descent, and smooth approximation of functions Dmitriy Drusvyatskiy School of ORIE, Cornell University Joint work with Alex D. Ioffe (Technion), Martin Larsson (EPFL), and Adrian

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

A derivative-free nonmonotone line search and its application to the spectral residual method

A derivative-free nonmonotone line search and its application to the spectral residual method IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral

More information

Sequential convex programming,: value function and convergence

Sequential convex programming,: value function and convergence Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Sparse Optimization Lecture: Dual Methods, Part I

Sparse Optimization Lecture: Dual Methods, Part I Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 20 Subgradients Assumptions

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

A user s guide to Lojasiewicz/KL inequalities

A user s guide to Lojasiewicz/KL inequalities Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f

More information

A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions

A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions Angelia Nedić and Asuman Ozdaglar April 16, 2006 Abstract In this paper, we study a unifying framework

More information

Efficient Methods for Large-Scale Optimization

Efficient Methods for Large-Scale Optimization Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Distributed intelligence in multi agent systems

Distributed intelligence in multi agent systems Distributed intelligence in multi agent systems Usman Khan Department of Electrical and Computer Engineering Tufts University Workshop on Distributed Optimization, Information Processing, and Learning

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Tight Rates and Equivalence Results of Operator Splitting Schemes

Tight Rates and Equivalence Results of Operator Splitting Schemes Tight Rates and Equivalence Results of Operator Splitting Schemes Wotao Yin (UCLA Math) Workshop on Optimization for Modern Computing Joint w Damek Davis and Ming Yan UCLA CAM 14-51, 14-58, and 14-59 1

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Local strong convexity and local Lipschitz continuity of the gradient of convex functions Local strong convexity and local Lipschitz continuity of the gradient of convex functions R. Goebel and R.T. Rockafellar May 23, 2007 Abstract. Given a pair of convex conjugate functions f and f, we investigate

More information

Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity

Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity 1 Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity Sijia Liu, Member, IEEE, Pin-Yu Chen, Member, IEEE, and Alfred O. Hero, Fellow, IEEE arxiv:1704.05193v2 [stat.ml]

More information

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel

Nonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior

More information

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the

More information