WE consider an undirected, connected network of n
|
|
- Easter Berry
- 6 years ago
- Views:
Transcription
1 On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus nonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do sometimes. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are nown in the convex setting. In particular, when diminishing or constant step sizes are used, we can prove convergence to a or a neighborhood of consensus stationary solution and have guaranteed rates of convergence. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD and l q quasi-norms, q [0,. Similarly, Prox-DGD can tae the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used the convex setting. Index Terms Nonconvex dencentralized computing, consensus optimization, decentralized gradient descent method, proximal decentralized gradient descent I. INTRODUCTION WE consider an undirected, connected networ of n agents and the following consensus optimization problem defined on the networ: n minimize fx f i x, x R p i= where f i is a differentiable function only nown to the agent i. We also consider the consensus optimization problem in the following differentiable+proximable form: n minimize x R p sx f i x + r i x, i= where f i, r i are differentiable and proximable functions, respectively, only nown to the agent i. Each function r i is possibly non-differentiable or nonconvex, or both. The models and find applications in decentralized averaging, learning, estimation, and control. Some specific examples include: i the distributed compressed sensing and machine learning problems, where f i is the data-fidelity term, which is often differentiable, and r i is a sparsity-promoting regularizer such as the l q quasi-norm with 0 q [], [7]; ii optimization problems with per-agent constraints, J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 3300, China jsh.zeng@gmail.com W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095, USA wotaoyin@ucla.edu. We call { a function proximable if its proximal operator prox αf y argmin x αfx + x y } is easy to compute. where f i is a differentiable objective function of agent i and r i is the indicator function of the constraint set of agent i, that is, r i x = 0 if x satisfies the constraint and otherwise [7], [0]. When f i s are convex, the existing algorithms include the subgradient methods [6], [8], [6], [5], [8], [4], [46], [3], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers DADMM [35], [36], [7], and EXTRA [37], [38]. However, when f i s are nonconvex, few algorithms have convergence guarantees. Some existing results include [3], [4], [3], [3], [4], [39], [40], [9], [4], [43], [48]. In spite of the algorithms and their analysis in these wors, the convergence of the simple algorithm Decentralized Gradient Descent DGD [8] under nonconvex f i s is still unnown. Furthermore, although DGD is slower than D-ADMM and EXTRA on convex problems, DGD is simpler and thus easier to extend to a variety of settings such as [3], [45], [6], [5], where online processing and delay tolerance are considered. Therefore, we expect our results to motivate future adoptions of nonconvex DGD. This paper studies the convergence of two algorithms: DGD for solving problem and Prox-DGD for problem. In each DGD iteration, every agent locally computes a gradient and then updates its variable by combining the average of its neighbors with the negative gradient step. In each Prox-DGD iteration, every agent locally computes a gradient of f i and a proximal map of r i, as well as exchanges information with its neighbors. Both algorithms can use either a fixed step size or a sequence of decreasing step sizes. When the problem is convex and a fixed step size is used, DGD does not converge to a solution of the original problem but a point in its neighborhood[46]. This motivates the use of decreasing step sizes such as in [8], [6]. Assuming f i s are convex and have Lipschitz continuous and bounded gradients, [8] shows that decreasing step sizes α = lead to a convergence rate O ln of the running best of objective errors. [6] uses nested loops and shows an outerloop convergence rate O of objective errors, utilizing Nesterov s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, their single-loop algorithm using the decreasing step sizes α = /3 has a reduced rate O ln. The objective of this paper is two-fold: a we aim to show, other than losing global optimality, most existing convergence results of DGD and Prox-DGD that are nown in the convex setting remain valid in the nonconvex setting, and b to achieve a, we illustrate how to tailor nonconvex analysis tools for decentralized optimization. In particular, our asymptotic exact and inexact consensus results require new treatments because they are special to decentralized algorithms. The analytic results of this paper can be summarized as
2 follows. a When a fixed step size α is used and properly bounded, the DGD iterates converge to a stationary point of a Lyapunov function. The difference between each local estimate of x and the global average of all local estimates is bounded, and the bound is proportional to α. b When a decreasing step size α = O/ + ɛ is used, where 0 < ɛ and is the iteration number, the objective sequence converges, and the iterates of DGD are asymptotically consensual i.e., become equal one another, and they achieve this at the rate of O/ + ɛ. Moreover, we show the convergence of DGD to a stationary point of the original problem, and derive the convergence rates of DGD with different ɛ for objective functions that are convex. c The convergence analysis of DGD can be extended to the algorithm Prox-DGD for solving problem. However, when the proximable functions r i s are nonconvex, the mixing matrix is required to be positive definite and a smaller step size is also required. Otherwise, the mixing matrix can be non-definite. The detailed comparisons between our results and the existing results on DGD and Prox-DGD are presented in Tables I and II. The global objective error rate in these two tables refers to the rate of {f x fx opt } or {s x sx opt }, where x = n n i= x i is the average of the th iterate and x opt is a global solution. The comparisons beyond DGD and Prox- DGD are presented in Section IV and Table III. New proof techniques are introduced in this paper, particularly, in the analysis of convergence of DGD and Prox-DGD with decreasing step sizes. Specifically, the convergence of objective sequence and convergence to a stationary point of the original problem with decreasing step sizes are justified via taing a Lyapunov function and several new lemmas cf. Lemmas 9,, and the proof of Theorem. Moreover, we estimate the consensus rate by introducing an auxiliary sequence and then showing both sequences have the same rates cf. the proof of Proposition 3. All these proof techniques are new and distinguish our paper from the existing wors such as [8], [6], [8], [3], [3], [3], [40], [43]. The rest of this paper is organized as follows. Section II describes the problem setup and reviews the algorithms. Section III presents our assumptions and main results. Section IV discusses related wors. Section V presents the proofs of our main results. We conclude this paper in Section VI. Notation: Let I denote the identity matrix of the size n n, and R n denote the vector of all s. For the matrix X, X T denotes its transpose, X ij denotes its i, jth component, and X X, X = i,j X ij is its Frobenius norm, which simplifies to the Euclidean norm when X is a vector. Given a symmetric, positive semidefinite matrix G R n n, we let X G X, GX be the induced semi-norm. Given a function h, domf denotes its domain. i, j E represents a communication lin between nodes i and j. Let x i R p denote the local copy of x at node i. We reformulate the consensus problem into the equivalent problem: minimize x T fx n f i x i, 3 i= subject to x i = x j, i, j E, where x R n p, fx R n with x x T x T. x T n, fx f x f x. f n x n In addition, the gradient of fx is f x T f x T fx. Rn p. f n x n T. The ith rows of the matrices x and fx, and vector fx, correspond to agent i. The analysis in this paper applies to any integer p. For simplicity, one can let p = and treat x and fx as vectors rather than matrices. The algorithm DGD [8] for 3 is described as follows: Pic an arbitrary x 0. For = 0,,..., compute x + W x α fx, 4 where W is a mixing matrix and α > 0 is a stepsize parameter. Similarly, we can reformulate the composite problem as the following equivalent form: minimize x n f i x i + r i x i, i= subject to x i = x j, i, j E. 5 Let rx n i= r ix i. The algorithm Prox-DGD can be applied to the above problem 5: Prox-DGD: Tae an arbitrary x 0. For = 0,,..., perform x + prox α rw x α fx, 6 where the proximal operator is prox α rx argmin {α ru + u R n p } u x. 7 II. PROBLEM SETUP AND ALGORITHM REVIEW Consider a connected undirected networ G = {V, E}, where V is a set of n nodes and E is the edge set. Any edge III. ASSUMPTIONS AND MAIN RESULTS This section presents all of our main results.
3 3 TABLE I COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS SMOOTH OPTIMIZATION PROBLEM. Fixed step size Decreasing step sizes algorithm DGD [46] DGD this paper D-NG [6] DGD this paper f i convex only nonconvex convex only nonconvex f i Lipschitz Lipschitz, bounded step size 0 < α < +λnw O with Nesterov acc. O ɛ ɛ 0, ] consensus error Oα O O ɛ min j x j+ x j o no rate o +ɛ global objective error O until error O α ζ Convex: O until error O α ζ ; Nonconvex: no rate O ln Convex : O ln ɛ = /, O ɛ =, ln O min{ɛ, ɛ} other ɛ; Nonconvex: no rate The objective error rates of DGD and Prox-DGD obtained in this paper and those in convex DProx-Grad [8] are ergodic or running best rates. TABLE II COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS COMPOSITE OPTIMIZATION PROBLEM. Fixed step size Decreasing step sizes algorithm AccDProx-Grad [6] DProx-Grad [8] Prox-DGD this paper DProx-Grad [8] Prox-DGD this paper f i, r i convex only nonconvex convex only nonconvex f i Lipschitz, bounded Lipschitz Lipschitz, bounded r i bounded bounded step size 0 < α < 0 < α < +λnw convex r i ; 0 < α < λnw nonconvex r i, λ nw > 0 O O + / ɛ 0, ] + ɛ consensus Oγ, 0 < γ < error Oα O / O min j x j+ x j no rate no rate o no rate o +ɛ ɛ global objective error O D α + D α, D, D > 0 Form Convex: form D 3 α + D 4α, D 3, D 4 > 0; Nonconvex: no rate O ln Convex : O ln ɛ = /, O ɛ =, ln O min{ɛ, ɛ} other ɛ, Nonconvex: no rate The objective error rates are ergodic or running best rates. A. Definitions and assumptions Definition Lipschitz differentiability. A function h is called Lipschitz differentiable if h is differentiable and its gradient h is Lipschitz continuous, i.e., hu hv L u v, u, v domh, where L > 0 is its Lipschitz constant. Definition Coercivity. A function h is called coercive if u + implies hu +.
4 4 TABLE III COMPARISONS ON SCENARIOS APPLIED FOR DIFFERENT NONCONVEX DECENTRALIZED ALGORITHMS. f i nonsmooth r i step size networ W algorithm type fusion scheme algorithm smooth cvx ncvx fixed diminish static dynamic determin stochastic ATC CTA DGD this paper doubly Perturbed Push-sum [40] column ZENITH [3] doubly Prox-DGD this paper NEXT [3] DeFW [43] Proj SGD [3] doubly doubly doubly row In this table, the full names of these abbreviations are list as follows: cvx convex, ncvx nonconvex, diminish diminishing, determin deterministic, ATC adaptive-then-combine, CTA combine-then-adaptive, doubly doubly stochastic, column column stochastic, row row stochastic, where vocabularies in the bracets are the full names. A row, or column, or double stochastic W means that: W =, or W T =, or both hold. The next definition is a property that many functions have see [44, Section.] for examples and can help obtain whole sequence convergence from subsequence convergence. Definition 3 urdya-łojasiewicz Ł property [], [5], []. A function h : R p R {+ } has the Ł property at x dom h if there exist η 0, + ], a neighborhood U of x, and a continuous concave function ϕ : [0, η R + such that: i ϕ0 = 0 and ϕ is differentiable on 0, η; ii for all s 0, η, ϕ s > 0; iii for all x in U {x : hx < hx < hx + η}, the Ł inequality holds ϕ hx hx dist 0, hx. 8 Proper lower semi-continuous functions that satisfy the Ł inequality at each point of dom h are called Ł functions. Assumption Objective. The objective functions f i : R p R {+ }, i =,..., n, satisfy the following: f i is Lipschitz differentiable with constant i > 0. f i is proper i.e., not everywhere infinite and coercive. The sum n i= f ix i is -Lipschitz differentiable with max i i. In addition, each f i is lower bounded following Part of the above assumption. Assumption Mixing matrix. The mixing matrix W = [w ij ] R n n has the following properties: Graph If i j and i, j / E, then w ij = 0, otherwise, w ij > 0. Symmetry W = W T. 3 Null space property null{i W } = span{}. 4 Spectral property I W I. By Assumption, a solution x opt to problem 3 satisfies I W x opt = 0. Due to the symmetric assumption of W, Whole sequence convergence from any starting point is referred to as global convergence in the literature. Its limit is not necessarily a global solution. its eigenvalues are real and can be sorted in the nonincreasing order. Let λ i W denote the ith largest eigenvalue of W. Then by Assumption, λ W = > λ W λ n W >. Let ζ be the second largest magnitude eigenvalue of W. Then ζ = max{ λ W, λ n W }. 9 B. Convergence results of DGD We consider the convergence of DGD with both a fixed step size and a sequence of decreasing step sizes. Convergence results of DGD with a fixed step size: The convergence result of DGD with a fixed step size i.e., α α is established based on the Lyapunov function [46]: L α x T fx + α x I W. 0 It is worth reminding that convexity is not assumed. Theorem Global convergence. Let {x } be the sequence generated by DGD 4 with the step size 0 < α < +λnw. Let Assumptions and hold. Then {x } has at least one accumulation point x, and any such point is a stationary point of L α x. Furthermore, the running best rates of the sequences { x + x } and { L α x } are o. In addition, if L α satisfies the Ł property at an accumulation point x, then {x } globally converges to x. Remar. Let x be a stationary point of L α x, and thus 0 = fx + α I W x. Since T I W = 0, yields 0 = T fx, indicating that x is also a stationary point to the separable function n i= f ix i. Since the rows of x are not necessarily Given a nonnegative sequence a, its running best sequence is b = min{a i : i }. We say a has a running best rate of o/ if b = o/. These quantities naturally appear in the analysis, so we eep the squares.
5 5 identical, we cannot say x is a stationary point to Problem 3. However, the differences between the rows of x are bounded, following our next result below adapted from [46]: Proposition Consensual bound on x. For each iteration, define x n n i= x i. Then, it holds for each node i that x i x αd ζ, where D is a universal bound of fx defined in Lemma 6 below, ζ is the second largest magnitude eigenvalue of W specified in 9. As, yields the consensual bound where x n n i= x i. x i x αd ζ, In Proposition, the consensual bound is proportional to the step size α and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of W. Let us compare the DGD iteration with the iteration of centralized gradient descent 4 for fx. Averaging the rows of 4 yields the following comparison: n DGD averaged: x + x α f i x i. 3 Centralized: n x + x α n i= n i= f i x. 4 Apparently, DGD approximates centralized gradient descent by evaluating f i at local variables x i instead of the global average. We can estimate the error of this approximation as n f i x i n n f i x n n i= n i= i= f i x i f i x αd ζ. Unlie the convex analysis in [46], it is impossible to bound the difference between the sequences of 3 and 4 without convexity because the two sequences may converge to different stationary points of L α. Remar. The Ł assumption on L α in Theorem can be satisfied if each f i is a sub-analytic function. Since x I W is obviously sub-analytic and the sum of two sub-analytic functions remains sub-analytic, L α is sub-analytic if each f i is so. See [44, Section.] for more details and examples. Proposition Ł convergence rates. Let the assumptions of Theorem hold. Suppose that L α satisfies the Ł inequality at an accumulation point x with ψs = cs θ for some constant c > 0. Then, the following convergence rates hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C 0 τ for all for some > 0, C 0 > 0, τ [0,. c If θ,, x x C 0 θ/θ for all, for certain > 0, C 0 > 0. Note that the rates in parts b and c of Proposition are of the eventual type. Using fixed step sizes, our results are limited because the stationary point x of L α is not a stationary point of the original problem. We only have a consensual bound on x. To address this issue, the next subsection uses decreasing step sizes and presents better convergence results. Convergence of DGD with decreasing step sizes: The positive consensual error bound in Proposition, which is proportional to the constant step size α, motivates the use of properly decreasing step sizes α = O +, for some ɛ 0 < ɛ, to diminish the consensual bound to 0. As a result, any accumulation point x becomes a stationary point of the original problem 3. To analyze DGD with decreasing step sizes, we add the following assumption. Assumption 3 Bounded gradient. For any, fx is uniformly bounded by some constant B > 0, i.e., fx B. Note that the bounded gradient assumption is a regular assumption in the convergence analysis of decentralized gradient methods see, [3], [4], [3], [3], [4], [39], [40], [9], [43] for example, even in the convex setting [6] and also [8], though it is not required for centralized gradient descent. We tae the step size sequence: α =, 0 < ɛ, 5 + ɛ throughout the rest part of this section. The numerator can be replaced by any positive constant. By iteratively applying iteration 4, we obtain the following expression x = W x 0 α j W j fx j. 6 Proposition 3 Asymptotic consensus rate. Let Assumptions and 3 hold. Let DGD use 5. Let x n T x. Then, x x converges to 0 at the rate of O/ + ɛ. According to Proposition 3, the iterates of DGD with decreasing step sizes can reach consensus asymptotically compared to a nonzero bound in the fixed step size case in Proposition. Moreover, with a larger ɛ, faster decaying step sizes generally imply a faster asymptotic consensus rate. Note that I W x = 0 and thus x I W = x x I W. Therefore, the above proposition implies the following result. Corollary. Apply the setting of Proposition 3. x I W converges to 0 at the rate of O/ + ɛ. Corollary shows that the sequence {x } in the I W semi-norm can decay to 0 at a sublinear rate. For any global consensual solution x opt to problem 3, we have x x opt I W = x I W so, if {x } does converge to x opt, then their distance in the same semi-norm decays at O/ ɛ. Theorem Convergence. Let Assumptions, and 3 hold. Let DGD use step sizes 5. Then a {L α x } and { T fx } converge to the same limit; b lim T fx = 0, and any limit point of {x } is a stationary point of problem 3;
6 6 c In addition, if there exists an isolated accumulation point, then {x } converges. In the proof of Theorem, we will establish =0 α + λ nw x + x <, which implies that the running best rate of the sequence { x + x } is o/ +ɛ. Theorem shows that the objective sequence converges, and any limit point of {x } is a stationary point of the original problem. However, there is no result on the convergence rate of the objective sequence to an optimal value, and it is generally difficult to get such a rate without convexity. Although our primary focus is nonconvexity, next we assume convexity and present the objective convergence rate, which has an interesting relation with ɛ. For any x R n p n, let fx i= f ix i. Even if f i s are convex, the solution to 3 may be non-unique. Thus, let X be the set of solutions to 3. Given x, we pic the solution x opt = Proj X x X. Also let f opt = fx opt be the optimal value of. Define the ergodic objective: f =0 = α f x + =0 α, 7 where x + = n T x +. Obviously, f min f x. 8 =,...,+ Proposition 4 Convergence rates under convexity. Let Assumptions, and 3 hold. Let DGD use step sizes 5. If λ n W > 0 and each f i is convex, then { f } defined in 7 converges to the optimal objective value f opt at the following rates: a if 0 < ɛ < /, the rate is O ; ɛ b if ɛ = /, the rate is O ln ; c if / < ɛ <, the rate is O ; ɛ d if ɛ =, the rate is O ln. The convergence rates established in Proposition 4 almost as good as O when ɛ =. As ɛ goes to either 0 or, the rates become slower, and ɛ = / may be the optimal choice in terms of the convergence rate. However, by Proposition 3, a larger ɛ implies a faster consensus rate. Therefore, there is a tradeoff to choose an appropriate ɛ in the practical implementation of DGD. C. Convergence results of Prox-DGD Similarly, we consider the convergence of Prox-DGD with both a fixed step size and decreasing step sizes. The iteration 6 can be reformulated as x + = prox α rx α L α x 9 based on which, we define the Lyapunov function ˆL α x L α x + rx, where we recall L α x = n i= f ix i + α x I W. Then 9 is clearly the forward-bacward splitting a..a., proxgradient iteration for minimize x ˆLα x. Specifically, 9 first performs gradient descent to the differentiable function L α x and then computes the proximal of rx. To analyze Prox-DGD, we should revise Assumption as follows. Assumption 4 Composite objective. The objective function of 5 satisfies the following: Each f i is Lipschitz differentiable with constant i > 0. Each f i +r i is proper, lower semi-continuous, coercive. As before, n i= f ix i is -Lipschitz differentiable for max i i. Convergence results of Prox-DGD with a fixed step size: Based on the above assumptions, we can get the global convergence of Prox-DGD as follows. Theorem 3 Global convergence of Prox-DGD. Let {x } be the sequence generated by Prox-DGD 6 where the step size α satisfies 0 < α < +λnw when r i s are convex; and 0 < α < λnw, when r i s are not necessarily convex this case requires λ n W > 0. Let Assumptions and 4 hold. Then {x } has at least one accumulation point x, and any accumulation point is a stationary point of ˆLα x. Furthermore, the running best rates of the sequences { x + x } and g + where g + is defined in Lemma 8 are both o. In addition, if ˆLα satisfies the Ł property at an accumulation point x, then {x } converges to x. The rate of convergence of Prox-DGD can be also established by leveraging the Ł property. Proposition 5 Rate of convergence of Prox-DGD. Under assumptions of Theorem 3, suppose that ˆLα satisfies the Ł inequality at an accumulation point x with ψs = c s θ for some constant c > 0. Then the following hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C τ for all for some > 0, C > 0, τ [0,. c If θ,, x x C θ/θ for all, for certain > 0, C > 0. Convergence of Prox-DGD with decreasing step sizes: In Prox-DGD, we also use the decreasing step size 5. To investigate its convergence, the bounded gradient Assumption 3 should be revised as follows. Assumption 5 Bounded composite subgradient. For each i, f i is uniformly bounded by some constant B i > 0, i.e., f i x B i for any x R p. Moreover, ξ i B ri for any ξ i r i x and x R p, i =..., n. Let B n i= B i + B ri. Then fx + ξ where ξ rx for any x R n p is uniformly bounded by B. Note that the same assumption is used to analyze the convergence A nonnegative sequence a induces its running best sequence b = min{a i : i }; therefore, a has running best rate of o/ if b = o/.
7 7 of distributed proximal-gradient method in the convex setting [6], [8], and also is widely used to analyze the convergence of nonconvex decentralized algorithms lie in [3], [4]. In light of Lemma 9 below, the claims in Proposition 3 and Corollary also hold for Prox-DGD. Proposition 6 Asymptotic consensus and rate. Let Assumptions and 5 hold. In Prox-DGD, use the step sizes 5. There hold x x C x 0 ζ + B α j ζ j, and x x converges to 0 at the rate of O/ + ɛ. Moreover, let x be any global solution of the problem 5. Then x x I W = x I W = x x I W converges to 0 at the rate of O/ + ɛ. For any x R n p, define sx = n i= f ix i + r i x i. Let X be a set of solutions of 5, x opt = Proj X x X, and s opt = sx opt be the optimal value of 5. Define s = =0 α s x + =0 α. 0 Theorem 4 Convergence and rate. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 5. Then a { ˆL α x } and { n i= f ix i + r ix i} converge to the same limit; b =0 α +λ nw x + x < when r i s are convex; or, =0 α λ nw x + x < when r i s are not necessarily convex this case requires λ n W > 0; c if {ξ } satisfies ξ + ξ L r x + x for each > 0, some constant L r > 0, and a sufficiently large integer 0 > 0, then lim T fx + ξ + = 0, where ξ + rx + is the one determined by the proximal operator 7, and any limit point is a stationary point of problem 5. d in addition, if there exists an isolated accumulation point, then {x } converges. e furthermore, if f i and r i are convex and λ n W > 0, then the claims on the rates of { f } in Proposition 4 hold for the sequence { s } defined in 0. Theorem 4b implies that the best running rate of x + x is o. The additional condition imposed on {ξ } in +ɛ Theorem 4c is some type of restricted continuous regularity of the subgradient r with respect to the generated sequence, which may be held for a class of proximal functions as studied in [47]. If r is locally Lipschitz continuous in a neighborhood of a limit point, then such condition can generally be satisfied, since {x } is asymptotic regular, and thus x will lies in such neighborhood of this limit point when is sufficiently large. Theorem 4e gives the convergence rates of Prox-DGD in the convex setting. IV. RELATED WORS AND DISCUSSIONS We summarize some recent nonconvex decentralized algorithms in Table III. Most of them apply to either the smooth optimization problem or the composite optimization problem and use diminishing step sizes. Although is a special case of via letting r i x = 0, there are still differences in both algorithm design and theoretical analysis. Therefore, we divide their comparisons. We first discuss the algorithms for. In [40], the authors proved the convergence of perturbed push-sum for nonconvex under some regularity assumptions. They also introduced random perturbations to avoid local minima. The networ considered in [40] is time-varying and directed, and specific column stochastic matrices and diminishing step sizes are used. Their algorithm is an extension of DGD with diminishing step sizes of this paper. The convergence results for the deterministic perturbed push-sum algorithm obtained in [40] are similar to those of DGD developed in this paper under similar assumptions see, Theorem above and [40, Theorem 3]. However, in this paper, we obtain the asymptotic consensus and convergence to a stationary point of DGD via a Lyapunov function and developing several new results such as Lemma for the convergence of the so-called wealysummable sequence. The proofs in [40] are mainly based on [30, Theorem.7.3]. In [3], a primal-dual approximate gradient algorithm called ZENITH was developed for. The convergence of ZENITH was given in the expectation of constraint violation under the Lipschitz differentiable assumption and other assumptions. Table III includes three algorithms for solving the composite problem, which are related to ours. All of them only deal with convex r i whereas r i in this paper can also be nonconvex. In [4], the authors proposed NEXT based on the previous successive convex approximation SCA technique. The iterates of NEXT include two stages, a local SCA stage to update local variables and a consensus update stage to fuse the information between agents. While NEXT has results similar to Prox-DGD using diminishing step sizes, Prox- DGD is simpler than NEXT. Another interesting algorithm is decentralized Fran-Wolfe DeFW proposed in [43] for nonconvex, smooth, constrained decentralized optimization, where a bounded convex constraint set is imposed. There are three steps at each iteration of DeFW: average gradient computation, local variable evaluation by Fran-Wolfe, and information fusion between agents. In [43], the authors established convergence results similar to Prox-DGD under diminishing step sizes. The stochastic version of DeFW has also been developed in [9] for high-dimensional convex sparse optimization. The last one is projected stochastic gradient algorithm Proj SGD [3] for constrained, nonconvex, smooth consensus optimization. It has two steps at each iteration: a projected stochastic gradientstep to update local variables and a consensus step to exchange the information between local agents. The mixing matrix used in this algorithm is The original form of this algorithm, push-sum, was proposed in [7] for the average consensus problem. It was modified and analyzed in [9] for convex consensus optimization problem over time-varying directed graphs.
8 8 random and row stochastic, but its expectation is column stochastic. Asymptotic consensus and convergence to the set of arush-uhn-tucer points were proved under diminishing step sizes, smooth objective function, some mean and variance restrictions to the stochastic direction, and other assumptions on the mixing matrices and the constraint set. Based on the above analysis, the convergence results of DGD and Prox-DGD with diminishing step sizes of this paper are comparable with most of the existing ones, which involve more complicated methods. However, we allow nonconvex nonsmooth r i and are able to obtain the estimates of asymptotic consensus rates. We also establish global convergenceusing a fixed step size while it is only found in ZENITH. V. PROOFS In this section, we present the proofs of our main theorems and propositions. A. Proof for Theorem The setch of the proof is as follows: DGD is interpreted as the gradient descent algorithm applied to the Lyapunov function L α, following the argument in [46]; then, the properties of sufficient descent, lower boundedness, and bounded gradients are established for the sequence {L α x }, giving subsequence convergence of the DGD iterates; finally, whole sequence convergence of the DGD iterates follows from the Ł property of L α. Lemma Gradient descent interpretation. The sequence {x } generated by the DGD iteration 4 is the same sequence generated by applying gradient descent with the fixed step size α to the objective function L α x. A proof of this lemma is given in [46], and it is based on reformulating 4 as the iteration: x + = x α fx + α I W x = x α L α x. Although the sequence {x } generated by the DGD iteration 4 can be interpreted as a centralized gradient descent sequence of function L α x, it is different to the gradient descent of the original problem 3. Lemma Sufficient descent of {L α x }. Let Assumptions and hold. Set the step size 0 < α < +λnw. It holds that L α x + L α x α + λ n W x + x, N. Proof. From x + = x α L α x, it follows that L α x, x + x = x+ x. 3 α Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ max I W = + α λ n W, implying L α x + L α x + L α x, x + x + L x+ x. 4 Combining 3 and 4 yields. Lemma 3 Boundedness. Under Assumptions and, if 0 < α < +λnw, then the sequence {L α x } is lower bounded, and the sequence {x } is bounded, i.e., there exists a constant B > 0 such that x < B for all. Proof. The lower boundedness of L α x is due to the lower boundedness of each f i as it is proper and coercive Assumption Part. By Lemma and the choice of α, L α x is nonincreasing and upper bounded by L α x 0 < +. Hence, T fx L α x 0 implies that x is bounded due to the coercivity of T fx Assumption Part. From Lemmas and 3, we immediately obtain the following lemma. Lemma 4 l -summable and asymptotic regularity. It holds that =0 x+ x < + and that x + x 0 as. From, the result below directly follows: Lemma 5 Gradient bound. L α x α x + x. Based on the above lemmas, we get the global convergence of DGD. Proof of Theorem. By Lemma 3, the sequence {x } is bounded, so there exist a convergent subsequence and a limit point, denoted by {x s } s N x as s +. By Lemmas and 3, L α x is monotonically nonincreasing and lower bounded, and therefore x + x 0 as. Based on Lemma 5, L α x 0 as. In particular, L α x s 0 as s. Hence, we have L α x = 0. The running best rate of the sequence { x + x } follows from [0, Lemma.] or [8, Theorem 3.3.]. By Lemma 5, the running best rate of the sequence { L α x } is o. Similar to [, Theorem.9], we can claim the global convergence of the considered sequence {x } N under the Ł assumption of L α. Next, we derive a bound on the gradient sequence { fx }, which is used in Proposition. Lemma 6. Under Assumption, there exists a point y satisfying fy = 0, and the following bound holds fx D B + y, N, 5 where B is the bound of x given in Lemma 3. A sequence {a } is said to be asymptotic regular if a + a 0 as.
9 9 Proof. By the lower boundedness assumption Assumption Part, the minimizer of T fy exists. Let y be a minimizer. Then by Lipschitz differentiability of each f i Assumption Part, we have that fy = 0. Then, for any, we have fx = fx fy x y Lemma 3 B + y. Therefore, we have proven this lemma. B. Proof for Proposition Proof. Note that L α x + L α x + L α x + L α x L x + x + α x + x = α λ n W + x + x, where the second inequality holds for Lemma 5 and the Lipschitz continuity of L α with constant L = + α λ n W. Thus, it shows that {x } satisfies the socalled relative error condition as list in []. Moreover, by Lemmas and 3, {x } also satisfies the so-called sufficient decrease and continuity conditions as listed in []. Under such three conditions and the Ł property of L α at x with ψs = cs θ, following the proof of [, Lemma.6], there exists 0 > 0 such that for all 0, we have x + x x x + cb a 6 Lα x L α x θ L α x + L α x θ, where a α + λ n W and b α λ n W +. Then, an easy induction yields t= 0 x t+ x t x 0 x 0 + cb a Lα x 0 L α x θ L α x + L α x θ. Following a derivation similar to the proof of [, Theorem 5], we can estimate the rate of convergence of {x } in the different cases of θ. C. Proof for Proposition 3 In order to prove Proposition 3, we also need the following lemmas. {}}{ Lemma 7. [8, Proposition ] Let W W W be the power of W with degree for any N. Under Assumption, it holds W n T Cζ 7 for some constant C > 0, where ζ is the second largest magnitude eigenvalue of W as specified in 9. Lemma 8. [33, Lemma 3.] Let {γ } be a scalar sequence. If lim γ = γ and 0 < β <, then lim l=0 β l γ l = γ β. Proof of Proposition 3. By the recursion 6, note that x x = W n T x 0 8 α j W j n T fx j. Further by Lemma 7 and Assumption 3, we obtain x x W n T x 0 α j W j n T fx j + C x 0 ζ + B α j ζ j. 9 Furthermore, by Lemma 8 and step sizes 5, we get lim x x = 0. Let b + ɛ. To show the rate of x x, we only need to show that lim b x x C for some 0 < C <. Let j [ + log ζb ] where [x] denotes the integer part of x for any x R. Note that b x x x 0 ζ + B α j ζ j Cb = C x 0 b ζ + CBb + CBb j=j + α j ζ j j α j ζ j T + T + T 3, 30 where the first inequality holds because of 9. In the following, we will estimate the above three terms in the right-hand side of 30, respectively. First, by the definition of j, for any j j, we have Thus, b ζ j T CB b Second, for j < j, b α j and also + ɛ j + ɛ ζ j. j α j ζ j/. 3 + ɛ + ɛ log ζ + ɛ, b α + ɛ j + ɛ =,
10 0 Thus, for any j < j Furthermore, note that lim b α j =. 3 lim b ζ/ = Therefore, there exists a such that for b α j, 34 b ζ/. 35 The above two inequalities imply that for sufficiently large, T C x 0 ζ /, 36 T 3 CB From 3, 36 and 37, we get b j=j + ζ j. 37 x x C x 0 ζ / 38 j + CB α j ζ j/ +. j=j + ζ j By Lemma 8 and 38, there exists a C > 0 such that lim b x x C. 39 We have completed the proof of this proposition. D. Proof for Theorem To prove Theorem, we first note that similar to, the DGD iterates under decreasing step sizes can be rewritten as where L α x = T fx + the following lemmas. x + = x α L α x, 40 α x I W, and we also need Lemma 9 [34]. Let {v t } be a nonnegative scalar sequence such that v t+ + b t v t u t + c t for all t N, where b t 0, u t 0 and c t 0 with t=0 b t < and t=0 c t <. Then the sequence {v t } converges to some v 0 and t=0 u t <. Lemma 0. Let α satisfy 5. Then it holds Proof. We first prove that α + α ɛ + ɛ. + x ɛ ɛx, x [0, ]. 4 Let gx = + x ɛ ɛx. Then its derivative g x = ɛ + x ɛ ɛ < 0, x [0, ]. It implies gx g0 = 0 for any x [0, ], that is, the inequality 4 holds. Note that α + α = + ɛ + ɛ = + ɛ + + ɛ where the last inequality holds for 4. ɛ + ɛ, 4 Note that the term {α + α x+ I W } exists in the right hand side the latter inequality 48. In order to apply Lemma 9 and then show the convergence of {L α x }, we need the following lemma to guarantee that {α + α x+ I W } is summable. Lemma. Let Assumptions,, and 3 hold. In DGD, use step sizes α in 5. Then {α + α x+ I W } is summable, i.e., =0 α + α x+ I W <. Proof. Note that x + I W = x + x + I W λ n W x + x By Lemma 0, α + α x+ I W ɛ + ɛ x + I W ɛ + ɛ λ n W x + x Furthermore, by 44 and Proposition 3, the sequence {α + α x+ I W } converges to 0 at the rate of O/ + +ɛ, which implies that the sequence {α α α x+ I W } is l -summable, i.e., x+ I W <. + =0 α + Lemma convergence of wealy summable sequence. Let {β } and {γ } be two nonnegative scalar sequences such that a γ = +, for some ɛ 0, ], N; ɛ b =0 γ β < ; c β + β γ, where means that β + β Mγ for some constant M > 0, then lim β 0. We call a sequence {β } satisfying Lemma a and b a wealy summable sequence since itself is not necessarily summable but becomes summable via multiplying another non-summable, diminishing sequence {γ }. It is generally impossible to claim that β converges to 0. However, if the distance of two successive steps of {β } with the same order of the multiplied sequence γ, then we can claim the convergence of β. A special case with ɛ = / has been observed in [9]. Proof. By condition b, we have + i= γ i β i 0, 45 as and for any N. In the following, we will show lim β = 0 by contradiction. Assume this is not the case, i.e., β 0 as,
11 then lim sup β C > 0. Thus, for every N > 0, there exists a > N such that β > C. Let [ ] C + ɛ, 4M where [x] denotes the integer part of x for any x R. By condition c, i.e., β j+ β j Mγ j for any j N, then Hence, + j= { C = β +i C 4, i {0,,..., } γ j β j C γ j C x + ɛ dx j= + + ɛ + ɛ, ɛ 0,, 4 ɛ C 4 ln + + ln +, ɛ =. Note that when ɛ 0,, the term + + ɛ + ɛ is monotonically increasing with respect to, which implies that + j= γ jβ j is lower bounded by a positive constant when ɛ 0,. While when ɛ =, noting that the specific form of, we have ln+ + ln+ = ln + = ln + C, + 4M which is a positive constant. As a consequence, + j= γ jβ j will not go to 0 as 0, which contradicts with 45. Therefore, lim β = 0. Proof of Theorem. We first develop the following inequality L α+ x + L α x + α + α x+ I W α + λ nw x + x, 48 and then claim the convergence of the sequences {L α x }, { T fx } and {x } based on this inequality. a Development of 48: From x + = x α L α x, it follows that L α x, x + x = x+ x α. 49 Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ maxi W = + λ nw, implying α L α x + 50 L α x + L α x, x + x + L x+ x = L α x α + λ nw x + x. Moreover, L α+ x + = L α x + + α + α x+ I W. 5 Combining 50 and 5 yields 48. b Convergence of objective sequence: By Lemma and Lemma 9, 48 yields the convergence of {L α x } and α + λ nw x + x < 5 =0 which implies that x + x converges to 0 at the rate of o ɛ and {x } is asymptotic regular. Moreover, notice that α x I W = α x x I W λ n W + ɛ x x. By Proposition 3, the term α x I W converges to 0 as. As a consequence, lim T fx = lim = lim L α x. L α x x I W α c Convergence to a stationary point: Let fx n T fx. By the specific form 5 of α, we have α + λ nw = α + λ nw α α for all > 0, where 0 = + λ n W part of + λ n W ɛ. Note that 0 + ɛ [ + λ n W ɛ x + x = n T x + x 53 ], i.e., the integer x + x. 54 Thus, 5, 53 and 54 yield α x+ x <. 55 =0 By the iterate 4 of DGD, we have x + x = α fx. 56 Plugging 56 into 55 yields α fx <. 57 Moreover, =0 fx + fx fx + fx fx + + fx B fx + fx B fx + fx B x + x, 58 where the second inequality holds by the bounded gradient assumption Assumption 3, the third inequality holds by the A sequence {a } is said to be asymptotic regular if a + a 0 as.
12 specific form of fx, and the last inequality holds by the Lipschitz continuity of f. Note that x + x = x + x + + x + x + x x x + x + + x x + α fx α, 59 where the first inequality holds for the triangle inequality and 56, and the last inequality holds for Proposition 3 and the bounded assumption of f. Thus, 58 and 59 imply fx + fx α. 60 By the specific form 5 of α, 57, 60 and Lemma, it holds As a consequence, lim fx = 0. 6 lim T fx = 0. 6 Furthermore, by the coercivity of f i for each i and the convergence of { T fx }, {x } is bounded. Therefore, there exists a convergent subsequence of {x }. Let x be any limit point of {x }. By 6 and the continuity of f, it holds T fx = 0. Moreover, by Proposition 3, x is consensual. As a consequence, x is a stationary point of problem 3. In addition, if x is isolated, then by the asymptotic regularity of {x } Lemma 4, {x } converges to x. E. Proof for Proposition 4 To prove Proposition 4, we still need the following lemmas. Lemma 3 Accumulated consensus of iterates. Under conditions of Proposition 3, we have α x + x + D + D α, 63 =0 where D = C x0 ζ ζ, D specified in Assumption 3. =0 =0 = C x 0 ζ + B ζ Proof. By 9, α x + x + C x 0 ζ α ζ + CB =0 =0, and B is ζ j α α j. 64 In the following, we estimate these two terms in the right-hand side of 64, respectively. Note that α ζ α + ζ =0 =0 ζ + =0 α, 65 =0 and =0 = ζ j α α j α =0 ζ =0 ζ j + ζ j α + αj αj ζ j =j α. 66 =0 Plugging 65 and 66 into 64 yields 63. Besides Lemma 3, we also need the following two lemmas, which have appeared in the literature cf. [8]. Lemma 4 [8]. Let γ = for some 0 < ɛ. Then the ɛ following hold a if 0 < ɛ < /, ɛ = γ ɛ = O, ɛ = γ = γ b if ɛ = /, = γ = γ = γ ɛ ɛ ɛ ɛ ɛ = O ɛ. c if / < ɛ <, = γ = γ = γ d if ɛ =, = γ = γ = γ ɛ ɛ = O, + ln / = Oln. ɛ ɛ = O, ɛ ɛ ɛ /ɛ ɛ ɛ ln = O ln, = O / ln + ln = O ln. ɛ. Lemma 5. [8, Proposition 3] Let h : R d R be a continuously differentiable function whose gradient is Lipschitz continuous with constant L h. Then for any x, y, u R p, hu hx + hy, u x L h x y. Proof of Proposition 4. To prove this proposition, we first develop the following inequality, L α x + L α u α x u x + u for any u R n p. By Lemma 5, we have 67 L α u L α x L α x, u x + L x+ x, where L = + α λ nw, and by, we have L α x = α x x +. Then 68 implies L α u L α x α x x +, u x + L x+ x.
13 3 Note that the specific form of α = +, there exists an ɛ integer 0 > 0 such that L α for all > 0. Actually, for the simplicity of the proof, we can tae α < λnw starting from the first step so that L α holds from the initial step. Thus, 69 implies L α u L α x α x x +, u x + x + x. α Recall that for any two vectors a and b, it holds a, b a = b a b. Therefore, L α u L α x + + α u x + u x. As a consequence, we get the basic inequality 67. Note that the optimal solution x opt is consensual and thus, x opt I W = 0. Therefore, L α x opt = fx opt = f opt. By 67, we have α Lα x + f opt x x opt x + x opt /. Summing the above inequality over = 0,,..., yields α L α x + f opt x 0 x opt /. 7 =0 Moreover, noting that L α x + = convexity of L α, f x + and by the L α x + f x + + L α x +, x + x + f x + B x + x +, 7 where the second inequality holds by the bounded assumption of gradient cf. Assumption 3. Plugging 7 into 7 yields α f x + f opt 73 =0 x0 x opt + B α x + x +. =0 By the definition of f 7, then 73 implies f f opt α 74 =0 x0 x opt + B D 3 + D 4 =0 α x + x + =0 α, 75 where D 3 = x0 x opt + BD, D 4 = BD, D and D are specified in Lemma 3, and the second inequality holds for Lemma 3. As a consequence, f f opt D 3 + D 4 =0 α =0 α. 76 Furthermore, by Lemma 4, we get the claims of this proposition. F. Proofs for Theorem 3 and Proposition 5 In order to prove Theorem 3, we need the following lemmas. Lemma 6 Sufficient descent of { ˆL α x }. Let Assumptions and 4 hold. Results are given in two cases below: C: r i s are convex. Set 0 < α < +λnw. ˆL α x + ˆL α x 77 α + λ n W x + x, N. C: r i s are not necessarily convex in this case, we assume λ n W > 0. Set 0 < α < λnw. ˆL α x + ˆL α x 78 α λ n W x + x, N. Proof. Recall from Lemma that L α x is L -Lipschitz continuous for L = + α λ n W, and thus ˆL α x + ˆL α x = L α x + L α x + rx + rx L α x, x + x + L x+ x + rx + rx. 79 C: From the convexity of r, 7, and 9, it follows that 0 = ξ + + α x + x + α L α x, ξ + rx +. This and the convexity of r further give us rx + rx ξ +, x + x = α x+ x L α x, x + x. Substituting this inequality into the inequality 79 and then expanding L = + α λ n W yield ˆL α x + ˆL α x α L x + x = α + λ n W x + x. Sufficient descent requires the last term to be negative, thus 0 < α < +λnw. C: From 7 and 9, it follows that the function ru + u x α L αx α reaches its minimum at u = x +. Comparing the values of this function at x + and x yields rx + rx α x x α L α x α x+ x α L α x = α x+ x L α x, x + x. Substituting this inequality into 79 and expanding L yield ˆL α x + ˆL α x α L x + x = α λ n W x + x. Hence, sufficient descent requires 0 < α < λnw.
14 4 Lemma 7 Boundedness. Under the conditions of Lemma 6, the sequence { ˆL α x } is lower bounded, and the sequence {x } is bounded. Proof. The lower boundedness of { ˆL α x } is due to Assumption 4 Part. By Lemma 6 and under a proper step size, ˆLα x is nonincreasing and upper bounded by ˆL α x 0. Hence, n i= f ix i + r ix i is upper bounded by ˆL α x 0. Consequently, {x } is bounded due to the coercivity of each f i +r i see Assumption 4 Part. Lemma 8 Bounded subgradient. Let ˆL α x + denote the limiting subdifferential of ˆLα, which is assumed to exist for all N. Then, there exists g + ˆL α x + such that g + α λ n W + x + x. Proof. By the iterate 9, the following optimality condition holds 0 α x + x + α L α x + rx +, 80 where rx + denotes the limiting subdifferential of r at x +. For any ξ + rx +, it follows from 80 that L α x + + ξ + = α x x + + L α x + L α x, which immediate yields L α x + + ξ + α x + x + L α x + L α x α + L x + x α λ n W + x + x. Thus, then the claim of Lemma 8 holds. Based on Lemmas 6 8, we can easily prove Theorem 3 and Proposition 5. Proof of Theorem 3. The proof of this theorem is similar to that of Theorem and thus is omitted. Proof of Proposition 5. The proof is similar to that of Proposition. We shall however note that in 6, a = α + λ n W if ri s are convex, while a = α λ n W if ri s are not necessarily convex and λ n W > 0. G. Proofs for Theorem 4 and Proposition 6 Based on the iterate 6 of Prox-DGD, we derive the following recursion of the iterates of Prox-DGD, which is similar to 6. Lemma 9 Recursion of {x }. For any N, x = W x 0 α j W j fx j + ξ j+, 8 where ξ j+ rx j+ is the one determined by the proximal operator 7, for any j = 0,...,. Proof. By the definition of the proximal operator 7, the iterate 6 implies x + + α ξ + = W x α fx, 8 where ξ + rx +, and thus x + = W x α fx + ξ By 83, we can easily derive the recursion 8. Proof of Proposition 6. The proof of this proposition is similar to that of Proposition 3. It only needs to note that the subgradient term fx j + ξ j+ is uniformly bounded by the constant B for any j. Thus, we omit it here. To prove Theorem 4, we still need the following lemmas. Lemma 0. Let Assumptions and 4 hold. In Prox-DGD, use the step sizes 5. Results are given in two cases below: C: r i s are convex. For any N, ˆL α+ x + ˆL α x + α + α x+ I W α + λ nw x + x. 84 C: r i s are not necessarily convex. For any N, ˆL α+ x + ˆL α x + α + α x+ I W α λ nw x + x. 85 Proof. The proof of this lemma is similar to that of Lemma 6 via noting that and ˆL α+ x + = ˆL α x + ˆL α+ x + ˆL α x + + ˆL α x + ˆL α x, ˆL α+ x + ˆL α x + = α + α x+ I W. While the term ˆL α x + ˆL α x can be estimated similarly by the proof of Lemma 6. Lemma. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 5. If further each f i and r i are convex, then for any u R n p, we have ˆL α x + ˆL α u α x u x + u. Proof. By Lemma 5, we have L α u L α x L α x, u x + L x+ x, where L = + α λ nw, and by the convexity of r, we have ru rx + + ξ +, u x +, 87 where ξ + rx + is the one determined by the proximal operator 7. By 83, it follows ξ + = α x x + L α x. 88
15 5 Plugging 88 into 87, and then summing up 86 and 87 yield ˆL α u ˆL α x α x x +, u x + L x+ x. Similar to the rest proof of the inequality 67, we can prove this lemma based on 89. Proof of Theorem 4. Based on Lemma 0 and Lemma, we can proof Theorem 4. The proof of Theorem 4a-d is similar to that of Theorem, while the proof of Theorem 4d is very similar to that of Proposition 4, and thus the proof of this theorem is omitted. VI. CONCLUSION In this paper, we study the convergence behavior of the algorithm DGD for smooth, possibly nonconvex consensus optimization. We consider both fixed and decreasing step sizes. When using a fixed step size, we show that the iterates of DGD converge to a stationary point of a Lyapunov function, which approximates to one of the original problem. Moreover, we estimate the bound between each local point and its global average, which is proportional to the step size and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of the mixing matrix. This motivate us to study the algorithm DGD with decreasing step sizes. When using decreasing step sizes, we show that the iterates of DGD reach consensus asymptotically at a sublinear rate and converge to a stationary point of the original problem. We also estimate the convergence rates of objective sequence in the convex setting using different diminishing step size strategies. Furthermore, we extend these convergence results to Prox-DGD designed for minimizing the sum of a differentiable function and a proximal function. Both functions can be nonconvex. If the proximal function is convex, a larger fixed step size is allowed. These results are obtained by applying both existing and new proof techniques. ACNOWLEDGMENTS The wor of J. Zeng has been supported in part by the NSF grants 66036, and the Doctoral start-up foundation of Jiangxi Normal University. The wor of W. Yin has been supported in part by the NSF grant ECCS and ONR grants N and N REFERENCES [] H. Attouch, and J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Math. Program., 6: 5-6, 009. [] H. Attouch, J. Bolte and B. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forwardbacward splitting, and regularized Gauss-Seidel methods, Math. Program., Ser. A, 37: 9-9, 03. [3] P. Bianchi and J. Jaubowicz, Convergence of a multi-agent projected stochastic gradient algorithm for nonconvex optimization, IEEE Trans. Automatic Control, 58:39-405, 03. [4] P. Bianchi, G. Fort and W. Hachem, Performance of a distributed stochastic approximation algorithm, IEEE Trans. Information Theory, 59: , 03. [5] J. Bolte, A. Daniilidis and A. Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization, 74:05-3, 007. [6] A. Chen and A. Ozdaglar, A fast distributed proximal gradient method, in Proc. 50th Allerton Conf. Commun., Control Comput., Moticello, IL, Oct. 0, pp [7] T. Chang, M. Hong and X. Wang, Multi-agent distributed optimization via inexact consensus ADMM, IEEE Trans. Signal Process., 63: , 05. [8] A. Chen, Fast Distributed First-Order Methods, Master s thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 0. [9] Y.T. Chow, T. Wu and W. Yin, Cyclic Coordinate Update Algorithms for Fixed-Point Problems: Analysis and Applications. UCLA CAM 6-78, 06. [0] W. Deng, M. Lai, Z. Peng and W. Yin, Parallel multi-bloc admm with o/ convergence, Journal of Scientific Computing, DOI 0.007/s , 06. [] E. Hazan,.Y. Levy and S. Shalev-Shwarz, On graduated optimization for stochastic nonconvex problems, In Proceedings of the 33rd International Conference on Machine Learning, New Yor, NY, USA, 06. [] M. Hardt, B. Retch and Y. Singer, Train faster, generalize better: stability of stochastic gradient descent, In Proceedings of the 33rd International Conference on Machine Learning, New Yor, NY, USA, 06. [3] D. Hajinezhad, M. Hong and A. Garcia, ZENTH: a zeroth-order distributed algorithm for multi-agent nonconvex optimization, Technical report. [4] M. Hong, Z. Luo and M. Razaviyayn, Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems, ICASSP 05. [5] S. Hosseini, A. Chapman and M. Mesbahi, Online distributed optimization on dynamic networs, IEEE T. Auto. Control, 6: , 06. [6] D. Jaovetic, J. Xavier and J. Moura, Fast distributed gradient methods, IEEE Trans. Automatic Control, 59:3-46, 04. [7] D. empe, A. Dobra and J. Gehre, Gossip-based computation of aggregate information, In Foundations of Computer Science, 003. Proceedings 44th Annual IEEE Symposium on, 48-49, IEEE Computer Society, 003. [8]. nopp, Infinite sequences and series, Courier Corporation, 956. [9] J. Lafond, H. Wai and E. Moulines, D-FW: communication efficient distributed algorithms for high-dimensional sparse optimization, ICASSP 06. [0] S. Lee and A. Nedic, Distributed random projection algorithm for convex optimization, IEEE J. Sel. Topics Signal Process., 7: -9, 03. [] Q. Ling and Z. Tian, Decentralized sparse signal recovery for compressive sleeping wireless sensor networs, IEEE Trans. Signal Process., 587: , 00. [] S. Łojasiewicz, Sur la géométrie semi-et sous-analytique, Ann. Inst. Fourier Grenoble 435: , 993. [3] P.D. Lorenzo and G. Scutari, NEXT: in-networ nonconvex optimization, IEEE Trans. Signal and Information Processing over Networ, :0-36, 06. [4] P.D. Lorenzo and G. Scutari, Distributed nonconvex optimization over time-varying networs, ICASSP 06. [5] I. Matei and J. Baras, Performance evaluation of the consensus-based distributed subgradient method under random communication topologies, IEEE J. Sel. Top. Signal Process., 5:754-77, 0. [6] H. McMahan and M. Streeter, Delay-Tolerant algorithms for asynchronous distributed online learning, In: Advances in Neural Information Processing Systems NIPS, 04. [7] G. Mateos, J. Bazerque and G. Giannais, Distributed sparse linear regression, IEEE Trans. Signal Process., 580: , 00. [8] A. Nedic and A. Ozdaglar, Distributed subgradient methods for multiagent optimization, IEEE Trans. Automatic Control, 54:48-6, 009. [9] A. Nedic and A. Olshevsy, Distributed optimization over time-varying directed graphs, IEEE Trans. Automatic Control, 603:60-65, 05. [30] M. Nevelson and R.Z. hasminsii, Stochastic approximation and recursive estimation, [translated from the Russian by Israel Program for Scientific Translations; translation edited by B. Silver]. Americal Mathematical Society, 973. [3] G. Qu and N. Li, Harnessing smoothness to accelerate distributed optimization, IEEE Transactions on Control of Networ Systems, 07, Volume: PP, Issue: 99. [3] M. Raginsy, N. iarashi and R. Willett, Decentralized online convex programming with local information, In: 0 American Control Conference, San Francisco, CA, USA, 0.
WE consider an undirected, connected network of n
On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationDecentralized Consensus Optimization with Asynchrony and Delay
Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationOn the iterate convergence of descent methods for convex optimization
On the iterate convergence of descent methods for convex optimization Clovis C. Gonzaga March 1, 2014 Abstract We study the iterate convergence of strong descent algorithms applied to convex functions.
More informationDECENTRALIZED algorithms are used to solve optimization
5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,
More informationDistributed Consensus Optimization
Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized
More informationOn the convergence of a regularized Jacobi algorithm for convex optimization
On the convergence of a regularized Jacobi algorithm for convex optimization Goran Banjac, Kostas Margellos, and Paul J. Goulart Abstract In this paper we consider the regularized version of the Jacobi
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationConvergence of Fixed-Point Iterations
Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and
More informationA Distributed Newton Method for Network Utility Maximization, I: Algorithm
A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order
More informationPerturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization
Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract
More informationDLM: Decentralized Linearized Alternating Direction Method of Multipliers
1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationOn the linear convergence of distributed optimization over directed graphs
1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,
More informationAn Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods
An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This
More informationThe Proximal Gradient Method
Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,
More informationConvergence rates for distributed stochastic optimization over random networks
Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationDouglas-Rachford splitting for nonconvex feasibility problems
Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationNewton-like method with diagonal correction for distributed optimization
Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization
More informationConstrained Consensus and Optimization in Multi-Agent Networks
Constrained Consensus Optimization in Multi-Agent Networks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher
More informationConvex Analysis and Optimization Chapter 2 Solutions
Convex Analysis and Optimization Chapter 2 Solutions Dimitri P. Bertsekas with Angelia Nedić and Asuman E. Ozdaglar Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com
More informationNewton-like method with diagonal correction for distributed optimization
Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems
More informationA Distributed Newton Method for Network Utility Maximization, II: Convergence
A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility
More informationIteration-complexity of first-order penalty methods for convex programming
Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)
More informationDecentralized Quadratically Approximated Alternating Direction Method of Multipliers
Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania
More informationA Distributed Newton Method for Network Utility Maximization
A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationDistributed Optimization over Random Networks
Distributed Optimization over Random Networks Ilan Lobel and Asu Ozdaglar Allerton Conference September 2008 Operations Research Center and Electrical Engineering & Computer Science Massachusetts Institute
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationDistributed Optimization over Networks Gossip-Based Algorithms
Distributed Optimization over Networks Gossip-Based Algorithms Angelia Nedić angelia@illinois.edu ISE Department and Coordinated Science Laboratory University of Illinois at Urbana-Champaign Outline Random
More informationc 2015 Society for Industrial and Applied Mathematics
SIAM J. OPTIM. Vol. 5, No., pp. 944 966 c 05 Society for Industrial and Applied Mathematics EXTRA: AN EXACT FIRST-ORDER ALGORITHM FOR DECENTRALIZED CONSENSUS OPTIMIZATION WEI SHI, QING LING, GANG WU, AND
More informationIterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming
Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction
J. Korean Math. Soc. 38 (2001), No. 3, pp. 683 695 ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE Sangho Kum and Gue Myung Lee Abstract. In this paper we are concerned with theoretical properties
More informationAn asymptotic ratio characterization of input-to-state stability
1 An asymptotic ratio characterization of input-to-state stability Daniel Liberzon and Hyungbo Shim Abstract For continuous-time nonlinear systems with inputs, we introduce the notion of an asymptotic
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationNew hybrid conjugate gradient methods with the generalized Wolfe line search
Xu and Kong SpringerPlus (016)5:881 DOI 10.1186/s40064-016-5-9 METHODOLOGY New hybrid conjugate gradient methods with the generalized Wolfe line search Open Access Xiao Xu * and Fan yu Kong *Correspondence:
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More informationOn proximal-like methods for equilibrium programming
On proximal-lie methods for equilibrium programming Nils Langenberg Department of Mathematics, University of Trier 54286 Trier, Germany, langenberg@uni-trier.de Abstract In [?] Flam and Antipin discussed
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationarxiv: v1 [stat.ml] 12 Nov 2015
Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,
More informationAlternative Characterization of Ergodicity for Doubly Stochastic Chains
Alternative Characterization of Ergodicity for Doubly Stochastic Chains Behrouz Touri and Angelia Nedić Abstract In this paper we discuss the ergodicity of stochastic and doubly stochastic chains. We define
More informationarxiv: v2 [math.oc] 21 Nov 2017
Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationProximal-like contraction methods for monotone variational inequalities in a unified framework
Proximal-like contraction methods for monotone variational inequalities in a unified framework Bingsheng He 1 Li-Zhi Liao 2 Xiang Wang Department of Mathematics, Nanjing University, Nanjing, 210093, China
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationSOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1
SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationA globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications
A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationSubgradient Methods in Network Resource Allocation: Rate Analysis
Subgradient Methods in Networ Resource Allocation: Rate Analysis Angelia Nedić Department of Industrial and Enterprise Systems Engineering University of Illinois Urbana-Champaign, IL 61801 Email: angelia@uiuc.edu
More informationKaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization
Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth
More informationON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS
ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in
More informationTHE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS
Submitted: 24 September 2007 Revised: 5 June 2008 THE EFFECT OF DETERMINISTIC NOISE 1 IN SUBGRADIENT METHODS by Angelia Nedić 2 and Dimitri P. Bertseas 3 Abstract In this paper, we study the influence
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationActive sets, steepest descent, and smooth approximation of functions
Active sets, steepest descent, and smooth approximation of functions Dmitriy Drusvyatskiy School of ORIE, Cornell University Joint work with Alex D. Ioffe (Technion), Martin Larsson (EPFL), and Adrian
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationA derivative-free nonmonotone line search and its application to the spectral residual method
IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral
More informationSequential convex programming,: value function and convergence
Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March 23 2016 1 / 16 Introduction Local search methods for finite dimensional
More informationIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationSparse Optimization Lecture: Dual Methods, Part I
Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 20 Subgradients Assumptions
More informationGeneralized Uniformly Optimal Methods for Nonlinear Programming
Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly
More informationA user s guide to Lojasiewicz/KL inequalities
Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f
More informationA Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions
A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions Angelia Nedić and Asuman Ozdaglar April 16, 2006 Abstract In this paper, we study a unifying framework
More informationEfficient Methods for Large-Scale Optimization
Efficient Methods for Large-Scale Optimization Aryan Mokhtari Department of Electrical and Systems Engineering University of Pennsylvania aryanm@seas.upenn.edu Ph.D. Proposal Advisor: Alejandro Ribeiro
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationDistributed intelligence in multi agent systems
Distributed intelligence in multi agent systems Usman Khan Department of Electrical and Computer Engineering Tufts University Workshop on Distributed Optimization, Information Processing, and Learning
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationTight Rates and Equivalence Results of Operator Splitting Schemes
Tight Rates and Equivalence Results of Operator Splitting Schemes Wotao Yin (UCLA Math) Workshop on Optimization for Modern Computing Joint w Damek Davis and Ming Yan UCLA CAM 14-51, 14-58, and 14-59 1
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLocal strong convexity and local Lipschitz continuity of the gradient of convex functions
Local strong convexity and local Lipschitz continuity of the gradient of convex functions R. Goebel and R.T. Rockafellar May 23, 2007 Abstract. Given a pair of convex conjugate functions f and f, we investigate
More informationAccelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity
1 Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity Sijia Liu, Member, IEEE, Pin-Yu Chen, Member, IEEE, and Alfred O. Hero, Fellow, IEEE arxiv:1704.05193v2 [stat.ml]
More informationNonparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel
IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. X, O. X, X X onparametric Decentralized Detection and Sparse Sensor Selection via Weighted Kernel Weiguang Wang, Yingbin Liang, Member, IEEE, Eric P. Xing, Senior
More informationA Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions
A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the
More information