WE consider an undirected, connected network of n

Size: px
Start display at page:

Download "WE consider an undirected, connected network of n"

Transcription

1 On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been proposed for convex consensus optimization. However, to the behaviors or consensus nonconvex optimization, our understanding is more limited. When we lose convexity, we cannot hope our algorithms always return global solutions though they sometimes still do. Somewhat surprisingly, the decentralized consensus algorithms, DGD and Prox-DGD, retain most other properties that are nown in the convex setting. In particular, when diminishing or constant step sizes are used, we can prove convergence to a or a neighborhood of consensus stationary solution under some regular assumptions. It is worth noting that Prox-DGD can handle nonconvex nonsmooth functions if their proximal operators can be computed. Such functions include SCAD, MCP and l q quasi-norms, q [0,. Similarly, Prox-DGD can tae the constraint to a nonconvex set with an easy projection. To establish these properties, we have to introduce a completely different line of analysis, as well as modify existing proofs that were used in the convex setting. Index Terms Nonconvex dencentralized computing, consensus optimization, decentralized gradient descent method, proximal decentralized gradient descent I. INTRODUCTION WE consider an undirected, connected networ of n agents and the following consensus optimization problem defined on the networ: n minimize fx f i x, x R p i= where f i is a differentiable function only nown to the agent i. We also consider the consensus optimization problem in the following differentiable+proximable form: n minimize x R p sx f i x + r i x, i= where f i, r i are differentiable and proximable functions, respectively, only nown to the agent i. Each function r i is possibly non-differentiable or nonconvex, or both. The models and find applications in decentralized averaging, learning, estimation, and control. Some typical applications include: i the distributed compressed sensing problems [4], [30], [39], [45], [49]; ii distributed consensus [68], [9], [9], [58], [55], [6]; iii distributed and parallel machine learning [55], [5], [], [33], [43]. More specifically, in these applications, each f i can be: the data-fidelity J. Zeng is with the College of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 3300, China jsh.zeng@gmail.com W. Yin is with the Department of Mathematics, University of California, Los Angeles, CA 90095, USA wotaoyin@ucla.edu. We call { a function proximable if its proximal operator prox αf y argmin x αfx + x y } is easy to compute. term possibly nonconvex in statistical learning and machine learning [5], [6]; nonconvex utility functions used in applications such as resource allocation [6], [0]; 3 empirical ris of deep neural networs with nonlinear activation functions [3]. The proximal function r i can be taen as: convex penalties such as nonsmooth l -norm or smooth l - norm; the indicator function for a closed convex set or a nonconvex set with an easy projection [4], that is, r i x = 0 if x satisfies the constraint and otherwise; 3 nonconvex penalties such as l q quasi-norm 0 q < [49], [], smoothly clipped absolute deviation SCAD penalty [6] and the minimax concave penalty MCP [67]. When f i s are convex, the existing algorithms include the subgradient methods [8], [0], [4], [37], [40], [59], [65], [46], and the primal-dual domain methods such as the decentralized alternating direction method of multipliers DADMM [5], [5], [9], DLM [3], and EXTRA [53], [54]. When f i s are nonconvex, some existing results include [4], [5], [8], [35], [36], [56], [57], [7], [60], [6], [68]. In spite of the algorithms and their analysis in these wors, the convergence of the simple algorithm Decentralized Gradient Descent DGD [40] under nonconvex f i s is still unnown. Furthermore, although DGD is slower than DADMM, DLM and EXTRA on convex problems, DGD is simpler and thus easier to extend to a variety of settings such as [47], [64], [38], [3], where online processing and delay tolerance are considered. Therefore, we expect our results to motivate future adoptions of nonconvex DGD. This paper studies the convergence of two algorithms: DGD for solving problem and Prox-DGD for problem. In each DGD iteration, every agent locally computes a gradient and then updates its variable by combining the average of its neighbors with the negative gradient step. In each Prox-DGD iteration, every agent locally computes a gradient of f i and a proximal map of r i, as well as exchanges information with its neighbors. Both algorithms can use either a fixed step size or a sequence of decreasing step sizes. When the problem is convex and a fixed step size is used, DGD does not converge to a solution of the original problem but a point in its neighborhood [65]. This motivates the use of decreasing step sizes such as in [0], [4]. Assuming f i s are convex and have Lipschitz continuous and bounded gradients, [0] shows that decreasing step sizes α = lead to a convergence rate O ln of the running best of objective errors. [4] uses nested loops and shows an outerloop convergence rate O of objective errors, utilizing Nesterov s acceleration, provided that the inner loop performs substantial consensus computation. Without a substantial inner loop, their single-loop algorithm using the decreasing step sizes α = /3 has a reduced rate O ln.

2 The objective of this paper is two-fold: a we aim to show, other than losing global optimality, most existing convergence results of DGD and Prox-DGD that are nown in the convex setting remain valid in the nonconvex setting, and b to achieve a, we illustrate how to tailor nonconvex analysis tools for decentralized optimization. In particular, our asymptotic exact and inexact consensus results require new treatments because they are special to decentralized algorithms. The analytic results of this paper can be summarized as follows. a When a fixed step size α is used and properly bounded, the DGD iterates converge to a stationary point of a Lyapunov function. The difference between each local estimate of x and the global average of all local estimates is bounded, and the bound is proportional to α. b When a decreasing step size α = O/ + ϵ is used, where 0 < ϵ and is the iteration number, the objective sequence converges, and the iterates of DGD are asymptotically consensual i.e., become equal one another, and they achieve this at the rate of O/ + ϵ. Moreover, we show the convergence of DGD to a stationary point of the original problem, and derive the convergence rates of DGD with different ϵ for objective functions that are convex. c The convergence analysis of DGD can be extended to the algorithm Prox-DGD for solving problem. However, when the proximable functions r i s are nonconvex, the mixing matrix is required to be positive definite and a smaller step size is also required. Otherwise, the mixing matrix can be non-definite. The detailed comparisons between our results and the existing results on DGD and Prox-DGD are presented in Tables I and II. The global objective error rate in these two tables refers to the rate of {f x fx opt } or {s x sx opt }, where x = n n i= x i is the average of the th iterate and x opt is a global solution. The comparisons beyond DGD and Prox- DGD are presented in Section IV and Table III. New proof techniques are introduced in this paper, particularly, in the analysis of convergence of DGD and Prox-DGD with decreasing step sizes. Specifically, the convergence of objective sequence and convergence to a stationary point of the original problem with decreasing step sizes are justified via taing a Lyapunov function and several new lemmas cf. Lemmas 9,, and the proof of Theorem. Moreover, we estimate the consensus rate by introducing an auxiliary sequence and then showing both sequences have the same rates cf. the proof of Proposition 3. All these proof techniques are new and distinguish our paper from the existing wors such as [0], [4], [40], [4], [8], [35], [57], [6]. It should be mentioned that during the revision of this paper, we found some recent, related but independent wor on the convergence of nonconvex decentralized algorithms including [33], [], [], [9]. We will give detailed comparisons with these wor latter. The rest of this paper is organized as follows. Section II describes the problem setup and reviews the algorithms. Section III presents our assumptions and main results. Section IV discusses related wors. Section V shows some numerical experiments to verify the developed theoretical results. Section VI presents the proofs of our main results. We conclude this paper in Section VII. Notation: Let I denote the identity matrix of the size n n, and R n denote the vector of all s. For the matrix X, X T denotes its transpose, X ij denotes its i, jth component, and X X, X = i,j X ij is its Frobenius norm, which simplifies to the Euclidean norm when X is a vector. Given a symmetric, positive semidefinite matrix G R n n, we let X G X, GX be the induced semi-norm. Given a function h, domf denotes its domain. II. PROBLEM SETUP AND ALGORITHM REVIEW Consider a connected undirected networ G = {V, E}, where V is a set of n nodes and E is the edge set. Any edge i, j E represents a communication lin between nodes i and j. Let x i R p denote the local copy of x at node i. We reformulate the consensus problem into the equivalent problem: x minimize x T fx n f i x i, 3 i= subject to x i = x j, i, j E, where x R n p, fx R n with x T x T. x T n, fx f x f x. f n x n. In addition, the gradient of fx is f x T f x T fx. Rn p. 4 f n x n T The ith rows of the matrices x and fx, and vector fx, correspond to agent i. The analysis in this paper applies to any integer p. For simplicity, one can let p = and treat x and fx as vectors rather than matrices. The algorithm DGD [40] for 3 is described as follows: Pic an arbitrary x 0. For = 0,,..., compute x + W x α fx, 5 where W is a mixing matrix and α > 0 is a stepsize parameter. Similarly, we can reformulate the composite problem as the following equivalent form: minimize x n f i x i + r i x i, i= subject to x i = x j, i, j E. 6 Let rx n i= r ix i. The algorithm Prox-DGD can be applied to the above problem 6:

3 3 TABLE I COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS SMOOTH OPTIMIZATION PROBLEM Fixed step size Decreasing step sizes algorithm DGD [65] DGD this paper D-NG [4] DGD this paper f i convex only nonconvex convex only nonconvex f i Lipschitz Lipschitz, bounded step size 0 < α < +λnw O with Nesterov acc. O ϵ ϵ 0, ] consensus error Oα O O ϵ min j x j+ x j o no rate o +ϵ global objective error O until error O α ζ Convex: O until error O α ζ ; Nonconvex: no rate O ln Convex : O ln ϵ = /, O ϵ =, ln O min{ϵ, ϵ} other ϵ; Nonconvex: no rate The objective error rates of DGD and Prox-DGD obtained in this paper and those in convex DProx-Grad [0] are ergodic or running best rates. TABLE II COMPARISONS ON DIFFERENT ALGORITHMS FOR CONSENSUS COMPOSITE OPTIMIZATION PROBLEM Fixed step size Decreasing step sizes algorithm AccDProx-Grad [8] DProx-Grad [0] Prox-DGD this paper DProx-Grad [0] Prox-DGD this paper f i, r i convex only nonconvex convex only nonconvex f i Lipschitz, bounded Lipschitz Lipschitz, bounded r i bounded bounded step size 0 < α < 0 < α < +λ nw convex r i ; 0 < α < λ nw nonconvex r i, λ nw > 0 O O + / ϵ 0, ] + ϵ consensus Oγ, 0 < γ < error Oα O / O min j x j+ x j no rate no rate o no rate o +ϵ ϵ global objective error O D α + D α, D, D > 0 Form Convex: form D 3 α + D 4α, D 3, D 4 > 0; Nonconvex: no rate O ln Convex : O ln ϵ = /, O ϵ =, ln O min{ϵ, ϵ} other ϵ, Nonconvex: no rate Prox-DGD: Tae an arbitrary x 0. For = 0,,..., perform x + prox α rw x α fx, 7 where the proximal operator is prox α rx argmin {α ru + u R n p u x }. 8

4 4 TABLE III COMPARISONS ON SCENARIOS APPLIED FOR DIFFERENT NONCONVEX DECENTRALIZED ALGORITHMS f i nonsmooth r i step size networ W algorithm type fusion scheme algorithm smooth cvx ncvx fixed diminish static dynamic determin stochastic ATC CTA DGD this paper doubly Perturbed Push-sum [57] column ZENITH [8] doubly Prox-DGD this paper NEXT [35] DeFW [6] Proj SGD [4] doubly doubly doubly row In this table, the full names of these abbreviations are list as follows: cvx convex, ncvx nonconvex, diminish diminishing, determin deterministic, ATC adaptive-then-combine, CTA combine-then-adaptive, doubly doubly stochastic, column column stochastic, row row stochastic, where vocabularies in the bracets are the full names. A row, or column, or double stochastic W means that: W =, or W T =, or both hold. III. ASSUMPTIONS AND MAIN RESULTS This section presents all of our main results. A. Definitions and assumptions Definition Lipschitz differentiability. A function h is called Lipschitz differentiable if h is differentiable and its gradient h is Lipschitz continuous, i.e., hu hv L u v, u, v domh, where L > 0 is its Lipschitz constant. Definition Coercivity. A function h is called coercive if u + implies hu +. The next definition is a property that many functions have see [63, Section.] for examples and can help obtain whole sequence convergence from subsequence convergence. Definition 3 urdya-łojasiewicz Ł property [34], [7], []. A function h : R p R {+ } has the Ł property at x dom h if there exist η 0, + ], a neighborhood U of x, and a continuous concave function φ : [0, η R + such that: i φ0 = 0 and φ is differentiable on 0, η; ii for all s 0, η, φ s > 0; iii for all x in U {x : hx < hx < hx + η}, the Ł inequality holds φ hx hx dist 0, hx. 9 Proper lower semi-continuous functions that satisfy the Ł inequality at each point of dom h are called Ł functions. Assumption Objective. The objective functions f i : R p R {+ }, i =,..., n, satisfy the following: f i is Lipschitz differentiable with constant i > 0. f i is proper i.e., not everywhere infinite and coercive. The sum n i= f ix i is -Lipschitz differentiable with max i i this can be easily verified via the definition of Whole sequence convergence from any starting point is referred to as global convergence in the literature. Its limit is not necessarily a global solution. fx as shown in 4. In addition, each f i is lower bounded following Part of the above assumption. Assumption Mixing matrix. The mixing matrix W = [w ij ] R n n has the following properties: Graph If i j and i, j / E, then w ij = 0, otherwise, w ij > 0. Symmetry W = W T. 3 Null space property null{i W } = span{}. 4 Spectral property I W I. By Assumption, a solution x opt to problem 3 satisfies I W x opt = 0. Due to the symmetric assumption of W, its eigenvalues are real and can be sorted in the nonincreasing order. Let λ i W denote the ith largest eigenvalue of W. Then by Assumption, λ W = > λ W λ n W >. Let ζ be the second largest magnitude eigenvalue of W. Then ζ = max{ λ W, λ n W }. 0 B. Convergence results of DGD We consider the convergence of DGD with both a fixed step size and a sequence of decreasing step sizes. Convergence results of DGD with a fixed step size: The convergence result of DGD with a fixed step size i.e., α α is established based on the Lyapunov function [65]: L α x T fx + α x I W. It is worth reminding that convexity is not assumed. Theorem Global convergence. Let {x } be the sequence generated by DGD 5 with the step size 0 < α < +λ nw. Let Assumptions and hold. Then {x } has at least one accumulation point x, and any such point is a stationary point of L α x. Furthermore, the running best rates of Given a nonnegative sequence a, its running best sequence is b = min{a i : i }. We say a has a running best rate of o/ if b = o/.

5 5 the sequences { x + x }, and { L α x }, and { n T fx } are o. The convergence rate of the sequence { =0 n T fx } is O. In addition, if L α satisfies the Ł property at an accumulation point x, then {x } globally converges to x. Remar. Let x be a stationary point of L α x, and thus 0 = fx + α I W x. Since T I W = 0, yields 0 = T fx, indicating that x is also a stationary point to the separable function n i= f ix i. Since the rows of x are not necessarily identical, we cannot say x is a stationary point to Problem 3. However, the differences between the rows of x are bounded, following our next result below adapted from [65]: Proposition Consensual bound on x. For each iteration, define x n n i= x i. Then, it holds for each node i that x i x αd ζ, 3 where D is a universal bound of fx defined in Lemma 6 below, ζ is the second largest magnitude eigenvalue of W specified in 0. As, 3 yields the consensual bound where x n n i= x i. x i x αd ζ, In Proposition, the consensual bound is proportional to the step size α and inversely proportional to the gap between the largest and the second largest magnitude eigenvalues of W. Let us compare the DGD iteration with the iteration of centralized gradient descent 5 for fx. Averaging the rows of 5 yields the following comparison: n DGD averaged: x + x α f i x i. 4 Centralized: n x + x α n i= n i= f i x. 5 Apparently, DGD approximates centralized gradient descent by evaluating f i at local variables x i instead of the global average. We can estimate the error of this approximation as n f i x i n n f i x n n i= n i= i= f i x i f i x αd ζ. Unlie the convex analysis in [65], it is impossible to bound the difference between the sequences of 4 and 5 without convexity because the two sequences may converge to different stationary points of L α. Remar. The Ł assumption on L α in Theorem can be satisfied if each f i is a sub-analytic function. Since x I W These quantities naturally appear in the analysis, so we eep the squares. is obviously sub-analytic and the sum of two sub-analytic functions remains sub-analytic, L α is sub-analytic if each f i is so. See [63, Section.] for more details and examples. Proposition Ł convergence rates. Let the assumptions of Theorem hold. Suppose that L α satisfies the Ł inequality at an accumulation point x with ψs = cs θ for some constant c > 0. Then, the following convergence rates hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C 0 τ for all for some > 0, C 0 > 0, τ [0,. c If θ,, x x C 0 θ/θ for all, for certain > 0, C 0 > 0. Note that the rates in parts b and c of Proposition are of the eventual type. Using fixed step sizes, our results are limited because the stationary point x of L α is not a stationary point of the original problem. We only have a consensual bound on x. To address this issue, the next subsection uses decreasing step sizes and presents better convergence results. Convergence of DGD with decreasing step sizes: The positive consensual error bound in Proposition, which is proportional to the constant step size α, motivates the use of properly decreasing step sizes α = O +, for some ϵ 0 < ϵ, to diminish the consensual bound to 0. As a result, any accumulation point x becomes a stationary point of the original problem 3. To analyze DGD with decreasing step sizes, we add the following assumption. Assumption 3 Bounded gradient. For any, fx is uniformly bounded by some constant B > 0, i.e., fx B. Note that the bounded gradient assumption is a regular assumption in the convergence analysis of decentralized gradient methods see, [4], [5], [8], [35], [36], [56], [57], [7], [6] for example, even in the convex setting [4] and also [0], though it is not required for centralized gradient descent. We tae the step size sequence: α =, 0 < ϵ, 6 + ϵ throughout the rest part of this section. The numerator can be replaced by any positive constant. By iteratively applying iteration 5, we obtain the following expression x = W x 0 α j W j fx j. 7 Proposition 3 Asymptotic consensus rate. Let Assumptions and 3 hold. Let DGD use 6. Let x n T x. Then, x x converges to 0 at the rate of O/ + ϵ. According to Proposition 3, the iterates of DGD with decreasing step sizes can reach consensus asymptotically compared to a nonzero bound in the fixed step size case in Proposition. Moreover, with a larger ϵ, faster decaying step sizes generally imply a faster asymptotic consensus rate. Note that I W x = 0 and thus x I W = x x I W. Therefore, the above proposition implies the following result.

6 6 Corollary. Apply the setting of Proposition 3. x I W converges to 0 at the rate of O/ + ϵ. Corollary shows that the sequence {x } in the I W semi-norm can decay to 0 at a sublinear rate. For any global consensual solution x opt to problem 3, we have x x opt I W = x I W so, if {x } does converge to x opt, then their distance in the same semi-norm decays at O/ ϵ. Theorem Convergence. Let Assumptions, and 3 hold. Let DGD use step sizes 6. Then a {L α x } and { T fx } converge to the same limit; b lim T fx = 0, and any limit point of {x } is a stationary point of problem 3; c In addition, if there exists an isolated accumulation point, then {x } converges. In the proof of Theorem, we will establish α + λ nw x + x <, =0 which implies that the running best rate of the sequence { x + x } is o/ +ϵ. Theorem shows that the objective sequence converges, and any limit point of {x } is a stationary point of the original problem. However, there is no result on the convergence rate of the objective sequence to an optimal value, and it is generally difficult to get such a rate without convexity. Although our primary focus is nonconvexity, next we assume convexity and present the objective convergence rate, which has an interesting relation with ϵ. For any x R n p n, let fx i= f ix i. Even if f i s are convex, the solution to 3 may be non-unique. Thus, let X be the set of solutions to 3. Given x, we pic the solution x opt = Proj X x X. Also let f opt = fx opt be the optimal value of. Define the ergodic objective: f =0 = α f x + =0 α, 8 where x + = n T x +. Obviously, f min f x. 9 =,...,+ Proposition 4 Convergence rates under convexity. Let Assumptions, and 3 hold. Let DGD use step sizes 6. If λ n W > 0 and each f i is convex, then { f } defined in 8 converges to the optimal objective value f opt at the following rates: a if 0 < ϵ < /, the rate is O ; ϵ b if ϵ = /, the rate is O ln ; c if / < ϵ <, the rate is O ; ϵ d if ϵ =, the rate is O ln. The convergence rates established in Proposition 4 almost as good as O when ϵ =. As ϵ goes to either 0 or, the rates become slower, and ϵ = / may be the optimal choice in terms of the convergence rate. However, by Proposition 3, a larger ϵ implies a faster consensus rate. Therefore, there is a tradeoff to choose an appropriate ϵ in the practical implementation of DGD. Remar 3. A related algorithm is the perturbed push-sum algorithm, also called subgradient-push, which was proposed in [5] for average consensus problem over time-varying networ. Its convergence in the convex setting was developed in [4]. Recently, its convergence to a critical point in the nonconvex setting was established in [57] under some regularity assumptions. Moreover, by utilizing perturbations on the update process and the assumption of no saddle-point existence, almost sure convergence to a local minimum of its perturbed variant was also shown in [57]. Remar 4. Another recent algorithm is decentralized s- tochastic gradient descent D-PSGD in [33] with support to nonconvex large-sum objectives. An O + n -ergodic convergence rate was established assuming is sufficiently large and the step size α is sufficiently small. When applied to the setting of this paper, [33, Theorem ] implies that the sequence { =0 n T fx } converges to zero at the rate O if the step size 0 < α < ζ 6 n, where ζ is defined in 0. From Theorem, we can also establish such an O -ergodic convergence rate of DGD as long as 0 < α < +λ nw. Similar rates of convergence to a stationary point have also been shown for different nonconvex algorithms in [8], [57], [8]. C. Convergence results of Prox-DGD Similarly, we consider the convergence of Prox-DGD with both a fixed step size and decreasing step sizes. The iteration 7 can be reformulated as x + = prox α rx α L α x 0 based on which, we define the Lyapunov function ˆL α x L α x + rx, where we recall L α x = n i= f ix i + α x I W. Then 0 is clearly the forward-bacward splitting a..a., proxgradient iteration for minimize x ˆLα x. Specifically, 0 first performs gradient descent to the differentiable function L α x and then computes the proximal of rx. To analyze Prox-DGD, we should revise Assumption as follows. Assumption 4 Composite objective. The objective function of 6 satisfies the following: Each f i is Lipschitz differentiable with constant i > 0. Each f i +r i is proper, lower semi-continuous, coercive. As before, n i= f ix i is -Lipschitz differentiable for max i i. Convergence results of Prox-DGD with a fixed step size: Based on the above assumptions, we can get the global convergence of Prox-DGD as follows. Theorem 3 Global convergence of Prox-DGD. Let {x } be the sequence generated by Prox-DGD 7 where the step size α satisfies 0 < α < +λ nw when r i s are convex;

7 7 and 0 < α < λ nw, when r i s are not necessarily convex this case requires λ n W > 0. Let Assumptions and 4 hold. Then {x } has at least one accumulation point x, and any accumulation point is a stationary point of ˆL α x. Furthermore, the running best rates of the sequences { x + x }, { g + } and { n T fx + n T ξ } where g + is defined in Lemma 8, and ξ is defined in Lemma 9 are o. The convergence rate of the sequence =0 n T fx + ξ } is O. { In addition, if ˆLα satisfies the Ł property at an accumulation point x, then {x } converges to x. The rate of convergence of Prox-DGD can be also established by leveraging the Ł property. Proposition 5 Rate of convergence of Prox-DGD. Under assumptions of Theorem 3, suppose that ˆLα satisfies the Ł inequality at an accumulation point x with ψs = c s θ for some constant c > 0. Then the following hold: a If θ = 0, x converges to x in finitely many iterations. b If θ 0, ], x x C τ for all for some > 0, C > 0, τ [0,. c If θ,, x x C θ/θ for all, for certain > 0, C > 0. Convergence of Prox-DGD with decreasing step sizes: In Prox-DGD, we also use the decreasing step size 6. To investigate its convergence, the bounded gradient Assumption 3 should be revised as follows. Assumption 5 Bounded composite subgradient. For each i, f i is uniformly bounded by some constant B i > 0, i.e., f i x B i for any x R p. Moreover, ξ i B ri for any ξ i r i x and x R p, i =..., n. Let B n i= B i + B ri. Then fx + ξ where ξ rx for any x R n p is uniformly bounded by B. Note that the same assumption is used to analyze the convergence of distributed proximal-gradient method in the convex setting [8], [0], and also is widely used to analyze the convergence of nonconvex decentralized algorithms lie in [35], [36]. In light of Lemma 9 below, the claims in Proposition 3 and Corollary also hold for Prox-DGD. Proposition 6 Asymptotic consensus and rate. Let Assumptions and 5 hold. In Prox-DGD, use the step sizes 6. There hold x x C x 0 ζ + B α j ζ j, and x x converges to 0 at the rate of O/ + ϵ. Moreover, let x be any global solution of the problem 6. Then x x I W = x I W = x x I W converges to 0 at the rate of O/ + ϵ. For any x R n p, define sx = n i= f ix i + r i x i. Let X be a set of solutions of 6, x opt = Proj X x X, and s opt = sx opt be the optimal value of 6. Define s =0 = α s x + =0 α. Theorem 4 Convergence and rate. Let Assumptions, 4 and 5 hold. In Prox-DGD, use the step sizes 6. Then a { ˆL α x } and { n i= f ix i + r ix i} converge to the same limit; b =0 α +λ nw x + x < when r i s are convex; or, =0 α λ nw x + x < when r i s are not necessarily convex this case requires λ n W > 0; c if {ξ } satisfies ξ + ξ L r x + x for each > 0, some constant L r > 0, and a sufficiently large integer 0 > 0, then lim T fx + ξ + = 0, where ξ + rx + is the one determined by the proximal operator 8, and any limit point is a stationary point of problem 6. d in addition, if there exists an isolated accumulation point, then {x } converges. e furthermore, if f i and r i are convex and λ n W > 0, then the claims on the rates of { f } in Proposition 4 hold for the sequence { s } defined in. Theorem 4b implies that the running best rate of x + x is o. The additional condition imposed on {ξ } in +ϵ Theorem 4c is some type of restricted continuous regularity of the subgradient r with respect to the generated sequence. If r is locally Lipschitz continuous in a neighborhood of a limit point, then such Lipschitz condition on {ξ } can generally be satisfied, since {x } is asymptotic regular, and thus x will lies in such neighborhood of this limit point when is sufficiently large. There are many inds of proximal functions satisfying such assumption as studied in [66] see, Remar 5 for detailed information. Theorem 4e gives the convergence rates of Prox-DGD in the convex setting. Remar 5. A typical proximal function r satisfying the assumption of Theorem 4 c is the l q quasi-norm 0 q < widely studied in sparse optimization, which taes the form rx = p i= x i q. From [] and [66], there is a positive lower bound for the absolute values of non-zero components of the solutions of l q regularized optimization problem. Furthermore, as shown by [66, Property b], the sequence generated by Prox-DGD also has the similar lower bound property. Moreover, by Theorem 4b, we have x + x 0 as. Together with the lower bound property, we can easily obtain the finite support and sign convergence of {x }, that is, the supports and signs of the non-zero components will freeze for sufficiently large. When restricted to such nonzero subspace, the gradient of r i u = u q is Lipschitz continuous for any u τ and some positive constant τ, where τ denotes the lower bound. Besides l q quasi-norm, there are some other typical cases lie SCAD [6] and MCP [67] widely used in statistical learning, satisfying the condition c of this theorem. Remar 6. One tightly related algorithm of Prox-DGD is the projected stochastic gradient descent Proj SGD method pro- When q = 0, we denote 0 0 = 0.

8 8 posed by [4] for solving the constrained multi-agent optimization problem with a convex constraint set. When restricted to the deterministic case as studied in this paper, the convergence results of Proj SGD are very similar to that of Prox-DGD see, Theorem 4 c-d in this paper and [4, Theorem ]. However, there are some differences between [4] and this paper. In short, Proj SGD in [4] uses convex constraints, which correspond to setting rx in our paper as indicator functions of those convex sets. Our paper also considers nonconvex functions lie l q quasi-norm 0 q <, SCAD, and MCP, which are widely used in statistical learning. Another difference is that Proj SGD of [4] uses adaptive-then-combine ATC and Prox-DGD of this paper does combine-then-adaptive CTA. By [4, Assumption ], Proj SGD uses decreasing step sizes lie O ϵ for some ϵ > /. We study the step size α = O ϵ for any 0 < ϵ for Prox-DGD, as well as a fixed step size. IV. RELATED WORS AND DISCUSSIONS We summarize some recent nonconvex decentralized algorithms in Table III. Most of them apply to either the s- mooth optimization problem or the composite optimization problem and use diminishing step sizes. Although is a special case of via letting r i x = 0, there are still differences in both algorithm design and theoretical analysis. Therefore, we divide their comparisons. We first discuss the algorithms for. In [57], the authors proved the convergence of perturbed push-sum for nonconvex under some regularity assumptions. They also introduced random perturbations to avoid local minima. The networ considered in [57] is time-varying and directed, and specific column stochastic matrices and diminishing step sizes are used. The convergence results for the deterministic perturbed push-sum algorithm obtained in [57] are similar to those of DGD developed in this paper under similar assumptions see, Theorem above and [57, Theorem 3]. The detailed comparisons between two algorithms are illustrated in Remar 3. In [33], the sublinear convergence to a stationary point of D-PSGD algorithm was developed under the nonconvex setting. DGD studied in this paper can be viewed a special D- PSGD with a zero variance. In [8], a primal-dual approximate gradient algorithm called ZENITH was developed for. The convergence of ZENITH was given in the expectation of constraint violation under the Lipschitz differentiable assumption and other assumptions. The last one is the proximal primaldual algorithm Prox-PDA recently proposed in []. The O -rate of convergence to a stationary point was established in []. Latter, a perturbed variant of Prox-PDA was proposed in [] for constrained composite smooth+nonsmooth optimization problem with a linear equality constraint. Table III includes three algorithms for solving the composite problem, which are related to ours. All of them only deal with convex r i whereas r i in this paper can also be nonconvex. In [36], the authors proposed NEXT based on the previous successive convex approximation SCA technique. The original form of this algorithm, push-sum, was proposed in [5] for the average consensus problem. It was modified and analyzed in [4] for convex consensus optimization problem over time-varying directed graphs. The iterates of NEXT include two stages, a local SCA stage to update local variables and a consensus update stage to fuse the information between agents. While NEXT has results similar to Prox-DGD using diminishing step sizes. Another interesting algorithm is decentralized Fran-Wolfe DeFW proposed in [6] for nonconvex, smooth, constrained decentralized optimization, where a bounded convex constraint set is imposed. There are three steps at each iteration of DeFW: average gradient computation, local variable evaluation by Fran- Wolfe, and information fusion between agents. In [6], the authors established convergence results similar to Prox-DGD under diminishing step sizes. The stochastic version of DeFW has also been developed in [7] for high-dimensional convex sparse optimization. The next one is projected stochastic gradient algorithm Proj SGD [4] for constrained, nonconvex, smooth consensus optimization with a convex constrained set. The detailed comparison between Proj SGD and Prox-DGD are shown in Remar 6. Based on the above analysis, the convergence results of DGD and Prox-DGD with decreasing step sizes of this paper are comparable with most of the existing ones. However, we allow nonconvex nonsmooth r i and are able to obtain the estimates of asymptotic consensus rates. We also establish global convergence using a fixed step size while it is only found in ZENITH. V. NUMERICAL EXPERIMENTS In this section, we describe a set of numerical experiments mainly to verify our theoretical findings for DGD and Prox- DGD. The comparisons between DGD or Prox-DGD and the other existing algorithms can be referred to these literature lie [35], [], [], [9]. A. Convergence of DGD We verify the performance of DGD using both fixed and diminishing step sizes in the experimental setting identical to [57]. The following one dimensional decentralized optimization is considered in [57], where minimize fx = f x + f x + f 3 x, x R f x = f x = f 3 x = x 3 6xx +, if x 0, 448x 3400, if x > 0, 3x 5040, if x < 0, 0.5x 3 + x x 4, if x 0, 60x 600, if x > 0, 0x 6600, if x < 0, x + x 4, if x 0, 88x 06, if x > 0, 88x 984, if x < 0. We plot the function f over the intervals [ 5, 5] and [ 6, 6] as shown in Fig.. The function achieves its global minimizer at x =.6 and has a local minimizer at x =.49, as well as a local maximizer at x =.. It is easy to compute the Lipschitz constants of f, f and f 3 as L = 88,

9 9 000 Global objective function 60 Global objective function 3 DGD α=+λ n W/ +, = 88, λ n W= DGD α=+λ n W/ +, = 88, λ n W= fx x fx x x i t Iteration t Initial: x 0 = x 0 = x3 0 = 0 agent agent agent 3 x opt =.6 x i t Iteration t Initial: x 0 =, x 0 =., x3 0 =. agent agent agent 3 x opt =.6 a plot on [ 5, 5] b plot on [ 6, 6] a fixed, x 0 b fixed, x 0 Fig.. The plots of objective function f. L = 53, and L 3 = 60, respectively. Thus, = max i L i = 88, which is used in our theoretical analysis. Moreover, f is obviously bounded. We consider one of the following three connected networs:, 3, 3, 3,, 3. In the experiment, the mixing matrix W is taen as W = 0 0 0, which has the eigenvalues:, 0.5, The mixing matrix W is not positive definite. All the assumptions used in our theory are satisfied. We test the performance of DGD with a theoretically fixed step size and several different inds of decreasing step sizes, starting iterations from two different initial points: x 0 := 0, 0, 0, and x 0 :=,.,.. The second initialization is a dangerous point since it is close to local maximum and between two local minima one of them is global. The experiment results are reported in Fig.. From these figures, DGD successfully converges to the global minimum and achieves the consensus starting from both initial points though the objective function is nonconvex. B. Prox-DGD for decentralized L 0 regularization We apply Prox-DGD to solve the following nonconvex decentralized L 0 regularization problem: x argmin fx = n f i x, x R p n where f i x = B ix b i + λ i x 0, B i R mi p, b i R m i for i =,..., n, and x 0 is called the l 0 quasi-norm, which yields the number of the nonzero components of x. In this experiment, we tae n = 0, p = 56, and m i = 50 for i =,..., n. In this case, the proximal operator of l 0 quasi-norm is the well-nown hard thresholding mapping: { x, if x < α, prox α 0 x = 0, otherwise. i= We do not consider the model selection problem, but just test the performance of Prox-DGD applied to such a given x i t x i t DGD diminishing α = 500/ t Iteration t Initial: x 0 = x 0 = x3 0 = 0 c O t, x0 DGD diminishing α = 00/ t+ agent agent agent 3 x =.6 opt Iteration t Initial: x 0 =, x 0 =., x3 0 =. e O t, x0 agent agent agent 3 x opt =.6 x i t x i t DGD diminishing α = 00/ t Iteration t Initial: x 0 = x 0 = x3 0 = 0 d O t, x 0 DGD diminishing α = 50/ t+ 0.5 agent agent agent 3 x =.6 opt Iteration t Initial: x 0 =, x 0 =., x3 0 =. f O t, x 0 agent agent agent 3 x =.6 opt Fig.. The convergence behaviors of DGD in different cases. Two initializations are considered, that is, x 0 := 0, 0, 0 and x0 :=,.,.. Fig The networ used in the decentralized L 0 regularization problem. deterministic model. Tae λ i = 0.5 for each agent i. The connected networ is shown in Fig. 3. We use several step-size strategies, including three fixed step sizes and two decreasing step sizes, to test Prox-DGD. The initializations in all cases are 0. The experiment results are illustrated in Fig. 4. By Fig. 4a and b, when a fixed step size is adopted obviously, we should assume that such step size is sufficiently small to satisfy the theoretical restriction, a larger fixed step size generally implies a faster convergence rate just converges to a stationary point of the 6 7

10 0 x t x F x t x F Convergence rate of Prox DGD with fixed step sizes for L 0 LS α = 0.05 α = 0.0 α = Iteration t a convergence rate fixed Convergence rate of Prox DGD with diminishing step sizes for L 0 LS α = /50*t+00 α = /50*t Iteration t c convergence rate decreasing x t t x ave F x t t x ave F Consensus error of Prox DGD with fixed step sizes for L 0 LS α = 0.05 α = 0.0 α = Iteration t b consensus rate fixed Consensus error of Prox DGD with diminishing step sizes for L 0 LS α = /50*t+00 α = /50*t Iteration t d consensus rate decreasing Fig. 4. The convergence and consensus rates of Prox-DGD under different inds of step sizes for decentralized L 0 regularization. In these experiments, we use the iterate x 0000 to replace the underlying x used to compute the convergence rates. x t ave denotes the average of n agents at tth iteration, i.e., x t ave = n T x t. related Lyapunov function minimization problem but not the original problem while causes a larger consensus error. These phenomena are reasonable and further verify the established results in Proposition. It can be also observed from Fig. 4a that Prox-DGD almost performs at linear rates on all the three fixed step sizes. It is expected since the L 0 regularization function satisfies the urdya-łojasiewicz Ł inequality with ψs = c s θ for θ 0, /] see []. Thus, by Proposition 5ii, Prox-DGD has eventual linear convergence. Once the initial guess in this case, 0 may be a good initial guess is good enough, Prox-DGD starts decaying linearly starting from early iterations. Similar phenomenon can be also observed by Fig. 4c and d under the decreasing step sizes. Specifically, if α t = O t, a smaller ϵ larger step ϵ size generally implies a faster convergence rate but a slower consensus rate. This indeed verifies our Proposition 5, which states that a larger ϵ generally means a faster consensus rate. Moreover, from Fig. 4b and d, under a fixed step size, the consensus error does not vanish but settles to a deterministic value, which can be overcome by using decreasing step sizes. This gives the main motivation of using decreasing step sizes. VI. PROOFS In this section, we present the proofs of our main theorems and propositions. A. Proof for Theorem The setch of the proof is as follows: DGD is interpreted as the gradient descent algorithm applied to the Lyapunov function L α, following the argument in [65]; then, the properties of sufficient descent, lower boundedness, and bounded gradients are established for the sequence {L α x }, giving subsequence convergence of the DGD iterates; finally, whole sequence convergence of the DGD iterates follows from the Ł property of L α. Lemma Gradient descent interpretation. The sequence {x } generated by the DGD iteration 5 is the same sequence generated by applying gradient descent with the fixed step size α to the objective function L α x. A proof of this lemma is given in [65], and it is based on reformulating 5 as the iteration: x + = x α fx + α I W x = x α L α x. 3 Although the sequence {x } generated by the DGD iteration 5 can be interpreted as a centralized gradient descent sequence of function L α x, it is different to the gradient descent of the original problem 3. Lemma Sufficient descent of {L α x }. Let Assumptions and hold. Set the step size 0 < α < +λnw. It holds that L α x + L α x 4 α + λ n W x + x, N. Proof: From x + = x α L α x, it follows that L α x, x + x = x+ x. 5 α Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ max I W = + α λ n W, implying L α x + L α x + L α x, x + x + L x+ x. 6 Combining 5 and 6 yields 4. Lemma 3 Boundedness. Under Assumptions and, if 0 < α < +λ nw, then the sequence {L α x } is lower bounded, and the sequence {x } is bounded, i.e., there exists a constant B > 0 such that x < B for all. Proof: The lower boundedness of L α x is due to the lower boundedness of each f i as it is proper and coercive Assumption Part. By Lemma and the choice of α, L α x is nonincreasing and upper bounded by L α x 0 < +. Hence, T fx L α x 0 implies that x is bounded due to the coercivity of T fx Assumption Part. From Lemmas and 3, we immediately obtain the following lemma. Lemma 4 l -summable and asymptotic regularity. It holds that =0 x+ x < + and that x + x 0 as. A sequence {a } is said to be asymptotic regular if a + a 0 as.

11 From 3, the result below directly follows: Lemma 5 Gradient bound. L α x α x + x. Based on the above lemmas, we get the global convergence of DGD. Proof of Theorem : By Lemma 3, the sequence {x } is bounded, so there exist a convergent subsequence and a limit point, denoted by {x s } s N x as s +. By Lemmas and 3, L α x is monotonically nonincreasing and lower bounded, and therefore L α x L for some L and x + x 0 as. Based on Lemma 5, L α x 0 as. In particular, L α x s 0 as s. Hence, we have L α x = 0. The running best rate of the sequence { x + x } follows from [3, Lemma.] or [6, Theorem 3.3.]. By Lemma 5, the running best rate of the sequence { L α x } is o. By, L α x = fx + α I W x, which implies n T fx = n T L α x due to n T I W = 0. Thus, B. Proof for Proposition Proof: Note that L α x + L α x + L α x + L α x L x + x + α x + x = α λ n W + x + x, where the second inequality holds for Lemma 5 and the Lipschitz continuity of L α with constant L = + α λ n W. Thus, it shows that {x } satisfies the socalled relative error condition as list in []. Moreover, by Lemmas and 3, {x } also satisfies the so-called sufficient decrease and continuity conditions as listed in []. Under such three conditions and the Ł property of L α at x with ψs = cs θ, following the proof of [, Lemma.6], there exists 0 > 0 such that for all 0, we have x + x x x + cb a 8 Lα x L α x θ L α x + L α x θ, n T fx = n T L α x L α x, which implies the running best rate of { n T fx } is also o. By Lemmas and 5, it holds L α x which implies =0 α + λ n W α L αx L α x +, L α x L α x 0 L α + λ n W α. Moreover, note that n T fx L α x. Thus, the convergence rate of { =0 n T fx } is O. Similar to [, Theorem.9], we can claim the global convergence of the considered sequence {x } N under the Ł assumption of L α. Next, we derive a bound on the gradient sequence { fx }, which is used in Proposition. Lemma 6. Under Assumption, there exists a point y satisfying fy = 0, and the following bound holds where a α + λ n W and b α λ n W +. Then, an easy induction yields t= 0 x t+ x t x 0 x 0 + cb a Lα x 0 L α x θ L α x + L α x θ. Following a derivation similar to the proof of [, Theorem 5], we can estimate the rate of convergence of {x } in the different cases of θ. C. Proof for Proposition 3 In order to prove Proposition 3, we also need the following lemmas. {}}{ Lemma 7. [40, Proposition ] Let W W W be the power of W with degree for any N. Under Assumption, it holds W n T Cζ 9 fx D B + y, N, 7 where B is the bound of x given in Lemma 3. Proof: By the lower boundedness assumption Assumption Part, the minimizer of T fy exists. Let y be a minimizer. Then by Lipschitz differentiability of each f i Assumption Part, we have that fy = 0. Then, for any, we have fx = fx fy x y Lemma 3 B + y. Therefore, we have proven this lemma. for some constant C > 0, where ζ is the second largest magnitude eigenvalue of W as specified in 0. Lemma 8. [48, Lemma 3.] Let {γ } be a scalar sequence. If lim γ = γ and 0 < β <, then lim l=0 β l γ l = γ β. Proof of Proposition 3: By the recursion 7, note that x x = W n T x 0 30 α j W j n T fx j.

12 Further by Lemma 7 and Assumption 3, we obtain x x W n T x 0 α j W j n T fx j + C x 0 ζ + B α j ζ j. 3 Furthermore, by Lemma 8 and step sizes 6, we get lim x x = 0. Let b + ϵ. To show the rate of x x, we only need to show that lim b x x C for some 0 < C <. Let j [ + log ζb ] where [x] denotes the integer part of x for any x R. Note that b x x x 0 ζ + B α j ζ j Cb = C x 0 b ζ + CBb + CBb j=j + α j ζ j j α j ζ j T + T + T 3, 3 where the first inequality holds because of 3. In the following, we will estimate the above three terms in the right-hand side of 3, respectively. First, by the definition of j, for any j j, we have Thus, b ζ j T CB b Second, for j < j, b α j and also + ϵ j + ϵ ζ j. j α j ζ j/. 33 Thus, for any j < j Furthermore, note that + ϵ + ϵ log ζ + ϵ, b α + ϵ j + ϵ =, lim b α j =. 34 lim b ζ/ = Therefore, there exists a such that for b α j, 36 b ζ/. 37 The above two inequalities imply that for sufficiently large, T C x 0 ζ /, 38 T 3 CB From 33, 38 and 39, we get b j=j + ζ j. 39 x x C x 0 ζ / 40 j + CB α j ζ j/ +. j=j + ζ j By Lemma 8 and 40, there exists a C > 0 such that lim b x x C. 4 We have completed the proof of this proposition. D. Proof for Theorem To prove Theorem, we first note that similar to 3, the DGD iterates under decreasing step sizes can be rewritten as where L α x = T fx + the following lemmas. x + = x α L α x, 4 α x I W, and we also need Lemma 9 [50]. Let {v t } be a nonnegative scalar sequence such that v t+ + b t v t u t + c t for all t N, where b t 0, u t 0 and c t 0 with t=0 b t < and t=0 c t <. Then the sequence {v t } converges to some v 0 and t=0 u t <. Lemma 0. Let α satisfy 6. Then it holds α + α ϵ + ϵ. Proof: We first prove that + x ϵ ϵx, x [0, ]. 43 Let gx = + x ϵ ϵx. Then its derivative g x = ϵ + x ϵ ϵ < 0, x [0, ]. It implies gx g0 = 0 for any x [0, ], that is, the inequality 43 holds. Note that α + α = + ϵ + ϵ = + ϵ + + ϵ where the last inequality holds for 43. ϵ + ϵ, 44

13 3 The following lemma shows that {α α x+ I W } is summable. + Lemma. Let Assumptions,, and 3 hold. In DGD, use step sizes α in 6. Then {α + α x+ I W } is summable, i.e., =0 α + α x+ I W <. Proof: Note that x + I W = x + x + I W λ n W x + x By Lemma 0, α + α x+ I W ϵ + ϵ x + I W ϵ + ϵ λ n W x + x Furthermore, by 46 and Proposition 3, the sequence {α + α x+ I W } converges to 0 at the rate of O/ + +ϵ, which implies that the sequence {α α α x+ I W } is l -summable, i.e., x+ I W <. + =0 α + Lemma convergence of wealy summable sequence. Let {β } and {γ } be two nonnegative scalar sequences such that a γ = +, for some ϵ 0, ], N; ϵ b =0 γ β < ; c β + β γ, where means that β + β Mγ for some constant M > 0, then lim β 0. We call a sequence {β } satisfying Lemma a and b a wealy summable sequence since itself is not necessarily summable but becomes summable via multiplying another non-summable, diminishing sequence {γ }. It is generally impossible to claim that β converges to 0. However, if the distance of two successive steps of {β } with the same order of the multiplied sequence γ, then we can claim the convergence of β. A special case with ϵ = / has been observed in []. Proof: By condition b, we have + i= γ i β i 0, 47 as and for any N. In the following, we will show lim β = 0 by contradiction. Assume this is not the case, i.e., β 0 as, then lim sup β C > 0. Thus, for every N > 0, there exists a > N such that β > C. Let [ C + ϵ 4M where [x] denotes the integer part of x for any x R. By condition c, i.e., β j+ β j Mγ j for any j N, then ], β +i C 4, i {0,,..., }. 48 Hence, + j= { C = + + γ j β j C γ j C 4 4 j= x + ϵ dx 49 4 ϵ + + ϵ + ϵ, ϵ 0,, C 4 ln + + ln +, ϵ =. Note that when ϵ 0,, the term + + ϵ + ϵ is monotonically increasing with respect to, which implies that + j= γ jβ j is lower bounded by a positive constant when ϵ 0,. While when ϵ =, noting that the specific form of, we have ln+ + ln+ = ln + = ln + C, + 4M which is a positive constant. As a consequence, + j= γ jβ j will not go to 0 as 0, which contradicts with 47. Therefore, lim β = 0. Proof of Theorem : We first develop the following inequality L α+ x + L α x + α + α x+ I W α + λ nw x + x, 50 and then claim the convergence of the sequences {L α x }, { T fx } and {x } based on this inequality. a Development of 50: From x + = x α L α x, it follows that L α x, x + x = x+ x α. 5 Since n i= f ix i is -Lipschitz, L α is Lipschitz with the constant L + α λ maxi W = + λ nw, implying α L α x + 5 L α x + L α x, x + x + L x+ x = L α x α + λ nw x + x. Moreover, L α+ x + = L α x + + α + α x+ I W. 53 Combining 5 and 53 yields 50. b Convergence of objective sequence: By Lemma and Lemma 9, 50 yields the convergence of {L α x } and α + λ nw x + x < 54 =0 which implies that x + x converges to 0 at the rate of o ϵ and {x } is asymptotic regular Moreover, notice that α x I W = α x x I W λ n W + ϵ x x.

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić February 7, 2017 Abstract We consider distributed optimization

More information

DECENTRALIZED algorithms are used to solve optimization

DECENTRALIZED algorithms are used to solve optimization 5158 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 64, NO. 19, OCTOBER 1, 016 DQM: Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Aryan Mohtari, Wei Shi, Qing Ling,

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

DLM: Decentralized Linearized Alternating Direction Method of Multipliers

DLM: Decentralized Linearized Alternating Direction Method of Multipliers 1 DLM: Decentralized Linearized Alternating Direction Method of Multipliers Qing Ling, Wei Shi, Gang Wu, and Alejandro Ribeiro Abstract This paper develops the Decentralized Linearized Alternating Direction

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Newton-like method with diagonal correction for distributed optimization

Newton-like method with diagonal correction for distributed optimization Newton-lie method with diagonal correction for distributed optimization Dragana Bajović Dušan Jaovetić Nataša Krejić Nataša Krlec Jerinić August 15, 2015 Abstract We consider distributed optimization problems

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

A Distributed Newton Method for Network Utility Maximization, I: Algorithm

A Distributed Newton Method for Network Utility Maximization, I: Algorithm A Distributed Newton Method for Networ Utility Maximization, I: Algorithm Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract Most existing wors use dual decomposition and first-order

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Convergence of Fixed-Point Iterations

Convergence of Fixed-Point Iterations Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

On the linear convergence of distributed optimization over directed graphs

On the linear convergence of distributed optimization over directed graphs 1 On the linear convergence of distributed optimization over directed graphs Chenguang Xi, and Usman A. Khan arxiv:1510.0149v4 [math.oc] 7 May 016 Abstract This paper develops a fast distributed algorithm,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Convex Analysis and Optimization Chapter 2 Solutions

Convex Analysis and Optimization Chapter 2 Solutions Convex Analysis and Optimization Chapter 2 Solutions Dimitri P. Bertsekas with Angelia Nedić and Asuman E. Ozdaglar Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

On proximal-like methods for equilibrium programming

On proximal-like methods for equilibrium programming On proximal-lie methods for equilibrium programming Nils Langenberg Department of Mathematics, University of Trier 54286 Trier, Germany, langenberg@uni-trier.de Abstract In [?] Flam and Antipin discussed

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging arxiv:307.879v [math.oc] 7 Jul 03 Angelia Nedić and Soomin Lee July, 03 Dedicated to Paul Tseng Abstract This paper considers

More information

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS HONOUR SCHOOL OF MATHEMATICS, OXFORD UNIVERSITY HILARY TERM 2005, DR RAPHAEL HAUSER 1. The Quasi-Newton Idea. In this lecture we will discuss

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Distributed Consensus Optimization

Distributed Consensus Optimization Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Decentralized-1 Backgroundwhy andwe motivation need decentralized optimization? I Decentralized

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Constrained Consensus and Optimization in Multi-Agent Networks

Constrained Consensus and Optimization in Multi-Agent Networks Constrained Consensus Optimization in Multi-Agent Networks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher

More information

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction J. Korean Math. Soc. 38 (2001), No. 3, pp. 683 695 ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE Sangho Kum and Gue Myung Lee Abstract. In this paper we are concerned with theoretical properties

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Convergence rates for distributed stochastic optimization over random networks

Convergence rates for distributed stochastic optimization over random networks Convergence rates for distributed stochastic optimization over random networs Dusan Jaovetic, Dragana Bajovic, Anit Kumar Sahu and Soummya Kar Abstract We establish the O ) convergence rate for distributed

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

arxiv: v2 [math.oc] 5 May 2018

arxiv: v2 [math.oc] 5 May 2018 The Impact of Local Geometry and Batch Size on Stochastic Gradient Descent for Nonconvex Problems Viva Patel a a Department of Statistics, University of Chicago, Illinois, USA arxiv:1709.04718v2 [math.oc]

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

A derivative-free nonmonotone line search and its application to the spectral residual method

A derivative-free nonmonotone line search and its application to the spectral residual method IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Bootstrap Percolation on Periodic Trees

Bootstrap Percolation on Periodic Trees Bootstrap Percolation on Periodic Trees Milan Bradonjić Iraj Saniee Abstract We study bootstrap percolation with the threshold parameter θ 2 and the initial probability p on infinite periodic trees that

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Chapter 2. Optimization. Gradients, convexity, and ALS

Chapter 2. Optimization. Gradients, convexity, and ALS Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve

More information

Decentralized Consensus Optimization with Asynchrony and Delay

Decentralized Consensus Optimization with Asynchrony and Delay Decentralized Consensus Optimization with Asynchrony and Delay Tianyu Wu, Kun Yuan 2, Qing Ling 3, Wotao Yin, and Ali H. Sayed 2 Department of Mathematics, 2 Department of Electrical Engineering, University

More information

Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity

Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity 1 Accelerated Distributed Dual Averaging over Evolving Networks of Growing Connectivity Sijia Liu, Member, IEEE, Pin-Yu Chen, Member, IEEE, and Alfred O. Hero, Fellow, IEEE arxiv:1704.05193v2 [stat.ml]

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Subgradient Methods in Network Resource Allocation: Rate Analysis

Subgradient Methods in Network Resource Allocation: Rate Analysis Subgradient Methods in Networ Resource Allocation: Rate Analysis Angelia Nedić Department of Industrial and Enterprise Systems Engineering University of Illinois Urbana-Champaign, IL 61801 Email: angelia@uiuc.edu

More information

Distributed intelligence in multi agent systems

Distributed intelligence in multi agent systems Distributed intelligence in multi agent systems Usman Khan Department of Electrical and Computer Engineering Tufts University Workshop on Distributed Optimization, Information Processing, and Learning

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Proximal-like contraction methods for monotone variational inequalities in a unified framework

Proximal-like contraction methods for monotone variational inequalities in a unified framework Proximal-like contraction methods for monotone variational inequalities in a unified framework Bingsheng He 1 Li-Zhi Liao 2 Xiang Wang Department of Mathematics, Nanjing University, Nanjing, 210093, China

More information

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India Chi Jin UC Berkeley Michael I. Jordan UC Berkeley Rong Ge Duke Univ. Sham M. Kakade U Washington Nonconvex optimization

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Key words. saddle-point dynamics, asymptotic convergence, convex-concave functions, proximal calculus, center manifold theory, nonsmooth dynamics

Key words. saddle-point dynamics, asymptotic convergence, convex-concave functions, proximal calculus, center manifold theory, nonsmooth dynamics SADDLE-POINT DYNAMICS: CONDITIONS FOR ASYMPTOTIC STABILITY OF SADDLE POINTS ASHISH CHERUKURI, BAHMAN GHARESIFARD, AND JORGE CORTÉS Abstract. This paper considers continuously differentiable functions of

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

A Distributed Newton Method for Network Utility Maximization

A Distributed Newton Method for Network Utility Maximization A Distributed Newton Method for Networ Utility Maximization Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie Abstract Most existing wor uses dual decomposition and subgradient methods to solve Networ Utility

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Non-smooth Non-convex Bregman Minimization: Unification and new Algorithms Peter Ochs, Jalal Fadili, and Thomas Brox Saarland University, Saarbrücken, Germany Normandie Univ, ENSICAEN, CNRS, GREYC, France

More information

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities

On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities On the Iteration Complexity of Some Projection Methods for Monotone Linear Variational Inequalities Caihua Chen Xiaoling Fu Bingsheng He Xiaoming Yuan January 13, 2015 Abstract. Projection type methods

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Optimality Conditions for Nonsmooth Convex Optimization

Optimality Conditions for Nonsmooth Convex Optimization Optimality Conditions for Nonsmooth Convex Optimization Sangkyun Lee Oct 22, 2014 Let us consider a convex function f : R n R, where R is the extended real field, R := R {, + }, which is proper (f never

More information

A New Sequential Optimality Condition for Constrained Nonsmooth Optimization

A New Sequential Optimality Condition for Constrained Nonsmooth Optimization A New Sequential Optimality Condition for Constrained Nonsmooth Optimization Elias Salomão Helou Sandra A. Santos Lucas E. A. Simões November 23, 2018 Abstract We introduce a sequential optimality condition

More information

Optimality Conditions

Optimality Conditions Chapter 2 Optimality Conditions 2.1 Global and Local Minima for Unconstrained Problems When a minimization problem does not have any constraints, the problem is to find the minimum of the objective function.

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Porcupine Neural Networks: (Almost) All Local Optima are Global

Porcupine Neural Networks: (Almost) All Local Optima are Global Porcupine Neural Networs: (Almost) All Local Optima are Global Soheil Feizi, Hamid Javadi, Jesse Zhang and David Tse arxiv:1710.0196v1 [stat.ml] 5 Oct 017 Stanford University Abstract Neural networs have

More information

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE Journal of Applied Analysis Vol. 6, No. 1 (2000), pp. 139 148 A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE A. W. A. TAHA Received

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information