NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

Size: px
Start display at page:

Download "NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization"

Transcription

1 ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for nonconvex problems whose objective consists of a sum of nonconvex L i /-smooth functions, plus a nonsmooth regularizer. The proposed onconvex primal-dual SpliTTing ESTT algorithm splits the problem into subproblems, and utilizes an augmented Lagrangian based primal-dual scheme to solve it in a distributed and stochastic manner. With a special non-uniform sampling, a version of ESTT achieves ɛ-stationary solution using O Li / /ɛ gradient evaluations, which can be up to O times better than the proximal gradient descent methods. It also achieves Q-linear convergence rate for nonconvex l penalized quadratic problems with polyhedral constraints. Further, we reveal a fundamental connection between primal-dual based methods and a few primal only methods such as IAG/SAG/SAGA. Introduction Consider the following nonconvex and nonsmooth constrained optimization problem min fz := g i z + g 0 z + pz,. z Z where Z R d ; for each i {0,, }, g i : R d R is a smooth possibly nonconvex function which has L i -Lipschitz continuous gradient; pz : R d R is a lower semi-continuous convex but possibly nonsmooth function. Define gz := g iz for notational simplicity. Problem. is quite general. It arises frequently in applications such as machine learning and signal processing; see a recent survey [7]. In particular, each smooth functions {g i } can represent: a mini-batch of loss functions modeling data fidelity, such as the l loss, the logistic loss, etc; nonconvex activation functions for neural networks, such as the logit or the tanh functions; 3 nonconvex utility functions used in signal processing and resource allocation, see [4]. The smooth function g 0 can represent smooth nonconvex regularizers such as the non-quadratic penalties [], or the smooth part of the SCAD or MCP regularizers which is a concave function [6]. The convex function p can take the following form: nonsmooth convex regularizers such as l and l functions; an indicator function for convex and closed feasible set Z, denoted as ι Z ; 3 convex functions without global Lipschitz continuous gradient, such as pz = z 4 or pz = /z +ι z 0 z. In this work we solve. in a stochastic and distributed manner. We consider the setting in which distributed agents each having the knowledge of one smooth function {g i }, and they are connected to a cluster center which handles g 0 and p. At any given time, a randomly selected agent is activated and performs computation to optimize its local objective. Such distributed computation model has been popular in large-scale machine learning and signal processing [6]. Such model is also closely related to the centralized stochastic finite-sum optimization problem [, 9, 4, 5, Department of Industrial & Manufacturing Systems Engineering and Department of Electrical & Computer Engineering, Iowa State University, Ames, IA, {dhaji,mingyi}@iastate.edu School of Industrial and Systems Engineering, Georgia Institute of Technology tourzhao@gatech.edu Department of Operations Research, Princeton University,zhaoran@princeton.edu 30th Conference on eural Information Processing Systems IPS 06, Barcelona, Spain.

2 , ], in which each time the iterate is updated based on the gradient information of a random component function. One of the key differences between these two problem types is that in the distributed setting there can be disagreement between local copies of the optimization variable z, while in the centralized setting only one copy of z is maintained. Our Contributions. We propose a class of onconvex primal-dual SpliTTing ESTT algorithms for problem.. We split z R d into local copies of x i R d, while enforcing the equality constraints x i = z for all i. That is, we consider the following reformulation of. min lx, z := g i x i + g 0 z + hz, s.t. x i = z, i =,,,. x,z R d where hz := ι Z z + pz, x := [x ; ; x ]. Our algorithm uses the Lagrangian relaxation of the equality constraints, and at each iteration a possibly non-uniformly randomly selected primal variable is optimized, followed by an approximate dual ascent step. ote that such splitting scheme has been popular in the convex setting [6], but not so when the problem becomes nonconvex. The ESTT is one of the first stochastic algorithms for distributed nonconvex nonsmooth optimization, with provable and nontrivial convergence rates. Our main contribution is given below. First, in terms of some primal and dual optimality gaps, ESTT converges sublinearly to a point belongs to stationary solution set of.. Second, ESTT converges Q-linearly for certain nonconvex l penalized quadratic problems. To the best of our knowledge, this is the first time that linear convergence is established for stochastic and distributed optimization of such type of problems. Third, we show that a gradient-based ESTT with non-uniform sampling achieves an ɛ-stationary solution of. using O Li / /ɛ gradient evaluations. Compared with the classical gradient descent, which in the worst case requires O L i/ɛ gradient evaluation to achieve ɛ-stationarity, our obtained rate can be up to O times better in the case where the L i s are not equal. Our work also reveals a fundamental connection between primal-dual based algorithms and the primal only average-gradient based algorithm such as SAGA/SAG/IAG [5, 9, ]. With the key observation that the dual variables in ESTT serve as the memory of the past gradients, one can specialize ESTT to SAGA/SAG/IAG. Therefore, ESTT naturally generalizes these algorithms to the nonconvex nonsmooth setting. It is our hope that by bridging the primal-dual splitting algorithms and primal-only algorithms in both the convex and nonconvex setting, there can be significant further research developments benefiting both algorithm classes. Related Work. Many stochastic algorithms have been designed for. when it is convex. In these algorithms the component functions g i s are randomly sampled and optimized. Popular algorithms include the SAG/SAGA [9, ], the SDCA [3], the SVRG [4], the RPDG [5] and so on. When the problem becomes nonconvex, the well-known incremental based algorithm can be used [3, 4], but these methods generally lack convergence rate guarantees. The SGD based method has been studied in [0], with O/ɛ convergence rate. Recent works [] and [] develop algorithms based on SVRG and SAGA for a special case of. where the entire problem is smooth and unconstrained. To the best of our knowledge there has been no stochastic algorithms with provable, and non-trivial, convergence rate guarantees for solving problem.. On the other hand, distributed stochastic algorithms for solving problem. in the nonconvex setting has been proposed in [3], in which each time a randomly picked subset of agents update their local variables. However there has been no convergence rate analysis for such distributed stochastic scheme. There has been some recent distributed algorithms designed for. [7], but again without global convergence rate guarantee. Preliminaries. The augmented Lagrangian function for problem. is given by: ηi L x, z; λ = gixi + λi, xi z + xi z + g 0z + hz,.3 where λ := {λ i } is the set of dual variables, and η := {η i > 0} are penalty parameters. We make the following assumptions about problem. and the function.3. A-a The function fz is bounded from below over Z intdom f: f := min z Z fz >. pz is a convex lower semi-continuous function; Z is a closed convex set. A-b The g i s and g have Lipschitz continuous gradients, i.e., gy gz L y z, and g i y g i z L i y z, y, z

3 Algorithm ESTT-G Algorithm : for r = to R do : Pick {,,, } with probability p ir and update x, λ Update z: x r+ = arg min x ir V ir x ir, z r, λ r ;.4 λ r+ = λ r + α ir η ir x r+ z r ;.5 λ r+ j = λ r j, x r+ j = z r, j ;.6 z r+ = arg min z Z L{xr+ i }, z; λ r..7 3: end for 4: Output: z m, x m, λ m where m randomly picked from {,,, R}. Clearly L / L i, and the equality can be achieved in the worst case. For simplicity of analysis we will further assume that L 0 L i. A-c Each η i in.3 satisfies η i > L i /; if g 0 is nonconvex, then η i > 3L 0. Assumption A-c implies that L x, z; λ is strongly convex w.r.t. each x i and z, with modulus γ i := η i L i / and γ z = η i L 0, respectively [7, Theorem.]. We then define the prox-gradient pgrad for., which will serve as a measure of stationarity. It can be checked that the pgrad vanishes at the set of stationary solutions of. [0]. Definition.. The proximal gradient of problem. is given by for any γ > 0 f γz := γ z prox γ p+ι Z [z /γ gz + g 0z], with prox γ p+ι Z [u] := argmin pu+ γ u Z z u. The ESTT-G Algorithm Algorithm Description. We present a primal-dual splitting scheme for the reformulated problem.. The algorithm is referred to as the ESTT with Gradient step ESTT-G since each agent only requires to know the gradient of each component function. To proceed, let us define the following function for some constants {α i > 0} : V ix i, z; λ i = giz + αiηi giz, xi z + λi, xi z + xi z. ote that V i is related to L in the following way: it is a quadratic approximation approximated at the point z of Lx, y; λ w.r.t. x i. The parameters α := {α i } give some freedom to the algorithm design, and they are critical in improving convergence rates as well as in establishing connection between ESTT-G with a few primal only stochastic optimization schemes. The algorithm proceeds as follows. Before each iteration begins the cluster center broadcasts z to everyone. At iteration r + a randomly selected agent {,, } is picked, who minimizes V ir w.r.t. its local variable x ir, followed by a dual ascent step for λ ir. The rest of the agents update their local variables by simply setting them to z. The cluster center then minimizes Lx, z; λ with respect to z. See Algorithm for details. We remark that ESTT-G is related to the popular ADMM method for convex optimization [6]. However our particular update schedule randomly picking x i, λ i plus deterministic updating z, combined with the special x-step minimizing an approximation of L evaluated at a different block variable z is not known before. These features are critical in our following rate analysis. Convergence Analysis. To proceed, let us define rj as the last iteration in which the jth block is picked before iteration r +. i.e. rj := max{t t < r +, j = it}. Define yj r := zrj if j, and y r = z r. Define the filtration F r as the σ-field generated by {it} r t=. A few important observations are in order. Combining the x, z updates.4.7, we have x r+ q = z r α qη q λ r q + gqzr, gqzr + λ r q + α qη qx r+ q z r = 0, with q =.8a λ r+ = gir z r, λ r+ j = gjzrj, j, λ r+ i = giyr i, i.8b x r+ j.6 = z r.8b = z r α jη j λ r j + gjzrj, j..8c 3

4 The key here is that the dual variables serve as the memory for the past gradients of g i s. To proceed, we first construct a potential function using an upper bound of Lx, y; λ. ote that gjxr+ j + λ r j, x r+ j z r + ηj xr+ j z r = gjzr, j.9 gir x r+ + λ r, x r+ z r + ηi xr+ z r i gir z r + ηir + L ir / x r+ z r ii = gir z r + ηir + L ir / α ir η ir / gir y r g ir z r.0 where i uses.8b and applies the descent lemma on the function /g i ; in ii we have used.5 and.8b. Since each i is picked with probability p i, we have E ir [Lx r+, z r ; λ r F r ] g iz r p i η i + L i / + α i η i / g i y r i g i z r + g 0 z r + hz r g iz r + 3p i η i α i η i / g iy r i g i z r + g 0 z r + hz r := Q r, where in the last inequality we have used Assumption [A-c]. In the following, we will use E F r[q r ] as the potential function, and show that it decreases at each iteration. Lemma.. Suppose Assumption A holds, and pick α i = p i = βη i, where β := η, and η i 9L i, i =,.. i p i Then the following descent estimate holds true for ESTT-G E[Q r Q r F r ] ηi E z r z r z r 8 η i gizr g iy r i.. Sublinear Convergence. Define the optimality gap as the following: [ E[G r ] := E /β fz r ] = [ β E z r prox /β h [zr β gz r + g 0z r ] ]..3 ote that when h, g 0 0, E[G r ] reduces to E[ gz r ]. We have the following result. Theorem.. Suppose Assumption A holds, and pick for i =,, Li / α i = p i = Li /, η Li i = 3 Li / /, β = 3 Li /..4 Then every limit point generated by ESTT-G is a stationary solution of problem.. Further, E[G m ] 80 E[Q Li/ Q R+ ] ; 3 R [ ] E[G m ] + E 3ηi x m i z m 80 E[Q Li/ Q R+ ]. 3 R ote that Part is useful in the centralized finite-sum minimization setting, as it shows the sublinear convergence of ESTT-G, measured only by the primal optimality gap evaluated at z r. Meanwhile, part is useful in the distributed setting, as it also shows that the expected constraint violation, which measures the consensus among agents, shrinks in the same order. We also comment that the above result suggests that to achieve an ɛ-stationary solution, the ESTT-G requires about O Li / /ɛ number of gradient evaluations for simplicity we have ignored an additive factor for evaluating the gradient of the entire function at the initial step of the algorithm. 4

5 Algorithm ESTT-E Algorithm : for r = to R do : Update z by minimizing the augmented Lagrangian: z r+ = arg min z Lx r, z; λ r : Randomly pick {,, } with probability p ir : x r+ = argmin x ir U ir x ir, z r+ ; λ r ; 3.6 λ r+ = λ r + α ir η ir x r+ z r+ ; 3.7 x r+ j = x r j, λ r+ j = λ r j j : end for 5: Output: z m, x m, λ m where m randomly picked from {,,, R}. It is interesting to observe that our choice of p i is proportional to the square root of the Lipschitz constant of each component function, rather than to L i. Because of such choice of the sampling probability, the derived convergence rate has a mild dependency on and L i s. Compared with the conventional gradient-based methods, our scaling can be up to times better. Detailed discussion and comparison will be given in Section 4. ote that similar sublinear convergence rates can be obtained for the case α i = for all i with different scaling constants. However due to space limitation, we will not present those results here. Linear Convergence. In this section we show that the ESTT-G is capable of linear convergence for a family of nonconvex quadratic problems, which has important applications, for example in high-dimensional statistical learning [6]. To proceed, we will assume the following. B-a Each function g i z is a quadratic function of the form g i z = /z T A i z + b, z, where A i is a symmetric matrix but not necessarily positive semidefinite; B-b The feasible set Z is a closed compact polyhedral set; B-c The nonsmooth function pz = µ z, for some µ 0. Our linear convergence result is based upon certain error bound condition around the stationary solutions set, which has been shown in [8] for smooth quadratic problems and has been extended to including l penalty in [5, Theorem 4]. Due to space limitation the statement of the condition will be given in the supplemental material, along with the proof of the following result. Theorem.. Suppose that Assumptions A, B are satisfied. Then the sequence {E[Q r+ ]} r= converges Q-linearly 4 to some Q = fz, where z is a stationary solution for problem.. That is, there exists a finite r > 0, ρ 0, such that for all r r, E[Q r+ Q ] ρe[q r Q ]. Linear convergence of this type for problems satisfying Assumption B has been shown for deterministic proximal gradient based methods [5, Theorem, 3]. To the best of our knowledge, this is the first result that shows the same linear convergence for a stochastic and distributed algorithm. 3 The ESTT-E Algorithm Algorithm Description. In this section, we present a variant of ESTT-G, which is named ESTT with Exact minimization ESTT-E. Our motivation is the following. First, in ESTT-G every agent should update its local variable at every iteration [cf..4 or.6]. In practice this may not be possible, for example at any given time a few agents can be in the sleeping mode so they cannot perform.6. Second, in the distributed setting it has been generally observed e.g., see [8, Section V] that performing exact minimization whenever possible instead of taking the gradient steps for local problems can significantly speed up the algorithm. The ESTT-E algorithm to be presented in this section is designed to address these issues. To proceed, let us define a new function as follows: Ux, z; λ := U ix i, z; λ i := αiηi gixi + λi, xi z + xi z 4 A sequence {x r } is said to converge Q-linearly to some x if lim sup r x r+ x / x r x ρ, where ρ 0, is some constant; cf [5] and references therein.. 5

6 ote that if α i = for all i, then the Lx, z; λ = Ux, z; λ + pz + hz. The algorithm details are presented in Algorithm. Convergence Analysis. We begin analyzing ESTT-E. The proof technique is quite different from that for ESTT-G, and it is based upon using the expected value of the Augmented Lagrangian function as the potential function; see [,, 3]. For the ease of description we define the following quantities: w := x, z, λ, β :=, c i := L i ηi α γi iη i + αi L i α i, α := {αi}. To measure the optimality of ESTT-E, define the prox-gradient of Lx, z; λ as: [ ] Lw = z prox h [z z Lw hz]; x Lw; ; x Lw R +d. 3.9 We define the optimality gap by adding to Lw the size of the constraint violation [3]: Hw r := Lw r L i + xr i z r. It can be verified that Hw r 0 implies that w r reaches a stationary solution for problem.. We have the following theorem regarding the convergence properties of ESTT-E. Theorem 3.. Suppose Assumption A holds, and that η i, α i are chosen such that c i < 0. Then for some constant f, we have E[Lw r ] E[Lw r+ ] f >, r 0. Further, almost surely every limit point of {w r } is a stationary solution of problem.. Finally, for some function of α denoted as Cα = σ α/σ α, we have the following: E[Hw m ] CαE[Lw Lw R+ ], 3.0 R where σ := maxˆσ α, σ and σ := maxˆσ α, σ, and these constants are given by ˆσ α = max i { 4 L i + η i + α i L i + 3 σ = 4ηi + + η i + L 0 L i + 3, { γi ˆσ α = max p i i L i } αi L i, σ = α iη i α i L 4 i α iη i 4 + L i }, ηi L0. We remark that the above result shows the sublinear convergence of ESTT-E to the set of stationary solutions. ote that γ i = η i L i /, to satisfy c i < 0, a simple derivation yields L i α i + α i + 8α i η i >. α i Further, the above result characterizes the dependency of the rates on various parameters of the L algorithm. For example, to see the effect of α on the convergence rate, let us set p i = i Li, and η i = 3L i /, and assume L 0 = 0, then consider two different choices of α: α i =, i and α i = 4, i. One can easily check that applying these different choices leads to following results: C α = 49 L i/, C α = 8 L i/. The key observation is that increasing α i s reduces the constant in front of the rate. Hence, we expect that in practice larger α i s will yield faster convergence. 4 Connections and Comparisons with Existing Works In this section we compare ESTT-G/E with a few existing algorithms in the literature. First, we present a somewhat surprising observation, that ESTT-G takes the same form as some well-known algorithms for convex finite-sum problems. To formally state such relation, we show in the following result that ESTT-G in fact admits a compact primal-only characterization. 6

7 Table : Comparison of # of gradient evaluations for ESTT-G and GD in the worst case ESTT-G GD # of Gradient Evaluations O Li/ /ɛ O Li/ɛ Case I: L i =, i O/ɛ O/ɛ Case II : O terms with L i = O/ɛ O 3/ /ɛ the rest with L i = Case III : O terms with L i = O/ɛ O the rest with L i = /ɛ Proposition 4.. The ESTT-G can be written into the following compact form: with z r+ = arg min hz + g z 0z + β z ur+ u r+ := z r β g ir z r g ir y r i α r + ir 4.a g iy r i. 4.b Based on this observation, the following comments are in order. Suppose h 0, g 0 0 and α i =, p i = / for all i. Then 4. takes the same form as the SAG presented in []. Further, when the component functions g i s are picked cyclically in a Gauss-Seidel manner, the iteration 4. takes the same form as the IAG algorithm [5]. Suppose h 0 and g 0 0, and α i = p i = / for all i. Then 4. is the same as the SAGA algorithm [9], which is design for optimizing convex nonsmooth finite sum problems. ote that SAG/SAGA/IAG are all designed for convex problems. Through the lens of primal-dual splitting, our work shows that they can be generalized to nonconvex nonsmooth problems as well. Secondly, ESTT-E is related to the proximal version of the nonconvex ADMM [3, Algorithm ]. However, the introduction of α i s is new, which can significantly improve the practical performance but complicates the analysis. Further, there has been no counterpart of the sublinear and linear convergence rate analysis for the stochastic version of [3, Algorithm ]. Thirdly, we note that a recent paper [] has shown that SAGA works for smooth and unconstrained nonconvex problem. Suppose that h 0, g 0 0, L i = L j, i, j and α i = p i = /, the authors show that SAGA achieves ɛ-stationarity using O /3 L i//ɛ gradient evaluations. Compared with GD, which achieves ɛ-stationarity using O L i/ɛ gradient evaluations in the worse case in the sense that L i/ = L, the rate in [] is O /3 times better. However, the algorithm in [] is different from ESTT-G in two aspects: it does not generalize to the nonsmooth constrained problem.; it samples two component functions at each iteration, while ESTT-G only samples once. Further, the analysis and the scaling are derived for the case of uniform L i s, therefore it is not clear how the algorithm and the rates can be adapted for the nonuniform case. On the other hand, our ESTT works for the general nonsmooth constrained setting. The non-uniform sampling used in ESTT-G is well-suited for problems with non-uniform L i s, and our scaling can be up to times better than GD or its proximal version in the worst case. ote that problems with non-uniform L i s for the component functions are common in applications such as sparse optimization and signal processing. For example in LASSO problem the data matrix is often normalized by feature or column-normalized [9], therefore the l norm of each row of the data matrix which corresponds to the Lipschitz constant for each component function can be dramatically different. In Table we list the comparison of the number of gradient evaluations for ESTT-G and GD, in the worst case in the sense that L i/ = L. For simplicity, we omitted an additive constant of O for computing the initial gradients. 5 umerical Results In this section we evaluate the performance of ESTT. Consider the high dimensional regression problem with noisy observation [6], where M observations are generated by y = Xν + ɛ. Here y R M is the observed data sample; X R M P is the covariate matrix; ν R P is the ground truth, and ɛ R M is the noise. Suppose that the covariate matrix is not perfectly known, i.e., we observe A = X + W where W R M P is the noise matrix with known covariance matrix Σ W. Let us define ˆΓ := /MA A Σ W, and ˆγ := /MA y. To estimate the ground truth ν, let 7

8 0 5 Uniform Sampling 0 5 on-uniform Sampling Optimality gap 0-5 SGD 0-0 ESTT-E = 0 ESTT-E = ESTT-G SAGA # Grad/ Optimality gap SGD ESTT-E = ESTT-E = ESTT-G SAGA # Grad/ Figure : Comparison of ESTT-G/E, SAGA, SGD on problem 5.. The x-axis denotes the number of Li / passes of the dataset. Left: Uniform Sampling p i = /; Right: on-uniform Sampling p i =. Li / Table : Optimality gap /β fz r for different algorithms, with 00 passes of the datasets. SGD ESTT-E α = 0 ESTT-G SAGA Uniform on-uni Uniform on-uni Uniform on-uni Uniform on-uni E-6 6.6E-9.3E- 6.E-4.7E E-9 5.9E-9.E-0.9E- 7.7E E-6.7E-6 4.5E-7.4E-7.5E E-4 8.E-5.8E-5 3.E-5 4.E E.-4 7.E-4.E-4.7E-4.5E us consider the following nonconvex optimization problem posed in [6, problem.4] where R > 0 controls sparsity: min z z ˆΓz ˆγz s.t. z R. 5. Due to the existence of noise, ˆΓ is not positive semidefinite hence the problem is not convex. ote that this problem satisfies Assumption A B, then by Theorem. ESTT-G converges Q-linearly. To test the performance of the proposed algorithm, we generate the problem following similar setups as [6]. Let X = X ;, X R M P with i i = M and each X i R i P corresponds to i data points, and it is generated from i.i.d Gaussian. Here epresents the size of each minibatch of samples. Generate the observations y i = X i ν +ɛ i R i, where ν is a K-sparse vector to be estimated, and ɛ i R i is the random noise. Let W = [W ; ; W ], with W i R i P generated with i.i.d Gaussian. Therefore we have z ˆΓz = M z Xi X i Wi W i z. We set M = 00, 000, P = 5000, = 50, K = P,and R = ν. We implement ESTT-G/E, the SGD, and the nonconvex SAGA proposed in [] with stepsize β = 3L max /3 with L max := max i L i. ote that the SAGA proposed in [] only works for the unconstrained problems with uniform L i, therefore when applied to 5. it is not guaranteed to converge. Here we only include it for comparison purposes. In Fig. we compare different algorithms in terms of the gap /β fz r. In the left figure we consider the problem with i = j for all i, j, and we show performance of the proposed algorithms with uniform sampling i.e., the probability of picking ith block is p i = /. On the right one we consider problems in which approximately half of the component functions have twice the size of L i s as the rest, and consider the non-uniform sampling p i = L i // Li /. Clearly in both cases the proposed algorithms perform quite well. Furthermore, it is clear that the ESTT-E performs well with large α := {α i }, which confirms our theoretical rate analysis. Also it is worth mentioning that when the i s are non-uniform, the proposed algorithms [ESTT-G and ESTT-E with α = 0] significantly outperform SAGA and SGD. In Table we further compare different algorithms when changing the number of component functions i.e., the number of minibatches while the rest of the setup is as above. We run each algorithm with 00 passes over the dataset. Similarly as before, our algorithms perform well, while SAGA seems to be sensitive to the uniformity of the size of the mini-batch [note that there is no convergence guarantee for SAGA applied to the nonconvex constrained problem 5.]. 8

9 References [] Z. A.-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. 06. Preprint, available on arxiv, arxiv: [] A. Antoniadis, I. Gijbels, and M. ikolova. Penalized likelihood regression for generalized linear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 633:585 65, 009. [3] D. Bertsekas. Incremental gradient, subgradient, and proximal methods f or convex optimization: A survey LIDS Report 848. [4] E. Bjornson and E. Jorswieck. Optimal resource allocation in coordinated multi-cell systems. Foundations and Trends in Communications and Information Theory, 9, 03. [5] D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 8:9 5, 007. [6] S. Boyd,. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:, 0. [7] V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Processing Magazine, 35:3 43, Sept 04. [8] T.-H. Chang, M. Hong, and X. Wang. Multi-agent distributed optimization via inexact consensus admm. IEEE Transactions on Signal Processing, 63:48 497, Jan 05. [9] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In The Proceeding of IPS, 04. [0] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvx stochastic programming. SIAM Journal on Optimizatnoi, 34:34 368, 03. [] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong. onnegative matrix factorization using admm: Algorithm and convergence analysis. In 06 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, pages , March 06. [] D. Hajinezhad and M. Hong. onconvex alternating direction method of multipliers for distributed sparse principal component analysis. In the Proceedings of GlobalSIPT, 05. [3] M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal On Optimization, 6: , 06. [4] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In the Proceedings of the eural Information Processing IPS. 03. [5] G. Lan. An optimal randomized incremental gradient method. 05. Preprint. [6] P.-L. Loh and M. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 403: , 0. [7] P. D. Lorenzo and G. Scutari. ext: In-network nonconvex optimization. 06. Preprint. [8] Z.-Q. Luo and P. Tseng. On the linear convergence of descent methods for convex essentially smooth minimization. SIAM Journal on Control and Optimization, 30:408 45, 99. [9] S.. egahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for highdimensional analysis of m-estimators with decomposable regularizers. Statist. Sci., 74: , 0. [0] M. Razaviyayn, M. Hong, Z.-Q. Luo, and J. S. Pang. Parallel successive convex approximation for nonsmooth nonconvex optimization. In the Proceedings of IPS, 04. [] S. J. Reddi, S. Sra, B. Poczos, and A. Smola. Fast incremental method for nonconvex optimization. 06. Preprint, available on arxiv: arxiv: [] M. Schmidt,. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. 03. Technical report, IRIA. [3] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent methods for regularzied loss minimization. Journal of Machine Learning Rsearch, 4: , 03. [4] S. Sra. Scalable nonconvex inexact proximal splitting. In Advances in eural Information Processing Systems IPS, 0. [5] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 7:387 43, 009. [6] Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Annals of Statistics, 46:64 0, 04. [7] S. Zlobec. On the Liu - Floudas convexification of smooth programs. Journal of Global Optimization, 3:40 407,

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems?

Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Does Alternating Direction Method of Multipliers Converge for Nonconvex Problems? Mingyi Hong IMSE and ECpE Department Iowa State University ICCOPT, Tokyo, August 2016 Mingyi Hong (Iowa State University)

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS

BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS BLOCK ALTERNATING OPTIMIZATION FOR NON-CONVEX MIN-MAX PROBLEMS: ALGORITHMS AND APPLICATIONS IN SIGNAL PROCESSING AND COMMUNICATIONS Songtao Lu, Ioannis Tsaknakis, and Mingyi Hong Department of Electrical

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems

The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems The Proximal Primal-Dual Approach for Nonconvex Linearly Constrained Problems Presenter: Mingyi Hong Joint work with Davood Hajinezhad University of Minnesota ECE Department DIMACS Workshop on Distributed

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations

A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations Chuangchuang Sun and Ran Dai Abstract This paper proposes a customized Alternating Direction Method of Multipliers

More information

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Noname manuscript No. (will be inserted by the editor Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization Davood Hajinezhad and Mingyi Hong Received: date / Accepted: date Abstract

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Fast-and-Light Stochastic ADMM

Fast-and-Light Stochastic ADMM Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Distributed Optimization via Alternating Direction Method of Multipliers

Distributed Optimization via Alternating Direction Method of Multipliers Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

First-order methods for structured nonsmooth optimization

First-order methods for structured nonsmooth optimization First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers

Decentralized Quadratically Approximated Alternating Direction Method of Multipliers Decentralized Quadratically Approimated Alternating Direction Method of Multipliers Aryan Mokhtari Wei Shi Qing Ling Alejandro Ribeiro Department of Electrical and Systems Engineering, University of Pennsylvania

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING SIAM J. OPTIM. Vol. 8, No. 1, pp. 646 670 c 018 Society for Industrial and Applied Mathematics HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

An interior-point stochastic approximation method and an L1-regularized delta rule

An interior-point stochastic approximation method and an L1-regularized delta rule Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods

Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Distributed Smooth and Strongly Convex Optimization with Inexact Dual Methods Mahyar Fazlyab, Santiago Paternain, Alejandro Ribeiro and Victor M. Preciado Abstract In this paper, we consider a class of

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Dealing with Constraints via Random Permutation

Dealing with Constraints via Random Permutation Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

2 Regularized Image Reconstruction for Compressive Imaging and Beyond EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

10-725/36-725: Convex Optimization Spring Lecture 21: April 6 10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

High Order Methods for Empirical Risk Minimization

High Order Methods for Empirical Risk Minimization High Order Methods for Empirical Risk Minimization Alejandro Ribeiro Department of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu IPAM Workshop of Emerging Wireless

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information