NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

Size: px

Start display at page:

Download "NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization"

Claire Matthews
5 years ago
Views:

1 ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for nonconvex problems whose objective consists of a sum of nonconvex L i /-smooth functions, plus a nonsmooth regularizer. The proposed onconvex primal-dual SpliTTing ESTT algorithm splits the problem into subproblems, and utilizes an augmented Lagrangian based primal-dual scheme to solve it in a distributed and stochastic manner. With a special non-uniform sampling, a version of ESTT achieves ɛ-stationary solution using O Li / /ɛ gradient evaluations, which can be up to O times better than the proximal gradient descent methods. It also achieves Q-linear convergence rate for nonconvex l penalized quadratic problems with polyhedral constraints. Further, we reveal a fundamental connection between primal-dual based methods and a few primal only methods such as IAG/SAG/SAGA. Introduction Consider the following nonconvex and nonsmooth constrained optimization problem min fz := g i z + g 0 z + pz,. z Z where Z R d ; for each i {0,, }, g i : R d R is a smooth possibly nonconvex function which has L i -Lipschitz continuous gradient; pz : R d R is a lower semi-continuous convex but possibly nonsmooth function. Define gz := g iz for notational simplicity. Problem. is quite general. It arises frequently in applications such as machine learning and signal processing; see a recent survey [7]. In particular, each smooth functions {g i } can represent: a mini-batch of loss functions modeling data fidelity, such as the l loss, the logistic loss, etc; nonconvex activation functions for neural networks, such as the logit or the tanh functions; 3 nonconvex utility functions used in signal processing and resource allocation, see [4]. The smooth function g 0 can represent smooth nonconvex regularizers such as the non-quadratic penalties [], or the smooth part of the SCAD or MCP regularizers which is a concave function [6]. The convex function p can take the following form: nonsmooth convex regularizers such as l and l functions; an indicator function for convex and closed feasible set Z, denoted as ι Z ; 3 convex functions without global Lipschitz continuous gradient, such as pz = z 4 or pz = /z +ι z 0 z. In this work we solve. in a stochastic and distributed manner. We consider the setting in which distributed agents each having the knowledge of one smooth function {g i }, and they are connected to a cluster center which handles g 0 and p. At any given time, a randomly selected agent is activated and performs computation to optimize its local objective. Such distributed computation model has been popular in large-scale machine learning and signal processing [6]. Such model is also closely related to the centralized stochastic finite-sum optimization problem [, 9, 4, 5, Department of Industrial & Manufacturing Systems Engineering and Department of Electrical & Computer Engineering, Iowa State University, Ames, IA, {dhaji,mingyi}@iastate.edu School of Industrial and Systems Engineering, Georgia Institute of Technology tourzhao@gatech.edu Department of Operations Research, Princeton University,zhaoran@princeton.edu 30th Conference on eural Information Processing Systems IPS 06, Barcelona, Spain.

2 , ], in which each time the iterate is updated based on the gradient information of a random component function. One of the key differences between these two problem types is that in the distributed setting there can be disagreement between local copies of the optimization variable z, while in the centralized setting only one copy of z is maintained. Our Contributions. We propose a class of onconvex primal-dual SpliTTing ESTT algorithms for problem.. We split z R d into local copies of x i R d, while enforcing the equality constraints x i = z for all i. That is, we consider the following reformulation of. min lx, z := g i x i + g 0 z + hz, s.t. x i = z, i =,,,. x,z R d where hz := ι Z z + pz, x := [x ; ; x ]. Our algorithm uses the Lagrangian relaxation of the equality constraints, and at each iteration a possibly non-uniformly randomly selected primal variable is optimized, followed by an approximate dual ascent step. ote that such splitting scheme has been popular in the convex setting [6], but not so when the problem becomes nonconvex. The ESTT is one of the first stochastic algorithms for distributed nonconvex nonsmooth optimization, with provable and nontrivial convergence rates. Our main contribution is given below. First, in terms of some primal and dual optimality gaps, ESTT converges sublinearly to a point belongs to stationary solution set of.. Second, ESTT converges Q-linearly for certain nonconvex l penalized quadratic problems. To the best of our knowledge, this is the first time that linear convergence is established for stochastic and distributed optimization of such type of problems. Third, we show that a gradient-based ESTT with non-uniform sampling achieves an ɛ-stationary solution of. using O Li / /ɛ gradient evaluations. Compared with the classical gradient descent, which in the worst case requires O L i/ɛ gradient evaluation to achieve ɛ-stationarity, our obtained rate can be up to O times better in the case where the L i s are not equal. Our work also reveals a fundamental connection between primal-dual based algorithms and the primal only average-gradient based algorithm such as SAGA/SAG/IAG [5, 9, ]. With the key observation that the dual variables in ESTT serve as the memory of the past gradients, one can specialize ESTT to SAGA/SAG/IAG. Therefore, ESTT naturally generalizes these algorithms to the nonconvex nonsmooth setting. It is our hope that by bridging the primal-dual splitting algorithms and primal-only algorithms in both the convex and nonconvex setting, there can be significant further research developments benefiting both algorithm classes. Related Work. Many stochastic algorithms have been designed for. when it is convex. In these algorithms the component functions g i s are randomly sampled and optimized. Popular algorithms include the SAG/SAGA [9, ], the SDCA [3], the SVRG [4], the RPDG [5] and so on. When the problem becomes nonconvex, the well-known incremental based algorithm can be used [3, 4], but these methods generally lack convergence rate guarantees. The SGD based method has been studied in [0], with O/ɛ convergence rate. Recent works [] and [] develop algorithms based on SVRG and SAGA for a special case of. where the entire problem is smooth and unconstrained. To the best of our knowledge there has been no stochastic algorithms with provable, and non-trivial, convergence rate guarantees for solving problem.. On the other hand, distributed stochastic algorithms for solving problem. in the nonconvex setting has been proposed in [3], in which each time a randomly picked subset of agents update their local variables. However there has been no convergence rate analysis for such distributed stochastic scheme. There has been some recent distributed algorithms designed for. [7], but again without global convergence rate guarantee. Preliminaries. The augmented Lagrangian function for problem. is given by: ηi L x, z; λ = gixi + λi, xi z + xi z + g 0z + hz,.3 where λ := {λ i } is the set of dual variables, and η := {η i > 0} are penalty parameters. We make the following assumptions about problem. and the function.3. A-a The function fz is bounded from below over Z intdom f: f := min z Z fz >. pz is a convex lower semi-continuous function; Z is a closed convex set. A-b The g i s and g have Lipschitz continuous gradients, i.e., gy gz L y z, and g i y g i z L i y z, y, z

3 Algorithm ESTT-G Algorithm : for r = to R do : Pick {,,, } with probability p ir and update x, λ Update z: x r+ = arg min x ir V ir x ir, z r, λ r ;.4 λ r+ = λ r + α ir η ir x r+ z r ;.5 λ r+ j = λ r j, x r+ j = z r, j ;.6 z r+ = arg min z Z L{xr+ i }, z; λ r..7 3: end for 4: Output: z m, x m, λ m where m randomly picked from {,,, R}. Clearly L / L i, and the equality can be achieved in the worst case. For simplicity of analysis we will further assume that L 0 L i. A-c Each η i in.3 satisfies η i > L i /; if g 0 is nonconvex, then η i > 3L 0. Assumption A-c implies that L x, z; λ is strongly convex w.r.t. each x i and z, with modulus γ i := η i L i / and γ z = η i L 0, respectively [7, Theorem.]. We then define the prox-gradient pgrad for., which will serve as a measure of stationarity. It can be checked that the pgrad vanishes at the set of stationary solutions of. [0]. Definition.. The proximal gradient of problem. is given by for any γ > 0 f γz := γ z prox γ p+ι Z [z /γ gz + g 0z], with prox γ p+ι Z [u] := argmin pu+ γ u Z z u. The ESTT-G Algorithm Algorithm Description. We present a primal-dual splitting scheme for the reformulated problem.. The algorithm is referred to as the ESTT with Gradient step ESTT-G since each agent only requires to know the gradient of each component function. To proceed, let us define the following function for some constants {α i > 0} : V ix i, z; λ i = giz + αiηi giz, xi z + λi, xi z + xi z. ote that V i is related to L in the following way: it is a quadratic approximation approximated at the point z of Lx, y; λ w.r.t. x i. The parameters α := {α i } give some freedom to the algorithm design, and they are critical in improving convergence rates as well as in establishing connection between ESTT-G with a few primal only stochastic optimization schemes. The algorithm proceeds as follows. Before each iteration begins the cluster center broadcasts z to everyone. At iteration r + a randomly selected agent {,, } is picked, who minimizes V ir w.r.t. its local variable x ir, followed by a dual ascent step for λ ir. The rest of the agents update their local variables by simply setting them to z. The cluster center then minimizes Lx, z; λ with respect to z. See Algorithm for details. We remark that ESTT-G is related to the popular ADMM method for convex optimization [6]. However our particular update schedule randomly picking x i, λ i plus deterministic updating z, combined with the special x-step minimizing an approximation of L evaluated at a different block variable z is not known before. These features are critical in our following rate analysis. Convergence Analysis. To proceed, let us define rj as the last iteration in which the jth block is picked before iteration r +. i.e. rj := max{t t < r +, j = it}. Define yj r := zrj if j, and y r = z r. Define the filtration F r as the σ-field generated by {it} r t=. A few important observations are in order. Combining the x, z updates.4.7, we have x r+ q = z r α qη q λ r q + gqzr, gqzr + λ r q + α qη qx r+ q z r = 0, with q =.8a λ r+ = gir z r, λ r+ j = gjzrj, j, λ r+ i = giyr i, i.8b x r+ j.6 = z r.8b = z r α jη j λ r j + gjzrj, j..8c 3

4 The key here is that the dual variables serve as the memory for the past gradients of g i s. To proceed, we first construct a potential function using an upper bound of Lx, y; λ. ote that gjxr+ j + λ r j, x r+ j z r + ηj xr+ j z r = gjzr, j.9 gir x r+ + λ r, x r+ z r + ηi xr+ z r i gir z r + ηir + L ir / x r+ z r ii = gir z r + ηir + L ir / α ir η ir / gir y r g ir z r.0 where i uses.8b and applies the descent lemma on the function /g i ; in ii we have used.5 and.8b. Since each i is picked with probability p i, we have E ir [Lx r+, z r ; λ r F r ] g iz r p i η i + L i / + α i η i / g i y r i g i z r + g 0 z r + hz r g iz r + 3p i η i α i η i / g iy r i g i z r + g 0 z r + hz r := Q r, where in the last inequality we have used Assumption [A-c]. In the following, we will use E F r[q r ] as the potential function, and show that it decreases at each iteration. Lemma.. Suppose Assumption A holds, and pick α i = p i = βη i, where β := η, and η i 9L i, i =,.. i p i Then the following descent estimate holds true for ESTT-G E[Q r Q r F r ] ηi E z r z r z r 8 η i gizr g iy r i.. Sublinear Convergence. Define the optimality gap as the following: [ E[G r ] := E /β fz r ] = [ β E z r prox /β h [zr β gz r + g 0z r ] ]..3 ote that when h, g 0 0, E[G r ] reduces to E[ gz r ]. We have the following result. Theorem.. Suppose Assumption A holds, and pick for i =,, Li / α i = p i = Li /, η Li i = 3 Li / /, β = 3 Li /..4 Then every limit point generated by ESTT-G is a stationary solution of problem.. Further, E[G m ] 80 E[Q Li/ Q R+ ] ; 3 R [ ] E[G m ] + E 3ηi x m i z m 80 E[Q Li/ Q R+ ]. 3 R ote that Part is useful in the centralized finite-sum minimization setting, as it shows the sublinear convergence of ESTT-G, measured only by the primal optimality gap evaluated at z r. Meanwhile, part is useful in the distributed setting, as it also shows that the expected constraint violation, which measures the consensus among agents, shrinks in the same order. We also comment that the above result suggests that to achieve an ɛ-stationary solution, the ESTT-G requires about O Li / /ɛ number of gradient evaluations for simplicity we have ignored an additive factor for evaluating the gradient of the entire function at the initial step of the algorithm. 4

5 Algorithm ESTT-E Algorithm : for r = to R do : Update z by minimizing the augmented Lagrangian: z r+ = arg min z Lx r, z; λ r : Randomly pick {,, } with probability p ir : x r+ = argmin x ir U ir x ir, z r+ ; λ r ; 3.6 λ r+ = λ r + α ir η ir x r+ z r+ ; 3.7 x r+ j = x r j, λ r+ j = λ r j j : end for 5: Output: z m, x m, λ m where m randomly picked from {,,, R}. It is interesting to observe that our choice of p i is proportional to the square root of the Lipschitz constant of each component function, rather than to L i. Because of such choice of the sampling probability, the derived convergence rate has a mild dependency on and L i s. Compared with the conventional gradient-based methods, our scaling can be up to times better. Detailed discussion and comparison will be given in Section 4. ote that similar sublinear convergence rates can be obtained for the case α i = for all i with different scaling constants. However due to space limitation, we will not present those results here. Linear Convergence. In this section we show that the ESTT-G is capable of linear convergence for a family of nonconvex quadratic problems, which has important applications, for example in high-dimensional statistical learning [6]. To proceed, we will assume the following. B-a Each function g i z is a quadratic function of the form g i z = /z T A i z + b, z, where A i is a symmetric matrix but not necessarily positive semidefinite; B-b The feasible set Z is a closed compact polyhedral set; B-c The nonsmooth function pz = µ z, for some µ 0. Our linear convergence result is based upon certain error bound condition around the stationary solutions set, which has been shown in [8] for smooth quadratic problems and has been extended to including l penalty in [5, Theorem 4]. Due to space limitation the statement of the condition will be given in the supplemental material, along with the proof of the following result. Theorem.. Suppose that Assumptions A, B are satisfied. Then the sequence {E[Q r+ ]} r= converges Q-linearly 4 to some Q = fz, where z is a stationary solution for problem.. That is, there exists a finite r > 0, ρ 0, such that for all r r, E[Q r+ Q ] ρe[q r Q ]. Linear convergence of this type for problems satisfying Assumption B has been shown for deterministic proximal gradient based methods [5, Theorem, 3]. To the best of our knowledge, this is the first result that shows the same linear convergence for a stochastic and distributed algorithm. 3 The ESTT-E Algorithm Algorithm Description. In this section, we present a variant of ESTT-G, which is named ESTT with Exact minimization ESTT-E. Our motivation is the following. First, in ESTT-G every agent should update its local variable at every iteration [cf..4 or.6]. In practice this may not be possible, for example at any given time a few agents can be in the sleeping mode so they cannot perform.6. Second, in the distributed setting it has been generally observed e.g., see [8, Section V] that performing exact minimization whenever possible instead of taking the gradient steps for local problems can significantly speed up the algorithm. The ESTT-E algorithm to be presented in this section is designed to address these issues. To proceed, let us define a new function as follows: Ux, z; λ := U ix i, z; λ i := αiηi gixi + λi, xi z + xi z 4 A sequence {x r } is said to converge Q-linearly to some x if lim sup r x r+ x / x r x ρ, where ρ 0, is some constant; cf [5] and references therein.. 5

6 ote that if α i = for all i, then the Lx, z; λ = Ux, z; λ + pz + hz. The algorithm details are presented in Algorithm. Convergence Analysis. We begin analyzing ESTT-E. The proof technique is quite different from that for ESTT-G, and it is based upon using the expected value of the Augmented Lagrangian function as the potential function; see [,, 3]. For the ease of description we define the following quantities: w := x, z, λ, β :=, c i := L i ηi α γi iη i + αi L i α i, α := {αi}. To measure the optimality of ESTT-E, define the prox-gradient of Lx, z; λ as: [ ] Lw = z prox h [z z Lw hz]; x Lw; ; x Lw R +d. 3.9 We define the optimality gap by adding to Lw the size of the constraint violation [3]: Hw r := Lw r L i + xr i z r. It can be verified that Hw r 0 implies that w r reaches a stationary solution for problem.. We have the following theorem regarding the convergence properties of ESTT-E. Theorem 3.. Suppose Assumption A holds, and that η i, α i are chosen such that c i < 0. Then for some constant f, we have E[Lw r ] E[Lw r+ ] f >, r 0. Further, almost surely every limit point of {w r } is a stationary solution of problem.. Finally, for some function of α denoted as Cα = σ α/σ α, we have the following: E[Hw m ] CαE[Lw Lw R+ ], 3.0 R where σ := maxˆσ α, σ and σ := maxˆσ α, σ, and these constants are given by ˆσ α = max i { 4 L i + η i + α i L i + 3 σ = 4ηi + + η i + L 0 L i + 3, { γi ˆσ α = max p i i L i } αi L i, σ = α iη i α i L 4 i α iη i 4 + L i }, ηi L0. We remark that the above result shows the sublinear convergence of ESTT-E to the set of stationary solutions. ote that γ i = η i L i /, to satisfy c i < 0, a simple derivation yields L i α i + α i + 8α i η i >. α i Further, the above result characterizes the dependency of the rates on various parameters of the L algorithm. For example, to see the effect of α on the convergence rate, let us set p i = i Li, and η i = 3L i /, and assume L 0 = 0, then consider two different choices of α: α i =, i and α i = 4, i. One can easily check that applying these different choices leads to following results: C α = 49 L i/, C α = 8 L i/. The key observation is that increasing α i s reduces the constant in front of the rate. Hence, we expect that in practice larger α i s will yield faster convergence. 4 Connections and Comparisons with Existing Works In this section we compare ESTT-G/E with a few existing algorithms in the literature. First, we present a somewhat surprising observation, that ESTT-G takes the same form as some well-known algorithms for convex finite-sum problems. To formally state such relation, we show in the following result that ESTT-G in fact admits a compact primal-only characterization. 6

7 Table : Comparison of # of gradient evaluations for ESTT-G and GD in the worst case ESTT-G GD # of Gradient Evaluations O Li/ /ɛ O Li/ɛ Case I: L i =, i O/ɛ O/ɛ Case II : O terms with L i = O/ɛ O 3/ /ɛ the rest with L i = Case III : O terms with L i = O/ɛ O the rest with L i = /ɛ Proposition 4.. The ESTT-G can be written into the following compact form: with z r+ = arg min hz + g z 0z + β z ur+ u r+ := z r β g ir z r g ir y r i α r + ir 4.a g iy r i. 4.b Based on this observation, the following comments are in order. Suppose h 0, g 0 0 and α i =, p i = / for all i. Then 4. takes the same form as the SAG presented in []. Further, when the component functions g i s are picked cyclically in a Gauss-Seidel manner, the iteration 4. takes the same form as the IAG algorithm [5]. Suppose h 0 and g 0 0, and α i = p i = / for all i. Then 4. is the same as the SAGA algorithm [9], which is design for optimizing convex nonsmooth finite sum problems. ote that SAG/SAGA/IAG are all designed for convex problems. Through the lens of primal-dual splitting, our work shows that they can be generalized to nonconvex nonsmooth problems as well. Secondly, ESTT-E is related to the proximal version of the nonconvex ADMM [3, Algorithm ]. However, the introduction of α i s is new, which can significantly improve the practical performance but complicates the analysis. Further, there has been no counterpart of the sublinear and linear convergence rate analysis for the stochastic version of [3, Algorithm ]. Thirdly, we note that a recent paper [] has shown that SAGA works for smooth and unconstrained nonconvex problem. Suppose that h 0, g 0 0, L i = L j, i, j and α i = p i = /, the authors show that SAGA achieves ɛ-stationarity using O /3 L i//ɛ gradient evaluations. Compared with GD, which achieves ɛ-stationarity using O L i/ɛ gradient evaluations in the worse case in the sense that L i/ = L, the rate in [] is O /3 times better. However, the algorithm in [] is different from ESTT-G in two aspects: it does not generalize to the nonsmooth constrained problem.; it samples two component functions at each iteration, while ESTT-G only samples once. Further, the analysis and the scaling are derived for the case of uniform L i s, therefore it is not clear how the algorithm and the rates can be adapted for the nonuniform case. On the other hand, our ESTT works for the general nonsmooth constrained setting. The non-uniform sampling used in ESTT-G is well-suited for problems with non-uniform L i s, and our scaling can be up to times better than GD or its proximal version in the worst case. ote that problems with non-uniform L i s for the component functions are common in applications such as sparse optimization and signal processing. For example in LASSO problem the data matrix is often normalized by feature or column-normalized [9], therefore the l norm of each row of the data matrix which corresponds to the Lipschitz constant for each component function can be dramatically different. In Table we list the comparison of the number of gradient evaluations for ESTT-G and GD, in the worst case in the sense that L i/ = L. For simplicity, we omitted an additive constant of O for computing the initial gradients. 5 umerical Results In this section we evaluate the performance of ESTT. Consider the high dimensional regression problem with noisy observation [6], where M observations are generated by y = Xν + ɛ. Here y R M is the observed data sample; X R M P is the covariate matrix; ν R P is the ground truth, and ɛ R M is the noise. Suppose that the covariate matrix is not perfectly known, i.e., we observe A = X + W where W R M P is the noise matrix with known covariance matrix Σ W. Let us define ˆΓ := /MA A Σ W, and ˆγ := /MA y. To estimate the ground truth ν, let 7

8 0 5 Uniform Sampling 0 5 on-uniform Sampling Optimality gap 0-5 SGD 0-0 ESTT-E = 0 ESTT-E = ESTT-G SAGA # Grad/ Optimality gap SGD ESTT-E = ESTT-E = ESTT-G SAGA # Grad/ Figure : Comparison of ESTT-G/E, SAGA, SGD on problem 5.. The x-axis denotes the number of Li / passes of the dataset. Left: Uniform Sampling p i = /; Right: on-uniform Sampling p i =. Li / Table : Optimality gap /β fz r for different algorithms, with 00 passes of the datasets. SGD ESTT-E α = 0 ESTT-G SAGA Uniform on-uni Uniform on-uni Uniform on-uni Uniform on-uni E-6 6.6E-9.3E- 6.E-4.7E E-9 5.9E-9.E-0.9E- 7.7E E-6.7E-6 4.5E-7.4E-7.5E E-4 8.E-5.8E-5 3.E-5 4.E E.-4 7.E-4.E-4.7E-4.5E us consider the following nonconvex optimization problem posed in [6, problem.4] where R > 0 controls sparsity: min z z ˆΓz ˆγz s.t. z R. 5. Due to the existence of noise, ˆΓ is not positive semidefinite hence the problem is not convex. ote that this problem satisfies Assumption A B, then by Theorem. ESTT-G converges Q-linearly. To test the performance of the proposed algorithm, we generate the problem following similar setups as [6]. Let X = X ;, X R M P with i i = M and each X i R i P corresponds to i data points, and it is generated from i.i.d Gaussian. Here epresents the size of each minibatch of samples. Generate the observations y i = X i ν +ɛ i R i, where ν is a K-sparse vector to be estimated, and ɛ i R i is the random noise. Let W = [W ; ; W ], with W i R i P generated with i.i.d Gaussian. Therefore we have z ˆΓz = M z Xi X i Wi W i z. We set M = 00, 000, P = 5000, = 50, K = P,and R = ν. We implement ESTT-G/E, the SGD, and the nonconvex SAGA proposed in [] with stepsize β = 3L max /3 with L max := max i L i. ote that the SAGA proposed in [] only works for the unconstrained problems with uniform L i, therefore when applied to 5. it is not guaranteed to converge. Here we only include it for comparison purposes. In Fig. we compare different algorithms in terms of the gap /β fz r. In the left figure we consider the problem with i = j for all i, j, and we show performance of the proposed algorithms with uniform sampling i.e., the probability of picking ith block is p i = /. On the right one we consider problems in which approximately half of the component functions have twice the size of L i s as the rest, and consider the non-uniform sampling p i = L i // Li /. Clearly in both cases the proposed algorithms perform quite well. Furthermore, it is clear that the ESTT-E performs well with large α := {α i }, which confirms our theoretical rate analysis. Also it is worth mentioning that when the i s are non-uniform, the proposed algorithms [ESTT-G and ESTT-E with α = 0] significantly outperform SAGA and SGD. In Table we further compare different algorithms when changing the number of component functions i.e., the number of minibatches while the rest of the setup is as above. We run each algorithm with 00 passes over the dataset. Similarly as before, our algorithms perform well, while SAGA seems to be sensitive to the uniformity of the size of the mini-batch [note that there is no convergence guarantee for SAGA applied to the nonconvex constrained problem 5.]. 8

9 References [] Z. A.-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. 06. Preprint, available on arxiv, arxiv: [] A. Antoniadis, I. Gijbels, and M. ikolova. Penalized likelihood regression for generalized linear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 633:585 65, 009. [3] D. Bertsekas. Incremental gradient, subgradient, and proximal methods f or convex optimization: A survey LIDS Report 848. [4] E. Bjornson and E. Jorswieck. Optimal resource allocation in coordinated multi-cell systems. Foundations and Trends in Communications and Information Theory, 9, 03. [5] D. Blatt, A. O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant step size. SIAM Journal on Optimization, 8:9 5, 007. [6] S. Boyd,. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:, 0. [7] V. Cevher, S. Becker, and M. Schmidt. Convex optimization for big data: Scalable, randomized, and parallel algorithms for big data analytics. IEEE Signal Processing Magazine, 35:3 43, Sept 04. [8] T.-H. Chang, M. Hong, and X. Wang. Multi-agent distributed optimization via inexact consensus admm. IEEE Transactions on Signal Processing, 63:48 497, Jan 05. [9] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In The Proceeding of IPS, 04. [0] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvx stochastic programming. SIAM Journal on Optimizatnoi, 34:34 368, 03. [] D. Hajinezhad, T. H. Chang, X. Wang, Q. Shi, and M. Hong. onnegative matrix factorization using admm: Algorithm and convergence analysis. In 06 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP, pages , March 06. [] D. Hajinezhad and M. Hong. onconvex alternating direction method of multipliers for distributed sparse principal component analysis. In the Proceedings of GlobalSIPT, 05. [3] M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal On Optimization, 6: , 06. [4] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In the Proceedings of the eural Information Processing IPS. 03. [5] G. Lan. An optimal randomized incremental gradient method. 05. Preprint. [6] P.-L. Loh and M. Wainwright. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 403: , 0. [7] P. D. Lorenzo and G. Scutari. ext: In-network nonconvex optimization. 06. Preprint. [8] Z.-Q. Luo and P. Tseng. On the linear convergence of descent methods for convex essentially smooth minimization. SIAM Journal on Control and Optimization, 30:408 45, 99. [9] S.. egahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A unified framework for highdimensional analysis of m-estimators with decomposable regularizers. Statist. Sci., 74: , 0. [0] M. Razaviyayn, M. Hong, Z.-Q. Luo, and J. S. Pang. Parallel successive convex approximation for nonsmooth nonconvex optimization. In the Proceedings of IPS, 04. [] S. J. Reddi, S. Sra, B. Poczos, and A. Smola. Fast incremental method for nonconvex optimization. 06. Preprint, available on arxiv: arxiv: [] M. Schmidt,. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. 03. Technical report, IRIA. [3] S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent methods for regularzied loss minimization. Journal of Machine Learning Rsearch, 4: , 03. [4] S. Sra. Scalable nonconvex inexact proximal splitting. In Advances in eural Information Processing Systems IPS, 0. [5] P. Tseng and S. Yun. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 7:387 43, 009. [6] Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Annals of Statistics, 46:64 0, 04. [7] S. Zlobec. On the Liu - Floudas convexification of smooth programs. Journal of Global Optimization, 3:40 407,

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors