Inexact proximal stochastic gradient method for convex composite optimization

Size: px
Start display at page:

Download "Inexact proximal stochastic gradient method for convex composite optimization"

Transcription

1 Comput Optim Appl DOI /s Inexact proximal stochastic gradient method for convex composite optimization Xiao Wang 1 Shuxiong Wang Hongchao Zhang 3 Received: 9 September 016 Springer Science+Business Media, LLC 017 Abstract We study an inexact proximal stochastic gradient IPSG method for convex composite optimization, whose objective function is a summation of an average of a large number of smooth convex functions and a convex, but possibly nonsmooth, function. Variance reduction techniques are incorporated in the method to reduce the stochastic gradient variance. The main feature of this IPSG algorithm is to allow solving the proximal subproblems inexactly while still eeping the global convergence with desirable complexity bounds. Different subproblem stopping criteria are proposed. Global convergence and the component gradient complexity bounds are derived for the both cases when the objective function is strongly convex or just generally convex. Preliminary numerical experiment shows the overall efficiency of the IPSG algorithm. This research is partially supported by the National Natural Science Foundation of China and the National Science Foundation of USA B Xiao Wang wangxiao@ucas.ac.cn Shuxiong Wang wsx@lsec.cc.ac.cn Hongchao Zhang hozhang@math.lsu.edu hozhang 1 School of Mathematical Sciences, University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing , China Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijng , China 3 Department of Mathematics, Louisiana State University, 0 Locett Hall, Baton Rouge, LA , USA

2 X. Wang et al. Keywords Convex composite optimization Empirical ris minimization Stochastic gradient Inexact methods Global convergence Complexity bound Mathematics Subject Classification 47N10 65K10 1 Introduction In this paper, we consider the following composite optimization problem min Pw := f w + hw, 1.1 w R d where f is an average of many smooth component convex functions, that is f w = 1 n n f i w 1. with f i : R d R, i = 1,...,n, being convex not necessarily strongly convex and Lipschitz continuously differentiable, and h : R d R {+ }is a convex, but possibly nonsmooth, function. In addition, we assume the solution set of 1.1 is not empty and achieves the optimal function value P. Problem is also nown as regularized empirical ris minimization [3] and can arise in many applications, such as many inds of classification and regression models in machine learning. In these applications, a ey feature of the problem is that the total number of component functions, denoted by n, is very large or even huge such that it is often prohibitive to do the full function and/or gradient evaluations of the objective function at each iteration, which is often required by an iterative method. Problem 1.1 can be solved by standard proximal gradient methods. Given a starting point w 0 R d, such methods update iterates through { w +1 = arg min f w, w w +hw + 1 } w w w R d, 1.3 η where η > 0 is a proximal parameter. Since the full gradient of f is used in 1.3, the proximal gradient method is also referred as the proximal full gradient Prox-FG method. The Prox-FG methods have been shown superior to the classical subgradient methods for solving 1.1. To achieve an ε-solution w of the problem 1.1, i.e. P w P <ε, the standard iteration complexity bound for the Prox-FG method is Oε 1, while for subgradient method it is Oε. When combining with Nesterov s acceleration technique, the iteration complexity bound of the Prox-FG methods can achieve Oε 0.5. Extensive research has been carried out on proximal gradient methods, such as [,9,4]. However, in many cases it is hard or time-consuming to evaluate the full gradient f. So, proximal stochastic gradient Prox-SG methods become popular for solving 1.1. These methods usually update the iterates by

3 Inexact proximal stochastic gradient method for convex { w +1 = arg min g,w w +hw + 1 } w w w R d, η where g is an approximate gradient at w based on some random variable and its related expectation is equal to f x. For general convex case, based on Nesterov s optimal method, Lan [0] propose an unified optimal stochastic approximation algorithm which can achieve optimal complexity order Oɛ without particulary nowing the problem structure. In addition, the algorithm depends optimally on the Lipschitz-type constants of the problems. Then, Ghadimi and Lan [1,13] have studied the accelerated stochastic approximation methods for strongly convex case, which can achieve optimal or nearly optimal convergence rate with proper stepsizes. The Prox-SG methods can be also referred as randomized incremental proximal methods when g is defined as component gradient and the index is chosen in a certain random order. A good review of incremental proximal methods can be found in [4,5,34,35]. Although much computation could be saved by using the stochastic gradient at each iteration, its large variance may slow down the convergence speed. Hence, the variance reduction techniques aiming to reduce the variance of stochastic gradients have recently attracted much interest. The variance reduction techniques are originally proposed for improving classical stochastic gradient methods for solving with h 0. A stochastic average gradient SAG method, which taes an average of historical stochastic gradients at each iteration, is proposed in [9]. The On log1/ε component gradient complexity bound is achieved by this SAG method if f is strongly convex, Lipschitz continuously differentiable and n 8L/μ, where L is the Lipschitz constant and μ is the strongly convex modulus. When f i w = φ i a T i w, i = 1,...,n, with φ i being L i -smooth and h being μ-strongly convex, a stochastic dual coordinate ascent SDCA method is proposed in [31]to solve This SDCA method is shown to have the On + L max /μ log1/ε complexity, where L max = max i {L i }. Although both SAG and SDCA have lower computational cost at each iteration and achieve good convergence rate, they all have a relatively high Ond memory requirement. A stochastic variance reduced gradient SVRG method was introduced in [17] for solving smooth and strongly convex problem. This SVRG method uses an unbiased stochastic gradient constructed from the full gradient computed at a particular chosen point after every m > 0 iterations. With proper choice of m, the same convergence rate as [9,31] can be achieved, but only the Od memory is required. Another closely related algorithm is SAGA proposed in [8], which uses an unbiased stochastic gradient incorporated with average of past stochastic gradients. Hence, it could be considered as a method between SAG and SVRG. SAGA can be directly extended to solve the composite optimization when the objective function is not strongly convex. But, similar to SAG, SAGA also needs relatively high memory. Due to the small memory requirement and low component gradient complexity, SVRG-type methods have been paid more attention recently. In [36], a Prox-SVRG method is proposed to achieve On + L/μ log1/ε component gradient complexity for solving strongly convex composite optimization. When the objective function is not strongly convex, by applying the method on a slightly perturbed strongly convex problem, the n + L/εOlog1/ε complexity bound can be obtained. On the other hand, in [40]

4 X. Wang et al. a SVRG based Prox-SG method is designed to solve the non-strongly convex composite optimization directly with an improved On log1/ε + L/ε complexity bound. For other Prox-SG type methods, one may refer to [14,19,5]. In theory, all the aforementioned stochastic gradient methods require to solve the proximal subproblems exactly. However, when the function h is complicated and/or does not have special structure, the proximal subproblem often does not have closed form solution. Many optimization problems including topographic dictionary learning [18,3], CUR-lie factorization optimization [,3] and multi-tas learning problems [6,37], normally require the regularization term h reveal some group sparsity structure of variables. For example, it might tae the form of summation of -norm. In this case, it would be expensive or sometimes impossible to solve the subproblems exactly. On the other hand, since the subproblems are built on stochastic inexact gradients, it is also unnecessary to spend great efforts for solving the subproblems exactly, especially in the early iterations of an algorithm. To the best of our nowledge, there is no wor focusing on inexact Prox-SG methods yet. Hence, in this paper we would develop an inexact proximal stochastic gradient IPSG method which would guarantee global convergence and desirable complexity bounds. In particular, we show that although the subproblems are solved inexactly in the IPSG method, compared to the exact SVRG based Prox-SG methods, the same or almost the same component gradient complexity bounds can still be maintained. Our analysis is partly motivated from the deterministic inexact Prox-FG methods. One line of research on inexact Prox-FG methods focuses on using approximate proximal operators for which exact solution could be obtained relatively easily, such as the methods developed in [6,7,11,1]. The inexact Prox-FG methods developed in [15,8,30,33] are more closely related to our wor. Instead of constructing an approximate proximal operator, these methods allow to solve subproblems inexactly up to certain accuracy derived from theoretical analysis. However, most of these Prox-FG methods use deterministic exact full gradients, which as mentioned previously is not suitable for solving our problem Although the method in [30] allows inexact gradients, the analysis developed for Prox-FG method in [30] could not be directly applied to the inexact Prox-SG methods. One ey reason is that the variance of stochastic gradient may not meet the accuracy required in [30] for the inexact gradients. For example, without combining with particular treatment, the variance of stochastic gradient may not even reduce. Hence, in the similar spirit to the SVRG methods, we also apply the variance reduction techniques to control the variance of stochastic gradients. As a result, in this paper, we would propose an inexact Prox-SG method with variance reduction techniques for solving and develop its corresponding convergence theory. The remainder of this paper is organized as follows. In Sect., we present an inexact Prox-SG IPSG method for solving the convex composite optimization problem and discuss some of its basic properties. In Sect. 3, we give the convergence analysis of this IPSG method when f is strongly convex and provide its convergence rate under different accuracy criteria of solving the subproblem. In Sect. 4, we extend our analysis to the case that f may not be strongly convex and give the corresponding convergence rate when the subproblems are solved inexactly. Some preliminary numerical experiment results are reported in Sect. 5. Finally, some conclusion remars are given in Sect. 6.

5 Inexact proximal stochastic gradient method for convex Notation The gradient of f is denoted as f. Given a vector x R d and p > 0, x p denotes its p-norm, while x = x represents its Euclidean norm. For x, y R d,weuse x, y to denote the Euclidean inner product of x and y. Given amatrixx R m n, X F is its Frobenius norm. For a real number a, a denotes the largest integer number less than or equal to a. Given a random variable X, E ξ [X] denotes the expectation of X with respect to ξ. Inexact Prox-SG method for In this section, we first give the definition of inexact solutions used in this paper for the proximal mapping problem: min qw := hw + 1 w R d η w y..1 Then, we will present our inexact Prox-SG algorithm for solving the convex composite optimization problem and develop some of its important properties needed in the later analysis. Let us first recall the definition of the ε-subdifferential of a convex function: Given a convex function φ : R d R, its ε-subdifferential at z, denoted as ε φz, is defined as { ε φz = y : φw φz y,w z ε for all w R d}. Now, we give our definition of inexact solutions used for the problem.1. Definition.1 Given ε > 0 and ˆε > 0, we call z to be an ε, ˆε-solution of the problem.1, if there exists u R d such that u η ε and y z u η ˆε hz.. By the definition of ε-subdifferential, when u η ε, it is not difficult to verify that z y + u η { 1 ε η w=z w y }. Hence, if z is an ε, ˆε-solution of the problem.1, by. wehave { 1 0 ε η w=z w y } + ˆε hz ε+ˆε qz. So, by the definition of ε-subdifferential again we have qw qz 0,w z ε +ˆε, for any w R d,

6 X. Wang et al. which implies qz min qw + ε +ˆε..3 w Rd Now, let us consider a particular case that h is an indicator function of a closed convex set C R d. For this case, similar observations have been made in [7,33]. Our purpose here is to compare the inexact solutions given by Definition.1 and other characterization of inexact solutions in the literature. When y C, the solution of proximal mapping.1 is trivial. Hence, we assume y / C. Given a tolerance ε>0, one often used condition for inexact solutions of.1istofindz such that 0 ε qz,.4 which is equivalent see. e.g., [3, Section 4.3] to require z satisfying qz min qw + ε..5 w Rd This condition.5 isusedin[30]. Note that when ε +ˆε ε, by.3, the inexact solution z given by Definition.1 will also satisfy.5. When z satisfies.5, we have z C and 1 η w y + ε 1 z y for any w C η z y Proj C y y + ηε, where Proj C is the projection on C. So, geometrically, z belongs to the intersection of C and the ball centered at y with radius Proj C y y + ηε see Fig. 1a. Another type of condition often used for inexact solutions of.1 see, e.g. in [7,33] requires z to satisfy y z ε hz,.6 η which is equivalent to z C and w z, y z ηε, for any w C..7 Note that the condition. in Definition.1 will reduce to.6 by setting ε = 0 and ˆε = ε. Geometrically, z satisfying.6 belongs to the intersection of C and the half-space determined by.7, for which the separating hyperplane is normal to y z and with distance ηε/ y z to z. see Fig. 1b. The third approximation criterion used in literature see, e.g. [7] for inexact solutions of.1 is to require z satisfying d0, qz ε,.8 which is equivalent to existing e R d and e ε such that e qz y z η + e hz

7 Inexact proximal stochastic gradient method for convex Fig. 1 Comparison of several inexact solution criteria. a Corresponds to.5, b corresponds to.6, c corresponds to.8, d corresponds to. y + ηe z,w z 0, z = Proj C y + ηe. for any w C Therefore, geometrically, the approximate solution z satisfying.8 belongs to the projection of the ball centered at y with radius ηε on C see Fig. 1c. Note that the condition. in Definition.1 will reduce to.8 by setting ˆε = 0 and ε = ηε /. Finally, the condition. in Definition.1 is equivalent to z C and w z, x z ηˆε, for any w C,.9 where x = y u. Geometrically, this means z belongs to the intersection of C and the half-space determined by.9, for which the separating hyperplane is normal to x z and with distance ηˆε/ x z to z and x is any point in the ball centered at y with radius η ε see Fig. 1d. Now, we give our inexact Prox-SG algorithm for solving

8 X. Wang et al. Algorithm.1. Inexact Prox-SG method IPSG for Input: Maximum number of iterations N, integer parameters {m } =1 N, {s } =1 N, and tolerance sequence { } { } ε t and ˆε t for = 1,...,N and t = 1,...,m. Initialize w0 1 = w 0. 1: for = 1,,...,N do : Choose proximal parameter η > 0. 3: Calculate ṽ = f w 1. 4: for t = 1,,...,m do 5: Randomly choosing a sample set K {1,,...,n} with size s, such that the probability of each index being piced is s /n. Compute v t 1 = f K wt 1 f K w 1 +ṽ, where f K = 1 s i K f i. 6: Find an ε t, ˆε t -solution w t of the subproblem 7: end for 8: Set w = m 1 m w t. 9: Choose w : end for Output: w N. min q w R d t w := vt 1,w w t 1 + hw + 1 w w η t Notice that Algorithm.1 consists of a two-loop procedure. In the inner loop, a sequence of inexact proximal stochastic gradient steps are carried out. Because of expensive calculation of the full gradient, by randomly choosing a sample set K, an inexpensive stochastic gradient of f is formed and used in calculating these proximal gradient steps. Meanwhile, the variance reduction technique proposed in SVRG [17] is applied to control the variance of the stochastic gradient generated in the inner iterations. In the outer loop, we choose the proximal parameter η > 0 and update w, where the full gradient f w will be calculated. In Step 9, we do not specify how to choose the starting iterate w0 +1 of the inner loop. As will be seen in the later sections, two different strategies on how to choose w0 +1 will be given according to whether the objective function is strongly convex or simply convex. In Step 5, there are different ways of choosing the sample set K such that the probability of each index in {1,...,n} being piced is s /n. The following simple strategy is considered in [39] and used in our numerical experiments. First, partition all the indices {1,...,n} into s disjoint groups K j, j = 1,...,s, with equal cardinality n/s assuming s divides n. Then, K consists of picing one index from each group K j with uniform probability s /n. The major difference between our Algorithm.1 and the existing Prox-SG methods is that the subproblem.10 is solved inexactly in Algorithm.1. Hence, the computational cost of solving each subproblem could be reduced greatly compared with exactly solving of.10 required by other methods. More precisely, we only find an ε t, ˆε t -solution w t of the subproblem.10. By Definition.1, we want to find wt such that there exists an u t satisfying

9 Inexact proximal stochastic gradient method for convex u t η ε t and w t 1 w t u t vt 1 η ˆε h w t t..11 For convenience, in the following, we let ε t = ε t +ˆε t..1 Hence, by.3, we have q t wt min q w R d t w + ε t..13 In the later sections, we would give the global convergence and the complexity analysis on the rate of E[P w N ] Pw converging to zero after N outer iterations of Algorithm.1, where w is an optimal solution of 1.1, i.e., w arg min w R d Pw. The following assumption is required throughout this paper. AS.1 For any i {1,...,n}, f i is Lipschitz continuously differentiable with Lipschitz constant L i > 0, i.e., f i x f i y f i y, x y L i x y, for any x, y R d. In the following we denote L = max L i..14 i {1,...,n} Hence, under the Assumption AS.1, all the functions f i, i = 1,...,n, and f are Lipschitz continuously differentiable with constant L. To facilitate the analysis, we first observe that for any w R d, f K w is an unbiased estimate of f w. Indeed, since for any index i the probability of i K is s /n, by taing expectation of f K w with respect to the random sample set K,we have [ ] E K [ f K w] = 1 E K f i w = 1 n [ s ] s s n f iw i K = 1 n n f i w = f w..15 So, by the definition of vt 1 in Algorithm.1,.15impliesthatv t 1 is an unbiased estimate of f wt 1 [ ], that is, E v t 1 = f w t 1. The variance of the stochastic gradient vt 1 can be bounded by the following lemma, whose proof is a proper modification of the proof of Lemma 3.4 in [36] due to the existence of batch size s.

10 X. Wang et al. Lemma.1 Under assumption AS.1, we have E K [ v t 1 f where L is defined in.14. ] wt 1 4L [ ] P wt 1 Pw + P w 1 Pw, s Proof Given any i {1,...,n}, since f i is Lipschitz continuously differentiable with constant L i, for any w R d we have f i w f i w L i [ f i w f i w f i w, w w ]. By the definition of L and summing up the above inequality over i = 1,...,n, we have 1 n n f i w f i w L[ f w f w f w, w w ]. From the optimality condition of 1.1, there exists a p such that for any w R d f w + p,w w 0, which together with the convexity of h yields that f w, w w p,w w hw hw..16 Then, it follows from.16 thatforanyw R d we have 1 n f i w f i w L[Pw Pw ]..17 n Now, noticing that E K [ f K wt 1 ]= fw t 1, E K[ f K w 1 ]= f w 1 and the fact that for any index i the probability of i K is s /n, wehave [ v ] E K t 1 f wt 1 [ = E K fk wt 1 f K w 1 + f w 1 f = 1 s E K i K ] wt 1 f i wt 1 f i w 1 + f w 1 f wt 1 [ = 1 s E K f i wt 1 f i w 1 + f w 1 f i K ] wt 1

11 Inexact proximal stochastic gradient method for convex [ 1 s E K i K f i wt 1 ] f i w 1 = 1 s n s n f i wt 1 f i w 1. Hence, it follows from.17 that E K [ v t 1 f s n n 4L s [ P ] wt 1 [ fi wt 1 w t 1 ] f i w + f i w 1 f i w ] Pw + P w 1 Pw. Remar.1 The Lemma.1 provides an insight that the batch size s can help to further reduce the variance of the stochastic gradient. By the later analysis, although the overall component gradient complexity can not be reduced by increasing the batch size s, the stochastic gradient could be obtained by parallel computing when s > 1 and hence, the total number of subproblems to be solved can be reduced by a factor of s. Note that since we do not save the component gradients f i w 1, i = 1,...,n, at Step 3 of Algorithm.1, in together s component gradients need to be computed at Step 5 of Algorithm.1. The following lemma provides a bound on the difference between the inexact proximal stochastic gradient step given by.10 and the exact proximal full gradient step. Lemma. Under assumption AS.1, we have wt w t v η t 1 f wt 1 + η εt, where w t denotes the exact solution to.10 with vt 1 replaced by f wt 1. Proof Since wt is an ε t, ˆε t -solution of the subproblem.10, we have from.13 and the definition of εt in.1 that q t wt qt ŵt εt, where ŵ t is the exact minimizer of q t defined in.10. So, by the strong convexity of q t we have 1 w η t ŵt qt wt qt ŵt εt,

12 X. Wang et al. which implies wt ŵt ηεt..18 Note that w t and ŵ t satisfy w t 1 w t η f wt 1 h w t and w t 1 ŵ t vt 1 η h ŵt. Hence, we have w h ŵt h w t t 1 w t h w t h ŵt η w t 1 ŵ t η f Summing up the above two inequalities yields that w t ŵt η η v t 1 f wt 1, ŵt w t and vt 1, w t ŵt. vt 1 w f t 1, w t ŵt wt 1 w t ŵt, which gives w t ŵt v η t 1 f wt 1. Therefore, by.18 wehave w t w t w t ŵ t + wt ŵt v η t 1 f wt 1 + η εt. Remar. From the above Lemma., we can derive the following inequality, which will be frequently used in the later analysis: v t 1 f wt 1,w wt v t 1 f wt 1,w w t + vt 1 wt 1 f w t wt v t 1 f wt 1,w w t v + η t 1 f wt 1 + η εt v t 1 f wt 1 v t 1 f wt 1,w w t v + η t 1 f wt ε t..19

13 Inexact proximal stochastic gradient method for convex The following lemma gives a bound on the function value gap Pw t Pw. Lemma.3 Under assumption AS.1, if the proximal parameter η in Algorithm.1 satisfies η 1/L, we have P wt Pw + 1 w w η + 1 η u t,w w t t 1 w wt + vt 1 f w t 1, w wt +ˆε t,.0 where u t η ε t. Proof Since wt is an ε t, ˆε t -solution of the subproblem.10, we have from.11 that hw + 1 w w t 1 η w h wt + t 1 wt u t vt 1 η,w wt ˆε t + 1 w w t 1 η = h wt 1 u t η,w wt vt 1,w wt ˆε t + 1 w w t η + 1 w η t wt 1, where u t η ε t. By rearranging the terms, we have 1 w w t η 1 w w t 1 wt w t 1 + hw h η + vt 1,w wt + 1 u t η,w wt +ˆε t 1 w w t 1 wt w t 1 + hw h η + f wt 1,w wt η u t,w w t wt w t v t 1 f w t 1, w w t +ˆε t..1 It follows from the convexity of f that f w f wt 1 + f wt 1,w wt 1..

14 X. Wang et al. By the Lipschitz continuity of f with Lipschitz constant L, wehave f wt 1 f wt f wt 1, w t wt 1 L wt wt 1..3 Then, summing. and.3 yields f w f wt + f wt 1,w wt L wt wt 1. Combing the above inequality and.1, we have 1 w w t η 1 w wt 1 1 L w t w t 1 + Pw P η η + v t 1 f wt 1,w wt + 1 u t η,w wt +ˆε t, wt which gives.0 due to η 1/L. 3 Convergence properties for strongly convex case In this section, we investigate the theoretical properties of Algorithm.1 with the additional assumption that f in the objective function 1.1 is μ-strongly convex. Since h in 1.1 is convex, the objective function P is μ-strongly convex, i.e., for any w R d we have Pw Pw μ w w. 3.1 For the strongly convex case in this section, the w +1 0 in Step 8 of Algorithm.1 is chosen as w +1 0 = w. 3. The next theorem gives a recursive relation between E[P w ] Pw and E[P w 1 ] Pw. Theorem 3.1 Under assumption AS.1, if f is μ-strongly convex and the proximal parameter η in Algorithm.1 satisfies { s η < min 1L, 1 } L where s {1,...,n}, the following property holds:

15 Inexact proximal stochastic gradient method for convex where E[P w ] Pw s μη m s 1η L + 1m + 1η L E[P w 1 ] Pw m s 1η L s + m s 1η L A, 3.3 A = m ε t + 3 m m m ε i + ˆε t + 3 m εt 3.4 and the expectation is taen with respect to all the history random variables. Proof Summing up.0 over t = 1,...,m,wehave i=t m P wt Pw 1 w 0 η w wm w + 1 m u t η,w m m wt + ˆε t + v t 1 f wt 1,w wt, 3.5 where u t η ε t. We first bound the term η 1 m u t,w wt.by.13, η < L 1 proof of [30, Proposition 3], we can obtain and the same w t w 1 μη t w 0 w + v η i 1 f wi 1 + η εt t 1 μη i. Then, it follows from μη < 1 due to η < 1 L 1 μ that w t w w 0 w + Hence, we have t v η i 1 f wi 1 + η εt. 1 η m u t,w wt 1 m η u t w wt 1 w m η 0 w η ε t + 1 m η ε t η t

16 v η i 1 f wi 1 + η εi X. Wang et al. = m m i=t m m i=t m m ε t + 1 η ε i ε t m w 1 η 0 w + η v η t 1 w f t w m η 0 w + η v ε i t 1 f w t 1 m + ε t i=t m m + m + 1 m ε i + η η w 0 w + 1 m ε i + i=t ε t + η εt m m vt 1 f w t 1 ε t + 1 w 3 η 0 w + m i=t m ε i ε t m ε i + m + η vt 1 wt 1 f. 3.6 It thus follows from.19, 3.5, 3.6 and the definition of A in 3.4 that m P wt Pw 1 w 0 η w wm w + 1 w η 0 w m + 3η vt 1 w f t 1 m + v t 1 f wt 1,w w t + A i=t 1 w m η 0 w + 3η vt 1 w f t 1 ε t

17 Inexact proximal stochastic gradient method for convex m + v t 1 f wt 1,w w t + A μη [ P w 1 Pw ] + 3η m + m vt 1 w f t 1 v t 1 f wt 1,w w t + A, where the last inequality follows from w0 = w 1 and the μ-strong convexity 3.1 of P. Taing expectation on both sides of the above inequality and noticing that ] E [ v t 1 f wt 1,w w t w t 1 = 0, 3.7 we obtain η m + 3η [ ] E P wt Pw [ E P w 1 ] Pw μ m Then it follows from Lemma.1 that η m EP wt Pw [ v E t 1 wt 1 ] f + η A. [ E P w 1 ] Pw + 1η L m [ ] E P wt 1 Pw μ s + E [ P w 1 ] Pw + η A = μ [ E P w 1 ] Pw + 1η L [ ] E P w0 Pw s m + 1η L s t= [ ] E P wt 1 Pw + 1m η L s E [ P w 1 ] Pw + η A. Noticing that w 0 = w 1 and rearranging terms, we have η 1 1η m L [ ] E P wt Pw s μ + 1m + 1η L s

18 X. Wang et al. E [ P w 1 Pw ] + η A. 3.8 Since w = m 1 m w t, it follows from the convexity of P that m [ ] E P wt [ Pw m E P w ] Pw. 3.9 Then, 3.3 follows from η < s 1L and 3.8. Motivated from Theorem 3.1, if we further set ε t ε t 1 μη t 1, for = 1,...,N and t = 1,...,m, then we have m m ε t ε t m 1 μη t 1 1 μη m εt, and m m i=t ε i m 1 μη = 1 μη [ m m εi i=t i=t [ m m εi i=t [ m t εt ] 1 μη i 1 1 μη t 1 ] 1 μη i 1 ] 1 μη m εt. Then, due to μη < 1 and ˆε t ε t, A defined in 3.4 can be upper bounded by Hence, by 3.3 wehave 1 A m εt 5 m εt μη μη μη. E [ P w ] s Pw μη m s 1η L + 1m + 1η L E [ P w 1 ] Pw m s 1η L + 5s μη m s 1η L which would lead to the following theorem. m εt, 3.10

19 Inexact proximal stochastic gradient method for convex Theorem 3. Under assumption AS.1,if f isμ-strongly convex and the parameters in Algorithm.1 are set as { s η < min 40L, 1 }, L m > 0θ for some θ>1, μη and ε t = εt 1 μη t 1, 3.11 where s {1,...,n}, then we have where E[P w ] Pw γ E[P w 0 ] Pw + 5 m εt i γ i+1, 3.1 γ = 1 + 6θ 7θ < 1. Proof Since η < s /40L, we have that s s 1η L < 10 7 and η L s 1η L < 1 8. Hence, denoting γ = we have from m > 0θ μη γ < s μη m s 1Lη + 1m + 1Lη m s 1Lη, 3.13 and θ>1 that 0 + 1m μη m 8m 7μη m 8 < 1 + 6θ 7θ = γ<1. Notice that the coefficient of the last term in 3.10 is less than 5 γ. Hence, by 3.10 we have E[P w ] Pw γ E[P w 1 ] Pw + 5γ m εt Then, 3.1 follows from induction and γ <γ <1. Remar 3.1 From 3.13, we can see that for fixed η and m, γ will decrease as s increases. Hence, by 3.14, increasing the sample size s when calculating the stochastic gradient in Step 5 of Algorithm.1 will generally improve the convergence speed of E[P w ] Pw to zero. By Theorem 3., to ensure E[P w ] Pw converges to zero, it is sufficient to require m εi t γ i+1 converges to zero as increases to infinity. This gives

20 X. Wang et al. us certain freedom for the choices of εt. In the following, we analyze several different choices of εt and derive its corresponding complexity bounds. Corollary 3.3 Under assumption AS.1, suppose f is μ-strongly convex and the parameters in Algorithm.1 are set as in Then, 1 if the subproblem tolerance ε t satisfies we have E[P w ] Pw γ P w 0 Pw if the subproblem tolerance ε t satisfies we have ε t α +t with α 0, 1, α max{γ,α} 1 α1 max{γ, α} max{ α, γ} ; 3.16 ε t 1 β αt with α 0, 1 and β>0, 3.17 E[P w ] Pw γ P w 0 Pw + where ξ is a scalar satisfying 3 if the subproblem tolerance ε t satisfies we have 5αγ 1 α1 γ ξ β, 3.18 γ ξ = ξ β ; 3.19 εt 1 β 1 t 1+φ with β>0 and φ>0, 3.0 E[P w ] Pw γ P w 0 Pw + 5γ1 + φ 1 ξ β, γ where ξ is a scalar satisfying 3.19; 4 if the subproblem tolerance ε t satisfies ε t 1 + t κ with κ>1, 3.

21 Inexact proximal stochastic gradient method for convex we have E[P w ] Pw γ P w 0 Pw + where ξ is a scalar satisfying γ κ 11 γ ξ1 κ, 3.3 Proof We now analyze the estimate bound for E[P w ] Pw case by case. 1 If the tolerance εt satisfies 3.15, direct calculations show that m ε t α Then γ i+1 α i = γ i+1 α i max{γ, α} +i+1 max{γ,α} 1 max{γ, α} max{γ, α}. So, 3.16 follows from 3.1 directly. If the tolerance εt satisfies 3.17, then m that ε t α E[P w ] Pw γ P w 0 Pw + 1 α α. 1 α β. It follows from 3.1 5α 1 α γ i+1 i β. We now compute γ i+1 i β. Since ξ satisfies γ ξ = ξ β, we have ξ 0,, and γ i+1 i β = ξ ξ γ i+1 i β + i= ξ +1 γ i+1 + ξ +1 β γ i+1 i β i= ξ +1 γ i+1 γ ξ +1 + γ 1 γ 1 γ ξ +1 β γ 1 γ ξ β. 3.4 Therefore, we obtain If the tolerance εt satisfies 3.0, we have m m m εt = 1 β + t 1+φ 1 β + x 1+φ dx 1 + φ 1 β. t= Then, 3.1 follows from 3.1 and 3.4.

22 X. Wang et al. 4 If the tolerance ε t satisfies 3., we have m εt = + t κ m m Then, 3.3 follows from 3.1 and x κ dx = 1 κ κ 1 + m 1 κ κ 1 Remar 3. Notice that for any given λ 0, 1, wehave lim γ 1 λ = 0. λ β So, by the definition of ξ in 3.19, there exists an integer K > 0 such that which yields that ξ λ for all K, ξ β λ β. 1 κ κ Therefore, all the upper bounds in 3.18, 3.1 and 3.3 converge to zero as increases to infinity. Remar 3.3 We are more interested in the complexity bounds of the total number of component gradient evaluations to reduce the function value gap E[P w ] Pw below certain tolerance. This complexity bound is also called batch complexity, which often measures the dominating cost of solving original problem 1.1. From our theoretical analysis, it seems reasonable to choose the parameters m and s such that m s = OL/μ. With this choice of m and s, let us discuss the component gradient complexity bounds implied by Corollary 3.3 under different accuracies of solving the proximal subproblem.10. For the tolerance setting 3.15, to achieve E[P w N ] Pw <εfor some ε>0, by 3.16 the maximum outer iteration number N should satisfy max{ α, γ} N = Oε, which implies that N = Olog1/ε. Hence, by the fact that m s is on the order L/μ, the total number of component gradient evaluations Nn + N =1 m s is in the order of n + L 1 O log. 3.7 μ ε Note that when all the subproblems.10 are solved exactly, our Algorithm.1 will be reduced to the method proposed in [36]. The same complexity bound 3.7 was obtained in [36], but it required to solve all the subproblems exactly. Hence, based on our results, the algorithm could have the same complexity bounds even

23 Inexact proximal stochastic gradient method for convex without solving the subproblems exactly. When the subproblem does not have a closed-form solution or is too expensive to be solved exactly, allowance of solving the subproblems inexactly would be crucial for both efficiency and stableness of the algorithm. For the tolerance setting 3.17, to achieve E[P w N ] Pw <ε,by3.18 and Remar 3., when ε>0 is sufficiently small, it suffices to require γ N = Oε and N β = Oε. This implies the outer iteration number N should be in the order of Oε 1/β when N is large. Thus, for a sufficiently small ε>0, the component gradient complexity bound is n + L 1 O μ ε 1/β. 3.8 For the tolerance setting 3.0, to achieve E[P w N ] Pw <ε,by3.1 and Remar 3., when ε>0issufficiently small, it suffices to require γ N = Oε and 1 + φ 1 N β = Oε. This implies the outer iteration number N should in the order of Oρε 1/β, where ρ = min{1,φ}. Thus, for a sufficient small ε>0, the component gradient complexity bound is n + L O μ 1 ρε 1/β. 3.9 For the tolerance setting 3., similar to above two cases, following from 3.3 and Remar 3., to obtain E[P w N ] Pw <εfor a sufficiently small ε>0, it suffices to require γ N = Oε and 1 κ 1 N 1 κ = Oε. This implies N should be in the order of Oκ 1ε 1/κ 1 and the component gradient complexity bound is n + L O μ 1 κ 1ε 1/κ Note that compared with the latter three cases, the tolerance 3.15 has the most accuracy. Intuitively, it should have the smallest complexity bound, which is verified by 3.7. Meanwhile, we can see that as less subproblem accuracy is required, the corresponding component gradient complexity bound increases. Hence, proper subproblem accuracy setting should depend on the balance of the cost of solving the subproblems and the cost of component gradient evaluations. Remar 3.4 It also deserves to mention the inexact proximal gradient methods proposed in [30]. To obtain a linear convergence rate analogous to 3.16 when the

24 X. Wang et al. objective is strongly convex, the linear decrease rate on the subproblem accuracies similar to 3.15 is also required in [30]. But methods proposed in [30] are deterministic methods, which do not consider the particular summation structure of our objective function in 1.1. Furthermore, by combining with variance reduction techniques, we show the more relaxed subproblem accuracies, such as 3.17, 3.0 and 3., can be applied without loosing the convergence, which have not been discussed in [30]. 4 Convergence properties for general convex case In this section, we discuss the theoretical properties of Algorithm.1 for solving the problem without assuming the strong convexity of f. Different from 3., for the general convex case in this section, the w0 +1 in Step 8 of Algorithm.1 is chosen as w0 +1 = wm. 4.1 And given an initial positive integer m 0, we require the succeeded inner iteration number m, = 1,,..., in Algorithm.1 is increasing and satisfies m = 1 + θ m 1, 4. where θ > 0 is a parameter. First, the following lemma [30, Lemma 1] will be used in our analysis to bound w t w. Lemma 4.1 Assume that the nonnegative sequence {v } satisfies the following recursion for all 1: v S + λ i v i, with {S } an increasing sequence, S 0 u 0 and λ i 0 for all i. Then for all 1, 1/ v 1 1 λ i + S + λ i. Based on Lemma 4.1, we give the following bound on w t w. Lemma 4. Under assumption AS.1, if the proximal parameter η in Algorithm.1 satisfies η 1/L, then w t w t v η i 1 f wi 1 + η ε i + w 0 w + η t t 1/ ˆε i v + η i 1 f wi 1 + η ε i. 4.3

25 Inexact proximal stochastic gradient method for convex Proof Note that.0 together with u t η ε t give wi w w i 1 w + η v i 1 f w wi 1 + η ε i i w + η ˆε i. Summing up the above inequality over i = 1,...,t, wehave w t w w 0 w + t + η ˆε i. t w v η i 1 f wi 1 + η ε i i w By letting v i = w i w, St = w 0 w + t η ˆε i and λ i = η ε i + η vi 1 f w i 1 in Lemma 4.1, we obtain 4.3. Now we give the following theorem which is analogous to Theorem 3.1 for the strongly convex case. Theorem 4.1 Under assumption AS.1, if the proximal parameter η in Algorithm.1 satisfies { s η < min 16Lθ +, 1 }, 4.4 L the following property holds: [ s w ] E[P w ] Pw + η m s 16η L E +1 0 w 16η L [ ] + E P w0 +1 Pw m s 16η L 1 + η 1 + θ R + m B, 4.5 where [ s w R = E[P w 1 ] Pw + η m 1 s 16η L E 0 w ] 16η L [ ] + E P w0 Pw m 1 s 16η L 4.6 and B = 1 m m ε t + 4 η m i=t ε i + 5 m εt. 4.7

26 X. Wang et al. Here, the expectation is taen with respect to all the history random variables. Proof Summing up.0 over t = 1,...,m, due to w +1 0 = w m we have m P wt Pw 1 w 0 η w w0 +1 w + 1 m u t η,w wt m + v t 1 f m wt 1,w wt + ˆε t, 4.8 where u t η ε t. We first give a bound on the term η 1 m u t,w wt. The upper bound of u t in 4.8 and Lemma 4. give where 1 η m E 1 := 1 m η ε t η m E := 1 η η ε t u t,w wt 1 m η η ε t t v η i 1 f wi 1 + η ε i, w t t 0 w + η ˆε i + Now, let us first derive a bound on E 1 : m E 1 = 1 = 3 m i=t m m i=t m m w w t v η ε i t 1 w f t 1 m + i=t m ε i + η ε i E 1 + E, 4.9 1/ v η i 1 f wi 1 + η ε i. m i=t ε i ε t vt 1 w f t 1 m m + ε t + m + η vt 1 w f t 1 m + ε t. Noticing the definition of ε t in.1, we have the following bound on E : E 1 m w η ε t 0 η w + t η ˆε i + t m i=t v η i 1 f wi 1 + η ε i ε i

27 Inexact proximal stochastic gradient method for convex m ε t w m t m t η 0 w + ε t ˆε i + η ε t vi 1 w f m t i 1 + ε t ε i w 0 w + 1 m m m m m v ε t + ˆε t ε i + η ε i t 1 w η f η t 1 i=t i=t m m + ε t ε i i=t w 0 w + 1 m m m m m ε t + ˆε t + ε i + η vt 1 w η f t 1 η i=t + 1 m m m m m ε i + ε t + ε i i=t i=t w 0 w + 1 m m ε t + εt + 5 m m m ε i + η vt 1 w η η f t 1. i=t By inserting the above two bounds on E 1 and E into 4.9, we have 1 η m w u t,w wt 0 w + η η m vt 1 f w t m m ε t + 4 η m m m i=t ε i + εt + ε t Combining.19, 4.8 and 4.10 gives m P wt Pw 1 + η w 1 w η 0 w +1 0 w η m + 4η vt 1 w f t 1 m + v t 1 f wt 1,w w t + B, 4.11 where B is defined in 4.7. Taing expectation on both sides of 4.11 and noticing 3.7, it follows from Lemma.1 that m E[Pwt ] Pw 1 + η η E[ w 0 w ] 1 η E[ w +1 0 w ]

28 X. Wang et al. + 16η L s m E[Pwt 1 ] Pw + 16m η L s EP w 1 Pw + B. Then, by η < s 16L and w+1 0 = wm,wehave 1 16η L m E[Pwt s ] Pw + 1 η E[ w +1 0 w ]+ 16η L s E[Pw +1 0 ] Pw 16m η L s E[P w 1 ] Pw η E[ w0 η w ]+ 16η L E[Pw0 s ] Pw + B. 4.1 Then, it follows from 3.9 and 4.1 that s E[P w ] Pw + η m s 16η L E[ w+1 0 w ] 16η L + m s 16η L E[Pw+1 0 ] Pw 16η L s 16η L E[P w s 1 + η 1] Pw + η m s 16η L E[ w 0 w ] 16η L + m s 16η L E[Pw 0 ] Pw s + m s 16η L B Since η < s 16Lθ +,wehaves 16η L > θ +1 θ + s. Therefore, we have 16η L s 16η L < θ and s s 16η L = θ + <, 4.14 θ + 1 which together with 4.13 and m = 1 + θ m 1 imply 4.5. Motivated from Theorem 4.1, if we further set ε t ε t 1 η t 1, for = 1,...,N and t = 1,...,m, 4.15 then we have m m ε t ε t m 1 η t 1 1 η m εt,

29 Inexact proximal stochastic gradient method for convex and m m i=t ε i m 1 η = 1 η [ m m εi i=t i=t [ m m εi i=t [ m t εt ] 1 η i 1 1 η t 1 ] ] 1 η i 1 1 m εt η. Then, the B defined in 4.7 can be upper bounded by 5 B + 5 m εt η We now give the convergence property of Algorithm.1 for solving without strongly convex assumption. Theorem 4. Under assumption AS.1, if the parameters η and ε t in Algorithm.1 are set as { s η = η<min 16Lθ +, 1 } and ε t = εt L 1 η t 1, for = 1,...,N and t = 1,...,m, 4.17 the inner iteration number m satisfies the condition 4. and the sample size {s } N =1 is a nondecreasing sequence, then we have E[P w ] Pw a η m 0 where a 0 = m i η i 16ηL [P w 0 Pw ]+ m 0 s 1 16ηL Proof First, for all 1, denote ε i t j=1 1 + η 1 + θ j, 4.18 s 1 ηm 0 s 1 16ηL w1 0 w s a = E[P w ] Pw + ηm s 16ηL E[ w+1 0 w ] 16ηL + m s 16ηL E[Pw+1 0 ] Pw.

30 X. Wang et al. Then, it follows from Theorem 4.1,4.17 and {s } =1 N being a nondecreasing sequence that for all η a a 1 + B 1 + θ m So, by induction we get a a 0 j=1 1 + η 1 + θ j + j=i η B i. 1 + θ j m i Then, 4.18 follows from m i = m ij= θ j and the upper bound 4.16onB i. Remar 4.1 Note that to ensure E[P w ] Pw converges to zero as goes to infinity, by Theorem 4., a sufficient condition is to require j=1 1 + η 1 + θ j 0 and 1 + m i η i ε i t j=1 1 + η 1 + θ j 0, as goes to infinity. Motivated from Remar 4.1, we have the following Corollary. Corollary 4.3 Under assumption AS.1, suppose the inner iteration number m satisfies the condition 4. with θ = θ>0and the parameters in Algorithm.1 are set as { s s = s, η = η<min 16Lθ +, 1 } L,θ and ε t = εt 1 η t Then, we have E[P w ] Pw a η m 0 where a 0 is defined in 4.19 and 1 + m i η i ε i t γ, 4.1 γ = 1 + η 1 + θ < Consequently, 1 if the subproblem tolerance εt satisfies 3.15, then E[P w ] Pw a 0 + α 10η m 0 1 α γ ; 4.3

31 Inexact proximal stochastic gradient method for convex if the subproblem tolerance ε t satisfies 3.17, then E[P w ] Pw 3 if the subproblem tolerance ε t satisfies 3.0, then E[P w ] Pw 4 if the subproblem tolerance ε t satisfies 3., then E[P w ] Pw a 0 + α10η m 0 1 α γ ; 4.4 η a φ 1 10η γ ; 4.5 m 0 η a η m 0 κ 1 γ. 4.6 η Proof First, by Theorem 4., 4.1 follows from 4.18 directly. 1 If the subproblem tolerance εt satisfies 3.15, then m i implies that 1 + m i η i εt i = α 1 α α 1 + η i α 1 α Then, 4.3 follows from 4.1. If the subproblem tolerance εt satisfies 3.17, then m i implies that 1 + m i η i εt i α 1 α 1 + η i εi t α 1 α αi, which α 1 + η α α 1 α. εi t α α η1 α. 1 α i β, which Then, 4.4 follows from If the subproblem tolerance ε t satisfies 3.0, then 4.5 follows from 4.1 and the fact that m i εt i 1 + φ 1 i β 1 + φ 1 and 1 + η i 1, η where the first inequality is by If the subproblem tolerance ε t satisfies 3., then 4.6 follows from 4.1 and the fact that m i ε i t 1 κ 1 i 1 κ 1 κ 1, where the first inequality is by 3.6.

32 X. Wang et al. Remar 4. Following from Corollary 4.3, wehavee[p w ] Pw converges to zero as goes to infinity. We now analyze the component gradient complexity of Algorithm.1 implied by Corollary 4.3. In the algorithm, suppose we set m 0 = OL, η = OL 1 and s = O Then, it follows from 4.7 and 4.14 that 16η L s 16η L = O1, s s 16η L = O1. Hence, we have a 0 = O1. In the following, let us denote τ = log1 + θ log1 + θ log1 + η. 4.8 Then, Corollary 4.3 would give the following complexity bounds. Suppose ε t is set as It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.3 iso1. Hence, to achieve E[P w N ] Pw <εfor a given ε > 0, the outer iteration number N should be in the order of Olog γ ε = Olog1/ε. So, by direct calculation, the total number of component gradient evaluations required by Algorithm.1 is nn + sm i = nn + sm θ N θ θ = O n log 1 + Lε ε τ. 4.9 Suppose ε t is set as It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.4isOL 1/. Hence, to achieve E[P w N ] Pw <εfor a given ε>0, the outer iteration number N should be in the order of Olog γ ε/l 1/ = OlogL 1/ /ε. Therefore, the total number of component gradient evaluations required by Algorithm.1 is L 1/ L 1/ τ O n log + L. ε ε Suppose ε t is set as 3.0. It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.5isOL 1/ /ρ where ρ = min{1,φ}. Hence, to achieve E[P w N ] Pw <εfor a given ε>0, the outer iteration number N should be in the order of Olog γ ρε/l 1/ = OlogL 1/ /ρε. Therefore, the total number of component gradient evaluations required by Algorithm.1 is L 1/ L 1/ τ O n log + L. ρε ρε

33 Inexact proximal stochastic gradient method for convex Suppose εt is set as 3.. It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.6 isoσ, where σ = max{1, L 1/ /κ 1}. Hence, to achieve E[P w N ] Pw <εfor a given ε>0, the outer iteration number N should be in the order of Olog γ ε/σ = Ologσ/ε. Therefore, the total number of component gradient evaluations required by Algorithm.1 is σ σ τ O n log + L. ε ε Similar to the strongly convex case, we can see as the subproblem tolerances become less strict, the component gradient complexity bound would increase. Again, the proper subproblem accuracy setting should depend on the balance of the subproblem difficulty and the cost of component gradient evaluations. Remar 4.3 When all the subproblems.10 are solved exactly, the complexity bound On log1/ε + L/ε is obtained in [40] for solving without strongly convexity assumption. By comparison, the complexity bound 4.9 is worse since it has a power τ > 1. This gap arises when the term 1/η u t,w wt in 4.8 is estimated. When the subproblems are solved exactly, this term vanishes. However, when the subproblems are solved inexactly, this term needs to be estimated as shown in 4.10, which leads to the the reduction ratio 1 + η/1 + θ in 4.5, and the power τ>1finally appear in the complexity result. But, on the other hand, when the Lipschitz constant L is very large, which is usually the case for ill-posed problems, we could have θ>>l 1 and τ would be close to 1. Then, the complexity bound 4.9 would be very close to the complexity bound obtained in [40], which assumes exactly solving of all the subproblems. Observe that by the same analysis for establishing 4.16, we can replace 4.15 by ε t = ε t 1 σ η t 1, 4.30 where σ 0, 1 is a constant. In this case, the convergence complexity order of Algorithm.1 given in Theorem 4. and Corollary 4.3 would still be the same. In addition, based on Theorem 4., we could also have a variable choice of θ as which gives θ = λ 1 + η η 1 + θ = λ + 1, where λ>0 is a constant, Then, by 4.18 wehave E[P w ] Pw a η m 0 1, m i η i ε i t λ.

34 X. Wang et al. Same analysis as in the proof of Corollary 4.3 would show the factor in front of λ/ in the above estimate are bounded for all the four choices of tolerance settings 3.15, 3.17, 3.0 and 3.. Hence, the convergence of E[P w ] Pw to zero, as goes to infinity, is still guaranteed. 5 Experiments In this section, we consider solving the following CUR-lie factorization optimization proposed in [3]: min PX X Rnr nc := 1 W WXW F + λ n c row X i p + λ c X j p, 5.1 n r j=1 where W R n c n r is a given matrix, X i = X i, ; and X j = X :, j are the i-th row and j-th column of the matrix X, respectively. This optimization model 5.1 would generate a solution X with sparse rows and columns with different choices of p > 0. It can be seen that the problem 5.1 would have the formulation as 1.1 by setting f X := 1 W WXW F = 1 f i, j X n r n c n r j=1 n c hx := λ row X i p + λ c X j p, j=1 and where f i, j X = W i, j W i XW j, W i = W i, : and W j = W :, j are the i-th row and j-th column of the matrix W, respectively. In [3], the authors choose p = for which the proximal subproblem.10 has a closed-form solution. The more natural choice of p = isdiscussedin[30]. However, in this case there is no closed-form formula to find the exact solution of.10. Hence, the proximal subproblem.10 is solved in [30] by the bloc coordinate descent BCD algorithm proposed in [16], which is also equivalent to Dystra s algorithm for two monotone operators [1]. In our experiments, we also consider the choice of p = and the proximal subproblem.10 is solved by the same BCD algorithm used in [30]. In the implementation of the BCD method, the proximal subproblem is minimized alternatively with respect to the rows and columns. Since the duality gap can be computed while applying the BCD algorithm, we can compute an approximate solution of the subproblem.10 until its function value duality gap is below any given tolerance ε>0. The same data sets in [3,30] 1 are used in our experiments. We set λ row = 0.01 and λ col = 0.01 in 5.1 asthosein[30], which would yield approximately 5 40% non-zero entries in the solution. The data sets is summarized in the following table: 1 The datasets are available at

35 Inexact proximal stochastic gradient method for convex Data sets 9_Tumors Brain_Tumor1 Leuemia1 SRBCT n r n c In the experiments, the size of the random sample index set in Step 5 of Algorithm.1 is set as s = n c for all. The starting point w0 +1 for the inner loop is selected by 4.1, i.e., w +1 0 = w m. For the inner iteration number, we set m = 1 + θ m 1 with m 0 = 1, and as suggested by 4. and 4.31, we consider the following two choices of θ : 1 θ = λ η 1 with λ 1 = 1.001; θ = λ η 1 with λ = 0.8. Fig. Objective function value gap P w Pw vertical axis against the number of BCD iterations horizontal axis. In this figure, θ = λ η 1andλ 1 = a 9_Tumors, b Brain_Tumor1, c Leuemia1, d SRBCT

36 X. Wang et al. Fig. 3 Objective function value gap P w Pw vertical axis against the number of BCD iterations horizontal axis. In this figure, θ = λ η 1andλ = 0.8. a 9_Tumors, b Brain_Tumor1, c Leuemia1, d SRBCT In both above two cases the proximal parameter η = max{0.7/, 1/ L} where L = 30 is an estimate of the Lipschitz constant of f in 1.1. We have tested four different types of tolerance εt according to 3.15, 3.17, 3.0 and 3. as the following: 1 εt = 1 σ L t 1 α +t with α = 0.9. εt = 1 σ L t ε t = 1 σ L t 1 4 ε t = 1 σ L t 1 α t with α = 0.9 and β = 0.5. β 1 1 with β = 1 and φ = 0.5. β t 1+φ 1 +t κ with κ = 3. In all the above cases, we set σ = 0.1 and L = 1/η. All the subproblems.10are solved by the BCD algorithm until the subproblem function value duality gap is below ε t. Then, by [38, Thm..8.7] and [3, Section 4.3], the inexact solution w t obtained by the BCD algorithm is an ε t, ˆε t -solution of the problem.1 with ε t +ˆε t ε t. Hence, tolerance condition 4.30 on ε t is satisfied. Firstly, to show the benefits of solving the subproblems inexactly, we also tested the Algorithm.1 but with almost exactly solving of the subproblem.10 until

37 Inexact proximal stochastic gradient method for convex Fig. 4 Objective function value gap Pw Pw vertical axis against the number of effective passes horizontal axis. In this figure, θ = λ η 1andλ 1 = ɛ = 3 represents the method proposed in [30]. a 9_Tumors, b Brain_Tumor1, c Leuemia1, d SRBCT its function value duality gap is below Note that the only differences of the comparing algorithms are their tolerances for solving the subproblem. Hence, as in [30], in this case we also use the function value gap P w Pw against the number of proximal iterations BCD iterations as the measure of algorithm performance. In our numerical experiments, the optimal function value Pw is approximated by the minimum objective function value obtained by running all the comparing algorithms until P w i P w i 1 max i=0,1, P w i The numerical results corresponding to different choices of θ in cases 1 and are shown in Figs. and 3, respectively. The results clearly show that the inexact methods outperform the the exact method e.g., the method with fixed tolerance ɛ t = Hence, the result confirms our analysis that it is not necessary to solve the subproblems exactly at each iteration. On the other hand, we observe that the new algorithms with different tolerances often behave similar for this set of testing problems, especially at

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

AN AUGMENTED LAGRANGIAN AFFINE SCALING METHOD FOR NONLINEAR PROGRAMMING

AN AUGMENTED LAGRANGIAN AFFINE SCALING METHOD FOR NONLINEAR PROGRAMMING AN AUGMENTED LAGRANGIAN AFFINE SCALING METHOD FOR NONLINEAR PROGRAMMING XIAO WANG AND HONGCHAO ZHANG Abstract. In this paper, we propose an Augmented Lagrangian Affine Scaling (ALAS) algorithm for general

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

WE consider an undirected, connected network of n

WE consider an undirected, connected network of n On Nonconvex Decentralized Gradient Descent Jinshan Zeng and Wotao Yin Abstract Consensus optimization has received considerable attention in recent years. A number of decentralized algorithms have been

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS Igor V. Konnov Department of Applied Mathematics, Kazan University Kazan 420008, Russia Preprint, March 2002 ISBN 951-42-6687-0 AMS classification:

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract

arxiv: v5 [cs.lg] 23 Dec 2017 Abstract Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction Xingguo Li, Raman Arora, Han Liu, Jarvis Haupt, and Tuo Zhao arxiv:1605.02711v5 [cs.lg] 23 Dec 2017 Abstract We

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Hongchao Zhang hozhang@math.lsu.edu Department of Mathematics Center for Computation and Technology Louisiana State

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean Renato D.C. Monteiro B. F. Svaiter March 17, 2009 Abstract In this paper we analyze the iteration-complexity

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E. Curtis Department of Industrial and Systems Engineering, Lehigh University Daniel P. Robinson Department

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have

For any y 2C, any sub-gradient, v, of h at prox(x), i.e., v and by optimality of prox(x) in (3), we have A Proofs We now give the details for the proof of our main results, i.e., heorems and 2. Below, we outline the steps for the proof of FAG s heorem. he proof of heorem 2 for FARE follows the same line of

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate

Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate www.scichina.com info.scichina.com www.springerlin.com Prediction-based adaptive control of a class of discrete-time nonlinear systems with nonlinear growth rate WEI Chen & CHEN ZongJi School of Automation

More information

3.10 Lagrangian relaxation

3.10 Lagrangian relaxation 3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS WEIWEI KONG, JEFFERSON G. MELO, AND RENATO D.C. MONTEIRO Abstract.

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Primal/Dual Decomposition Methods

Primal/Dual Decomposition Methods Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

arxiv: v1 [stat.ml] 12 Nov 2015

arxiv: v1 [stat.ml] 12 Nov 2015 Random Multi-Constraint Projection: Stochastic Gradient Methods for Convex Optimization with Many Constraints Mengdi Wang, Yichen Chen, Jialin Liu, Yuantao Gu arxiv:5.03760v [stat.ml] Nov 05 November 3,

More information

A PREDICTOR-CORRECTOR PATH-FOLLOWING ALGORITHM FOR SYMMETRIC OPTIMIZATION BASED ON DARVAY'S TECHNIQUE

A PREDICTOR-CORRECTOR PATH-FOLLOWING ALGORITHM FOR SYMMETRIC OPTIMIZATION BASED ON DARVAY'S TECHNIQUE Yugoslav Journal of Operations Research 24 (2014) Number 1, 35-51 DOI: 10.2298/YJOR120904016K A PREDICTOR-CORRECTOR PATH-FOLLOWING ALGORITHM FOR SYMMETRIC OPTIMIZATION BASED ON DARVAY'S TECHNIQUE BEHROUZ

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

EQUIVALENCE OF TOPOLOGIES AND BOREL FIELDS FOR COUNTABLY-HILBERT SPACES

EQUIVALENCE OF TOPOLOGIES AND BOREL FIELDS FOR COUNTABLY-HILBERT SPACES EQUIVALENCE OF TOPOLOGIES AND BOREL FIELDS FOR COUNTABLY-HILBERT SPACES JEREMY J. BECNEL Abstract. We examine the main topologies wea, strong, and inductive placed on the dual of a countably-normed space

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information