Inexact proximal stochastic gradient method for convex composite optimization

Size: px

Start display at page:

Download "Inexact proximal stochastic gradient method for convex composite optimization"

Nicholas Barnett
5 years ago
Views:

1 Comput Optim Appl DOI /s Inexact proximal stochastic gradient method for convex composite optimization Xiao Wang 1 Shuxiong Wang Hongchao Zhang 3 Received: 9 September 016 Springer Science+Business Media, LLC 017 Abstract We study an inexact proximal stochastic gradient IPSG method for convex composite optimization, whose objective function is a summation of an average of a large number of smooth convex functions and a convex, but possibly nonsmooth, function. Variance reduction techniques are incorporated in the method to reduce the stochastic gradient variance. The main feature of this IPSG algorithm is to allow solving the proximal subproblems inexactly while still eeping the global convergence with desirable complexity bounds. Different subproblem stopping criteria are proposed. Global convergence and the component gradient complexity bounds are derived for the both cases when the objective function is strongly convex or just generally convex. Preliminary numerical experiment shows the overall efficiency of the IPSG algorithm. This research is partially supported by the National Natural Science Foundation of China and the National Science Foundation of USA B Xiao Wang wangxiao@ucas.ac.cn Shuxiong Wang wsx@lsec.cc.ac.cn Hongchao Zhang hozhang@math.lsu.edu hozhang 1 School of Mathematical Sciences, University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing , China Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijng , China 3 Department of Mathematics, Louisiana State University, 0 Locett Hall, Baton Rouge, LA , USA

2 X. Wang et al. Keywords Convex composite optimization Empirical ris minimization Stochastic gradient Inexact methods Global convergence Complexity bound Mathematics Subject Classification 47N10 65K10 1 Introduction In this paper, we consider the following composite optimization problem min Pw := f w + hw, 1.1 w R d where f is an average of many smooth component convex functions, that is f w = 1 n n f i w 1. with f i : R d R, i = 1,...,n, being convex not necessarily strongly convex and Lipschitz continuously differentiable, and h : R d R {+ }is a convex, but possibly nonsmooth, function. In addition, we assume the solution set of 1.1 is not empty and achieves the optimal function value P. Problem is also nown as regularized empirical ris minimization [3] and can arise in many applications, such as many inds of classification and regression models in machine learning. In these applications, a ey feature of the problem is that the total number of component functions, denoted by n, is very large or even huge such that it is often prohibitive to do the full function and/or gradient evaluations of the objective function at each iteration, which is often required by an iterative method. Problem 1.1 can be solved by standard proximal gradient methods. Given a starting point w 0 R d, such methods update iterates through { w +1 = arg min f w, w w +hw + 1 } w w w R d, 1.3 η where η > 0 is a proximal parameter. Since the full gradient of f is used in 1.3, the proximal gradient method is also referred as the proximal full gradient Prox-FG method. The Prox-FG methods have been shown superior to the classical subgradient methods for solving 1.1. To achieve an ε-solution w of the problem 1.1, i.e. P w P <ε, the standard iteration complexity bound for the Prox-FG method is Oε 1, while for subgradient method it is Oε. When combining with Nesterov s acceleration technique, the iteration complexity bound of the Prox-FG methods can achieve Oε 0.5. Extensive research has been carried out on proximal gradient methods, such as [,9,4]. However, in many cases it is hard or time-consuming to evaluate the full gradient f. So, proximal stochastic gradient Prox-SG methods become popular for solving 1.1. These methods usually update the iterates by

3 Inexact proximal stochastic gradient method for convex { w +1 = arg min g,w w +hw + 1 } w w w R d, η where g is an approximate gradient at w based on some random variable and its related expectation is equal to f x. For general convex case, based on Nesterov s optimal method, Lan [0] propose an unified optimal stochastic approximation algorithm which can achieve optimal complexity order Oɛ without particulary nowing the problem structure. In addition, the algorithm depends optimally on the Lipschitz-type constants of the problems. Then, Ghadimi and Lan [1,13] have studied the accelerated stochastic approximation methods for strongly convex case, which can achieve optimal or nearly optimal convergence rate with proper stepsizes. The Prox-SG methods can be also referred as randomized incremental proximal methods when g is defined as component gradient and the index is chosen in a certain random order. A good review of incremental proximal methods can be found in [4,5,34,35]. Although much computation could be saved by using the stochastic gradient at each iteration, its large variance may slow down the convergence speed. Hence, the variance reduction techniques aiming to reduce the variance of stochastic gradients have recently attracted much interest. The variance reduction techniques are originally proposed for improving classical stochastic gradient methods for solving with h 0. A stochastic average gradient SAG method, which taes an average of historical stochastic gradients at each iteration, is proposed in [9]. The On log1/ε component gradient complexity bound is achieved by this SAG method if f is strongly convex, Lipschitz continuously differentiable and n 8L/μ, where L is the Lipschitz constant and μ is the strongly convex modulus. When f i w = φ i a T i w, i = 1,...,n, with φ i being L i -smooth and h being μ-strongly convex, a stochastic dual coordinate ascent SDCA method is proposed in [31]to solve This SDCA method is shown to have the On + L max /μ log1/ε complexity, where L max = max i {L i }. Although both SAG and SDCA have lower computational cost at each iteration and achieve good convergence rate, they all have a relatively high Ond memory requirement. A stochastic variance reduced gradient SVRG method was introduced in [17] for solving smooth and strongly convex problem. This SVRG method uses an unbiased stochastic gradient constructed from the full gradient computed at a particular chosen point after every m > 0 iterations. With proper choice of m, the same convergence rate as [9,31] can be achieved, but only the Od memory is required. Another closely related algorithm is SAGA proposed in [8], which uses an unbiased stochastic gradient incorporated with average of past stochastic gradients. Hence, it could be considered as a method between SAG and SVRG. SAGA can be directly extended to solve the composite optimization when the objective function is not strongly convex. But, similar to SAG, SAGA also needs relatively high memory. Due to the small memory requirement and low component gradient complexity, SVRG-type methods have been paid more attention recently. In [36], a Prox-SVRG method is proposed to achieve On + L/μ log1/ε component gradient complexity for solving strongly convex composite optimization. When the objective function is not strongly convex, by applying the method on a slightly perturbed strongly convex problem, the n + L/εOlog1/ε complexity bound can be obtained. On the other hand, in [40]

4 X. Wang et al. a SVRG based Prox-SG method is designed to solve the non-strongly convex composite optimization directly with an improved On log1/ε + L/ε complexity bound. For other Prox-SG type methods, one may refer to [14,19,5]. In theory, all the aforementioned stochastic gradient methods require to solve the proximal subproblems exactly. However, when the function h is complicated and/or does not have special structure, the proximal subproblem often does not have closed form solution. Many optimization problems including topographic dictionary learning [18,3], CUR-lie factorization optimization [,3] and multi-tas learning problems [6,37], normally require the regularization term h reveal some group sparsity structure of variables. For example, it might tae the form of summation of -norm. In this case, it would be expensive or sometimes impossible to solve the subproblems exactly. On the other hand, since the subproblems are built on stochastic inexact gradients, it is also unnecessary to spend great efforts for solving the subproblems exactly, especially in the early iterations of an algorithm. To the best of our nowledge, there is no wor focusing on inexact Prox-SG methods yet. Hence, in this paper we would develop an inexact proximal stochastic gradient IPSG method which would guarantee global convergence and desirable complexity bounds. In particular, we show that although the subproblems are solved inexactly in the IPSG method, compared to the exact SVRG based Prox-SG methods, the same or almost the same component gradient complexity bounds can still be maintained. Our analysis is partly motivated from the deterministic inexact Prox-FG methods. One line of research on inexact Prox-FG methods focuses on using approximate proximal operators for which exact solution could be obtained relatively easily, such as the methods developed in [6,7,11,1]. The inexact Prox-FG methods developed in [15,8,30,33] are more closely related to our wor. Instead of constructing an approximate proximal operator, these methods allow to solve subproblems inexactly up to certain accuracy derived from theoretical analysis. However, most of these Prox-FG methods use deterministic exact full gradients, which as mentioned previously is not suitable for solving our problem Although the method in [30] allows inexact gradients, the analysis developed for Prox-FG method in [30] could not be directly applied to the inexact Prox-SG methods. One ey reason is that the variance of stochastic gradient may not meet the accuracy required in [30] for the inexact gradients. For example, without combining with particular treatment, the variance of stochastic gradient may not even reduce. Hence, in the similar spirit to the SVRG methods, we also apply the variance reduction techniques to control the variance of stochastic gradients. As a result, in this paper, we would propose an inexact Prox-SG method with variance reduction techniques for solving and develop its corresponding convergence theory. The remainder of this paper is organized as follows. In Sect., we present an inexact Prox-SG IPSG method for solving the convex composite optimization problem and discuss some of its basic properties. In Sect. 3, we give the convergence analysis of this IPSG method when f is strongly convex and provide its convergence rate under different accuracy criteria of solving the subproblem. In Sect. 4, we extend our analysis to the case that f may not be strongly convex and give the corresponding convergence rate when the subproblems are solved inexactly. Some preliminary numerical experiment results are reported in Sect. 5. Finally, some conclusion remars are given in Sect. 6.

5 Inexact proximal stochastic gradient method for convex Notation The gradient of f is denoted as f. Given a vector x R d and p > 0, x p denotes its p-norm, while x = x represents its Euclidean norm. For x, y R d,weuse x, y to denote the Euclidean inner product of x and y. Given amatrixx R m n, X F is its Frobenius norm. For a real number a, a denotes the largest integer number less than or equal to a. Given a random variable X, E ξ [X] denotes the expectation of X with respect to ξ. Inexact Prox-SG method for In this section, we first give the definition of inexact solutions used in this paper for the proximal mapping problem: min qw := hw + 1 w R d η w y..1 Then, we will present our inexact Prox-SG algorithm for solving the convex composite optimization problem and develop some of its important properties needed in the later analysis. Let us first recall the definition of the ε-subdifferential of a convex function: Given a convex function φ : R d R, its ε-subdifferential at z, denoted as ε φz, is defined as { ε φz = y : φw φz y,w z ε for all w R d}. Now, we give our definition of inexact solutions used for the problem.1. Definition.1 Given ε > 0 and ˆε > 0, we call z to be an ε, ˆε-solution of the problem.1, if there exists u R d such that u η ε and y z u η ˆε hz.. By the definition of ε-subdifferential, when u η ε, it is not difficult to verify that z y + u η { 1 ε η w=z w y }. Hence, if z is an ε, ˆε-solution of the problem.1, by. wehave { 1 0 ε η w=z w y } + ˆε hz ε+ˆε qz. So, by the definition of ε-subdifferential again we have qw qz 0,w z ε +ˆε, for any w R d,

6 X. Wang et al. which implies qz min qw + ε +ˆε..3 w Rd Now, let us consider a particular case that h is an indicator function of a closed convex set C R d. For this case, similar observations have been made in [7,33]. Our purpose here is to compare the inexact solutions given by Definition.1 and other characterization of inexact solutions in the literature. When y C, the solution of proximal mapping.1 is trivial. Hence, we assume y / C. Given a tolerance ε>0, one often used condition for inexact solutions of.1istofindz such that 0 ε qz,.4 which is equivalent see. e.g., [3, Section 4.3] to require z satisfying qz min qw + ε..5 w Rd This condition.5 isusedin[30]. Note that when ε +ˆε ε, by.3, the inexact solution z given by Definition.1 will also satisfy.5. When z satisfies.5, we have z C and 1 η w y + ε 1 z y for any w C η z y Proj C y y + ηε, where Proj C is the projection on C. So, geometrically, z belongs to the intersection of C and the ball centered at y with radius Proj C y y + ηε see Fig. 1a. Another type of condition often used for inexact solutions of.1 see, e.g. in [7,33] requires z to satisfy y z ε hz,.6 η which is equivalent to z C and w z, y z ηε, for any w C..7 Note that the condition. in Definition.1 will reduce to.6 by setting ε = 0 and ˆε = ε. Geometrically, z satisfying.6 belongs to the intersection of C and the half-space determined by.7, for which the separating hyperplane is normal to y z and with distance ηε/ y z to z. see Fig. 1b. The third approximation criterion used in literature see, e.g. [7] for inexact solutions of.1 is to require z satisfying d0, qz ε,.8 which is equivalent to existing e R d and e ε such that e qz y z η + e hz

7 Inexact proximal stochastic gradient method for convex Fig. 1 Comparison of several inexact solution criteria. a Corresponds to.5, b corresponds to.6, c corresponds to.8, d corresponds to. y + ηe z,w z 0, z = Proj C y + ηe. for any w C Therefore, geometrically, the approximate solution z satisfying.8 belongs to the projection of the ball centered at y with radius ηε on C see Fig. 1c. Note that the condition. in Definition.1 will reduce to.8 by setting ˆε = 0 and ε = ηε /. Finally, the condition. in Definition.1 is equivalent to z C and w z, x z ηˆε, for any w C,.9 where x = y u. Geometrically, this means z belongs to the intersection of C and the half-space determined by.9, for which the separating hyperplane is normal to x z and with distance ηˆε/ x z to z and x is any point in the ball centered at y with radius η ε see Fig. 1d. Now, we give our inexact Prox-SG algorithm for solving

8 X. Wang et al. Algorithm.1. Inexact Prox-SG method IPSG for Input: Maximum number of iterations N, integer parameters {m } =1 N, {s } =1 N, and tolerance sequence { } { } ε t and ˆε t for = 1,...,N and t = 1,...,m. Initialize w0 1 = w 0. 1: for = 1,,...,N do : Choose proximal parameter η > 0. 3: Calculate ṽ = f w 1. 4: for t = 1,,...,m do 5: Randomly choosing a sample set K {1,,...,n} with size s, such that the probability of each index being piced is s /n. Compute v t 1 = f K wt 1 f K w 1 +ṽ, where f K = 1 s i K f i. 6: Find an ε t, ˆε t -solution w t of the subproblem 7: end for 8: Set w = m 1 m w t. 9: Choose w : end for Output: w N. min q w R d t w := vt 1,w w t 1 + hw + 1 w w η t Notice that Algorithm.1 consists of a two-loop procedure. In the inner loop, a sequence of inexact proximal stochastic gradient steps are carried out. Because of expensive calculation of the full gradient, by randomly choosing a sample set K, an inexpensive stochastic gradient of f is formed and used in calculating these proximal gradient steps. Meanwhile, the variance reduction technique proposed in SVRG [17] is applied to control the variance of the stochastic gradient generated in the inner iterations. In the outer loop, we choose the proximal parameter η > 0 and update w, where the full gradient f w will be calculated. In Step 9, we do not specify how to choose the starting iterate w0 +1 of the inner loop. As will be seen in the later sections, two different strategies on how to choose w0 +1 will be given according to whether the objective function is strongly convex or simply convex. In Step 5, there are different ways of choosing the sample set K such that the probability of each index in {1,...,n} being piced is s /n. The following simple strategy is considered in [39] and used in our numerical experiments. First, partition all the indices {1,...,n} into s disjoint groups K j, j = 1,...,s, with equal cardinality n/s assuming s divides n. Then, K consists of picing one index from each group K j with uniform probability s /n. The major difference between our Algorithm.1 and the existing Prox-SG methods is that the subproblem.10 is solved inexactly in Algorithm.1. Hence, the computational cost of solving each subproblem could be reduced greatly compared with exactly solving of.10 required by other methods. More precisely, we only find an ε t, ˆε t -solution w t of the subproblem.10. By Definition.1, we want to find wt such that there exists an u t satisfying

9 Inexact proximal stochastic gradient method for convex u t η ε t and w t 1 w t u t vt 1 η ˆε h w t t..11 For convenience, in the following, we let ε t = ε t +ˆε t..1 Hence, by.3, we have q t wt min q w R d t w + ε t..13 In the later sections, we would give the global convergence and the complexity analysis on the rate of E[P w N ] Pw converging to zero after N outer iterations of Algorithm.1, where w is an optimal solution of 1.1, i.e., w arg min w R d Pw. The following assumption is required throughout this paper. AS.1 For any i {1,...,n}, f i is Lipschitz continuously differentiable with Lipschitz constant L i > 0, i.e., f i x f i y f i y, x y L i x y, for any x, y R d. In the following we denote L = max L i..14 i {1,...,n} Hence, under the Assumption AS.1, all the functions f i, i = 1,...,n, and f are Lipschitz continuously differentiable with constant L. To facilitate the analysis, we first observe that for any w R d, f K w is an unbiased estimate of f w. Indeed, since for any index i the probability of i K is s /n, by taing expectation of f K w with respect to the random sample set K,we have [ ] E K [ f K w] = 1 E K f i w = 1 n [ s ] s s n f iw i K = 1 n n f i w = f w..15 So, by the definition of vt 1 in Algorithm.1,.15impliesthatv t 1 is an unbiased estimate of f wt 1 [ ], that is, E v t 1 = f w t 1. The variance of the stochastic gradient vt 1 can be bounded by the following lemma, whose proof is a proper modification of the proof of Lemma 3.4 in [36] due to the existence of batch size s.

10 X. Wang et al. Lemma.1 Under assumption AS.1, we have E K [ v t 1 f where L is defined in.14. ] wt 1 4L [ ] P wt 1 Pw + P w 1 Pw, s Proof Given any i {1,...,n}, since f i is Lipschitz continuously differentiable with constant L i, for any w R d we have f i w f i w L i [ f i w f i w f i w, w w ]. By the definition of L and summing up the above inequality over i = 1,...,n, we have 1 n n f i w f i w L[ f w f w f w, w w ]. From the optimality condition of 1.1, there exists a p such that for any w R d f w + p,w w 0, which together with the convexity of h yields that f w, w w p,w w hw hw..16 Then, it follows from.16 thatforanyw R d we have 1 n f i w f i w L[Pw Pw ]..17 n Now, noticing that E K [ f K wt 1 ]= fw t 1, E K[ f K w 1 ]= f w 1 and the fact that for any index i the probability of i K is s /n, wehave [ v ] E K t 1 f wt 1 [ = E K fk wt 1 f K w 1 + f w 1 f = 1 s E K i K ] wt 1 f i wt 1 f i w 1 + f w 1 f wt 1 [ = 1 s E K f i wt 1 f i w 1 + f w 1 f i K ] wt 1

11 Inexact proximal stochastic gradient method for convex [ 1 s E K i K f i wt 1 ] f i w 1 = 1 s n s n f i wt 1 f i w 1. Hence, it follows from.17 that E K [ v t 1 f s n n 4L s [ P ] wt 1 [ fi wt 1 w t 1 ] f i w + f i w 1 f i w ] Pw + P w 1 Pw. Remar.1 The Lemma.1 provides an insight that the batch size s can help to further reduce the variance of the stochastic gradient. By the later analysis, although the overall component gradient complexity can not be reduced by increasing the batch size s, the stochastic gradient could be obtained by parallel computing when s > 1 and hence, the total number of subproblems to be solved can be reduced by a factor of s. Note that since we do not save the component gradients f i w 1, i = 1,...,n, at Step 3 of Algorithm.1, in together s component gradients need to be computed at Step 5 of Algorithm.1. The following lemma provides a bound on the difference between the inexact proximal stochastic gradient step given by.10 and the exact proximal full gradient step. Lemma. Under assumption AS.1, we have wt w t v η t 1 f wt 1 + η εt, where w t denotes the exact solution to.10 with vt 1 replaced by f wt 1. Proof Since wt is an ε t, ˆε t -solution of the subproblem.10, we have from.13 and the definition of εt in.1 that q t wt qt ŵt εt, where ŵ t is the exact minimizer of q t defined in.10. So, by the strong convexity of q t we have 1 w η t ŵt qt wt qt ŵt εt,

12 X. Wang et al. which implies wt ŵt ηεt..18 Note that w t and ŵ t satisfy w t 1 w t η f wt 1 h w t and w t 1 ŵ t vt 1 η h ŵt. Hence, we have w h ŵt h w t t 1 w t h w t h ŵt η w t 1 ŵ t η f Summing up the above two inequalities yields that w t ŵt η η v t 1 f wt 1, ŵt w t and vt 1, w t ŵt. vt 1 w f t 1, w t ŵt wt 1 w t ŵt, which gives w t ŵt v η t 1 f wt 1. Therefore, by.18 wehave w t w t w t ŵ t + wt ŵt v η t 1 f wt 1 + η εt. Remar. From the above Lemma., we can derive the following inequality, which will be frequently used in the later analysis: v t 1 f wt 1,w wt v t 1 f wt 1,w w t + vt 1 wt 1 f w t wt v t 1 f wt 1,w w t v + η t 1 f wt 1 + η εt v t 1 f wt 1 v t 1 f wt 1,w w t v + η t 1 f wt ε t..19

13 Inexact proximal stochastic gradient method for convex The following lemma gives a bound on the function value gap Pw t Pw. Lemma.3 Under assumption AS.1, if the proximal parameter η in Algorithm.1 satisfies η 1/L, we have P wt Pw + 1 w w η + 1 η u t,w w t t 1 w wt + vt 1 f w t 1, w wt +ˆε t,.0 where u t η ε t. Proof Since wt is an ε t, ˆε t -solution of the subproblem.10, we have from.11 that hw + 1 w w t 1 η w h wt + t 1 wt u t vt 1 η,w wt ˆε t + 1 w w t 1 η = h wt 1 u t η,w wt vt 1,w wt ˆε t + 1 w w t η + 1 w η t wt 1, where u t η ε t. By rearranging the terms, we have 1 w w t η 1 w w t 1 wt w t 1 + hw h η + vt 1,w wt + 1 u t η,w wt +ˆε t 1 w w t 1 wt w t 1 + hw h η + f wt 1,w wt η u t,w w t wt w t v t 1 f w t 1, w w t +ˆε t..1 It follows from the convexity of f that f w f wt 1 + f wt 1,w wt 1..

14 X. Wang et al. By the Lipschitz continuity of f with Lipschitz constant L, wehave f wt 1 f wt f wt 1, w t wt 1 L wt wt 1..3 Then, summing. and.3 yields f w f wt + f wt 1,w wt L wt wt 1. Combing the above inequality and.1, we have 1 w w t η 1 w wt 1 1 L w t w t 1 + Pw P η η + v t 1 f wt 1,w wt + 1 u t η,w wt +ˆε t, wt which gives.0 due to η 1/L. 3 Convergence properties for strongly convex case In this section, we investigate the theoretical properties of Algorithm.1 with the additional assumption that f in the objective function 1.1 is μ-strongly convex. Since h in 1.1 is convex, the objective function P is μ-strongly convex, i.e., for any w R d we have Pw Pw μ w w. 3.1 For the strongly convex case in this section, the w +1 0 in Step 8 of Algorithm.1 is chosen as w +1 0 = w. 3. The next theorem gives a recursive relation between E[P w ] Pw and E[P w 1 ] Pw. Theorem 3.1 Under assumption AS.1, if f is μ-strongly convex and the proximal parameter η in Algorithm.1 satisfies { s η < min 1L, 1 } L where s {1,...,n}, the following property holds:

15 Inexact proximal stochastic gradient method for convex where E[P w ] Pw s μη m s 1η L + 1m + 1η L E[P w 1 ] Pw m s 1η L s + m s 1η L A, 3.3 A = m ε t + 3 m m m ε i + ˆε t + 3 m εt 3.4 and the expectation is taen with respect to all the history random variables. Proof Summing up.0 over t = 1,...,m,wehave i=t m P wt Pw 1 w 0 η w wm w + 1 m u t η,w m m wt + ˆε t + v t 1 f wt 1,w wt, 3.5 where u t η ε t. We first bound the term η 1 m u t,w wt.by.13, η < L 1 proof of [30, Proposition 3], we can obtain and the same w t w 1 μη t w 0 w + v η i 1 f wi 1 + η εt t 1 μη i. Then, it follows from μη < 1 due to η < 1 L 1 μ that w t w w 0 w + Hence, we have t v η i 1 f wi 1 + η εt. 1 η m u t,w wt 1 m η u t w wt 1 w m η 0 w η ε t + 1 m η ε t η t

16 v η i 1 f wi 1 + η εi X. Wang et al. = m m i=t m m i=t m m ε t + 1 η ε i ε t m w 1 η 0 w + η v η t 1 w f t w m η 0 w + η v ε i t 1 f w t 1 m + ε t i=t m m + m + 1 m ε i + η η w 0 w + 1 m ε i + i=t ε t + η εt m m vt 1 f w t 1 ε t + 1 w 3 η 0 w + m i=t m ε i ε t m ε i + m + η vt 1 wt 1 f. 3.6 It thus follows from.19, 3.5, 3.6 and the definition of A in 3.4 that m P wt Pw 1 w 0 η w wm w + 1 w η 0 w m + 3η vt 1 w f t 1 m + v t 1 f wt 1,w w t + A i=t 1 w m η 0 w + 3η vt 1 w f t 1 ε t

17 Inexact proximal stochastic gradient method for convex m + v t 1 f wt 1,w w t + A μη [ P w 1 Pw ] + 3η m + m vt 1 w f t 1 v t 1 f wt 1,w w t + A, where the last inequality follows from w0 = w 1 and the μ-strong convexity 3.1 of P. Taing expectation on both sides of the above inequality and noticing that ] E [ v t 1 f wt 1,w w t w t 1 = 0, 3.7 we obtain η m + 3η [ ] E P wt Pw [ E P w 1 ] Pw μ m Then it follows from Lemma.1 that η m EP wt Pw [ v E t 1 wt 1 ] f + η A. [ E P w 1 ] Pw + 1η L m [ ] E P wt 1 Pw μ s + E [ P w 1 ] Pw + η A = μ [ E P w 1 ] Pw + 1η L [ ] E P w0 Pw s m + 1η L s t= [ ] E P wt 1 Pw + 1m η L s E [ P w 1 ] Pw + η A. Noticing that w 0 = w 1 and rearranging terms, we have η 1 1η m L [ ] E P wt Pw s μ + 1m + 1η L s

18 X. Wang et al. E [ P w 1 Pw ] + η A. 3.8 Since w = m 1 m w t, it follows from the convexity of P that m [ ] E P wt [ Pw m E P w ] Pw. 3.9 Then, 3.3 follows from η < s 1L and 3.8. Motivated from Theorem 3.1, if we further set ε t ε t 1 μη t 1, for = 1,...,N and t = 1,...,m, then we have m m ε t ε t m 1 μη t 1 1 μη m εt, and m m i=t ε i m 1 μη = 1 μη [ m m εi i=t i=t [ m m εi i=t [ m t εt ] 1 μη i 1 1 μη t 1 ] 1 μη i 1 ] 1 μη m εt. Then, due to μη < 1 and ˆε t ε t, A defined in 3.4 can be upper bounded by Hence, by 3.3 wehave 1 A m εt 5 m εt μη μη μη. E [ P w ] s Pw μη m s 1η L + 1m + 1η L E [ P w 1 ] Pw m s 1η L + 5s μη m s 1η L which would lead to the following theorem. m εt, 3.10

19 Inexact proximal stochastic gradient method for convex Theorem 3. Under assumption AS.1,if f isμ-strongly convex and the parameters in Algorithm.1 are set as { s η < min 40L, 1 }, L m > 0θ for some θ>1, μη and ε t = εt 1 μη t 1, 3.11 where s {1,...,n}, then we have where E[P w ] Pw γ E[P w 0 ] Pw + 5 m εt i γ i+1, 3.1 γ = 1 + 6θ 7θ < 1. Proof Since η < s /40L, we have that s s 1η L < 10 7 and η L s 1η L < 1 8. Hence, denoting γ = we have from m > 0θ μη γ < s μη m s 1Lη + 1m + 1Lη m s 1Lη, 3.13 and θ>1 that 0 + 1m μη m 8m 7μη m 8 < 1 + 6θ 7θ = γ<1. Notice that the coefficient of the last term in 3.10 is less than 5 γ. Hence, by 3.10 we have E[P w ] Pw γ E[P w 1 ] Pw + 5γ m εt Then, 3.1 follows from induction and γ <γ <1. Remar 3.1 From 3.13, we can see that for fixed η and m, γ will decrease as s increases. Hence, by 3.14, increasing the sample size s when calculating the stochastic gradient in Step 5 of Algorithm.1 will generally improve the convergence speed of E[P w ] Pw to zero. By Theorem 3., to ensure E[P w ] Pw converges to zero, it is sufficient to require m εi t γ i+1 converges to zero as increases to infinity. This gives

20 X. Wang et al. us certain freedom for the choices of εt. In the following, we analyze several different choices of εt and derive its corresponding complexity bounds. Corollary 3.3 Under assumption AS.1, suppose f is μ-strongly convex and the parameters in Algorithm.1 are set as in Then, 1 if the subproblem tolerance ε t satisfies we have E[P w ] Pw γ P w 0 Pw if the subproblem tolerance ε t satisfies we have ε t α +t with α 0, 1, α max{γ,α} 1 α1 max{γ, α} max{ α, γ} ; 3.16 ε t 1 β αt with α 0, 1 and β>0, 3.17 E[P w ] Pw γ P w 0 Pw + where ξ is a scalar satisfying 3 if the subproblem tolerance ε t satisfies we have 5αγ 1 α1 γ ξ β, 3.18 γ ξ = ξ β ; 3.19 εt 1 β 1 t 1+φ with β>0 and φ>0, 3.0 E[P w ] Pw γ P w 0 Pw + 5γ1 + φ 1 ξ β, γ where ξ is a scalar satisfying 3.19; 4 if the subproblem tolerance ε t satisfies ε t 1 + t κ with κ>1, 3.

21 Inexact proximal stochastic gradient method for convex we have E[P w ] Pw γ P w 0 Pw + where ξ is a scalar satisfying γ κ 11 γ ξ1 κ, 3.3 Proof We now analyze the estimate bound for E[P w ] Pw case by case. 1 If the tolerance εt satisfies 3.15, direct calculations show that m ε t α Then γ i+1 α i = γ i+1 α i max{γ, α} +i+1 max{γ,α} 1 max{γ, α} max{γ, α}. So, 3.16 follows from 3.1 directly. If the tolerance εt satisfies 3.17, then m that ε t α E[P w ] Pw γ P w 0 Pw + 1 α α. 1 α β. It follows from 3.1 5α 1 α γ i+1 i β. We now compute γ i+1 i β. Since ξ satisfies γ ξ = ξ β, we have ξ 0,, and γ i+1 i β = ξ ξ γ i+1 i β + i= ξ +1 γ i+1 + ξ +1 β γ i+1 i β i= ξ +1 γ i+1 γ ξ +1 + γ 1 γ 1 γ ξ +1 β γ 1 γ ξ β. 3.4 Therefore, we obtain If the tolerance εt satisfies 3.0, we have m m m εt = 1 β + t 1+φ 1 β + x 1+φ dx 1 + φ 1 β. t= Then, 3.1 follows from 3.1 and 3.4.

22 X. Wang et al. 4 If the tolerance ε t satisfies 3., we have m εt = + t κ m m Then, 3.3 follows from 3.1 and x κ dx = 1 κ κ 1 + m 1 κ κ 1 Remar 3. Notice that for any given λ 0, 1, wehave lim γ 1 λ = 0. λ β So, by the definition of ξ in 3.19, there exists an integer K > 0 such that which yields that ξ λ for all K, ξ β λ β. 1 κ κ Therefore, all the upper bounds in 3.18, 3.1 and 3.3 converge to zero as increases to infinity. Remar 3.3 We are more interested in the complexity bounds of the total number of component gradient evaluations to reduce the function value gap E[P w ] Pw below certain tolerance. This complexity bound is also called batch complexity, which often measures the dominating cost of solving original problem 1.1. From our theoretical analysis, it seems reasonable to choose the parameters m and s such that m s = OL/μ. With this choice of m and s, let us discuss the component gradient complexity bounds implied by Corollary 3.3 under different accuracies of solving the proximal subproblem.10. For the tolerance setting 3.15, to achieve E[P w N ] Pw <εfor some ε>0, by 3.16 the maximum outer iteration number N should satisfy max{ α, γ} N = Oε, which implies that N = Olog1/ε. Hence, by the fact that m s is on the order L/μ, the total number of component gradient evaluations Nn + N =1 m s is in the order of n + L 1 O log. 3.7 μ ε Note that when all the subproblems.10 are solved exactly, our Algorithm.1 will be reduced to the method proposed in [36]. The same complexity bound 3.7 was obtained in [36], but it required to solve all the subproblems exactly. Hence, based on our results, the algorithm could have the same complexity bounds even

23 Inexact proximal stochastic gradient method for convex without solving the subproblems exactly. When the subproblem does not have a closed-form solution or is too expensive to be solved exactly, allowance of solving the subproblems inexactly would be crucial for both efficiency and stableness of the algorithm. For the tolerance setting 3.17, to achieve E[P w N ] Pw <ε,by3.18 and Remar 3., when ε>0 is sufficiently small, it suffices to require γ N = Oε and N β = Oε. This implies the outer iteration number N should be in the order of Oε 1/β when N is large. Thus, for a sufficiently small ε>0, the component gradient complexity bound is n + L 1 O μ ε 1/β. 3.8 For the tolerance setting 3.0, to achieve E[P w N ] Pw <ε,by3.1 and Remar 3., when ε>0issufficiently small, it suffices to require γ N = Oε and 1 + φ 1 N β = Oε. This implies the outer iteration number N should in the order of Oρε 1/β, where ρ = min{1,φ}. Thus, for a sufficient small ε>0, the component gradient complexity bound is n + L O μ 1 ρε 1/β. 3.9 For the tolerance setting 3., similar to above two cases, following from 3.3 and Remar 3., to obtain E[P w N ] Pw <εfor a sufficiently small ε>0, it suffices to require γ N = Oε and 1 κ 1 N 1 κ = Oε. This implies N should be in the order of Oκ 1ε 1/κ 1 and the component gradient complexity bound is n + L O μ 1 κ 1ε 1/κ Note that compared with the latter three cases, the tolerance 3.15 has the most accuracy. Intuitively, it should have the smallest complexity bound, which is verified by 3.7. Meanwhile, we can see that as less subproblem accuracy is required, the corresponding component gradient complexity bound increases. Hence, proper subproblem accuracy setting should depend on the balance of the cost of solving the subproblems and the cost of component gradient evaluations. Remar 3.4 It also deserves to mention the inexact proximal gradient methods proposed in [30]. To obtain a linear convergence rate analogous to 3.16 when the

24 X. Wang et al. objective is strongly convex, the linear decrease rate on the subproblem accuracies similar to 3.15 is also required in [30]. But methods proposed in [30] are deterministic methods, which do not consider the particular summation structure of our objective function in 1.1. Furthermore, by combining with variance reduction techniques, we show the more relaxed subproblem accuracies, such as 3.17, 3.0 and 3., can be applied without loosing the convergence, which have not been discussed in [30]. 4 Convergence properties for general convex case In this section, we discuss the theoretical properties of Algorithm.1 for solving the problem without assuming the strong convexity of f. Different from 3., for the general convex case in this section, the w0 +1 in Step 8 of Algorithm.1 is chosen as w0 +1 = wm. 4.1 And given an initial positive integer m 0, we require the succeeded inner iteration number m, = 1,,..., in Algorithm.1 is increasing and satisfies m = 1 + θ m 1, 4. where θ > 0 is a parameter. First, the following lemma [30, Lemma 1] will be used in our analysis to bound w t w. Lemma 4.1 Assume that the nonnegative sequence {v } satisfies the following recursion for all 1: v S + λ i v i, with {S } an increasing sequence, S 0 u 0 and λ i 0 for all i. Then for all 1, 1/ v 1 1 λ i + S + λ i. Based on Lemma 4.1, we give the following bound on w t w. Lemma 4. Under assumption AS.1, if the proximal parameter η in Algorithm.1 satisfies η 1/L, then w t w t v η i 1 f wi 1 + η ε i + w 0 w + η t t 1/ ˆε i v + η i 1 f wi 1 + η ε i. 4.3

25 Inexact proximal stochastic gradient method for convex Proof Note that.0 together with u t η ε t give wi w w i 1 w + η v i 1 f w wi 1 + η ε i i w + η ˆε i. Summing up the above inequality over i = 1,...,t, wehave w t w w 0 w + t + η ˆε i. t w v η i 1 f wi 1 + η ε i i w By letting v i = w i w, St = w 0 w + t η ˆε i and λ i = η ε i + η vi 1 f w i 1 in Lemma 4.1, we obtain 4.3. Now we give the following theorem which is analogous to Theorem 3.1 for the strongly convex case. Theorem 4.1 Under assumption AS.1, if the proximal parameter η in Algorithm.1 satisfies { s η < min 16Lθ +, 1 }, 4.4 L the following property holds: [ s w ] E[P w ] Pw + η m s 16η L E +1 0 w 16η L [ ] + E P w0 +1 Pw m s 16η L 1 + η 1 + θ R + m B, 4.5 where [ s w R = E[P w 1 ] Pw + η m 1 s 16η L E 0 w ] 16η L [ ] + E P w0 Pw m 1 s 16η L 4.6 and B = 1 m m ε t + 4 η m i=t ε i + 5 m εt. 4.7

26 X. Wang et al. Here, the expectation is taen with respect to all the history random variables. Proof Summing up.0 over t = 1,...,m, due to w +1 0 = w m we have m P wt Pw 1 w 0 η w w0 +1 w + 1 m u t η,w wt m + v t 1 f m wt 1,w wt + ˆε t, 4.8 where u t η ε t. We first give a bound on the term η 1 m u t,w wt. The upper bound of u t in 4.8 and Lemma 4. give where 1 η m E 1 := 1 m η ε t η m E := 1 η η ε t u t,w wt 1 m η η ε t t v η i 1 f wi 1 + η ε i, w t t 0 w + η ˆε i + Now, let us first derive a bound on E 1 : m E 1 = 1 = 3 m i=t m m i=t m m w w t v η ε i t 1 w f t 1 m + i=t m ε i + η ε i E 1 + E, 4.9 1/ v η i 1 f wi 1 + η ε i. m i=t ε i ε t vt 1 w f t 1 m m + ε t + m + η vt 1 w f t 1 m + ε t. Noticing the definition of ε t in.1, we have the following bound on E : E 1 m w η ε t 0 η w + t η ˆε i + t m i=t v η i 1 f wi 1 + η ε i ε i

27 Inexact proximal stochastic gradient method for convex m ε t w m t m t η 0 w + ε t ˆε i + η ε t vi 1 w f m t i 1 + ε t ε i w 0 w + 1 m m m m m v ε t + ˆε t ε i + η ε i t 1 w η f η t 1 i=t i=t m m + ε t ε i i=t w 0 w + 1 m m m m m ε t + ˆε t + ε i + η vt 1 w η f t 1 η i=t + 1 m m m m m ε i + ε t + ε i i=t i=t w 0 w + 1 m m ε t + εt + 5 m m m ε i + η vt 1 w η η f t 1. i=t By inserting the above two bounds on E 1 and E into 4.9, we have 1 η m w u t,w wt 0 w + η η m vt 1 f w t m m ε t + 4 η m m m i=t ε i + εt + ε t Combining.19, 4.8 and 4.10 gives m P wt Pw 1 + η w 1 w η 0 w +1 0 w η m + 4η vt 1 w f t 1 m + v t 1 f wt 1,w w t + B, 4.11 where B is defined in 4.7. Taing expectation on both sides of 4.11 and noticing 3.7, it follows from Lemma.1 that m E[Pwt ] Pw 1 + η η E[ w 0 w ] 1 η E[ w +1 0 w ]

28 X. Wang et al. + 16η L s m E[Pwt 1 ] Pw + 16m η L s EP w 1 Pw + B. Then, by η < s 16L and w+1 0 = wm,wehave 1 16η L m E[Pwt s ] Pw + 1 η E[ w +1 0 w ]+ 16η L s E[Pw +1 0 ] Pw 16m η L s E[P w 1 ] Pw η E[ w0 η w ]+ 16η L E[Pw0 s ] Pw + B. 4.1 Then, it follows from 3.9 and 4.1 that s E[P w ] Pw + η m s 16η L E[ w+1 0 w ] 16η L + m s 16η L E[Pw+1 0 ] Pw 16η L s 16η L E[P w s 1 + η 1] Pw + η m s 16η L E[ w 0 w ] 16η L + m s 16η L E[Pw 0 ] Pw s + m s 16η L B Since η < s 16Lθ +,wehaves 16η L > θ +1 θ + s. Therefore, we have 16η L s 16η L < θ and s s 16η L = θ + <, 4.14 θ + 1 which together with 4.13 and m = 1 + θ m 1 imply 4.5. Motivated from Theorem 4.1, if we further set ε t ε t 1 η t 1, for = 1,...,N and t = 1,...,m, 4.15 then we have m m ε t ε t m 1 η t 1 1 η m εt,

29 Inexact proximal stochastic gradient method for convex and m m i=t ε i m 1 η = 1 η [ m m εi i=t i=t [ m m εi i=t [ m t εt ] 1 η i 1 1 η t 1 ] ] 1 η i 1 1 m εt η. Then, the B defined in 4.7 can be upper bounded by 5 B + 5 m εt η We now give the convergence property of Algorithm.1 for solving without strongly convex assumption. Theorem 4. Under assumption AS.1, if the parameters η and ε t in Algorithm.1 are set as { s η = η<min 16Lθ +, 1 } and ε t = εt L 1 η t 1, for = 1,...,N and t = 1,...,m, 4.17 the inner iteration number m satisfies the condition 4. and the sample size {s } N =1 is a nondecreasing sequence, then we have E[P w ] Pw a η m 0 where a 0 = m i η i 16ηL [P w 0 Pw ]+ m 0 s 1 16ηL Proof First, for all 1, denote ε i t j=1 1 + η 1 + θ j, 4.18 s 1 ηm 0 s 1 16ηL w1 0 w s a = E[P w ] Pw + ηm s 16ηL E[ w+1 0 w ] 16ηL + m s 16ηL E[Pw+1 0 ] Pw.

30 X. Wang et al. Then, it follows from Theorem 4.1,4.17 and {s } =1 N being a nondecreasing sequence that for all η a a 1 + B 1 + θ m So, by induction we get a a 0 j=1 1 + η 1 + θ j + j=i η B i. 1 + θ j m i Then, 4.18 follows from m i = m ij= θ j and the upper bound 4.16onB i. Remar 4.1 Note that to ensure E[P w ] Pw converges to zero as goes to infinity, by Theorem 4., a sufficient condition is to require j=1 1 + η 1 + θ j 0 and 1 + m i η i ε i t j=1 1 + η 1 + θ j 0, as goes to infinity. Motivated from Remar 4.1, we have the following Corollary. Corollary 4.3 Under assumption AS.1, suppose the inner iteration number m satisfies the condition 4. with θ = θ>0and the parameters in Algorithm.1 are set as { s s = s, η = η<min 16Lθ +, 1 } L,θ and ε t = εt 1 η t Then, we have E[P w ] Pw a η m 0 where a 0 is defined in 4.19 and 1 + m i η i ε i t γ, 4.1 γ = 1 + η 1 + θ < Consequently, 1 if the subproblem tolerance εt satisfies 3.15, then E[P w ] Pw a 0 + α 10η m 0 1 α γ ; 4.3

31 Inexact proximal stochastic gradient method for convex if the subproblem tolerance ε t satisfies 3.17, then E[P w ] Pw 3 if the subproblem tolerance ε t satisfies 3.0, then E[P w ] Pw 4 if the subproblem tolerance ε t satisfies 3., then E[P w ] Pw a 0 + α10η m 0 1 α γ ; 4.4 η a φ 1 10η γ ; 4.5 m 0 η a η m 0 κ 1 γ. 4.6 η Proof First, by Theorem 4., 4.1 follows from 4.18 directly. 1 If the subproblem tolerance εt satisfies 3.15, then m i implies that 1 + m i η i εt i = α 1 α α 1 + η i α 1 α Then, 4.3 follows from 4.1. If the subproblem tolerance εt satisfies 3.17, then m i implies that 1 + m i η i εt i α 1 α 1 + η i εi t α 1 α αi, which α 1 + η α α 1 α. εi t α α η1 α. 1 α i β, which Then, 4.4 follows from If the subproblem tolerance ε t satisfies 3.0, then 4.5 follows from 4.1 and the fact that m i εt i 1 + φ 1 i β 1 + φ 1 and 1 + η i 1, η where the first inequality is by If the subproblem tolerance ε t satisfies 3., then 4.6 follows from 4.1 and the fact that m i ε i t 1 κ 1 i 1 κ 1 κ 1, where the first inequality is by 3.6.

32 X. Wang et al. Remar 4. Following from Corollary 4.3, wehavee[p w ] Pw converges to zero as goes to infinity. We now analyze the component gradient complexity of Algorithm.1 implied by Corollary 4.3. In the algorithm, suppose we set m 0 = OL, η = OL 1 and s = O Then, it follows from 4.7 and 4.14 that 16η L s 16η L = O1, s s 16η L = O1. Hence, we have a 0 = O1. In the following, let us denote τ = log1 + θ log1 + θ log1 + η. 4.8 Then, Corollary 4.3 would give the following complexity bounds. Suppose ε t is set as It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.3 iso1. Hence, to achieve E[P w N ] Pw <εfor a given ε > 0, the outer iteration number N should be in the order of Olog γ ε = Olog1/ε. So, by direct calculation, the total number of component gradient evaluations required by Algorithm.1 is nn + sm i = nn + sm θ N θ θ = O n log 1 + Lε ε τ. 4.9 Suppose ε t is set as It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.4isOL 1/. Hence, to achieve E[P w N ] Pw <εfor a given ε>0, the outer iteration number N should be in the order of Olog γ ε/l 1/ = OlogL 1/ /ε. Therefore, the total number of component gradient evaluations required by Algorithm.1 is L 1/ L 1/ τ O n log + L. ε ε Suppose ε t is set as 3.0. It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.5isOL 1/ /ρ where ρ = min{1,φ}. Hence, to achieve E[P w N ] Pw <εfor a given ε>0, the outer iteration number N should be in the order of Olog γ ρε/l 1/ = OlogL 1/ /ρε. Therefore, the total number of component gradient evaluations required by Algorithm.1 is L 1/ L 1/ τ O n log + L. ρε ρε

33 Inexact proximal stochastic gradient method for convex Suppose εt is set as 3.. It follows from a 0 = O1 and 4.7 that the coefficient of γ in 4.6 isoσ, where σ = max{1, L 1/ /κ 1}. Hence, to achieve E[P w N ] Pw <εfor a given ε>0, the outer iteration number N should be in the order of Olog γ ε/σ = Ologσ/ε. Therefore, the total number of component gradient evaluations required by Algorithm.1 is σ σ τ O n log + L. ε ε Similar to the strongly convex case, we can see as the subproblem tolerances become less strict, the component gradient complexity bound would increase. Again, the proper subproblem accuracy setting should depend on the balance of the subproblem difficulty and the cost of component gradient evaluations. Remar 4.3 When all the subproblems.10 are solved exactly, the complexity bound On log1/ε + L/ε is obtained in [40] for solving without strongly convexity assumption. By comparison, the complexity bound 4.9 is worse since it has a power τ > 1. This gap arises when the term 1/η u t,w wt in 4.8 is estimated. When the subproblems are solved exactly, this term vanishes. However, when the subproblems are solved inexactly, this term needs to be estimated as shown in 4.10, which leads to the the reduction ratio 1 + η/1 + θ in 4.5, and the power τ>1finally appear in the complexity result. But, on the other hand, when the Lipschitz constant L is very large, which is usually the case for ill-posed problems, we could have θ>>l 1 and τ would be close to 1. Then, the complexity bound 4.9 would be very close to the complexity bound obtained in [40], which assumes exactly solving of all the subproblems. Observe that by the same analysis for establishing 4.16, we can replace 4.15 by ε t = ε t 1 σ η t 1, 4.30 where σ 0, 1 is a constant. In this case, the convergence complexity order of Algorithm.1 given in Theorem 4. and Corollary 4.3 would still be the same. In addition, based on Theorem 4., we could also have a variable choice of θ as which gives θ = λ 1 + η η 1 + θ = λ + 1, where λ>0 is a constant, Then, by 4.18 wehave E[P w ] Pw a η m 0 1, m i η i ε i t λ.

34 X. Wang et al. Same analysis as in the proof of Corollary 4.3 would show the factor in front of λ/ in the above estimate are bounded for all the four choices of tolerance settings 3.15, 3.17, 3.0 and 3.. Hence, the convergence of E[P w ] Pw to zero, as goes to infinity, is still guaranteed. 5 Experiments In this section, we consider solving the following CUR-lie factorization optimization proposed in [3]: min PX X Rnr nc := 1 W WXW F + λ n c row X i p + λ c X j p, 5.1 n r j=1 where W R n c n r is a given matrix, X i = X i, ; and X j = X :, j are the i-th row and j-th column of the matrix X, respectively. This optimization model 5.1 would generate a solution X with sparse rows and columns with different choices of p > 0. It can be seen that the problem 5.1 would have the formulation as 1.1 by setting f X := 1 W WXW F = 1 f i, j X n r n c n r j=1 n c hx := λ row X i p + λ c X j p, j=1 and where f i, j X = W i, j W i XW j, W i = W i, : and W j = W :, j are the i-th row and j-th column of the matrix W, respectively. In [3], the authors choose p = for which the proximal subproblem.10 has a closed-form solution. The more natural choice of p = isdiscussedin[30]. However, in this case there is no closed-form formula to find the exact solution of.10. Hence, the proximal subproblem.10 is solved in [30] by the bloc coordinate descent BCD algorithm proposed in [16], which is also equivalent to Dystra s algorithm for two monotone operators [1]. In our experiments, we also consider the choice of p = and the proximal subproblem.10 is solved by the same BCD algorithm used in [30]. In the implementation of the BCD method, the proximal subproblem is minimized alternatively with respect to the rows and columns. Since the duality gap can be computed while applying the BCD algorithm, we can compute an approximate solution of the subproblem.10 until its function value duality gap is below any given tolerance ε>0. The same data sets in [3,30] 1 are used in our experiments. We set λ row = 0.01 and λ col = 0.01 in 5.1 asthosein[30], which would yield approximately 5 40% non-zero entries in the solution. The data sets is summarized in the following table: 1 The datasets are available at

35 Inexact proximal stochastic gradient method for convex Data sets 9_Tumors Brain_Tumor1 Leuemia1 SRBCT n r n c In the experiments, the size of the random sample index set in Step 5 of Algorithm.1 is set as s = n c for all. The starting point w0 +1 for the inner loop is selected by 4.1, i.e., w +1 0 = w m. For the inner iteration number, we set m = 1 + θ m 1 with m 0 = 1, and as suggested by 4. and 4.31, we consider the following two choices of θ : 1 θ = λ η 1 with λ 1 = 1.001; θ = λ η 1 with λ = 0.8. Fig. Objective function value gap P w Pw vertical axis against the number of BCD iterations horizontal axis. In this figure, θ = λ η 1andλ 1 = a 9_Tumors, b Brain_Tumor1, c Leuemia1, d SRBCT

36 X. Wang et al. Fig. 3 Objective function value gap P w Pw vertical axis against the number of BCD iterations horizontal axis. In this figure, θ = λ η 1andλ = 0.8. a 9_Tumors, b Brain_Tumor1, c Leuemia1, d SRBCT In both above two cases the proximal parameter η = max{0.7/, 1/ L} where L = 30 is an estimate of the Lipschitz constant of f in 1.1. We have tested four different types of tolerance εt according to 3.15, 3.17, 3.0 and 3. as the following: 1 εt = 1 σ L t 1 α +t with α = 0.9. εt = 1 σ L t ε t = 1 σ L t 1 4 ε t = 1 σ L t 1 α t with α = 0.9 and β = 0.5. β 1 1 with β = 1 and φ = 0.5. β t 1+φ 1 +t κ with κ = 3. In all the above cases, we set σ = 0.1 and L = 1/η. All the subproblems.10are solved by the BCD algorithm until the subproblem function value duality gap is below ε t. Then, by [38, Thm..8.7] and [3, Section 4.3], the inexact solution w t obtained by the BCD algorithm is an ε t, ˆε t -solution of the problem.1 with ε t +ˆε t ε t. Hence, tolerance condition 4.30 on ε t is satisfied. Firstly, to show the benefits of solving the subproblems inexactly, we also tested the Algorithm.1 but with almost exactly solving of the subproblem.10 until

37 Inexact proximal stochastic gradient method for convex Fig. 4 Objective function value gap Pw Pw vertical axis against the number of effective passes horizontal axis. In this figure, θ = λ η 1andλ 1 = ɛ = 3 represents the method proposed in [30]. a 9_Tumors, b Brain_Tumor1, c Leuemia1, d SRBCT its function value duality gap is below Note that the only differences of the comparing algorithms are their tolerances for solving the subproblem. Hence, as in [30], in this case we also use the function value gap P w Pw against the number of proximal iterations BCD iterations as the measure of algorithm performance. In our numerical experiments, the optimal function value Pw is approximated by the minimum objective function value obtained by running all the comparing algorithms until P w i P w i 1 max i=0,1, P w i The numerical results corresponding to different choices of θ in cases 1 and are shown in Figs. and 3, respectively. The results clearly show that the inexact methods outperform the the exact method e.g., the method with fixed tolerance ɛ t = Hence, the result confirms our analysis that it is not necessary to solve the subproblems exactly at each iteration. On the other hand, we observe that the new algorithms with different tolerances often behave similar for this set of testing problems, especially at

Stochastic and online algorithms

Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem