arxiv: v2 [math.oc] 1 Jul 2015

Size: px

Start display at page:

Download "arxiv: v2 [math.oc] 1 Jul 2015"

Merryl Stewart
5 years ago
Views:

1 A Family of Subgradient-Based Methods for Convex Optimization Problems in a Unifying Framework Masaru Ito (ito@is.titech.ac.jp) Mituhiro Fukuda (mituhiro@is.titech.ac.jp) arxiv: v2 [math.oc] Jul 205 Department of Mathematical and Computing Sciences, Tokyo Institute of Technology 2-2--W8-4 Oh-okayama, Meguro, Tokyo Japan Research Report B-477 Department of Mathematical and Computing Sciences Tokyo Institute of Technology February 204, revised June 205 Abstract We propose a new family of subgradient- and gradient-based methods which converges with optimal complexity for convex optimization problems whose feasible region is simple enough. This includes cases where the objective function is non-smooth, smooth, have composite/saddle structure, or are given by an inexact oracle model. We unified the way to construct the subproblems which are necessary to be solved at each iteration of these methods. This permitted us to analyze the convergence of these methods in a unified way compared to previous results which required different approaches for each method/algorithm. Our contribution rely on two well-known methods in non-smooth convex optimization: The mirror-descent method by Nemirovski-Yudin and the dual-averaging method by Nesterov. Therefore, it include them and many other methods as particular cases or partially becomes special cases of other universal methods. For instance, the proposed family of classical gradient methods and its accelerations generalizes Devolder et al. s and Nesterov s primal/dual gradient methods and Tseng s accelerated proximal gradient methods. As an additional contribution, the novel extended mirror-descent method removes the compactness assumption of the feasible region and the fixation of the final number of iterations to attain optimal complexity. Keywords: non-smooth/smooth convex optimization, structured convex optimization, subgradient/gradient-based proximal method, mirror-descent method, dual-averaging method, complexity bounds. Mathematical Subject Classification (200): 90C25, 68Q25, 49M37 Introduction. Background The gradient-based method proposed by Nesterov in 983 for smooth convex optimization problems brought a surprising class of optimal complexity methods with preeminent performance over classical gradient methods for the worst case instances [20]. The minimization of a smooth convex corresponding author

2 function, whose gradient is Lipschitz continuous with constant L, by these optimal complexity methods ensures an ε-solution for the objective value within O( LR 2 /ε) iterations, while the classical gradient methods require O(LR 2 /ε) iterations; R is the distance between an optimal solution and the initial point. It is important to observe that in all of those methods, the iteration complexity is with respect to the convergence rate of the approximate optimal values and not with respect to the approximate optimal solutions. The Nesterov s optimal complexity method, as well as further improvements and extensions [, 2, 2, 23], applied or extended for solving non-smooth convex problems [5, 5, 22, 23, 26, 28, 29] with composite structure [4, 0,, 4, 22, 23, 25] and the inexact oracle model [7, 8], changed substantially the approach on how to solve large-scale structured convex optimization problems arising in machine learning, compressed sensing, image processing, statistics, etc. One can notice that for a general non-smooth convex problem, the complexity analysis of those methods is apparently different compared to the smooth case. The optimal complexity in the non-smooth case is O(M 2 R 2 /ε 2 ) iterations for an ε-solution, where M is a Lipschitz constant for the objective function. A well known optimal method for this case is the Mirror-Descent Method (MDM) proposed by Nemirovski and Yudin [9] which was later related to the subgradient algorithm by Beck and Teboulle [3]. Further variants which do not require to fix the final number of iterations to guarantee the optimal convergence rate were also proposed, but they additionally require the boundedness of the feasible region [7, 8]. The Dual-Averaging Method (DAM) proposed by Nesterov [24] and its modification [27], on the other hand, further allows us to ensure the optimal complexity O( LR 2 /ε) even if the feasible region is unbounded. A key idea to obtain this enhancement in the DAM was the introduction of a sequence which we call scaling parameter β k in this paper. Since the MDM and the DAM are the main motivations of the present article, we will focus our subsequent discussion on results related to them. The (approximate) gradient-based methods proposed as the primal and dual gradient methods in [8, 25] can be interpreted as particular cases of the MDM and the DAM for the corresponding smooth convex problems, respectively, as it will be clear along the article. However, they only ensure the same complexity O(LR 2 /ε) as the classical gradient methods and they require different approaches to prove each of their rate of convergences. Many of gradient-based methods for smooth and structured convex problems were unified and generalized in some way by Tseng [28, 29]. The three algorithms proposed there preserve the O( LR 2 /ε) iteration complexity, but they also require separate analysis for each of them. Particular cases of Tseng s optimal methods can be seen as accelerated versions of the MDM and the DAM for smooth and structured problems [8, 25] as mentioned previously. ItisimportanttonotethatthedifferencebetweentheMDM andthedam orrelatedalgorithms lay on the construction of the subproblems solved at each iteration. As far as we know, there is no results formalizing a combined treatment to them to prove their convergence as we will propose in the present article..2 Our contributions All of above methods require at each iteration the computation of minimizer(s) of one (or two) strongly convex function(s), which we call auxiliary functions, over a(simple) closed convex domain. Our main contribution is the identification of common properties intrinsic to the MDM and the DAM, which we call Property 2 (Section 3), that these auxiliary functions should satisfy in order to secure optimal converging rate methods. This characterization let us propose two strategies to construct sequentially these auxiliary functions: the extended Mirror-Descent (MD) model (3), which we believe is completely new in the literature, and the Dual-Averaging (DA) model (4). 2

3 In fact, we will show that they can be combined in arbitrary order (Proposition 4) for our final purpose. The above strategy combined with appropriate step-sizes, which we call scaling and weight parameters, satisfying the inequalities (R k ) (8) (or (ˆR k ) (2) for the non-smooth problems and (ˆR k ) (39) for the structured problems) will permit us to propose novel families of (in fact, infinitely many) methods to solve convex optimization problems. The iteration complexity for the nonsmoothproblemsusingeither Method9(a) or (b)(in Section 4) is O(M 2 R 2 /ε 2 ). For thestructured problems, which include smooth, composite structure, saddle structure or inexact oracle [8], using Methods 7 and 8 (in Section 5) are O(L 2 R 2 /ε) and O( LR 2 /ε), respectively (excepting for the inexact oracle case). A clear advantage of our unifying framework over the exiting ones is that we can prove all the convergences and their rates in a universal way without specifying the proofs to a particular method/algorithm. As far as we know, this is the first time that such general treatment unifying the MDM and the DAM is proposed. The approach of constructing optimal methods based on our framework some how resemble the estimate sequences [, 2, 2] and the inequality (R k ) [23] and the inequality (23) [27]. Our approach should be distinguished from the universal (sub)gradient-based methods which can be applied simultaneously to non-smooth or smooth problems such as [0,, 4] or to structured problems which can admit inexact oracles, weakly smooth functions, etc. [7, 8, 26]. Such universal approach are also applicable for our framework (Remark 5); a development of a universal method to minimize weakly smooth and strongly convex functions based on our unifying framework is also discussed in [2]. Also, as pointed out before, Tseng unified many of these methods in three algorithms, but they require different treatment for each of them. Since our methods are based on extended MD and/or DA updates, they seems quite restrictive, but as Table shows, many of known optimal methods are particular cases or can be particularized to coincide with our methods. As a minor contribution, we generalized the MD method to the extended MD method. Our methods has the advantage of not requiring a fixed number of iterates to determine the stepsize (parameter). This drawback was already partially solved in [0,, 4], but our method has additionally the advantage of not requiring the boundedness of the domain. The structure of this article is as follows. First, we review some existing methods, in particular the MDM and the DAM for non-smooth objective functions and Tseng s accelerated gradient methods for smooth ones (Section 2). In Section 3, we propose the Property 2 which represents a framework of auxiliary functions for the development of our methods, as well as some supporting lemmas. We then propose in Section 4 the general subgradient-based method and prove its convergence rate, in particular for the extended MDM and the DAM, and subsequently for the structured problems in Section 5..3 Problem setting and notations In this paper, we consider a finite dimensional real vector space E endowed with a norm. The dual space of E is denoted by E endowed with the dual norm defined by s = max s,x, s E x where s,x denotes the value of s E at x E. We consider subgradient-based and gradientbased methods to solve the following convex optimization problem : minf(x) () 3

4 Table : Relation between our family of (sub)gradient-based methods and other known methods. The star (*) corresponds to our result. Complexity indicates the number of iteration to obtain an ε-solution when the objective function has no inexactness for its oracle. [25] is included considering that its Lipschitz constant is known in advance. The third problem class applicable for both indicates that the existing methods can be applicable simultaneously for non-smooth, smooth, and structured problems. problem class complexity some known methods generalized methods non-smooth structured/ smooth applicable for both *Method 9 (a) with the model (3) mirror-descent [3, 9] extended mirror-descent: Method 3 optimal ( ) dual-averaging [24] *Method 9 (a) with the model (4) O M 2 R 2 ε double averaging [27] *Method 9 (b) with the model (4) 2 sliding averaging [8] Nedić-Lee s averaging [7] classical ( ) primal gradient [8, 9, 25] *Method 7 with the model (3) O LR 2 dual gradient [8, 25] *Method 7 with the model (4) ε optimal ( ) O LR 2 ε optimal estimate sequence method [2, 2] Nesterov s method [23] Tseng s modified method; see [28, (35-36)] interior gradient method [] Tseng s method [28, Algorithm ] Lan-Luo-Monteiro s method [5] FISTA [4] Tseng s first APG [29] *Method 2 Method 8 with the model (3) Tseng s second APG [29] Tseng s method [28, Algorithm ] *Method 22 Method 8 with the model (4) Tseng s third APG [29] Tseng s method [28, Algorithm 3] Fast gradient method [8] Ghadimi-Lan s method [0,, 4] Universal gradient method [26] whereqis anonemptyclosed convex, andpossiblyunbounded,subsetof E, andf : E R {+ } is a proper lower semicontinuous convex function with Q domf := {x E : f(x) < + }. For each x domf, the subdifferential of f at x is denoted by f(x) := {g E : f(y) f(x) + g,y x, y E}. We assume throughout this paper that the problem () always has an optimal solution x Q, and the structure of Q is simple enough or has some special structure which permits one to solve a subproblem over it with moderate easiness. See[23] for some examples. We additionally assume that there is a proper lower semi-continuous convex function d : E R {+ } satisfying the following properties: d(x) is a strongly convex function on Q with parameter σ > 0, i.e., d(τx+( τ)y) τd(x)+( τ)d(y) 2 στ( τ) x y 2, x,y Q, τ [0,]. d(x) is continuously differentiable on Q. We denote by ξ(z,x) the Bregman distance [6] between z and x: ξ(z,x) := d(x) d(z) d(z),x z, z,x Q. The Bregman distance satisfies ξ(z,x) σ 2 x z 2 for any x,z Q by the strong convexity of d(x). We also assume that d(x 0 ) = min d(x) = 0 for x 0 := argmin d(x) Q, which is used for the initial point of our methods. Finally, we define R as R := σ d(x ), R := We can always assume this requirement for an arbitrary point x 0 Q by replacing d(x) by ξ(x 0,x). σ ξ(x 0,x ), or 4

5 their upper bounds, which quantifies the distance between the optimal solution x and the initial point x 0 in view of properties d(x 0 ) = 0 and d(x) σ 2 x x 0 2 for every x Q. 2 Existing optimal methods In this section, we review some well-known subgradient-based and gradient-based methods. In particular, we focus on the Mirror-Descent Method (MDM), Dual-Averaging Method (DAM), the double and triple averaging methods for non-smooth objective functions, and on Nesterov s accelerated gradient and Tseng s Accelerated Proximal Gradient (APG) methods for smooth objective functions (or for non-smooth ones with some special structures). The purpose of this section is to unify the notation of these methods in order to introduce a unifying framework for them in Section 3. For that, we sometimes changed the variables names, shifted their indices, and added constants in the objective functions of optimization subproblems compared to the original articles. 2. Optimal methods for the non-smooth case Let usfirstassumethatf(x)in() isnon-smooth. TheMDM [9]intheformreinterpretedbyBeck and Teboulle [3] generates the following iterates from the initial point x 0 := argmin d(x) Q. x k+ := argmin{λ k [f(x k )+ g k,x x k ]+ξ(x k,x)}, k = 0,,2,..., (2) where g k f(x k ) and λ k > 0 is a weight. The parameter λ k is also referred to as a stepsize; it is known that the procedure (2) reduces to the classical subgradient method x k+ := π Q (x k λ k g k ) whene isaneuclideanspace, isthenormofe inducedbyitsinnerproduct,d(x) := 2 x x 0 2, and π Q is the orthogonal projection onto Q (see also Auslender-Teboulle [] and Fukushima-Mine [9] for some related works). The MDM produces the following estimate [3]: k 0, min 0 i k f(x i) f(x ) k λ if(x i ) k λ i f(x ) ξ(x 0,x )+ 2σ k λ2 i g i 2 k λ i (3) where the right hand side can be bounded by M 2σ ξ(x 0,x )/ k + if M := sup{ g : g f(x), x Q} is finite and if we choose the constant weights λ i := M 2σξ(x 0,x )/ k+, i = 0,...,k for a fixed k 0. If we further know an upper bound R σ ξ(x 0,x ), this result ensures an ε-solution in O(M 2 R 2 /ε 2 ) iterations which provides the optimal complexity for the non-smooth case [3]. The above choice of weights, however, is impractical since it depends on the final iterate k and an upper bound for ξ(x 0,x ); a more practical choice λ i := r/ i+ for some r > 0 only ensures an upper bound ξ(x 0,x )+(2σ) r 2 M 2 (+log(k+)) 2r( = O(logk/ k) for the right hand side of k+2 ) (3). Note that, however, when the feasible region Q is compact, the weights λ i := r/ i+ (r > 0) ensure O(/ k)-convergence rate for the difference f(ˆx k ) f(x ) by considering ˆx k, a weighted average of x 0,...,x k [7, 8]. The DAM proposed by Nesterov [24] overcomes the dependence of weights of the MDM on k and even achieves the rate of convergence O(/ k). This method employs non-decreasing positive scaling parameters {β k } k (β k+ β k > 0) in addition to the weights {λ k } k 0. From the initial point x 0 := argmin d(x) Q, the DAM is performed as { k } x k+ := argmin λ i [f(x i )+ g i,x x i ]+β k d(x), k = 0,,2,... (4) 5

6 Nesterov proved that the DAM satisfies the following general estimate (set D = d(x ) in [24, Theorem and (3.2)]): k k 0, min f(x i) f(x ) λ if(x i ) 0 i k k λ i 2σ k f(x ) β kd(x )+ k λ i λ 2 i β i g i 2 (5) In order to ensure the rate of convergence O(/ k), we do not even need a prior knowledge of an upper bound for ξ(x 0,x ); for instance, choosing λ k := and β k := γˆβ k where γ > 0 and ˆβ := ˆβ 0 :=, ˆβk+ := ˆβ k + ˆβ k, k 0, (6) the right hand side of (4) can be bounded by ) (γd(x )+ M k+ 2σγ k+ and achieves the optimal complexity if we choose γ := M/ 2σd(x ). A key in the analysis of the DAM in [24] is the use of a dual approach such as the conjugate function of βd(x) for β > 0. In this paper, we prove the same result with simpler arguments (in Section 4) for the DAM and (an extension of) the MDM without employing it. Nesterov and Shikhman [27] further proposed the double and triple averaging methods in order to obtain convergence results for the sequence {x k }. The double averaging method [27, eq. (28)] iterates starting from x 0 := argmin d(x) Q as follows: z k := argmin { k } λ i [f(x i )+ g i,x x i ]+β k d(x), x k+ := ( τ k )x k +τ k z k, k = 0,,2,... (7) where τ k := λ k+ / k+ λ i. This method bounds the difference f(x k ) f(x ) by the same value as the right hand side of (5) [27, Theorem 3.] for all k 0. Hence, it achieves optimality. The triple averaging, which is a modification of (7), allows further flexibility on the choices for {λ k } and {β k } [27, Theorem 3.3]. Observe that for all the above mentioned methods, we do not need to evaluate any function value at any iteration and x k+ is determined uniquelyeven if Q is unboundedsince d(x) is strongly convex [24, Lemma 6]. 2.2 Optimal methods for the smooth case Let us assume now that the function f(x) in () is convex, continuously differentiable, and its gradient is Lipschitz continuous on Q with constant L > 0: f(x) f(y) L x y, x,y Q. Many optimal complexity methods were proposed in the literature under this assumptions (see, e.g., Table ). In particular, we recall the optimal methods proposed by Nesterov [23] and Tseng [28, 29] for a comparison with our results. Given positive weights {λ k } k 0, both methods depend on the following computation of optimal solutions ẑ k and/or z k of auxiliary subproblems: (a) ẑ k := argmin { λk [f(x k )+ f(x k ),x x k ]+ L σ ξ(z k,x) }, (b) z k := argmin { k λ i[f(x i )+ f(x i ),x x i ]+ L σ d(x) } (8) 6

7 where {x k } k 0 Q is the sequence generated by those methods. Note that the subproblem (a) is closely related to the one of the MDM (2) and the subproblem (b) corresponds to the one of the DAM (4) with β k = L/σ. Similarly to the non-smooth case, it is not necessary to evaluate the function values at x k s and the minimums are uniquely defined. The Nesterov s optimal method (see modified method in [23, Section 5.3]) with a particular choice for the weights λ k is described as follow. Nesterov s method: Set λ k := (k +)/2 for k 0 and x 0 := z := argmin d(x). Compute ẑ 0 by (a) and set ˆx 0 := z 0 := ẑ 0. For k 0, iterate the following procedure: Set x k+ := ( τ k )ˆx k +τ k z k, where τ k := λ k+ k+, λ i Compute ẑ k+ by (a), Set ˆx k+ := ( τ k )ˆx k +τ k ẑ k+, Compute z k+ by (b). (9) In comparison, Tseng s second and third Accelerated Proximal Gradient (APG) methods [29] which are particular cases of algorithms and 3 in [28], only require the solution of either subproblem (a) or (b), respectively. Tseng s second APG method: Set λ 0 :=, λ k+ := + +4λ 2 k 2 for k 0, and x 0 := z := argmin d(x). Compute ẑ 0 by (a) and set ˆx 0 := ẑ 0. For k 0, iterate the following procedure: Set x k+ := ( τ k )ˆx k +τ k ẑ k, where τ k := λ k+ k+, λ i Compute ẑ k+ by (a) replacing z k by ẑ k, Set ˆx k+ := ( τ k )ˆx k +τ k ẑ k+. (0) Tseng s third APG method: Set λ 0 :=, λ k+ := + +4λ 2 k 2 for k 0, and x 0 := z := argmin d(x). Compute z 0 by (b) and set ˆx 0 := z 0. For k 0, iterate the following procedure: Set x k+ := ( τ k )ˆx k +τ k z k, where τ k := λ k+ k+, λ i Compute z k+ by (b), Set ˆx k+ := ( τ k )ˆx k +τ k z k+. () Remark. To see the equivalence to Tseng s second APG method, notice that x 0 is not used at all in [29]. Then defining d(x) := D(x,z 0 ) = η(x) η(z 0 ) η(z 0 ),x z 0 for an arbitrary z 0 Q, we have σ = in (a). Finally, making the correspondence z k z k, y k x k, x k ˆx k, and θ k λ k, it will result in our notation. For the Tseng s third APG method, identical observations are valid, excepting that we define d(x) := η(x) η(z 0 ) instead. It can be shown that both Nesterov s and Tseng s methods attain the optimal convergence rate; Nesterov s method (9) and Tseng s third APG method () satisfy k 0, f(ˆx k ) f(x ) while Tseng s second APG method (0) satisfies 4Ld(x ) σ(k +)(k +2) k 0, f(ˆx k ) f(x ) 4Lξ(x 0,x ) σ(k+2) 2. The convergence analysis of these three methods are performed in distinct ways. What we propose in Section 5 is a universal analysis for them using the unifying framework developed in Section 3. 7

8 The above gradient-based methods for smooth problems can be generalized for non-smooth convex problems with special structures preserving the same iteration complexity. The structures of the composite objective function and the inexact oracle model are remarkably important since they have significant applications for machine learning, compressed sensing, image processing, and statistics (see [4, 8, 29] for some examples). These structures will be detailed in Section 5. The Nesterov s method (9) was generalized for the composite structure [25] and the inexact oracle model [8]. Tseng s methods were originally proposed for the composite objective function unifying some existing methods [, 4, 23], while we only have described the particular ones for the smooth case. It is important to note that the inexact oracle model[8] is also applicable to non-smooth problem yielding an optimal subgradient method; more precisely, it is applicable to weakly smooth convex problems (see Remark 5). There are several universal (sub)gradient methods [7, 8, 0,, 4, 26] which are optimal for both non-smooth and smooth problems (and further generalized ones). In contrast to such universal methods, we will propose different (not universal) (sub)gradientbased methods for non-smooth and smooth problems which include some of previously mentioned methods. A key contribution of our approach is that it provides a unified methodology on the analysis of optimal subgradient/gradient-based methods for non-smooth/smooth problems. 3 General conditions for the auxiliary functions: The unifying framework For all methods we reviewed for non-smooth or smooth objective functions, we need to form one or two auxiliary functions ψ k (x) and solve the corresponding subproblem(s) min ψ k (x) at each iteration. In this section, we will propose general conditions which these auxiliary functions should satisfy in order to provide a unifying analysis. In particular, we will see that these auxiliary functions are derived from the extended MD model (3), the DA model (4), or a mixture of them. Based on these results, we will propose a family of methods in a unifying framework for non-smooth functions in Section 4 and for structured convex problems in Section 5 which includes the smooth functions. We use the following notations for the description and the analysis of our methods. For a point y Q, denote by l f (y;x) : E R {+ } a proper lower semicontinuous convex function with f(x) l f (y;x), x E, i.e., a lower approximation of f(x) at y Q. The explicit description of the function l f (y;x) will be given in Sections 4 and 5 and will vary according to the property of f(x). For the function d(x), we denote l d (y;x) := d(y)+ d(y),x y. Note that d(x) l d (y;x) and ξ(y,x) = d(x) l d (y;x) for any x,y Q. We introduce the following two kinds of parameters for our methods. - The weight parameter {λ k } k 0. We assume that λ k > 0 for all k 0 - The scaling parameter {β k } k. We assume that β k β k > 0 for all k 0. Note that thesequenceof scaling parameters {β k } k is assumedtobenon-decreasingthroughout this paper. We define := k λ i. Moreover, we use {ˆx k } k 0 Q and {x k } k 0 Q for sequences of approximate solutions and test points (for which we compute the (sub)gradients), respectively (recall that x 0 := argmin d(x)). Finally, we consider auxiliary functions ψ k (x) whose unique minimizers on Q are denoted by z k := argmin ψ k (x). The function ψ k (x) is assumed to bedefined by {λ i } k, {β i} k i=, {x i} k and {z i } k for each k 0. We also consider ψ (x) (and z := argmin ψ (x)) for convenience. 8

9 The following property will be the fundamental one for the construction of auxiliary functions {ψ k (x)} k in our unifying framework. Property 2. Let {λ k } k 0 be a sequence of weight parameters, {β k } k be a sequence of scaling parameters, and {x k } k 0 be a sequence of test points. Let ψ k (x) be auxiliary functions which are determined by {λ i } k, {β i} k i=, {x i} k, and {z i} k where z i := argmin ψ i (x) for each k. Then the following conditions hold: (i) min ψ (x) = 0 and z = x 0. (ii) The following inequality holds for every k : x Q, ψ k+ (x) min z Q ψ k(z)+λ k+ l f (x k+ ;x)+β k+ d(x) β k l d (z k ;x). (iii) The following inequality holds for every k 0 : { k } min ψ k(x) min λ i l f (x i ;x)+β k l d (z k ;x). On the construction of an auxiliary function, the following lemma [28, Property 2] is useful. Lemma 3. Let h : E R {+ } be a proper lower semicontinuous convex function with Q domh and β be a positive number. Denote ψ(x) = h(x) + βd(x). Then the minimization problem min ψ(x) has a unique solution z Q and it satisfies ψ(x) ψ(z )+βξ(z,x), x Q. We now propose a family of auxiliary functions which satisfy Property 2. (0) Define ψ (x) := β d(x). () For each k, define ψ k+ (x) by either the extended Mirror-Descent (MD) model (3) or the Dual-Averaging (DA) model (4). Extended MD model: DA model: (2) ψ k+ (x) := min z Q ψ k(z)+λ k+ l f (x k+ ;x)+β k+ d(x) β k l d (z k ;x). (3) ψ k+ (x) := ψ k (x)+λ k+ l f (x k+ ;x)+β k+ d(x) β k d(x). (4) In both cases, ψ k+ (x) is proper lower semicontinuous and strongly convex on Q. The following result plays a crucial role in the development of our methods. Proposition 4. Any sequence of auxiliary functions {ψ k (x)} constructed by (2) satisfies Property 2. Proof. Since min d(x) = d(x 0 ) = 0, ψ (x) = β d(x) satisfies condition (i). If we construct ψ k+ (x) by (3), then it is clear that the condition (ii) holds. Let us consider the case (4). Notice that on the construction (2), we can show by induction that the functions 9

10 h k (x) := ψ k (x) β k d(x) are always proper lower semicontinuous and convex. Thus Lemma 3 implies that ψ k (x) min z Q ψ k (z)+β k ξ(z k,x) for every x Q. Therefore, we obtain ψ k+ (x) = ψ k (x)+λ k+ l f (x k+ ;x)+β k+ d(x) β k d(x) [min z Q ψ k(z)+β k ξ(z k,x)]+λ k+ l f (x k+ ;x)+β k+ d(x) β k d(x) = min z Q ψ k(z)+λ k+ l f (x k+ ;x)+β k+ d(x) β k l d (z k ;x) for all x Q. Let us finally prove the condition (iii) by induction. We actually show that it is also valid for k. The case k = is due to the optimality condition for z = argmin ψ (x) = argmin β d(x), that is, min β d(x) = min β l d (z ;x) holds. Suppose that the condition (iii) holds for some k. Consider the auxiliary function ψ k+p (x) for a positive integer p defined as follows. Define ψ k+ (x) by (3) and define ψ k+i+ (x) by (4) for i =,...,p. Then ψ k+p (x) = ψ k (z k )+ and using Lemma 3 we have k+p i=k+ min ψ k+p(z) ψ k+p (x) β k+p ξ(z k+p,x) z Q [ = (iii) = ψ k (z k )+ k+p i=k+ [ k ] λ i l f (x i ;x)+β k l d (z k ;x) k+p + k+p i=k+ λ i l f (x i ;x)+β k+p d(x) β k l d (z k ;x), (5) λ i l f (x i ;x)+β k+p d(x) β k l d (z k ;x) ] β k+p ξ(z k+p,x) λ i l f (x i ;x)+β k+p d(x) β k l d (z k ;x) β k+p ξ(z k+p,x) λ i l f (x i ;x)+β k+p l d (z k+p ;x) for all x Q. This proves the condition (iii) for ψ k+p (x). It is, therefore, enough to prove the condition (iii) in the case when the auxiliary function ψ k (x) is defined only by (4) updates. We have ψ k (x) = k λ il f (x i ;x)+β k d(x) in this case and again Lemma 3 implies that for every x Q. min ψ k(z) ψ k (x) β k ξ(z k,x) = z Q k λ i l f (x i ;x)+β k l d (z k ;x) Remark 5. Proposition 4 proves that the following auxiliary functions satisfy Property 2 for appropriate choices of l f (x i ;x) s. Constructing {ψ k (x)} by (2) with only extended MD model updates (3) yields ψ k (x) = min z Q ψ k (z)+λ k l f (x k ;x)+β k d(x) β k l d (z k ;x), z k = argmin { λk l f (x k ;x)+β k d(x) β k l d (z k ;x) } (6) which coincides with the MDM (2) for β k = and x k = z k, and the subproblemsof Tseng s second APG method (0) for β k = L/σ. 0

11 Constructing {ψ k (x)} by (2) with only DA model updates (4) yields ψ k (x) = k λ il f (x { i ;x)+β k d(x), k } z k = argmin λ il f (x i ;x)+β k d(x) (7) which coincides with the DAM (4) and the subproblems of Tseng s third APG method () with β k = L/σ. Notice that a pure extended MD model updates (6) considers only the previous l f (x k ;x) while the DA model updates (7) accumulates all l f (x i ;x) s. Moreover, Proposition 4 shows that Property 2 issatisfiedeven ifwemixthestrategies (6)and(7)whichcorrespondinselectingsome ofprevious l f (x i ;x) s to define the subproblem as shown in (5). Note that, for a fixed ψ k (x), the construction (3) of ψ k+ (x) is the minimalist choice which satisfies Property 2; according to (ii), any auxiliary function ψ k+ (x) majorizes the one defined by (3) on the set Q. To conclude this section, we define the following relation based on the Nesterov s approach [23], see also [27]; we propose (sub)gradient-based methods which generates approximate solutions {ˆx k } Q satisfying the following relation for every k 0 : (R k ) f(ˆx k ) min ψ k(x)+c k (8) where C k is defined according to the problem structure. This relation yields the following lemma which provides a convergence rate for all methods. Lemma 6. Let {ψ k (x)} be a sequence of auxiliary functions satisfying Property 2 associated with weight parameters {λ k } k 0, scaling parameters {β k } k, and test points {x k } k 0. If a sequence {ˆx k } Q satisfies the relation (R k ) for some k 0, then we have where z k := argmin ψ k (x). f(ˆx k ) f(x ) β kl d (z k ;x )+C k Proof. Since k λ il f (x i ;x) f(x) for all x Q, using the condition (iii) of Property 2 yields { k } min ψ k(x) min λ i l f (x i ;x)+β k l d (z k ;x) min {f(x)+β k l d (z k ;x)} f(x )+β k l d (z k ;x ). Therefore, the relation (R k ) implies f(ˆx k ) min ψ k(x)+c k f(x )+β k l d (z k ;x )+C k. 4 A family of subgradient-based methods in the unifying framework 4. General subgradient-based methods in the unifying framework In this section, we propose novel subgradient-based methods for solving problem () with nonsmooth function. Throughout this section, we assume that subgradients of the objective function

12 f, g(y) f(y), are computable at any point y Q and a lower approximation l f (y; ) at the same point is defined by l f (y;x) := f(y)+ g(y),x y, x Q. For a test point x k Q, we denote g k = g(x k ) f(x k ). In this case, the subproblems z k = argmin ψ k (x) constructed from (2) are of the form for some s E and β > 0. We use the following lemma for our analysis. min{ s,x +βd(x)} (9) Lemma 7. Let {x k } k 0 Q and g k f(x k ), k 0. Then, for λ R, β > 0 and x,z Q, we have λg k,x z +βξ(z,x)+ 2σβ λg k 2 0, k 0, and, in particular, λl f (x k,x)+βξ(x k,x)+ λ2 2σβ g k 2 λf(x k ), k 0. Proof. Since for every x E and s E the inequality 2 x s 2 s,x holds, we have λg k,x z +βξ(z,x)+ 2σβ λg k 2 λg k,x z + σβ 2 x z 2 + 2σβ λg k 2 0. Substituting z = x k for this inequality and adding λf(x k ) to both sides, we obtain the second assertion. Let us consider the relation (R k ) defined at the previous section with C k = 2σ We also use the following alternative relation: k λ 2 i β i g i 2. (20) (ˆR k ) k λ i f(x i ) min ψ k(x)+c k. (2) Note that the relation (ˆR k ) provides an alternative to Lemma 6 which can be proven in the same way: If {ψ k (x)} admits Property 2 and the relation (ˆR k ) is satisfied for some k 0, then we have k λ i f(x i ) f(x ) β kl d (z k ;x )+C k. (22) Now, let us show the following key result which will provide efficient subgradient-based methods in a straightforward way. Theorem 8. Let {ψ k (x)} be a sequence of auxiliary functions satisfying Property 2 associated with weight parameters {λ k } k 0, scaling parameters {β k } k, and test points {x k } k 0. Denote z k = argmin ψ k (x) and define C k by (20). Then the following assertions hold. (a) The relations (R 0 ) and (ˆR 0 ) are satisfied by setting ˆx 0 := x 0. 2

13 (b) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relation x k+ = z k holds, then the relation (R k+ ) is satisfied by setting ˆx k+ := ˆx k +λ k+ x k+. Moreover, if the relations (ˆR k ) is satisfied for some k 0 and x k+ = z k holds, then (ˆR k+ ) is satisfied. (b ) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relation x k+ = ˆx k +λ k+ z k holds, then the relation (R k+ ) is satisfied by setting ˆx k+ := x k+. Proof. We remark that using condition (ii) of Property 2 we obtain the inequality k, min ψ k+(x) min ψ k(x)+λ k+ l f (x k+,z k+ )+β k ξ(z k,z k+ ) by setting x = z k+ = argmin ψ k+ (x) (recall that d(x) 0 (x Q) and β k+ β k ). (a) Letting k = in the condition (ii) and using the condition (i) of Property 2, we have min ψ 0(x)+ λ2 0 g 0 2 [ min 2σβ ψ (x)+λ 0 l f (x 0 ;z 0 )+β ξ(z,z 0 ) ] + λ2 0 g 0 2 2σβ = λ 0 l f (x 0 ;z 0 )+β ξ(z,z 0 )+ λ2 0 2σβ g 0 2 = λ 0 l f (x 0 ;z 0 )+β ξ(x 0,z 0 )+ λ2 0 2σβ g 0 2 λ 0 f(x 0 ) = S 0 f(ˆx 0 ), where the last inequality is due to Lemma 7. (b) By the condition (ii) of Property 2 and the assumptions for x k+ and ˆx k+, we obtain that min ψ k+(x)+ 2σ k+ λ 2 i β i g i 2 [ min ψ k(x)+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ ) = min ψ k(x)+ k 2σ [ min ψ k(x)+ k 2σ λ 2 i β i g i 2 + λ 2 i β i g i 2 f(ˆx k )+λ k+ f(x k+ ) ( ) Skˆx k +λ k+ x k+ f = f(ˆx k+ ), ] [ ] + λ2 k+ 2σβ k g k σ k λ 2 i β i g i 2 λ k+ l f (x k+ ;z k+ )+β k ξ(x k+,z k+ )+ λ2 k+ 2σβ k g k+ 2 +λ k+ f(x k+ ) ] 3

14 where we used Lemma 7, the relation (R k ), and the convexity of f in the last three inequalities, respectively. This implies that the relation (R k+ ) holds. Moreover, replacing the use of (R k ) by (ˆR k ) in the above inequality, we obtain the relation (ˆR k+ ), which proves the latter assertion. (b ) Denote x k+ = ˆx k +λ k+ z k+. Then the relation x k+ = ˆx k +λ k+ z k z k+ z k = λ k+ (x k+ x k+). Thus the condition (ii) of Property 2 and the relation (R k ) imply that min ψ k+(x)+ k+ λ 2 i g i 2 2σ β i min ψ k(x)+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ )+ λ2 k+ g k+ 2 2σβ + k 2σ f(ˆx k )+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 l f (x k+ ;ˆx k )+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ )+ λ2 k+ g k+ 2 2σβ k ( = l f x k+ ; S ) kˆx k +λ k+ z k+ +β k ξ(z k,z k+ )+ λ2 k+ g k+ 2 2σβ k = l f (x k+ ;x k+ )+β kξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 = f(x k+ )+ g k+, (x k+ x k+) +β k ξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 = f(x k+ )+ λ k+ g k+,z k+ z k +β k ξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 f(x k+ ) = f(ˆx k+ ) where the last inequality is due to Lemma 7. yields k λ 2 i β i g i 2 Now we are ready to propose the following two novel subgradient-based methods for the nonsmooth case. Method 9 (General subgradient-based methods). Choose weight parameters {λ k } k 0 and scaling parameters {β k } k. Generate sequences {(z k,x k,g k,ˆx k )} k 0 by or by (a) (b) x k := z k := argmin z k := argmin ψ k (x), ˆx k := k λ i x i, g k f(x k ), for k 0 (23) ψ k (x), ˆx k := x k := k λ i z i, g k f(x k ), for k 0 (24) where {ψ k (x)} k is defined using the construction (2) as well as any construction which admits Property 2. Notice that the sequences {z k } k and {x k } k 0 are dummy ones for the methods (a) and (b), respectively, but we kept them to preserve the notation. 4

15 4.2 Convergence analysis of general subgradient-based methods Corollary 0. Given the weight parameter {λ k } k 0, the scaling parameter {β k } k, and any sequence {(z k,x k,g k,ˆx k )} k 0 generated by (a) the first procedure (23) in Method 9, we have: for all k 0; or f(ˆx k ) f(x ) k λ i f(x i ) f(x ) β k l d (z k ;x )+ 2σ k λ 2 i β i g i 2 (25) (b) the second procedure (24) in Method 9, we have: for all k 0. f(ˆx k ) f(x ) β k l d (z k ;x )+ 2σ k λ 2 i β i g i 2 (26) Proof. Thefirstinequality in (25) is from theconvexity of f(x). Proposition 4and Theorem 8 show that the sequences generated by the procedures (23) and (24) satisfy the relation (R k ); futhermore, the former construction (23) also satisfies (ˆR k ). Thus, Lemma 6 and the alternative (22) of Lemma 6 for (ˆR k ) prove the assertion. In [24], Nesterov proposed to use of the auxiliary sequence (6) to ensure an efficient convergence of the DAM (4). This sequence also satisfies the identity and the inequality k 0, ˆβ k = k i= 2k + ˆβk ˆβ i (k 0) (27) k +. (28) Corollary (see also [24]). Consider the following two choices for the parameters. (Simple Averages) Let {(z k,x k,g k,ˆx k )} k 0 be generated by Method 9 with parameters λ k := and β k := γˆβ k for some γ > 0. Then we have ( ) k 0, f(ˆx k ) f(x ) γl d (z k ;x )+ M2 k k+ (29) 2σγ k+ and k, z k,x k+,ˆx k+ where M = 0 and M k = max 0 i k g i for k 0. { } x Q : x x 2 2d(x ) + M2 k σ σ 2 γ 2 (Weighted Averages) Let {(z k,x k,g k,ˆx k )} k 0 be generated by Method 9 with parameters λ k := and β k := ˆβ k g k ρ for some ρ > 0. Then we have σ (30) ( k 0, f(ˆx k ) f(x ld (z ) M k σ k ;x ) + ρ ρ 2 ) k + k + (3) 5

16 and k, z k,x k+,ˆx k+ {x Q : x x 2 2d(x )+ρ 2 }. (32) σ Moreover, for both simple and weighted averages, the above f(ˆx k ) f(x ) s can be replaced by its upper bound k λ if(x i ) f(x ) when we use the first procedure (23) in Method 9. In this case, the left hand side of the inequality can be replaced by min{f(ˆx k ) f(x ),min 0 i k f(x i ) f(x )}. Proof. Substituting the specified λ k and β k into the estimations in Corollary 0 and using the properties (27) and (28) of ˆβ k, we obtain (29) and (3), respectively. Denote by B k the ball on the right hand side of (30) for k. Then B k B k+ for each k. The inequality (29) implies that γl d (z k ;x )+(2σγ) Mk 2 0 for all k 0. Using the strong convexity, d(x ) l d (z k ;x )+ σ 2 x z k 2, and therefore, shows that z k B k for each k 0. We also have z B ; since z = x 0 = argmin d(x), d(z ) = d(x 0 ) = 0, and d(x ) l d (z ;x )+ σ 2 z x 2 σ 2 z x 2. Finally, we conclude that x k+,ˆx k+ B k for all k because they are convex combinations of {z i } k i=. The proof of (32) is similar. Remark 2. Notice that in our approach, the bounds in (29) and (3) are slightly smaller than the ones in (3.3) and (3.5) in [24], respectively, since l d (z k ;x ) d(x ) D. However, essentially, Nesterov s original argument also arrives to the same bound when d(x) is continuously differentiable on Q (note that the argument in [24] does not impose the differentiability for d(x)). In fact, in [24], Theorems 2 and 3 rely on the estimate (2.5) which is implied from (2.8). Notice in (2.8) that we have V βk+ ( s k+ ) = min { s k+,x x 0 +β k+ d(x)} = min { s k+,x x 0 +β k+ l d (x k+ ;x)} by the optimality of x k+ = π βk+ ( s k+ ). Then adding k λ i[f(x i ) + g i,x 0 x i ] and using s k+ = k λ ig i in (2.8), it yields k λ i f(x i ) min { k } λ i [f(x i )+ g i,x x i ]+β k+ l d (x k+ ;x) + 2σ k λ 2 i β i g i 2 which corresponds to the relation (ˆR k ) 2. Thus we obtained the same bound as our analysis for the DA model. A consequence of Corollary is that if M := sup{ g : g f(x), x Q} is finite, Method 9 generates a sequence {ˆx k } such that f(ˆx k ) f(x ) with a rate O(/ k) in the number k of iterations. In particular, the estimates (29) and (3) achieve the optimal complexity for the nonsmooth case when we choose γ := M/ 2σd(x ) and ρ := 2d(x ), respectively. Also Method 9 with the parameters suggested in Corollary produces bounded sequences {x k }, {ˆx k }, and {z k } (even if M = + for the Weighted Averages case). These features are similar to the DAM. We can obtain the optimal convergence rate if we know an upper bound for d(x ), but without assuming the compactness of Q or fixing the required number of iterations. 4.3 Particular cases: The extended MD and the DA models Restricting to the extended MD model (3) in Method 9, the first procedure (23) provides the following extension of the MDM. 2 Notice that x k+ and β k+ in [24] are called z k and β k here, respectively. 6

17 Method 3 (Extended Mirror-Descent). Set x 0 := argmin d(x). Choose weight parameters {λ k } k 0 and scaling parameters {β k } k. Generate sequences {(x k,g k,ˆx k )} k 0 by for k 0. g k f(x k ), { x k+ := argmin λk [f(x k )+ g k,x x k ]+β k d(x) β k l d (x k ;x) }, ˆx k := k λ i x i The iteration updates described by (2) of the original MDM corresponds to Method 3 with β k :=. Corollary shows, in particular, that this extended MDM has a better complexity bound for the objective function compared to the original MDM described in Section 2. It is important to note that, according to Corollary, the extended MDM ensure O(/ k)-convergence without a precise upper bound for d(x ) and even if the feasible region Q is unbounded, while the existing averaging techniques [7, 8] of the MDM assume the compactness of Q. On the other hand, restricting to the DA model (4) in Method 9, the first procedure (23) yields the Nesterov s DAM (4) described in Section 2. In particular, Corollary 0 and subsequently Corollary provide a small improvement over the original result assuming the differentiability of d(x) as pointed out before. Since our analysis does not introduce the dual space, the argments are more straighforward than the original one. We can also obtain variants of the extended MDM and the DAM from the second procedure (24). An upper bound of f(ˆx k ) f(x ) for the sequence {ˆx k } generated by these methods can be derived from Corollaries 0 or. An interesting feature of these variants is that their convergence results are in relation to the test points x k (= ˆx k ) compared to the average of test points for the extended MDM or the DAM. In particular, the variant of the DAM, i.e., the second procedure(24) with the DA model (4) corresponds to the double averaging method (7) proposed by Nesterov and Shikhman [27]. 5 A family of (inexact) gradient-based methods for structured problems in the unifying framework The framework discussed in Section 3 can be also applied to develop efficient (inexact) gradientbased methods for structured convex problems. In this section, we assume that the objective function f(x) of the problem () has the following structure; for any y Q, there exists a lower approximation l f (y;x) of f(x), which is convex in x, and satisfies the inequalities l f (y;x) f(x) l f (y;x)+ L(y) 2 x y 2 +δ(y), x Q, (33) for some L(y) > 0 and δ(y) 0. We also assume that for any y Q, s E, and β > 0, we can compute the optimal solution of the (sub)problem Let us see some examples which admit these assumptions. min {l f(y;x)+ s,x +βd(x)}. (34) Example 4. The first four cases were already considered in the literature. 7

18 (i) Smooth case. If the convex objective function f(x) is continuously differentiable on Q and its gradient f(x) is Lipschitz continuous on Q with a constant L > 0, defining l f (y;x) := f(y)+ f(y),x y yields the condition (33) with L( ) L and δ( ) 0. Then subproblem (34) is of the form min{f(y)+ s+ f(y),x y +βd(x)}. (35) (ii) Composite structure. Let the objective function f(x) has the form f(x) = f 0 (x)+ψ(x) (36) where f 0 (x) : E R {+ } is convex and continuously differentiable on Q with Lipschitz continuous gradient and Ψ(x) : E R {+ } is a lower semicontinuous convex function with Q domψ. Letting L > 0 be the Lipschitz constant of f 0 on Q, we can define l f (y;x) := f 0 (y)+ f 0 (y),x y +Ψ(x) so that we have (33) with L( ) L and δ( ) 0. The corresponding (sub)problem has the form min {f 0(y)+ s+ f 0 (y),x y +βd(x)+ψ(x)}. A generalization of classical methods such as proximal gradient method for this model was proposed by Fukushima and Mine [9] (without assuming convexity for f 0 (x)). Nesterov s optimal method (9) can be also generalized to this case [25]. Smoothing techniques are also an important approach for this example. Nesterov [23] showed a significant improvement on the convergence rate for a particular class and Beck and Teboulle [5] proposed an unifying generalization. (iii) Inexact oracle model. Let us assume that our oracle for f(x) has inexactness [8], that is, we can compute ( f(y),ḡ(y)) R E at each y Q such that 0 f(x) ( f(y)+ ḡ(y),x y ) L y 2 x y 2 +δ y, x Q (37) is satisfied for some L y > 0 and δ y 0. Then defining l f (y;x) := f(y)+ ḡ(y),x y,l(y) := L y, and δ(y) := δ y we have exactly (33). This model was investigated in [8] and the primal, dual, and fast gradient method were proposed. These methods were also implemented in [26] for a particular class of this model equipped by an iterative scheme to estimate the Lipschitz constants L y at each iteration. The fast gradient methods can be seen as generalizations of Nesterov s optimal method (9) to those cases. (iv) Saddle structure. Let us consider an objective function with the following structure: f(x) = supφ(u, x) u U where U is a compact convex set of a finite dimensional real vector space E and φ : U E R {+ } is a concave-convex function satisfying the following conditions. φ(,x) is a upper semicontinuous concave function for all x Q. φ(u, ) is a lower semicontinuous convex function with Q domφ(u, ) for all u U. For all u U, φ(u, ) is continuously differentiable on Q and its gradient is Lipschitz continuous on Q, i.e., there exists a constant L u 0 such that x φ(u,x ) x φ(u,x 2 ) L u x x 2, x,x 2 Q. 8

19 L := max u U L u is finite and positive. Then defining l f (y;x) := max u U {φ(u,y)+ xφ(u,y),x y }, (38) it satisfies condition (33) with L( ) L, δ( ) 0, and we will have the following subproblem: { } min max {φ(u,y)+ s+ xφ(u,y),x y }+βd(x). u U This case is a generalization of the structured convex problem discussed in [20], namely, E R m and, for each u = (u (),...,u (m) ) U, defining φ(u,x) = m u(i) f i (x) for given differentiable convex functions f (x),...,f m (x) on E with Lipschitz continuous gradient. The convexity of φ(u, ) is satisfied by imposing the following assumption as in [20]: If there exists u U such that u (i) < 0, then f i (x) is a linear function. Letting L (i) be a Lipchitz constant of f i (x) for i =,...,m, we have L = max u U L u = max m u U i= u(i) L (i). Thedefinition of l f (y;x) can besimplified whenq int(domf) andφ(,x) is strictly concave for all x Q. In this case, denoting u x = argmax u U φ(u,x), we have f(x) = x φ(u x,x) and therefore we can define l f (y;x) := φ(u y,y)+ x φ(u y,y),x y which satisfies (33) with L( ) L and δ( ) 0. Its subproblem is of the form (35). This situation is also discussed in Tseng s methods [28]. (v) Mixed structure. The above examples can be combined with each other; for instance, considering the function f 0 (x) in (ii) with inexactness (iii) or with the saddle structure (iv), or considering the function φ(u, x) in (iv) with inexactness (iii) or with the composite structure (ii) satisfies our requirement (33). Remark 5. Since our model includes the inexact oracle model (iii), the proposed methods in this section allow us to apply also for non-smooth and weakly smooth convex problems (see [7, 8]). Moreover, considering a generalization of (33) by replacing δ(y) with δ(x, y) where δ(, y) is a nonnegative and lower semicontinuous convex function on Q for every y Q, we can further include other structured convex problems such as the composite convex problem discussed in [0,, 4] (in the deterministic version): The objective function f(x) satisfies the condition f(y) f(x) g(y),y x L 2 y x 2 +M y x, x,y Q, for a subgradient mapping g(x) f(x), L,M 0, and δ(x,y) := M x y. The smooth and non-smooth problems are its special cases with M = 0 and L = 0, respectively. See [2] for more details. 5. General gradient-based methods in the unifying framework We will propose (inexact) gradient-based methods for structured convex optimization problems which satisfies (33) and admits computable solutions for (34) highlighted by Example 4. These methods generate approximate solutions {ˆx k } Q satisfying the relation (R k ). We also consider, in this section, the following alternative of this relation (R k ) for some constant C k : (ˆR k ) k λ i f(x i+ ) min ψ k(x)+c k. (39) 9

20 Notice that the relation (ˆR k ) is sightly different from that of the non-smooth case (2). We use the following alternative of Lemma 6 for this relation; if {ψ k (x)} satisfies Property 2 and the relation (ˆR k ) is satisfied for some k 0, then we have k λ i f(x i+ ) f(x ) β kl d (z k,x )+C k. (40) The following theorem validates our methods. Theorem 6. Let {ψ k (x)} k be a sequence of auxiliary functions satisfying Property 2 associated with weight parameters {λ k } k 0, scaling parameters {β k } k, and test points {x k } k 0. Denote z k = argmin ψ k (x). Then the following assertions hold. (a) If σβ /λ 0 L(x 0 ) holds, then relation (R 0 ) is satisfied with ˆx 0 := z 0 and C 0 := λ 0 δ(x 0 ). (b) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relations x k+ = z k and σβ k /λ k+ L(x k+ ) hold, then the relation (R k+ ) is satisfied with ˆx k+ := ˆx k +λ k+ z k+, C k+ := C k +λ k+ δ(x k+ ). (4) Moreover, ifthe relations (ˆR k ) is satisfied for some integer k 0, and the relations σβ k/λ k+ L(x k+ ) and x k+ = z k hold, then the relation (ˆR k+ ) is satisfied with C k+ := C k + λ k+ δ(x k+ ). (b ) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relations x k+ = ˆx k +λ k+ z k and σβ k /λ 2 k+ L(x k+) hold, then the relation (R k+ ) is satisfied with ˆx k+ := ˆx k +λ k+ z k+, C k+ := C k + δ(x k+ ). Proof. Denote L k = L(x k ) and δ k = δ(x k ). (a) Condition (ii) with k = and condition (i) of Property 2 yields that min ψ 0(x)+λ 0 δ 0 min ψ (x)+λ 0 l f (x 0 ;z 0 )+β ξ(z,z 0 )+λ 0 δ 0 ( = λ 0 l f (x 0 ;z 0 )+ β ) ξ(x 0,z 0 )+δ 0 λ 0 λ 0 ( l f (x 0 ;z 0 )+ σβ λ 0 λ 0 ( l f (x 0 ;z 0 )+ L 0 2 z 0 x 0 2 +δ 0 ) 2 z 0 x 0 2 +δ 0 λ 0 f(z 0 ) = S 0 f(ˆx 0 ) ) 20

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different