arxiv: v2 [math.oc] 1 Jul 2015

Size: px
Start display at page:

Download "arxiv: v2 [math.oc] 1 Jul 2015"

Transcription

1 A Family of Subgradient-Based Methods for Convex Optimization Problems in a Unifying Framework Masaru Ito (ito@is.titech.ac.jp) Mituhiro Fukuda (mituhiro@is.titech.ac.jp) arxiv: v2 [math.oc] Jul 205 Department of Mathematical and Computing Sciences, Tokyo Institute of Technology 2-2--W8-4 Oh-okayama, Meguro, Tokyo Japan Research Report B-477 Department of Mathematical and Computing Sciences Tokyo Institute of Technology February 204, revised June 205 Abstract We propose a new family of subgradient- and gradient-based methods which converges with optimal complexity for convex optimization problems whose feasible region is simple enough. This includes cases where the objective function is non-smooth, smooth, have composite/saddle structure, or are given by an inexact oracle model. We unified the way to construct the subproblems which are necessary to be solved at each iteration of these methods. This permitted us to analyze the convergence of these methods in a unified way compared to previous results which required different approaches for each method/algorithm. Our contribution rely on two well-known methods in non-smooth convex optimization: The mirror-descent method by Nemirovski-Yudin and the dual-averaging method by Nesterov. Therefore, it include them and many other methods as particular cases or partially becomes special cases of other universal methods. For instance, the proposed family of classical gradient methods and its accelerations generalizes Devolder et al. s and Nesterov s primal/dual gradient methods and Tseng s accelerated proximal gradient methods. As an additional contribution, the novel extended mirror-descent method removes the compactness assumption of the feasible region and the fixation of the final number of iterations to attain optimal complexity. Keywords: non-smooth/smooth convex optimization, structured convex optimization, subgradient/gradient-based proximal method, mirror-descent method, dual-averaging method, complexity bounds. Mathematical Subject Classification (200): 90C25, 68Q25, 49M37 Introduction. Background The gradient-based method proposed by Nesterov in 983 for smooth convex optimization problems brought a surprising class of optimal complexity methods with preeminent performance over classical gradient methods for the worst case instances [20]. The minimization of a smooth convex corresponding author

2 function, whose gradient is Lipschitz continuous with constant L, by these optimal complexity methods ensures an ε-solution for the objective value within O( LR 2 /ε) iterations, while the classical gradient methods require O(LR 2 /ε) iterations; R is the distance between an optimal solution and the initial point. It is important to observe that in all of those methods, the iteration complexity is with respect to the convergence rate of the approximate optimal values and not with respect to the approximate optimal solutions. The Nesterov s optimal complexity method, as well as further improvements and extensions [, 2, 2, 23], applied or extended for solving non-smooth convex problems [5, 5, 22, 23, 26, 28, 29] with composite structure [4, 0,, 4, 22, 23, 25] and the inexact oracle model [7, 8], changed substantially the approach on how to solve large-scale structured convex optimization problems arising in machine learning, compressed sensing, image processing, statistics, etc. One can notice that for a general non-smooth convex problem, the complexity analysis of those methods is apparently different compared to the smooth case. The optimal complexity in the non-smooth case is O(M 2 R 2 /ε 2 ) iterations for an ε-solution, where M is a Lipschitz constant for the objective function. A well known optimal method for this case is the Mirror-Descent Method (MDM) proposed by Nemirovski and Yudin [9] which was later related to the subgradient algorithm by Beck and Teboulle [3]. Further variants which do not require to fix the final number of iterations to guarantee the optimal convergence rate were also proposed, but they additionally require the boundedness of the feasible region [7, 8]. The Dual-Averaging Method (DAM) proposed by Nesterov [24] and its modification [27], on the other hand, further allows us to ensure the optimal complexity O( LR 2 /ε) even if the feasible region is unbounded. A key idea to obtain this enhancement in the DAM was the introduction of a sequence which we call scaling parameter β k in this paper. Since the MDM and the DAM are the main motivations of the present article, we will focus our subsequent discussion on results related to them. The (approximate) gradient-based methods proposed as the primal and dual gradient methods in [8, 25] can be interpreted as particular cases of the MDM and the DAM for the corresponding smooth convex problems, respectively, as it will be clear along the article. However, they only ensure the same complexity O(LR 2 /ε) as the classical gradient methods and they require different approaches to prove each of their rate of convergences. Many of gradient-based methods for smooth and structured convex problems were unified and generalized in some way by Tseng [28, 29]. The three algorithms proposed there preserve the O( LR 2 /ε) iteration complexity, but they also require separate analysis for each of them. Particular cases of Tseng s optimal methods can be seen as accelerated versions of the MDM and the DAM for smooth and structured problems [8, 25] as mentioned previously. ItisimportanttonotethatthedifferencebetweentheMDM andthedam orrelatedalgorithms lay on the construction of the subproblems solved at each iteration. As far as we know, there is no results formalizing a combined treatment to them to prove their convergence as we will propose in the present article..2 Our contributions All of above methods require at each iteration the computation of minimizer(s) of one (or two) strongly convex function(s), which we call auxiliary functions, over a(simple) closed convex domain. Our main contribution is the identification of common properties intrinsic to the MDM and the DAM, which we call Property 2 (Section 3), that these auxiliary functions should satisfy in order to secure optimal converging rate methods. This characterization let us propose two strategies to construct sequentially these auxiliary functions: the extended Mirror-Descent (MD) model (3), which we believe is completely new in the literature, and the Dual-Averaging (DA) model (4). 2

3 In fact, we will show that they can be combined in arbitrary order (Proposition 4) for our final purpose. The above strategy combined with appropriate step-sizes, which we call scaling and weight parameters, satisfying the inequalities (R k ) (8) (or (ˆR k ) (2) for the non-smooth problems and (ˆR k ) (39) for the structured problems) will permit us to propose novel families of (in fact, infinitely many) methods to solve convex optimization problems. The iteration complexity for the nonsmoothproblemsusingeither Method9(a) or (b)(in Section 4) is O(M 2 R 2 /ε 2 ). For thestructured problems, which include smooth, composite structure, saddle structure or inexact oracle [8], using Methods 7 and 8 (in Section 5) are O(L 2 R 2 /ε) and O( LR 2 /ε), respectively (excepting for the inexact oracle case). A clear advantage of our unifying framework over the exiting ones is that we can prove all the convergences and their rates in a universal way without specifying the proofs to a particular method/algorithm. As far as we know, this is the first time that such general treatment unifying the MDM and the DAM is proposed. The approach of constructing optimal methods based on our framework some how resemble the estimate sequences [, 2, 2] and the inequality (R k ) [23] and the inequality (23) [27]. Our approach should be distinguished from the universal (sub)gradient-based methods which can be applied simultaneously to non-smooth or smooth problems such as [0,, 4] or to structured problems which can admit inexact oracles, weakly smooth functions, etc. [7, 8, 26]. Such universal approach are also applicable for our framework (Remark 5); a development of a universal method to minimize weakly smooth and strongly convex functions based on our unifying framework is also discussed in [2]. Also, as pointed out before, Tseng unified many of these methods in three algorithms, but they require different treatment for each of them. Since our methods are based on extended MD and/or DA updates, they seems quite restrictive, but as Table shows, many of known optimal methods are particular cases or can be particularized to coincide with our methods. As a minor contribution, we generalized the MD method to the extended MD method. Our methods has the advantage of not requiring a fixed number of iterates to determine the stepsize (parameter). This drawback was already partially solved in [0,, 4], but our method has additionally the advantage of not requiring the boundedness of the domain. The structure of this article is as follows. First, we review some existing methods, in particular the MDM and the DAM for non-smooth objective functions and Tseng s accelerated gradient methods for smooth ones (Section 2). In Section 3, we propose the Property 2 which represents a framework of auxiliary functions for the development of our methods, as well as some supporting lemmas. We then propose in Section 4 the general subgradient-based method and prove its convergence rate, in particular for the extended MDM and the DAM, and subsequently for the structured problems in Section 5..3 Problem setting and notations In this paper, we consider a finite dimensional real vector space E endowed with a norm. The dual space of E is denoted by E endowed with the dual norm defined by s = max s,x, s E x where s,x denotes the value of s E at x E. We consider subgradient-based and gradientbased methods to solve the following convex optimization problem : minf(x) () 3

4 Table : Relation between our family of (sub)gradient-based methods and other known methods. The star (*) corresponds to our result. Complexity indicates the number of iteration to obtain an ε-solution when the objective function has no inexactness for its oracle. [25] is included considering that its Lipschitz constant is known in advance. The third problem class applicable for both indicates that the existing methods can be applicable simultaneously for non-smooth, smooth, and structured problems. problem class complexity some known methods generalized methods non-smooth structured/ smooth applicable for both *Method 9 (a) with the model (3) mirror-descent [3, 9] extended mirror-descent: Method 3 optimal ( ) dual-averaging [24] *Method 9 (a) with the model (4) O M 2 R 2 ε double averaging [27] *Method 9 (b) with the model (4) 2 sliding averaging [8] Nedić-Lee s averaging [7] classical ( ) primal gradient [8, 9, 25] *Method 7 with the model (3) O LR 2 dual gradient [8, 25] *Method 7 with the model (4) ε optimal ( ) O LR 2 ε optimal estimate sequence method [2, 2] Nesterov s method [23] Tseng s modified method; see [28, (35-36)] interior gradient method [] Tseng s method [28, Algorithm ] Lan-Luo-Monteiro s method [5] FISTA [4] Tseng s first APG [29] *Method 2 Method 8 with the model (3) Tseng s second APG [29] Tseng s method [28, Algorithm ] *Method 22 Method 8 with the model (4) Tseng s third APG [29] Tseng s method [28, Algorithm 3] Fast gradient method [8] Ghadimi-Lan s method [0,, 4] Universal gradient method [26] whereqis anonemptyclosed convex, andpossiblyunbounded,subsetof E, andf : E R {+ } is a proper lower semicontinuous convex function with Q domf := {x E : f(x) < + }. For each x domf, the subdifferential of f at x is denoted by f(x) := {g E : f(y) f(x) + g,y x, y E}. We assume throughout this paper that the problem () always has an optimal solution x Q, and the structure of Q is simple enough or has some special structure which permits one to solve a subproblem over it with moderate easiness. See[23] for some examples. We additionally assume that there is a proper lower semi-continuous convex function d : E R {+ } satisfying the following properties: d(x) is a strongly convex function on Q with parameter σ > 0, i.e., d(τx+( τ)y) τd(x)+( τ)d(y) 2 στ( τ) x y 2, x,y Q, τ [0,]. d(x) is continuously differentiable on Q. We denote by ξ(z,x) the Bregman distance [6] between z and x: ξ(z,x) := d(x) d(z) d(z),x z, z,x Q. The Bregman distance satisfies ξ(z,x) σ 2 x z 2 for any x,z Q by the strong convexity of d(x). We also assume that d(x 0 ) = min d(x) = 0 for x 0 := argmin d(x) Q, which is used for the initial point of our methods. Finally, we define R as R := σ d(x ), R := We can always assume this requirement for an arbitrary point x 0 Q by replacing d(x) by ξ(x 0,x). σ ξ(x 0,x ), or 4

5 their upper bounds, which quantifies the distance between the optimal solution x and the initial point x 0 in view of properties d(x 0 ) = 0 and d(x) σ 2 x x 0 2 for every x Q. 2 Existing optimal methods In this section, we review some well-known subgradient-based and gradient-based methods. In particular, we focus on the Mirror-Descent Method (MDM), Dual-Averaging Method (DAM), the double and triple averaging methods for non-smooth objective functions, and on Nesterov s accelerated gradient and Tseng s Accelerated Proximal Gradient (APG) methods for smooth objective functions (or for non-smooth ones with some special structures). The purpose of this section is to unify the notation of these methods in order to introduce a unifying framework for them in Section 3. For that, we sometimes changed the variables names, shifted their indices, and added constants in the objective functions of optimization subproblems compared to the original articles. 2. Optimal methods for the non-smooth case Let usfirstassumethatf(x)in() isnon-smooth. TheMDM [9]intheformreinterpretedbyBeck and Teboulle [3] generates the following iterates from the initial point x 0 := argmin d(x) Q. x k+ := argmin{λ k [f(x k )+ g k,x x k ]+ξ(x k,x)}, k = 0,,2,..., (2) where g k f(x k ) and λ k > 0 is a weight. The parameter λ k is also referred to as a stepsize; it is known that the procedure (2) reduces to the classical subgradient method x k+ := π Q (x k λ k g k ) whene isaneuclideanspace, isthenormofe inducedbyitsinnerproduct,d(x) := 2 x x 0 2, and π Q is the orthogonal projection onto Q (see also Auslender-Teboulle [] and Fukushima-Mine [9] for some related works). The MDM produces the following estimate [3]: k 0, min 0 i k f(x i) f(x ) k λ if(x i ) k λ i f(x ) ξ(x 0,x )+ 2σ k λ2 i g i 2 k λ i (3) where the right hand side can be bounded by M 2σ ξ(x 0,x )/ k + if M := sup{ g : g f(x), x Q} is finite and if we choose the constant weights λ i := M 2σξ(x 0,x )/ k+, i = 0,...,k for a fixed k 0. If we further know an upper bound R σ ξ(x 0,x ), this result ensures an ε-solution in O(M 2 R 2 /ε 2 ) iterations which provides the optimal complexity for the non-smooth case [3]. The above choice of weights, however, is impractical since it depends on the final iterate k and an upper bound for ξ(x 0,x ); a more practical choice λ i := r/ i+ for some r > 0 only ensures an upper bound ξ(x 0,x )+(2σ) r 2 M 2 (+log(k+)) 2r( = O(logk/ k) for the right hand side of k+2 ) (3). Note that, however, when the feasible region Q is compact, the weights λ i := r/ i+ (r > 0) ensure O(/ k)-convergence rate for the difference f(ˆx k ) f(x ) by considering ˆx k, a weighted average of x 0,...,x k [7, 8]. The DAM proposed by Nesterov [24] overcomes the dependence of weights of the MDM on k and even achieves the rate of convergence O(/ k). This method employs non-decreasing positive scaling parameters {β k } k (β k+ β k > 0) in addition to the weights {λ k } k 0. From the initial point x 0 := argmin d(x) Q, the DAM is performed as { k } x k+ := argmin λ i [f(x i )+ g i,x x i ]+β k d(x), k = 0,,2,... (4) 5

6 Nesterov proved that the DAM satisfies the following general estimate (set D = d(x ) in [24, Theorem and (3.2)]): k k 0, min f(x i) f(x ) λ if(x i ) 0 i k k λ i 2σ k f(x ) β kd(x )+ k λ i λ 2 i β i g i 2 (5) In order to ensure the rate of convergence O(/ k), we do not even need a prior knowledge of an upper bound for ξ(x 0,x ); for instance, choosing λ k := and β k := γˆβ k where γ > 0 and ˆβ := ˆβ 0 :=, ˆβk+ := ˆβ k + ˆβ k, k 0, (6) the right hand side of (4) can be bounded by ) (γd(x )+ M k+ 2σγ k+ and achieves the optimal complexity if we choose γ := M/ 2σd(x ). A key in the analysis of the DAM in [24] is the use of a dual approach such as the conjugate function of βd(x) for β > 0. In this paper, we prove the same result with simpler arguments (in Section 4) for the DAM and (an extension of) the MDM without employing it. Nesterov and Shikhman [27] further proposed the double and triple averaging methods in order to obtain convergence results for the sequence {x k }. The double averaging method [27, eq. (28)] iterates starting from x 0 := argmin d(x) Q as follows: z k := argmin { k } λ i [f(x i )+ g i,x x i ]+β k d(x), x k+ := ( τ k )x k +τ k z k, k = 0,,2,... (7) where τ k := λ k+ / k+ λ i. This method bounds the difference f(x k ) f(x ) by the same value as the right hand side of (5) [27, Theorem 3.] for all k 0. Hence, it achieves optimality. The triple averaging, which is a modification of (7), allows further flexibility on the choices for {λ k } and {β k } [27, Theorem 3.3]. Observe that for all the above mentioned methods, we do not need to evaluate any function value at any iteration and x k+ is determined uniquelyeven if Q is unboundedsince d(x) is strongly convex [24, Lemma 6]. 2.2 Optimal methods for the smooth case Let us assume now that the function f(x) in () is convex, continuously differentiable, and its gradient is Lipschitz continuous on Q with constant L > 0: f(x) f(y) L x y, x,y Q. Many optimal complexity methods were proposed in the literature under this assumptions (see, e.g., Table ). In particular, we recall the optimal methods proposed by Nesterov [23] and Tseng [28, 29] for a comparison with our results. Given positive weights {λ k } k 0, both methods depend on the following computation of optimal solutions ẑ k and/or z k of auxiliary subproblems: (a) ẑ k := argmin { λk [f(x k )+ f(x k ),x x k ]+ L σ ξ(z k,x) }, (b) z k := argmin { k λ i[f(x i )+ f(x i ),x x i ]+ L σ d(x) } (8) 6

7 where {x k } k 0 Q is the sequence generated by those methods. Note that the subproblem (a) is closely related to the one of the MDM (2) and the subproblem (b) corresponds to the one of the DAM (4) with β k = L/σ. Similarly to the non-smooth case, it is not necessary to evaluate the function values at x k s and the minimums are uniquely defined. The Nesterov s optimal method (see modified method in [23, Section 5.3]) with a particular choice for the weights λ k is described as follow. Nesterov s method: Set λ k := (k +)/2 for k 0 and x 0 := z := argmin d(x). Compute ẑ 0 by (a) and set ˆx 0 := z 0 := ẑ 0. For k 0, iterate the following procedure: Set x k+ := ( τ k )ˆx k +τ k z k, where τ k := λ k+ k+, λ i Compute ẑ k+ by (a), Set ˆx k+ := ( τ k )ˆx k +τ k ẑ k+, Compute z k+ by (b). (9) In comparison, Tseng s second and third Accelerated Proximal Gradient (APG) methods [29] which are particular cases of algorithms and 3 in [28], only require the solution of either subproblem (a) or (b), respectively. Tseng s second APG method: Set λ 0 :=, λ k+ := + +4λ 2 k 2 for k 0, and x 0 := z := argmin d(x). Compute ẑ 0 by (a) and set ˆx 0 := ẑ 0. For k 0, iterate the following procedure: Set x k+ := ( τ k )ˆx k +τ k ẑ k, where τ k := λ k+ k+, λ i Compute ẑ k+ by (a) replacing z k by ẑ k, Set ˆx k+ := ( τ k )ˆx k +τ k ẑ k+. (0) Tseng s third APG method: Set λ 0 :=, λ k+ := + +4λ 2 k 2 for k 0, and x 0 := z := argmin d(x). Compute z 0 by (b) and set ˆx 0 := z 0. For k 0, iterate the following procedure: Set x k+ := ( τ k )ˆx k +τ k z k, where τ k := λ k+ k+, λ i Compute z k+ by (b), Set ˆx k+ := ( τ k )ˆx k +τ k z k+. () Remark. To see the equivalence to Tseng s second APG method, notice that x 0 is not used at all in [29]. Then defining d(x) := D(x,z 0 ) = η(x) η(z 0 ) η(z 0 ),x z 0 for an arbitrary z 0 Q, we have σ = in (a). Finally, making the correspondence z k z k, y k x k, x k ˆx k, and θ k λ k, it will result in our notation. For the Tseng s third APG method, identical observations are valid, excepting that we define d(x) := η(x) η(z 0 ) instead. It can be shown that both Nesterov s and Tseng s methods attain the optimal convergence rate; Nesterov s method (9) and Tseng s third APG method () satisfy k 0, f(ˆx k ) f(x ) while Tseng s second APG method (0) satisfies 4Ld(x ) σ(k +)(k +2) k 0, f(ˆx k ) f(x ) 4Lξ(x 0,x ) σ(k+2) 2. The convergence analysis of these three methods are performed in distinct ways. What we propose in Section 5 is a universal analysis for them using the unifying framework developed in Section 3. 7

8 The above gradient-based methods for smooth problems can be generalized for non-smooth convex problems with special structures preserving the same iteration complexity. The structures of the composite objective function and the inexact oracle model are remarkably important since they have significant applications for machine learning, compressed sensing, image processing, and statistics (see [4, 8, 29] for some examples). These structures will be detailed in Section 5. The Nesterov s method (9) was generalized for the composite structure [25] and the inexact oracle model [8]. Tseng s methods were originally proposed for the composite objective function unifying some existing methods [, 4, 23], while we only have described the particular ones for the smooth case. It is important to note that the inexact oracle model[8] is also applicable to non-smooth problem yielding an optimal subgradient method; more precisely, it is applicable to weakly smooth convex problems (see Remark 5). There are several universal (sub)gradient methods [7, 8, 0,, 4, 26] which are optimal for both non-smooth and smooth problems (and further generalized ones). In contrast to such universal methods, we will propose different (not universal) (sub)gradientbased methods for non-smooth and smooth problems which include some of previously mentioned methods. A key contribution of our approach is that it provides a unified methodology on the analysis of optimal subgradient/gradient-based methods for non-smooth/smooth problems. 3 General conditions for the auxiliary functions: The unifying framework For all methods we reviewed for non-smooth or smooth objective functions, we need to form one or two auxiliary functions ψ k (x) and solve the corresponding subproblem(s) min ψ k (x) at each iteration. In this section, we will propose general conditions which these auxiliary functions should satisfy in order to provide a unifying analysis. In particular, we will see that these auxiliary functions are derived from the extended MD model (3), the DA model (4), or a mixture of them. Based on these results, we will propose a family of methods in a unifying framework for non-smooth functions in Section 4 and for structured convex problems in Section 5 which includes the smooth functions. We use the following notations for the description and the analysis of our methods. For a point y Q, denote by l f (y;x) : E R {+ } a proper lower semicontinuous convex function with f(x) l f (y;x), x E, i.e., a lower approximation of f(x) at y Q. The explicit description of the function l f (y;x) will be given in Sections 4 and 5 and will vary according to the property of f(x). For the function d(x), we denote l d (y;x) := d(y)+ d(y),x y. Note that d(x) l d (y;x) and ξ(y,x) = d(x) l d (y;x) for any x,y Q. We introduce the following two kinds of parameters for our methods. - The weight parameter {λ k } k 0. We assume that λ k > 0 for all k 0 - The scaling parameter {β k } k. We assume that β k β k > 0 for all k 0. Note that thesequenceof scaling parameters {β k } k is assumedtobenon-decreasingthroughout this paper. We define := k λ i. Moreover, we use {ˆx k } k 0 Q and {x k } k 0 Q for sequences of approximate solutions and test points (for which we compute the (sub)gradients), respectively (recall that x 0 := argmin d(x)). Finally, we consider auxiliary functions ψ k (x) whose unique minimizers on Q are denoted by z k := argmin ψ k (x). The function ψ k (x) is assumed to bedefined by {λ i } k, {β i} k i=, {x i} k and {z i } k for each k 0. We also consider ψ (x) (and z := argmin ψ (x)) for convenience. 8

9 The following property will be the fundamental one for the construction of auxiliary functions {ψ k (x)} k in our unifying framework. Property 2. Let {λ k } k 0 be a sequence of weight parameters, {β k } k be a sequence of scaling parameters, and {x k } k 0 be a sequence of test points. Let ψ k (x) be auxiliary functions which are determined by {λ i } k, {β i} k i=, {x i} k, and {z i} k where z i := argmin ψ i (x) for each k. Then the following conditions hold: (i) min ψ (x) = 0 and z = x 0. (ii) The following inequality holds for every k : x Q, ψ k+ (x) min z Q ψ k(z)+λ k+ l f (x k+ ;x)+β k+ d(x) β k l d (z k ;x). (iii) The following inequality holds for every k 0 : { k } min ψ k(x) min λ i l f (x i ;x)+β k l d (z k ;x). On the construction of an auxiliary function, the following lemma [28, Property 2] is useful. Lemma 3. Let h : E R {+ } be a proper lower semicontinuous convex function with Q domh and β be a positive number. Denote ψ(x) = h(x) + βd(x). Then the minimization problem min ψ(x) has a unique solution z Q and it satisfies ψ(x) ψ(z )+βξ(z,x), x Q. We now propose a family of auxiliary functions which satisfy Property 2. (0) Define ψ (x) := β d(x). () For each k, define ψ k+ (x) by either the extended Mirror-Descent (MD) model (3) or the Dual-Averaging (DA) model (4). Extended MD model: DA model: (2) ψ k+ (x) := min z Q ψ k(z)+λ k+ l f (x k+ ;x)+β k+ d(x) β k l d (z k ;x). (3) ψ k+ (x) := ψ k (x)+λ k+ l f (x k+ ;x)+β k+ d(x) β k d(x). (4) In both cases, ψ k+ (x) is proper lower semicontinuous and strongly convex on Q. The following result plays a crucial role in the development of our methods. Proposition 4. Any sequence of auxiliary functions {ψ k (x)} constructed by (2) satisfies Property 2. Proof. Since min d(x) = d(x 0 ) = 0, ψ (x) = β d(x) satisfies condition (i). If we construct ψ k+ (x) by (3), then it is clear that the condition (ii) holds. Let us consider the case (4). Notice that on the construction (2), we can show by induction that the functions 9

10 h k (x) := ψ k (x) β k d(x) are always proper lower semicontinuous and convex. Thus Lemma 3 implies that ψ k (x) min z Q ψ k (z)+β k ξ(z k,x) for every x Q. Therefore, we obtain ψ k+ (x) = ψ k (x)+λ k+ l f (x k+ ;x)+β k+ d(x) β k d(x) [min z Q ψ k(z)+β k ξ(z k,x)]+λ k+ l f (x k+ ;x)+β k+ d(x) β k d(x) = min z Q ψ k(z)+λ k+ l f (x k+ ;x)+β k+ d(x) β k l d (z k ;x) for all x Q. Let us finally prove the condition (iii) by induction. We actually show that it is also valid for k. The case k = is due to the optimality condition for z = argmin ψ (x) = argmin β d(x), that is, min β d(x) = min β l d (z ;x) holds. Suppose that the condition (iii) holds for some k. Consider the auxiliary function ψ k+p (x) for a positive integer p defined as follows. Define ψ k+ (x) by (3) and define ψ k+i+ (x) by (4) for i =,...,p. Then ψ k+p (x) = ψ k (z k )+ and using Lemma 3 we have k+p i=k+ min ψ k+p(z) ψ k+p (x) β k+p ξ(z k+p,x) z Q [ = (iii) = ψ k (z k )+ k+p i=k+ [ k ] λ i l f (x i ;x)+β k l d (z k ;x) k+p + k+p i=k+ λ i l f (x i ;x)+β k+p d(x) β k l d (z k ;x), (5) λ i l f (x i ;x)+β k+p d(x) β k l d (z k ;x) ] β k+p ξ(z k+p,x) λ i l f (x i ;x)+β k+p d(x) β k l d (z k ;x) β k+p ξ(z k+p,x) λ i l f (x i ;x)+β k+p l d (z k+p ;x) for all x Q. This proves the condition (iii) for ψ k+p (x). It is, therefore, enough to prove the condition (iii) in the case when the auxiliary function ψ k (x) is defined only by (4) updates. We have ψ k (x) = k λ il f (x i ;x)+β k d(x) in this case and again Lemma 3 implies that for every x Q. min ψ k(z) ψ k (x) β k ξ(z k,x) = z Q k λ i l f (x i ;x)+β k l d (z k ;x) Remark 5. Proposition 4 proves that the following auxiliary functions satisfy Property 2 for appropriate choices of l f (x i ;x) s. Constructing {ψ k (x)} by (2) with only extended MD model updates (3) yields ψ k (x) = min z Q ψ k (z)+λ k l f (x k ;x)+β k d(x) β k l d (z k ;x), z k = argmin { λk l f (x k ;x)+β k d(x) β k l d (z k ;x) } (6) which coincides with the MDM (2) for β k = and x k = z k, and the subproblemsof Tseng s second APG method (0) for β k = L/σ. 0

11 Constructing {ψ k (x)} by (2) with only DA model updates (4) yields ψ k (x) = k λ il f (x { i ;x)+β k d(x), k } z k = argmin λ il f (x i ;x)+β k d(x) (7) which coincides with the DAM (4) and the subproblems of Tseng s third APG method () with β k = L/σ. Notice that a pure extended MD model updates (6) considers only the previous l f (x k ;x) while the DA model updates (7) accumulates all l f (x i ;x) s. Moreover, Proposition 4 shows that Property 2 issatisfiedeven ifwemixthestrategies (6)and(7)whichcorrespondinselectingsome ofprevious l f (x i ;x) s to define the subproblem as shown in (5). Note that, for a fixed ψ k (x), the construction (3) of ψ k+ (x) is the minimalist choice which satisfies Property 2; according to (ii), any auxiliary function ψ k+ (x) majorizes the one defined by (3) on the set Q. To conclude this section, we define the following relation based on the Nesterov s approach [23], see also [27]; we propose (sub)gradient-based methods which generates approximate solutions {ˆx k } Q satisfying the following relation for every k 0 : (R k ) f(ˆx k ) min ψ k(x)+c k (8) where C k is defined according to the problem structure. This relation yields the following lemma which provides a convergence rate for all methods. Lemma 6. Let {ψ k (x)} be a sequence of auxiliary functions satisfying Property 2 associated with weight parameters {λ k } k 0, scaling parameters {β k } k, and test points {x k } k 0. If a sequence {ˆx k } Q satisfies the relation (R k ) for some k 0, then we have where z k := argmin ψ k (x). f(ˆx k ) f(x ) β kl d (z k ;x )+C k Proof. Since k λ il f (x i ;x) f(x) for all x Q, using the condition (iii) of Property 2 yields { k } min ψ k(x) min λ i l f (x i ;x)+β k l d (z k ;x) min {f(x)+β k l d (z k ;x)} f(x )+β k l d (z k ;x ). Therefore, the relation (R k ) implies f(ˆx k ) min ψ k(x)+c k f(x )+β k l d (z k ;x )+C k. 4 A family of subgradient-based methods in the unifying framework 4. General subgradient-based methods in the unifying framework In this section, we propose novel subgradient-based methods for solving problem () with nonsmooth function. Throughout this section, we assume that subgradients of the objective function

12 f, g(y) f(y), are computable at any point y Q and a lower approximation l f (y; ) at the same point is defined by l f (y;x) := f(y)+ g(y),x y, x Q. For a test point x k Q, we denote g k = g(x k ) f(x k ). In this case, the subproblems z k = argmin ψ k (x) constructed from (2) are of the form for some s E and β > 0. We use the following lemma for our analysis. min{ s,x +βd(x)} (9) Lemma 7. Let {x k } k 0 Q and g k f(x k ), k 0. Then, for λ R, β > 0 and x,z Q, we have λg k,x z +βξ(z,x)+ 2σβ λg k 2 0, k 0, and, in particular, λl f (x k,x)+βξ(x k,x)+ λ2 2σβ g k 2 λf(x k ), k 0. Proof. Since for every x E and s E the inequality 2 x s 2 s,x holds, we have λg k,x z +βξ(z,x)+ 2σβ λg k 2 λg k,x z + σβ 2 x z 2 + 2σβ λg k 2 0. Substituting z = x k for this inequality and adding λf(x k ) to both sides, we obtain the second assertion. Let us consider the relation (R k ) defined at the previous section with C k = 2σ We also use the following alternative relation: k λ 2 i β i g i 2. (20) (ˆR k ) k λ i f(x i ) min ψ k(x)+c k. (2) Note that the relation (ˆR k ) provides an alternative to Lemma 6 which can be proven in the same way: If {ψ k (x)} admits Property 2 and the relation (ˆR k ) is satisfied for some k 0, then we have k λ i f(x i ) f(x ) β kl d (z k ;x )+C k. (22) Now, let us show the following key result which will provide efficient subgradient-based methods in a straightforward way. Theorem 8. Let {ψ k (x)} be a sequence of auxiliary functions satisfying Property 2 associated with weight parameters {λ k } k 0, scaling parameters {β k } k, and test points {x k } k 0. Denote z k = argmin ψ k (x) and define C k by (20). Then the following assertions hold. (a) The relations (R 0 ) and (ˆR 0 ) are satisfied by setting ˆx 0 := x 0. 2

13 (b) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relation x k+ = z k holds, then the relation (R k+ ) is satisfied by setting ˆx k+ := ˆx k +λ k+ x k+. Moreover, if the relations (ˆR k ) is satisfied for some k 0 and x k+ = z k holds, then (ˆR k+ ) is satisfied. (b ) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relation x k+ = ˆx k +λ k+ z k holds, then the relation (R k+ ) is satisfied by setting ˆx k+ := x k+. Proof. We remark that using condition (ii) of Property 2 we obtain the inequality k, min ψ k+(x) min ψ k(x)+λ k+ l f (x k+,z k+ )+β k ξ(z k,z k+ ) by setting x = z k+ = argmin ψ k+ (x) (recall that d(x) 0 (x Q) and β k+ β k ). (a) Letting k = in the condition (ii) and using the condition (i) of Property 2, we have min ψ 0(x)+ λ2 0 g 0 2 [ min 2σβ ψ (x)+λ 0 l f (x 0 ;z 0 )+β ξ(z,z 0 ) ] + λ2 0 g 0 2 2σβ = λ 0 l f (x 0 ;z 0 )+β ξ(z,z 0 )+ λ2 0 2σβ g 0 2 = λ 0 l f (x 0 ;z 0 )+β ξ(x 0,z 0 )+ λ2 0 2σβ g 0 2 λ 0 f(x 0 ) = S 0 f(ˆx 0 ), where the last inequality is due to Lemma 7. (b) By the condition (ii) of Property 2 and the assumptions for x k+ and ˆx k+, we obtain that min ψ k+(x)+ 2σ k+ λ 2 i β i g i 2 [ min ψ k(x)+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ ) = min ψ k(x)+ k 2σ [ min ψ k(x)+ k 2σ λ 2 i β i g i 2 + λ 2 i β i g i 2 f(ˆx k )+λ k+ f(x k+ ) ( ) Skˆx k +λ k+ x k+ f = f(ˆx k+ ), ] [ ] + λ2 k+ 2σβ k g k σ k λ 2 i β i g i 2 λ k+ l f (x k+ ;z k+ )+β k ξ(x k+,z k+ )+ λ2 k+ 2σβ k g k+ 2 +λ k+ f(x k+ ) ] 3

14 where we used Lemma 7, the relation (R k ), and the convexity of f in the last three inequalities, respectively. This implies that the relation (R k+ ) holds. Moreover, replacing the use of (R k ) by (ˆR k ) in the above inequality, we obtain the relation (ˆR k+ ), which proves the latter assertion. (b ) Denote x k+ = ˆx k +λ k+ z k+. Then the relation x k+ = ˆx k +λ k+ z k z k+ z k = λ k+ (x k+ x k+). Thus the condition (ii) of Property 2 and the relation (R k ) imply that min ψ k+(x)+ k+ λ 2 i g i 2 2σ β i min ψ k(x)+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ )+ λ2 k+ g k+ 2 2σβ + k 2σ f(ˆx k )+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 l f (x k+ ;ˆx k )+λ k+ l f (x k+ ;z k+ )+β k ξ(z k,z k+ )+ λ2 k+ g k+ 2 2σβ k ( = l f x k+ ; S ) kˆx k +λ k+ z k+ +β k ξ(z k,z k+ )+ λ2 k+ g k+ 2 2σβ k = l f (x k+ ;x k+ )+β kξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 = f(x k+ )+ g k+, (x k+ x k+) +β k ξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 = f(x k+ )+ λ k+ g k+,z k+ z k +β k ξ(z k,z k+ )+ λ2 k+ 2σβ k g k+ 2 f(x k+ ) = f(ˆx k+ ) where the last inequality is due to Lemma 7. yields k λ 2 i β i g i 2 Now we are ready to propose the following two novel subgradient-based methods for the nonsmooth case. Method 9 (General subgradient-based methods). Choose weight parameters {λ k } k 0 and scaling parameters {β k } k. Generate sequences {(z k,x k,g k,ˆx k )} k 0 by or by (a) (b) x k := z k := argmin z k := argmin ψ k (x), ˆx k := k λ i x i, g k f(x k ), for k 0 (23) ψ k (x), ˆx k := x k := k λ i z i, g k f(x k ), for k 0 (24) where {ψ k (x)} k is defined using the construction (2) as well as any construction which admits Property 2. Notice that the sequences {z k } k and {x k } k 0 are dummy ones for the methods (a) and (b), respectively, but we kept them to preserve the notation. 4

15 4.2 Convergence analysis of general subgradient-based methods Corollary 0. Given the weight parameter {λ k } k 0, the scaling parameter {β k } k, and any sequence {(z k,x k,g k,ˆx k )} k 0 generated by (a) the first procedure (23) in Method 9, we have: for all k 0; or f(ˆx k ) f(x ) k λ i f(x i ) f(x ) β k l d (z k ;x )+ 2σ k λ 2 i β i g i 2 (25) (b) the second procedure (24) in Method 9, we have: for all k 0. f(ˆx k ) f(x ) β k l d (z k ;x )+ 2σ k λ 2 i β i g i 2 (26) Proof. Thefirstinequality in (25) is from theconvexity of f(x). Proposition 4and Theorem 8 show that the sequences generated by the procedures (23) and (24) satisfy the relation (R k ); futhermore, the former construction (23) also satisfies (ˆR k ). Thus, Lemma 6 and the alternative (22) of Lemma 6 for (ˆR k ) prove the assertion. In [24], Nesterov proposed to use of the auxiliary sequence (6) to ensure an efficient convergence of the DAM (4). This sequence also satisfies the identity and the inequality k 0, ˆβ k = k i= 2k + ˆβk ˆβ i (k 0) (27) k +. (28) Corollary (see also [24]). Consider the following two choices for the parameters. (Simple Averages) Let {(z k,x k,g k,ˆx k )} k 0 be generated by Method 9 with parameters λ k := and β k := γˆβ k for some γ > 0. Then we have ( ) k 0, f(ˆx k ) f(x ) γl d (z k ;x )+ M2 k k+ (29) 2σγ k+ and k, z k,x k+,ˆx k+ where M = 0 and M k = max 0 i k g i for k 0. { } x Q : x x 2 2d(x ) + M2 k σ σ 2 γ 2 (Weighted Averages) Let {(z k,x k,g k,ˆx k )} k 0 be generated by Method 9 with parameters λ k := and β k := ˆβ k g k ρ for some ρ > 0. Then we have σ (30) ( k 0, f(ˆx k ) f(x ld (z ) M k σ k ;x ) + ρ ρ 2 ) k + k + (3) 5

16 and k, z k,x k+,ˆx k+ {x Q : x x 2 2d(x )+ρ 2 }. (32) σ Moreover, for both simple and weighted averages, the above f(ˆx k ) f(x ) s can be replaced by its upper bound k λ if(x i ) f(x ) when we use the first procedure (23) in Method 9. In this case, the left hand side of the inequality can be replaced by min{f(ˆx k ) f(x ),min 0 i k f(x i ) f(x )}. Proof. Substituting the specified λ k and β k into the estimations in Corollary 0 and using the properties (27) and (28) of ˆβ k, we obtain (29) and (3), respectively. Denote by B k the ball on the right hand side of (30) for k. Then B k B k+ for each k. The inequality (29) implies that γl d (z k ;x )+(2σγ) Mk 2 0 for all k 0. Using the strong convexity, d(x ) l d (z k ;x )+ σ 2 x z k 2, and therefore, shows that z k B k for each k 0. We also have z B ; since z = x 0 = argmin d(x), d(z ) = d(x 0 ) = 0, and d(x ) l d (z ;x )+ σ 2 z x 2 σ 2 z x 2. Finally, we conclude that x k+,ˆx k+ B k for all k because they are convex combinations of {z i } k i=. The proof of (32) is similar. Remark 2. Notice that in our approach, the bounds in (29) and (3) are slightly smaller than the ones in (3.3) and (3.5) in [24], respectively, since l d (z k ;x ) d(x ) D. However, essentially, Nesterov s original argument also arrives to the same bound when d(x) is continuously differentiable on Q (note that the argument in [24] does not impose the differentiability for d(x)). In fact, in [24], Theorems 2 and 3 rely on the estimate (2.5) which is implied from (2.8). Notice in (2.8) that we have V βk+ ( s k+ ) = min { s k+,x x 0 +β k+ d(x)} = min { s k+,x x 0 +β k+ l d (x k+ ;x)} by the optimality of x k+ = π βk+ ( s k+ ). Then adding k λ i[f(x i ) + g i,x 0 x i ] and using s k+ = k λ ig i in (2.8), it yields k λ i f(x i ) min { k } λ i [f(x i )+ g i,x x i ]+β k+ l d (x k+ ;x) + 2σ k λ 2 i β i g i 2 which corresponds to the relation (ˆR k ) 2. Thus we obtained the same bound as our analysis for the DA model. A consequence of Corollary is that if M := sup{ g : g f(x), x Q} is finite, Method 9 generates a sequence {ˆx k } such that f(ˆx k ) f(x ) with a rate O(/ k) in the number k of iterations. In particular, the estimates (29) and (3) achieve the optimal complexity for the nonsmooth case when we choose γ := M/ 2σd(x ) and ρ := 2d(x ), respectively. Also Method 9 with the parameters suggested in Corollary produces bounded sequences {x k }, {ˆx k }, and {z k } (even if M = + for the Weighted Averages case). These features are similar to the DAM. We can obtain the optimal convergence rate if we know an upper bound for d(x ), but without assuming the compactness of Q or fixing the required number of iterations. 4.3 Particular cases: The extended MD and the DA models Restricting to the extended MD model (3) in Method 9, the first procedure (23) provides the following extension of the MDM. 2 Notice that x k+ and β k+ in [24] are called z k and β k here, respectively. 6

17 Method 3 (Extended Mirror-Descent). Set x 0 := argmin d(x). Choose weight parameters {λ k } k 0 and scaling parameters {β k } k. Generate sequences {(x k,g k,ˆx k )} k 0 by for k 0. g k f(x k ), { x k+ := argmin λk [f(x k )+ g k,x x k ]+β k d(x) β k l d (x k ;x) }, ˆx k := k λ i x i The iteration updates described by (2) of the original MDM corresponds to Method 3 with β k :=. Corollary shows, in particular, that this extended MDM has a better complexity bound for the objective function compared to the original MDM described in Section 2. It is important to note that, according to Corollary, the extended MDM ensure O(/ k)-convergence without a precise upper bound for d(x ) and even if the feasible region Q is unbounded, while the existing averaging techniques [7, 8] of the MDM assume the compactness of Q. On the other hand, restricting to the DA model (4) in Method 9, the first procedure (23) yields the Nesterov s DAM (4) described in Section 2. In particular, Corollary 0 and subsequently Corollary provide a small improvement over the original result assuming the differentiability of d(x) as pointed out before. Since our analysis does not introduce the dual space, the argments are more straighforward than the original one. We can also obtain variants of the extended MDM and the DAM from the second procedure (24). An upper bound of f(ˆx k ) f(x ) for the sequence {ˆx k } generated by these methods can be derived from Corollaries 0 or. An interesting feature of these variants is that their convergence results are in relation to the test points x k (= ˆx k ) compared to the average of test points for the extended MDM or the DAM. In particular, the variant of the DAM, i.e., the second procedure(24) with the DA model (4) corresponds to the double averaging method (7) proposed by Nesterov and Shikhman [27]. 5 A family of (inexact) gradient-based methods for structured problems in the unifying framework The framework discussed in Section 3 can be also applied to develop efficient (inexact) gradientbased methods for structured convex problems. In this section, we assume that the objective function f(x) of the problem () has the following structure; for any y Q, there exists a lower approximation l f (y;x) of f(x), which is convex in x, and satisfies the inequalities l f (y;x) f(x) l f (y;x)+ L(y) 2 x y 2 +δ(y), x Q, (33) for some L(y) > 0 and δ(y) 0. We also assume that for any y Q, s E, and β > 0, we can compute the optimal solution of the (sub)problem Let us see some examples which admit these assumptions. min {l f(y;x)+ s,x +βd(x)}. (34) Example 4. The first four cases were already considered in the literature. 7

18 (i) Smooth case. If the convex objective function f(x) is continuously differentiable on Q and its gradient f(x) is Lipschitz continuous on Q with a constant L > 0, defining l f (y;x) := f(y)+ f(y),x y yields the condition (33) with L( ) L and δ( ) 0. Then subproblem (34) is of the form min{f(y)+ s+ f(y),x y +βd(x)}. (35) (ii) Composite structure. Let the objective function f(x) has the form f(x) = f 0 (x)+ψ(x) (36) where f 0 (x) : E R {+ } is convex and continuously differentiable on Q with Lipschitz continuous gradient and Ψ(x) : E R {+ } is a lower semicontinuous convex function with Q domψ. Letting L > 0 be the Lipschitz constant of f 0 on Q, we can define l f (y;x) := f 0 (y)+ f 0 (y),x y +Ψ(x) so that we have (33) with L( ) L and δ( ) 0. The corresponding (sub)problem has the form min {f 0(y)+ s+ f 0 (y),x y +βd(x)+ψ(x)}. A generalization of classical methods such as proximal gradient method for this model was proposed by Fukushima and Mine [9] (without assuming convexity for f 0 (x)). Nesterov s optimal method (9) can be also generalized to this case [25]. Smoothing techniques are also an important approach for this example. Nesterov [23] showed a significant improvement on the convergence rate for a particular class and Beck and Teboulle [5] proposed an unifying generalization. (iii) Inexact oracle model. Let us assume that our oracle for f(x) has inexactness [8], that is, we can compute ( f(y),ḡ(y)) R E at each y Q such that 0 f(x) ( f(y)+ ḡ(y),x y ) L y 2 x y 2 +δ y, x Q (37) is satisfied for some L y > 0 and δ y 0. Then defining l f (y;x) := f(y)+ ḡ(y),x y,l(y) := L y, and δ(y) := δ y we have exactly (33). This model was investigated in [8] and the primal, dual, and fast gradient method were proposed. These methods were also implemented in [26] for a particular class of this model equipped by an iterative scheme to estimate the Lipschitz constants L y at each iteration. The fast gradient methods can be seen as generalizations of Nesterov s optimal method (9) to those cases. (iv) Saddle structure. Let us consider an objective function with the following structure: f(x) = supφ(u, x) u U where U is a compact convex set of a finite dimensional real vector space E and φ : U E R {+ } is a concave-convex function satisfying the following conditions. φ(,x) is a upper semicontinuous concave function for all x Q. φ(u, ) is a lower semicontinuous convex function with Q domφ(u, ) for all u U. For all u U, φ(u, ) is continuously differentiable on Q and its gradient is Lipschitz continuous on Q, i.e., there exists a constant L u 0 such that x φ(u,x ) x φ(u,x 2 ) L u x x 2, x,x 2 Q. 8

19 L := max u U L u is finite and positive. Then defining l f (y;x) := max u U {φ(u,y)+ xφ(u,y),x y }, (38) it satisfies condition (33) with L( ) L, δ( ) 0, and we will have the following subproblem: { } min max {φ(u,y)+ s+ xφ(u,y),x y }+βd(x). u U This case is a generalization of the structured convex problem discussed in [20], namely, E R m and, for each u = (u (),...,u (m) ) U, defining φ(u,x) = m u(i) f i (x) for given differentiable convex functions f (x),...,f m (x) on E with Lipschitz continuous gradient. The convexity of φ(u, ) is satisfied by imposing the following assumption as in [20]: If there exists u U such that u (i) < 0, then f i (x) is a linear function. Letting L (i) be a Lipchitz constant of f i (x) for i =,...,m, we have L = max u U L u = max m u U i= u(i) L (i). Thedefinition of l f (y;x) can besimplified whenq int(domf) andφ(,x) is strictly concave for all x Q. In this case, denoting u x = argmax u U φ(u,x), we have f(x) = x φ(u x,x) and therefore we can define l f (y;x) := φ(u y,y)+ x φ(u y,y),x y which satisfies (33) with L( ) L and δ( ) 0. Its subproblem is of the form (35). This situation is also discussed in Tseng s methods [28]. (v) Mixed structure. The above examples can be combined with each other; for instance, considering the function f 0 (x) in (ii) with inexactness (iii) or with the saddle structure (iv), or considering the function φ(u, x) in (iv) with inexactness (iii) or with the composite structure (ii) satisfies our requirement (33). Remark 5. Since our model includes the inexact oracle model (iii), the proposed methods in this section allow us to apply also for non-smooth and weakly smooth convex problems (see [7, 8]). Moreover, considering a generalization of (33) by replacing δ(y) with δ(x, y) where δ(, y) is a nonnegative and lower semicontinuous convex function on Q for every y Q, we can further include other structured convex problems such as the composite convex problem discussed in [0,, 4] (in the deterministic version): The objective function f(x) satisfies the condition f(y) f(x) g(y),y x L 2 y x 2 +M y x, x,y Q, for a subgradient mapping g(x) f(x), L,M 0, and δ(x,y) := M x y. The smooth and non-smooth problems are its special cases with M = 0 and L = 0, respectively. See [2] for more details. 5. General gradient-based methods in the unifying framework We will propose (inexact) gradient-based methods for structured convex optimization problems which satisfies (33) and admits computable solutions for (34) highlighted by Example 4. These methods generate approximate solutions {ˆx k } Q satisfying the relation (R k ). We also consider, in this section, the following alternative of this relation (R k ) for some constant C k : (ˆR k ) k λ i f(x i+ ) min ψ k(x)+c k. (39) 9

20 Notice that the relation (ˆR k ) is sightly different from that of the non-smooth case (2). We use the following alternative of Lemma 6 for this relation; if {ψ k (x)} satisfies Property 2 and the relation (ˆR k ) is satisfied for some k 0, then we have k λ i f(x i+ ) f(x ) β kl d (z k,x )+C k. (40) The following theorem validates our methods. Theorem 6. Let {ψ k (x)} k be a sequence of auxiliary functions satisfying Property 2 associated with weight parameters {λ k } k 0, scaling parameters {β k } k, and test points {x k } k 0. Denote z k = argmin ψ k (x). Then the following assertions hold. (a) If σβ /λ 0 L(x 0 ) holds, then relation (R 0 ) is satisfied with ˆx 0 := z 0 and C 0 := λ 0 δ(x 0 ). (b) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relations x k+ = z k and σβ k /λ k+ L(x k+ ) hold, then the relation (R k+ ) is satisfied with ˆx k+ := ˆx k +λ k+ z k+, C k+ := C k +λ k+ δ(x k+ ). (4) Moreover, ifthe relations (ˆR k ) is satisfied for some integer k 0, and the relations σβ k/λ k+ L(x k+ ) and x k+ = z k hold, then the relation (ˆR k+ ) is satisfied with C k+ := C k + λ k+ δ(x k+ ). (b ) Suppose that the relation (R k ) is satisfied for some integer k 0. If the relations x k+ = ˆx k +λ k+ z k and σβ k /λ 2 k+ L(x k+) hold, then the relation (R k+ ) is satisfied with ˆx k+ := ˆx k +λ k+ z k+, C k+ := C k + δ(x k+ ). Proof. Denote L k = L(x k ) and δ k = δ(x k ). (a) Condition (ii) with k = and condition (i) of Property 2 yields that min ψ 0(x)+λ 0 δ 0 min ψ (x)+λ 0 l f (x 0 ;z 0 )+β ξ(z,z 0 )+λ 0 δ 0 ( = λ 0 l f (x 0 ;z 0 )+ β ) ξ(x 0,z 0 )+δ 0 λ 0 λ 0 ( l f (x 0 ;z 0 )+ σβ λ 0 λ 0 ( l f (x 0 ;z 0 )+ L 0 2 z 0 x 0 2 +δ 0 ) 2 z 0 x 0 2 +δ 0 λ 0 f(z 0 ) = S 0 f(ˆx 0 ) ) 20

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization Convergence rate of inexact proximal point methods with relative error criteria for convex optimization Renato D. C. Monteiro B. F. Svaiter August, 010 Revised: December 1, 011) Abstract In this paper,

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Universal Gradient Methods for Convex Optimization Problems

Universal Gradient Methods for Convex Optimization Problems CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging arxiv:307.879v [math.oc] 7 Jul 03 Angelia Nedić and Soomin Lee July, 03 Dedicated to Paul Tseng Abstract This paper considers

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

arxiv: v1 [math.oc] 21 Apr 2016

arxiv: v1 [math.oc] 21 Apr 2016 Accelerated Douglas Rachford methods for the solution of convex-concave saddle-point problems Kristian Bredies Hongpeng Sun April, 06 arxiv:604.068v [math.oc] Apr 06 Abstract We study acceleration and

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

An inexact strategy for the projected gradient algorithm in vector optimization problems on variable ordered spaces

An inexact strategy for the projected gradient algorithm in vector optimization problems on variable ordered spaces An inexact strategy for the projected gradient algorithm in vector optimization problems on variable ordered spaces J.Y. Bello-Cruz G. Bouza Allende November, 018 Abstract The variable order structures

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean Renato D.C. Monteiro B. F. Svaiter March 17, 2009 Abstract In this paper we analyze the iteration-complexity

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION By WEI ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS WEIWEI KONG, JEFFERSON G. MELO, AND RENATO D.C. MONTEIRO Abstract.

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Abstract This paper presents an accelerated

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

On the convergence properties of the projected gradient method for convex optimization

On the convergence properties of the projected gradient method for convex optimization Computational and Applied Mathematics Vol. 22, N. 1, pp. 37 52, 2003 Copyright 2003 SBMAC On the convergence properties of the projected gradient method for convex optimization A. N. IUSEM* Instituto de

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Convex and Nonconvex Optimization Techniques for the Constrained Fermat-Torricelli Problem

Convex and Nonconvex Optimization Techniques for the Constrained Fermat-Torricelli Problem Portland State University PDXScholar University Honors Theses University Honors College 2016 Convex and Nonconvex Optimization Techniques for the Constrained Fermat-Torricelli Problem Nathan Lawrence Portland

More information

Stochastic model-based minimization under high-order growth

Stochastic model-based minimization under high-order growth Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Subdifferential representation of convex functions: refinements and applications

Subdifferential representation of convex functions: refinements and applications Subdifferential representation of convex functions: refinements and applications Joël Benoist & Aris Daniilidis Abstract Every lower semicontinuous convex function can be represented through its subdifferential

More information

Estimate sequence methods: extensions and approximations

Estimate sequence methods: extensions and approximations Estimate sequence methods: extensions and approximations Michel Baes August 11, 009 Abstract The approach of estimate sequence offers an interesting rereading of a number of accelerating schemes proposed

More information

Spectral gradient projection method for solving nonlinear monotone equations

Spectral gradient projection method for solving nonlinear monotone equations Journal of Computational and Applied Mathematics 196 (2006) 478 484 www.elsevier.com/locate/cam Spectral gradient projection method for solving nonlinear monotone equations Li Zhang, Weijun Zhou Department

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction

ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE. Sangho Kum and Gue Myung Lee. 1. Introduction J. Korean Math. Soc. 38 (2001), No. 3, pp. 683 695 ON GAP FUNCTIONS OF VARIATIONAL INEQUALITY IN A BANACH SPACE Sangho Kum and Gue Myung Lee Abstract. In this paper we are concerned with theoretical properties

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

c 2013 Society for Industrial and Applied Mathematics

c 2013 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 3, No., pp. 109 115 c 013 Society for Industrial and Applied Mathematics AN ACCELERATED HYBRID PROXIMAL EXTRAGRADIENT METHOD FOR CONVEX OPTIMIZATION AND ITS IMPLICATIONS TO SECOND-ORDER

More information

Efficiency of minimizing compositions of convex functions and smooth maps

Efficiency of minimizing compositions of convex functions and smooth maps Efficiency of minimizing compositions of convex functions and smooth maps D. Drusvyatskiy C. Paquette Abstract We consider global efficiency of algorithms for minimizing a sum of a convex function and

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms

Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms JOTA manuscript No. (will be inserted by the editor) Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms Peter Ochs Jalal Fadili Thomas Brox Received: date / Accepted: date Abstract

More information

A. Derivation of regularized ERM duality

A. Derivation of regularized ERM duality A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as

More information

From error bounds to the complexity of first-order descent methods for convex functions

From error bounds to the complexity of first-order descent methods for convex functions From error bounds to the complexity of first-order descent methods for convex functions Nguyen Trong Phong-TSE Joint work with Jérôme Bolte, Juan Peypouquet, Bruce Suter. Toulouse, 23-25, March, 2016 Journées

More information

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date Noname manuscript No. (will be inserted by the editor) A Simple Convergence Analysis of Bregman Proximal Gradient Algorithm Yi Zhou Yingbin Liang Lixin Shen arxiv:1503.05601v3 [math.oc] 17 Dec 2017 Received:

More information

Sequential Unconstrained Minimization: A Survey

Sequential Unconstrained Minimization: A Survey Sequential Unconstrained Minimization: A Survey Charles L. Byrne February 21, 2013 Abstract The problem is to minimize a function f : X (, ], over a non-empty subset C of X, where X is an arbitrary set.

More information

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the

More information

A convergence result for an Outer Approximation Scheme

A convergence result for an Outer Approximation Scheme A convergence result for an Outer Approximation Scheme R. S. Burachik Engenharia de Sistemas e Computação, COPPE-UFRJ, CP 68511, Rio de Janeiro, RJ, CEP 21941-972, Brazil regi@cos.ufrj.br J. O. Lopes Departamento

More information

Lagrange Relaxation and Duality

Lagrange Relaxation and Duality Lagrange Relaxation and Duality As we have already known, constrained optimization problems are harder to solve than unconstrained problems. By relaxation we can solve a more difficult problem by a simpler

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Proximal Gradient Method with Extrapolation and Line Search for a Class of Nonconvex and Nonsmooth Problems arxiv:1711.06831v [math.oc] 1 Nov 017 Lei Yang Abstract In this paper, we consider a class of

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

arxiv: v2 [math.oc] 21 Nov 2017

arxiv: v2 [math.oc] 21 Nov 2017 Unifying abstract inexact convergence theorems and block coordinate variable metric ipiano arxiv:1602.07283v2 [math.oc] 21 Nov 2017 Peter Ochs Mathematical Optimization Group Saarland University Germany

More information

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization Qihang Lin Lin Xiao April 4, 013 Abstract We consider optimization problems with an objective function

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

Smooth minimization of non-smooth functions

Smooth minimization of non-smooth functions Math. Program., Ser. A 103, 127 152 (2005) Digital Object Identifier (DOI) 10.1007/s10107-004-0552-5 Yu. Nesterov Smooth minimization of non-smooth functions Received: February 4, 2003 / Accepted: July

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

arxiv: v5 [math.oc] 23 Jan 2018

arxiv: v5 [math.oc] 23 Jan 2018 Another look at the fast iterative shrinkage/thresholding algorithm FISTA Donghwan Kim Jeffrey A. Fessler arxiv:608.086v5 [math.oc] Jan 08 Date of current version: January 4, 08 Abstract This paper provides

More information

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by Duality (Continued) Recall, the general primal problem is min f ( x), xx g( x) 0 n m where X R, f : X R, g : XR ( X). he Lagrangian is a function L: XR R m defined by L( xλ, ) f ( x) λ g( x) Duality (Continued)

More information

A New Look at the Performance Analysis of First-Order Methods

A New Look at the Performance Analysis of First-Order Methods A New Look at the Performance Analysis of First-Order Methods Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Yoel Drori, Google s R&D Center, Tel Aviv Optimization without

More information

Convex Optimization and Modeling

Convex Optimization and Modeling Convex Optimization and Modeling Duality Theory and Optimality Conditions 5th lecture, 12.05.2010 Jun.-Prof. Matthias Hein Program of today/next lecture Lagrangian and duality: the Lagrangian the dual

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Extended Monotropic Programming and Duality 1

Extended Monotropic Programming and Duality 1 March 2006 (Revised February 2010) Report LIDS - 2692 Extended Monotropic Programming and Duality 1 by Dimitri P. Bertsekas 2 Abstract We consider the problem minimize f i (x i ) subject to x S, where

More information

An inexact subgradient algorithm for Equilibrium Problems

An inexact subgradient algorithm for Equilibrium Problems Volume 30, N. 1, pp. 91 107, 2011 Copyright 2011 SBMAC ISSN 0101-8205 www.scielo.br/cam An inexact subgradient algorithm for Equilibrium Problems PAULO SANTOS 1 and SUSANA SCHEIMBERG 2 1 DM, UFPI, Teresina,

More information

arxiv: v7 [math.oc] 22 Feb 2018

arxiv: v7 [math.oc] 22 Feb 2018 A SMOOTH PRIMAL-DUAL OPTIMIZATION FRAMEWORK FOR NONSMOOTH COMPOSITE CONVEX MINIMIZATION QUOC TRAN-DINH, OLIVIER FERCOQ, AND VOLKAN CEVHER arxiv:1507.06243v7 [math.oc] 22 Feb 2018 Abstract. We propose a

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization Qihang Lin Lin Xiao February 4, 014 Abstract We consider optimization problems with an objective function

More information

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE Convex Analysis Notes Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE These are notes from ORIE 6328, Convex Analysis, as taught by Prof. Adrian Lewis at Cornell University in the

More information

arxiv: v3 [math.oc] 18 Apr 2012

arxiv: v3 [math.oc] 18 Apr 2012 A class of Fejér convergent algorithms, approximate resolvents and the Hybrid Proximal-Extragradient method B. F. Svaiter arxiv:1204.1353v3 [math.oc] 18 Apr 2012 Abstract A new framework for analyzing

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information