Universal Gradient Methods for Convex Optimization Problems

Size: px
Start display at page:

Download "Universal Gradient Methods for Convex Optimization Problems"

Transcription

1 CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex minimization. They do not need to know in advance the actual level of smoothness of the objective function. Their only essential input parameter is the required accuracy of the solution. At the same time, for each particular problem class they automatically ensure the best possible rate of convergence. We confirm our theoretical results by encouraging numerical experiments, which demonstrate that the fast rate of convergence, typical for the smooth optimization problems, sometimes can be achieved even on nonsmooth problem instances. Keywords: convex optimization, black-box methods, complexity bounds, optimal methods, weakly smooth functions. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; The research results presented in this paper have been supported by a grant Action de recherche concertè ARC 04/09-35 from the Direction de la recherche scientifique - Communautè française de Belgique. The author also acknowledges the support from Laboratory of Structural Methods of Data Analysis in Predictive Modelling, through the RF government grant.g

2 Introduction Motivation. In Convex Optimization, the majority of numerical schemes are developed for particular problem classes. In the Black-Box framework, two main classes of convex problems, the smooth problems, and nonsmooth ones are treated by completely different techniques. This separation looks very natural. Indeed, differentiable function allow constructing monotone minimization sequences, for which the convergence results can be easily obtained. Smooth function can be locally approximated by first- and second-order models, which are very helpful in developing efficient minimization schemes. The class of nonsmooth convex functions looks much more difficult. For them, there is no hope to get a good local approximation model. It is very difficult to construct relaxation sequences. Moreover, even if a descent direction is found, there is no guarantee that we can advance along it by a sufficiently long step. Therefore, all methods for nonsmooth convex optimization rely only on separation properties. Cutting planes provide us with information about the half-spaces containing the optimal solution. Using this very restricted knowledge, it is still possible to develop some optimization methods. But their computational abilities are incomparably weaker than the abilities of smooth minimization schemes. Above observations are confirmed by theoretical results. It is well known that for the class of smooth problems C, (R n ), composed by functions with Lipschitz-continuous gradients, the optimal iteration complexity bound for finding ϵ-solution of corresponding optimization problem by a first-order method is of the order O( ). For nonsmooth ϵ /2 problems from the class C,0 (R n ), where we can rely only on Lipschitz continuity of function values, such a bound is established on the level of O( ) (see, e.g. [8]). ϵ 2 Such a big difference in the complexity bounds stimulated an interest to the intermediate classes of convex problems. One of the possibilities consists in considering functions from the class C,ν (R n ), ν [0, ], which have Hölder continuous gradients: f(x) f(y) L ν x y ν, x, y R n. (.) General Complexity Theory [7] established for this class the following lower iteration complexity bound: ( ( ) O LνR +ν 2 ) +3ν, (.2) ϵ where R is the distance from a starting point to the solution. The first optimal methods for such problems were developed in [6]. The main advantage of these schemes is an automatic adjustment to the proper level of smoothness parameter ν. However, these methods need to know another characteristics of the problem (estimate of Lipschitz constant L ν, estimate of the distance to optimum), which are not readily available. Moreover, it was necessary to decide in advance on the total number of steps of the method. This requirement is not very practical. Indeed, in order to make such a decision, we need to know the rate of convergence of the method. However, this is possible only if we know the Hölder parameter. This hidden contradiction probably explains why these theoretically attractive procedures were never seriously tested in computational practice. English translation of this paper was included in Section 2.3 in [3]

3 In the last decade, we can see a restoration of interest to the gradient methods. New problems setting in image processing, data mining, and statistics require computationally cheap minimization procedures, which can quickly deliver an approximate solutions with a moderate accuracy. This demand was satisfied by new families of problem-oriented methods (e.g. [9], [0], []), which increase the rate of convergence of the gradient schemes much above the limits of Black-Box Complexity Theory [7]. This can be done, of-course, only by an appropriate use of problem structure, violating one of the main assumptions of the Black-Box concept. However, it appears that the Black-Box methods did not reach yet the limits of their performance. The old idea of automatic adjustment to Hölder parameter was revived in [4], where a new version of Level Method [5] was adapted to smooth problems, ensuring the best possible complexity bounds for all values of the smoothness parameter. The only drawback of this approach is related to a high iteration cost of the Level Method. Minimization of functions with Hölder-continuous gradient was discussed in [2] in the framework of inexact oracle. It was shown that the answer (f(x), f(x)) of an exact oracle for a convex function satisfying Hölder condition (.) can be treated as inexact information for some function from C, (R n ): 0 f(y) f(x) f(x), y x δ + 2 L y x 2, x, y R n, (.3) where L and δ are some inexactness parameters. It was shown that these parameters can be chosen as appropriate functions of ν. Therefore, functions from C,ν (R n ) can be minimized by an inexact version of Fast Gradient Methods for C, (R n ). The resulting complexity bounds appear to be optimal (.2). However, in order to apply this approach, we need to employ a lot of additional information (the value of parameter ν, constant L ν, distance estimate R, and the total number of steps of the method). In this paper, we construct new universal methods for minimizing functions with Hölder-continuous gradient. They do not need à priori knowledge of the parameter ν, and they have a cheap cost of one iteration. In order to solve the problem min f(x) (.4) x Q by universal methods, we suggest to use a Damped Relaxation Condition (DR) [ f(x + ) δ + min f( x) + f( x), y x + ˆL y y Q 2 x 2], (.5) where the tolerance parameter δ > 0 depends only on the required accuracy ϵ > 0 of the final approximate solution. Similar conditions were used in [6] and [2] with δ being a function of smoothness parameter ν and total number of iterations. We show that all necessary information on ν and L ν can be accumulated in the constant ˆL, which can be easily adapted by an appropriate line-search strategy. For different methods, the dependence of δ in ϵ must be different. For the simplest Primal and Dual Gradient Methods, it is enough to take δ = ϵ 2. For the Fast Gradient Method [9], we use condition (.5) with much smaller value of δ, allowing to maintain a damped version of the estimating sequence condition A k (f(x k ) ϵ 2 ) min ϕ k(x), (.6) x Q 2

4 (see Section 4 for details). All our methods are developed for composite minimization problems [0], which space of variables is endowed with arbitrary norm. Hence, we apply machinery of Bregman distances. Our methods adjust automatically to the actual level of smoothness of the smooth part of the objective function. The only essential input parameter for these schemes is the required accuracy ϵ > 0. Contents. The paper is organized as follows. In Section 2 we introduce the problem formulation and discuss the main properties of Bregman mapping as applied to functions with Hölder continuous gradients. After that, we prove a convergence result for Universal Primal Gradient Method and derive its complexity bound. We show that this method needs in average at most two calls of oracle per iteration. Moreover, this method can be equipped with a reliable stopping criterion. In Section 3, we prove similar results for Universal Dual Gradient Method. This method needs in average four calls of oracle per iteration. Both these methods are based on DR-condition (.5). In Section 4, in order to derive Universal Fast Gradient Method, we introduce condition (.6). We show that this scheme is uniformly optimal for minimizing composite function, which has Hölder-continuous gradients of its smooth part. This scheme has a reliable stopping criterion. It needs in average four calls of oracle per iteration. In Section 5, we present preliminary computational results. We consider three families of random test problems. All of them are nonsmooth problems with Lipschitz-continuous objective function. It is shown that quite often the Universal Fast Gradient Method is able to accelerate and demonstrate the rate of convergence typical for smooth minimization schemes. The choice of appropriate norms is always very important. Notation. In what follows, we work in a finite-dimensional linear vector space E. Its dual space, the space of all linear function on E, is denoted by E. For x E and s E, we denote by s, x the value of linear function s at x. For the (primal) space E, we introduce a norm. Then the dual norm is defined in the standard way: s def = max x E { s, x : x }. Finally, for a convex function f : dom f R with dom f E we denote by f(x) E one of its subgradients. 2 Universal Primal Gradient Method Consider the following minimization problem: min x Q [ f(x) def = f(x) + Ψ(x) ], (2.) where Q is a simple closed convex set, Ψ is a simple closed convex function. Function f is assumed to be subdifferentiable on Q. In order to characterize variability of its (sub)gradients, we introduce the following values: f(x) f(y) M ν M ν (f) = sup x y, ν 0. ν x,y Q, x y (2.2) 3

5 This condition can be rewritten in the form ln M ν = sup [ ln f(x) f(y) ν ln x y ]. x,y Q, x y Thus, M ν is a log-convex function of ν. For certain ν [0, ], the constant M ν can be infinite. However, if M 0 and M are finite, then In any case, if M ν <, then This inequality ensures that M ν M0 ν M ν, 0 ν. (2.3) f(x) f(y) M ν x y ν, x, y Q. (2.4) f(y) f(x) + f(x), y x + Mν +ν x y +ν, x, y Q. (2.5) Our main assumption is as follows. Assumption def ˆM(f) = inf M ν(f) < +. 0 ν For solving problem (2.), we introduce a prox-function d(x). This is a differentiable strongly convex function with convexity parameter equal to one: d(y) d(x) + d(x), y x + 2 x y 2, x, y rint Q. (2.6) We assume that d(x) attains its minimum on Q at some point x 0, and d(x 0 ) = 0. Thus, d(x) (2.6) 2 x x 0 2, x Q. (2.7) This prox-function defines the Bregman distance ξ(x, y) def = d(y) d(x) d(x), y x. Clearly, ξ(x, x) 0, and ξ(x, y) (2.6) 2 x y 2, x, y Q. (2.8) Now for any x Q we can define the Bregman mapping { B M (x) = arg min ψ M (x, y) def } = f(x) + f(x), y x + Mξ(x, y) + Ψ(y). (2.9) y Q We assume that this point is easily computable either in a closed form, or by some cheap computational procedure. The first-order optimality condition for optimization problem in (2.9) is as follows: f(x) + M( d(b M (x)) d(x)) + Ψ(B M (x)), y B M (x) 0, y Q. (2.0) Denote ψm (x) = ψ M(x, B M (x)). 4

6 Lemma Let function f satisfy condition (2.4). Then for any δ > 0 and M [ δ ] ν +ν 2 +ν M ν (2.) we have f(y) f(x) + f(x), y x + 2 M y x 2 + δ 2, x, y Q. (2.2) Therefore, f(b M (x)) ψ M (x) + δ 2. (2.3) Proof: It is well known that for all nonnegative τ and s we have p τ p + q sq τs, where p, q and p + q =. Therefore, taking p = 2 +ν, q = 2 ν, and τ = t+ν, we get t +ν +ν 2s t2 + ν 2 s +ν ν, s > 0, t 0. [ ] Denote δ = ν +ν M νs +ν ν ν. Then s = +ν ν δ +ν M ν. Therefore, [ ] M ν +ν t+ν 2s M νt 2 + δ 2 = ν 2 +ν ν δ +ν 2 +ν M ν t 2 + δ 2 (2.) 2 Mt2 + δ 2. (2.4) This inequality, together with (2.5), justifies (2.2). Further, denoting x + = B M (x), we obtain: f(x + ) (2.5) f(x) + f(x), x + x + M ν +ν x + x +ν (2.4) f(x) + f(x), x + x + M 2 x + x 2 + δ 2 (2.8) f(x) + f(x), x + x + Mξ(x, x + ) + δ 2. Therefore, f(x + ) = f(x + ) + Ψ(x + ) ψ M (x) + δ 2. Note that the right-hand side of inequality (2.) is continuous in ν. As ν, it becomes M M. (2.5) Let us look now at the simplest Universal Primal Gradient Method equipped with a backtracking line search procedure with restore. Denote by x the optimal solution 5

7 to (2.). Universal Primal Gradient Method (PGM) Initialization. Choose L 0 > 0 and accuracy ϵ > 0. For k 0 do: (2.6). Find the smallest i k 0 such that for x + k def = B 2 i k Lk (x k ) we have f(x + k ) f(x k) + f(x k ), x + k x k + 2 i k L k x + k x k ϵ. 2. Set x k+ = B 2 i k Lk (x k ), and L k+ = 2 i k L k. Denote γ(m, ϵ) def = [ ] ν +ν ϵ M 2 +ν, Sk = k+ L k, and f k = i= k i=0 L i+ f(xi ). Theorem Let f satisfies condition (2.4). Assume that L 0 γ(m ν, ϵ). Then for all k 0 we have L k+ γ(m ν, ϵ). Moreover, for all y Q f k k i=0 L i+ [f(x i ) + f(x i ), y x i ] + Ψ(y) + ϵ ξ(x 0, y). (2.7) Therefore, f k f(x ) ϵ 2 + 2γ(M ν,ϵ) k+ ξ(x 0, x ). Proof: In view of Lemma, the line-search procedure of Step in method (2.6) is well defined, and 2L k+ = 2 i kl k 2γ(M ν, ϵ). (2.8) Let us fix an arbitrary point y Q. Denote r k (y) def = ξ(x k, y). Then r k+ (y) = d(y) d(x k+ ) d(x k+ ), y x k+ (2.0) d(y) d(x k+ ) d(x k ), y x k+ + 2L k+ f(x k ) + Ψ(x k+ ), y x k+. Note that d(y) d(x k+ ) d(x k ), y x k+ (2.6) d(y) d(x k ) d(x k ), x k+ x k 2 x k+ x k 2 d(x), y x k+ = r k (y) 2 x k+ x k 2. 6

8 Thus, r k+ (y) r k (y) 2L k+ f(x k ) + Ψ(x k+ ), y x k+ 2 x k+ x k 2 = 2L k+ Ψ(x k+ ), y x k+ 2L k+ ( f(xk ), x k+ x k + L k+ x k+ x k 2) + 2L k+ f(x k ), y x k ) 2L k+ (Ψ(y) Ψ(x k+ ) + f(x k ) f(x k+ ) + 2 ϵ + f(x k), y x k. Thus, we obtain the following inequality: 2L f(xk+ k+ ) + r k+ (y) ( 2L f(xk k+ ) + f(x k ), y x k + Ψ(y) + 2) ϵ + rk (y). Summing up these inequalities, we obtain f k k i=0 It remains to use inequality (2.8). L i+ [f(x i ) + f(x i ), y x i ] + Ψ(y) + ϵ r 0 (y). It is important that method (2.6) does not include ν as a parameter. Therefore, in view of Theorem (), in order to get ϵ-solution of problem (2.) we need 4ξ(x 0, x ) inf 0 ν ( ) 2 Mν +ν ϵ (2.9) iterations at most. In this estimate, among all classes of functions with Hölder continuous gradient, we choose the class which better fits our particular objective function. Note that the expression (2.9) is log-quasiconvex in ν. Hence, if M 0 and M are finite, there are good chances that the optimal ν belongs to the interval (0, ). Inequality (2.7) gives us a reliable stopping criterion for method (2.6). Indeed, assume we have a bound for the size of optimal solution: Denote l p def k (y) = Then k i=0 ξ(x 0, x ) D. (2.20) L i+ [f(x i ) + f(x i ), y x i ], and define { ˆf k = min l p y Q k (y) + Ψ(y) : ξ(x 0, y) D }. f k f(x ) f k ˆf k 2γ(M ν,ϵ) k+ D. (2.2) Note that ˆf k can be computed. Thus, inequality (2.2) provides us with an implementable stopping criterion f k ˆf k ϵ. Finally, let us estimate N(k), the total number of computations of the function values in method (2.6) after k iterations. Note that L k+ = 2 2i kl k. 7

9 L Therefore, i k = log k+ 2 L k. Hence, for any ν [0, ], we have N(k) = k (i j + ) = 2(k + ) + log 2 L k+ log 2 L 0 j=0 (2.8) 2(k + ) + ν +ν log 2 ϵ + 2 +ν log 2 M ν log 2 L 0. Finally, we come to the following upper bound: [ ] N(k) 2(k + ) log 2 L 0 + inf ν 0 ν +ν log 2 ϵ + 2 +ν log 2 M ν. (2.22) Thus in average, up to negligible logarithmic terms, method (2.6) requires two computations of function values per iteration. The complexity estimates in (2.9) are optimal only for ν = 0. In Section 4 we show that much better (and optimal) bounds can be achieved by a fast gradient scheme. 3 Universal Dual Gradient Method Dual gradient method is based on updating a simple model for objective function of problem (2.). Its justification is based on the following simple result. Lemma 2 Let ϕ : Q R {+ } be a convex function such that for some M 0 the difference ϕ(x) Md(x) is subdifferentiable on Q. Denote x = arg min ϕ(x). Then x Q ϕ(y) ϕ( x) + Mξ( x, y), y Q. (3.) Proof: Denote F (y) = ϕ(y) Md(y). Then F ( x) + M d( x), y x 0 for all y Q. Therefore, ϕ(y) = F (y) + Md(y) F ( x) + F ( x), y x + Md(y) F ( x) M d( x), y x + Md(y) = ϕ( x) + Mξ( x, y). 8

10 Universal Dual Gradient Method (DGM) Initialization. Choose L 0 > 0. Define ϕ 0 (x) = ξ(x 0, x). For k 0 do:. Find the smallest i k 0 such that for point { } x k,ik = arg min ϕ k (x) + [f(x x Q 2 i k L k ) + f(x k ), x x k + Ψ(x)] k (3.2) we have f ( ) B 2 i k Lk (x k,ik ) ψ (x 2 i k L k,ik ) + ϵ k Set x k+ = x k,ik, L k+ = 2 i k L k, and ϕ k+ (x) = ϕ k (x) + 2L k+ [f(x k ) + f(x k ), x x k + Ψ(x)]. Assume that L 0 γ(m ν, ϵ), and M ν <. Note that the termination criterion of Step in method (3.2) is satisfied if 2 i kl k γ(m ν, ϵ). Therefore, same as in the proof of Theorem, we have L k γ(m ν, ϵ), k. (3.3) Denote y k = B 2 i k Lk (x k,ik ), = induction that the relation k i=0 L i+, and ϕ k = arg min y Q ϕ k(y). Let us prove by k i=0 is valid for all k 0. Indeed, for k = 0 we have [ ] 2L f(y0 ) S 0 ϵ 4 = 2L f(y0 ) ϵ 2 2L f(yi i+ ) ϕ k+ + ϵ (3.4) 4 2 i 0 L 0 ψ 2 i 0 L 0 (y 0 ) = 2 i 0 L 0 [f(x 0 ) + f(x 0 ), y 0 x 0 + Ψ(y 0 )] + ξ(x 0, y 0 ) = ϕ (y 0 ) = min x Q ϕ (x). In view of Lemma 2, for any k 0 we have ϕ k (x) ϕ k (x k ) + ξ(x k, x), x Q. 9

11 Assume that (3.4) is true for some k 0. Then, min ϕ k+2(x) x Q { } min ϕ k+ (x) + x Q 2L k+2 [f(x k+ ) + f(x k+ ), x x k+ + Ψ(x)] { } min ϕ k+ (x k+ ) + ξ(x k+, x) + x Q 2L k+2 [f(x k+ ) + f(x k+ ), x x k+ + Ψ(x)] [ ] (3.4) ϕ k+ (x k+ ) + 2L f(yk+ k+2 ) ϵ 2 + ϵ 4 + k+ 2L f(yi i+ ). i=0 Thus, we have proved that (3.4) is valid for all k 0. Now we can prove the main convergence result for Universal Dual Gradient Method. Denote f k = k L f(yi i+ ). Theorem 2 Let f satisfies condition (2.4) with M ν <, and L 0 γ(m ν, ϵ). Then all L k generated by method (3.2) satisfy condition (3.3). Moreover, for all k 0 we have Proof: Indeed, in view of inequality (3.4), we have f k f(x ) ϵ 2 + 2γ(M ν,ϵ) k+ ξ(x 0, x ). (3.5) 2 f k = k 2L f(yi i+ ) min ϕ k+(x) + ϵ i=0 x Q 4 2 f(x ) + ξ(x 0, x ) + ϵ 4. It remains to use inequality (3.3). i=0 Note that the worst-case complexity bound for the number of iterations of method (3.2) coincides with the bound (2.9). However, the number of function evaluations at each iteration of (3.2) is twice more than (2.22). Same as method (2.6), Universal Dual Gradient Method can be equipped with a stopping criterion. Denote l d k (y) = k L i+ [f(x i ) + f(x i ), y x i ]. Assume that i=0 ξ(x 0, x ) D and the constant D is known. Denote Note that ˆf k = min y Q { l d k (y) + Ψ(y) : ξ(x 0, y) D }. { } ˆf k = min max x Q β 0 l d k (y) + Ψ(y) + β (ξ(x 0, y) D) { } = max min β 0 x Q l d k (y) + Ψ(y) + β (ξ(x 0, y) D) (3.4) 2 β=2/ 2 ϕ k+ 2 D. Since f k ϕ k+ + ϵ 2, we conclude that the stoping criterion f k ˆf k ϵ ensures f k f(x ) ϵ as far as 4 ϵ D. 0

12 4 Universal Fast Gradient Method Finally, let us apply to problem (2.) the following method. Universal Fast Gradient Method (FGM) Initialization. Choose L 0 > 0. Define ϕ 0 (x) = ξ(x 0, x), y 0 = x 0, A 0 = 0. For k 0 do:. Find v k = arg min x Q ϕ k(x). 2. Find the smallest i k 0 such that coefficient a k+,ik > 0, computed from equation a 2 k+,i k = 2 i k L k (A k + a k+,ik ) and used in the definitions A k+,ik = A k + a k+,ik, τ k,ik = a k+,i k A k+,ik, x k+,ik = τ k,ik v k + ( τ k,ik )y k, (4.) ˆx k+,ik = arg min y Q {ξ(v k, y) + a k+,ik [ f(x k+,ik ), y + Ψ(y)]}, y k+,ik = τ k,ik ˆx k+,ik + ( τ k,ik )y k, ensures the following relation: f(y k+,ik ) f(x k+,ik ) + f(x k+,ik ), y k+,ik x k+,ik +2 i k L k y k+,ik x k+,ik 2 + ϵ 2 τ k,i k. 3. Set x k+ = x k+,ik, y k+ = y k+,ik, a k+ = a k+,ik, τ k = τ k,ik. Define A k+ = A k + a k+, L k+ = 2 i k L k, and ϕ k+ (x) = ϕ k (x) + a k+ [f(x k+ ) + f(x k+ ), x x k+ + Ψ(x)]. Theorem 3 Let f satisfies condition (2.4) with certain M ν < +. Then all iterations of method (4.) are well defined. Moreover, for all k 0 we have [ where A k 2 2+4ν M 2 ν A k ( f(yk ) ϵ 2 ) ϕ k def = min x Q ϕ k(x), (4.2) ϵ ν k +3ν] +ν. Consequently, for all k we have f(y k ) f(x ) [ ] 2 2+4ν Mν 2 +ν ξ(x ϵ ν k +3ν 0, x ) + ϵ 2. (4.3)

13 Proof: Let us prove first, that the line-search process of Item 2 in (4.) is finite. In view of inequality (2.2), we need to show that 2 i kl k [ ] ν +ν 2 ϵτ k,ik M +ν ν for i k large enough. Indeed, in view of the characteristic equation for a k+,ik, we have ( ) ν ( ) 2 i ν +ν kl k τk,i k = A 2ν k+,i ak+,ik +ν +ν k a 2 A k+,i k+,ik = τ k,ik a k+,ik a k+,ik. k It remains to note that a k+,ik 0 as i k. Let us prove relation (4.2). For k = 0 it is evident. Assume that it is valid for certain k 0. Then for any y Q we have (3.) ϕ k (y) ϕ k + ξ(v k, y) (4.2) ( ) A k f(yk ) ϵ 2 + ξ(v k, y k ) A k ( f(xk+ ) + f(x k+, y k x k+ + Ψ(y k ) ϵ 2) + ξ(vk, y). Therefore, ( ϕ k+ (y) ξ(v k, y) + A k f(xk+ ) + f(x k+, y k x k+ + Ψ(y k ) ϵ ) 2 +a k+ [f(x k+ ) + f(x k+ ), y x k+ + Ψ(y)] ( = ξ(v k, y) + A k f(xk+ ) + Ψ(y k ) ϵ ) 2 +a k+ [f(x k+ ) + f(x k+ ), y v k + Ψ(y)]. In view of definition of point ˆx k+, we have ϕ ( k+ ξ(v k, ˆx k+ ) + A k f(xk+ ) + Ψ(y k ) ϵ ) 2 +a k+ [f(x k+ ) + f(x k+ ), ˆx k+ v k + Ψ(ˆx k+ )] (2.8) 2 ˆx k+ v k 2 + A k+ f(x k+ ) + A k+ Ψ(y k+ ) ϵ 2 A k +a k+ f(x k+ ), ˆx k+ v k. Since ˆx k+ v k = τ k (y k+ x k+ ), we obtain ϕ k+ 2τ 2 k y k+ x k+ 2 + A k+ f(x k+ ) + A k+ Ψ(y k+ ) ϵ 2 A k +A k+ f(x k+ ), y k+ x k+ = A k+ (f(x k+ ) + f(x k+ ), y k+ x k+ +2 i k L k y k+ x k+ 2 + Ψ(y k+ )) ϵ 2 A k A k+ (f(y k+ ) ϵ 2 τ k + Ψ(y k+ )) ϵ 2 A k = A k+ ( f(yk+ ) ϵ 2 2 ).

14 Thus, inequality (4.2) is proved for all k 0. Since ϕ k (y) A k f(y) + ξ(x0, y) for all y Q, we obtain f(y k ) f(x (4.2) ) ξ(x 0,x ) A k + ϵ 2, k. It remains to estimate the growth of coefficients A k. In view of Lemma, the number of internal steps i k in Item 2 of (4.) satisfies inequality Therefore, a 2 k+ A k+ = 2 i k 2 L k +ν 2Mν come to the following estimate: 2 i kl k 2 [ ϵτ k ] ν +ν 2 +ν Mν. [ϵτ k ] ν +ν, which is a 2 k+ [ϵa k+] ν +ν a k+ ϵ ν 2ν +3ν +3ν Ak+ 2 +3ν +ν 2 +3ν Mν Denote B k = A γ +ν k, where γ = +3ν 2. Since A k+ A k, we have B k+ B k A k+ A k Thus, we have proved that A k A γ k+ +A γ k [ k ϵ +3ν ν 2 2+4ν 2 +3ν +3ν Mν a k+ 2A γ k+ ] +3ν +ν 2 +ν 2Mν A 2ν +ν k+. Thus, we. (4.4) (4.4) = k +3ν +ν ϵ ν +ν 2 2+4ν 2 +ν +ν Mν ϵ +3ν ν 2 2+4ν 2 +3ν +3ν Mν.. From the rate of convergence (4.3), we get the following upper bound for the number of iterations, which are necessary for getting ϵ-solution of problem (2.): k inf 0 ν [ ( 2 3+5ν 2 M ν ϵ ) 2 ] +3ν ξ(x0, x ) +ν +3ν. (4.5) As compared with (2.9), the dependence of this bound in smoothness parameters is now optimal. Same as the gradient methods (2.6) and (3.2), Fast Gradient Method (4.) can be equipped with an implementable stopping criterion. Assume that ξ(x 0, x ) D. Denote l pd k (y) = k i= a i [f(x i ) + f(x i ), x x i ], and ˆf k = min { y Q A k l pd k (y) + Ψ(y) : ξ(x 0, y) D}. Note that f(y k ) (4.2) ϵ 2 + obtain A k ϕ k. Using the reasoning presented in the end of Section 3, we { } ˆf k = max min β 0 y Q A k l pd k (y) + Ψ(y) + β(ξ(x β=/ak 0, y) D) A k ϕ k A k D. Thus, we can use stopping criterion f(y k ) ˆf k ϵ 2, (4.6) 3

15 which ensures f(y k ) f(x ) ϵ as far as A k 2 ϵ D. (4.7) It remains to estimate from above the total number of calls of oracle of method (4.), which is sufficient to get an ϵ-solution of problem (2.). Let us assume that this method is equipped with the stopping criterion (4.6). Then we can be sure that A k 2 ϵ D, k 0. (4.8) Denote by N(k) the total number of calls of oracle after k iterations. At each iteration of this method we call the oracle 2(i k + ) times (at point x k+,ik and at the prediction point y k+,ik ). Therefore, using the same reasoning as in the end of Section 2, we conclude that Note that N(k) = 4(k + ) + 2 log 2 L k+ 2 log 2 L 0. (4.9) L k+ = 2 2i k L k = A k+ a 2 k+ (2.) [ ϵτ k ] ν +ν 2 +ν Mν = [ Ak+ ϵa k+ ] ν +ν 2 +ν M ν (4.0) [ ] +3ν Therefore +ν 2ν [ ] ν +ν a k+ A k+ ϵ +ν 2 +ν Mν, and we conclude that 2ν [ ] ν +ν L k+ A k+ [A k+ ϵ +ν 2 +ν Mν ] 2(+ν) +3ν = A ν +3ν k+ [ ϵ ] 2( ν) +3ν 4 +3ν Mν (4.8) [ ] (2D) ν 3( ν) +3ν ϵ +3ν 4 +3ν Mν. Substituting this estimate in the expression (4.9), we obtain that in average method (4.) has at most four calls of oracle per iteration. 5 Numerical experiments In our numerical experiments we tried to check the actual level of adaptivity of the above methods to the local topological structure of the objective function. For that, we have chosen three families of nonsmooth minimization problems.. Matrix games. In this problem, given by an n m payoff matrix A, we need to find a saddle point of the following problem: min max x, Ay = min x n y m x n { } ψ p (x) def = max x, Ae j j m { } = max ψ d (y) def = min e i, Ay, y m i n (5.) where e ( ) denote the basis vectors in the corresponding spaces, and ( ) denotes the standard simplex. This problem can be posed as a minimization problem min x n,y m {ψ pd (x, y) = ψ p (x) ψ d (y)}. (5.2) 4

16 The optimal value of this problem is zero. For our experiments, we generated matrix A randomly, with uniform distribution of its entries in the interval [, ]. For feasible set of this problem, F = {z = (x, y) : x n, y m }, a natural prox-function is the entropy: η(z) = n z (i) ln z (i). i= This function is strongly convex with respect to l -norm with the convexity parameter one. Note that l -norm is very good for measuring simplexes. Consequently, we can measure the subgradients of the objective function in (5.2) in l -norm. In view of our strategy for generating the matrix A, we get Lipschitz-continuous function ψ pd with the constant equal to one. We will refer to the methods based on the entropy function as methods with the Entropy Setup. If a method is using the standard Euclidean norm, we say that it is based on the Euclidean setup. In the table below, we give computational results for two universal methods, the Fast Gradient Method (4.), and the Primal Gradient Method (2.6), both with Entropy Setup. In our problem instance, n = 896 and m = 28. Eps FGM Entropy PGM Entropy (5.3) In the first column we indicate the required accuracy. For each method, there are three subcolumns. First one indicates the number of iterations. Second one shows the upper estimate for the achieved accuracy. 2 The third column shows the current level of Lipschitz constant, generated by the method. Note that per one iteration of FGM we need in average to call the oracle four times. PGM needs in average only two calls. It is clear, that both methods behave much better than the worst-case complexity bounds. For FGM, decrease of required accuracy in two times results in a doubling of the number of iterations. This dependence is typical for the complexity bounds of the type 2 Since in this problem the optimal value is known, we use it in the stopping criterion. 5

17 O( ϵ ), and not to the theoretical bound O( ϵ 2 ). For PGM, the average increase is higher (approximately, in three times). But it is still much better than the theoretical bound. From table (5.3) we can see that FGM generates much better model of the objective function. Its accuracy is usually only twice bigger than the actual residual. The estimates of PGM are much weaker. One of the possible reasons for this difference consists in much more aggressive behavior of FGM in introducing new gradients in the model. Finally, the third subcolumn shows that the actual level of the Lipschitz constants are much lower than the theoretical prediction. Let us look now how efficient FGM is in solving the same problem by the Euclidean setup. These results are presented in the left part of table (5.4). Eps FGM Euclid WDA Entropy (5.4) out of time out of time They confirm that the right choice of prox-function is crucial for the efficient solution of optimization problems. Behavior of FGM with Euclidean setup just corresponds to the worst-case theoretical bound O( ϵ 2 ) for Lipschitz-continuous functions (increase of the number of iterations in four times after dividing accuracy by two). In the right part of this table we present the results of the standard black-box subgradient scheme as applied to the same problem. This is Weighted Dual Averaging (WDA) [] with Entropy Setup. For choosing its parameters correctly, we need to know only an estimate for the diameter of the feasible set. Each iteration of this method needs one call of oracle. For our problem, WDA works in an exact correspondence to its worst-case complexity bound O( ϵ 2 ). The second column of this part demonstrates that the lower bound generated by this scheme is almost exact. 2. Continuous Steiner problem. In this problem we are given by centers a i R n, i =,..., m. It is necessary to find the optimal location of the service center x, which minimizes the total distance to all other centers. Thus, our problem is as follows: def min f(x) = m x a i. (5.5) x Q i= where Q R n is a closed convex set. All norms in this problem are Euclidean. 6

18 Clearly, the level of smoothness of problem (5.5) is much higher than that of (5.2). So, we can expect that it is easier for the universal schemes. Let us look at the results of the experiments for random problem with n = 256, m = 52, and Q = R+. n We choose m > n in order to increase the density of nonsmooth points. The centers were generated randomly in the box 0 x (i), i =,..., n (which has Euclidean diameter one). All n /2 methods have origin as a starting point. The initial value of the objective is f 0 = The optimal solution found by the schemes is f = The table below has the same structure as (5.3). Eps FGM Euclid PGM Euclid (5.6) out of time Note that the rate of convergence of FGM (4.) is unexpectedly high. Increase of the accuracy in four times results in doubling the number of iterations. From the complexity point of view, this corresponds to the level O( ), which is typical for Fast Gradient ϵ /2 Methods of smooth minimization. The predicted accuracy by FGM is still very good, and the level of Lipschitz constants is unexpectedly small. The results of PGM (2.6) are not so impressive. It doubles the number of iterations after dividing accuracy by two, which corresponds to O( ϵ ) level of complexity. It seems that a weak point of this method is the quality of termination criterion. 3. Universal methods and smoothing technique. Let us compare the practical performance of method (4.) as applied to the primal version of problem (5.) min ψ p (x), (5.7) x n 7

19 with its performance as applied to the smoothed version of this function ( ) m ψ p (x) = µ ln. e x,ae j /µ j= The value of smoothing parameter µ > 0 for this function is chosen in accordance to the theoretical recommendation (4.8) in [9]. For our experiments, we choose n = m = 52 and apply FGM with Entropy Setup. Eps FGM for ψ p (x) FGM for ψ p (x) (5.8) out of time These results confirm that smoothing is still a very powerful technique, which computational efficiency is often much higher than that of the Black Box Methods. 8

20 References [] A. Beck, M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(), (2009) [2] O.Devolder, F.Glineur, Yu.Nesterov. First-order methods of smooth convex optimization with inexact oracle. Accepted by Mathematical Programming. [3] K.-H.Elster(eds). Modern Mathematical Methods in Optimization. Academie Verlag, Berlin, 993. [4] G. Lan. Level methods uniformly optimal for composite and structured nonsmooth convex optimization. April 20, Submitted to Mathematical Programming. [5] C. Lemarechal, A. Nemirovskii and Yu. Nesterov. New variants of bundle methods. Mathematical Programming, 69 (995), -47. [6] A. Nemirovskii, Yu. Nesterov. Optimal methods for smooth convex optimization. Zh. Vychisl. Mat. i Mat. Fiz. 25(3), (985), in Russian. [7] A. Nemirovsky, D. Yudin. Problem complexity and method efficiency in optimization John Wiley & Sons, New York (983). [8] Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, [9] Yu. Nesterov. Smooth minimization of non-smooth functions, Mathematical Programming (A), 03 (), (2005). [0] Yu.Nesterov. Gradient methods for minimizing composite functions. CORE DP 2007/76. Published in Mathematical Programming, December (202) [] Yu. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 20(), (August 2009) 9

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Accelerating the cubic regularization of Newton s method on convex problems

Accelerating the cubic regularization of Newton s method on convex problems Accelerating the cubic regularization of Newton s method on convex problems Yu. Nesterov September 005 Abstract In this paper we propose an accelerated version of the cubic regularization of Newton s method

More information

Nonsymmetric potential-reduction methods for general cones

Nonsymmetric potential-reduction methods for general cones CORE DISCUSSION PAPER 2006/34 Nonsymmetric potential-reduction methods for general cones Yu. Nesterov March 28, 2006 Abstract In this paper we propose two new nonsymmetric primal-dual potential-reduction

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28 26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS CORE Voie du Roman Pays 4, L Tel

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Smooth minimization of non-smooth functions

Smooth minimization of non-smooth functions Math. Program., Ser. A 103, 127 152 (2005) Digital Object Identifier (DOI) 10.1007/s10107-004-0552-5 Yu. Nesterov Smooth minimization of non-smooth functions Received: February 4, 2003 / Accepted: July

More information

Constructing self-concordant barriers for convex cones

Constructing self-concordant barriers for convex cones CORE DISCUSSION PAPER 2006/30 Constructing self-concordant barriers for convex cones Yu. Nesterov March 21, 2006 Abstract In this paper we develop a technique for constructing self-concordant barriers

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

12. Interior-point methods

12. Interior-point methods 12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Nesterov s Optimal Gradient Methods

Nesterov s Optimal Gradient Methods Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization Convergence rate of inexact proximal point methods with relative error criteria for convex optimization Renato D. C. Monteiro B. F. Svaiter August, 010 Revised: December 1, 011) Abstract In this paper,

More information

arxiv: v1 [math.oc] 10 May 2016

arxiv: v1 [math.oc] 10 May 2016 Fast Primal-Dual Gradient Method for Strongly Convex Minimization Problems with Linear Constraints arxiv:1605.02970v1 [math.oc] 10 May 2016 Alexey Chernov 1, Pavel Dvurechensky 2, and Alexender Gasnikov

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Lecture 3: Huge-scale optimization problems

Lecture 3: Huge-scale optimization problems Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

On the Convergence and O(1/N) Complexity of a Class of Nonlinear Proximal Point Algorithms for Monotonic Variational Inequalities

On the Convergence and O(1/N) Complexity of a Class of Nonlinear Proximal Point Algorithms for Monotonic Variational Inequalities STATISTICS,OPTIMIZATION AND INFORMATION COMPUTING Stat., Optim. Inf. Comput., Vol. 2, June 204, pp 05 3. Published online in International Academic Press (www.iapress.org) On the Convergence and O(/N)

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Spectral gradient projection method for solving nonlinear monotone equations

Spectral gradient projection method for solving nonlinear monotone equations Journal of Computational and Applied Mathematics 196 (2006) 478 484 www.elsevier.com/locate/cam Spectral gradient projection method for solving nonlinear monotone equations Li Zhang, Weijun Zhou Department

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems CORE DISCUSSION PAPER 2012/02 Subgradient methods for huge-scale optimization problems Yu. Nesterov January, 2012 Abstract We consider a new class of huge-scale problems, the problems with sparse subgradients.

More information

Global convergence of the Heavy-ball method for convex optimization

Global convergence of the Heavy-ball method for convex optimization Noname manuscript No. will be inserted by the editor Global convergence of the Heavy-ball method for convex optimization Euhanna Ghadimi Hamid Reza Feyzmahdavian Mikael Johansson Received: date / Accepted:

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24 Outline 1 Problems

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION

ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION ACCELERATED BUNDLE LEVEL TYPE METHODS FOR LARGE SCALE CONVEX OPTIMIZATION By WEI ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

Complexity and Simplicity of Optimization Problems

Complexity and Simplicity of Optimization Problems Complexity and Simplicity of Optimization Problems Yurii Nesterov, CORE/INMA (UCL) February 17 - March 23, 2012 (ULg) Yu. Nesterov Algorithmic Challenges in Optimization 1/27 Developments in Computer Sciences

More information

12. Interior-point methods

12. Interior-point methods 12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information

On the acceleration of the double smoothing technique for unconstrained convex optimization problems

On the acceleration of the double smoothing technique for unconstrained convex optimization problems On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

arxiv: v5 [math.oc] 23 Jan 2018

arxiv: v5 [math.oc] 23 Jan 2018 Another look at the fast iterative shrinkage/thresholding algorithm FISTA Donghwan Kim Jeffrey A. Fessler arxiv:608.086v5 [math.oc] Jan 08 Date of current version: January 4, 08 Abstract This paper provides

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex concave saddle-point problems Optimization Methods and Software ISSN: 1055-6788 (Print) 1029-4937 (Online) Journal homepage: http://www.tandfonline.com/loi/goms20 An accelerated non-euclidean hybrid proximal extragradient-type algorithm

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Convex Optimization and Modeling

Convex Optimization and Modeling Convex Optimization and Modeling Duality Theory and Optimality Conditions 5th lecture, 12.05.2010 Jun.-Prof. Matthias Hein Program of today/next lecture Lagrangian and duality: the Lagrangian the dual

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging arxiv:307.879v [math.oc] 7 Jul 03 Angelia Nedić and Soomin Lee July, 03 Dedicated to Paul Tseng Abstract This paper considers

More information

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE

A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE Journal of Applied Analysis Vol. 6, No. 1 (2000), pp. 139 148 A CHARACTERIZATION OF STRICT LOCAL MINIMIZERS OF ORDER ONE FOR STATIC MINMAX PROBLEMS IN THE PARAMETRIC CONSTRAINT CASE A. W. A. TAHA Received

More information

More First-Order Optimization Algorithms

More First-Order Optimization Algorithms More First-Order Optimization Algorithms Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 3, 8, 3 The SDM

More information

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean Renato D.C. Monteiro B. F. Svaiter March 17, 2009 Abstract In this paper we analyze the iteration-complexity

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems Amir Beck and Shoham Sabach July 6, 2011 Abstract We consider a general class of convex optimization problems

More information

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations Boris T. Polyak Andrey A. Tremba V.A. Trapeznikov Institute of Control Sciences RAS, Moscow, Russia Profsoyuznaya, 65, 117997

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018 Usual rules. :) Exercises 1. Lots of Flows. Suppose you wanted to find an approximate solution to the following

More information

Primal Solutions and Rate Analysis for Subgradient Methods

Primal Solutions and Rate Analysis for Subgradient Methods Primal Solutions and Rate Analysis for Subgradient Methods Asu Ozdaglar Joint work with Angelia Nedić, UIUC Conference on Information Sciences and Systems (CISS) March, 2008 Department of Electrical Engineering

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

arxiv: v2 [math.oc] 1 Jul 2015

arxiv: v2 [math.oc] 1 Jul 2015 A Family of Subgradient-Based Methods for Convex Optimization Problems in a Unifying Framework Masaru Ito (ito@is.titech.ac.jp) Mituhiro Fukuda (mituhiro@is.titech.ac.jp) arxiv:403.6526v2 [math.oc] Jul

More information

Efficiency of minimizing compositions of convex functions and smooth maps

Efficiency of minimizing compositions of convex functions and smooth maps Efficiency of minimizing compositions of convex functions and smooth maps D. Drusvyatskiy C. Paquette Abstract We consider global efficiency of algorithms for minimizing a sum of a convex function and

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 11 On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method ROMAN A. POLYAK Department of SEOR and Mathematical

More information

PARALLEL SUBGRADIENT METHOD FOR NONSMOOTH CONVEX OPTIMIZATION WITH A SIMPLE CONSTRAINT

PARALLEL SUBGRADIENT METHOD FOR NONSMOOTH CONVEX OPTIMIZATION WITH A SIMPLE CONSTRAINT Linear and Nonlinear Analysis Volume 1, Number 1, 2015, 1 PARALLEL SUBGRADIENT METHOD FOR NONSMOOTH CONVEX OPTIMIZATION WITH A SIMPLE CONSTRAINT KAZUHIRO HISHINUMA AND HIDEAKI IIDUKA Abstract. In this

More information

WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE

WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE Fixed Point Theory, Volume 6, No. 1, 2005, 59-69 http://www.math.ubbcluj.ro/ nodeacj/sfptcj.htm WEAK CONVERGENCE OF RESOLVENTS OF MAXIMAL MONOTONE OPERATORS AND MOSCO CONVERGENCE YASUNORI KIMURA Department

More information

Cubic regularization of Newton method and its global performance

Cubic regularization of Newton method and its global performance Math. Program., Ser. A 18, 177 5 (6) Digital Object Identifier (DOI) 1.17/s117-6-76-8 Yurii Nesterov B.T. Polyak Cubic regularization of Newton method and its global performance Received: August 31, 5

More information

A Proximal Method for Identifying Active Manifolds

A Proximal Method for Identifying Active Manifolds A Proximal Method for Identifying Active Manifolds W.L. Hare April 18, 2006 Abstract The minimization of an objective function over a constraint set can often be simplified if the active manifold of the

More information

McMaster University. Advanced Optimization Laboratory. Title: A Proximal Method for Identifying Active Manifolds. Authors: Warren L.

McMaster University. Advanced Optimization Laboratory. Title: A Proximal Method for Identifying Active Manifolds. Authors: Warren L. McMaster University Advanced Optimization Laboratory Title: A Proximal Method for Identifying Active Manifolds Authors: Warren L. Hare AdvOl-Report No. 2006/07 April 2006, Hamilton, Ontario, Canada A Proximal

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1

An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1 An Infeasible Interior Proximal Method for Convex Programming Problems with Linear Constraints 1 Nobuo Yamashita 2, Christian Kanzow 3, Tomoyui Morimoto 2, and Masao Fuushima 2 2 Department of Applied

More information