Universal Gradient Methods for Convex Optimization Problems

Size: px

Start display at page:

Download "Universal Gradient Methods for Convex Optimization Problems"

Allen Baker
5 years ago
Views:

1 CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex minimization. They do not need to know in advance the actual level of smoothness of the objective function. Their only essential input parameter is the required accuracy of the solution. At the same time, for each particular problem class they automatically ensure the best possible rate of convergence. We confirm our theoretical results by encouraging numerical experiments, which demonstrate that the fast rate of convergence, typical for the smooth optimization problems, sometimes can be achieved even on nonsmooth problem instances. Keywords: convex optimization, black-box methods, complexity bounds, optimal methods, weakly smooth functions. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; The research results presented in this paper have been supported by a grant Action de recherche concertè ARC 04/09-35 from the Direction de la recherche scientifique - Communautè française de Belgique. The author also acknowledges the support from Laboratory of Structural Methods of Data Analysis in Predictive Modelling, through the RF government grant.g

2 Introduction Motivation. In Convex Optimization, the majority of numerical schemes are developed for particular problem classes. In the Black-Box framework, two main classes of convex problems, the smooth problems, and nonsmooth ones are treated by completely different techniques. This separation looks very natural. Indeed, differentiable function allow constructing monotone minimization sequences, for which the convergence results can be easily obtained. Smooth function can be locally approximated by first- and second-order models, which are very helpful in developing efficient minimization schemes. The class of nonsmooth convex functions looks much more difficult. For them, there is no hope to get a good local approximation model. It is very difficult to construct relaxation sequences. Moreover, even if a descent direction is found, there is no guarantee that we can advance along it by a sufficiently long step. Therefore, all methods for nonsmooth convex optimization rely only on separation properties. Cutting planes provide us with information about the half-spaces containing the optimal solution. Using this very restricted knowledge, it is still possible to develop some optimization methods. But their computational abilities are incomparably weaker than the abilities of smooth minimization schemes. Above observations are confirmed by theoretical results. It is well known that for the class of smooth problems C, (R n ), composed by functions with Lipschitz-continuous gradients, the optimal iteration complexity bound for finding ϵ-solution of corresponding optimization problem by a first-order method is of the order O( ). For nonsmooth ϵ /2 problems from the class C,0 (R n ), where we can rely only on Lipschitz continuity of function values, such a bound is established on the level of O( ) (see, e.g. [8]). ϵ 2 Such a big difference in the complexity bounds stimulated an interest to the intermediate classes of convex problems. One of the possibilities consists in considering functions from the class C,ν (R n ), ν [0, ], which have Hölder continuous gradients: f(x) f(y) L ν x y ν, x, y R n. (.) General Complexity Theory [7] established for this class the following lower iteration complexity bound: ( ( ) O LνR +ν 2 ) +3ν, (.2) ϵ where R is the distance from a starting point to the solution. The first optimal methods for such problems were developed in [6]. The main advantage of these schemes is an automatic adjustment to the proper level of smoothness parameter ν. However, these methods need to know another characteristics of the problem (estimate of Lipschitz constant L ν, estimate of the distance to optimum), which are not readily available. Moreover, it was necessary to decide in advance on the total number of steps of the method. This requirement is not very practical. Indeed, in order to make such a decision, we need to know the rate of convergence of the method. However, this is possible only if we know the Hölder parameter. This hidden contradiction probably explains why these theoretically attractive procedures were never seriously tested in computational practice. English translation of this paper was included in Section 2.3 in [3]

3 In the last decade, we can see a restoration of interest to the gradient methods. New problems setting in image processing, data mining, and statistics require computationally cheap minimization procedures, which can quickly deliver an approximate solutions with a moderate accuracy. This demand was satisfied by new families of problem-oriented methods (e.g. [9], [0], []), which increase the rate of convergence of the gradient schemes much above the limits of Black-Box Complexity Theory [7]. This can be done, of-course, only by an appropriate use of problem structure, violating one of the main assumptions of the Black-Box concept. However, it appears that the Black-Box methods did not reach yet the limits of their performance. The old idea of automatic adjustment to Hölder parameter was revived in [4], where a new version of Level Method [5] was adapted to smooth problems, ensuring the best possible complexity bounds for all values of the smoothness parameter. The only drawback of this approach is related to a high iteration cost of the Level Method. Minimization of functions with Hölder-continuous gradient was discussed in [2] in the framework of inexact oracle. It was shown that the answer (f(x), f(x)) of an exact oracle for a convex function satisfying Hölder condition (.) can be treated as inexact information for some function from C, (R n ): 0 f(y) f(x) f(x), y x δ + 2 L y x 2, x, y R n, (.3) where L and δ are some inexactness parameters. It was shown that these parameters can be chosen as appropriate functions of ν. Therefore, functions from C,ν (R n ) can be minimized by an inexact version of Fast Gradient Methods for C, (R n ). The resulting complexity bounds appear to be optimal (.2). However, in order to apply this approach, we need to employ a lot of additional information (the value of parameter ν, constant L ν, distance estimate R, and the total number of steps of the method). In this paper, we construct new universal methods for minimizing functions with Hölder-continuous gradient. They do not need à priori knowledge of the parameter ν, and they have a cheap cost of one iteration. In order to solve the problem min f(x) (.4) x Q by universal methods, we suggest to use a Damped Relaxation Condition (DR) [ f(x + ) δ + min f( x) + f( x), y x + ˆL y y Q 2 x 2], (.5) where the tolerance parameter δ > 0 depends only on the required accuracy ϵ > 0 of the final approximate solution. Similar conditions were used in [6] and [2] with δ being a function of smoothness parameter ν and total number of iterations. We show that all necessary information on ν and L ν can be accumulated in the constant ˆL, which can be easily adapted by an appropriate line-search strategy. For different methods, the dependence of δ in ϵ must be different. For the simplest Primal and Dual Gradient Methods, it is enough to take δ = ϵ 2. For the Fast Gradient Method [9], we use condition (.5) with much smaller value of δ, allowing to maintain a damped version of the estimating sequence condition A k (f(x k ) ϵ 2 ) min ϕ k(x), (.6) x Q 2

4 (see Section 4 for details). All our methods are developed for composite minimization problems [0], which space of variables is endowed with arbitrary norm. Hence, we apply machinery of Bregman distances. Our methods adjust automatically to the actual level of smoothness of the smooth part of the objective function. The only essential input parameter for these schemes is the required accuracy ϵ > 0. Contents. The paper is organized as follows. In Section 2 we introduce the problem formulation and discuss the main properties of Bregman mapping as applied to functions with Hölder continuous gradients. After that, we prove a convergence result for Universal Primal Gradient Method and derive its complexity bound. We show that this method needs in average at most two calls of oracle per iteration. Moreover, this method can be equipped with a reliable stopping criterion. In Section 3, we prove similar results for Universal Dual Gradient Method. This method needs in average four calls of oracle per iteration. Both these methods are based on DR-condition (.5). In Section 4, in order to derive Universal Fast Gradient Method, we introduce condition (.6). We show that this scheme is uniformly optimal for minimizing composite function, which has Hölder-continuous gradients of its smooth part. This scheme has a reliable stopping criterion. It needs in average four calls of oracle per iteration. In Section 5, we present preliminary computational results. We consider three families of random test problems. All of them are nonsmooth problems with Lipschitz-continuous objective function. It is shown that quite often the Universal Fast Gradient Method is able to accelerate and demonstrate the rate of convergence typical for smooth minimization schemes. The choice of appropriate norms is always very important. Notation. In what follows, we work in a finite-dimensional linear vector space E. Its dual space, the space of all linear function on E, is denoted by E. For x E and s E, we denote by s, x the value of linear function s at x. For the (primal) space E, we introduce a norm. Then the dual norm is defined in the standard way: s def = max x E { s, x : x }. Finally, for a convex function f : dom f R with dom f E we denote by f(x) E one of its subgradients. 2 Universal Primal Gradient Method Consider the following minimization problem: min x Q [ f(x) def = f(x) + Ψ(x) ], (2.) where Q is a simple closed convex set, Ψ is a simple closed convex function. Function f is assumed to be subdifferentiable on Q. In order to characterize variability of its (sub)gradients, we introduce the following values: f(x) f(y) M ν M ν (f) = sup x y, ν 0. ν x,y Q, x y (2.2) 3

5 This condition can be rewritten in the form ln M ν = sup [ ln f(x) f(y) ν ln x y ]. x,y Q, x y Thus, M ν is a log-convex function of ν. For certain ν [0, ], the constant M ν can be infinite. However, if M 0 and M are finite, then In any case, if M ν <, then This inequality ensures that M ν M0 ν M ν, 0 ν. (2.3) f(x) f(y) M ν x y ν, x, y Q. (2.4) f(y) f(x) + f(x), y x + Mν +ν x y +ν, x, y Q. (2.5) Our main assumption is as follows. Assumption def ˆM(f) = inf M ν(f) < +. 0 ν For solving problem (2.), we introduce a prox-function d(x). This is a differentiable strongly convex function with convexity parameter equal to one: d(y) d(x) + d(x), y x + 2 x y 2, x, y rint Q. (2.6) We assume that d(x) attains its minimum on Q at some point x 0, and d(x 0 ) = 0. Thus, d(x) (2.6) 2 x x 0 2, x Q. (2.7) This prox-function defines the Bregman distance ξ(x, y) def = d(y) d(x) d(x), y x. Clearly, ξ(x, x) 0, and ξ(x, y) (2.6) 2 x y 2, x, y Q. (2.8) Now for any x Q we can define the Bregman mapping { B M (x) = arg min ψ M (x, y) def } = f(x) + f(x), y x + Mξ(x, y) + Ψ(y). (2.9) y Q We assume that this point is easily computable either in a closed form, or by some cheap computational procedure. The first-order optimality condition for optimization problem in (2.9) is as follows: f(x) + M( d(b M (x)) d(x)) + Ψ(B M (x)), y B M (x) 0, y Q. (2.0) Denote ψm (x) = ψ M(x, B M (x)). 4

6 Lemma Let function f satisfy condition (2.4). Then for any δ > 0 and M [ δ ] ν +ν 2 +ν M ν (2.) we have f(y) f(x) + f(x), y x + 2 M y x 2 + δ 2, x, y Q. (2.2) Therefore, f(b M (x)) ψ M (x) + δ 2. (2.3) Proof: It is well known that for all nonnegative τ and s we have p τ p + q sq τs, where p, q and p + q =. Therefore, taking p = 2 +ν, q = 2 ν, and τ = t+ν, we get t +ν +ν 2s t2 + ν 2 s +ν ν, s > 0, t 0. [ ] Denote δ = ν +ν M νs +ν ν ν. Then s = +ν ν δ +ν M ν. Therefore, [ ] M ν +ν t+ν 2s M νt 2 + δ 2 = ν 2 +ν ν δ +ν 2 +ν M ν t 2 + δ 2 (2.) 2 Mt2 + δ 2. (2.4) This inequality, together with (2.5), justifies (2.2). Further, denoting x + = B M (x), we obtain: f(x + ) (2.5) f(x) + f(x), x + x + M ν +ν x + x +ν (2.4) f(x) + f(x), x + x + M 2 x + x 2 + δ 2 (2.8) f(x) + f(x), x + x + Mξ(x, x + ) + δ 2. Therefore, f(x + ) = f(x + ) + Ψ(x + ) ψ M (x) + δ 2. Note that the right-hand side of inequality (2.) is continuous in ν. As ν, it becomes M M. (2.5) Let us look now at the simplest Universal Primal Gradient Method equipped with a backtracking line search procedure with restore. Denote by x the optimal solution 5

7 to (2.). Universal Primal Gradient Method (PGM) Initialization. Choose L 0 > 0 and accuracy ϵ > 0. For k 0 do: (2.6). Find the smallest i k 0 such that for x + k def = B 2 i k Lk (x k ) we have f(x + k ) f(x k) + f(x k ), x + k x k + 2 i k L k x + k x k ϵ. 2. Set x k+ = B 2 i k Lk (x k ), and L k+ = 2 i k L k. Denote γ(m, ϵ) def = [ ] ν +ν ϵ M 2 +ν, Sk = k+ L k, and f k = i= k i=0 L i+ f(xi ). Theorem Let f satisfies condition (2.4). Assume that L 0 γ(m ν, ϵ). Then for all k 0 we have L k+ γ(m ν, ϵ). Moreover, for all y Q f k k i=0 L i+ [f(x i ) + f(x i ), y x i ] + Ψ(y) + ϵ ξ(x 0, y). (2.7) Therefore, f k f(x ) ϵ 2 + 2γ(M ν,ϵ) k+ ξ(x 0, x ). Proof: In view of Lemma, the line-search procedure of Step in method (2.6) is well defined, and 2L k+ = 2 i kl k 2γ(M ν, ϵ). (2.8) Let us fix an arbitrary point y Q. Denote r k (y) def = ξ(x k, y). Then r k+ (y) = d(y) d(x k+ ) d(x k+ ), y x k+ (2.0) d(y) d(x k+ ) d(x k ), y x k+ + 2L k+ f(x k ) + Ψ(x k+ ), y x k+. Note that d(y) d(x k+ ) d(x k ), y x k+ (2.6) d(y) d(x k ) d(x k ), x k+ x k 2 x k+ x k 2 d(x), y x k+ = r k (y) 2 x k+ x k 2. 6

8 Thus, r k+ (y) r k (y) 2L k+ f(x k ) + Ψ(x k+ ), y x k+ 2 x k+ x k 2 = 2L k+ Ψ(x k+ ), y x k+ 2L k+ ( f(xk ), x k+ x k + L k+ x k+ x k 2) + 2L k+ f(x k ), y x k ) 2L k+ (Ψ(y) Ψ(x k+ ) + f(x k ) f(x k+ ) + 2 ϵ + f(x k), y x k. Thus, we obtain the following inequality: 2L f(xk+ k+ ) + r k+ (y) ( 2L f(xk k+ ) + f(x k ), y x k + Ψ(y) + 2) ϵ + rk (y). Summing up these inequalities, we obtain f k k i=0 It remains to use inequality (2.8). L i+ [f(x i ) + f(x i ), y x i ] + Ψ(y) + ϵ r 0 (y). It is important that method (2.6) does not include ν as a parameter. Therefore, in view of Theorem (), in order to get ϵ-solution of problem (2.) we need 4ξ(x 0, x ) inf 0 ν ( ) 2 Mν +ν ϵ (2.9) iterations at most. In this estimate, among all classes of functions with Hölder continuous gradient, we choose the class which better fits our particular objective function. Note that the expression (2.9) is log-quasiconvex in ν. Hence, if M 0 and M are finite, there are good chances that the optimal ν belongs to the interval (0, ). Inequality (2.7) gives us a reliable stopping criterion for method (2.6). Indeed, assume we have a bound for the size of optimal solution: Denote l p def k (y) = Then k i=0 ξ(x 0, x ) D. (2.20) L i+ [f(x i ) + f(x i ), y x i ], and define { ˆf k = min l p y Q k (y) + Ψ(y) : ξ(x 0, y) D }. f k f(x ) f k ˆf k 2γ(M ν,ϵ) k+ D. (2.2) Note that ˆf k can be computed. Thus, inequality (2.2) provides us with an implementable stopping criterion f k ˆf k ϵ. Finally, let us estimate N(k), the total number of computations of the function values in method (2.6) after k iterations. Note that L k+ = 2 2i kl k. 7

9 L Therefore, i k = log k+ 2 L k. Hence, for any ν [0, ], we have N(k) = k (i j + ) = 2(k + ) + log 2 L k+ log 2 L 0 j=0 (2.8) 2(k + ) + ν +ν log 2 ϵ + 2 +ν log 2 M ν log 2 L 0. Finally, we come to the following upper bound: [ ] N(k) 2(k + ) log 2 L 0 + inf ν 0 ν +ν log 2 ϵ + 2 +ν log 2 M ν. (2.22) Thus in average, up to negligible logarithmic terms, method (2.6) requires two computations of function values per iteration. The complexity estimates in (2.9) are optimal only for ν = 0. In Section 4 we show that much better (and optimal) bounds can be achieved by a fast gradient scheme. 3 Universal Dual Gradient Method Dual gradient method is based on updating a simple model for objective function of problem (2.). Its justification is based on the following simple result. Lemma 2 Let ϕ : Q R {+ } be a convex function such that for some M 0 the difference ϕ(x) Md(x) is subdifferentiable on Q. Denote x = arg min ϕ(x). Then x Q ϕ(y) ϕ( x) + Mξ( x, y), y Q. (3.) Proof: Denote F (y) = ϕ(y) Md(y). Then F ( x) + M d( x), y x 0 for all y Q. Therefore, ϕ(y) = F (y) + Md(y) F ( x) + F ( x), y x + Md(y) F ( x) M d( x), y x + Md(y) = ϕ( x) + Mξ( x, y). 8

10 Universal Dual Gradient Method (DGM) Initialization. Choose L 0 > 0. Define ϕ 0 (x) = ξ(x 0, x). For k 0 do:. Find the smallest i k 0 such that for point { } x k,ik = arg min ϕ k (x) + [f(x x Q 2 i k L k ) + f(x k ), x x k + Ψ(x)] k (3.2) we have f ( ) B 2 i k Lk (x k,ik ) ψ (x 2 i k L k,ik ) + ϵ k Set x k+ = x k,ik, L k+ = 2 i k L k, and ϕ k+ (x) = ϕ k (x) + 2L k+ [f(x k ) + f(x k ), x x k + Ψ(x)]. Assume that L 0 γ(m ν, ϵ), and M ν <. Note that the termination criterion of Step in method (3.2) is satisfied if 2 i kl k γ(m ν, ϵ). Therefore, same as in the proof of Theorem, we have L k γ(m ν, ϵ), k. (3.3) Denote y k = B 2 i k Lk (x k,ik ), = induction that the relation k i=0 L i+, and ϕ k = arg min y Q ϕ k(y). Let us prove by k i=0 is valid for all k 0. Indeed, for k = 0 we have [ ] 2L f(y0 ) S 0 ϵ 4 = 2L f(y0 ) ϵ 2 2L f(yi i+ ) ϕ k+ + ϵ (3.4) 4 2 i 0 L 0 ψ 2 i 0 L 0 (y 0 ) = 2 i 0 L 0 [f(x 0 ) + f(x 0 ), y 0 x 0 + Ψ(y 0 )] + ξ(x 0, y 0 ) = ϕ (y 0 ) = min x Q ϕ (x). In view of Lemma 2, for any k 0 we have ϕ k (x) ϕ k (x k ) + ξ(x k, x), x Q. 9

11 Assume that (3.4) is true for some k 0. Then, min ϕ k+2(x) x Q { } min ϕ k+ (x) + x Q 2L k+2 [f(x k+ ) + f(x k+ ), x x k+ + Ψ(x)] { } min ϕ k+ (x k+ ) + ξ(x k+, x) + x Q 2L k+2 [f(x k+ ) + f(x k+ ), x x k+ + Ψ(x)] [ ] (3.4) ϕ k+ (x k+ ) + 2L f(yk+ k+2 ) ϵ 2 + ϵ 4 + k+ 2L f(yi i+ ). i=0 Thus, we have proved that (3.4) is valid for all k 0. Now we can prove the main convergence result for Universal Dual Gradient Method. Denote f k = k L f(yi i+ ). Theorem 2 Let f satisfies condition (2.4) with M ν <, and L 0 γ(m ν, ϵ). Then all L k generated by method (3.2) satisfy condition (3.3). Moreover, for all k 0 we have Proof: Indeed, in view of inequality (3.4), we have f k f(x ) ϵ 2 + 2γ(M ν,ϵ) k+ ξ(x 0, x ). (3.5) 2 f k = k 2L f(yi i+ ) min ϕ k+(x) + ϵ i=0 x Q 4 2 f(x ) + ξ(x 0, x ) + ϵ 4. It remains to use inequality (3.3). i=0 Note that the worst-case complexity bound for the number of iterations of method (3.2) coincides with the bound (2.9). However, the number of function evaluations at each iteration of (3.2) is twice more than (2.22). Same as method (2.6), Universal Dual Gradient Method can be equipped with a stopping criterion. Denote l d k (y) = k L i+ [f(x i ) + f(x i ), y x i ]. Assume that i=0 ξ(x 0, x ) D and the constant D is known. Denote Note that ˆf k = min y Q { l d k (y) + Ψ(y) : ξ(x 0, y) D }. { } ˆf k = min max x Q β 0 l d k (y) + Ψ(y) + β (ξ(x 0, y) D) { } = max min β 0 x Q l d k (y) + Ψ(y) + β (ξ(x 0, y) D) (3.4) 2 β=2/ 2 ϕ k+ 2 D. Since f k ϕ k+ + ϵ 2, we conclude that the stoping criterion f k ˆf k ϵ ensures f k f(x ) ϵ as far as 4 ϵ D. 0

12 4 Universal Fast Gradient Method Finally, let us apply to problem (2.) the following method. Universal Fast Gradient Method (FGM) Initialization. Choose L 0 > 0. Define ϕ 0 (x) = ξ(x 0, x), y 0 = x 0, A 0 = 0. For k 0 do:. Find v k = arg min x Q ϕ k(x). 2. Find the smallest i k 0 such that coefficient a k+,ik > 0, computed from equation a 2 k+,i k = 2 i k L k (A k + a k+,ik ) and used in the definitions A k+,ik = A k + a k+,ik, τ k,ik = a k+,i k A k+,ik, x k+,ik = τ k,ik v k + ( τ k,ik )y k, (4.) ˆx k+,ik = arg min y Q {ξ(v k, y) + a k+,ik [ f(x k+,ik ), y + Ψ(y)]}, y k+,ik = τ k,ik ˆx k+,ik + ( τ k,ik )y k, ensures the following relation: f(y k+,ik ) f(x k+,ik ) + f(x k+,ik ), y k+,ik x k+,ik +2 i k L k y k+,ik x k+,ik 2 + ϵ 2 τ k,i k. 3. Set x k+ = x k+,ik, y k+ = y k+,ik, a k+ = a k+,ik, τ k = τ k,ik. Define A k+ = A k + a k+, L k+ = 2 i k L k, and ϕ k+ (x) = ϕ k (x) + a k+ [f(x k+ ) + f(x k+ ), x x k+ + Ψ(x)]. Theorem 3 Let f satisfies condition (2.4) with certain M ν < +. Then all iterations of method (4.) are well defined. Moreover, for all k 0 we have [ where A k 2 2+4ν M 2 ν A k ( f(yk ) ϵ 2 ) ϕ k def = min x Q ϕ k(x), (4.2) ϵ ν k +3ν] +ν. Consequently, for all k we have f(y k ) f(x ) [ ] 2 2+4ν Mν 2 +ν ξ(x ϵ ν k +3ν 0, x ) + ϵ 2. (4.3)

13 Proof: Let us prove first, that the line-search process of Item 2 in (4.) is finite. In view of inequality (2.2), we need to show that 2 i kl k [ ] ν +ν 2 ϵτ k,ik M +ν ν for i k large enough. Indeed, in view of the characteristic equation for a k+,ik, we have ( ) ν ( ) 2 i ν +ν kl k τk,i k = A 2ν k+,i ak+,ik +ν +ν k a 2 A k+,i k+,ik = τ k,ik a k+,ik a k+,ik. k It remains to note that a k+,ik 0 as i k. Let us prove relation (4.2). For k = 0 it is evident. Assume that it is valid for certain k 0. Then for any y Q we have (3.) ϕ k (y) ϕ k + ξ(v k, y) (4.2) ( ) A k f(yk ) ϵ 2 + ξ(v k, y k ) A k ( f(xk+ ) + f(x k+, y k x k+ + Ψ(y k ) ϵ 2) + ξ(vk, y). Therefore, ( ϕ k+ (y) ξ(v k, y) + A k f(xk+ ) + f(x k+, y k x k+ + Ψ(y k ) ϵ ) 2 +a k+ [f(x k+ ) + f(x k+ ), y x k+ + Ψ(y)] ( = ξ(v k, y) + A k f(xk+ ) + Ψ(y k ) ϵ ) 2 +a k+ [f(x k+ ) + f(x k+ ), y v k + Ψ(y)]. In view of definition of point ˆx k+, we have ϕ ( k+ ξ(v k, ˆx k+ ) + A k f(xk+ ) + Ψ(y k ) ϵ ) 2 +a k+ [f(x k+ ) + f(x k+ ), ˆx k+ v k + Ψ(ˆx k+ )] (2.8) 2 ˆx k+ v k 2 + A k+ f(x k+ ) + A k+ Ψ(y k+ ) ϵ 2 A k +a k+ f(x k+ ), ˆx k+ v k. Since ˆx k+ v k = τ k (y k+ x k+ ), we obtain ϕ k+ 2τ 2 k y k+ x k+ 2 + A k+ f(x k+ ) + A k+ Ψ(y k+ ) ϵ 2 A k +A k+ f(x k+ ), y k+ x k+ = A k+ (f(x k+ ) + f(x k+ ), y k+ x k+ +2 i k L k y k+ x k+ 2 + Ψ(y k+ )) ϵ 2 A k A k+ (f(y k+ ) ϵ 2 τ k + Ψ(y k+ )) ϵ 2 A k = A k+ ( f(yk+ ) ϵ 2 2 ).

14 Thus, inequality (4.2) is proved for all k 0. Since ϕ k (y) A k f(y) + ξ(x0, y) for all y Q, we obtain f(y k ) f(x (4.2) ) ξ(x 0,x ) A k + ϵ 2, k. It remains to estimate the growth of coefficients A k. In view of Lemma, the number of internal steps i k in Item 2 of (4.) satisfies inequality Therefore, a 2 k+ A k+ = 2 i k 2 L k +ν 2Mν come to the following estimate: 2 i kl k 2 [ ϵτ k ] ν +ν 2 +ν Mν. [ϵτ k ] ν +ν, which is a 2 k+ [ϵa k+] ν +ν a k+ ϵ ν 2ν +3ν +3ν Ak+ 2 +3ν +ν 2 +3ν Mν Denote B k = A γ +ν k, where γ = +3ν 2. Since A k+ A k, we have B k+ B k A k+ A k Thus, we have proved that A k A γ k+ +A γ k [ k ϵ +3ν ν 2 2+4ν 2 +3ν +3ν Mν a k+ 2A γ k+ ] +3ν +ν 2 +ν 2Mν A 2ν +ν k+. Thus, we. (4.4) (4.4) = k +3ν +ν ϵ ν +ν 2 2+4ν 2 +ν +ν Mν ϵ +3ν ν 2 2+4ν 2 +3ν +3ν Mν.. From the rate of convergence (4.3), we get the following upper bound for the number of iterations, which are necessary for getting ϵ-solution of problem (2.): k inf 0 ν [ ( 2 3+5ν 2 M ν ϵ ) 2 ] +3ν ξ(x0, x ) +ν +3ν. (4.5) As compared with (2.9), the dependence of this bound in smoothness parameters is now optimal. Same as the gradient methods (2.6) and (3.2), Fast Gradient Method (4.) can be equipped with an implementable stopping criterion. Assume that ξ(x 0, x ) D. Denote l pd k (y) = k i= a i [f(x i ) + f(x i ), x x i ], and ˆf k = min { y Q A k l pd k (y) + Ψ(y) : ξ(x 0, y) D}. Note that f(y k ) (4.2) ϵ 2 + obtain A k ϕ k. Using the reasoning presented in the end of Section 3, we { } ˆf k = max min β 0 y Q A k l pd k (y) + Ψ(y) + β(ξ(x β=/ak 0, y) D) A k ϕ k A k D. Thus, we can use stopping criterion f(y k ) ˆf k ϵ 2, (4.6) 3

15 which ensures f(y k ) f(x ) ϵ as far as A k 2 ϵ D. (4.7) It remains to estimate from above the total number of calls of oracle of method (4.), which is sufficient to get an ϵ-solution of problem (2.). Let us assume that this method is equipped with the stopping criterion (4.6). Then we can be sure that A k 2 ϵ D, k 0. (4.8) Denote by N(k) the total number of calls of oracle after k iterations. At each iteration of this method we call the oracle 2(i k + ) times (at point x k+,ik and at the prediction point y k+,ik ). Therefore, using the same reasoning as in the end of Section 2, we conclude that Note that N(k) = 4(k + ) + 2 log 2 L k+ 2 log 2 L 0. (4.9) L k+ = 2 2i k L k = A k+ a 2 k+ (2.) [ ϵτ k ] ν +ν 2 +ν Mν = [ Ak+ ϵa k+ ] ν +ν 2 +ν M ν (4.0) [ ] +3ν Therefore +ν 2ν [ ] ν +ν a k+ A k+ ϵ +ν 2 +ν Mν, and we conclude that 2ν [ ] ν +ν L k+ A k+ [A k+ ϵ +ν 2 +ν Mν ] 2(+ν) +3ν = A ν +3ν k+ [ ϵ ] 2( ν) +3ν 4 +3ν Mν (4.8) [ ] (2D) ν 3( ν) +3ν ϵ +3ν 4 +3ν Mν. Substituting this estimate in the expression (4.9), we obtain that in average method (4.) has at most four calls of oracle per iteration. 5 Numerical experiments In our numerical experiments we tried to check the actual level of adaptivity of the above methods to the local topological structure of the objective function. For that, we have chosen three families of nonsmooth minimization problems.. Matrix games. In this problem, given by an n m payoff matrix A, we need to find a saddle point of the following problem: min max x, Ay = min x n y m x n { } ψ p (x) def = max x, Ae j j m { } = max ψ d (y) def = min e i, Ay, y m i n (5.) where e ( ) denote the basis vectors in the corresponding spaces, and ( ) denotes the standard simplex. This problem can be posed as a minimization problem min x n,y m {ψ pd (x, y) = ψ p (x) ψ d (y)}. (5.2) 4

16 The optimal value of this problem is zero. For our experiments, we generated matrix A randomly, with uniform distribution of its entries in the interval [, ]. For feasible set of this problem, F = {z = (x, y) : x n, y m }, a natural prox-function is the entropy: η(z) = n z (i) ln z (i). i= This function is strongly convex with respect to l -norm with the convexity parameter one. Note that l -norm is very good for measuring simplexes. Consequently, we can measure the subgradients of the objective function in (5.2) in l -norm. In view of our strategy for generating the matrix A, we get Lipschitz-continuous function ψ pd with the constant equal to one. We will refer to the methods based on the entropy function as methods with the Entropy Setup. If a method is using the standard Euclidean norm, we say that it is based on the Euclidean setup. In the table below, we give computational results for two universal methods, the Fast Gradient Method (4.), and the Primal Gradient Method (2.6), both with Entropy Setup. In our problem instance, n = 896 and m = 28. Eps FGM Entropy PGM Entropy (5.3) In the first column we indicate the required accuracy. For each method, there are three subcolumns. First one indicates the number of iterations. Second one shows the upper estimate for the achieved accuracy. 2 The third column shows the current level of Lipschitz constant, generated by the method. Note that per one iteration of FGM we need in average to call the oracle four times. PGM needs in average only two calls. It is clear, that both methods behave much better than the worst-case complexity bounds. For FGM, decrease of required accuracy in two times results in a doubling of the number of iterations. This dependence is typical for the complexity bounds of the type 2 Since in this problem the optimal value is known, we use it in the stopping criterion. 5

17 O( ϵ ), and not to the theoretical bound O( ϵ 2 ). For PGM, the average increase is higher (approximately, in three times). But it is still much better than the theoretical bound. From table (5.3) we can see that FGM generates much better model of the objective function. Its accuracy is usually only twice bigger than the actual residual. The estimates of PGM are much weaker. One of the possible reasons for this difference consists in much more aggressive behavior of FGM in introducing new gradients in the model. Finally, the third subcolumn shows that the actual level of the Lipschitz constants are much lower than the theoretical prediction. Let us look now how efficient FGM is in solving the same problem by the Euclidean setup. These results are presented in the left part of table (5.4). Eps FGM Euclid WDA Entropy (5.4) out of time out of time They confirm that the right choice of prox-function is crucial for the efficient solution of optimization problems. Behavior of FGM with Euclidean setup just corresponds to the worst-case theoretical bound O( ϵ 2 ) for Lipschitz-continuous functions (increase of the number of iterations in four times after dividing accuracy by two). In the right part of this table we present the results of the standard black-box subgradient scheme as applied to the same problem. This is Weighted Dual Averaging (WDA) [] with Entropy Setup. For choosing its parameters correctly, we need to know only an estimate for the diameter of the feasible set. Each iteration of this method needs one call of oracle. For our problem, WDA works in an exact correspondence to its worst-case complexity bound O( ϵ 2 ). The second column of this part demonstrates that the lower bound generated by this scheme is almost exact. 2. Continuous Steiner problem. In this problem we are given by centers a i R n, i =,..., m. It is necessary to find the optimal location of the service center x, which minimizes the total distance to all other centers. Thus, our problem is as follows: def min f(x) = m x a i. (5.5) x Q i= where Q R n is a closed convex set. All norms in this problem are Euclidean. 6

18 Clearly, the level of smoothness of problem (5.5) is much higher than that of (5.2). So, we can expect that it is easier for the universal schemes. Let us look at the results of the experiments for random problem with n = 256, m = 52, and Q = R+. n We choose m > n in order to increase the density of nonsmooth points. The centers were generated randomly in the box 0 x (i), i =,..., n (which has Euclidean diameter one). All n /2 methods have origin as a starting point. The initial value of the objective is f 0 = The optimal solution found by the schemes is f = The table below has the same structure as (5.3). Eps FGM Euclid PGM Euclid (5.6) out of time Note that the rate of convergence of FGM (4.) is unexpectedly high. Increase of the accuracy in four times results in doubling the number of iterations. From the complexity point of view, this corresponds to the level O( ), which is typical for Fast Gradient ϵ /2 Methods of smooth minimization. The predicted accuracy by FGM is still very good, and the level of Lipschitz constants is unexpectedly small. The results of PGM (2.6) are not so impressive. It doubles the number of iterations after dividing accuracy by two, which corresponds to O( ϵ ) level of complexity. It seems that a weak point of this method is the quality of termination criterion. 3. Universal methods and smoothing technique. Let us compare the practical performance of method (4.) as applied to the primal version of problem (5.) min ψ p (x), (5.7) x n 7

19 with its performance as applied to the smoothed version of this function ( ) m ψ p (x) = µ ln. e x,ae j /µ j= The value of smoothing parameter µ > 0 for this function is chosen in accordance to the theoretical recommendation (4.8) in [9]. For our experiments, we choose n = m = 52 and apply FGM with Entropy Setup. Eps FGM for ψ p (x) FGM for ψ p (x) (5.8) out of time These results confirm that smoothing is still a very powerful technique, which computational efficiency is often much higher than that of the Black Box Methods. 8

20 References [] A. Beck, M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(), (2009) [2] O.Devolder, F.Glineur, Yu.Nesterov. First-order methods of smooth convex optimization with inexact oracle. Accepted by Mathematical Programming. [3] K.-H.Elster(eds). Modern Mathematical Methods in Optimization. Academie Verlag, Berlin, 993. [4] G. Lan. Level methods uniformly optimal for composite and structured nonsmooth convex optimization. April 20, Submitted to Mathematical Programming. [5] C. Lemarechal, A. Nemirovskii and Yu. Nesterov. New variants of bundle methods. Mathematical Programming, 69 (995), -47. [6] A. Nemirovskii, Yu. Nesterov. Optimal methods for smooth convex optimization. Zh. Vychisl. Mat. i Mat. Fiz. 25(3), (985), in Russian. [7] A. Nemirovsky, D. Yudin. Problem complexity and method efficiency in optimization John Wiley & Sons, New York (983). [8] Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, [9] Yu. Nesterov. Smooth minimization of non-smooth functions, Mathematical Programming (A), 03 (), (2005). [0] Yu.Nesterov. Gradient methods for minimizing composite functions. CORE DP 2007/76. Published in Mathematical Programming, December (202) [] Yu. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 20(), (August 2009) 9

Cubic regularization of Newton s method for convex problems with constraints

CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized