Gradient methods for minimizing composite functions

Size: px

Start display at page:

Download "Gradient methods for minimizing composite functions"

Benjamin Martin
5 years ago
Views:

1 Math. Program., Ser. B 2013) 140: DOI /s FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December 2010 / Published online: 21 December 2012 Springer-Verlag Berlin Heidelberg and Mathematical Optimization Society 2012 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum of two terms: one is smooth and given by a black-box oracle, and another is a simple general convex function with known structure. Despite the absence of good properties of the sum, such problems, both in convex and nonconvex cases, can be solved with efficiency typical for the first part of the objective. For convex problems of the above structure, we consider primal and dual variants of the gradient method with convergence rate O ) 1 k ), and ) an accelerated multistep version with convergence rate O 1, where k is the iteration counter. For nonconvex problems with this structure, we prove convergence to a k 2 point from which there is no descent direction. In contrast, we show that for general nonsmooth, nonconvex problems, even resolving the question of whether a descent direction exists from a point is NP-hard. For all methods, we suggest some efficient line search procedures and show that the additional computational work necessary for estimating the unknown problem class parameters can only multiply the complexity of each iteration by a small constant factor. We present also the results of preliminary computational experiments, which confirm the superiority of the accelerated scheme. Dedicated to Claude Lemaréchal on the Occasion of his 65th Birthday. The author acknowledges the support from Office of Naval Research grant # N : Efficiently Computable Compressed Sensing. Yu. Nesterov B) Center for Operations Research and Econometrics CORE), Catholic University of Louvain UCL), 34 voie du Roman Pays, 1348 Louvain-la-Neuve, Belgium nesterov@core.ucl.ac.be

2 126 Yu. Nesterov Keywords Local optimization Convex Optimization Nonsmooth optimization Complexity theory Black-box model Optimal methods Structural optimization l 1 -Regularization Mathematics Subject Classification 90C25 90C47 68Q25 1 Introduction Motivation. In recent years, several advances in Convex Optimization have been based on development of different models for optimization problems. Starting from the theory of self-concordant functions 14], it was becoming more and more clear that the proper use of the problem s structure can lead to very efficient optimization methods, which significantly overcome the limitations of black-box Complexity Theory see Section 4.1 in 9] for discussion). For more recent examples, we can mention the development of smoothing technique 10], or the special methods for minimizing convex objective function up to certain relative accuracy 12]. In both cases, the proposed optimization schemes strongly exploit the particular structure of corresponding optimization problem. In this paper, we develop new optimization methods for approximating global and local minima of composite convex objective functions φx). Namely, we assume that φx) = f x) + x), 1.1) where f x) is a differentiable function defined by a black-box oracle, and x) is a general closed convex function. However, we assume that function x) is simple. This means that we are able to find a closed-form solution for minimizing the sum of with some simple auxiliary functions. Let us give several examples. 1. Constrained minimization.letq be a closed convex set. Define as an indicator function of the set Q: x) = { 0, if x Q, +, otherwise. Then, the unconstrained minimization of composite function 1.1) is equivalent to minimizing the function f over the set Q. We will see that our assumption on simplicity of the function reduces to the ability of finding a closed form Euclidean projection of an arbitrary point onto the set Q. 2. Barrier representation of feasible set. Assume that the objective function of the convex constrained minimization problem find f = min x Q f x) is given by a black-box oracle, but the feasible set Q is described by a ν-selfconcordant barrier Fx) 14]. Define x) = ν ɛ Fx), φx) = f x) + x), and

3 Gradient methods for minimizing composite functions 127 = arg min x Q f x). Then, for arbitrary ˆx int Q, by general properties of self-concordant barriers we get x f ˆx) f x ) + φˆx), ˆx x + ɛ ν F ˆx), x ˆx f + φˆx) ˆx x +ɛ. Thus, a point ˆx, with small norm of the gradient of function φ, approximates well the solution of the constrained minimization problem. Note that the objective function φ does not belong to any standard class of convex problems formed by functions with bounded derivatives of certain degree. 3. Sparse least squares. In many applications, it is necessary to minimize the following objective: φx) = 1 2 Ax b x 1 def = f x) + x), 1.2) where A is a matrix of corresponding dimension and k denotes the standard l k -norm. The presence of additive l 1 -term very often increases the sparsity of the optimal solution see 1,18]). This feature was observed a long time ago see, for example, 2,6,16,17]). Recently, this technique became popular in signal processing and statistics 7,19]. 1 From the formal point of view, the objective φx) in 1.2) is a nonsmooth ) convex function. Hence, the standard black-box gradient schemes need O 1 iterations for ɛ 2 generating its ɛ-solution. The structural methods based on the smoothing technique 10] need O ) 1 ɛ iterations. However, we will see that the same problem can be solved ) in O 1 iterations of a special gradient-type scheme. ɛ 1/2 Contents. In Sect. 2 we study the problem of finding a local minimum of a nonsmooth nonconvex function. First, we show that for general nonsmooth, nonconvex functions, even resolving the question of whether there exists a descent direction from a point is NP-hard. However, for the special form of the objective function 1.1), we can introduce the composite gradient mapping, which makes the above mentioned problem solvable. The objective function of the auxiliary problem is formed as a sum of the objective of the usual gradient mapping 8] and the general nonsmooth convex term. For the particular case 1.2), this construction was proposed in 20]. 2 We present different properties of this object, which are important for complexity analysis of optimization methods. In Sect. 3 we study the behavior of the simplest gradient scheme based on the composite gradient mapping. We prove that in convex and nonconvex cases we have exactly the same complexity results as in the usual smooth situation 0). 1 An interested reader can find a good survey of the literature, existing minimization techniques, and new methods in 3] and5]. 2 However, this idea has much longer history. To the best of our knowledge, for the general framework this technique was originally developed in 4].

4 128 Yu. Nesterov For example, for nonconvex f, the maximal negative slope of φ along the minimization sequence with monotonically decreasing function values increases as O 1, ) k 1/2 where k is the iteration counter. Thus, the limiting points have no decent directions see Theorem 3).If f is convex and has Lipschitz continuous gradient, then the Gradient Method converges as O k 1 ). It is important that our version of the Gradient Method has an adjustable stepsize strategy, which needs on average one additional computation of the function value per iteration. In Sect. 4, we introduce a machinery of estimate sequences and apply it first for justifying the rate of convergence of the dual variant of the gradient method. Afterwards, we present an accelerated version, which converges as O 1 ). As compared k with the previous variants of accelerated schemes e.g. 9,10]), our 2 new scheme can efficiently adjust the initial estimate of the unknown Lipschitz constant. In Sect. 5 we give examples of applications of the accelerated scheme. We show how to minimize functions with known strong convexity parameter Sect. 5.1), how to find a point with a small residual in the system of the first-order optimality conditions Sect. 5.2), and how to approximate unknown parameters of strong convexity Sect. 5.3). In Sect. 6 we present the results of preliminary testing of the proposed optimization methods. An earlier version of the results was published in the research report 11]. Notation. In what follows, E denotes a finite-dimensional real vector space, and E the dual space, which is formed by all linear functions on E. The value of function s E at x E is denoted by s, x. By fixing a positive definite self-adjoint operator B : E E, we can define the following Euclidean norms: 3 h = Bh, h 1/2, h E, s = s, B 1 s 1/2, s E. 1.3) In the particular case of coordinate vector space E = R n,wehavee = E. Then, usually B is taken as a unit matrix, and s, x denotes the standard coordinate-wise inner product. Further, for function f x), x E, we denote by f x) its gradient at x: f x + h) = f x) + f x), h +o h ), h E. Clearly f x) E. For a convex function we denote by x) its subdifferential at x. Finally, the directional derivative of function φ is defined in the usual way: Dφy)u] =lim α 0 1 φy + αu) φy)]. α Finally, we use notation a #.#) b for indicating that the inequality a b is an immediate consequence of the displayed relation #.#). 3 In this paper, for the sake of simplicity, we restrict ourselves to Euclidean norms only. The extension onto the general case can be done in a standard way using Bregman distances e.g. 10]).

5 Gradient methods for minimizing composite functions Composite gradient mapping The problem of finding a descent direction for a nonsmooth nonconvex function or proving that this is not possible) is one of the most difficult problems in Numerical Analysis. In order to see this, let us fix an arbitrary integer vector c Z+ n and denote γ = n i=1 c i) 1. Consider the function φx) = 1 1 ) max γ 1 i n xi) min 1 i n xi) + c, x. 2.1) Clearly, φ is a piece-wise linear function with φ0) = 0. Nevertheless, we have the following discouraging result. Lemma 1 It is NP-hard to decide if there exists x R n with φx) <0. Proof Assume that some vector σ R n with coefficients ±1 satisfies equation c,σ =0. Then φσ) = γ 1 < 0. Let us assume now that φx) <0forsomex R n. We can always choose x with max 1 i n x i) =1. Denote δ = c, x. In view of our assumption, we have x i) 2.1) > 1 1 γ + δ, i = 1,...,n. Denoting σ i) = sign x i), i = 1,...,n, we can rewrite this inequality as σ i) x i) > 1 γ 1 + δ. Therefore, σ i) x i) =1 σ i) x i) < γ 1 δ, and we conclude that c,σ c, x + c,σ x δ + γ max 1 i n σ i) x i) <1 γ)δ Since c has integer coefficients, this is possible if and only if c,σ =0. It is well known that verification of solvability of the latter equality in boolean variables is NP-hard this is the standard Boolean Knapsack Problem). Thus, we have shown that finding a descent direction of function φ is NP-hard. Considering now the function max{ 1,φx)}, we can see that finding a local) minimum of a unimodular nonsmooth nonconvex function is also NP-hard. Thus, in a sharp contrast to Smooth Minimization, for nonsmooth functions even the local improvement of the objective is difficult. Therefore, in this paper we restrict ourselves to objective functions of very special structure. Namely, we consider the problem of minimizing φx) def = f x) + x) 2.2) over a convex set Q, where function f is differentiable, and function is closed and convex on Q. For characterizing a solution to our problem, define the cone of feasible directions and the corresponding dual cone, which is called normal: Fy) ={u = τ x y), x Q, τ 0} E,

6 130 Yu. Nesterov N y) ={s : s, x y 0, x Q} E, y Q. Then, the first-order necessary optimality conditions at the point of local minimum x can be written as follows: φ def = f x ) + ξ N x ), 2.3) where ξ x ). In other words, φ, u 0 u Fx ). 2.4) Since is convex, the latter condition is equivalent to the following: Dφx )u] 0 u Fx ). 2.5) Note that in the case of convex f, any of the conditions 2.3) 2.5) is sufficient for point x to be a point of global minimum of function φ over Q. The last variant of the first-order optimality conditions is convenient for defining an approximate solution to our problem. Definition 1 The point x Q satisfies the first-order optimality conditions of local minimum of function φ over the set Q with accuracy ɛ 0if Dφ x)u] ɛ u F x), u = ) Note that in the case F x) = E with 0 / f x) + x), this condition reduces to the following inequality: ɛ min Dφ x)u] = min max f x) + ξ,u u =1 u =1 ξ x) = min u 1 max ξ x) f x) + ξ,u = max = min ξ x) f x) + ξ. ξ x) min f x) + ξ,u u 1 For finding a point x satisfying condition 2.6), we will use the composite gradient mapping. Namely, at any y Q define m L y; x) = f y) + f y), x y + L 2 x y 2 + x), T L y) = arg min m Ly; x), 2.7) x Q where L is a positive constant. Recall that in the usual gradient mapping 8] wehave ) 0 our modification is inspired by 20]). Then, we can define a constrained analogue of the gradient direction for a smooth function, the vector g L y) = L By T L y)) E, 2.8)

7 Gradient methods for minimizing composite functions 131 where the operator B 0 defines the norm 1.3). In case of ambiguity of the objective function, we use notation g L y)φ].) It is easy to see that for Q E and 0 we get g L y) = φy) f x) for any L > 0. Our assumption on simplicity of function means exactly the feasibility of operation 2.7). Let us mention the main properties of the composite gradient mapping. Almost all of them follow from the first-order optimality condition for problem 2.7): f y) + LBT L y) y) + ξ L y), x T L y) 0, x Q, 2.9) where ξ L y) T L y)). In what follows, we denote φ T L y)) = f T L y)) + ξ L y) φt L y)). 2.10) We are going to show that the above subgradient inherits all important properties of the gradient of a smooth convex function. From now on, we assume that the first part of the objective function 2.2) has Lipschitz-continuous gradient: f x) f y) L f x y, x, y Q, 2.11) From 2.11) and convexity of Q, one can easily derive the following useful inequality see, for example, 15]): f x) f y) f y), x y L f 2 x y 2, x, y Q. 2.12) First of all, let us estimate a local variation of function φ. Denote S L y) = f T Ly)) f y) T L y) y L f. Theorem 1 At any y Q, Moreover, for any x Q, we have φy) φt L y)) 2L L f 2L 2 g L y) 2, 2.13) φ T L y)), y T L y) L L f L 2 g L y) ) φ T L y)), x T L y) ) L S Ly) g L y) T L y) x 1 + L ) f g L y) T L y) x. 2.15) L

8 132 Yu. Nesterov Proof For the sake of notation, denote T = T L y) and ξ = ξ L y). Then φt ) 2.11) 2.9), x=y f y) + f y), T y + L f 2 T y 2 + T ) f y) + LBT y) + ξ, y T + L f 2 T y 2 + T ) = f y) + L f 2L T y 2 + T ) + ξ, y T 2 φy) 2L L f 2 T y 2. Taking into account the definition 2.8), we get 2.13). Further, f T ) + ξ, y T = f y) + ξ, y T f T ) f y), T y Thus, we get 2.14). Finally, 2.9), x=y LBy T ), y T f T ) f y), T y 2.11) L L f ) T y 2 2.8) = L L f L 2 g L y) 2. f T ) + ξ,t x 2.9) f T ), T x + f y) + LBT y), x T = f T ) f y), T x g L y), x T ) L S Ly) g L y) T x, 2.9) and 2.15) follows. Corollary 1 For any y Q, and any u FT L y)), u =1, we have DφT L y))u] 1 + L ) f g L y). 2.16) L In this respect, it is interesting to investigate the dependence of g L y) in L. Lemma 2 The norm of the gradient direction g L y) is increasing in L, and the norm of the step T L y) y is decreasing in L. Proof Indeed, consider the function ωτ) = min f y) + f y), x y + 1 ] x Q 2τ x y 2 + x). The objective function of this minimization problem is jointly convex in x and τ. Therefore, ωτ) is convex in τ. Since the minimum of this problem is attained at a single point, ωτ) is differentiable and

9 Gradient methods for minimizing composite functions 133 ω τ) = τ T 1/τ y) y] 2 = 1 2 g 1/τ y) 2. Since ω ) is convex, ω τ) is an increasing function of τ. Hence, g 1/τ y) is a decreasing function of τ. The second statement follows from the concavity of the function ˆωL) = min f y) + f y), x y + L ] x Q 2 x y 2 + x). Now let us look at the output of the composite gradient mapping from a global perspective. Theorem 2 For any y Q we have m L y; T L y)) φy) 1 2L g Ly) 2, 2.17) m L y; T L y)) min φx) + L + L ] f x y ) x Q 2 If function f is convex, then m L y; T L y)) min φx) + L2 ] x x Q y ) Proof Note that function m L y; x) is strongly convex in x with convexity parameter L. Hence, φy) m L y; T L y))=m L y; y) m L y; T L y)) L 2 y T Ly) 2 = 1 2L g Ly) 2. Further, if f is convex, then m L y; T L y)) = min f y) + f y), x y + L ] x Q 2 x y 2 + x) min f x) + x) + L ] 2 x y 2 x Q = min x Q φx) + L2 x y 2 ]. For nonconvex f, we can plug into the same reasoning the following consequence of 2.12): f y) + f y), x y f x) + L f 2 x y 2.

10 134 Yu. Nesterov Remark 1 In view of 2.11), for L L f we have φt L y)) m L y; T L y)). 2.20) Hence, in this case inequality 2.19) guarantees φt L y)) min φx) + L2 ] x x Q y ) Finally, let us prove a useful inequality for strongly convex φ. Lemma 3 Let function φ be strongly convex with convexity parameter μ φ > 0. Then for any y Q we have T L y) x ) μ φ L S Ly) g L y) 1 1+ L ) f g L y), μ φ L 2.22) where x is a unique minimizer of φ on Q. Proof Indeed, in view of inequality 2.15), we have: 1+ L ) f g L y) T L y) x 1+ 1 ) L L S Ly) g L y) T L y) x φ T L y)), T L y) x μ φ T L y) x 2, and 2.22) follows. Now we are ready to analyze different optimization schemes based on the composite gradient mapping. In the next section, we describe the simplest one. 3 Gradient method Define first the gradient iteration with the simplest backtracking strategy for the line search parameter we call its termination condition the full relaxation).

11 Gradient methods for minimizing composite functions 135 Gradient Iteration Gx, M) Set: L := M. Repeat: T := T L x), if φt )>m L x; T ) then L := L γ u, 3.1) Until: φt ) m L x; T ). Output: Gx, M).T = T, Gx, M).L = L, Gx, M).S = S L x). If there is an ambiguity in the objective function, we use notation G φ x, M). For running the gradient scheme, we need to choose an initial optimistic estimate L 0 for the Lipschitz constant L f : 0 < L 0 L f, 3.2) and two adjustment parameters γ u > 1 and γ d 1. Let y 0 Q be our starting point. For k 0, consider the following iterative process. Gradient Method GMy 0, L 0 ) y k+1 = Gy k, L k ).T, M k = Gy k, L k ).L, 3.3) L k+1 = max{l 0, M k /γ d }. Thus, y k+1 = T Mk y k ). Since function f satisfies inequality 2.12), in the loop 3.1), the value L can keep increasing only if L L f. Taking into account condition 3.2), we obtain the following bounds: Moreover, if γ d γ u, then L 0 L k M k γ u L f. 3.4) L k L f, k )

12 136 Yu. Nesterov Note that in 3.1) there is no explicit bound on the number of repetition of the loop. However, it is easy to see that the total amount of calls of oracle N k after k iterations of 3.3) cannot be too big. Lemma 4 In the method 3.3), for any k 0 we have N k 1 + ln γ ] d k + 1) + 1 ln γ ) u L f. 3.6) ln γ u ln γ u γ d L 0 + Proof Denote by n i 1 the number of calls of the oracle at iteration i 0. Then L i+1 1 γ d L i γ n i 1 u. Thus, Hence, we can estimate n i 1 + ln γ d ln γ u + 1 ln γ u ln L i+1 L i. N k = k n i = i=0 1 + ln γ ] d k + 1) + 1 ln L k+1. ln γ u ln γ u L 0 3.4) { In remains to note that L k+1 max L 0, γ u γ d L f }. A reasonable choice of the adjustment parameters is as follows: γ u = γ d = 2 3.6) N k 2k + 1) + log 2 L f L 0, L k 3.6) L f. 3.7) Thus, the performance of the Gradient Method 3.3) is well described by the estimates for the iteration counter, Therefore, in the rest of this section we will focus on estimating the rate of convergence of this method in different situations. Let us start from the general nonconvex case. Denote 1 δ k = min g Mi y i ) 2 0 i k 2M, i 1 i k = 1 + arg min g Mi y i ) 2 0 i k 2M. i Theorem 3 Let function φ be bounded below on Q by some constant φ. Then δ k φy 0) φ. 3.8) k + 1

13 Gradient methods for minimizing composite functions 137 Moreover, for any u Fy ik ) with u =1 we have Dφy ik )u] 1 + γ u)l f L 1/2 0 2φy0 ) φ ). 3.9) k + 1 Proof Indeed, in view of the termination criterion in 3.1), we have φy i ) φy i+1 ) φy i ) m Mi y i ; T Mi y i )) 2.17) 1 2M i g Mi y i ) 2. Summing up these inequalities for i = 0,...,k, we obtain 3.8). Denote j k = i k 1. Since y ik = T M jk y jk ), for any u Fy ik ) with u =1we have Dφy ik )u] 2.16) 3.8) 1 + L f M jk M j k +L f M 1/2 j k ) g M jk y jk ) = 1 + L f 2φy0 ) φ ) k + 1 M jk 3.4) 1+γ u)l f L 1/2 0 ) 2M jk δ k 2φy0 ) φ ). k+1 Let us describe now the behavior of the Gradient Method 3.3) in the convex case. Theorem 4 Let function f be convex on Q. Assume that it attains a minimum on Q at point x and that the level sets of φ are bounded: y x R y Q : φy) φy 0 ). 3.10) If φy 0 ) φx ) γ u L f R 2, then φy 1 ) φx ) γ u L f R 2 2. Otherwise, for any k 0 we have φy k ) φx ) 2γ u L f R 2 Moreover, for any u Fy ik ) with u =1 we have k ) Dφy ik )u] 41 + γ u)l f R k + 2)k + 4] 1/2 L f γ u. 3.12) L 0 Proof Since φy k+1 ) φy k ) for all k 0, we have the bound y k x R valid for all generated points. Consider y k α) = αx + 1 α)y k Q α 0, 1].

14 138 Yu. Nesterov Then, φy k+1 ) m Mk y k ; T Mk y k )) 2.19) y = y k α)) min 3.4) 0 α 1 min 0 α 1 min φy) + M ] k y Q 2 y y k 2 φαx + 1 α)y k ) + M kα 2 ] y k x 2 2 φy k ) αφy k ) φx )) + γ u L f R 2 ] α 2. If φy 0 ) φx ) γ u L f R 2, then the optimal solution of the latter optimization problem is α = 1 and we get φy 1 ) φx ) γ u L f R 2. 2 Otherwise, the optimal solution is and we obtain From this inequality, denoting λ k = α = φy k) φx ) γ u L f R 2 φy 0) φx ) γ u L f R 2 1, φy k+1 ) φy k ) φy k) φx )] 2 2γ u L f R ) λ k+1 λ k + Hence, for k 0wehave λ k 1 φy k ) φx ), we get λ k+1 2λ k γ u L f R 2 λ 1 k + 2γ u L f R 2. 1 φy 0 ) φx ) + k 2γ u L f R 2 k + 2 2γ u L f R 2. Further, let us fix an integer m,0< m < k. Since 2 φy i ) φy i+1 ) 1 2M i g Mi y i ) 2, i = 0,...,k, we have k m + 1)δ k k i=m 1 2M i g Mi y i ) 2 φy m) φy k+1 ) φy m ) φx ) 3.11) 2γ u L f R 2 m + 2.

15 Gradient methods for minimizing composite functions 139 Denote j k = i k 1. Then, for any u Fy ik ) with u =1, we have Dφy ik )u] 2.16) 3.11) 1 + L ) f g M M jk y jk ) = 1 + L f jk M jk 2 M j k + L f M 1/2 j k 3.4) 21 + γ u )L f R γ u L f R 2 m + 2)k + 1 m) γ u L f L 0 m + 2)k + 1 m). ) 2M jk δ k Choosing m = k k+2)k+4) 2, we get m + 2)k + 1 m) 4. Theorem 5 Let function φ be strongly convex on Q with convexity parameter μ φ.if 2γ u, then for any k 0 we have μ φ L f Otherwise, φy k ) φx γu L f ) μ φ φy k ) φx ) Proof Since φ is strongly convex, for any k 0wehave ) k φy 0 ) φx )) 1 2 k φy 0) φx )). 3.14) 1 μ ) k φ φy 0 ) φx )). 3.15) 4γ u L f φy k ) φx ) μ φ 2 y k x ) Denote y k α) = αx + 1 α)y k Q, α 0, 1]. Then, φy k+1 ) 2.19) 3.4) 3.16) min φαx + 1 α)y k ) + M kα 2 ] y k x 2 2 min φy k ) αφy k ) φx )) + γ ] u L f α 2 y k x 2 2 ) ] φy k ) α 1 α γu L f φy k ) φx )). μ φ 0 α 1 0 α 1 min 0 α 1 { The minimum of the last expression is achieved for α = min μ φ 2γ u L f 1, then α = 1 and we get 1, μ φ 2γ u L f }. Hence, if φy k+1 ) φx ) γ u L f φyk ) φy ) ) 1 φyk ) φy ) ). μ φ 2

16 140 Yu. Nesterov If μ φ 2γ u L f 1, then α = μ φ 2γ u L f and φy k+1 ) φx ) 1 μ ) φ φy k ) φy ) ). 4γ u L f Remark 2 1) In Theorem 5, the condition number L f μ φ can be smaller than one. 2) For strongly convex φ, the bounds on the directional derivatives can be obtained by combining the inequalities 3.14), 3.15) with the estimate φy k ) φx ) 2.13):L=L f 1 2L f g L f y k ) 2 and inequality 2.16). Thus, inequality 3.14) results in the bound ) γu L k/2 f Dφy k+1 )u] 2 2L f φy 0 ) φ ), 3.17) μ φ and inequality 3.15) leads to the bound Dφy k+1 )u] 2 1 μ ) k/2 φ 2L f φy 0 ) φ ), 3.18) 4γ u L f which are valid for all u Fy k+1 ) with u =1. 3) The estimate 3.6) may create an impression that a large value of γ u can reduce the total number of calls of the oracle. This is not true since γ u enters also into the estimate of the rate of convergence of the methods e.g. 3.11)]. Therefore, reasonable values of this parameter lie in the interval 2, 3]. 4 Accelerated scheme In the previous section, we have seen that, for convex f, the gradient method 3.3) converges as O 1 k ). However, it is well known that on convex problems the usual gradient scheme can be accelerated e.g. Chapter 2 in 9]). Let us show that the same acceleration can be achieved for composite objective functions. Consider the problem min φx) = f x) + x) ], 4.1) x E where function f is convex and satisfies 2.11), and function is closed and strongly convex on E with convexity parameter μ 0. We assume this parameter to be known. The case μ = 0 corresponds to convex. Denote by x the optimal solution to 4.1).

17 Gradient methods for minimizing composite functions 141 In problem 4.1), we allow dom = E. Therefore, the formulation 4.1) covers also constrained problem instances. Note that for 4.1), the first-order optimality condition 2.9) defining the composite gradient mapping can be written in a simpler form: T L y) dom, f y) + ξ L y) = LBy T L y)) g L y), 4.2) where ξ L y) T L y)). For justifying the rate of convergence of different schemes as applied to 4.1), we will use the machinery of estimate functions in its newer variant 13]. Taking into account the special form of the objective in 4.1), we update recursively the following sequences: A minimizing sequence {x k } k=0. A sequence of increasing scaling coefficients {A k } k=0 : A 0 = 0, A k def = A k 1 + a k, k 1. A sequence of estimate functions ψ k x) = l k x) + A k x) x x 0 2 k 0, 4.3) where x 0 dom is our starting point, and l k x) are linear functions in x E. However, as compared with 13], we will add a possibility to update the estimates for Lipschitz constant L f, using the initial guess L 0 satisfying 3.2), and two adjustment parameters γ u > 1 and γ d 1. For the above objects, we maintain recursively the following relations: R 1 k : A kφx k ) ψk min } x ψ k x), R 2 k : ψ kx) A k φx) x x 0 2, k ), x E. These relations clearly justify the following rate of convergence of the minimizing sequence: φx k ) φx ) x x 0 2 2A k, k ) Denote v k = arg min x E ψ k x). Since μ ψk 1, for any x E we have A k φx k )+ 1 2 x v k 2 R1 k ψ k x v k 2 ψ k x) R2 k A k φx)+ 1 2 x x 0 2. Hence, taking x = x, we get two useful consequences of 4.4): x v k x x 0, v k x 0 2 x x 0, k )

18 142 Yu. Nesterov Note that the relations 4.4) can be used for justifying the rate of convergence of a dual variant of the gradient method 3.3). Indeed, for v 0 dom define ψ 0 x) = 1 2 x v 0 2, and choose L 0 satisfying condition 3.2). Dual Gradient Method DGv 0, L 0 ), k 0. y k = Gv k, L k ).T, M k = Gv k, L k ).L, L k+1 = max{l 0, M k /γ d }, a k+1 = 1 M k, 4.7) ψ k+1 x) = ψ k x) + 1 M k f v k ) + f v k ), x v k + x)]. In this scheme, the operation G is defined by 3.1). Since is simple, the points v k are easily computable. Note that the relations R 1 0 and R2 k, k 0, are trivial. Relations R1 k can be justified by induction. Define x 0 = y 0, φ k = min 0 i k 1 φy i ), and x k : φx k ) = φ k for k 1. Then ψ k+1 { = min ψ k x) + x R 1 k A k φ k + min x 2.7) = A k φ k + a k+1 m Mk v k ; y k ) 1 } f v k ) + f v k ), x v k + x)] M k { 1 2 x v k } f v k ) + f v k ), x v k + x)] M k 3.1) A k φ k + a k+1 φy k ) A k+1 φ k+1. Thus, relations R 1 k are valid for all k 0. Since the values M k satisfy bounds 3.4), for method 4.7) we obtain the following rate of convergence: φx k ) φx ) γ u L f 2k x v 0 2, k ) Note that the constant in the right-hand side of this inequality is four times smaller than the constant in 3.11). However, each iteration in the dual method is two times more expensive as compared to the primal version 3.3). However, the method 4.7) does not implement the best way of using the machinery of estimate functions. Let us look at the accelerated version of 4.7). As parameters, it has the starting point x 0 dom, the lower estimate L 0 > 0 for the Lipschitz constant L f, and a lower estimate μ 0,μ ] for the convexity parameter of function.

19 Gradient methods for minimizing composite functions 143 Accelerated method Ax 0, L 0,μ) Initial settings : ψ 0 x) = 1 2 x x 0 2, 3A 0 = 0. Iteration k 0 Set: L := L k. Repeat: Find a from quadratic equation Set y = A k x k +av k A k +a, and compute T L y). a2 A k +a = 2 1+μA k L. ) 4.9) if φ T L y)), y T L y) < L 1 φ T L y)) 2, then L := L γ u. Until: φ T L y)), y T L y) L 1 φ T L y)) 2. ) Define: y k := y, M k := L, a k+1 := a, L k+1 := M k /γ d, x k+1 := T Mk y k ), ψ k+1 x) := ψ k x) + a k+1 f x k+1 ) + f x k+1 ), x x k+1 + x)]. As compared with Gradient Iteration 3.1), we use a damped relaxation condition ) as a stopping criterion of the internal cycle of 4.9). Lemma 5 Condition **) in 4.9) is satisfied for any L L f. Proof Denote T = T L y). Multiplying the representation φ T ) = f T ) + ξ L y) 4.2) = LBy T ) + f T ) f y) 4.10) by the vector y T, we obtain φ T ), y T = L y T 2 f y) f T ), y T 4.10) = 1 φ T ) 2 L +2L f y) f T ), y T fy) f T ] ) 2 f y) f T ), y T = 1 L φ T ) 2 + f y) f T ), y T 1 L f y) f T ) 2. Hence, for L L f condition **) is satisfied. Thus, we can always guarantee L k M k γ u L f. 4.11) If γ d γ u, then the upper bound 3.5) remains valid.

20 144 Yu. Nesterov Let us establish a relation between the total number of calls of oracle N k after k iterations, and the value of the iteration counter. Lemma 6 In the method 4.9), for any k 0 we have N k ln γ ] d k + 1) + 2 ln γ u L f. 4.12) ln γ u ln γ u γ d L 0 Proof Denote by n i 1 the number of calls of the oracle at iteration i 0. At each cycle of the internal loop we call the oracle twice for computing f y) and f T L y)). Therefore, L i+1 = 1 γ d L i γ 0.5n i 1 u. Thus, n i = ln γ d + 1 ln L ] i+1. ln γ u ln γ u L i Hence, we can compute N k = k i=0 n i = ln γ ] d k + 1) + 2 ln L k+1. ln γ u ln γ u L ) In remains to note that L k+1 γ u γ d L f. Thus, each iteration of 4.9) needs approximately two times more calls of the oracle than one iteration of the Gradient Method: L f γ u = γ d = 2 N k 4k + 1) + 2log 2, L k L f. 4.13) L 0 However, we will see that the rate of convergence of 4.9) is much higher. Let us start from two auxiliary statements. Lemma 7 Assume μ μ. Then the sequences {x k }, {A k } and {ψ k }, generated by the method Ax 0, L 0,μ), satisfy relations 4.4) for all k 0. Proof Indeed, in view of initial settings of 4.9), A 0 = 0 and ψ0 = 0. Hence, for k = 0, both relations 4.4) are trivial. Assume now that relations R 1 k, R2 k are valid for some k 0. In view of R2 k,for any x E we have ψ k+1 x) A k φx)+ 1 2 x x 0 2 +a k+1 f x k+1 )+ f x k+1 ), x x k+1 + x)] A k + a k+1 )φx) x x 0 2, and this is R 2 k+1. Let us show that the relation R1 k+1 is also valid.

21 Gradient methods for minimizing composite functions 145 Indeed, in view of 4.3), function ψ k x) is strongly convex with convexity parameter 1 + μa k. Hence, in view of R 1 k, for any x E, wehave ψ k x) ψ k μa k 2 Therefore x v k 2 A k φx k ) μa k x v k ) 2 ψk+1 = min {ψ kx)+a k+1 f x k+1 )+ f x k+1 ), x x k+1 + x)]} x E { 4.14) min A k φx k )+ 1+μA k x v k 2 +a k+1 φx k+1 )+ φ x k+1 ), x x k+1 ] x E 2 min x E 4.9) = min x E { A k + a k+1 )φx k+1 ) + A k φ x k+1 ), x k x k+1 } x v k 2 +a k+1 φ x k+1 ), x x k μa k 2 { A k+1 φx k+1 ) + φ x k+1 ), A k+1 y k a k+1 v k A k x k+1 } x v k 2 +a k+1 φ x k+1 ), x x k μa k 2 { A k+1 φx k+1 ) + A k+1 φ x k+1 ), y k x k+1 = min x E +a k+1 φ x k+1 ), x v k μa } k x v k 2. 2 The minimum of the later problem is attained at x = v k a k+1 1+μA k B 1 φ x k+1 ). Thus, we have proved inequality } ψ k+1 A k+1φx k+1 ) + A k+1 φ x k+1 ), y k x k+1 On the other hand, by the termination criterion in 4.9), we have a 2 k μa k ) φ x k+1 ) 2. φ x k+1 ), y k x k+1 1 M k φ x k+1 ) 2. It remains to note that in 4.9) we choose a k+1 from the quadratic equation A k+1 A k + a k+1 = M ka 2 k μa k ). Thus, R 1 k+1 is valid.

22 146 Yu. Nesterov Thus, in order to use inequality 4.5) for deriving the rate of convergence of method Ax 0, L 0,μ), we need to estimate the rate of growth of the scaling coefficients {A k } k=0. Lemma 8 For any μ 0, the scaling coefficients grow as follows: A k For μ>0, the rate of growth is linear: A k 2 μ 1 + γ u L f 2γ u L f Proof Indeed, in view of equation )in4.9), we have: A k+1 A k+1 1+μA k ) = M k 2 A k+1 A k ) 2 = M k 2 2A k+1 M k A 1/2 k+1 A1/2 k Thus, for any k 0 we get A 1/2 k above, we obtain k2 2γ u L f, k ) ] 2k 1), k ) ] A 1/2 2 ] k+1 A1/2 k A 1/2 2 k+1 + A1/2 k ] 2. ] ) 2A k+1 γ u L f A 1/2 k+1 A1/2 k k 2γu L f.ifμ>0, then, by the same reasoning as ] μa k A k+1 < A k μa k ) 2A k+1 γ u L f A 1/2 2 k+1 A1/2 k. Hence, A 1/2 k+1 A1/2 k 1 + ] μ 2γ u L f. Since A 1 = M ) 0 2 γ u L f, we come to 4.16). Now we can summarize all our observations. Theorem 6 Let the gradient of function f be Lipschitz continuous with constant L f. Also, let the parameter L 0 satisfy condition 3.2). Then the rate of convergence of the method Ax 0, L 0, 0) as applied to the problem 4.1) can be estimated as follows: φx k ) φx ) γ u L f x x 0 2 k 2, k ) If in addition the function is strongly convex, then the sequence {x k } k=1 generated by Ax 0, L 0,μ ) satisfies both 4.17) and the following inequality: φx k ) φx ) γ u L f 4 x x 0 2 μ 1 + 2γ u L f ] 2k 1), k ) In the next section we will show how to apply this result in order to achieve some specific goals for different optimization problems.

23 Gradient methods for minimizing composite functions Different minimization strategies 5.1 Strongly convex objective with known parameter Consider the following convex constrained minimization problem: min x Q ˆ f x), 5.1) where Q is a closed convex set, and fˆ is a strongly convex function with Lipschitz continuous gradient. Assume the convexity parameter μ f ˆ to be known. Denote by σ Q x) an indicator function of set Q: σ Q x) = { 0, x Q, +, otherwise. We can solve the problem 5.1) by two different techniques. 1. Reallocating the prox-term in the objective. For μ 0,μf ˆ ], define f x) = ˆ f x) μ 2 x x 0 2, x) = σ Q x) + μ 2 x x ) Note that function f in 5.2) is convex and its gradient is Lipschitz continuous with L f = L f ˆ μ. Moreover, the function x) is strongly convex with convexity parameter μ. On the other hand, φx) = f x) + x) = ˆ f x) + σ Q x). Thus, the corresponding unconstrained minimization problem 4.1) coincides with constrained problem 5.1). Since all conditions of Theorem 6 are satisfied, the method Ax 0, L 0,μ)has the following performance guarantees: ˆ f x k ) ˆ f x ) γ ul f ˆ μ) x x ] 2k 1), k ) μ 2γ u L f ˆ μ) This means that an ɛ-solution of problem 5.1) can be obtained by this technique in L ˆ O f μ ln 1 5.4) ɛ

24 148 Yu. Nesterov iterations. Note that the same problem can be solved also by the Gradient Method 3.3). However, in accordance to 3.15), its performance guarantee is much worse; it needs O ) L f ˆ μ ln 1 ɛ iterations. 2. Restart. For problem 5.1), define the following components of composite objective function in 4.1): f x) = ˆ f x), x) = σ Q x). 5.5) Let us fix an upper bound N 1 for the number of iterations in A. Consider the following two-level process: Choose u 0 Q. Compute u k+1 as a result of N iterations of Au k, L 0, 0), k ) In view of definition 5.5), we have f ˆu k+1 ) f ˆx ) 4.17) γ u L f ˆ x u k 2 N 2 2γ u L f ˆ f ˆu k ) f ˆx )] μ f ˆ N 2. γu L f ˆ Thus, taking N = 2 μ ˆ f, we obtain f ˆu k+1 ) f ˆx ) 1 2 f ˆu k ) f ˆx )]. Hence, the performance guarantees of this technique are of the same order as 5.4). 5.2 Approximating the first-order optimality conditions In some applications, we are interested in finding a point with small residual of the system of the first-order optimality conditions. Since DφT L x))u] 2.16) 1 + L ) f 2.13) φx) φx g L x) L + L f ) ) L 2L L f u FT L x)), u =1, 5.7) the upper bounds on this residual can be obtained from the estimates on the rate of convergence of method 4.9) in the form4.17) or4.18). However, in this case, the first inequality does not give a satisfactory result. Indeed, it can guarantee that

25 Gradient methods for minimizing composite functions 149 the right-hand side of inequality 5.7) vanishes as O k 1 ). This rate is typical for the Gradient Method see 3.12)], and from accelerated version 4.9) we can expect much more. Let us show how we can achieve a better result. Consider the following constrained optimization problem: min x Q f x), 5.8) where Q is a closed convex set, and f is a convex function with Lipschitz continuous gradient. Let us fix a tolerance parameter δ>0 and a starting point x 0 Q. Define x) = σ Q x) + δ 2 x x 0 2. Consider now the unconstrained minimization problem 4.1) with composite objective function φx) = f x)+ x). Note that function is strongly convex with parameter μ = δ. Hence, in view of Theorem 6, the method Ax 0, L 0,δ)converges as follows: φx k ) φx ) γ u L f 4 x x 0 2 δ 1 + 2γ u L f ] 2k 1). 5.9) For simplicity, we can choose γ u = γ d in order to have L k L f for all k 0. Let us compute now T k = Gx k, L k ).T and M k = Gx k, L k ).L. Then φx k ) φx ) φx k ) φt k ) 2.17) 1 2M k g Mk x k ) 2, L 0 M k γ u L f, and we obtain the following estimate: g Mk x k ) 5.9) γ u L f δ 2 1/2 x x γ u L f ] 1 k. 5.10) In our case, the first-order optimality conditions 4.2) for computing T Mk x k ) can be written as follows: f x k ) + δbt k x 0 ) + ξ k = g Mk x k ), 5.11) where ξ k σ Q T k ). Note that for any y Q we have 0 = σ Q y) σ Q T k ) + ξ k, y T k = ξ k, y T k. 5.12)

26 150 Yu. Nesterov Hence, for any direction u FT k ) with u =1 we obtain f T k ), u 2.11) f x k ), u L f M k g Mk x k ) 5.11) = g Mk x k ) δbt k x 0 ) ξ k, u L f 5.12) = δ T k x 0 M k g Mk x k ) 1 + L ) f g Mk x k ). M k Assume now that the size of the set Q does not exceed R, and δ = ɛ L 0. Let us choose the number of iterations k from inequality 1 + ɛl 0 2γ u L f ] 1 k ɛ. Then the residual of the first-order optimality conditions satisfies the following inequality: f T k ), u ɛ R L 0 + γ u L f 2 1/2 1+ L )] f, u FT k ), u = ) L 0 ) For that, the required number of iterations k is at most of the order O 1 ɛ ln 1 ɛ. 5.3 Unknown parameter of strongly convex objective In Sect. 5.1 we have discussed two efficient strategies for minimizing a strongly convex function with known estimate of convexity parameter μ f ˆ. However, usually this information is not available. We can easily get only an upper estimate for this value, for example, by inequality μ ˆ f S L x) L ˆ f, x Q. Let us show that such a bound can be also used for designing an efficient optimization strategy for strongly convex functions. For problem 5.1), assume that we have some guess μ for the parameter μ f ˆ and a starting point u 0 Q. Denote φ 0 x) = f ˆx) + σ Q x). Let us choose x 0 = G φ0 u 0, L 0 ).T, M 0 = G φ0 u 0, L 0 ).L, S 0 = G φ0 u 0, L 0 ).S, and minimize the composite objective 5.2) by method Ax 0, M 0,μ) using the following stopping criterion:

27 Gradient methods for minimizing composite functions 151 Compute: v k = G φ0 x k, L k ).T, M k = G φ0 x k, L k ).L. Stop the stage: if A): g Mk x k )φ 0 ] 1 2 g M 0 u 0 )φ 0 ], or B): ) M k A k 1 + S 0 M μ ) If the stage was terminated by Condition A), then we call it successful. In this case, we run the next stage, taking v k as a new starting point and keeping the estimate μ of the convexity parameter unchanged. Suppose that the stage was terminated by Condition B) that is an unsuccessful stage). If μ were a correct lower bound for the convexity parameter μ ˆ f, then 1 g Mk x k )φ 0 ] ) 2M k 2.22) f ˆx k ) ˆ 1 2A k μ 2 f x ) 4.5) 1 + S 0 M 0 1 x 0 x 2 2A ) k g M0 u 0 )φ 0 ] 2. Hence, in view of Condition B), in this case the stage must be terminated by Condition A). Since this did not happen, we conclude that μ>μˆ f. Therefore, we redefine μ := 2 1 μ, and run again the stage keeping the old starting point x 0. We are not going to present all details of the complexity analysis of the above strategy. It can be shown that, for generating an ɛ-solution of problem 5.1) with strongly convex objective, it needs O κ 1/2 fˆ ln κ ˆ f ) + O κ 1/2 ln κ fˆ ˆ f ln κ fˆ ɛ ) def, κf ˆ = L fˆ, calls of the oracle. The first term in this bound corresponds to the total amount of calls of the oracle at all unsuccessful stages. The factor κ 1/2 ln κ fˆ f ˆ represents an upper bound on the length of any stage independently on the variant of its termination. μ f ˆ 6 Computational experiments We tested the algorithms described above on a set of randomly generated Sparse Least Squares problems of the form Find φ = min φx) def = 1 ] x R n 2 Ax b x 1, 6.1) where A a 1,...,a n ) is an m n dense matrix with m < n. All problems were generated with known optimal solutions, which can be obtained from the dual representation of the initial primal problem 6.1):

28 152 Yu. Nesterov ] 1 min x R n 2 Ax b x 1 = min x R n = max u R m = max u R m max u R m u, b Ax 1 2 u x 1 min b, u 12 ] x R u 22 AT u, x + x 1 n b, u 1 ] 2 u 2 2 : AT u ) ] Thus, the problem dual to 6.1) consists in finding a Euclidean projection of the vector b R m onto the dual polytope D = { } y R m : A T y 1. This interpretation explains the changing sparsity of the optimal solution x τ) to the following parametric version of problem 6.1): min φ τ x) def = 1 ] x R n 2 Ax b τ x 1 6.3) Indeed, for τ>0, we have 1 φ τ x) = τ 2 2 A x τ b ] τ x τ 1. Hence, in the dual problem, we project vector τ b onto the polytope D. The nonzero components of x τ) correspond to the active facets of D. Thus, for τ big enough, we have τ b int D, which means x τ) = 0. When τ decreases, we get x τ) more and more dense. Finally, if all facets of D are in general position, we get in x τ) exactly m nonzero components as τ 0. In our computational experiments, we compare three minimization methods. Two of them maintain recursively relations 4.4). This feature allows us to classify them as primal-dual methods. Indeed, denote φ u) = 1 2 u 2 2 b, u. As we have seen in 6.2), φx) + φ u) 0 x R n, u D. 6.4) Moreover, the lower bound is achieved only at the optimal solutions of the primal and dual problems. For some sequence {z i } i=1, and a starting point z 0 dom, relations 4.4) ensure

29 Gradient methods for minimizing composite functions 153 A k φx k ) min x R n { k } a i f z i )+ f z i ), x z i ]+ A k x)+ 1 2 x z ) i=1 In our situation, f x) = 1 2 Ax b 2 2, x) = x 1, and we choose z 0 = 0. Denote u i = b Az i. Then f z i ) = A T u i, and therefore f z i ) f z i ), z i = 1 2 u i 2 + A T u i, z i = b, u i 1 2 u i 2 2 = φ u i ). Denoting ū k = 1 A k k a i u i, 6.6) i=1 we obtain A k φx k ) + φ ū k )] A k φx k ) + 6.5) min x R n k a i φ u i ) i=1 { k } a i f z i ), x +A k x) x 2 i=1 x=0 0. In view of 6.4), u k cannot be feasible: φ ū k ) φx k ) φ = min u D φ u). 6.7) Let us measure the level of infeasibility of these points. Note that the minimum of optimization problem in 6.5) is achieved at x = v k. Hence, the corresponding first-order optimality conditions ensure k a i A T u i + Bv k A k. i=1 Therefore, a i, ū k A k Bv k ) i), i = 1,...,n. Assume that the matrix B in 1.3) is diagonal: Then, a i, ū k 1 d i A k v i) k B i, j) =, and { di, i = j, 0, otherwise. n ] 1/2 ρū k ) def 1 = a i, ū k 1 ) v k 4.6) 2 x, 6.8) d i A k A k i=1

30 154 Yu. Nesterov where α) + = max{α, 0}. Thus, we can use function ρ ) as a dual infeasibility measure. In view of 6.7), it is a reasonable stopping criterion for our primal-dual methods. For generating the random test problems, we apply the following strategy. Choose m m, the number of nonzero components of the optimal solution x of problem 6.1), and parameter ρ>0responsible for the size of x. Generate randomly a matrix B R m n with elements uniformly distributed in the interval 1, 1]. Generate randomly a vector v R m with elements uniformly distributed in 0, 1]. Define y = v / v 2. Sort the entries of vector B T y in the decreasing order of their absolute values. For the sake of notation, assume that it coincides with a natural order. For i = 1,...,n, define a i = α i b i with α i > 0 chosen accordingly to the following rule: 1 b i,y for i = 1,...,m, α i = 1, if b i, y 0.1 and i > m, ξ i b i,y, otherwise, where ξ i are uniformly distributed in 0, 1]. For i = 1,...,n, generate the components of the primal solution: { x ] i) ξi sign a = i, y ), for i m, 0, otherwise, ρ where ξ i are uniformly distributed in 0, m ]. Define b = y + Ax. Thus, the optimal value of the randomly generated problem 6.1) can be computed as φ = 1 2 y x 1. In the first series of tests, we use this value in the termination criterion. Let us look first at the results of minimization of two typical random problem instances. The first problem is relatively easy. In this table, the column Gap shows the relative decrease of the initial residual. In the rest of the table, we can see the computational results for three methods: Primal gradient method 3.3), abbreviated as PG. Dual version of the gradient method 4.7), abbreviated as DG. Accelerated gradient method 4.9), abbreviated as AC. In all methods, we use the following values of the parameters: γ u = γ d = 2, x 0 = 0, L 0 = max 1 i n a i 2 L f, μ = 0.

31 Gradient methods for minimizing composite functions 155 Problem 1: n = 4,000, m = 1,000, m = 100, ρ= 1. Gap PG DG AC k #Ax SpeedUp %) k #Ax SpeedUp %) k #Ax SpeedUp %) , , , , ,255 3, ,257 6, ,430 4, ,466 7, , ,547 4, ,613 8, , ,640 4, ,743 8, , ,722 5, ,849 9, , ,788 5, ,935 9, , ,847 5, ,003 10, , ,898 5, ,061 10, , ,944 5, ,113 10, , ,987 5, ,164 10, , ,029 6, ,217 11, , ,072 6, ,272 11, , ,120 6, ,331 11, , ,165 6, ,448 12, , Let us explain the remaining columns of this table. For each method, the column k shows the number of iterations necessary for reaching the corresponding reduction of the initial gap in the function value. Column Ax shows the necessary number of matrix-vector multiplications. Note that for computing the value f x) we need one multiplication. If in addition, we need to compute the gradient, we need one more multiplication. For example, in accordance to the estimate 4.13), each iteration of 4.9) needs four computations of the pair function/gradient. Hence, in this method we can expect eight matrix-vector multiplications per iteration. For the Gradient Method 3.3), we need in average two calls of the oracle. However, one of them is done in the line-search procedure 3.1) and it requires only the function value. Hence, in this case we expect to have three matrix-vector multiplications per iteration. In the table above, we can observe the remarkable accuracy of our predictions. Finally, the column SpeedUp represents the absolute accuracy of current approximate solution as a percentage of the worst-case estimate given by the corresponding rate of convergence. Since the exact L f is unknown, we use L 0 instead.

32 156 Yu. Nesterov We can see that all methods usually significantly outperform the theoretically predicted rate of convergence. However, for all of them, there are some parts of the trajectory where the worst-case predictions are quite accurate. This is even more evident from our second table, which corresponds to a more difficult problem instance. Problem 2: n = 5,000, m = 500, m = 100, ρ= 1. Gap PG DG AC k #Ax SpeedUp %) k #Ax SpeedUp %) k #Ax SpeedUp %) , ,027 3, ,026 5, ,402 7, ,387 11, , ,681 11, ,664 18, , ,677 14, ,664 23, , ,410 16, ,392 26, , ,938 17, ,879 29, , ,335 19, ,218 31, , ,637 19, ,471 32, , ,859 20, ,670 33, , ,021 21, ,835 34, , ,161 21, ,978 34, , ,281 21, ,108 35, , ,372 22, ,225 36, , ,438 22, ,335 36, , ,492 22, ,433 37, , In this table, we can see that the Primal Gradient Method still significantly outperforms the theoretical predictions. This is not too surprising since it can, for example, automatically accelerate on strongly convex functions see Theorem 5). All other methods require in this case some explicit changes in their schemes. However, despite all these discrepancies, the main conclusion of our theoretical analysis seems to be confirmed: the accelerated scheme 4.9) significantly outperforms the primal and dual variants of the Gradient Method. In the second series of tests, we studied the abilities of the primal-dual schemes 4.7) and 4.9) in decreasing the infeasibility measure ρ ) see 6.8)]. This problem,

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum