Gradient methods for minimizing composite functions

Size: px
Start display at page:

Download "Gradient methods for minimizing composite functions"

Transcription

1 Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum of two terms: one is smooth and given by a black-box oracle, and another is a simple general convex function with known structure. Despite the absence of good properties of the sum, such problems, both in convex and nonconvex cases, can be solved with efficiency typical for the first part of the objective. For convex problems of the above structure, ) we consider primal and dual variants of the gradient method with convergence rate O k ), and an accelerated multistep version with convergence rate O ), where k is the iteration counter. For nonconvex problems with k this structure, we prove convergence to a point from which there is no descent direction. In contrast, we show that for general nonsmooth, nonconvex problems, even resolving the question of whether a descent direction exists from a point is NP-hard. For all methods, we suggest some efficient line search procedures and show that the additional computational work necessary for estimating the unknown problem class parameters can only multiply the complexity of each iteration by a small constant factor. We present also the results of preliminary computational experiments, which confirm the superiority of the accelerated scheme. Keywords: Local Optimization, Convex Optimization, Nonsmooth optimization, Complexity theory, Black-box model, Optimal methods, Structural Optimization, l -regularization. Dedicated to Claude Lemaréchal on the Occasion of his 65th Birthday Center for Operations Research and Econometrics CORE), Catholic University of Louvain UCL), 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; nesterov@core.ucl.ac.be. The author acknowledges the support from Office of Naval Research grant # N : Efficiently Computable Compressed Sensing.

2 Introduction Motivation. In recent years, several advances in Convex Optimization have been based on development of different models for optimization problems. Starting from the theory of self-concordant functions 4], it was becoming more and more clear that the proper use of the problem s structure can lead to very efficient optimization methods, which significantly overcome the limitations of black-box Complexity Theory see Section 4. in 9] for discussion). For more recent examples, we can mention the development of smoothing technique 0], or the special methods for minimizing convex objective function up to certain relative accuracy ]. In both cases, the proposed optimization schemes strongly exploit the particular structure of corresponding optimization problem. In this paper, we develop new optimization methods for approximating global and local minima of composite convex objective functions φx). Namely, we assume that φx) = fx) + Ψx),.) where fx) is a differentiable function defined by a black-box oracle, and Ψx) is a general closed convex function. However, we assume that function Ψx) is simple. This means that we are able to find a closed-form solution for minimizing the sum of Ψ with some simple auxiliary functions. Let us give several examples.. Constrained minimization. Let Q be a closed convex set. Define Ψ as an indicator function of the set Q: Ψx) = { 0, if x Q, +, otherwise. Then, the unconstrained minimization of composite function.) is equivalent to minimizing the function f over the set Q. We will see that our assumption on simplicity of the function Ψ reduces to the ability of finding a closed form Euclidean projection of an arbitrary point onto the set Q.. Barrier representation of feasible set. Assume that the objective function of the convex constrained minimization problem find f = min x Q fx) is given by a black-box oracle, but the feasible set Q is described by a ν-self-concordant barrier F x) 4]. Define Ψx) = ɛ ν F x), φx) = fx) + Ψx), and x = arg min fx). x Q Then, for arbitrary ˆx int Q, by general properties of self-concordant barriers we get fˆx) fx ) + φˆx), ˆx x + ɛ ν F ˆx), x ˆx f + φˆx) ˆx x + ɛ. Thus, a point ˆx, with small norm of the gradient of function φ, approximates well the solution of the constrained minimization problem. Note that the objective function φ does not belong to any standard class of convex problems formed by functions with bounded derivatives of certain degree.

3 3. Sparse least squares. In many applications, it is necessary to minimize the following objective: φx) = Ax b + x def = fx) + Ψx),.) where A is a matrix of corresponding dimension and k denotes the standard l k -norm. The presence of additive l -term very often increases the sparsity of the optimal solution see, 8]). This feature was observed a long time ago see, for example,, 6, 6, 7]). Recently, this technique became popular in signal processing and statistics 7, 9]. ) From the formal point of view, the objective φx) in.) is a nonsmooth convex function. Hence, the standard black-box gradient schemes need O ) iterations for generating ɛ its ɛ-solution. The structural methods based on the smoothing technique 0] need O ɛ ) iterations. However, we will see that the same problem can be solved in O ) iterations ɛ / of a special gradient-type scheme. Contents. In Section we study the problem of finding a local minimum of a nonsmooth nonconvex function. First, we show that for general nonsmooth, nonconvex functions, even resolving the question of whether there exists a descent direction from a point is NPhard. However, for the special form of the objective function.), we can introduce the composite gradient mapping, which makes the above mentioned problem solvable. The objective function of the auxiliary problem is formed as a sum of the objective of the usual gradient mapping 8] and the general nonsmooth convex term Ψ. For the particular case.), this construction was proposed in 0]. ) We present different properties of this object, which are important for complexity analysis of optimization methods. In Section 3 we study the behavior of the simplest gradient scheme based on the composite gradient mapping. We prove that in convex and nonconvex cases we have exactly the same complexity results as in the usual smooth situation Ψ 0). For example, for nonconvex f, the maximal negative slope of φ along the minimization ) sequence with monotonically decreasing function values increases as O, where k is the iteration k / counter. Thus, the limiting points have no decent directions see Theorem 3). If f is convex and has Lipschitz continuous gradient, then the Gradient Method converges as O k ). It is important that our version of the Gradient Method has an adjustable stepsize strategy, which needs on average one additional computation of the function value per iteration. In Section 4, we introduce a machinery of estimate sequences and apply it first for justifying the rate of convergence of the dual variant of the gradient method. Afterwards, we present an accelerated version, which converges as O k ). As compared with the previous variants of accelerated schemes e.g. 9], 0]), our new scheme can efficiently adjust the initial estimate of the unknown Lipschitz constant. In Section 5 we give examples of applications of the accelerated scheme. We show how to minimize functions with known strong convexity parameter Section 5.), how to find a point with a small residual in the system of the first-order optimality conditions Section 5.), and how to approximate ) An interested reader can find a good survey of the literature, existing minimization techniques, and new methods in 3] and 5]. ) However, this idea has much longer history. To the best of our knowledge, for the general framework this technique was originally developed in 4].

4 unknown parameters of strong convexity Section 5.3). In Section 6 we present the results of preliminary testing of the proposed optimization methods. An earlier version of the results was published in the research report ]. Notation. In what follows, E denotes a finite-dimensional real vector space, and E the dual space, which is formed by all linear functions on E. The value of function s E at x E is denoted by s, x. By fixing a positive definite self-adjoint operator B : E E, we can define the following Euclidean norms: 3) h = Bh, h /, h E, s = s, B s /, s E..3) In the particular case of coordinate vector space E = R n, we have E = E. Then, usually B is taken as a unit matrix, and s, x denotes the standard coordinate-wise inner product. Further, for function fx), x E, we denote by fx) its gradient at x: fx + h) = fx) + fx), h + o h ), h E. Clearly fx) E. For a convex function Ψ we denote by Ψx) its subdifferential at x. Finally, the directional derivative of function φ is defined in the usual way: Dφy)u] = lim α 0 α φy + αu) φy)]. Finally, we use notation a #.#) b for indicating that the inequality a b is an immediate consequence of the displayed relation #.#). Composite gradient mapping The problem of finding a descent direction for a nonsmooth nonconvex function or proving that this is not possible) is one of the most difficult problems in Numerical Analysis. In order to see this, let us fix an arbitrary integer vector c Z+ n and denote γ = n c i). Consider the function φx) = ) γ max in xi) min in xi) + c, x..) Clearly, φ is a piece-wise linear function with φ0) = 0. Nevertheless, we have the following discouraging result. Lemma It is NP-hard to decide if there exists x R n with φx) < 0. Proof: Assume that some vector σ R n with coefficients ± satisfies equation c, σ = 0. Then φσ) = γ < 0. 3) In this paper, for the sake of simplicity, we restrict ourselves to Euclidean norms only. The extension onto the general case can be done in a standard way using Bregman distances e.g. 0]). 3 i=

5 Let us assume now that φx) < 0 for some x R n. We can always choose x with max in xi) =. Denote δ = c, x. In view of our assumption, we have x i).) > γ + δ, i =,..., n. Denoting σ i) = sign x i), i =,..., n, we can rewrite this inequality as σ i) x i) > γ +δ. Therefore, σ i) x i) = σ i) x i) < γ δ, and we conclude that c, σ c, x + c, σ x δ + γ max in σi) x i) < γ)δ +. Since c has integer coefficients, this is possible if and only if c, σ = 0. It is well known that verification of solvability of the latter equality in boolean variables is NP-hard this is the standard Boolean Knapsack Problem). Thus, we have shown that finding a descent direction of function φ is NP-hard. Considering now the function max{, φx)}, we can see that finding a local) minimum of a unimodular nonsmooth nonconvex function is also NP-hard. Thus, in a sharp contrast to Smooth Minimization, for nonsmooth functions even the local improvement of the objective is difficult. Therefore, in this paper we restrict ourselves to objective functions of very special structure. Namely, we consider the problem of minimizing def φx) = fx) + Ψx).) over a convex set Q, where function f is differentiable, and function Ψ is closed and convex on Q. For characterizing a solution to our problem, define the cone of feasible directions and the corresponding dual cone, which is called normal: Fy) = {u = τ x y), x Q, τ 0} E, N y) = {s : s, x y 0, x Q} E, y Q. Then, the first-order necessary optimality conditions at the point of local minimum x can be written as follows: φ def = fx ) + ξ N x ),.3) where ξ Ψx ). In other words, φ, u 0 u Fx )..4) Since Ψ is convex, the latter condition is equivalent to the following: Dφx )u] 0 u Fx )..5) Note that in the case of convex f, any of the conditions.3) -.5) is sufficient for point x to be a point of global minimum of function φ over Q. The last variant of the first-order optimality conditions is convenient for defining an approximate solution to our problem. 4

6 Definition The point x Q satisfies the first-order optimality conditions of local minimum of function φ over the set Q with accuracy ɛ 0 if Dφ x)u] ɛ u F x), u =..6) Note that in the case F x) = E with 0 / f x) + Ψ x), this condition reduces to the following inequality: ɛ min u = Dφ x)u] = = min u min u = max f x) + ξ, u = max ξ Ψ x) = min ξ Ψ x) f x) + ξ. max f x) + ξ, u ξ Ψ x) ξ Ψ x) min f x) + ξ, u u For finding a point x satisfying condition.6), we will use the composite gradient mapping. Namely, at any y Q define m L y; x) = fy) + fy), x y + L x y + Ψx), T L y) = arg min x Q m Ly; x),.7) where L is a positive constant. Recall that in the usual gradient mapping 8] we have Ψ ) 0 our modification is inspired by 0]). Then, we can define a constrained analogue of the gradient direction for a smooth function, the vector g L y) = L By T L y)) E,.8) where the operator B 0 defines the norm.3). In case of ambiguity of the objective function, we use notation g L y)φ].) It is easy to see that for Q E and Ψ 0 we get g L y) = φy) fx) for any L > 0. Our assumption on simplicity of function Ψ means exactly the feasibility of operation.7). Let us mention the main properties of the composite gradient mapping. Almost all of them follow from the first-order optimality condition for problem.7): fy) + LBT L y) y) + ξ L y), x T L y) 0, x Q,.9) where ξ L y) ΨT L y)). In what follows, we denote φ T L y)) = ft L y)) + ξ L y) φt L y))..0) We are going to show that the above subgradient inherits all important properties of the gradient of a smooth convex function. From now on, we assume that the first part of the objective function.) has Lipschitzcontinuous gradient: fx) fy) L f x y, x, y Q,.) 5

7 From.) and convexity of Q, one can easily derive the following useful inequality see, for example, 5]): fx) fy) fy), x y L f x y, x, y Q..) First of all, let us estimate a local variation of function φ. Denote Theorem At any y Q, Moreover, for any x Q, we have φ T L y)), x T L y) S L y) = ft Ly)) fy) T L y) y L f. φy) φt L y)) L L f L g L y),.3) φ T L y)), y T L y) L L f L g L y)..4) ) + L S Ly) g L y) T L y) x + L f L ) g L y) T L y) x. Proof: For the sake of notation, denote T = T L y) and ξ = ξ L y). Then.5) φt ).).9), x=y fy) + fy), T y + L f T y + ΨT ) fy) + LBT y) + ξ, y T + L f T y + ΨT ) = fy) + L f L T y + ΨT ) + ξ, y T φy) L L f T y. Taking into account the definition.8), we get.3). Further, ft ) + ξ, y T = fy) + ξ, y T ft ) fy), T y Thus, we get.4). Finally,.9), x=y LBy T ), y T ft ) fy), T y.) L L f ) T y.8) = L L f L g L y). ft ) + ξ, T x.9) ft ), T x + fy) + LBT y), x T = ft ) fy), T x g L y), x T.8) ) + L S Ly) g L y) T x, 6

8 and.5) follows. Corollary For any y Q, and any u FT L y)), u =, we have DφT L y))u] + L f L ) g L y)..6) In this respect, it is interesting to investigate the dependence of g L y) in L. Lemma The norm of the gradient direction g L y) is increasing in L, and the norm of the step T L y) y is decreasing in L. Proof: Indeed, consider the function ] ωτ) = min fy) + fy), x y + x Q τ x y + Ψx). The objective function of this minimization problem is jointly convex in x and τ. Therefore, ωτ) is convex in τ. Since the minimum of this problem is attained at a single point, ωτ) is differentiable and ω τ) = τ T /τ y) y] = g /τ y). Since ω ) is convex, ω τ) is an increasing function of τ. Hence, g /τ y) is a decreasing function of τ. The second statement follows from the concavity of the function ] ˆωL) = min fy) + fy), x y + L x Q x y + Ψx). Now let us look at the output of the composite gradient mapping from a global perspective. Theorem For any y Q we have m L y; T L y)) φy) L g Ly),.7) m L y; T L y)) min φx) + L+L f x Q x y ]..8) If function f is convex, then m L y; T L y)) min φx) + L x Q x y ]..9) 7

9 Proof: Note that function m L y; x) is strongly convex in x with convexity parameter L. Hence, φy) m L y; T L y)) = m L y; y) m L y; T L y)) L y T Ly) = L g Ly). Further, if f is convex, then m L y; T L y)) = min x Q ] fy) + fy), x y + L x y + Ψx) min fx) + Ψx) + L x Q x y ] = min φx) + L x Q x y ]. For nonconvex f, we can plug into the same reasoning the following consequence of.): fy) + fy), x y fx) + L f x y. Remark In view of.), for L L f we have Hence, in this case inequality.9) guarantees φt L y)) m L y; T L y))..0) φt L y)) min x Q Finally, let us prove a useful inequality for strongly convex φ. φx) + L x y ]..) Lemma 3 Let function φ be strongly convex with convexity parameter µ φ > 0. Then for any y Q we have ) T L y) x µ φ + L S Ly) g L y) µ φ + L ) f L g L y),.) where x is a unique minimizer of φ on Q. Proof: Indeed, in view of inequality.5), we have: + L ) ) f L g L y) T L y) x + L S Ly) g L y) T L y) x φ T L y)), T L y) x µ φ T L y) x, and.) follows. Now we are ready to analyze different optimization schemes based on the composite gradient mapping. In the next section, we describe the simplest one. 8

10 3 Gradient method Define first the gradient iteration with the simplest backtracking strategy for the line search parameter we call its termination condition the full relaxation). Gradient Iteration Gx, M) Set: L := M. Repeat: T := T L x), if φt ) > m L x; T ) then L := L γ u, 3.) Until: φt ) m L x; T ). Output: Gx, M).T = T, Gx, M).L = L, Gx, M).S = S L x). If there is an ambiguity in the objective function, we use notation G φ x, M). For running the gradient scheme, we need to choose an initial optimistic estimate L 0 for the Lipschitz constant L f : 0 < L 0 L f, 3.) and two adjustment parameters γ u > and γ d. Let y 0 Q be our starting point. For k 0, consider the following iterative process. Gradient Method GMy 0, L 0 ) y k+ = Gy k, L k ).T, M k = Gy k, L k ).L, 3.3) L k+ = max{l 0, M k /γ d }. Thus, y k+ = T Mk y k ). Since function f satisfies inequality.), in the loop 3.), the value L can keep increasing only if L L f. Taking into account condition 3.), we obtain the following bounds: L 0 L k M k γ u L f. 3.4) 9

11 Moreover, if γ d γ u, then L k L f, k ) Note that in 3.) there is no explicit bound on the number of repetition of the loop. However, it is easy to see that the total amount of calls of oracle N k after k iterations of 3.3) cannot be too big. Lemma 4 In the method 3.3), for any k 0 we have N k ] + ln γ d ln γ u k + ) + ln γ u ln γul f γ d L 0 )+. 3.6) Proof: Denote by n i the number of calls of the oracle at iteration i 0. Then Thus, Hence, we can estimate N k = k n i = i=0 L i+ γ d L i γ n i u. n i + ln γ d ln γ u + ln γ u ln L i+ L i. ] + ln γ d ln γ u k + ) + ln γ u ln L k+ L ) { In remains to note that L k+ max L 0, γ u γd L f }. A reasonable choice of the adjustment parameters is as follows: γ u = γ d = 3.6) N k k + ) + log L f L 0, L k 3.5) L f. 3.7) Thus, the performance of the Gradient Method 3.3) is well described by the estimates for the iteration counter, Therefore, in the rest of this section we will focus on estimating the rate of convergence of this method in different situations. Let us start from the general nonconvex case. Denote δ k = min 0ik M i g Mi y i ), i k = + arg min 0ik M i g Mi y i ). Theorem 3 Let function φ be bounded below on Q by some constant φ. Then Moreover, for any u Fy ik ) with u = we have δ k φy 0) φ k+. 3.8) Dφy ik )u] +γu)l f L / 0 φy0 ) φ ) k+. 3.9) 0

12 Proof: Indeed, in view of the termination criterion in 3.), we have φy i ) φy i+ ) φy i ) m Mi y i ; T Mi y i )).7) M i g Mi y i ). Summing up these inequalities for i = 0,..., k, we obtain 3.8). Denote j k = i k. Since y ik = T Mjk y jk ), for any u Fy ik ) with u = we have Dφy ik )u].6) + L ) f M jk g Mjk y jk ) = + L ) f M jk M jk δ k 3.8) M j k +L f φy0 ) φ ) M / k+ j k 3.4) +γu)l f L / 0 φy0 ) φ ) k+. Let us describe now the behavior of the Gradient Method 3.3) in the convex case. Theorem 4 Let function f be convex on Q. Assume that it attains a minimum on Q at point x and that the level sets of φ are bounded: y x R y Q : φy) φy 0 ). 3.0) If φy 0 ) φx ) γ u L f R, then φy ) φx ) γ ul f R. Otherwise, for any k 0 we have φy k ) φx ) γul f R k+. 3.) Moreover, for any u Fy ik ) with u = we have Dφy ik )u] 4+γu)L f R L γ f k+)k+4] / u L 0. 3.) Proof: Since φy k+ ) φy k ) for all k 0, we have the bound y k x R valid for all generated points. Consider Then, y k α) = αx + α)y k Q α 0, ]. φy k+ ) m Mk y k ; T Mk y k )).9) min φy) + M k y Q y y k ] y = y k α)) min φαx + α)y k ) + M kα 0α y k x ] 3.4) min φy k ) αφy k ) φx )) + γul f R 0α α ].

13 If φy 0 ) φx ) γ u L f R, then the optimal solution of the latter optimization problem is α = and we get φy ) φx ) γ ul f R. Otherwise, the optimal solution is and we obtain α = φy k) φx ) γ u L f R φy 0) φx ) γ u L f R, φy k+ ) φy k ) φy k) φx )] γ ul f R. 3.3) From this inequality, denoting λ k = φy k ) φx ), we get Hence, for k 0 we have λ k+ λ k + λ k+ λ k γ u L f R λ k + γ u L f R. λ k φy 0 ) φx ) + k γ u L f k+ R γ u L f. R Further, let us fix an integer m, 0 < m < k. Since we have φy i ) φy i+ ) M i g Mi y i ), i = 0,..., k, k m + )δ k k i=m M i g Mi y i ) φy m ) φy k+ ) φy m ) φx ) 3.) γul f R m+. Denote j k = i k. Then, for any u Fy ik ) with u =, we have Dφy ik )u].6) 3.) + L ) f M jk g Mjk y jk ) = + L ) f M jk M jk δ k M j k +L f M / j k 3.4) + γ u )L f R γ u L f R m+)k+ m) γ u L f L 0 m+)k+ m). Choosing m = k k+)k+4), we get m + )k + m) 4. Theorem 5 Let function φ be strongly convex on Q with convexity parameter µ φ. γ u, then for any k 0 we have µ φ L f If Otherwise, φy k ) φx ) φy k ) φx ) γu L f µ φ ) k φy0 ) φx )) k φy 0 ) φx )). 3.4) µ φ 4γ u L f ) k φy0 ) φx )). 3.5)

14 Proof: Since φ is strongly convex, for any k 0 we have Denote y k α) = αx + α)y k Q, α 0, ]. Then, φy k ) φx ) µ φ y k x. 3.6) φy k+ ).9) 3.4) min φαx + α)y k ) + M kα 0α y k x ] min φy k ) αφy k ) φx )) + γul f 0α α y k x ] 3.6) min φy k ) α α γul ) ] f 0α µ φ φy k ) φx )). { µ = min, φ γ u L f }. Hence, if The minimum of the last expression is achieved for α µ φ γ u L f, then α = and we get φy k+ ) φx ) γ ul f µ φ φy k ) φy )) φy k) φy )). If µ φ γ u L f, then α = µ φ γ u L f and φy k+ ) φx ) µ ) φ 4γ u L f φy k ) φy )). Remark ) In Theorem 5, the condition number L f µ φ can be smaller than one. ) For strongly convex φ, the bounds on the directional derivatives can be obtained by combining the inequalities 3.4), 3.5) with the estimate φy k ) φx ).3):L=L f L f g Lf y k ) and inequality.6). Thus, inequality 3.4) results in the bound ) γul k/ Dφy k+ )u] f µ φ L f φy 0 ) φ ), 3.7) and inequality 3.5) leads to the bound Dφy k+ )u] µ ) k/ φ 4γ ul f L f φy 0 ) φ ), 3.8) which are valid for all u Fy k+ ) with u =. 3) The estimate 3.6) may create an impression that a large value of γ u can reduce the total number of calls of the oracle. This is not true since γ u enters also into the estimate of the rate of convergence of the methods e.g. 3.)). Therefore, reasonable values of this parameter lie in the interval, 3]. 3

15 4 Accelerated scheme In the previous section, we have seen that, for convex f, the gradient method 3.3) converges as O k ). However, it is well known that on convex problems the usual gradient scheme can be accelerated e.g. Chapter in 9]). Let us show that the same acceleration can be achieved for composite objective functions. Consider the problem min x E φx) = fx) + Ψx) ], 4.) where function f is convex and satisfies.), and function Ψ is closed and strongly convex on E with convexity parameter µ Ψ 0. We assume this parameter to be known. The case µ Ψ = 0 corresponds to convex Ψ. Denote by x the optimal solution to 4.). In problem 4.), we allow dom Ψ E. Therefore, the formulation 4.) covers also constrained problem instances. Note that for 4.), the first-order optimality condition.9) defining the composite gradient mapping can be written in a simpler form: T L y) dom Ψ, fy) + ξ L y) = LBy T L y)) g L y), 4.) where ξ L y) ΨT L y)). For justifying the rate of convergence of different schemes as applied to 4.), we will use the machinery of estimate functions in its newer variant 3]. Taking into account the special form of the objective in 4.), we update recursively the following sequences: A minimizing sequence {x k } k=0. A sequence of increasing scaling coefficients {A k } k=0 : A sequence of estimate functions A 0 = 0, A k def = A k + a k, k. ψ k x) = l k x) + A k Ψx) + x x 0 k 0, 4.3) where x 0 dom Ψ is our starting point, and l k x) are linear functions in x E. However, as compared with 3], we will add a possibility to update the estimates for Lipschitz constant L f, using the initial guess L 0 satisfying 3.), and two adjustment parameters γ u > and γ d. For the above objects, we maintain recursively the following relations: R k : A kφx k ) ψk min ψ kx), x R k : ψ kx) A k φx) + x x 0, x E., k ) These relations clearly justify the following rate of convergence of the minimizing sequence: φx k ) φx ) x x 0 A k, k. 4.5) 4

16 Denote v k = arg min x E ψ kx). Since µ ψk, for any x E we have A k φx k ) + x v k R k ψ k + x v k ψ k x) R k A k φx) + x x 0. Hence, taking x = x, we get two useful consequences of 4.4): x v k x x 0, v k x 0 x x 0, k. 4.6) Note that the relations 4.4) can be used for justifying the rate of convergence of a dual variant of the gradient method 3.3). Indeed, for v 0 dom Ψ define ψ 0 x) = x v 0, and choose L 0 satisfying condition 3.). Dual Gradient Method DGv 0, L 0 ), k 0. y k = Gv k, L k ).T, M k = Gv k, L k ).L, L k+ = max{l 0, M k /γ d }, a k+ = M k, 4.7) ψ k+ x) = ψ k x) + M k fv k ) + fv k ), x v k + Ψx)]. In this scheme, the operation G is defined by 3.). Since Ψ is simple, the points v k are easily computable. Note that the relations R 0 and R k, k 0, are trivial. Relations R k can be justified by induction. Define x 0 = y 0, φ k = min φy i), and x k : φx k ) = φ k for k. Then 0ik { } ψk+ = min ψ k x) + x M k fv k ) + fv k ), x v k + Ψx)] R k { } A k φ k + min x x v k + M k fv k ) + fv k ), x v k + Ψx)].7) = A k φ k + a k+ m Mk v k ; y k ) 3.) A k φ k + a k+ φy k ) A k+ φ k+. Thus, relations R k are valid for all k 0. Since the values M k satisfy bounds 3.4), for method 4.7) we obtain the following rate of convergence: φx k ) φx ) γ ul f k x v 0, k. 4.8) Note that the constant in the right-hand side of this inequality is four times smaller than the constant in 3.). However, each iteration in the dual method is two times more expensive as compared to the primal version 3.3). 5

17 However, the method 4.7) does not implement the best way of using the machinery of estimate functions. Let us look at the accelerated version of 4.7). As parameters, it has the starting point x 0 dom Ψ, the lower estimate L 0 > 0 for the Lipschitz constant L f, and a lower estimate µ 0, µ Ψ ] for the convexity parameter of function Ψ. Accelerated method Ax 0, L 0, µ) Initial settings: ψ 0 x) = x x 0, A 0 = 0. Iteration k 0 Set: L := L k. Repeat: Find a from quadratic equation a A k +a = +µa k L. ) Set y = A kx k +av k A k +a, and compute T L y). 4.9) if φ T L y)), y T L y) < L φ T L y)), then L := L γ u. Until: φ T L y)), y T L y) L φ T L y)). ) Define: y k := y, M k := L, a k+ := a, L k+ := M k /γ d, x k+ := T Mk y k ), ψ k+ x) := ψ k x) + a k+ fx k+ ) + fx k+ ), x x k+ + Ψx)]. As compared with Gradient Iteration 3.), we use a damped relaxation condition ) as a stopping criterion of the internal cycle of 4.9). Lemma 5 Condition **) in 4.9) is satisfied for any L L f. Proof: Denote T = T L y). Multiplying the representation φ T ) = ft ) + ξ L y) 4.) = LBy T ) + ft ) fy) 4.0) 6

18 by the vector y T, we obtain φ T ), y T = L y T fy) ft ), y T 4.0) = L φ T ) + L fy) ft ), y T fy) ft ) ] fy) ft ), y T = L φ T ) + fy) ft ), y T L fy) ft ). Hence, for L L f condition **) is satisfied. Thus, we can always guarantee L k M k γ u L f. 4.) If γ d γ u, then the upper bound 3.5) remains valid. Let us establish a relation between the total number of calls of oracle N k after k iterations, and the value of the iteration counter. Lemma 6 In the method 4.9), for any k 0 we have ] N k + ln γ d ln γ u k + ) + ln γ u ln γ ul f γ d L 0. 4.) Proof: Denote by n i the number of calls of the oracle at iteration i 0. At each cycle of the internal loop we call the oracle twice for computing fy) and ft L y)). Therefore, Thus, Hence, we can compute N k = k i=0 L i+ = γ d L i γ 0.5n i u. n i = + ln γ d ln γ u + ln γ u ln L ] i+ L i. ] n i = + ln γ d ln γ u k + ) + ln γ u ln L k+ L 0. 4.) In remains to note that L k+ γ u γd L f. Thus, each iteration of 4.9) needs approximately two times more calls of the oracle than one iteration of the Gradient Method: γ u = γ d = N k 4k + ) + log L f L 0, L k L f. 4.3) However, we will see that the rate of convergence of 4.9) is much higher. Let us start from two auxiliary statements. 7

19 Lemma 7 Assume µ Ψ µ. Then the sequences {x k }, {A k } and {ψ k }, generated by the method Ax 0, L 0, µ), satisfy relations 4.4) for all k 0,. Proof: Indeed, in view of initial settings of 4.9), A 0 = 0 and ψ0 = 0. Hence, for k = 0, both relations 4.4) are trivial. Assume now that relations R k, R k are valid for some k 0. In view of R k, for any x E we have ψ k+ x) A k φx) + x x 0 + a k+ fx k+ ) + fx k+ ), x x k+ + Ψx)] A k + a k+ )φx) + x x 0, and this is R k+. Let us show that the relation R k+ is also valid. Indeed, in view of 4.3), function ψ k x) is strongly convex with convexity parameter + µa k. Hence, in view of R k, for any x E, we have Therefore ψ k x) ψ k + +µa k x v k A k φx k ) + +µa k x v k. 4.4) ψ k+ = min x E 4.4) min x E {ψ kx) + a k+ fx k+ ) + fx k+ ), x x k+ + Ψx)]} { } A k φx k ) + +µa k x v k + a k+ φx k+ ) + φ x k+ ), x x k+ ] min x E {A k + a k+ )φx k+ ) + A k φ x k+ ), x k x k+ +a k+ φ x k+ ), x x k+ + +µa k x v k } 4.9) = min x E {A k+φx k+ ) + φ x k+ ), A k+ y k a k+ v k A k x k+ +a k+ φ x k+ ), x x k+ + +µa k x v k } = min x E {A k+φx k+ ) + A k+ φ x k+ ), y k x k+ +a k+ φ x k+ ), x v k + +µa k x v k }. The minimum of the later problem is attained at x = v k a k+ +µa k B φ x k+ ). Thus, we have proved inequality ψ k+ A k+ φx k+ ) + A k+ φ x k+ ), y k x k+ a k+ +µa k ) φ x k+ ). On the other hand, by the termination criterion in 4.9), we have φ x k+ ), y k x k+ M k φ x k+ ). 8

20 It remains to note that in 4.9) we choose a k+ from the quadratic equation A k+ A k + a k+ = M ka k+ +µa k ). Thus, R k+ is valid. Thus, in order to use inequality 4.5) for deriving the rate of convergence of method Ax 0, L 0, µ), we need to estimate the rate of growth of the scaling coefficients {A k } k=0. Lemma 8 For any µ 0, the scaling coefficients grow as follows: For µ > 0, the rate of growth is linear: A k k γ u L f, k ) A k γ u L f + ] k ) µ γ u L f, k. 4.6) Proof: Indeed, in view of equation ) in 4.9), we have: A k+ A k+ + µa k ) = M k A k+ A k ) = M k A / ] k+ A/ k A / ] k+ + A/ k A k+ M k A / ] k+ 4.) A/ k A k+ γ u L f A / ] k+ A/ k. Thus, for any k 0 we get A / k k. If µ > 0, then, by the same reasoning as γu L f above, we obtain µa k A k+ < A k+ + µa k ) A k+ γ u L f A / ] k+ A/ k. Hence, A / k+ A/ k + ] µ γ ul f. Since A = 4.) M 0 γ ul f, we come to 4.6). Now we can summarize all our observations. Theorem 6 Let the gradient of function f be Lipschitz continuous with constant L f. Also, let the parameter L 0 satisfy condition 3.). Then the rate of convergence of the method Ax 0, L 0, 0) as applied to the problem 4.) can be estimated as follows: φx k ) φx ) γul f x x 0 k, k. 4.7) If in addition the function Ψ is strongly convex, then the sequence {x k } k= generated by Ax 0, L 0, µ Ψ ) satisfies both 4.7) and the following inequality: φx k ) φx ) γ ul f 4 x x 0 + ] k ) µ Ψ γ u L f, k. 4.8) In the next section we will show how to apply this result in order to achieve some specific goals for different optimization problems. 9

21 5 Different minimization strategies 5. Strongly convex objective with known parameter Consider the following convex constrained minimization problem: min x Q ˆfx), 5.) where Q is a closed convex set, and ˆf is a strongly convex function with Lipschitz continuous gradient. Assume the convexity parameter µ ˆf to be known. Denote by σ Q x) an indicator function of set Q: σ Q x) = { 0, x Q, +, otherwise. We can solve the problem 5.) by two different techniques.. Reallocating the prox-term in the objective. For µ 0, µ ˆf ], define fx) = ˆfx) µ x x 0, Ψx) = σ Q x) + µ x x 0. 5.) Note that function f in 5.) is convex and its gradient is Lipschitz continuous with L f = L ˆf µ. Moreover, the function Ψx) is strongly convex with convexity parameter µ. On the other hand, φx) = fx) + Ψx) = ˆfx) + σ Q x). Thus, the corresponding unconstrained minimization problem 4.) coincides with constrained problem 5.). Since all conditions of Theorem 6 are satisfied, the method Ax 0, L 0, µ) has the following performance guarantees: ˆfx k ) ˆfx ) γ ul ˆf µ) x x 0 ] k ), k. + µ γul ˆf µ) 5.3) This means that an ɛ-solution of problem 5.) can be obtained by this technique in O ) L ˆf µ ln ɛ 5.4) iterations. Note that the same problem can be solved also by the Gradient Method 3.3). However, in accordance to 3.5), its performance guarantee is much worse; it needs ) L O ˆf µ ln ɛ iterations.. Restart. For problem 5.), define the following components of composite objective function in 4.): fx) = ˆfx), Ψx) = σ Q x). 5.5) 0

22 Let us fix an upper bound N for the number of iterations in A. Consider the following two-level process: Choose u 0 Q. Compute u k+ as a result of N iterations of Au k, L 0, 0), k ) In view of definition 5.5), we have ˆfu k+ ) ˆfx ) 4.7) γ ul ˆf x u k N γul ˆf ˆfu k ) ˆfx )] µ ˆf N. γu L Thus, taking N = ˆf µ, we obtain ˆf ˆfu k+ ) ˆfx ) ˆfu k ) ˆfx )]. Hence, the performance guarantees of this technique are of the same order as 5.4). 5. Approximating the first-order optimality conditions In some applications, we are interested in finding a point with small residual of the system of the first-order optimality conditions. Since DφT L x))u].6) + L f L ).3) φx) φx g L x) L + L f ) ) L L f u FT L x)), u =, 5.7) the upper bounds on this residual can be obtained from the estimates on the rate of convergence of method 4.9) in the form 4.7) or 4.8). However, in this case, the first inequality does not give a satisfactory result. Indeed, it can guarantee that the right-hand side of inequality 5.7) vanishes as O k ). This rate is typical for the Gradient Method see 3.)), and from accelerated version 4.9) we can expect much more. Let us show how we can achieve a better result. Consider the following constrained optimization problem: min x Q fx), 5.8) where Q is a closed convex set, and f is a convex function with Lipschitz continuous gradient. Let us fix a tolerance parameter δ > 0 and a starting point x 0 Q. Define Ψx) = σ Q x) + δ x x 0. Consider now the unconstrained minimization problem 4.) with composite objective function φx) = fx) + Ψx). Note that function Ψ is strongly convex with parameter µ Ψ = δ. Hence, in view of Theorem 6, the method Ax 0, L 0, δ) converges as follows: φx k ) φx ) γul f 4 x x 0 + ] k ) δ γ u L f. 5.9)

23 For simplicity, we can choose γ u = γ d in order to have L k L f for all k 0. Let us compute now T k = Gx k, L k ).T and M k = Gx k, L k ).L. Then φx k ) φx ) φx k ) φt k ) and we obtain the following estimate: 5.9) g Mk x k ).7) M k g Mk x k ), L 0 M k γ u L f, γ ul f x x / 0 + ] k δ γ u L f. 5.0) In our case, the first-order optimality conditions 4.) for computing T Mk x k ) can be written as follows: fx k ) + δbt k x 0 ) + ξ k = g Mk x k ), 5.) where ξ k σ Q T k ). Note that for any y Q we have 0 = σ Q y) σ Q T k ) + ξ k, y T k = ξ k, y T k. 5.) Hence, for any direction u FT k ) with u = we obtain ft k ), u.) fx k ), u L f M k g Mk x k ) 5.) = g Mk x k ) δbt k x 0 ) ξ k, u L f M k g Mk x k ) 5.) = δ T k x 0 + L ) f M k g Mk x k ). Assume now that the size of the set Q does not exceed R, and δ = ɛ L 0. Let us choose the number of iterations k from inequality + ɛl 0 γ u L f ] k ɛ. Then the residual of the first-order optimality conditions satisfies the following inequality: ft k ), u ɛ R L 0 + γ ul f + L )] f / L 0, u FT k ), u =. 5.3) For that, the required number of iterations k is at most of the order O ɛ ln ɛ 5.3 Unknown parameter of strongly convex objective In Section 5. we have discussed two efficient strategies for minimizing a strongly convex function with known estimate of convexity parameter µ ˆf. However, usually this information is not available. We can easily get only an upper estimate for this value, for example, by inequality µ ˆf S L x) L ˆf, x Q. Let us show that such a bound can be also used for designing an efficient optimization strategy for strongly convex functions. ).

24 For problem 5.), assume that we have some guess µ for the parameter µ ˆf starting point u 0 Q. Denote φ 0 x) = ˆfx) + σ Q x). Let us choose and a x 0 = G φ0 u 0, L 0 ).T, M 0 = G φ0 u 0, L 0 ).L, S 0 = G φ0 u 0, L 0 ).S, and minimize the composite objective 5.) by method Ax 0, M 0, µ) using the following stopping criterion: Compute: v k = G φ0 x k, L k ).T, M k = G φ0 x k, L k ).L. Stop the stage: if A): g Mk x k )φ 0 ] g M 0 u 0 )φ 0 ], or B): ) M k A k + S 0 M 0 4 µ. 5.4) If the stage was terminated by Condition A), then we call it successful. In this case, we run the next stage, taking v k as a new starting point and keeping the estimate µ of the convexity parameter unchanged. Suppose that the stage was terminated by Condition B) that is an unsuccessful stage). If µ were a correct lower bound for the convexity parameter µ ˆf, then M k g Mk x k )φ 0 ].7) ˆfxk ) ˆfx ) 4.5) A k x 0 x.) ) A k + S µ 0 M 0 g M0 u 0 )φ 0 ]. Hence, in view of Condition B), in this case the stage must be terminated by Condition A). Since this did not happen, we conclude that µ > µ ˆf. Therefore, we redefine µ := µ, and run again the stage keeping the old starting point x 0. We are not going to present all details of the complexity analysis of the above strategy. It can be shown that, for generating an ɛ-solution of problem 5.) with strongly convex objective, it needs O κ / ˆf ln κ ˆf ) + O κ / ˆf ln κ ˆf ln κ ) ˆf def ɛ, κ ˆf = L ˆf µ ˆf calls of the oracle. The first term in this bound corresponds to the total amount of calls of the oracle at all unsuccessful stages. The factor κ / ln κ ˆf ˆf represents an upper bound on the length of any stage independently on the variant of its termination. 6 Computational experiments We tested the algorithms described above on a set of randomly generated Sparse Least Squares problems of the form Find φ = min φx) def ] = x R n Ax b + x, 6.) 3,

25 where A a,..., a n ) is an m n dense matrix with m < n. All problems were generated with known optimal solutions, which can be obtained from the dual representation of the initial primal problem 6.): ] min x R n Ax b + x = min x R n = max u R m ] max u, b Ax u R m u + x ] min b, u x R n u AT u, x + x ] = max b, u u R m u : AT u. 6.) Thus, the problem dual to 6.) consists in finding a Euclidean projection of the vector b R m onto the dual polytope D = {y R m : A T y }. This interpretation explains the changing sparsity of the optimal solution x τ) to the following parametric version of problem 6.): min φ τ x) def ] = x R n Ax b + τ x Indeed, for τ > 0, we have ] φ τ x) = τ A x τ b τ + x τ. 6.3) Hence, in the dual problem, we project vector b τ onto the polytope D. The nonzero components of x τ) correspond to the active facets of D. Thus, for τ big enough, we have b τ int D, which means x τ) = 0. When τ decreases, we get x τ) more and more dense. Finally, if all facets of D are in general position, we get in x τ) exactly m nonzero components as τ 0. In our computational experiments, we compare three minimization methods. Two of them maintain recursively relations 4.4). This feature allows us to classify them as primal-dual methods. Indeed, denote As we have seen in 6.), φ u) = u b, u. φx) + φ u) 0 x R n, u D. 6.4) Moreover, the lower bound is achieved only at the optimal solutions of the primal and dual problems. For some sequence {z i } i=, and a starting point z 0 dom Ψ, relations 4.4) ensure A k φx k ) min x R n { k } a i fz i ) + fz i ), x z i ] + A k Ψx) + x z ) i= 4

26 In our situation, fx) = Ax b, Ψx) = x, and we choose z 0 = 0. Denote u i = b Az i. Then fz i ) = A T u i, and therefore fz i ) fz i ), z i = u i + A T u i, z i = b, u i u i = φ u i ). Denoting we obtain ū k = A k k a i u i, 6.6) i= A k φx k ) + φ ū k )] A k φx k ) + k a i φ u i ) 6.5) min x R n { k i= } a i fz i ), x + A k Ψx) + x i= x=0 0. In view of 6.4), u k cannot be feasible: φ ū k ) φx k ) φ = min u D φ u). 6.7) Let us measure the level of infeasibility of these points. Note that the minimum of optimization problem in 6.5) is achieved at x = v k. Hence, the corresponding first-order optimality conditions ensure k a i A T u i + Bv k A k. i= Therefore, a i, ū k + A k Bv k ) i), i =,..., n. Assume that the matrix B in.3) is diagonal: { B i,j) di, i = j, = 0, otherwise. Then, a i, ū k d i A k v i) k, and ρū k ) def = n i= ] / d i a i, ū k ) + A k v k 4.6) A k x, 6.8) where α) + = max{α, 0}. Thus, we can use function ρ ) as a dual infeasibility measure. In view of 6.7), it is a reasonable stopping criterion for our primal-dual methods. For generating the random test problems, we apply the following strategy. Choose m m, the number of nonzero components of the optimal solution x of problem 6.), and parameter ρ > 0 responsible for the size of x. Generate randomly a matrix B R m n with elements uniformly distributed in the interval, ]. Generate randomly a vector v R m with elements uniformly distributed in 0, ]. Define y = v / v. 5

27 Sort the entries of vector B T y in the decreasing order of their absolute values. For the sake of notation, assume that it coincides with a natural order. For i =,..., n, define a i = α i b i with α i > 0 chosen accordingly to the following rule: b i,y for i =,..., m, α i =, if b i, y 0. and i > m, ξ i b i,y, otherwise, where ξ i are uniformly distributed in 0, ]. For i =,..., n, generate the components of the primal solution: ξ i sign a i, y ), for i m, x ] i) = 0, otherwise, where ξ i are uniformly distributed in Define b = y + Ax. ρ 0, m ]. Thus, the optimal value of the randomly generated problem 6.) can be computed as φ = y + x. In the first series of tests, we use this value in the termination criterion. Let us look first at the results of minimization of two typical random problem instances. 6

28 The first problem is relatively easy. Problem : n = 4000, m = 000, m = 00, ρ =. PG DG AC Gap k #Ax SpeedUp k #Ax SpeedUp k #Ax SpeedUp 4 0.% % 4 0.4% % 3 0.8% 4 8.4% % % % % 5 3.7% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % In this table, the column Gap shows the relative decrease of the initial residual. In the rest of the table, we can see the computational results for three methods: Primal gradient method 3.3), abbreviated as PG. Dual version of the gradient method 4.7), abbreviated as DG. Accelerated gradient method 4.9), abbreviated as AC. In all methods, we use the following values of the parameters: γ u = γ d =, x 0 = 0, L 0 = max in a i L f, µ = 0. Let us explain the remaining columns of this table. For each method, the column k shows the number of iterations necessary for reaching the corresponding reduction of the initial gap in the function value. Column Ax shows the necessary number of matrix-vector multiplications. Note that for computing the value fx) we need one multiplication. If in addition, we need to compute the gradient, we need one more multiplication. For example, in accordance to the estimate 4.3), each iteration of 4.9) needs four computations of the pair function/gradient. Hence, in this method we can expect eight matrix-vector multiplications per iteration. For the Gradient Method 3.3), we need in average two calls of the oracle. However, one of them is done in the line-search 7

29 procedure 3.) and it requires only the function value. Hence, in this case we expect to have three matrix-vector multiplications per iteration. In the table above, we can observe the remarkable accuracy of our predictions. Finally, the column SpeedUp represents the absolute accuracy of current approximate solution as a percentage of the worst-case estimate given by the corresponding rate of convergence. Since the exact L f is unknown, we use L 0 instead. We can see that all methods usually significantly outperform the theoretically predicted rate of convergence. However, for all of them, there are some parts of the trajectory where the worst-case predictions are quite accurate. This is even more evident from our second table, which corresponds to a more difficult problem instance. Problem : n = 5000, m = 500, m = 00, ρ =. PG DG AC Gap k #Ax SpeedUp k #Ax SpeedUp k #Ax SpeedUp 4 0.4% % 4 0.6% 6 0.0% 8 0.8% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % In this table, we can see that the Primal Gradient Method still significantly outperforms the theoretical predictions. This is not too surprising since it can, for example, automatically accelerate on strongly convex functions see Theorem 5). All other methods require in this case some explicit changes in their schemes. However, despite all these discrepancies, the main conclusion of our theoretical analysis seems to be confirmed: the accelerated scheme 4.9) significantly outperforms the primal and dual variants of the Gradient Method. In the second series of tests, we studied the abilities of the primal-dual schemes 4.7) and 4.9) in decreasing the infeasibility measure ρ ) see 6.8)). This problem, at least for the Dual Gradient Method 4.7), appears to be much harder than the primal minimization 8

30 problem 6.). Let us look at the following results. Problem 3: n = 500, m = 50, m = 5, ρ =. DG AC Gap k #Ax φ SpeedUp k #Ax φ SpeedUp % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % In this table we can see the computational cost for decreasing the initial value of ρ in times. Note that both methods require more iterations than for Problem, which was solved up to accuracy in the objective function of the order Moreover, for reaching the required level of ρ, method 4.7) has to decrease the residual in the objective up to machine precision, and the norm of gradient mapping up to 0. The accelerated scheme is more balanced: the final residual in φ is of the order 0 6, and the norm of the gradient mapping was decreased only up to Let us look at a bigger problem. Problem 4: n = 000, m = 00, m = 50, ρ =. DG AC Gap k #Ax φ SpeedUp k #Ax φ SpeedUp % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % 9

31 As compared with Problem 3, in Problem 4 the sizes are doubled. This makes almost no difference for the accelerated scheme, but for the Dual Gradient Method, the computational expenses grow substantially. The further increase of dimension makes the latter scheme impractical. Let us look at how these methods work on Problem with ρ ) being a termination criterion. Problem a: n = 4000, m = 000, m = 00, ρ =. DG AC Gap k #Ax φ SpeedUp k #Ax φ SpeedUp % % % % % % % % % % % % % % % % % % % % 0 Line search failure % % % % % The reason of the failure of the Dual Gradient Method is quite interesting. In the end, it generates the points with very small residual in the value of the objective function. Therefore, the termination criterion in the gradient iteration 3.) cannot work properly due to the rounding errors. In the accelerated scheme 4.9), this does not happen since the decrease of the objective function and the dual infeasibility measure is much more balanced. In some sense, this situation is natural. We have seen that on the current test problems all methods converge faster at the end. On the other hand, the rate of convergence of the dual variables ū k is limited by the rate of growth of coefficients a i in the representation 6.6). For the Dual Gradient Method, these coefficients are almost constant. For the accelerated scheme, they grow proportionally to the iteration counter. We hope that the numerical examples above clearly demonstrate the advantages of the accelerated gradient method 4.9) with the adjustable line search strategy. It is interesting to check numerically how this method works in other situations. Of course, the first candidates to try are different applications of the smoothing technique 0]. However, even for the Sparse Least Squares problem 6.) there are many potential improvements. Let us discuss one of them. Note that we treated the problem 6.) by a quite general model.) ignoring the important fact that the function f is quadratic. The characteristic property of quadratic functions is that they have a constant second derivative. Hence, it is natural to select the operator B in metric.3) taking into account the structure of the Hessian of function f. Let us define B = diag A T A) diag fx)). Then e i = Be i, e i = a i = Ae i = fx)e i, e i, i =,..., n, 30

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Accelerating the cubic regularization of Newton s method on convex problems

Accelerating the cubic regularization of Newton s method on convex problems Accelerating the cubic regularization of Newton s method on convex problems Yu. Nesterov September 005 Abstract In this paper we propose an accelerated version of the cubic regularization of Newton s method

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Lecture 3: Huge-scale optimization problems

Lecture 3: Huge-scale optimization problems Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1

More information

Universal Gradient Methods for Convex Optimization Problems

Universal Gradient Methods for Convex Optimization Problems CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Nonsymmetric potential-reduction methods for general cones

Nonsymmetric potential-reduction methods for general cones CORE DISCUSSION PAPER 2006/34 Nonsymmetric potential-reduction methods for general cones Yu. Nesterov March 28, 2006 Abstract In this paper we propose two new nonsymmetric primal-dual potential-reduction

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28 26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS CORE Voie du Roman Pays 4, L Tel

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24 Outline 1 Problems

More information

12. Interior-point methods

12. Interior-point methods 12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Smooth minimization of non-smooth functions

Smooth minimization of non-smooth functions Math. Program., Ser. A 103, 127 152 (2005) Digital Object Identifier (DOI) 10.1007/s10107-004-0552-5 Yu. Nesterov Smooth minimization of non-smooth functions Received: February 4, 2003 / Accepted: July

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems CORE DISCUSSION PAPER 2012/02 Subgradient methods for huge-scale optimization problems Yu. Nesterov January, 2012 Abstract We consider a new class of huge-scale problems, the problems with sparse subgradients.

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Constructing self-concordant barriers for convex cones

Constructing self-concordant barriers for convex cones CORE DISCUSSION PAPER 2006/30 Constructing self-concordant barriers for convex cones Yu. Nesterov March 21, 2006 Abstract In this paper we develop a technique for constructing self-concordant barriers

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization Qihang Lin Lin Xiao April 4, 013 Abstract We consider optimization problems with an objective function

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

Equality constrained minimization

Equality constrained minimization Chapter 10 Equality constrained minimization 10.1 Equality constrained minimization problems In this chapter we describe methods for solving a convex optimization problem with equality constraints, minimize

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization

An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization An Adaptive Accelerated Proximal Gradient Method and its Homotopy Continuation for Sparse Optimization Qihang Lin Lin Xiao February 4, 014 Abstract We consider optimization problems with an objective function

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method Robert M. Freund March, 2004 2004 Massachusetts Institute of Technology. The Problem The logarithmic barrier approach

More information

arxiv: v1 [math.oc] 5 Dec 2014

arxiv: v1 [math.oc] 5 Dec 2014 FAST BUNDLE-LEVEL TYPE METHODS FOR UNCONSTRAINED AND BALL-CONSTRAINED CONVEX OPTIMIZATION YUNMEI CHEN, GUANGHUI LAN, YUYUAN OUYANG, AND WEI ZHANG arxiv:141.18v1 [math.oc] 5 Dec 014 Abstract. It has been

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Barrier Method. Javier Peña Convex Optimization /36-725

Barrier Method. Javier Peña Convex Optimization /36-725 Barrier Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: Newton s method For root-finding F (x) = 0 x + = x F (x) 1 F (x) For optimization x f(x) x + = x 2 f(x) 1 f(x) Assume f strongly

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

3.10 Lagrangian relaxation

3.10 Lagrangian relaxation 3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

arxiv: v4 [math.oc] 1 Apr 2018

arxiv: v4 [math.oc] 1 Apr 2018 Generalizing the optimized gradient method for smooth convex minimization Donghwan Kim Jeffrey A. Fessler arxiv:607.06764v4 [math.oc] Apr 08 Date of current version: April 3, 08 Abstract This paper generalizes

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method Zhaosong Lu November 21, 2015 Abstract We consider the problem of minimizing a Lipschitz dierentiable function over a

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method Optimization Methods and Software Vol. 00, No. 00, Month 200x, 1 11 On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method ROMAN A. POLYAK Department of SEOR and Mathematical

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS

CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS CONVERGENCE PROPERTIES OF COMBINED RELAXATION METHODS Igor V. Konnov Department of Applied Mathematics, Kazan University Kazan 420008, Russia Preprint, March 2002 ISBN 951-42-6687-0 AMS classification:

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Lagrange Relaxation and Duality

Lagrange Relaxation and Duality Lagrange Relaxation and Duality As we have already known, constrained optimization problems are harder to solve than unconstrained problems. By relaxation we can solve a more difficult problem by a simpler

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Lecture 16: October 22

Lecture 16: October 22 0-725/36-725: Conve Optimization Fall 208 Lecturer: Ryan Tibshirani Lecture 6: October 22 Scribes: Nic Dalmasso, Alan Mishler, Benja LeRoy Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Complexity and Simplicity of Optimization Problems

Complexity and Simplicity of Optimization Problems Complexity and Simplicity of Optimization Problems Yurii Nesterov, CORE/INMA (UCL) February 17 - March 23, 2012 (ULg) Yu. Nesterov Algorithmic Challenges in Optimization 1/27 Developments in Computer Sciences

More information

Self-Concordant Barrier Functions for Convex Optimization

Self-Concordant Barrier Functions for Convex Optimization Appendix F Self-Concordant Barrier Functions for Convex Optimization F.1 Introduction In this Appendix we present a framework for developing polynomial-time algorithms for the solution of convex optimization

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Lecture 23: Conditional Gradient Method

Lecture 23: Conditional Gradient Method 10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley

More information

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang

A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES. Fenghui Wang A NEW ITERATIVE METHOD FOR THE SPLIT COMMON FIXED POINT PROBLEM IN HILBERT SPACES Fenghui Wang Department of Mathematics, Luoyang Normal University, Luoyang 470, P.R. China E-mail: wfenghui@63.com ABSTRACT.

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Accelerating Nesterov s Method for Strongly Convex Functions

Accelerating Nesterov s Method for Strongly Convex Functions Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Applications of Linear Programming

Applications of Linear Programming Applications of Linear Programming lecturer: András London University of Szeged Institute of Informatics Department of Computational Optimization Lecture 9 Non-linear programming In case of LP, the goal

More information

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations Boris T. Polyak Andrey A. Tremba V.A. Trapeznikov Institute of Control Sciences RAS, Moscow, Russia Profsoyuznaya, 65, 117997

More information

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems O. Kolossoski R. D. C. Monteiro September 18, 2015 (Revised: September 28, 2016) Abstract

More information

A Second-Order Method for Strongly Convex l 1 -Regularization Problems

A Second-Order Method for Strongly Convex l 1 -Regularization Problems Noname manuscript No. (will be inserted by the editor) A Second-Order Method for Strongly Convex l 1 -Regularization Problems Kimon Fountoulakis and Jacek Gondzio Technical Report ERGO-13-11 June, 13 Abstract

More information

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and

More information

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Optimization Escuela de Ingeniería Informática de Oviedo (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30 Unconstrained optimization Outline 1 Unconstrained optimization 2 Constrained

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa Convex Optimization Lecture 12 - Equality Constrained Optimization Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 19 Today s Lecture 1 Basic Concepts 2 for Equality Constrained

More information

FIXED POINT ITERATIONS

FIXED POINT ITERATIONS FIXED POINT ITERATIONS MARKUS GRASMAIR 1. Fixed Point Iteration for Non-linear Equations Our goal is the solution of an equation (1) F (x) = 0, where F : R n R n is a continuous vector valued mapping in

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information