Complexity bounds for primal-dual methods minimizing the model of objective function

Size: px
Start display at page:

Download "Complexity bounds for primal-dual methods minimizing the model of objective function"

Transcription

1 Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing to approach a primal-dual solution of convex optimization problem with composite objective function. Additional properties of complementary part of the objective (strong convexity significantly accelerate the scheme. We also justify a new variant of this method, which can be seen as a trust-region scheme applying to the linear model of objective function. For this variant, we prove also the rate of convergence for the total variation of linear model of composite objective over the feasible set. Our analysis works also for quadratic model, allowing to justify the global rate of convergence for a new second-order method as applied to a composite objective function. To the best of our knowledge, this is the first trust-region scheme supported by the worstcase complexity analysis both for the functional gap and for the variation of local quadratic model over the feasible set. Keywords: convex optimization, complexity bounds, linear optimization oracle, conditional gradient method, trust-region method. Center for Operations Research and Econometrics (CORE, Catholic University of Louvain (UCL, 34 voie du Roman Pays, 348 Louvain-la-Neuve, Belgium; Yurii.Nnesterov@uclouvain.be. The research results presented in this paper have been supported by the grant Action de recherche concertè ARC 4/9-060 from the Direction de la recherche scientifique - Communautè française de Belgique. Scientific responsibility rests with the author.

2 Introduction Motivation. In the last years, we can see an increasing interest to Frank-Wolfe algorithm [3,, ], which sometimes is called Conditional Gradient Method (CGM [5, 7, 9, 8]. At each iteration of this scheme, we need to solve an auxiliary problem of minimizing a linear function over a convex feasible set. In some situations, mainly when the feasible set is a simple polytope, the complexity of this subproblem is much lower than that of the standard projection technique (e.g. []. The standard complexity results for this method are related to convex objective function with Lipschitz-continuous gradient. In this situation, CGM converges as O( k, where k is the number of iterations. Moreover, it appears that this rate of convergence is optimal for methods with linear optimization oracle [0]. For nonsmooth functions, CGM cannot converge (we give a simple example in Section. Therefore, it is interesting to study the dependence of the rate of convergence of CGM on the level of smoothness of the objective function. On the other hand, sometimes nonsmoothness of the objective function results from a complementary regularization term. This situation can be treated in the framework of composite minimization [4]. However, the performance of CGM for this structure of the objective function was not studied yet. Finally, by its spirit, CGM is a primal-dual method. Indeed, it generates the lower bounds on the optimal value of the objective function, which converge to the optimal value [4]. Therefore, it would be natural to extract from this method an approximate solution of the dual problem. These questions served as the main motivations for this paper. Contents. In Section, we introduce our main problem of interest, where the objective function has a composite form. Our main assumption is that the problem of minimizing a linear function augmented by a simple convex complementary term is solvable. We assume that the smooth part of the objective has Hölder-continuous gradients. For proving efficiency estimate for CGM, we apply the technique of estimating sequences (e.g. [] in its extended form [8]. As a result, we get a significant freedom in the choice of stepsize coefficients. In the next Section 3 we consider a new variant of CGM, which can be seen as a trust-region method with linear model of the objective. For this scheme, the trust region is formed by contracting the feasible set towards the current test point. This method ensures an appropriate rate of convergence for total variation of the linear model, computed at the sequence of test points. Analytical form of our bounds for the primal-dual gap is similar to the bounds obtained in [4]. However, we estimate the difference of the current value of the objective function and the minimal value of the accumulated linear model. In Section 4, we explain how to extract from our convergence results an upper bound on the duality gap for some feasible primal-dual solution. In our technique we use an explicit max-representation of the smooth part of the objective function. In Section 5, we show that the additional properties of complementary part of the objective (strong convexity significantly accelerate the scheme. Finally, in Section 6, we apply our technique for a new second-order trust-region method, where the quadratic approximation of our objective function is minimized on a trust region formed by a con- Performance of CGM as applied to objective function regularized by a norm was studied in [6].

3 tracted feasible set. To the best of our knowledge, this is the first trust-region scheme [] supported by the worst-case complexity analysis. Notation. In what follows, we consider optimization problems over finite-dimensional linear space E with the dual space E. The value of linear function s E at x E is denoted by s, x. In E, we fix a norm, which defines the conjugate norm s = max x { s, x : x }, s E. For a linear operator A : E E, its conjugate operator A : E E is defined by identity Ax, y = A y, x, x E, y E. We call operator B : E E self-conjugate if B = B. It is positive-semidefinite if Bx, x 0 for all x E. For a linear operator B : E E, we define its operator norm in the usual way: B = max { Bx : x }. x For a differentiable function f(x with dom f E, we denote by f(x E its gradient, and by f(x : E E its Hessian. Note that ( f(x = f(x. In the sequel, we often need to estimate the partial sums of different series. For that, it is convenient to use the following trivial lemma. Lemma Let function ξ(τ, τ R, be decreasing and convex. Then, for any two integers a and b, such that [a, b + ] dom ξ, we have b+ a ξ(τdτ b k=a ξ(k b+/ a / ξ(τ dτ. (. For example, for any t 0 and p t, we have t+p k+p+ (. t+p+ t τ+p+ dτ = ln(τ + p + t+p+ t = ln t+p+ t+p+ = ln. (. On the other hand, if t, then t+ (. (k+ t+3/ t / (τ+ dτ = τ+ t+3/ t / = t+3/ t+7/ (.3 = 4t+8 (t+3(4t+7 (t+3. In this paper, we will use some special dual functions. Let Q E be a bounded closed convex set. For a closed convex function F ( with dom F int Q, we define its restricted dual function, (with respect to a central point x Q, as follows: F x,q (s = max { s, x x + F ( x F (x}, s x Q E. (.4 Clearly, this function is well defined for all s E. Moreover, it is convex and nonnegative on E. 3

4 In the theory of Nonsmooth Convex Optimization, the dual functions are often used for analyzing the behavior of first-order methods. In these applications, function F in (.4 is assumed to be a multiple of a strongly convex prox-function (see, for example, [3]. Such an assumption provides us with a possibility to control the step sizes in the corresponding optimization schemes. In this paper, we are going to drop the assumption on strong convexity. Hence, we need to develop another mechanism for controlling the sizes of our moves. This is the reason why we introduce in construction (.4 an additional scaling parameter τ [0, ]. We call function Fτ, x,q (s = max { s, x y + F ( x F (y : y = ( τ x + τx}, s x Q E, (.5 the scaled restricted dual of function F. Lemma For any s E and τ [0, ], we have F x,q (s F τ, x,q (s τ F x,q (s. (.6 Proof: Since for any x Q, the point y = ( τ x + τx belongs to Q, the first inequality is trivial. On the other hand, F τ, x,q (s = max{ s, τ( x x + F ( x F (y : y = ( τ x + τx } x Q max{ s, τ( x x + F ( x ( τf ( x τf (x } x Q = τf x,q (s. Conditional gradient method In this paper we consider numerical methods for solving the following composite minimization problem: { } def min f(x = f(x + Ψ(x, (. x where Ψ is a simple closed convex function with bounded domain Q E, and f is a convex function, which is subdifferentiable on Q. Denote by x one of the optimal solutions of (., and D def = diam(q. Our assumption on simplicity of function Ψ means that some auxiliary optimization problems related to Ψ are easily solvable. Complexity of these problems will be always discussed for corresponding optimization schemes. The most important examples of function Ψ are as follows. Ψ is an indicator function of a closed convex set Q: { Ψ(x = Ind Q (x def 0, x Q, = +, otherwise. Ψ is a self-concordant barrier for a closed convex set Q (see [6, ]. Ψ is a nonsmooth convex function with simple structure (e.g. Ψ(x = x. 4 (.

5 We assume that function f is represented by a black-box oracle. If it is a first-order oracle, we assume its gradients satisfy the following Hölder condition: f(x f(y G ν x y ν, x, y Q. (.3 Constant G ν is formally defined for any ν (0, ]. For some values of ν it can be +. Note that for any x and y in Q we have f(y f(x + f(x, y x + G ν +ν y x +ν. (.4 If this is a second-order oracle, we assume that its Hessians satisfy Hölder condition f(x f(y H ν x y ν, x, y Q. (.5 In this case, for any x and y in Q we have f(y f(x + f(x, y x + f(x(y x, y x + Hν y x +ν (+ν(+ν. (.6 For the first variant of Conditional Gradient Methods (CGM our assumption on simplicity of function Ψ means exactly the following. Assumption For any s E, the auxiliary problem min { s, x + Ψ(x} (.7 x Q is easily solvable. Denote by v Ψ (s Q one of its optimal solutions. In the case (., this assumption implies that we are able to solve the problem min{ s, x : x Q}. x Note that point v Ψ (s is characterized by the following variational principle: s, x v Ψ (s + Ψ(x Ψ(v Ψ (s, x Q. (.8 In order to solve problem (., we apply the following method. Conditional Gradient Method, Type I. Choose an arbitrary point x 0 Q.. For t 0 iterate: a Compute v t = v Ψ ( f(x t. (.9 b Choose τ t (0, ] and set x t+ = ( τ t x t + τ t v t. It is clear that this method can minimize only functions with continuous gradient. 5

6 Example Let Ψ(x = Ind Q (x with Q = {x R : (x ( + (x ( }. Define f(x = max{x (, x ( }. ( T Then clearly x =,. Let us choose in (.9 x0 x. For function f, we can apply an oracle, which returns at any x Q a subgradient f(x {(, 0 T, (0, T }. Then, for any feasible x, the point v Ψ ( f(x is equal either to y = (, 0 T, or to y = (0, T. Therefore, all points of the sequence {x t } t0, generated by method (.9, belong to the triangle Conv{x 0, y, y }, which does not contain the optimal point x. In order to justify the rate of convergence of method (.9 for functions with Hölder continuous gradients, we apply the estimating sequences technique [] in its relaxed form [8]. For that, it is convenient to introduce in (.9 new control variables. Consider a sequence of nonnegative weights {a t } t0. Define A t = t k=0 a k, τ t = a t+ A t+, t 0. (.0 From now on, we assume that parameter τ t in method (.9 is chosen in accordance to the rule (.0. Denote It is clear that (.6 V 0 V 0 = max { f(x 0, x 0 x + Ψ(x 0 Ψ(x}, x B ν,t = a 0 V 0 + a +ν k A ν G ν D +ν, t 0. k { } max f(x 0 f(x + G ν x +ν x x 0 +ν + Ψ(x 0 Ψ(x f(x 0 f(x + G νd +ν +ν def = (x 0 + G νd +ν +ν. (. (. Theorem Let sequence {x t } t0 be generated by method (.9. Then, for any ν (0, ], any step t 0, and any x Q we have Proof: A t (f(x t + Ψ(x t t a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t. (.3 k=0 6

7 Indeed, in view of definition (., for t = 0 inequality (.3 is satisfied. Assume that it is valid for some t 0. Then t+ k=0 (.3 a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t A t (f(x t + Ψ(x t + a t+ [f(x t+ + f(x t+, x x t+ + Ψ(x] A t+ f(x t+ + A t Ψ(x t + f(x t+, a t+ (x x t+ + A t (x t x t+ + a t+ Ψ(x (.9 b = At+ f(x t+ + A t Ψ(x t + a t+ [Ψ(x + f(x t+, x v t ] (.9 b A t+ (f(x t+ + Ψ(x t+ + a t+ [Ψ(x Ψ(v t + f(x t+, x v t ] It remains to note that Ψ(x Ψ(v t + f(x t+, x v t (.8 f(x t+ f(x t, x v t (.3 τ ν t L ν D +ν. Thus, for keeping (.3 valid for the next iteration, it is enough to choose B ν,t+ = B ν,t + a+ν t+ A ν t+ G ν D +ν. Corollary For any t 0 with A t > 0, and any ν (0, ] we have f(x t f(x A t B ν,t. (.4 Let us discuss now the possible variants for choosing the weights {a t } t0.. Constant weights. Let us choose a t, t 0. Then A t = t +, and for ν (0, we have (. B ν,t = V 0 + (+k G ν ν D+ν V 0 + G ν D +ν ν ( + t+/ τ ν / (. [ (x 0 + G ν D +ν +ν + ( ( 3 ν ( ] ν + 3 t ν Thus, for ν (0,, we have A t B ν,t O(t ν. For the most important case ν =, ( ( we have lim ν ν + 3 t ν = ln( + 3t. Therefore, f(x t f(x t+ In this situation, in method (.9 we take τ t (.0 = t+. ( (x0 + G D [ + ln( + 3 t]. (.5 7

8 . Linear weights. Let us choose a t t, t 0. Then A t = t(t+, and for ν (0, with t we have B ν,t = (. ν k +ν k ν (+k ν G ν D +ν G ν D+ν ν ν τ ν t+/ / ν k ν G ν D +ν [ (t = ν ν + ν ( ] ν G ν D +ν. Thus, for ν (0,, we again have A t B ν,t O(t ν. For the case ν =, we get the following bound: f(x t f(x 4 t+ G D, t. (.6 As we can see, this rate of convergence is better than (.5. In this case, in (.0 method (.9 we take τ t = t+, which is a standard recommendation for CGM ( Aggressive weights. Let us choose, for example, a t t, t 0. Then A t = t(t+(t+ k 6. Note that for k 0 we have +ν (k+ ν (k+ ν k ν. Therefore, for ν ν (0, with t we obtain B ν,t = (. 6 ν k (+ν k ν (+k ν (k+ ν G ν D +ν G ν D+ν 3ν 3 ν τ 3 ν t+/ / 3 ν k ν G ν D +ν [ (t = 3ν 3 ν + 3 ν ( ] 3 ν G ν D +ν. For ν (0,, we get again A t B ν,t O(t ν. For ν =, we obtain f(x t f(x 9 t+ G D, t., (.7 which is slightly worse than (.6. The rule for choosing the coefficients τ t in this (.0 6(t+ situation is τ t = (t+(t+3. It can be easily checked that further increase of the rate of growth of coefficients a t makes the rate of convergence of method (.9 even worse. Note that the above rules for choosing the coefficients {τ t } t0 in method (.9 do not depend on the smoothness parameter ν (0, ]. In this sense, method (.9 is a universal method for solving the problem (. (see [5]. Moreover, this method does not depend on the choice of the norm in E. Hence, its rate of convergence can be established with respect to the best norm describing the geometry of the feasible set. 3 Trust-region variant of Frank-Wolfe Method Let us consider a variant of method (.9, which takes into account the composite form of the objective function in problem (.. For Ψ(x Ind Q (x, these two methods coincide. 8

9 Otherwise, they generate different minimization sequences. Conditional Gradient Method, Type II. Choose an arbitrary point x 0 Q.. For t 0 iterate: Choose coefficient τ t (0, ] and compute (3. x t+ = arg min y { f(x t, y + Ψ(y : y = ( τ t x t + τ t x, x Q}. This method can be seen as a Trust-Region Scheme [] with linear model of the objective function. Trust region in method (3. is formed by a contraction of the initial feasible set. In Section 6, we will consider a more traditional trust-region method with quadratic model of the objective. Note that point x t+ in method (3. is characterized by the following variational principle: x t+ = ( τ t x t + τ t v t, v t Q, Ψ(( τ t x t + τ t x + τ t f(x t, x x t (3. Ψ(x t+ + f(x t, x t+ x t, x Q. Let us choose somehow the sequence of nonnegative weights {a t } t0, and define in (3. the coefficients τ t in accordance to (.0. Define now the estimating functional sequence {φ t (x} t0 as follows: φ 0 (x = a 0 f(x, φ t+ (x = φ t (x + a t+ [f(x t + f(x t, x x t + Ψ(x], t 0. (3.3 Clearly, for all t 0 we have Denote Let us introduce φ t (x A t f(x, x Q. (3.4 C ν,t = a 0 (x 0 + +ν a +ν k A ν G ν D +ν, t 0. (3.5 k δ(x def (.4 = max{ f(x, x y + Ψ(x Ψ(y} Ψ x,q ( f(x. (3.6 y Q For problem (., this value measures the level of satisfaction of the first-order optimality conditions at point x Q. For any x Q, we have δ(x f(x f(x 0. (3.7 We call δ(x the total variation of linear model of the composite objective function in problem (. over the feasible set. 9

10 Theorem Let sequence {x t } t0 be generated by method (3.. Then, for any ν (0, ] and any step t 0, we have Moreover, for any t 0 we have A t f(xt φ t (x + C ν,t, x Q. (3.8 f(x t f(x t+ τ t δ(x t G νd +ν +ν τ +ν t. (3.9 Proof: Let us prove inequality (3.8. For t = 0, we have C ν,0 = a 0 [ f(x 0 f(x ]. Thus, in this case (3.8 follows from (3.4. Assume now that (3.8 is valid for some t 0. In view of definition (.0, optimality condition (3. can written in the following form: a t+ f(x t, x x t A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x + f(x t, x t+ x t ] for all x Q. Therefore, φ t+ (x + C ν,t = φ t (x + C ν,t + a t+ [f(x t + f(x t, x x t + Ψ(x] (3.,(3.8 A t [f(x t + Ψ(x t ] + a t+ [f(x t + Ψ(x] +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x + f(x t, x t+ x t ] A t+ [f(x t + f(x t, x t+ x t + Ψ(x t+ ] (.4 A t+ [ f(xt+ +ν G ν x t+ x t +ν ]. It remains to note that x t+ x t = τ t x t v t (.0 a t+ A t+ D. Thus, we can take C ν,t+ = C ν,t + a +ν t+ +ν A ν G ν D +ν. t+ In order to prove inequality (3.9, let us introduce the values δ t (τ def = max x Q { f(x t, x t y + Ψ(x t Ψ(y : y = ( τx t + τx} (.5 = Ψ τ,x t,q ( f(x t, τ [0, ]. Clearly, δ t (τ t = min x Q { f(x t, y x t + Ψ(y Ψ(x t : y = ( τ t x t + τ t x} = f(x t, x t+ x t + Ψ(x t+ Ψ(x t (.4 f(x t+ f(x t Gν +ν x t+ x t +ν. 0

11 Since x t+ x t τ t D, we conclude that f(x t f(x t+ δ t (τ t G νd +ν +ν τt +ν (.6 τ t δ(x t G νd +ν +ν τ +ν t. In view of (3.4, inequality (3.8 results in the following rate of convergence: f(x t f(x A t C ν,t, t 0. (3.0 For the linearly growing weights a t = t, A t = t(t+, t 0, we have already seen that [ C ν,t = +ν B (t ν,t ν (+ν( ν + ν ( ] ν G ν D +ν. In the case ν =, this results in the following rate of convergence: f(x t f(x t+ G D, t. (3. Let us justify for this case the rate of convergence of the sequence {δ(x t } t. We have (.0 τ t = a t+ A t+ = t+. On the other hand, for any T t, G D t+ (3. f(xt f(x (3.9 Denote δ T = min 0tT δ(x t. Then, choosing T = t +, we get ln δ T (. ( T k+ δt T [ τk δ(x k G D τk ] + f(xt + f(x. (3. G D [ (.3 [ ] G D t+ + (t+3 68 GD T +. t+ + T Thus, in the case ν =, for odd T, we get the following bound: ] (k+ [ ] = G D T + + (T + (3. δ T 34 ln GD T +. (3.3 4 Computing the primal-dual solution Note that both methods (.9 and (3. admit computable accuracy certificates. For the first method, denote { t } l t = A t min a k [f(x k + f(x k, x x k + Ψ(x] : x Q. x k=0 It seems that such line of arguments was used in the first time in Section 7.5 of [8].

12 Clearly, f(x t f(x f(x t l t (.3 A t B ν,t. (4. For the second method, let us choose a 0 = 0. Then the estimate functions are linear: φ t (x = t a k [f(x k + f(x k, x x k + Ψ(x]. Therefore, defining ˆl t = A t min{φ t (x : x Q}, we also have x f(x t f(x f(x t ˆl t (.3 A t C ν,t, t. (4. Accuracy certificates (4. and (4. justify that both methods (.9 and (3. are able to recover some information on the optimal dual solution. However, in order to implement this ability, we need to open the Black Box and introduce an explicit model of the function f(x. Let us assume that function f is representable in the following form: f(x = max u { Ax, u g(u : u Q d}, (4.3 where A : E E, Q d is a closed convex set in a finite-dimensional linear space E, and function g( is p-uniformly convex on Q d : g(u g(u, u u σ g u u p, u, u Q d, where the convexity degree p. ( It is well known (e.g. [5] that in this case, for ν = p we have G ν = p σ g. In view of Danskin Theorem, f(x = A u(x, where u(x Q d is the unique optimal solution to optimization problem in the representation (4.3. Let us write down the dual problem to (.. min{ f(x : x Q} (4.3 { } = min Ψ(x + max { Ax, u g(u : u Q x x d} u { } max g(u + min{ A u, x + Ψ(x}. u Q d x Thus, defining Φ(u = min{ A u, x + Ψ(x}, we get the following dual problem: x { } max ḡ(u def = g(u + Φ(u. (4.4 u Q d In this problem, the objective function is nonsmooth and uniformly strongly concave of degree p. Clearly, we have f(x ḡ(u 0, x Q, u Q d. (4.5 Let us show that both methods (.9 and (3. are able to approximate the optimal solution to the dual problem (4.4.

13 Note that for any x Q we have f( x + f( x, x x (4.3 = A x, u( x g(u( x + A u( x, x x = Ax, u( x g(u( x. Therefore, denoting for the first method (.9 u t = A t { l t = min Ψ(x + x Q A t t a k u(x k, we obtain k=0 } t a k [ Ax, u(x k g(u(x k ] k=0 Thus, we get t = Φ(u t A t a k g(u(x k ḡ(u t. k=0 0 (4.5 f(x t ḡ(u t f(x t l t (4. A t B ν,t, t 0. (4.6 For the second method (3., we need to choose a 0 = 0 and take u t = A t In this case, by the similar reasoning, we get t a k u(x k. 0 (4.5 f(x t ḡ(u t f(x t ˆl t (4. A t C ν,t, t. (4.7 5 Strong convexity of function Ψ In this section, we assume that function Ψ in problem (. is strongly convex (see, for example, Section..3 in []. This means that there exists a positive constant σ Ψ such that Ψ(τx + ( τy τψ(x + ( τψ(y σ Ψτ( τ x y (5. for all x, y Q and τ [0, ]. Let us show that in this case CG-methods converge much faster. We demonstrate it for method (.9. In view of strong convexity of Ψ, the variational principle (.8, characterizing the point v t in method (.9 can be strengthen: Ψ(x + f(x t, x v t Ψ(v t + σ ψ x v t, x Q. (5. Let V 0 be defined as in (.. Denote ˆB ν,t = a 0 V 0 + a +ν k G νd ν A ν σ k Ψ, t 0. (5.3 Theorem 3 Let sequence {x t } t0 be generated by method (.9, and function Ψ is strongly convex. Then, for any ν (0, ], any step t 0, and any x Q we have A t (f(x t + Ψ(x t t a k [f(x k + f(x k, x x k + Ψ(x] + ˆB ν,t. (5.4 k=0 3

14 Proof: The beginning of the proof of this statement is very similar to that of Theorem. Assuming that (5.4 is valid for some t 0, we get the following inequality: Further. t+ k=0 a k [f(x k + f(x k, x x k + Ψ(x] + B ν,t A t+ (f(x t+ + Ψ(x t+ + a t+ [Ψ(x Ψ(v t + f(x t+, x v t ]. Ψ(x Ψ(v t + f(x t+, x v t (5. f(x t+ f(x t, x v t + σ Ψ x v t (.3 σ Ψ f(x t+ f(x t (.3 σ Ψ ( a ν t+ A ν t+ G ν D ν. Thus, for keeping (5.4 valid for the next iteration, it is enough to choose ˆB ν,t+ = ˆB ν,t + a +ν t+ σ Ψ A ν t+ G νd ν. It can be easily checked that in our situation, the linear weights strategy a t t is not the best one. Let us choose a t = t, t 0. Then A t = t(t+(t+ 6, and we get ˆB ν,t = (. G νd ν σ Ψ 6 ν k (+ν G νd ν k ν (k+ ν (k+ ν σ Ψ 3 ν 3 ν τ 3 ν t+/ / [ = 3ν 3 ν ( 3 ν t k ( ν (t + 3 ν ( Thus, for ν (0,, we get A t ˆBν,t O(t ν. For ν =, we obtain which is much better than (.6. G νd ν σ Ψ 3 ν ] G ν D ν σ Ψ. f(x t f(x 54 (t+(t+ G D σ Ψ, (5.5 6 Second-order trust-region method Let us assume now that in problem (. function f is twice continuously differentiable. Then we can apply to this problem the following method. 4

15 Contracting Trust-Region Method. Choose an arbitrary point x 0 Q.. For t 0 iterate: Define coefficient τ t (0, ] and choose { x t+ Arg min y f(x t, y x t + f(x t (y x t, y x t + Ψ(y : } y ( τ t x t + τ t x, x Q. (6. Note that this scheme is well defined even if the Hessian of function f is positive semidefinite. Of course, in general, the computational cost of each iteration of this scheme can be big. However, in one important case, when Ψ(x is an indicator function of a Euclidean ball, the complexity of each iteration of this scheme is dominated by complexity of matrix inversion. Thus, method (6. can be easily applied to problems of the form min x {f(x : x x 0 r}, (6. where the norm is Euclidean. Let H ν < + for some ν (0, ]. In this section we assume that f(xh, h L h, x Q, h E. (6.3 Let us choose a sequence of nonnegative weights {a t } t0, and define in (6. the coefficients {τ t } t0 in accordance to (.0. Define the estimate functional sequence {φ t (x} t0 by recurrent relations (3.3, where the sequence {x t } t0 is generated by method (6.. Finally, denote Ĉ ν,t = a 0 (x 0 + a +ν k A +ν k H νd +ν (+ν(+ν + a k A k LD. (6.4 In our convergence results, we estimate also the level of the second-order optimality condition for problem (. at the current test points. Let us introduce θ(x def = max y Q { f(x, x y f(x(y x, y x + Ψ(x Ψ(y}. (6.5 For any x Q we have θ(x 0. We call θ(x the total variation of quadratic model of the composite objective function in problem (. over the feasible set. Denoting F x (y = f(x(y x, y x + Ψ(y, ( we get θ(x = F x ( f(x (see definition (.4. x,q Theorem 4 Let sequence {x t } t0 be generated by method (6.. Then, for any ν [0, ] and any step t 0 we have Moreover, for any t 0 we have A t f(xt φ t (x + Ĉν,t, x Q. (6.6 f(x t f(x t+ τ t θ(x t HνD+ν (+ν(+ν τ +ν t. (6.7 5

16 Proof: Let us prove inequality (6.6. For t = 0, Ĉ ν,0 = a 0 [ f(x 0 f(x ]. Therefore, this inequality is valid. Note that the point x t+ is characterized by the following variational principle: x t+ = ( τ t x t + τ t v t, v t Q, Ψ(y + f(x t + f(x t (x t+ x t, y x t+ Ψ(x t+, y = ( τ t x t + τ t x, x Q. Therefore, in view of definition (.0, for any x Q we have a t+ f(x t, x x t A t+ f(x t + f(x t (x t+ x t, x t+ x t +a t+ f(x t (x t+ x t, x t x +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] (6.3 A t+ f(x t + f(x t (x t+ x t, x t+ x t Hence, +A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] a t+ A t+ LD. A t f(xt + a t+ [f(x t + f(x t, x x t + Ψ(x] A t Ψ(x t + A t+ [f(x t + f(x t + f(x t (x t+ x t, x t+ x t ] +a t+ Ψ(x + A t+ [Ψ(x t+ Ψ(( τ t x t + τ t x] a t+ A t+ LD (.6 A t+ [f(x t+ + Ψ(x t+ ] A t+ H ν x t+ x t +ν (+ν(+ν a t+ A t+ LD A t+ f(xt+ a+ν t+ A +ν t+ H ν D +ν (+ν(+ν a t+ A t+ LD. Thus, if (6.6 is valid for some t 0, then φ t+ (x + Ĉν,t A t f(xt + a t+ [f(x t + f(x t, x x t + Ψ(x] A t+ f(xt+ a+ν t+ A +ν t+ H ν D +ν (+ν(+ν a t+ A t+ LD. Therefore, we can take Ĉν,t+ = Ĉν,t + a+ν t+ A +ν t+ H νd +ν (+ν(+ν + a t+ A t+ LD. 6

17 In order to justify inequality (6.7, let us introduce the values θ t (τ def = max x Q { f(x t, x t y f(x t (y x t, y x t (.5 = +Ψ(x t Ψ(y : y = ( τx t + τx} ( F xt τ,x t,q ( f(x t, τ [0, ]. Clearly, θ t (τ t = min { f(x t, y x t x Q f(x t (y x t, y x t +Ψ(y Ψ(x t : y = ( τ t x t + τ t x} = f(x t, x t+ x t f(x t (x t+ x t, x t+ x t + Ψ(x t+ Ψ(x t (.6 f(x t+ f(x t H ν (+ν(+ν x t+ x t +ν. Since x t+ x t τ t D, we conclude that f(x t f(x t+ θ t (τ t H νd +ν (+ν(+ν τ t +ν (.6 τ t θ(x t H νd +ν (+ν(+ν τ +ν t. Thus, inequality (6.6 ensures the following rate of convergence of method (6. f(x t f(x A t Ĉ ν,t. (6.8 A particular expression of the right-hand side of this inequality for different values of ν [0, ] can be obtained exactly in the same way as it was done in Section. Here, we restrict ourselves only by the case ν = and a t = t, t 0. Then A t = t(t+(t+ 6, and t a 3 k A k = t 36k 6 k (k+ (k+ 8t, Thus, we get t a k A k = t f(x t f(x 3k 4 k(k+(k+ 3 t k = 3 4t(t +. 8H D 3 (t+(t+ + 9LD (t+. (6.9 Note that the rate of convergence (6.9 is worse than the convergence rate of cubic regularization of the Newton method [7]. However, to the best of our knowledge, inequality (6.9 gives us the first global rate of convergence of an optimization scheme belonging to the family of trust-region methods []. In view of inequality (6.6, the optimal solution of the dual problem (4.4 can be approximated by method (6. with a 0 = 0 in the same way as it was suggested in Section 4 for Conditional Gradient Methods. 7

18 Let us estimate now the rate of decrease of the values θ(x t, t 0, in the case ν =. (.0 Note that τ t = a t+ A t+ = 6(t+ (t+(t+3. It is easy to see that these coefficients satisfy the following inequalities: 3 t+3 τ t 6 t+5, t 0. (6.0 Therefore, choosing the total number of steps T = t +, we have T T τk 3 (6.0 τ k 3 t+ k+3 (. 3 ln, (6.0 t+ ( t+5/ (k+5/ 3 (k+5/ t / = 7 [ ] 4 = (T + (T +3 = 7 [ ] (t+ (t+5 7(3T +8(T +4 (T + (T +3 8 (T +(T +. (6. Now we can use the same trick as in the end of Section 3. Denote θ T = min 0tT θ(x t. Then 36H D 3 T (T + 9LD (T (6.9 f(x t f(x T ( f(x k f(x k+ (6.7 T θt τ k H D 3 6 Thus, for even T, we get the following bound: References θ T 3 ln T τk 3 [ 4H D 3 T (T + 3H D 3 4(T +(T + + LD (T [ ] 3 5H D 3 ln T (T + LD (T. (6. 3θ T ln 7H D 3 4(T +(T +. ] (6. [] Conn A.B., Gould N.I.M., Toint Ph.L., Trust Region Methods, SIAM, Philadelphia, 000. [] J. Dunn. Convergence rates for conditional gradient sequences generated by implicit step length rules. SIAM Journal on Control and Optimization, 8(5: , (980. [3] M. Frank, P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3: (956. [4] R.M. Freund, P. Grigas. New analysis and results for the FrankWolfe method. Mathematical Programming, DOI 0.007/s , (04. [5] D. Garber, E. Hazan. A linearly convergent conditional gradient algorithm with application to online and stochastic optimization. arxiv: v5, (03. [6] Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, DOI 0.007/s , (04. 8

19 [7] M. Jaggi. Revisiting Frank-Wolfe: projection-free sparse convex optimization. In Proceedings of the 30 th International Conderence on Machine Learning, Atlanta, Georgia, (03. [8] A. Juditsky and A. Nemirovski. Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles. Mathemetical Programming, Ser. A, DOI 0.007/s [9] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank- Wolfe optimization of structural svms. In Proceedings of the 30 th International Conderence on Machine Learning, Atlanta, Georgia, (03. [0] G. Lan. The complexity of large-scale convex programming under a linear optimization oracle. arxiv: v, (04. [] A. Migdalas. A regularization of the Frank-Wolfe method and unification of certain nonlinear programming methods. Mathematical Programming, 63, , (994 [] Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, Boston, 004. [3] Yu. Nesterov. Primal-dual subgradient methods. (009 [4] Yu.Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 40(, 5-6 (03. [5] Yu. Nesterov. Universal gradient methods for convex optimization problems. Mathematical Programming, DOI: 0.007/s , (04. [6] Nesterov Yu., Nemirovskii A., Interior-Point Polynomial Algorithms in Convex Programming, SIAM, Philadelphia, 994. [7] Yu. Nesterov, B. Polyak. Cubic regularization of Newton s method and its global performance. Mathematical Programming, 08(, 77-05, (006. [8] Yu. Nesterov, V. Shikhman. Quasi-monotone subgradient methods for nonsmooth convex minimization. JOTA, DOI 0.007/s , (04. 9

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

Accelerating the cubic regularization of Newton s method on convex problems

Accelerating the cubic regularization of Newton s method on convex problems Accelerating the cubic regularization of Newton s method on convex problems Yu. Nesterov September 005 Abstract In this paper we propose an accelerated version of the cubic regularization of Newton s method

More information

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28

CORE 50 YEARS OF DISCUSSION PAPERS. Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable 2016/28 26/28 Globally Convergent Second-order Schemes for Minimizing Twicedifferentiable Functions YURII NESTEROV AND GEOVANI NUNES GRAPIGLIA 5 YEARS OF CORE DISCUSSION PAPERS CORE Voie du Roman Pays 4, L Tel

More information

Nonsymmetric potential-reduction methods for general cones

Nonsymmetric potential-reduction methods for general cones CORE DISCUSSION PAPER 2006/34 Nonsymmetric potential-reduction methods for general cones Yu. Nesterov March 28, 2006 Abstract In this paper we propose two new nonsymmetric primal-dual potential-reduction

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Universal Gradient Methods for Convex Optimization Problems

Universal Gradient Methods for Convex Optimization Problems CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex

More information

Constructing self-concordant barriers for convex cones

Constructing self-concordant barriers for convex cones CORE DISCUSSION PAPER 2006/30 Constructing self-concordant barriers for convex cones Yu. Nesterov March 21, 2006 Abstract In this paper we develop a technique for constructing self-concordant barriers

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Primal-dual IPM with Asymmetric Barrier

Primal-dual IPM with Asymmetric Barrier Primal-dual IPM with Asymmetric Barrier Yurii Nesterov, CORE/INMA (UCL) September 29, 2008 (IFOR, ETHZ) Yu. Nesterov Primal-dual IPM with Asymmetric Barrier 1/28 Outline 1 Symmetric and asymmetric barriers

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Smooth minimization of non-smooth functions

Smooth minimization of non-smooth functions Math. Program., Ser. A 103, 127 152 (2005) Digital Object Identifier (DOI) 10.1007/s10107-004-0552-5 Yu. Nesterov Smooth minimization of non-smooth functions Received: February 4, 2003 / Accepted: July

More information

Cubic regularization of Newton method and its global performance

Cubic regularization of Newton method and its global performance Math. Program., Ser. A 18, 177 5 (6) Digital Object Identifier (DOI) 1.17/s117-6-76-8 Yurii Nesterov B.T. Polyak Cubic regularization of Newton method and its global performance Received: August 31, 5

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

On self-concordant barriers for generalized power cones

On self-concordant barriers for generalized power cones On self-concordant barriers for generalized power cones Scott Roy Lin Xiao January 30, 2018 Abstract In the study of interior-point methods for nonsymmetric conic optimization and their applications, Nesterov

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Introduction to Nonlinear Stochastic Programming

Introduction to Nonlinear Stochastic Programming School of Mathematics T H E U N I V E R S I T Y O H F R G E D I N B U Introduction to Nonlinear Stochastic Programming Jacek Gondzio Email: J.Gondzio@ed.ac.uk URL: http://www.maths.ed.ac.uk/~gondzio SPS

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Generalized Uniformly Optimal Methods for Nonlinear Programming

Generalized Uniformly Optimal Methods for Nonlinear Programming Generalized Uniformly Optimal Methods for Nonlinear Programming Saeed Ghadimi Guanghui Lan Hongchao Zhang Janumary 14, 2017 Abstract In this paper, we present a generic framewor to extend existing uniformly

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Supplement: Universal Self-Concordant Barrier Functions

Supplement: Universal Self-Concordant Barrier Functions IE 8534 1 Supplement: Universal Self-Concordant Barrier Functions IE 8534 2 Recall that a self-concordant barrier function for K is a barrier function satisfying 3 F (x)[h, h, h] 2( 2 F (x)[h, h]) 3/2,

More information

Local Superlinear Convergence of Polynomial-Time Interior-Point Methods for Hyperbolic Cone Optimization Problems

Local Superlinear Convergence of Polynomial-Time Interior-Point Methods for Hyperbolic Cone Optimization Problems Local Superlinear Convergence of Polynomial-Time Interior-Point Methods for Hyperbolic Cone Optimization Problems Yu. Nesterov and L. Tunçel November 009, revised: December 04 Abstract In this paper, we

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization Yinyu Ye Department is Management Science & Engineering and Institute of Computational & Mathematical Engineering Stanford

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications Weijun Zhou 28 October 20 Abstract A hybrid HS and PRP type conjugate gradient method for smooth

More information

Augmented self-concordant barriers and nonlinear optimization problems with finite complexity

Augmented self-concordant barriers and nonlinear optimization problems with finite complexity Augmented self-concordant barriers and nonlinear optimization problems with finite complexity Yu. Nesterov J.-Ph. Vial October 9, 2000 Abstract In this paper we study special barrier functions for the

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

An adaptive accelerated first-order method for convex optimization

An adaptive accelerated first-order method for convex optimization An adaptive accelerated first-order method for convex optimization Renato D.C Monteiro Camilo Ortiz Benar F. Svaiter July 3, 22 (Revised: May 4, 24) Abstract This paper presents a new accelerated variant

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Self-Concordant Barrier Functions for Convex Optimization

Self-Concordant Barrier Functions for Convex Optimization Appendix F Self-Concordant Barrier Functions for Convex Optimization F.1 Introduction In this Appendix we present a framework for developing polynomial-time algorithms for the solution of convex optimization

More information

Worst Case Complexity of Direct Search

Worst Case Complexity of Direct Search Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient

More information

Complexity and Simplicity of Optimization Problems

Complexity and Simplicity of Optimization Problems Complexity and Simplicity of Optimization Problems Yurii Nesterov, CORE/INMA (UCL) February 17 - March 23, 2012 (ULg) Yu. Nesterov Algorithmic Challenges in Optimization 1/27 Developments in Computer Sciences

More information

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization

Research Note. A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization Iranian Journal of Operations Research Vol. 4, No. 1, 2013, pp. 88-107 Research Note A New Infeasible Interior-Point Algorithm with Full Nesterov-Todd Step for Semi-Definite Optimization B. Kheirfam We

More information

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function

A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function A full-newton step infeasible interior-point algorithm for linear programming based on a kernel function Zhongyi Liu, Wenyu Sun Abstract This paper proposes an infeasible interior-point algorithm with

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Subgradient methods for huge-scale optimization problems

Subgradient methods for huge-scale optimization problems CORE DISCUSSION PAPER 2012/02 Subgradient methods for huge-scale optimization problems Yu. Nesterov January, 2012 Abstract We consider a new class of huge-scale problems, the problems with sparse subgradients.

More information

12. Interior-point methods

12. Interior-point methods 12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity

More information

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song

More information

Full Newton step polynomial time methods for LO based on locally self concordant barrier functions

Full Newton step polynomial time methods for LO based on locally self concordant barrier functions Full Newton step polynomial time methods for LO based on locally self concordant barrier functions (work in progress) Kees Roos and Hossein Mansouri e-mail: [C.Roos,H.Mansouri]@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/

More information

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008 Lecture 8 Plus properties, merit functions and gap functions September 28, 2008 Outline Plus-properties and F-uniqueness Equation reformulations of VI/CPs Merit functions Gap merit functions FP-I book:

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

An interior-point trust-region polynomial algorithm for convex programming

An interior-point trust-region polynomial algorithm for convex programming An interior-point trust-region polynomial algorithm for convex programming Ye LU and Ya-xiang YUAN Abstract. An interior-point trust-region algorithm is proposed for minimization of a convex quadratic

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

Largest dual ellipsoids inscribed in dual cones

Largest dual ellipsoids inscribed in dual cones Largest dual ellipsoids inscribed in dual cones M. J. Todd June 23, 2005 Abstract Suppose x and s lie in the interiors of a cone K and its dual K respectively. We seek dual ellipsoidal norms such that

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

Optimisation in Higher Dimensions

Optimisation in Higher Dimensions CHAPTER 6 Optimisation in Higher Dimensions Beyond optimisation in 1D, we will study two directions. First, the equivalent in nth dimension, x R n such that f(x ) f(x) for all x R n. Second, constrained

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization

Convergence rate of inexact proximal point methods with relative error criteria for convex optimization Convergence rate of inexact proximal point methods with relative error criteria for convex optimization Renato D. C. Monteiro B. F. Svaiter August, 010 Revised: December 1, 011) Abstract In this paper,

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 5, No. 1, pp. 115 19 c 015 Society for Industrial and Applied Mathematics DUALITY BETWEEN SUBGRADIENT AND CONDITIONAL GRADIENT METHODS FRANCIS BACH Abstract. Given a convex optimization

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

On Conically Ordered Convex Programs

On Conically Ordered Convex Programs On Conically Ordered Convex Programs Shuzhong Zhang May 003; revised December 004 Abstract In this paper we study a special class of convex optimization problems called conically ordered convex programs

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Worst case complexity of direct search

Worst case complexity of direct search EURO J Comput Optim (2013) 1:143 153 DOI 10.1007/s13675-012-0003-7 ORIGINAL PAPER Worst case complexity of direct search L. N. Vicente Received: 7 May 2012 / Accepted: 2 November 2012 / Published online:

More information

A Proximal Method for Identifying Active Manifolds

A Proximal Method for Identifying Active Manifolds A Proximal Method for Identifying Active Manifolds W.L. Hare April 18, 2006 Abstract The minimization of an objective function over a constraint set can often be simplified if the active manifold of the

More information

Global convergence of the Heavy-ball method for convex optimization

Global convergence of the Heavy-ball method for convex optimization Noname manuscript No. will be inserted by the editor Global convergence of the Heavy-ball method for convex optimization Euhanna Ghadimi Hamid Reza Feyzmahdavian Mikael Johansson Received: date / Accepted:

More information