1 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods

Size: px
Start display at page:

Download "1 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods"

Transcription

1 1 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I: General Purpose Methods Anatoli Juditsky Laboratoire Jean Kuntzmann, Université J. Fourier B. P Grenoble Cedex, France Arkadi Nemirovski School of Industrial and Systems Engineering, Georgia Institute of Technology 765 Ferst Drive NW, Atlanta Georgia 30332, USA We discuss several state-of-the-art computationally cheap, as opposed to the polynomial time Interior Point algorithms, first order methods for minimizing convex objectives over simple large-scale feasible sets. Our emphasis is on the general situation of a nonsmooth convex objective represented by deterministic/stochastic First Order oracle and on the methods which, under favorable circumstances, exhibit (nearly) dimension-independent convergence rate. 1.1 Introduction At present, essentially the entire Convex Programming is within the grasp of polynomial time Interior Point methods (IPMs) capable to solve convex programs to high accuracy at a low iteration count. However, the iteration cost of all known polynomial methods grows nonlinearly with problem s design dimension n (# of decision variables), something like n 3. As a result, as the design dimension grows, polynomial time methods eventually become impractical roughly speaking, a single iteration lasts forever. What in

2 2 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I fact eventually means, depends on problem s structure. For instance, typical Linear Programming programs of decision-making origin have extremely sparse constraint matrices, and IPMs are able to solve in reasonable time programs of this type with tens and hundreds of thousands variables and constraints. In contrast to this, Linear Programming programs arising in Machine Learning and Signal Processing often have dense constraint matrices. Such programs with just few thousands variables and constraints can become very difficult for an IPM. At the present level of our knowledge, the methods of choice when solving convex programs which, because of their size, are beyond the practical grasp of IPMs, are the First Order methods (FPMs) with computationally cheap iterations. In this chapter, we present several state-of-the-art FOMs for large-scale convex optimization, focusing on the most general nonsmooth unstructured case, where the convex objective f to be minimized can be nonsmooth and is represented by a black box a routine capable to compute the values and subgradients of f. First Order methods: limits of performance. We start with explaining what can and what cannot be expected from FOMs, restricting ourselves for the time being with convex programs of the form Opt(f) = min f(x), (1.1) x X where X is a compact convex subset of R n, and f is known to belong to a given family F of convex and (at least) Lipschitz continuous functions on X. Formally, a FOM is an algorithm B which knows in advance what are X and F, but does not know what exactly is f F. It is restricted to learn f via subsequent calls to a First Order oracle a routine which, given on input a point x X, returns on output a value f(x) and a (sub)gradient f (x) of f at x (informally speaking, this setting implicitly assumes that X is simple (like box, or ball, or standard simplex), while f can be complicated). Specifically, as applied to a particular objective f F and given on input a required accuracy ɛ > 0, the method B, after generating a finite sequence of search points x t X, t = 1, 2,..., where the First Order oracle is called, terminates and outputs an approximate solution x X which should be ɛ-optimal: f( x) Opt(f) ɛ. In other words, the method itself is a collection of rules for generating subsequent search points, identifying the terminal step, and building the approximate solution. These rules, in principle, can be arbitrary, with the only limitation of being non-anticipating, meaning that the output of a rule is uniquely defined by X and the first order information on f accumulated before the rule is applied. As a result, for a given B and X, x 1 is independent of

3 1.1 Introduction 3 f, x 2 depends solely on f(x 1 ), f (x 1 ), and so on. Similarly, the decision to terminate after a particular number t of steps, same as the resulting approximate solution x, are uniquely defined by the first order information f(x 1 ), f (x 1 ),..., f(x t ), f (x t ) accumulated in course of these t steps. Limits of performance of FOMs are given by Information-Based Complexity Theory which says what, for given X, F, ɛ, may be the minimal number of steps of a FOM solving all problems (1.1) with f F within accuracy ɛ. Here are several instructive examples (see Nemirovski and Yudin, 1983): (a) Let X {x R n : x p R}, where p {1, 2}, and let F = F p be comprised of all convex functions f which are Lipschitz continuous, with a given constant L, w.r.t. p. When X = {x R n : x p R}, the number N of steps of any FOM capable to solve every problem from the just outlined family within accuracy ɛ is at least O(1) min[n, L 2 R 2 /ɛ 2 ] 1. When p = 2, this lower complexity bound remains true when F is restricted to be the family of all functions of the type f(x) = max [ɛ ilx i + a i ] with ɛ i = ±1. Moreover, 1 i n the bound is nearly achievable: whenever X {x R n : x p R}, there exist quite transparent (and simple in implementation when X is simple) FOMs capable to solve all problems (1.1) with f F p within accuracy ɛ in O(1)(ln(n)) 2/p 1 L 2 R 2 /ɛ 2 steps. It should be stressed that outlined nearly dimension-independent performance of FOMs heavily depends on the assumption p {1, 2} 2. With p set to + (i.e., when minimizing Lipschitz continuous, with constant L w.r.t., convex functions over the box X = {x R n : x R}), the lower and the upper complexity bounds are O(1)n ln(lr/ɛ) provided that LR/ɛ 2; these bounds heavily depend on problem s dimension. (b) Let X = {x R n : x 2 1}, and let F be comprised of all differentiable convex functions with Lipschitz continuous, with constant L w.r.t. 2, gradient. Then the number N of steps of any FOM capable to solve every problem from the just outlined family within accuracy ɛ is at least O(1) min[n, LR 2 /ɛ]. This lower complexity bound remains true when F is restricted to be the family of convex quadratic forms 1 2 xt Ax + b T x with positive semidefinite symmetric matrices A of spectral norm (maximal singular value) not exceeding L. Here again the lower complexity bound is nearly achievable: whenever X {x R n : x 2 1}, there exists a simple in implementation when X is simple (although by far not transparent) FOM the famous Nesterov s optimal algorithm for smooth convex minimization (Nesterov, 1983, 2005) which allows to solve within accuracy ɛ all problems 1. From now on, all O(1) s are appropriate positive absolute constants. 2. In fact, it can be relaxed to 1 p 2.

4 4 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I (1.1) with f F in O(1) LR 2 /ɛ steps. (c) Let X be as in (b), and let F be comprised of all functions of the form f(x) = Ax b 2, where the spectral norm of A (which is not positive semidefinite anymore) does not exceed a given L. Let us slightly extend the power of the First Order oracle and assume that at a step of a FOM we observe b (but not A) and are allowed to carry out O(1) matrix-vector multiplications involving A and A T. In this case, the number of steps of any method capable to solve all problems in question within accuracy ɛ is at least O(1) min[n, LR/ɛ], and there exists a method (specifically, Nesterov s optimal algorithm as applied to the quadratic form Ax b 2 2 ) which achieves the desired accuracy in O(1)LR/ɛ steps. The outlined results bring us both bad and good news on FOMs as applied to large-scale convex programs. The bad news is that unless the number of steps of the method exceeds problem s design dimension n (which is of no interest when n is really large) and without imposing severe additional restrictions on the objectives to be minimized, a FOM can exhibit only sublinear rate of convergence, specifically, denoting by t the number of steps, the rate O(1)(ln(n)) 1/p 1/2 LR/t 1/2 in the case of (a) (better than nothing, but really slow), O(1)LR 2 /t 2 in the case of (b) (much better, but alas simple X along with smooth f is a rare commodity), and O(1)LR/t in the case of (c) ( in-between (a) and (b)). As a consequence, FOMs are poorly suited for building high-accuracy solutions to large-scale convex problems. The good news is that for problems with favorable geometry (e.g., those in (a) (c)), good FOMs exhibit dimension-independent, or nearly so, rate of convergence, which is of paramount importance in large-scale applications. Another good news (not declared explicitly in the above examples) is that when X is simple, typical FOMs have cheap iterations modulo computations hidden in the oracle, an iteration costs just O(dim X) a.o. The bottom line is that FOMs are well suited for finding medium-accuracy solutions to large-scale convex problems, at least when the latter possess favorable geometry. Another conclusion of the presented results is that the limits of performance of FOMs heavily depend on the size R of the feasible domain and on the Lipschitz constant L (of f in the case of (a), and of f in the case of (b)). This is in a sharp contrast to IPM s, where the complexity bounds depend logarithmically on the magnitudes of an optimal solution and of the data (the analogies or R, L, respectively), which, practically speaking, allows to handle problems with unbounded domains (one may impose an upper bound 10 6, or on the variables) and not to bother much of how the data is

5 1.1 Introduction 5 scaled 3. Severe dependence of the complexity of FOMs on L and R implies a number of important consequences. In particular: Boundedness of X is of paramount importance, at least theoretically. In this respect, unconstrained settings, like in Lasso: min{λ x 1 + Ax b 2 2 } are less preferable than their bounded domain counterparts, like min{ Ax b 2 : x 1 R}, 4 in full accordance with common sense however difficult it is to find a needle in a haystack, a small haystack in this respect is better than a large one! For a given problem (1.1), the size R of the feasible domain and the Lipschitz constant L of the objective depend on the norm used to quantify these quantities: R = R, L = L. When varies, the product L R (this product is all that matters) changes 5, and this phenomenon should be taken into account when choosing a FOM for a particular problem. What is ahead. Literature on FOMs, which always was huge, in the latest years is growing in a somehow explosive fashion partly, due to rapidly increasing demand for large-scale optimization, and partly by endogenous reasons stemming primarily from discovering ways (Nesterov, 2005) to accelerate FOMs by exploiting problem s structure (for more details on the latter subject, see chapter 2). Even a brief overview of this literature in a single document would be completely unrealistic. Our primary selection criteria were (a) to focus on techniques for large-scale nonsmooth convex programs (these are the problems arising in most applications known to us), (b) to restrict ourselves with FOMs possessing state-of-the art (in some cases even provably optimal) non-asymptotic efficiency estimates, and (c) possibility for self-contained presentation of the methods given space limitations. Last, but not least, we preferred to focus on the situations of which we have first-hand (or nearly so) knowledge. As a result, our presentation of FOMs being instructive (at least, so we hope), is definitely incomplete. As about citation policy, we restrict ourselves with referring to papers directly related to what we are presenting, with no attempt to give even a nearly exhaustive list of references to FOM literature. We apologize in advance x 3. In IPMs, scaling of the data affects stability of the methods w.r.t. rounding errors, but this is another story. 4. We believe that the desire to end up with unconstrained problems stems from the common belief that unconstrained convex minimization is simpler than the constrained one. To the best of our understanding, this belief is somehow misleading, and the actual distinction is between optimization over simple and over sophisticated domains; what is simple, this depends on the method in question. 5. For example, the ratio [L 2 R 2 ]/L 1 R 1 can be as small as 1/ n and as large as n

6 6 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I for potential omissions even in this reduced list. In this chapter, we focus on the simplest general-purpose FOMs, Mirror Descent (MD) methods aimed at solving nonsmooth convex minimization problems, specifically, general-type problems (1.1) (section 1.2), problems (1.1) with strongly convex objectives (section 1.4), convex problems with functional constraints min x X {f 0 (x) : f i (x) 0, 1 i m} (section 1.3), and stochastic versions of problems (1.1), where the First Order oracle is replaced with its stochastic counterpart providing unbiased random estimates of the subgradients of the objective rather than the subgradients themselves (section 1.5). Finally, section 1.6 presents extensions of the Mirror Descent scheme from problems of convex minimization to the convex-concave saddle point problems. As it was already said, this chapter is devoted to general purpose FOMs, meaning that the methods in question are fully black-box-oriented they do not assume any a priori knowledge of the structure of the objective (and the functional constraints, if any) aside of convexity and Lipschitz continuity. By itself, this generality is somehow redundant: convex programs arising in applications always possess a lot of known in advance structure, and utilizing a priori knowledge of this structure can accelerate dramatically the solution process. Acceleration FOMs by utilizing problem s structure is the subject of the subsequent chapter Mirror Descent Algorithm: Minimizing over Simple Set Problem of interest We primarily focus on solving optimization problem of the form Opt = min f(x), (1.2) x X where X E is a closed convex set in a finite-dimensional Euclidean space E, and f : X R is a Lipschitz continuous convex function represented by a First Order oracle. This oracle is a routine which, given on input a point x X, returns the value f(x) and a subgradient f (x) of f at x. We always assume that f (x) is bounded on X. We assume also that (1.2) is solvable Mirror Descent setup We set up the MD method with two entities: a norm on the space E embedding X, and the conjugate norm

7 1.2 Mirror Descent Algorithm: Minimizing over Simple Set 7 on E : ξ = max { ξ, x : x 1}; x a distance-generating function (d.-g.f. for short) for X compatible with the norm, that is, a continuous convex function ω(x) : X R such that ω(x) admits a selection ω (x) of subgradient which is continuous on the set X o = {x X : ω(x) }; ω( ) is strongly convex, with modulus 1, w.r.t. : (x, x X o ) : ω (x) ω (x ), x x x y 2. (1.3) For x X o, u X let V x (u) = ω(u) ω(x) ω (x), u x. (1.4) Denote x c = argmin u X ω(u) (the existence of a minimizer is given by continuity and strong convexity of ω on X and by closedness of X, and its uniqueness by strong convexity of ω). When X is bounded, we define ω( )-diameter Ω = max u X V xc (u) max X ω(u) min X ω(u) of X. Given x X o, we define the prox-mapping Prox x (ξ) : E X o as follows: Prox x (ξ) = argmin u X { ξ, u + V x (u)} (1.5) From now on we make the Simplicity assumption: X and ω are simple and fit each other, specifically, given x X o and ξ E, it is easy to compute Prox x (ξ) Basic Mirror Descent algorithm The MD algorithm associated with the outlined setup, as applied to problem (1.2), is the recurrence (a) x 1 = argmin x X ω(x) (b) x t+1 = Prox xt (γ t f (x t )), t = 1, 2,... (c) x t = [ t τ=1 γ ] 1 t τ τ=1 γ τ x τ (d) x t = argmin x {x1,...,x t} f(x) (1.6) Here x t are subsequent search points, and x t (or x t the error bounds to follow work for both these choices) are subsequent approximate solutions generated by the algorithm; note that x t X o and x t, x t X for all t. The convergence properties of MD stem from the following simple observation: Proposition 1.1. Suppose that f is Lipschitz continuous on X with L := sup x X f (x) <. Let f t = max[f(x t ), f( x t )]. Then

8 8 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I (i) for all u X, t 1 one has t γ τ f (x τ ), x τ u V x1 (u) + 1 t 2 τ=1 γ2 τ f (x τ ) 2 τ=1 As a result, for all t 1, V x1 (u) + L2 2 t τ=1 γ2 τ, (1.7) f t Opt ɛ t := V x 1 (x ) + L2 t 2 τ=1 γ2 τ t τ=1 γ, (1.8) τ where x is an optimal solution to (1.2). In particular, in the divergent series case γ t 0, t τ=1 γ τ + as t, the algorithm converges: f t Opt 0 as t. Moreover, with the stepsizes for all t one has [ Vx1 (x ) f t Opt O(1) + γ γ t = γ/[ f (x t ) t] ] ln(t + 1)γ Lt 1/2. (1.9) 2 (ii) Let X be bounded, so that the ω( )-diameter Ω of X is finite. Then for every number N of steps, the N-step MD algorithm with constant stepsizes 2Ω γ t = L, 1 t N. (1.10) N ensures that 1 f N = min N u X N τ=1 [f(x τ ) + f (x τ ), u x τ ] Opt, f N Opt f N f N 2ΩL (1.11) N In other words, the quality of approximate solutions (x N or x N ) can be certified by the easy-to-compute on-line lower bound f N on Opt, and the certified level of non-optimality of the solutions can be only better than the one given by the worst-case upper bound in the right hand side of (1.11). Proof. From the definition of the prox-mapping, { x τ+1 = argmin γτ f (x τ ) ω (x τ ), z + ω(z) }, z X whence, by optimality conditions, γ τ f (x τ ) ω (x τ ) + ω (x τ+1 ), u x τ+1 0 u X.

9 1.2 Mirror Descent Algorithm: Minimizing over Simple Set 9 When rearranging terms, this inequality can be rewritten as γ τ f (x τ ), x τ u [ω(u) ω(x τ ) ω (x τ ), u x τ ] [ω(u) ω(x τ+1 ) ω (x τ+1 ), u x τ+1 ] +γ τ f (x τ ), x τ x τ+1 [ω(x τ+1 ) ω(x τ ) ω (x τ ), x τ+1 x τ ] = V xτ (u) V xτ+1 (u) + [γ τ f (x τ ), x τ x τ+1 V xτ (x τ+1 )] }{{}. (1.12) δ τ From strong convexity of V xτ and we get it follows that δ τ γ τ f (x τ ), x τ x τ x τ x τ+1 2 γ τ f (x τ ) x τ x τ x τ x τ+1 2 max[γ τ f (x τ ) s 1 s 2 s2 ] = γ2 τ 2 f (x τ ) 2, γ τ f (x τ ), x τ x τ+1 V xτ (u) V xτ+1 (u) + γ 2 τ f (x τ ) 2 /2. (1.13) Summing up these inequalities over τ = 1,..., t and taking into account that V x (u) 0, we arrive at (1.7). With u = x, (1.7), when taking into account that f (x τ ), x τ x f(x τ ) Opt and setting f t = [ t τ=1 γ τ ] 1 t τ=1 γ τ f(x τ ), results in f t Opt V x 1 (x ) + L 2 [ t ] τ=1 γ2 τ /2 t τ=1 γ. τ Since, clearly, f t = max[f(x t ), f( x t )] f t, we have arrived at (1.8). This inequality straightforwardly implies the remaining results of (i). To prove (ii), note that by definition of Ω and due to x 1 = argmin X ω, (1.7) combines with (1.10) to imply that f N f N = max u X [ f N 1 N ] N [f(x τ ) + f (x τ ), u x τ τ=1 2ΩL N. (1.14) Since f is convex, the function 1 N N τ=1 [f(x τ ) + f (x τ ), u x τ ] underestimates f(u) everywhere on X, that is, f N Opt, and, as we have seen, f N f N, and therefore (ii) follows from (1.14)

10 10 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I 1.3 Problems with Functional Constraints The MD algorithm can be easily extended from the case of problem (1.2) to the case of problem Opt = min x X {f 0(x) : f i (x) 0, 1 i m}, (1.15) where f i, 0 f i m, are Lipschitz continuous convex functions on X given by First Order oracle which, given on input x X, returns the values f i (x) and subgradients f i (x) of f i at x, with bounded on X selections of the subgradients f i ( ). Consider the N-step algorithm as follows: 1. Initialization: Set x 1 = argmin X ω 2. Step t, 1 t N: Given x t X, call the First Order oracle, x t being the input, and check whether f i (x t ) γ f i(x t ), i = 1,..., m. (1.16) If it is the case ( productive step ), set i(t) = 0, otherwise ( non-productive step ) choose i(t) {1,..., m} such that f i(t) (x) > γ f i(t) (x t). Set γ t = γ/ f i(t) (x t), x t+1 = Prox xt (γ t f i(t) (x t)). When t < N, loop to step t Termination: After N steps are executed, output, as approximate solution x N, the best (with the smallest value of f 0 ) of the points x t associated with productive steps t; if there were no productive steps, claim (1.15) infeasible. Proposition 1.2. Let X be bounded. Given integer N 1, let us set γ = 2Ω/ N. Then (i) If (1.15) is feasible, x N is well defined; (ii) Whenever x N is well defined, one has max [ f 0 ( x N ) Opt, f 1 ( x N ),..., f m ( x N ) ] γl = L = max 0 i m sup x X f i (x). 2ΩL N, (1.17) Proof. By construction, when x N is well defined, it is some x t with productive t, whence f i ( x N ) γl for 1 i m by (1.16). It remains to verify that when (1.15) is feasible, x N is well defined and f 0 ( x N ) Opt + γl. Assume that it is not the case, whence at every productive step t, if any, we have f 0 (x t ) Opt > γ f 0 (x t). Let x be an optimal solution to (1.15). Exactly the same reasoning as in the proof of Proposition 1.1 yields the following

11 1.4 Minimizing Strongly Convex Functions 11 analogy of (1.7) (with u = x ): N t=1 γ t f i(t) (x t), x t x Ω N t=1 γ2 t f i(t) (x t) 2 = 2Ω. (1.18) Now, when t is non-productive, we have γ t f i(t), x t x γ t f i(t) (x t ) > γ 2, the concluding inequality being given by the definition of i(t) and γ t. When t is productive, we have γ t f i(t) (x t), x t x = γ t f 0 (x t), x t x γ t (f(x t ) Opt) > γ 2, the concluding inequality being given by the definition of γ t and our assumption that f 0 (x t ) Opt > γ f 0 (x t) at all productive steps t. The bottom line is that the left hand side in (1.18) is > Nγ 2 = 2Ω, what contradicts (1.18). 1.4 Minimizing Strongly Convex Functions The MD algorithm can be modified to attain the rate O(1/t) in the case where the objective f in (1.2) is strongly convex. The strong convexity of f with modulus κ > 0 means that (x, x X) f (x) f (x ), x x κ x x 2. (1.19) Further, let ω be the d.-g.f. for the entire E (not just for X, which may be unbounded in this case), compatible with. W.l.o.g. let 0 = argmin E ω, and let Ω = max ω(u) ω(0) u 1 be the variation of ω on the unit ball of. Now, let ω R,z (u) = ω ( ) u z R and Vx R,z (u) = ω R,z (u) ω R,z (x) (ω R,z (x)), u x. Given z X and R > 0 we define the prox-mapping Prox R,z x (ξ) = argmin u X and the recurrence (cf. (1.6)) [ ξ, u + V R,z (u)]. x x t+1 = Prox R,z x t (γ t f (x t )), t = 1, 2,... x t (R, z) = [ t τ=1 γ ] 1 t τ τ=1 γ (1.20) τ x τ We start with the following analogue of Proposition 1.1: Proposition 1.3. Let f be strongly convex on X with modulus κ > 0 and Lipschitz continuous on X with L := sup x X f (x) <. Given R > 0, t 1, suppose that x 1 x R, where x is the minimizer of f on X,

12 12 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I and let the stepsizes γ τ satisfy 2Ω γ τ = RL, 1 τ t. (1.21) t Then after t iterations (1.20) one has f(x t (R, x 1 )) Opt 1 t x t (R, x 1 ) x 2 1 tκ t f (x τ ), x τ x LR 2Ω, (1.22) t τ=1 t f (x τ ), x τ x LR 2Ω κ. (1.23) t τ=1 Proof. Observe that the modulus of strong convexity of the function ω R,x1 ( ) w.r.t. the norm R = /R is 1, and the conjugate of the latter norm is R. Following the steps of the proof of Proposition 1.1, with R and ω R,x1 ( ) in the roles of, respectively, we come to the analogue of (1.7) as follows: u X : t τ=1 γ τ f (x τ ), x τ u V R,x1 x 1 (u)+ 1 2 t R 2 L 2 γτ 2 Ω+ 1 2 τ=1 t R 2 L 2 γτ 2 Setting u = x (so that V R,x1 (x ) Ω due to x 1 x R), and substituting the value (1.21) of γ τ, we come to (1.22). Further, from strong convexity of f it follows that f (x τ ), x τ x κ x τ x 2, which combines with the definition of x t (R, x 1 ) to imply the first inequality in (1.23) (recall that γ τ is independent of τ, so that x t (R, x 1 ) = 1 t t τ=1 x τ ). The second inequality in (1.23) follows from (1.22). Proposition 1.21 states that the smaller is R (i.e., the closer is the initial guess x 1 to x ), the better will be the accuracy of the approximate solution x t (R, x 1 ) both in terms of f and in terms of the distance to x. When the upper bound on this distance, as given by (1.22), becomes small, we can restart the MD using x t ( ) as the improved initial point, compute new approximate solution, and so on. The algorithm below is a simple implementation this idea. Suppose that x 1 X and R 0 x x 1 are given. The algorithm is as follows: 1. Initialization: Set y 0 = x Stage k = 1, 2,...: Set N k = Ceil(2 k+2 L2 Ω κ 2 R ), where Ceil(t) is the smallest 0 2 integer t, and compute y k = x Nk (R k 1, y k 1 ) according to (1.20), with γ t = γ k := 2Ω LR k 1, 1 t N k. Set Rk 2 = 2 k R0 2 and pass to stage k + 1. τ=1

13 1.4 Minimizing Strongly Convex Functions 13 For the search points x 1,..., x Nk of the k-th stage of the method we define δ k = 1 N k f (x τ ), x τ x. N k τ=1 Let k be the smallest integer such that k 1 and 2 k+2 L2 Ω κ 2 R > k, and let 0 2 M k = k j=1 N j, k = 1, 2,... Note that M k is the total number of prox-steps carried out at the first k stages. Proposition 1.4. Setting y 0 = x 1, the points y k, k = 0, 1,..., generated by the above algorithm satisfy the following relations: k = 0, 1,..., y k x 2 R 2 k = 2 k R 2 0, (I k ) f(y k ) Opt δ k κr 2 k = κ2 k R 2 0, (J k ) k = 1, 2,... As a result, (i) when 1 k < k, one has M k 5k and f(y k ) Opt κ2 k R 2 0; (1.24) (ii) when k k, one has f(y k ) Opt 16L2 Ω κm k. (1.25) The proposition says that when the approximate solution y k is far from x, the method converges linearly; when approaching x, it slows down and switches to the rate O(1/t). Proof. Let us prove (I k ), (J k ) by induction in k. (I 0 ) is valid due to y 0 = x 1 and the origin of R 0. Assume that for some m 1 relations (I k ) and (J k ) are valid for 1 k m 1, and let us prove that then (I m ), (J m ) are valid as well. Applying Proposition 1.3 with R = R m 1, x 1 = y m 1 (so that x x 1 R by (I m 1 )) and t = N m, we get (a) : f(y m ) Opt δ m LR m 1 2Ω 2Ω, (b) : y m x 2 LR m 1 Nm κ. N m Since Rm 1 2 = 21 m R0 2 by (I m 1) and N m 2 m+2 L2 Ω κ 2 R, (b) implies (I 0 2 m ), and (a) implies (J m ). Induction is completed. Now let us prove that M k 5k for 1 k < k. Indeed, for such a k and for 1 j k we have N j = 1 when 2 j+2 L2 Ω κ 2 R < 1, let it be so for j < j 0 2, and N j 2 j+3 L2 Ω κ 2 R for j 0 2 j k. It follows that when j > k, we have M k = k.

14 14 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I When j k, we have M := k j=j N j 2 k+4 L2 Ω κ 2 R 4k (the concluding 0 2 inequality is due to k < k ), whence M k = j 1 + M 5k, as claimed. Invoking (J k ), we arrive at (i). To prove (ii), let k k, whence N k k + 1. We have 2 k+3 L2 Ω κ 2 R 2 0 > k j=1 2 j+2 L2 Ω κ 2 R 2 0 k (N j 1) = M k k M k /2, j=1 where the concluding stems from the fact that N k k + 1, and therefore M k k 1 j=1 N j + N k (k 1) + (k + 1) = 2k. Thus M k 2 k+4 L2 Ω κ 2 R, that 0 2 is 2 k 16L2 Ω M kκ 2 R, and the right hand side of (J 0 2 k ) is 16L2 Ω M. kκ 1.5 Mirror Descent Stochastic Approximation The MD algorithm can be extended to the case when the objective f in (1.2) is given by Stochastic oracle a routine which at t-th call, the query point being x t X, returns a vector G(x t, ξ t ), where ξ 1, ξ 2,... are independent identically distributed oracle noises. We assume that for all x X it holds E { G(x, ξ) 2 } L 2 < & g(x) f (x) µ, g(x) = E{G(x, ξ)}. (1.26) Replacing in (1.6) the subgradients f (x t ) with their stochastic estimates G(x t, ξ t ), we arrive at Robust Mirror Descent Stochastic Approximation (RMDSA). The convergence properties of this procedure are presented in the following counterpart of Proposition 1.1: Proposition 1.5. Let X be bounded. Given an integer N 1, consider N-step RMDSA with the stepsizes Then γ t = 2Ω/[L N], 1 t N. (1.27) E { f(x N ) Opt } 2ΩL/ N + 2 2Ωµ. (1.28) Proof. Let ξ t = [ξ 1 ;...; ξ t ], so that x t is deterministic function of ξ t 1. Exactly the same reasoning as in the proof of Proposition 1.1 results in the following analogy of (1.7): N γ N τ G(x τ, ξ τ ), x τ x Ω + 1 τ=1 2 τ=1 γ2 τ G(x τ, ξ τ ) 2. (1.29)

15 1.6 Mirror Descent for Convex-Concave Saddle Point Problems 15 Observe that x τ is a deterministic function of ξ t 1, so that E ξτ { G τ (x τ, ξ τ ), x τ x } = g(x τ ), x τ x f (x τ ), x τ x µd, where D = max x,x X x x is the -diameter of X. Now, when taking expectations of both sides of (1.29) we get { N } E γ τ f (x τ ), x τ x Ω + L2 N τ=1 2 τ=1 γ2 τ + µd N γ τ. τ=1 In the same way as in the proof of Proposition 1.1 we conclude that the left hand side in this inequality is [ N τ=1 γ τ ]E{f(x N ) Opt}, so that E{f(x N ) Opt} Ω + L2 2 N τ=1 γ2 τ N τ=1 γ τ + µd. (1.30) Observe that when x X, we have ω(x) ω(x 1 ) ω (x 1 ), x x x x 1 2 by strong convexity of ω, and ω(x) ω(x 1 ) ω (x 1 ), x x 1 ω(x) ω(x 1 ) Ω (since x 1 = argmin X ω and thus ω (x 1 ), x x 1 0). Thus, x x 1 2Ω for every x X, whence D := max x,x X x x 2 2Ω. This relation combines with (1.30) and (1.27) to imply (1.28). 1.6 Mirror Descent for Convex-Concave Saddle Point Problems We are about to demonstrate that the MD scheme can be naturally extended from problems of convex minimization to the convex-concave saddle point problems Preliminaries Convex-concave saddle point problem. (c.-c.s.p.) problem reads A convex-concave saddle point SadVal = inf sup x X y Y φ(x, y), (1.31) where X E x, Y E y are nonempty closed convex sets in the respective Euclidean spaces E x, E y. The cost function φ(x, y) is a continuous function on Z = X Y E = E x E y which is convex in the variable x X and concave in the variable y Y; the quantity SadVal is called the saddle point value of φ on Z. By definition (precise) solutions to (1.31) are saddle points of φ on Z, that is, points (x, y ) Z such that φ(x, y ) φ(x, y ) φ(x, y) for all (x, y) Z. The data of problem (1.31) give rise to a primal-dual pair

16 16 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I of convex optimization problems Opt(P ) = min x X φ(x) = sup y Y φ(x, y) (P ) Opt(D) = max y Y φ(y) = inf φ(x, y) (D) φ possesses saddle points on Z if and only if problems (P ), (D) are solvable with equal optimal values. Whenever saddle points exist, they are exactly the pairs (x, y ) comprised of optimal solutions x, y to the respective problems (P ), (D), and for every such pair (x, y ) we have x X φ(x, y ) = φ(x ) = Opt(P ) = SadVal := inf x X sup y Y φ(x, y) = sup y Y inf x X φ(x, y) = Opt(D) = φ(y ). From now on we assume that (1.31) is solvable. Remark 1.6. With our basic assumptions on φ (continuity and convexityconcavity on X Y) and on X, Y (nonemptiness, convexity and closedness) (1.31) definitely is solvable if either X and Y are bounded, or both X and all level sets {y Y : φ(y) a}, a R, of φ are bounded; these are the only situations we are about to consider in this chapter and in chapter 2. Saddle point accuracy measure. A natural way to quantify the accuracy of a candidate solution z = (x, y) Z to the c.-c.s.p. problem (1.31) is given by the gap ɛ sad (z) = sup η Y φ(x, η) inf φ(ξ, y) = φ(x) φ(y) ξ X = [ φ(x) Opt(P ) ] + [ Opt(D) φ(y) ] (1.32) where the concluding equality is given by the fact that, by our standing assumption, φ has a saddle point and thus Opt(P ) = Opt(D). We see that ɛ sad (x, y) is just the sum of non-optimalities, in terms of the respective objectives, of x as an approximate solution to (P ) and y as an approximate solution to (D). Monotone operator associated with (1.31). Let x φ(x, y) be the set of all subgradients w.r.t. X of (the convex function) φ(, y), taken at a point x X, and y [ φ(x, y)] be the set of all subgradients w.r.t. Y (of the convex function) φ(x, ), taken at a point y Y. We can associate with φ the point-to-set operator Φ(x, y) = {Φ x (x, y) = x φ(x, y)} {Φ y (x, y) = y [ φ(x, y)]}

17 1.6 Mirror Descent for Convex-Concave Saddle Point Problems 17 The domain Dom Φ := {(x, y) : Φ(x, y) } of this operator is comprised of all pairs (x, y) Z for which the corresponding subdifferentials are nonempty; it definitely contains the relative interior rint Z = rint X rint Y of Z, and the values of Φ in its domain are direct products of nonempty closed convex sets in E x and E y. It is well known (and easily seen) that Φ is monotone: (z, z Dom Φ, F Φ(z), F Φ(z )) : F F, z z 0. and the saddle points of φ are exactly the points z such that 0 Φ(z ). An equivalent, more convenient in our context, characterization of saddle points is as follows: z is a saddle point of φ if and only if for some (and then for every) selection F ( ) of Φ (i.e., a vector field F (z) : rint Z E such that F (z) Φ(z) for every z rint Z) one has F (z), z z 0 z rint Z. (1.33) Saddle Point Mirror Descent Here we assume that Z is bounded and φ is Lipschitz continuous on Z (whence, in particular, the domain of the associated monotone operator Φ is the entire Z). The setup the MP algorithm involves a norm on the embedding space E = E x E y of Z and a d.-g.f. ω( ) for Z compatible with this norm. For z Z o, u Z let (cf. the definition (1.4)) V z (u) = ω(u) ω(z) ω (z), u z, and let z c = argmin u Z ω(u). We assume that given z Z o and ξ E, it is easy to compute the prox-mapping ( [ Prox z (ξ) = argmin [ ξ, u + V z (u)] = argmin ξ ω (z), u + ω(u) ]). u Z u Z We denote by Ω = max u Z V zc (u) max Z ω( ) min Z ω( ) the ω( )- diameter of Z (cf. section 1.2.2). Let a First Order oracle for φ be available, so that for every z = (x, y) Z we can compute a vector F (z) Φ(z = (x, y)) := { x φ(x, y)} { y [ φ(x, y)]}. The saddle point MD algorithm is given by the recurrence (a) : z 1 = z c, (b) : z τ+1 = Prox zτ (γ τ F (z τ )), (c) : z τ = [ τ s=1 γ s] 1 τ s=1 γ sw s, (1.34)

18 18 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I where γ τ > 0 are the stepsizes. Note that z τ, ω τ Z o, whence z t Z. The convergence properties of the algorithm are given by the following Proposition 1.7. Suppose that F ( ) is bounded on Z and L is such that F (z) L for all z Z. (i) For every t 1 it holds [ ɛ sad (z t t ) τ=1 γ τ ] 1 [Ω + L2 ] t 2 τ=1 γ2 τ. (1.35) (ii) As a consequence, the N-step MD algorithm with constant stepsizes γ τ = γ/l N, τ = 1,..., N satisfies ɛ sad (z N ) L [ Ω N γ + Lγ ]. 2 In particular, the N-step MD algorithm with constant stepsizes γ τ = L 1 2Ω N, τ = 1,..., N satisfies 2Ω ɛ sad (z N ) L N. Proof. By definition of z τ+1 = Prox zτ (γ τ F (z τ )) we get u Z, γ τ F (z τ ), z τ u V zτ (u) V zτ+1 (u) + γ 2 τ F (z τ ) 2 /2 (it suffices to repeat the derivation of (1.13) in the proof of Proposition 1.1 with f (x τ ), x τ and x τ+1 substituted, respectively, with F (z τ ), z τ and z τ+1 ). When summing up for i = 1,..., t we get for all u Z: t γ τ F (z τ ), z τ u V z1 (u) + τ=1 t τ=1 γ 2 τ F (z τ ) 2 /2 Ω + L2 2 t γτ 2.(1.36) Let z τ = (x τ, y τ ), z t = (x t, y t ) and λ τ = [ t s=1 γ s] 1 γτ. Note that t s=1 λ s = 1, and for t λ τ F (z τ ), z τ u = τ=1 τ=1 t λ τ [ x φ(x τ, y τ ), x τ x + y φ(x τ, y τ ), y y τ ] τ=1 we have t τ=1 λ τ [ x φ(x τ, y τ ), x τ x + y φ(x τ, y τ ), y y τ ] t τ=1 λ τ [[φ(x τ, y τ ) φ(x, y τ )] + [φ(x τ, y) φ(x τ, y τ )]] (a) = t τ=1 λ τ [φ(x τ, y) φ(x, y τ )] φ( t τ=1 λ τ x τ, y) φ(x, t τ=1 λ τ y τ ) = φ(x t, y) φ(x, y t ) (b) (1.37)

19 1.7 Setting up a Mirror Descent method 19 (inequalities in (a), (b) are due to the convexity-concavity of φ). We come to so that (1.36) results in φ(x t, y) φ(x, y t ) Ω + L2 2 t τ=1 γ2 τ t τ=1 γ τ (x, y) Z. Taking supremum in (x, y) Z, we arrive at (1.35). 1.7 Setting up a Mirror Descent method An advantage of the Mirror Descent scheme is that its degrees of freedom (the norm and the d.-g.f. ω( )) allow to adjust, to some extent, the method to the geometry of the problem under consideration. This is the issue we are focusing on in this section. For the sake of definiteness, we restrict ourselves with the minimization problem (1.2); the saddle point case (1.31) is completely similar, with Z in the role of X Building blocks The basic MD setups are as follows: 1. Euclidean setup: = 2, ω(x) = 1 2 xt x. 2. l 1 setup: For this setup, E = R n, n > 1, and = 1. As about ω( ), there could be several choices, depending on what X is: (a) when X is unbounded, seemingly the only good choice is ω(x) = C ln(n) x 2 1 p(n) with p(n) = ln(n), where an absolute constant C is chosen in a way which ensures (1.3) (one can take C = e); (b) when X is bounded, assuming w.l.o.g. that X B n,1 := {x R n : x 1 1}, one can set ω(x) = C ln(n) n i=1 x i p(n) with the same as above p(n) and C = 2e; (c) when X is a part of the simplex S + n = {x R n + : n i=1 x i 1} (or the flat simplex S n = {x R n + : n i=1 x i = 1}) intersecting int R n +, a good choice of ω(x) is the entropy: ω(x) = Ent(x) := n i=1 x i ln(x i ) (1.38) 3. Matrix setup: This is the matrix analogy of the l 1 setup. Here the embedding space E of X is the space S ν of block-diagonal symmetric matrices with fixed block-diagonal structure ν = [ν 1 ;...; ν k ] (k diagonal blocks of row sizes ν 1,..., ν k ). S ν is equipped with the Frobenius inner product X, Y = Tr(XY ) and the trace norm X 1 = λ(x) 1, where

20 20 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I λ(x) is the vector of eigenvalues (taken with their multiplicities in the nonascending order) of a symmetric matrix X. The d.-g.f. s are the matrix analogies of those for l 1 setup. Specifically, (a) when X is unbounded, we set ω(x) = C ln( ν ) λ(x) 2 p( ν ), where ν = k l=1 ν l is the total row size of matrices from S ν, and C is an appropriate absolute constant which ensures (1.3) (one can take C = 2e); (b) when X is bounded, assuming w.l.o.g. that X B ν,1 = {X S ν : X 1 1}, we can take ω(x) = 4e ln( ν ) ν i=1 λ i(x) p( ν ) ; (c) when X is a part of the spectahedron Σ + ν = {X S ν : X 0, Tr(X) 1} (or the flat spectahedron Σ ν = {X S ν : X 0, Tr(X) = 1}) intersecting the interior {X 0} of the positive semidefinite cone S ν + = {X S ν : X 0}, one can take as ω(x) the matrix entropy: ω(x) = 2Ent(λ(X)) = 2 ν i=1 λ i(x) ln(λ i (X)). Note that l 1 -setup can be viewed as a particular case of the matrix one, corresponding to the case when the block-diagonal matrices in question are diagonal, and we identify a diagonal matrix with the vector of its diagonal entries. With the outlined setups, the Simplicity assumption holds provided that X is simple enough, specifically: Within the Euclidean setup, Prox x (ξ) is the metric projection of the vector x ξ onto X, that is, the point of X which is the closest to x ξ in l 2 -norm. Examples of sets X R n for which metric projection is easy include, among others, p -balls and intersections of centered at the origin p -balls with the nonnegative orthant R n +; Within l 1 -setup, computing the prox-mapping is reasonably easy in the case of 2a when X is the entire R n or R n +, in the case of 2b when X is the entire B n,1 or the intersection of B n,1 with R n +, in the case of 2c when X is the entire S + n or S n. With the indicated sets X, in the cases of 2a 2b computing the proxmapping requires solving auxiliary one- or two-dimensional convex problems (which can be done within machine accuracy by, e.g., the Ellipsoid algorithm in O(n) operations, cf. Nemirovski and Yudin (1983), Chapter II). In the

21 1.7 Setting up a Mirror Descent method 21 case of 2c, the prox-mappings are given by the explicit formulas: { X = S + [x 1 e ξ1 1 ;...; x n e ξn 1 ], i n Prox x (ξ) = eηi 1 1 [ i x ] 1 [ ] ie ξi x1 e η1 ;...; x n e ηn, otherwise X = S n Prox x (ξ) = [ i x ] 1 [ ] ie ξi x1 e η1 ;...; x n e ηn. (1.39) Within the Matrix setup, computing the prox-mapping is relatively easy in the case of 3a when X is the entire S ν or the positive semidefinite cone S ν + = {X S ν : X 0}, in the case of 3b when X is the entire B ν,1 or the intersection of B ν,1 with S ν +, in the case of 3c when X is the entire spectahedron Σ + ν or Σ ν. Indeed, in the outlined cases computing W = Prox X (Ξ) reduces to computing the eigenvalue decomposition of the matrix X (which allows to get ω (X)) and subsequent eigenvalue decomposition of the matrix H = Ξ ω (X): H = U Diag{h}U T (here Diag(A) stands for the diagonal matrix with the same diagonal as A). It is easily seen that in the cases in question W = U Diag{w}U T, w = argmin z: Diag{z} X { Diag{h}, Diag{z} + ω(diag{z})}, and the latter problem is exactly the one arising in the l 1 setup Illustration: Euclidean setup vs. l 1 setup To illustrate the ability of the MD scheme to adjust, to some extent, the method to problem s geometry, consider problem (1.2) when X is the unit p -ball in R n, where p = 1 or p = 2, and let us compare the respective performances of the Euclidean and the l 1 setups (to make optimization over the unit Euclidean ball B n,2 available for l 1 setup, we pass from min x 2 1 f(x) to the equivalent problem min f(n 1/2 u) and use the u 2 n 1/2 setup from item 2b, section 1.7.1). The ratio of the corresponding efficiency estimates (the right hand sides in (1.11)) within an absolute constant factor is Θ := EffEst(Eucl) EffEst(l 1) 1 = n 1 1/p ln(n) }{{} A sup x X f (x) 2 sup x X f (x) 1 } {{ } B note that Θ 1 means that the MD with Euclidean setup significantly outperforms the MD with l 1 setup, while Θ 1 witnesses exactly opposite. Now, A is 1 and thus is always in favor of the Euclidean setup, and is as small as 1/ n ln(n) when X is the Euclidean ball (p = 2), while ;

22 22 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I the factor B is in favor of the l 1 setup it is 1 and n and well can be of order of n (look what happens when all entries in f (x) are of the same order of magnitude). Which one of the factors overweights, it depends on f; however, reasonable choice can be made independently of the fine structure of f. Specifically, when X is the Euclidean ball, the factor A = 1/ n ln n is so small that the product AB definitely is 1, that is, the situation definitely is in favor of the Euclidean setup. In contrast to this, when X is the l 1 ball (p = 1), A is nearly constant just O(1/ ln(n)); since B can be as large as n, the situation is definitely in favor of the l 1 setup it can be outperformed by the Euclidean setup only marginally (by factor ln n), with reasonable chances to outperform its adversary quite significantly by factor O( n/ ln(n)). Thus, there are all reasons to select the Euclidean setup when p = 2 and the l 1 setup when p = Favorable Geometry case Consider the case when the domain X of (1.2) is bounded and, moreover, is a subset of the direct product X + of standard blocks: X + = X 1... X K E 1... E K, (1.40) where for every l = 1,..., K the pair (X l, E l X l ) is either a ball block, that is, E l = R nl and X l is either the unit Euclidean ball B nl,2 = {x R nl : x 2 1} in E l, or the intersection of this ball with R nl + ; or a spectahedron block, that is, E l = S νl is the space of block-diagonal, with block-diagonal structure ν l, symmetric matrices, and X l is either the unit trace-norm ball {X S νl : X 1 1}, or the intersection of this ball with S νl +, or the spectahedron Σ + ν = {X S νl l + : Tr(X) 1}, or the flat spectahedron Σ ν l = {X S νl + : Tr(X) = 1}. Note that according to our convention to identify vectors with diagonals of diagonal matrices, we allow for some of X l to be the unit l 1 balls, or their nonnegative parts, or simplices they are nothing but spectahedron blocks with purely diagonal structure ν l. We equip the embedding spaces E l of blocks with the natural inner 6. In fact, with this recommendation we get theoretically unimprovable, in terms of the Information-Based Complexity Theory, methods for large-scale nonsmooth convex optimization on Euclidean and l 1 balls (for details, see Nemirovski and Yudin, 1983; Ben- Tal et al., 2001); numerical experiments reported in (Ben-Tal et al., 2001; Nemirovski et al., 2009) seem to fully support the advantages of l 1 setup when minimizing over large-scale simplices.

23 1.7 Setting up a Mirror Descent method 23 products (the standard inner products when E l = R nl and the Frobenius inner product when E l = S νl ) and norms (l) (the standard Euclidean norm when E l = R nl and the trace-norm when E l = S νl ), and the standard blocks X l with d.-g.f. s 1 2 [xl ] T x l, X l is a ball block ω l (x l 4e ln( ν l ) i ) = λ i(x l ) p( νl ) X l is the unit 1 ball B νl,1 in, E l = S νl, or B νl,1 S νl + 2Ent(λ(X l X l is the spectahedron (Σ + ν or )), l Σ ν l) in E l = S νl (1.41) cf. section Finally, the embedding space E = E 1... E K of X + (and thus of X X + ) is equipped with the direct product type Euclidean structure induced by the inner products on E 1,..., E K and with the norm (x 1,..., x K ) = K l=1 α l x l 2 (l) (1.42) where α l > 0 are construction s parameters. X + is equipped with the d.-g.f. ω(x 1,..., x K ) = K l=1 α lω l (x l ) (1.43) which, as it is easily seen, is compatible with the norm. Assuming from now on that X intersects the relative interior rint X +, the restriction of ω( ) onto X is a d.-g.f. for X compatible with the norm on the space E embedding X, and we can solve (1.2) by the MD algorithm associated with and ω( ). Let us optimize the efficiency estimate of this algorithm over the parameters α l of our construction. For the sake of definiteness, consider the case where f is represented by a deterministic First Order oracle (the tuning of the MD setup in the case of Stochastic oracle being completely similar). To this end assume that we have at our disposal upper bounds L l <, 1 l K, on the quantities f x (x 1,..., x K ) l (l),, x = (x 1,..., x K ) X, where f x (x) is the projection of f (x) onto E l l, and (l), is the norm on E l conjugate to (l) (that is, (l), is the standard Euclidean norm 2 on E l when E l = R nl, and (l), is the standard matrix norm (maximal singular value) when E l = S νl ). The norm conjugate to the norm on E is (ξ 1,..., ξ K ) = K l=1 α 1 l ( x X) : f (x) L := ξ l 2 (l), K (1.44) l=1 α 1 l L 2 l,

24 24 First Order Methods for Nonsmooth Convex Large-Scale Optimization, I and the quantity we need to minimize in order to get as efficient MD method as possible within our framework, is ΩL, see, e.g. (1.11). We clearly have Ω Ω[X + ] K l=1 α lω l [X l ], where Ω l [X l ] is the variation (maximum minus minimum) of ω l on X l. These variations are upper-bounded by the quantities { 1 2 for ball blocks X l Ω l = 4e ln( ν l. (1.45) ) for spectahedron blocks X l Assuming that we have K b ball blocks X 1,..., X Kb blocks X Kb+1,..., X K=Kb+K s, we get and K s spectahedron ΩL Ω[X + ]L [ 1 Kb 2 α l + 4e K b+k s α l ln( ν l ) l=1 l=k b+1 ] K When optimizing the right hand side bound in α 1,..., α L, we get l=1 α 1 l L 2 l. α l = L l K Ωl i=1 L, Ω[X + ] = 1, L = L := K i Ωi l=1 L l Ωl. (1.46) The associated with our optimized setup efficiency estimate (1.11) reads f N Opt O(1)LN 1/2 [ = O(1)[max 1 l K L l ] K b + ] K b+k s l=k b+1 ln( ν l ) N 1/2. (1.47) We see that if we consider max 1 l K L l, K b and K s as given constants, the rate of convergence of the MD algorithm is O(1/ N), N being the number of steps, with the factor hidden in O( ) completely independent of the dimensions of the ball blocks and nearly independent of the sizes of the spectahedron blocks. In other words, when the total number K of standard blocks in X + is O(1), the MD algorithm exhibits nearly dimensionindependent O(N 1/2 ) rate of convergence, which definitely is good news when solving large-scale problems. Needless to say, the rate of convergence is not the only entity of interest; what matters is the arithmetic cost of an iteration. The latter, modulo the computational effort for obtaining the first order information on f, is dominated by the computational complexity of the prox-mapping. This complexity, let us denote it C, depends on what exactly is X. As it was explained in section 1.7.1, in the case of X = X +, C is O( K b l=1 dim X l) plus the complexity of the eigenvalue decomposition of a matrix from S ν1... S νks. In particular, when all spectahedron blocks are l 1 balls and simplices, C is just linear in the dimension of X +. Further, when X is cut off X + by O(1) linear inequalities, C is, essentially, the same as when X = X +. Indeed, here computing the prox-mapping for X reduces

25 1.8 Notes and Remarks 25 to solving the problem { min a, z + ω(z) : z X +, Az b }, dim b = k = O(1), z X + or, which is the same, by duality, to solving the problem max f (λ), f (λ) = λ R k + [ b T λ + min z X + [ a + A T λ, z + ω(z) ] We are in the situation of O(1) λ-variables, and thus the latter problem can be solved to machine precision in O(1) steps of a simple first order algorithm, like the Ellipsoid method. The required by this method first order information for f costs computing a single prox-mapping for X +, so that computing the prox-mapping for X + is, for all practical purposes, just by an absolute constant factor more costly than computing this mapping for X +. When X is a sophisticated subset of X +, computing the prox-mapping for X may become more involved, and the outlined setup could become difficult to implement. One of potential remedies is to rewrite the problem (1.2) in the form of (1.15) with X extended to X +, with f in the role of f 0 and the constraints which cut X off X + in the role of the functional constraints f 1 (x) 0,..., f m (x) 0 of (1.15). ]. 1.8 Notes and Remarks 1. The very first Mirror Descent method the Subgradient Descent originates from Shor (1967) and Polyak (1967); SD is nothing but the MD algorithm with Euclidean setup: x t+1 = argmin u X (x t γ t f (x t )) u 2. Non-Euclidean extensions (i.e., the general MD scheme) originate from (Nemirovskii, 1979; Nemirovski and Yudin, 1983); the contemporary form of this scheme used in our presentation is due to Beck and Teboulle (2003). An ingenious version of the method, which also allows to recover dual solutions is proposed by Nesterov (2009). The construction presented in section 1.3 originates from (Nemirovski and Yudin, 1983), for recent version, see (Beck et al., 2009). 2. Practical performance of FOMs of the type we have considered can be improved significantly by passing to their bundle versions explicitly utilizing both the latest and the past first order information (in MD, only the latest first order information is used explicitly, while the past one is loosely summarized in the current iterate). The Euclidean bundle methods originate from Lemaréchal (1978) and are the subject of numerous papers

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions

Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions Convex Stochastic and Large-Scale Deterministic Programming via Robust Stochastic Approximation and its Extensions Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia

More information

6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure

6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure 6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure Anatoli Juditsky Anatoli.Juditsky@imag.fr Laboratoire Jean Kuntzmann, Université J. Fourier B. P.

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Difficult Geometry Convex Programs and Conditional Gradient Algorithms

Difficult Geometry Convex Programs and Conditional Gradient Algorithms Difficult Geometry Convex Programs and Conditional Gradient Algorithms Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research with

More information

2 First Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure

2 First Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure 2 First Order Methods for Nonsmooth Convex Large-Scale Optimization, II: Utilizing Problem s Structure Anatoli Juditsky Anatoli.Juditsky@imag.fr Laboratoire Jean Kuntzmann, Université J. Fourier B. P.

More information

Lecture 6: Conic Optimization September 8

Lecture 6: Conic Optimization September 8 IE 598: Big Data Optimization Fall 2016 Lecture 6: Conic Optimization September 8 Lecturer: Niao He Scriber: Juan Xu Overview In this lecture, we finish up our previous discussion on optimality conditions

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

ADVANCES IN CONVEX OPTIMIZATION: CONIC PROGRAMMING. Arkadi Nemirovski

ADVANCES IN CONVEX OPTIMIZATION: CONIC PROGRAMMING. Arkadi Nemirovski ADVANCES IN CONVEX OPTIMIZATION: CONIC PROGRAMMING Arkadi Nemirovski Georgia Institute of Technology and Technion Israel Institute of Technology Convex Programming solvable case in Optimization Revealing

More information

Online First-Order Framework for Robust Convex Optimization

Online First-Order Framework for Robust Convex Optimization Online First-Order Framework for Robust Convex Optimization Nam Ho-Nguyen 1 and Fatma Kılınç-Karzan 1 1 Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA, 15213, USA. July 20, 2016;

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Saddle Point Algorithms for Large-Scale Well-Structured Convex Optimization

Saddle Point Algorithms for Large-Scale Well-Structured Convex Optimization Saddle Point Algorithms for Large-Scale Well-Structured Convex Optimization Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Dual subgradient algorithms for large-scale nonsmooth learning problems

Dual subgradient algorithms for large-scale nonsmooth learning problems Math. Program., Ser. B (2014) 148:143 180 DOI 10.1007/s10107-013-0725-1 FULL LENGTH PAPER Dual subgradient algorithms for large-scale nonsmooth learning problems Bruce Cox Anatoli Juditsky Arkadi Nemirovski

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A.

Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. . Selected Examples of CONIC DUALITY AT WORK Robust Linear Optimization Synthesis of Linear Controllers Matrix Cube Theorem A. Nemirovski Arkadi.Nemirovski@isye.gatech.edu Linear Optimization Problem,

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

SUPPORT VECTOR MACHINES VIA ADVANCED OPTIMIZATION TECHNIQUES. Eitan Rubinstein

SUPPORT VECTOR MACHINES VIA ADVANCED OPTIMIZATION TECHNIQUES. Eitan Rubinstein SUPPORT VECTOR MACHINES VIA ADVANCED OPTIMIZATION TECHNIQUES Research Thesis Submitted in partial fulfillment of the requirements for the degree of Master of Science in Operations Research and System Analysis

More information

Stochastic Semi-Proximal Mirror-Prox

Stochastic Semi-Proximal Mirror-Prox Stochastic Semi-Proximal Mirror-Prox Niao He Georgia Institute of echnology nhe6@gatech.edu Zaid Harchaoui NYU, Inria firstname.lastname@nyu.edu Abstract We present a direct extension of the Semi-Proximal

More information

Assignment 1: From the Definition of Convexity to Helley Theorem

Assignment 1: From the Definition of Convexity to Helley Theorem Assignment 1: From the Definition of Convexity to Helley Theorem Exercise 1 Mark in the following list the sets which are convex: 1. {x R 2 : x 1 + i 2 x 2 1, i = 1,..., 10} 2. {x R 2 : x 2 1 + 2ix 1x

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Universal Gradient Methods for Convex Optimization Problems

Universal Gradient Methods for Convex Optimization Problems CORE DISCUSSION PAPER 203/26 Universal Gradient Methods for Convex Optimization Problems Yu. Nesterov April 8, 203; revised June 2, 203 Abstract In this paper, we present new methods for black-box convex

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

What can be expressed via Conic Quadratic and Semidefinite Programming?

What can be expressed via Conic Quadratic and Semidefinite Programming? What can be expressed via Conic Quadratic and Semidefinite Programming? A. Nemirovski Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Abstract Tremendous recent

More information

INFORMATION-BASED COMPLEXITY CONVEX PROGRAMMING

INFORMATION-BASED COMPLEXITY CONVEX PROGRAMMING 1 TECHNION - THE ISRAEL INSTITUTE OF TECHNOLOGY FACULTY OF INDUSTRIAL ENGINEERING & MANAGEMENT INFORMATION-BASED COMPLEXITY OF CONVEX PROGRAMMING A. Nemirovski Fall Semester 1994/95 2 Information-Based

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

(Linear Programming program, Linear, Theorem on Alternative, Linear Programming duality)

(Linear Programming program, Linear, Theorem on Alternative, Linear Programming duality) Lecture 2 Theory of Linear Programming (Linear Programming program, Linear, Theorem on Alternative, Linear Programming duality) 2.1 Linear Programming: basic notions A Linear Programming (LP) program is

More information

Notation and Prerequisites

Notation and Prerequisites Appendix A Notation and Prerequisites A.1 Notation Z, R, C stand for the sets of all integers, reals, and complex numbers, respectively. C m n, R m n stand for the spaces of complex, respectively, real

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Barrier Method. Javier Peña Convex Optimization /36-725

Barrier Method. Javier Peña Convex Optimization /36-725 Barrier Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: Newton s method For root-finding F (x) = 0 x + = x F (x) 1 F (x) For optimization x f(x) x + = x 2 f(x) 1 f(x) Assume f strongly

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles

Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles Math. Program., Ser. A DOI 10.1007/s10107-015-0876-3 FULL LENGTH PAPER Solving variational inequalities with monotone operators on domains given by Linear Minimization Oracles Anatoli Juditsky Arkadi Nemirovski

More information

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17 EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 17 Andre Tkacenko Signal Processing Research Group Jet Propulsion Laboratory May 29, 2012 Andre Tkacenko

More information

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization MATHEMATICS OF OPERATIONS RESEARCH Vol. 29, No. 3, August 2004, pp. 479 491 issn 0364-765X eissn 1526-5471 04 2903 0479 informs doi 10.1287/moor.1040.0103 2004 INFORMS Some Properties of the Augmented

More information

Stochastic model-based minimization under high-order growth

Stochastic model-based minimization under high-order growth Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively

More information

Tutorial: Mirror Descent Algorithms for Large-Scale Deterministic and Stochastic Convex Optimization

Tutorial: Mirror Descent Algorithms for Large-Scale Deterministic and Stochastic Convex Optimization Tutorial: Mirror Descent Algorithms for Large-Scale Deterministic and Stochastic Convex Optimization Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of

More information

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space. Chapter 1 Preliminaries The purpose of this chapter is to provide some basic background information. Linear Space Hilbert Space Basic Principles 1 2 Preliminaries Linear Space The notion of linear space

More information

Convex Optimization Under Inexact First-order Information. Guanghui Lan

Convex Optimization Under Inexact First-order Information. Guanghui Lan Convex Optimization Under Inexact First-order Information A Thesis Presented to The Academic Faculty by Guanghui Lan In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy H. Milton

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization

Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization Zaid Harchaoui Anatoli Juditsky Arkadi Nemirovski May 25, 2014 Abstract Motivated by some applications in signal processing

More information

Lecture 7: Semidefinite programming

Lecture 7: Semidefinite programming CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 7: Semidefinite programming This lecture is on semidefinite programming, which is a powerful technique from both an analytic and computational

More information

A New Look at the Performance Analysis of First-Order Methods

A New Look at the Performance Analysis of First-Order Methods A New Look at the Performance Analysis of First-Order Methods Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Yoel Drori, Google s R&D Center, Tel Aviv Optimization without

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008. 1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms

Convex Optimization Theory. Athena Scientific, Supplementary Chapter 6 on Convex Optimization Algorithms Convex Optimization Theory Athena Scientific, 2009 by Dimitri P. Bertsekas Massachusetts Institute of Technology Supplementary Chapter 6 on Convex Optimization Algorithms This chapter aims to supplement

More information

15. Conic optimization

15. Conic optimization L. Vandenberghe EE236C (Spring 216) 15. Conic optimization conic linear program examples modeling duality 15-1 Generalized (conic) inequalities Conic inequality: a constraint x K where K is a convex cone

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Linear Algebra A Brief Reminder Purpose. The purpose of this document

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Lecture 9 Sequential unconstrained minimization

Lecture 9 Sequential unconstrained minimization S. Boyd EE364 Lecture 9 Sequential unconstrained minimization brief history of SUMT & IP methods logarithmic barrier function central path UMT & SUMT complexity analysis feasibility phase generalized inequalities

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

Lecture 1. 1 Conic programming. MA 796S: Convex Optimization and Interior Point Methods October 8, Consider the conic program. min.

Lecture 1. 1 Conic programming. MA 796S: Convex Optimization and Interior Point Methods October 8, Consider the conic program. min. MA 796S: Convex Optimization and Interior Point Methods October 8, 2007 Lecture 1 Lecturer: Kartik Sivaramakrishnan Scribe: Kartik Sivaramakrishnan 1 Conic programming Consider the conic program min s.t.

More information

We describe the generalization of Hazan s algorithm for symmetric programming

We describe the generalization of Hazan s algorithm for symmetric programming ON HAZAN S ALGORITHM FOR SYMMETRIC PROGRAMMING PROBLEMS L. FAYBUSOVICH Abstract. problems We describe the generalization of Hazan s algorithm for symmetric programming Key words. Symmetric programming,

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

GEORGIA INSTITUTE OF TECHNOLOGY H. MILTON STEWART SCHOOL OF INDUSTRIAL AND SYSTEMS ENGINEERING LECTURE NOTES OPTIMIZATION III

GEORGIA INSTITUTE OF TECHNOLOGY H. MILTON STEWART SCHOOL OF INDUSTRIAL AND SYSTEMS ENGINEERING LECTURE NOTES OPTIMIZATION III GEORGIA INSTITUTE OF TECHNOLOGY H. MILTON STEWART SCHOOL OF INDUSTRIAL AND SYSTEMS ENGINEERING LECTURE NOTES OPTIMIZATION III CONVEX ANALYSIS NONLINEAR PROGRAMMING THEORY NONLINEAR PROGRAMMING ALGORITHMS

More information

A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions

A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions A Unified Analysis of Nonconvex Optimization Duality and Penalty Methods with General Augmenting Functions Angelia Nedić and Asuman Ozdaglar April 16, 2006 Abstract In this paper, we study a unifying framework

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

A Quantum Interior Point Method for LPs and SDPs

A Quantum Interior Point Method for LPs and SDPs A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS, IRIF, Université Paris Diderot, Paris, France. September 26, 2018 Semi Definite Programs A Semidefinite Program

More information

On self-concordant barriers for generalized power cones

On self-concordant barriers for generalized power cones On self-concordant barriers for generalized power cones Scott Roy Lin Xiao January 30, 2018 Abstract In the study of interior-point methods for nonsymmetric conic optimization and their applications, Nesterov

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Lecture 3: Huge-scale optimization problems

Lecture 3: Huge-scale optimization problems Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1

More information

Sparse Approximation via Penalty Decomposition Methods

Sparse Approximation via Penalty Decomposition Methods Sparse Approximation via Penalty Decomposition Methods Zhaosong Lu Yong Zhang February 19, 2012 Abstract In this paper we consider sparse approximation problems, that is, general l 0 minimization problems

More information

Lecture 5. Theorems of Alternatives and Self-Dual Embedding

Lecture 5. Theorems of Alternatives and Self-Dual Embedding IE 8534 1 Lecture 5. Theorems of Alternatives and Self-Dual Embedding IE 8534 2 A system of linear equations may not have a solution. It is well known that either Ax = c has a solution, or A T y = 0, c

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Lecture 24 November 27

Lecture 24 November 27 EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve

More information

QUADRATIC MAJORIZATION 1. INTRODUCTION

QUADRATIC MAJORIZATION 1. INTRODUCTION QUADRATIC MAJORIZATION JAN DE LEEUW 1. INTRODUCTION Majorization methods are used extensively to solve complicated multivariate optimizaton problems. We refer to de Leeuw [1994]; Heiser [1995]; Lange et

More information

Duality Theory of Constrained Optimization

Duality Theory of Constrained Optimization Duality Theory of Constrained Optimization Robert M. Freund April, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 2 1 The Practical Importance of Duality Duality is pervasive

More information

LEARNING IN CONCAVE GAMES

LEARNING IN CONCAVE GAMES LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries

More information

Topological properties of Z p and Q p and Euclidean models

Topological properties of Z p and Q p and Euclidean models Topological properties of Z p and Q p and Euclidean models Samuel Trautwein, Esther Röder, Giorgio Barozzi November 3, 20 Topology of Q p vs Topology of R Both R and Q p are normed fields and complete

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Lecture Note 5: Semidefinite Programming for Stability Analysis

Lecture Note 5: Semidefinite Programming for Stability Analysis ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State

More information

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009 LMI MODELLING 4. CONVEX LMI MODELLING Didier HENRION LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ Universidad de Valladolid, SP March 2009 Minors A minor of a matrix F is the determinant of a submatrix

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version Convex Optimization Theory Chapter 5 Exercises and Solutions: Extended Version Dimitri P. Bertsekas Massachusetts Institute of Technology Athena Scientific, Belmont, Massachusetts http://www.athenasc.com

More information