Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Size: px
Start display at page:

Download "Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent"

Transcription

1 Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order methods for non-smooth (stochastic or deterministic) convex optimization assumes that the objective function is uniformly Lipschitz continuous with parameter M f. However, in many settings the non-differentiable convex function f is not uniformly Lipschitz continuous for example (i) the classical support vector machine (SVM) problem, (ii) the problem of minimizing the maximum of convex quadratic functions, and even (iii) the univariate setting with f(x) := max{0, x} + x. Herein we develop a notion of relative continuity that is determined relative to a user-specified reference function h (that should be computationally tractable for algorithms), and we show that many non-differentiable convex functions are relatively continuous with respect to a correspondingly fairly-simple reference function h. We also similarly develop a notion of relative stochastic continuity for the stochastic setting. We analyze two standard algorithms the (deterministic) mirror descent algorithm and the stochastic mirror descent algorithm for solving optimization problems in these new settings, providing the first computational guarantees for instances where the objective function is not uniformly Lipschitz continuous. This paper is a companion paper for non-differentiable convex optimization to the recent paper by Lu, Freund, and Nesterov, which developed analogous results for differentiable convex optimization. Introduction The usual approach to developing and analyzing first-order methods for non-differentiable convex optimization (which we review shortly) assumes that the objective function is uniformly Lipschitz continuous, in both deterministic and stochastic settings. However, in many settings the objective function f is not uniformly Lipschitz continuous. For example, consider the Support Vector Machine problem (SVM) for binary classification in machine learning, whose optimization formulation is: SVM: min f(x) := x n n max{0, y i x T w i } + λ x, MIT Department of Mathematics and MIT Operations Research Center, 77 Massachusetts Avenue, Cambridge, MA 039 (haihao@mit.edu). The author s research is supported by AFOSR Grant No. FA and the MIT-Belgium Université Catholique de Louvain Fund.

2 where w i is the input feature vector of sample i and y i {, } is the label of sample i. Notice that f is not differentiable due to the presence of hinge loss terms in the summation, and f is also not Lipschitz continuous due to the presence of the l -norm regularization term; thus we cannot directly utilize typical subgradient or gradient schemes and their associated computational guarantees for SVM. Another example is the the problem of computing a point x R m in the intersection of n ellipsoids, which can be tackled via the optimization problem IEP: f = min f(x) := max { x 0 i n xt A i x + b T i x + c i }, where the i th ellipsoid is Q i = {x R m : xt A i x + b i x + c i 0} and A i R m m is a symmetric positive semi-definite matrix, i =,..., n. Observe that the objective function f is both non-differentiable and non-lipschitz, and so it falls outside of the scope of standard classes of optimization problems for which first-order methods are guaranteed to work. Nevertheless, using the machinery developed in this paper, we will show in Section 5.3 how to solve both of these problems using deterministic or stochastic Mirror Descent. In this paper we develop a general theory and algorithmic constructs that overcome the drawbacks in the usual analyses of first-order methods that are grounded on restricted notions of uniform Lipschitz continuity. Here we develop a notion of relative continuity with respect to a given convex reference function h, a notion which does not require the specification of any particular norm indeed h need not be strongly (or even strictly) convex. Armed with relative continuity, we demonstrate the capability to solve a more general class of non-differentiable convex optimization problems (without uniform Lipschitz continuity) in both deterministic and stochastic settings. This paper is a companion for non-differentiable convex optimization to our predecessor paper [8] for differentiable convex optimization. In [8], with a very similar philosophy, we developed the notion of relative smoothness and relative strong convexity with respect to a given convex reference function. In that paper we demonstrated the capability to solve a more general class of differentiable convex optimization problems (without uniform Lipschitz continuous gradients), and we also demonstrated linear convergence results for a Primal Gradient Scheme when the objective function f is both smooth and strongly convex. There are some concurrent works on smooth optimization sharing a similar spirit to [8]. Bauschke, Bolte, and Teboulle [] presents a similar definition of relative smoothness as in [8] and analyzes the convergence of Mirror Descent Algorithm, although their algorithm and convergence complexity depend on a symmetry measure of the Bregman distance. Zhou et al. [7] discusses a unified proof of Mirror Descent and the Proximal Point Algorithm under a similar assumption of relative smoothness. Nguyen [5] develops similar ideas on analyzing Mirror Descent in a Banach space. A more detailed discussion comparing these related works is also presented in [8]. More recently, Hanzely and Richtarik [6] develop stochastic algorithms for the relatively smooth optimization setting. In Section we review the traditional set-up for Mirror Descent in both the deterministic and stochastic settings. In Section 3 we introduce our notion of relative continuity in both the deterministic and stochastic settings, together with some relevant properties. In Section 4 we prove computational guarantees associated with the Mirror Descent and Stochastic Mirror Descent

3 algorithms under relative continuity. In Section 5 we show constructively how our ideas apply to a large class of non-differentiable and non-lipschitz convex optimization problems that are not otherwise solvable by traditional first-order methods. Also in Section 5 we analyze computational guarantees associated with Mirror Descent and Stochastic Mirror Descent for the Intersection of Ellipsoids Problem (IEP) and also the Support Vector Machine (SVM) problem. Notation. denotes a given norm in R n and denotes the usual dual norm on the dual space. x := x T x denotes the Euclidean (inner product) norm, where x T means the transpose of the vector x, and B (c, r) := {x R n : x c r}. A denotes the l (spectral) norm of a matrix A. The inner product, specifically denotes the dot inner product in the underlying vector space. For a conditional random variable s(x) given x, E[s(x) x] denotes the conditional expectation of s(x) given x. Traditional Mirror Descent The optimization problem of interest is: P : min x f(x) s.t. x Q, where Q R n is a closed convex set and f : Q R is a convex function that is not necessarily differentiable. There are very many deterministic and stochastic first-order methods for tackling (), see for example [4], [], [9] and the references therein. Virtually all such methods are designed to solve () when the objective function f satisfies a uniform Lipschitz continuity condition on Q, which in the deterministic setting is (essentially) equivalent to the condition that there exists a constant M f < for which: g(x) M f for all x Q and g(x) f(x), () where f(x) is the subdifferential of f at x (i.e., the collection of subgradients of f at x), is a given norm on R n, and denotes the usual dual norm. Here we use g(x) to denote an assignment of a subgradient (or an oracle call thereof) at x, and so g(x) is not a function nor is it a point-to-set map. Another useful functional notion is strong convexity: f is (uniformly) µ f -strongly convex for some µ f > 0 if f(y) f(x) + g(x), y x + µ f y x for all x, y Q and g(x) f(x). (3) Nedić and Lee [9] obtain improved convergence guarantees for the stochastic mirror descent algorithm under strong convexity, for example. (). Deterministic Setting Let us now recall the Mirror Decent Algorithm (see [0] and []), which is also referred to as the prox subgradient method when interpreted in the space of primal variables. Mirror Descent employs a 3

4 differentiable convex prox function h to define a Bregman distance: D h (y, x) := h(y) h(x) h(x), y x for all x, y Q. (4) The Bregman distance is used in the computation of the Mirror Descent update: { } x i+ arg min f(x i ) + g(x i ), x x i + x Q t i D h (x, x i ), where {t i } is the sequence of step-sizes for the scheme. A formal statement of the Mirror Descent Algorithm is presented in Algorithm. The traditional set-up requires that h is -strongly convex with respect to the given norm, and in this set-up one can prove that after k iterations it holds for any x Q that: min 0 i k f(xi ) f(x) M f i=0 t i + D h(x, x 0 ) i=0 t i, (5) which leads to an O(/ k) sublinear rate of convergence using an appropriately chosen step-size sequence {t i }, see []. Algorithm Deterministic Mirror Descent Algorithm with Bregman distance D h (, ) Initialize. Initialize with x 0 Q. Let h be a given convex function. At iteration i : Perform Updates. Compute g(x i ) f(x i ), determine step-size t i, and compute update: x i+ arg min x Q {f(xi ) + g(x i ), x x i + t i D h (x, x i )}. (6) Notice in (6) by construction that the update requires the capability to solve instances of a linearized subproblem (which we denote by LS ) of the general form: LS : x new arg min{ c, x + h(x)}, (7) x Q for suitable iteration-specific values of c. Indeed, (6) is an instance of (7) with c = t i g(x i ) h(x i ) at iteration i. It is especially important to note that the Mirror Descent update (6) is somewhat meaningless absent the capability to efficiently solve (7), a fact which we return to later. In the usual design and implementation of Mirror Descent for solving (), one attempts to specify the norm and the -strongly convex prox function h in consideration of properties of the feasible domain Q while also ensuring that the LS subproblem (7) is efficiently solvable. Notice that the Mirror Descent Algorithm (Algorithm ) itself does not require the traditional set-up that h be -strongly convex for some particular norm; rather this requirement is part of the traditional analysis. As we will see, we can instead analyze Mirror Descent by considering the intrinsic ways that f and D h (, ) are related functionally, in a manner that is constructive in terms of actual algorithm design and implementation. Furthermore, this is in the same spirit as was done in the predecessor paper [8]. 4

5 . Stochastic Setting For some convex functions, computing an exact subgradient at x Q may be expensive or even intractable, but sampling a random stochastic estimate of a subgradient at x, which we denote by g(x), may be easy. We say that g(x) is an unbiased stochastic subgradient if E [ g(x) x] f(x). The usefulness of a stochastic subgradient methodology is easily seen in the context of machine and statistical learning problems. A prototypical learning problem is to compute an approximate solution of the following empirical loss minimization problem: min x f(x) := n n j= f j(x) s.t. x Q, where f j is a non-differentiable convex loss function associated with sample j, for j =,..., n data samples. When n 0, the standard subgradient method needs to evaluate n subgradients in order to compute a subgradient of f, which can be prohibitively expensive. A typical alternative is to compute a stochastic subgradient. Letting x i denote the i th iterate, at iteration i a single sample index j is drawn uniformly and independently on {,..., n}, and then a subgradient g f j (xi ) is computed that is used to define g(x i ) := g. This stochastic subgradient is then used in place of a subgradient at iteration i. Notice that by construction g(x i ) is a conditional random variable given x i, and g(x i ) is an unbiased stochastic subgradient, namely E[ g(x i ) x i ] f(x i ). A stochastic version of Mirror Descent is presented in Algorithm. The structure of Stochastic Mirror Descent is identical to that of Mirror Descent, the only difference being that the stochastic estimate of a subgradient g(x i ) replaces the exact subgradient g(x i ) in Algorithm. Algorithm Stochastic Mirror Descent Algorithm with Bregman distance D h (, ) Initialize. Initialize with x 0 Q. Let h be a given convex differentiable function. At iteration i : Perform Updates. Compute a stochastic subgradient g(x i ), determine step-size t i, and compute update: x i+ arg min x Q {f(x i ) + g(x i ), x x i + t i D h (x, x i )}. (8) A standard condition that is required in the traditional convergence analysis for Stochastic Mirror Descent (as well as other stochastic first-order methods) is that there exists G f > 0 for which: E[ g(x) x] G f, for any x Q. (9) For notational convenience, we say that f is G f -stochastically continuous if (9) holds. In [9], Nedić and Lee developed convergence results for Stochastic Mirror Descent (Algorithm ). Under the conditions that (i) f is G f -stochastically continuous, (ii) h is a differentiable and µ h -strongly convex function on Q, and (iii) Q is a closed bounded set, Nedić and Lee ([9] equation (7)) show the following convergence result using step-sizes t i = µh D max G f (i+) : [ ] E f( x k ) f 3G f Dmax µ h (k + ), (0) 5

6 where x k := i=0 t i i=0 t ix i and D max := max x,y Q D h (x, y). Furthermore, if also (a) f is µ f -strongly convex, and (b) h is L h -smooth, Nedić and Lee ([9] Theorem ) show that with step-sizes t i = L h µ f (i+) it holds that: where ˇx k := (i+)(i+) i=0 (i + )xi. [ ] E f(ˇx k ) f G f L h µ f (k + )µ h, () 3 Relative Continuity In this section we introduce our definition of relative continuity of a function f actually two different definitions one for the deterministic and another for the stochastic setting. The starting point is a reference function h which is a given differentiable convex function on Q that is used to construct the usual Bregman distance D h (, ) (4), and that is used as part of the Mirror Descent update (6). However, we point out for emphasis that unlike the traditional set-up there are no assumptionss on h (such as strong or strict convexity). 3. Deterministic Setting Consider the objective function f of (). We define relative continuity of f relative to the reference function h using the Bregman distance D h (, ) of h as follows. Definition 3.. f is M-relative continuous with respect to the reference function h on Q if for any x, y Q, x y, and g(x) f(x), it holds that g(x) M D h (y, x) y x. () (In the particular case when h(x) = x, the Bregman distance is D h(y, x) = y x, and the relative continuity condition () becomes g(x) M, which corresponds to the standard definition of Lipschitz continuity () for the l -norm.) We can rewrite () as g(x) M D h(y, x), (3) y x which states that the square of the norm of any subgradient is bounded by the ratio of the Bregman distance D h (y, x) to y x. The following proposition presents the key property of an M-relative continuous function that is used in the proofs of results to follow. 6

7 Proposition 3.. (Key property of M-relative continuity) If f is M-relative continuous with respect to the reference function h, then for any t > 0 it holds for all x, y Q and g(x) f(x) that: t D h(y, x) + g(x), y x + tm 0. (4) Proof: If f is M-relative continuous with respect to h, then for any t > 0 it follows that g(x), y x g(x) y x M D h (y, x) tm + D h(y, x) t where the last inequality is an application of the arithmetic-geometric mean inequality. The proof follows by rearranging terms. The key property (4) is what is used in the proofs of results herein, so we could define M-relative continuity using (4) instead of (). Furthermore, (4) is independent of any norm structure, and so is attractive for its generality. However, we use the definition () because it leads to easy verification of M-relative continuity in practical instances as is shown in Section 5. The following proposition presents some scaling and additivity properties of relative continuity. Proposition 3.. Additivity of Relative Continuity. If f is M-relative continuous with respect to h, then for any α > 0, f is M α -relative continuous with respect to α h.. If f is M-relative continuous with respect to h, then for any α > 0, αf is M-relative continuous with respect to α h. 3. If f j is M-relative continuous with respect to h j for j =,..., n, then n j= f j is nm-relative continuous with respect to n j= h j. 4. If f j is M j -relative continuous with respect to h j for j =,..., n, then for α j > 0 and M > 0 it Proof: holds that n j= α jf j is nm-relative continuous with respect to n j=. It holds that Let x, y Q, x y, and g(x) f(x). which establishes the result. g(x) M D h (y, x) y x = M α Dα h(y, x) y x, α j βj, h j with β j := M M j.. Notice that g(x) is a subgradient of f(x) if and only if αg(x) is a subgradient of αf(x), whereby which establishes the result. αg(x) = α g(x) αm D h (y, x) y x = M D α h(y, x), y x 3. Any subgradient of n j= f j at x can be written as n j= g j(x) where g j (x) f j (x) for j =,..., n (see Theorem B. of [3]). From the triangle inequality and the relative continuity of 7

8 f j we have: n g j (x) j= n g j (x) j= M ( n j= ) D hj (y, x) y x nm ( ) n j= D h j (y, x) y x = nm Dh + +h n (y, x) y x, where the third inequality is an application of the l /l -norm inequality applied to the n-tuple ( D h (y, x),..., D hn (y, x)). 4. It follows from part (.) that α j f j is M j -continuous relative to α j h j. Thus α j f j is also M- continuous relative to α j h βj j from part (.), whereby the proof is finished by utilizing part (3.). We also make use of the notion of relative strong convexity which was introduced in [8], and is used here in some of the convergence guarantee analyses. Definition 3.. f is µ-strongly convex relative to h on Q if there is a scalar µ 0 such that for any x, y int Q and any g(x) f it holds that f(y) f(x) + g(x), y x + µd h (y, x). (5) In [8] it was shown that the notion of relative strong convexity embodied in Definition 3. is the natural way to define strong convexity in the context of mirror descent and similar algorithms, and leads to linear convergence of mirror descent in the smooth setting. In the non-smooth setting, we will show in Theorem 4. that relative strong convexity improves the convergence of mirror descent from O(/ k) to O(/k). 3. Stochastic Setting For x Q, let g(x) denote a random stochastic estimate of a subgradient of f at x. Extending the definition of relative continuity from the deterministic setting, we define stochastic relative continuity as follows. Definition 3.3. f is G-stochastically-relative continuous with respect to the reference function h on Q for some G > 0 if f together with the oracle to compute a stochastic subgradient satisfies:. Unbiasedness property: E[ g(x) x] f(x), and. Boundedness property: E[ g(x) x] G D h (y,x) y x for all x, y Q and x y. (In the particular case when h(x) = x, the Bregman distance is D h(y, x) = y x, whereby the stochastically-relative continuity boundedness property becomes E[ g(x) x] G for all x Q, which corresponds to the standard condition (9) for the l -norm.) 8

9 For x Q, define M(x) := g(x) max y Q,y x y x Dh (y, x). (6) Notice for a given x that max y Q,y x y x Dh (y,x) is a deterministic quantity, and therefore M(x) is a conditional random variable (given x) that is defined on the same probability space as g(x). Clearly, if f is G-stochastically-relative continuous, we have by the boundedness property that for any x Q E[ M(x) x] = E[ g(x) x] max y Q,y x Exactly as in the deterministic setting, we have: y x D h (y, x) G. (7) Proposition 3.3. If f is G-stochastically-relative continuous with respect to the reference function h on Q, then for any t > 0 it holds for all x, y Q and any stochastic subgradient estimate g(x) that: t D h(y, x) + g(x), y x + t M (x) 0. Proof: For any t > 0, we have g(x), y x g(x) y x M(x) D h (y, x) t M(x) + D h(y, x) t and the proof is finished by rearranging terms., 4 Computational Analysis for Stochastic Mirror Descent and (Deterministic) Mirror Descent In this section we present computational guarantees for Stochastic Mirror Descent (Algorithm ) for minimizing a convex function f that is G-stochastically-relative continuous with respect to a given reference function h. We also present computational guarantees for (deterministic) Mirror Descent (Algorithm ) when f is M-relative continuous with respect to a reference function h, which follows as a special case of the stochastic setting. We begin by recalling the standard Three-Point Property for optimization using Bregman distances: Lemma 4.. (Three-Point Property (Tseng [4])) Let φ(x) be a convex function, and let D h (, ) be the Bregman distance for h. For a given vector z, let z + := arg min x Q {φ(x) + D h(x, z)}. Then φ(x) + D h (x, z) φ(z + ) + D h (z +, z) + D h (x, z + ) for all x Q. 9

10 Let us denote the (primitive) random variable at the i th iteration of the Stochastic Mirror Descent Algorithm (Algorithm ) by γ i, i.e., γ i is the random variable that determines the (stochastic) subgradient g(x i ) at iterate x i in the Stochastic Mirror Descent Algorithm. Then x i+ is computed according to the update of the Stochastic Mirror Descent Algorithm, whereby x i+ is a random variable which depends on all previous values γ 0,..., γ i and we denote this string of random variables by ξ i := {γ 0,..., γ i }. The following theorem states convergence guarantees for the Stochastic Mirror Descent Algorithm in terms of expectation. Theorem 4.. (Convergence Bound for Stochastic Mirror Descent Algorithm) Consider the Stochastic Mirror Descent Algorithm (Algorithm ) with given step-size sequence {t i }. If f is G-stochastically-relative continuous with respect to h for some G > 0, then the following inequality holds for all k and x Q: where x k := i=0 t i i=0 t ix i. ] E ξk [f( x k ) f(x) G k i=0 t i + D h(x, x 0 ) i=0 t i, (8) Proof: First notice that f(x i ) + g(x i ), x x i = f(x i ) + E γi [ g(x i ) x i ], x x i = f(x i ) + E γi [ g(x i ), x x i x i ] f(x i ) + E γi [ g(x i ), x i+ x i + t i D h (x i+, x i ) + t i D h (x, x i+ ) t i D h (x, x i ) x i ] f(x i ) + E γi [ M(x i ) t i + t i D h (x, x i+ ) t i D h (x, x i ) x i ] f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] t i D h (x, x i ), where the first equality uses the unbiasedness of g(x), the second equality is because of linearity, the first inequality is from the Three-Point Property with φ(x) = t i g(x i ), x x i, the second inequality uses Proposition 3.3, and the last inequality uses (7). Since also f(x) f(x i ) + g(x i ), x x i from the definition of a subgradient, we have from (9): f(x) f(x i ) + g(x i ), x x i f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] t i D h (x, x i ). Taking expectation with respect to ξ i on both sides of the above inequality yields: f(x) E ξi [f(x i )] G t i + t i E ξi [D h (x, x i+ )] t i E ξi [D h (x, x i )] (0) Now rearrange and multiply through by t i to yield: t i E ξi [f(x i ) f(x)] G t i + E ξi [D h (x, x i )] E ξi [D h (x, x i+ )]. (9) 0

11 Summing up the above inequality over i and noting that D h (x, x k+ ) 0 we arrive at: G k i=0 t i + D h(x, x 0 ) i=0 t ie ξi [f(x i ) f(x)] = E k ξk i=0 t i[f(x i ) f(x)] ( k i=0 t i) E ξk [ f( x k ) f(x) ], where the last inequality uses the convexity of f. Dividing by i=0 t i completes the proof. Remark 4.. As a direct consequence of (), we obtain the following result which is similar to the deterministic setting (5): [ ] E ξk min 0 i k f(xi ) f(x) G k i=0 t i + D h(x, x 0 ) i=0 t i. () Theorem 4. implies the following high-probability result using a simple Markov bound. Corollary 4.. Let x be an optimal solution of (). Under the conditions of Theorem 4., for any δ > 0 it holds that: [ ] P f( x k ) f δ Proof: Using the Markov inequality, we have: [ ] P f( x k ) f δ E [ f( x k ) f ] δ G k i=0 t i + D h(x, x 0 ) δ i=0 t i G k i=0 t i + D h(x, x 0 ) δ i=0 t i. Similar to the case of traditional analysis of stochastic mirror descent, the Stochastic Mirror Descent Algorithm (Algorithm ) leads to an O( ε ) convergence guarantee (in expectation) by using an appropriate step-size sequence {t i } as the next corollary shows. Corollary 4.. Under the conditions of Theorem 4., for a given ε > 0 suppose that the step-sizes are set to: t i := ε G for all i. Then within G D h (x, x 0 ) k := iterations of the Stochastic Mirror Descent Algorithm it holds that: [ ] E f( x k ) f ε, where x is any optimal solution of (). Proof: Substituting the values of t i in (8) yields the result directly. ε

12 Remark 4.. Similar to the standard stochastic gradient descent scheme, the step-size rule in Corollary 4. leads to the optimal rate of convergence provided by the bound in Theorem 4.. Remark 4.3. Let us now compare these results to related results of Nedić and Lee [9]. In order to 9G attain an ε-optimality gap, [9] proved a bound of f Dmax 4µ h iterations, which follows by rearranging ε (0). In addition to not requiring Lipschitz continuity of f, our bound does not require that h be strongly convex. We also do not require boundedness of the feasible region; and in most settings D h (x, x 0 ) D max even when D max < +. Furthermore, even in the setting of (0), it holds that: G = E [ g(x) x ] y x max y Q,y x D h (y, x) E [ g(x) x ] µ h G f µ h, where the first inequality utilizes the strong convexity (in the standard sense) of h, and the second inequality is due to the assumption that f is G f -stochasticlly continuous (in the standard sense). Thus we see that the bound in Corollary 4. improves on the bound in [9]. In the case when f is also µ f -strongly convex relative to h (see Definition 3.), we obtain an O( k ) convergence guarantee in expectation, which is also similar to the traditional case of stochastic gradient descent. This is shown in the next result. Theorem 4.. (Convergence Bound for Stochastic Mirror Descent Algorithm under Strong Convexity relative to h) Consider the Stochastic Mirror Descent Algorithm (Algorithm ). If f is G-stochastically-relative continuous with respect to h for some G > 0 and f is µ-strongly convex relative to h for some µ > 0, and if the step-sizes are chosen as t i = µ(i+), then the following inequality holds for all k : where ˆx k := k(k+) i=0 i xi. ] E ξk [f(ˆx k ) f G µ(k + ), Proof: For any x Q it follows from the definition of µ-strong convexity (5) that Combining the above inequality with (9) yields Substituting t i = µ(i+) f(x) f(x i ) + g(x i ), x x i + µd h (x, x i ). f(x) f(x i ) G t i + t i E γi [D h (x, x i+ ) x i ] ( t i µ)d h (x, x i ). and multiplying by i in the above inequality yields: i ( f(x i ) f(x) ) G i µ(i+) + µ ( i(i )Dh (x, x i ) i(i + )E γi [D h (x, x i+ ) x i ] ) G µ + µ ( i(i )Dh (x, x i ) i(i + )E γi [D h (x, x i+ ) x i ] ). Taking expectation over ξ i and summing up the above inequality over i then yields ( k ) k i E ξk [f(ˆx k ) f(x)] i ( E ξi [f(x i )] f(x) ) kg ( µ ) µ k(k+) E ξk [D h (x, x k+ )] kg µ,

13 where the first inequality uses the convexity of f and the observation that E ξi f(x i ) = E ξk f(x i ) for i k. Taking x = x where x is an optimal solution of (), the proof is completed by noticing i = k(k+). Remark 4.4. It may not be easy to find cases when the objective function is both G-stochasticallyrelative continuous and is µ-relatively strongly convex relative to h. However, as long as these properties are satisfied along the path of iterates or around the minimum, one can achieve the faster convergence of Theorem 4.. Remark 4.5. Let us also compare the computational guarantee of Theorem 4. to the results in Nedić and Lee [9]. In order to attain an ε-optimality gap, [9] proved the bound (). First notice that we do not require either that f is uniformly Lipschitz continuous or that h is strongly convex in the traditional sense, or that h is uniformly smooth. However, even if these requirements hold, it follows from Remark 4.3 that G G f µ h, and it also holds that: D f (x, y) µ f x y µ f L h D h (x, y), where the first inequality utilizes that f is µ f strongly convex and the second inequality h is L h smooth in the standard sense. Thus f is at least µ = µ f L h -strongly convex relative to h (this follows by applying Proposition. in [8]). Therefore, even under the stronger requirements of [9], Theorem 4. improves on the corresponding result in [9]. We end this section with a discussion of the deterministic setting, namely the (Deterministic) Mirror Descent Algorithm (Algorithm ). Suppose that there is no stochasticity in the computation of subgradients. We can cast this as an instance of the Stochastic Mirror Descent Algorithm (Algorithm ) wherein g(x) = g(x) f(x) for all x Q. In this case relative stochastic continuity (Definition 3.3) is equivalent to relative continuity (Definition 3.) with the same constant. Thus deterministic Mirror Descent is a special case of Stochastic Mirror Descent, and we have the following computational guarantees as special cases of the stochastic case. Theorem 4.3. (Convergence Bound for Deterministic Mirror Descent Algorithm) Consider the (Deterministic) Mirror Descent Algorithm (Algorithm ). If f is M-relative continuous with respect to h for some M > 0, then for all k and x Q the following inequality holds: f( x k ) f(x) M i=0 t i + D h(x, x 0 ) i=0 t i, where x k := i=0 t i i=0 t ix i. Corollary 4.3. Under the conditions of Theorem 4.3, for a given ε > 0 suppose that the step-sizes are set to: t i := ε M for all i. Then within M D h (x, x 0 ) k := ε 3

14 iterations of Deterministic Mirror Descent it holds that: f( x k ) f ε, where x k := i=0 t i i=0 t ix i, and x is any optimal solution of (). Theorem 4.4. (Convergence Bounds for Deterministic Mirror Descent with Strong Relative Convexity) Consider the Deterministic Mirror Descent Algorithm (Algorithm ). If f is M-relative continuous with respect to h for some M > 0 and f is µ-strongly convex relative to h for some µ > 0, and if the step-sizes are chosen as t i =, then the following inequality holds for all k : where ˆx k := k(k+) i=0 i xi. µ(i+) f(ˆx k ) f M µ(k + ), 5 Specifying a Reference Function h with Relative Continuity for Mirror Descent Let us discuss using either deterministic or stochastic Mirror Descent (Algorithm or Algorithm ) for solving the optimization problem () with objective function f that is M-relative continuous or G-stochastically-relative continuous (respectively) with respect to the reference function h. In order to efficiently execute the update step in Algorithm and/or Algorithm we need h to be such that the linearization subproblem LS (7) is efficiently solvable for any given c. Therefore, in order execute Mirror Descent for solving () using Algorithm or Algorithm, we need to specify a differentiable convex reference function h that has the following two properties: (i) f is M-relative continuous (or G-stochastically-relative continuous) with respect to h on Q for M (or G) that is easy to determine, and (ii) the linearization subproblem LS (7) has a solution, and the solution is efficiently computable. We now discuss quite broadly how to construct such a reference function h with these two properties when g(x) is bounded by a polynomial in x. 5. Deterministic Setting Suppose that g(x) p r ( x ) for all x Q and all g(x) f(x), where p r (α) = r i=0 a iα i is an r-degree polynomial of α whose coefficients {a i } are nonnegative. Let h(x) := r i=0 a i i+ x i+. 4

15 Then the following proposition states that f is -relative continuous with respect to h. This implies that no matter how fast the subgradient of f grows polynomially as x, f is relatively continuous with respect to the simple reference function h, even though f does not exhibit uniform Lipschitz continuity. Proposition 5.. f is -continuous relative to h(x) = r i=0 a i i+ x i+. Proof: Let h i (x) = distance, we have Notice that i+ x i+, then h(x) = r i=0 a ih i (x), and by the definition of Bregman D hi (y, x) = i+ y i+ i+ x i+ x i x, y x = ( i+ y i+ + (i + ) x i+ (i + ) x i x, y ). y i+ + (i + ) x i+ (i + ) x i x, y = ( y i+ + i x i+ i+ ) ( x i y + i+ x i x + y x, y ) 0 + i+ ( x i x + y x, y ) = i+ x i y x, where the inequality above is an application of arithmetic-geometric mean inequality a λ b λ λa + ( λ)b with a = x i+, b = y i+, and λ = i i+. Thus we have ( r r ) D h (y, x) = a i D hi (y, x) y x a i x i. () Therefore i=0 g(x) p r ( x ) = r i=0 i=0 a i x i D h(y, x) y x which shows that f is -relative continuous with respect to h., Solving the linearization subproblem (7). Let us see how we can solve the linearization subproblem (7) for this class of optimization problems. The linearization subproblem (7) can be written as LS : r min c, x + x R n i=0 a i i+ x i+, (3) and the first-order optimality condition is simply: ( r ) c + a i x i x = 0, (4) i=0 5

16 whereby x = θc for some scalar θ 0, and it remains to simply determine the value of the nonnegative scalar θ. In the case when c = 0 we have x = 0 satisfies (4), so let us examine the case when c 0, in which case from (4) θ must satisfy: r a i c i θ i+ = 0, i=0 which implies that θ is the unique positive root of a univariate polynomial monotone in θ 0. For r {0,,, 3} this root can be computed in closed form. Otherwise the root can be computed efficiently (up to machine precision) using any suitable root-finding method. Remark 5.. We can incorporate a simple set constraint x Q in problem (3) provided that we can easily compute the Euclidean projection on Q. In this case, the linearization subproblem (7) can be converted to a -dimensional convex optimization problem, see Appendix A. of [8] for details. 5. Stochastic Setting In the stochastic setting, the stochastic subgradient g(x) is a conditional random variable for a given x. Suppose that E [ g(x) x ] p r ( x ) for all x Q, where p r (α) = r i=0 a iα i is an r-degree polynomial whose coefficients {a i } are nonnegative. Let h(x) := and similar to the deterministic case we have: r i=0 a i i+ x i+, Proposition 5.. f is -stochastically continuous relative to h(x) = r i=0 a i i+ x i+. Proof: For any x, y Q with x y we have: E g(x) x ] p r ( x ) = r i=0 a i x i D h(y, x) y x, (5) where the last inequality follows from (), and thus f is -stochastically continuous relative to h. Solving the linearization subproblem (7). The linear optimization subproblem is identical in structure to that in the deterministic case and so can be solved as discussed at the end of Section Relative Continuity for instances of SVM and IEP Here we examine in detail the two motivating examples stated in the Introduction, namely the Support Vector Machine (SVM) problem, and the Intersection of Ellipsoids Problem (IEP). We first prove the following lemma, which presents upper bounds on the Bregman distances D h (, ) for h(x) = 3 x 3 and h(x) = 4 x 4. 6

17 Lemma 5... Let h(x) := 3 x 3. Then D h(y, x) 3 y x ( y + x ).. Let h(x) := 4 x 4. Then D h(y, x) 4 y x ( y + x + x ). Proof:. D h (y, x) = ( 3 y 3 + x 3 3 x x, y ) ( y 3 + x 3 3 x x, y y y, x + y x x x, y + y x ) 3 = 3 y x ( y + x ), where the first equality follows from simplifying and combining terms, the inequality follows from applying the Cauchy-Schwarz inequality twice, and the final equality is from simplifying and combining terms.. D h (y, x) = 4 ( y x 4 4 x x, y ) 4 ( y x 4 4 x x, y + 4 x y 4 x, y ) = 4 y x ( y + x + x ), where the first equality follows from simplifying and combining terms, the inequality follows from applying the Cauchy-Schwarz inequality once, and the final equality is from simplifying and combining terms. Support Vector Machine (SVM). The Support Vector Machine (SVM) is an important supervised learning model for binary classification in machine learning. The SVM optimization problem for binary classification is: SVM: min f(x) := x n n max{0, y i x T w i } + λ x, (6) where w i is the input feature vector of sample i and y i {, } is the label of sample i. Notice that f is not differentiable due to the presence of the hinge loss terms in the summation, and f is also not Lipschitz continuous due to the presence of the l -norm regularization term; thus we cannot directly utilize typical subgradient or gradient schemes and their associated computational guarantees in the analysis of (6). Researchers have developed various approaches to overcome this limitation. For example, [5] introduced a splitting subgradient-type method, where the basic idea is to split the loss function and the regularization terms. [6] introduced a quasi-newton method, where they do not need to worry about the unbounded subgradient. Another approach is to a priori constrain x to lie in an l -ball of radius R for R sufficiently large so that the ball contains the optimal solution, and to project onto this ball at each iteration; in this approach f is Lipschitz continuous in the amended feasible region, see [3]. Indeed, one can show using quadratic 7

18 optimization optimality conditions that it suffices to set R = min{ ( n λ n w ) i, /λ} (see Appendix 5.3) wherein the modulus of Lipschitz continuity in the amended feasible region is at most M n n w i + min{ λ, n n w i }. Furthermore, in [7] the authors show that if the initial point lies within a suitably chosen large ball, then Stochastic Subgradient Descent with a small step-size ensures in expectation that all iterates lie in the large ball, which then ensures that the norms of all subgradients are bounded in expectation. Let us see how we can directly use the constructs of relative continuity to tackle the SVM problem with a suitably designed version of Stochastic Mirror Descent without any projection step to a ball. We can rewrite the objective function of (6) as f(x) = n n f j (x), j= where f j (x) = max{0, y j x T w j } + λ x. We consider computing a stochastic estimate of the subgradient of f by using a single sample index j drawn randomly from {,..., n}, namely g(x) f j (x) where j is drawn uniformly at random from {,..., n}. Then g(x) (λ x + w j ), whereby ( n ) E[ g(x) x] λ x + λ w i x n + n w i, n and notice that the right-hand side is a polynomial in x of degree r =. If we choose the reference function h as ( n ) ( n ) h(x) := λ 4 x 4 + λ w i x 3 + w i x, (7) 3n n it follows from the Proposition 5. that f is -stochastically continuous relative to h(x). Proposition 5.3. (Computational Guarantees for Stochastic Mirror Descent for the SVM problem (6).) Consider applying the Stochastic Mirror Descent algorithm (Algorithm ) to the Support Vector Machine problem (6) using the reference function (7). For an absolute optimality tolerance value ε > 0, and using the constant step-sizes t i := ε, let the algorithm be run for x x 0 ( 3λ ( ) x + x 0 + x 0 + 8λ n k := ( n w i ) ( ) ( x + x n n w i )) 6ε iterations, where x is the optimal solution of (6). Then it holds that [ E f( x k ) f ] ε, where x k := k+ i=0 xi. Proof: We showed above (using Proposition 5.) that f is -stochastically continuous relative to h defined in (7). Furthermore, applying Lemma 5. it follows that ( D h (x, x 0 ) λ 4 x x 0 ( n ) ( x + x 0 + x 0 λ ( x ) + w i + x 0 ) n ) + w i x x 0. 3n n 8

19 The proof is finished by substituting these values into the computational guarantee of Corollary 4.. Intersection of Ellipsoids Problem (IEP). Consider the problem of computing a point x R m in the intersection of n ellipsoids, namely: x Q := Q Q Q n, (8) where Q i = {x R m : xt A i x + b i x + c i 0} and A i R m m is a given symmetric positive semi-definite matrix, i =,..., n. This problem can be cast as a second-order cone optimization problem, and hence can be tackled using interior-point methods. However, interior-point methods are typically only effective when the dimensions m and/or n are of moderate size. On the other hand, another way to tackle the problem is to use a first-order method to solve the unconstrained problem IEP: f = min f(x) := max { x 0 i n xt A i x + b T i x + c i }, (9) and notice that f(x) 0 x Q, and Q f 0. However, the objective function f in (9) is both non-differentiable and non-lipschitz, and so it falls outside of the scope of optimization problems for which traditional first-order methods are applicable. Let us see how we can use the machinery of relative continuity to tackle this problem. Let σ := max 0 i n A i where A i is the spectral radius of A i, let ρ := max 0 i n A i b i and let γ := max 0 i n b i. Notice that for any x and i =,..., n, we have g i (x) := A i x + b i = f i (x) where f i (x) is the i th term in the objective function of (9). Since g(x) f(x) if and only if g(x) is a convex combination of the active gradients f i (x) (see Danskin s Theorem, Proposition B. in [3]), it follows for any g(x) f(x) that g(x) max 0 i n A ix + b i max 0 i n A i x + b T i A i x + b i σ x + ρ x + γ. Therefore we have g(x) p ( x ), where p (α) = σα + ρα + γ is a quadratic function of α, which is a polynomial in α of degree r =. It follows from Proposition 5. that f is -continuous relative to the reference function h(x) := σ 4 x 4 + ρ 3 x 3 + γ x. (30) Proposition 5.4. (Computational Guarantees for Deterministic Mirror Descent for the IEP problem (9)). Consider applying the Deterministic Mirror Descent algorithm (Algorithm ) to the Ellipsoid Intersection Problem (9) using the reference function (30), where σ := max 0 i n A i and A i is the spectral radius of A i, ρ := max 0 i n A i b i and γ := max 0 i n b i. For an absolute optimality tolerance value ε > 0, and using the constant step-sizes t i := ε, let the algorithm be run for x x 0 ( 3σ ( x + x 0 + x0 ( ) + 4ρ x + x 0 ) ) + 6γ k := iterations, where x is any optimal solution of (9). Then it holds that where x k := k+ i=0 xi. This problem was suggested by Nesterov []. 6ε f( x k ) f ε, 9

20 Proof: We showed above (using Proposition 5.) that f is -continuous relative to h(x) = σ 4 x 4 + ρ 3 x 3 + γ x. Furthermore, applying Lemma 5. it follows that D h(x, x 0 ) σ 4 x x 0 ( x + x 0 + x 0 ) + ρ 3 x x 0 ( x + x 0 ) + γ x x 0. The proof is finished by substituting these values into the computational guarantee of Corollary 4.3. Acknowledgement The author would like to express his gratitude to Robert M. Freund for thoughtful discussions that helped motivate this work, for commenting on earlier drafts of this paper, and for advising on the presentation and positioning of this paper. The author also wishes to thank Yurii Nesterov for encouraging the author s work on this topic, and for pointing out the application of IEP. Appendix: Finite Radius Bound for SVM Here we derive an upper bound on the norm of an optimal solution of the SVM problem (6). Proposition { 5.5. The optimal solution to the SVM problem (6) lies in the ball B (0, R) for R = min n nλ w i, } /λ. Proof: For convenience define A i := y i w i for i =,..., n. Then we can re-write the SVM problem as the following constrained optimization problem: min s,x n et s + λ x s.t. s + Ax e s 0. Let π and β be the multipliers on the inequality constraints above. imply, among other things, that the optimal solution x must satisfy: Then the KKT conditions π + β = n e λx = A T π where π 0 and β 0. Define π = nπ. Then 0 π e and λ x = A T π = n AT π n n A i = n n w i, which proves the first term in the definition of R. Also, we have λ x f(x ) f(0) =, thus x { /λ. Therefore x min n nλ w i, } /λ, which finishes the proof. 0

21 References [] H.H. Bauschke, J. Bolte, and M. Teboulle, A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications, Mathematics of Operations Research 4 (06), no., [] A. Beck and M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters 3 (003), no. 3, [3] D. Bertsekas, Nonlinear programming, Athena Scientific, Belmont, MA, 999. [4] S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends R in Machine Learning 8 (05), no. 3-4, [5] J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, Journal of Machine Learning Research 0 (009), no. Dec, [6] F. Hanzely and P. Richtarik, Randomized methods for minimizing relatively smooth functions, Tech. report, 07. [7] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an O(/t) convergence rate for the projected stochastic subgradient method, arxiv preprint arxiv:.00 (0). [8] H. Lu, R. M. Freund, and Y. Nesterov, Relatively smooth convex optimization by first-order methods, and applications, SIAM Journal on Optimization 8 (08), no., [9] A. Nedić and S. Lee, On stochastic subgradient mirror-descent algorithm with weighted averaging, SIAM Journal on Optimization 4 (04), no., [0] A. S. Nemirovsky and D. B. Yudin, Problem complexity and method efficiency in optimization, Wiley, New York, 983. [] Y. Nesterov, Introductory lectures on convex optimization: a basic course, Kluwer Academic Publishers, Boston, 003. [], private communication, (06). [3] S. Shalev-Shwartz, Y. Singer, and N. Srebro, Pegasos: Primal estimated sub-gradient solver for SVM, Proceedings of the 4th International Conference on Machine learning, ACM, 007, pp [4] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, Tech. report, May, 008. [5] Q. Van Nguyen, Forward-backward splitting with Bregman distances, Vietnam Journal of Mathematics 45 (07), no. 3, [6] J. Yu, S.V.N. Vishwanathan, S. Günter, and N.N. Schraudolph, A quasi-newton approach to nonsmooth convex optimization problems in machine learning, Journal of Machine Learning Research (00), no. Mar,

22 [7] Y. Zhou, Y. Liang, and L. Shen, A unified approach to proximal algorithms using Bregman distance, Tech. report, 06.

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems

A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems A First Order Method for Finding Minimal Norm-Like Solutions of Convex Optimization Problems Amir Beck and Shoham Sabach July 6, 2011 Abstract We consider a general class of convex optimization problems

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Radial Subgradient Descent

Radial Subgradient Descent Radial Subgradient Descent Benja Grimmer Abstract We present a subgradient method for imizing non-smooth, non-lipschitz convex optimization problems. The only structure assumed is that a strictly feasible

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Cubic regularization of Newton s method for convex problems with constraints

Cubic regularization of Newton s method for convex problems with constraints CORE DISCUSSION PAPER 006/39 Cubic regularization of Newton s method for convex problems with constraints Yu. Nesterov March 31, 006 Abstract In this paper we derive efficiency estimates of the regularized

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods

AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods Robert M. Freund Paul Grigas Rahul Mazumder June 29, 2013 Abstract Boosting methods are highly popular and effective

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Primal-dual subgradient methods for convex problems

Primal-dual subgradient methods for convex problems Primal-dual subgradient methods for convex problems Yu. Nesterov March 2002, September 2005 (after revision) Abstract In this paper we present a new approach for constructing subgradient schemes for different

More information

Lecture 2: Convex Sets and Functions

Lecture 2: Convex Sets and Functions Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Nonsymmetric potential-reduction methods for general cones

Nonsymmetric potential-reduction methods for general cones CORE DISCUSSION PAPER 2006/34 Nonsymmetric potential-reduction methods for general cones Yu. Nesterov March 28, 2006 Abstract In this paper we propose two new nonsymmetric primal-dual potential-reduction

More information

Stochastic model-based minimization under high-order growth

Stochastic model-based minimization under high-order growth Stochastic model-based minimization under high-order growth Damek Davis Dmitriy Drusvyatskiy Kellie J. MacPhee Abstract Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively

More information

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Primal/Dual Decomposition Methods

Primal/Dual Decomposition Methods Primal/Dual Decomposition Methods Daniel P. Palomar Hong Kong University of Science and Technology (HKUST) ELEC5470 - Convex Optimization Fall 2018-19, HKUST, Hong Kong Outline of Lecture Subgradients

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

LEARNING IN CONCAVE GAMES

LEARNING IN CONCAVE GAMES LEARNING IN CONCAVE GAMES P. Mertikopoulos French National Center for Scientific Research (CNRS) Laboratoire d Informatique de Grenoble GSBE ETBC seminar Maastricht, October 22, 2015 Motivation and Preliminaries

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1 1 Introduction

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

The Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1

The Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1 October 2003 The Relation Between Pseudonormality and Quasiregularity in Constrained Optimization 1 by Asuman E. Ozdaglar and Dimitri P. Bertsekas 2 Abstract We consider optimization problems with equality,

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date Noname manuscript No. (will be inserted by the editor) A Simple Convergence Analysis of Bregman Proximal Gradient Algorithm Yi Zhou Yingbin Liang Lixin Shen arxiv:1503.05601v3 [math.oc] 17 Dec 2017 Received:

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Variational Inequalities. Anna Nagurney Isenberg School of Management University of Massachusetts Amherst, MA 01003

Variational Inequalities. Anna Nagurney Isenberg School of Management University of Massachusetts Amherst, MA 01003 Variational Inequalities Anna Nagurney Isenberg School of Management University of Massachusetts Amherst, MA 01003 c 2002 Background Equilibrium is a central concept in numerous disciplines including economics,

More information

Lecture 3. Optimization Problems and Iterative Algorithms

Lecture 3. Optimization Problems and Iterative Algorithms Lecture 3 Optimization Problems and Iterative Algorithms January 13, 2016 This material was jointly developed with Angelia Nedić at UIUC for IE 598ns Outline Special Functions: Linear, Quadratic, Convex

More information

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption

BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption BISTA: a Bregmanian proximal gradient method without the global Lipschitz continuity assumption Daniel Reem (joint work with Simeon Reich and Alvaro De Pierro) Department of Mathematics, The Technion,

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

A projection algorithm for strictly monotone linear complementarity problems.

A projection algorithm for strictly monotone linear complementarity problems. A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu

More information

Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty

Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty V. Jeyakumar, G. M. Lee and G. Li Communicated by Sándor Zoltán Németh Abstract This paper deals with convex optimization problems

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Optimisation in Higher Dimensions

Optimisation in Higher Dimensions CHAPTER 6 Optimisation in Higher Dimensions Beyond optimisation in 1D, we will study two directions. First, the equivalent in nth dimension, x R n such that f(x ) f(x) for all x R n. Second, constrained

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006 Quiz Discussion IE417: Nonlinear Programming: Lecture 12 Jeff Linderoth Department of Industrial and Systems Engineering Lehigh University 16th March 2006 Motivation Why do we care? We are interested in

More information

Efficient Methods for Stochastic Composite Optimization

Efficient Methods for Stochastic Composite Optimization Efficient Methods for Stochastic Composite Optimization Guanghui Lan School of Industrial and Systems Engineering Georgia Institute of Technology, Atlanta, GA 3033-005 Email: glan@isye.gatech.edu June

More information

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008. 1 ECONOMICS 594: LECTURE NOTES CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS W. Erwin Diewert January 31, 2008. 1. Introduction Many economic problems have the following structure: (i) a linear function

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum

More information