A Greedy Framework for First-Order Optimization

Size: px
Start display at page:

Download "A Greedy Framework for First-Order Optimization"

Transcription

1 A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA Jonathan Huggins Department of EECS Massachusetts Institute of Technology Cambridge, MA 039 Introduction. Recent work has shown many connections between conditional gradient and other first-order optimization methods, such as herding [3] and subgradient descent []. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of these algorithms into a single framework, which can be interpreted as taking successive arg-mins of a sequence of surrogate functions. Using a standard online learning analysis based on Bregman divergences, we are able to demonstrate an O(/T ) convergence rate for all algorithms in this class. Setup. More concretely, pose that we are given a function L : U Θ R defined by L(u, ) = h(u) + u, R() () and wish to find the saddle point L def = min max L(u, ). () u We should think of u as the primal variables and as the dual variables; we will assume throughout that h and R are both convex. We will also abuse notation and define L(u) def = max L(u, ); we can equivalently write L(u) as L(u) = h(u) + R (u), (3) where R is the Fenchel conjugate of R. Note that L(u) is a convex function. Moreover, since R R is a one-to-one mapping, for any convex function L and any decomposition of L into convex functions h and R, we get a corresponding two-argument function L(u, ). Primal algorithm. Given the function L(u, ), we define the following optimization procedure, which will generate a sequence of points (u, ), (u, ),... converging to a saddle point of L. First, take a sequence of weights α, α,..., and for notational convenience define s= û t = α su s s= α s s= and ˆt = α s s s= α. s Then the primal boosted mirror descent (PBMD) algorithm is the following iterative procedure:. u arg min u h(u). t arg max Θ, u t R() = R (u t ) 3. u t+ arg min u h(u) + ˆ t, u = h ( ˆ t ) As long as h is strongly convex, for the proper choice of α t we obtain the bound (see Corollary ): L(û T, ) L + O(/T ). (4) Θ As an example, pose that we are given a γ-strongly convex function f: that is, f(x) = γ x + f 0 (x), where f 0 is convex. Then we let h(x) = γ x, R (x) = f 0 (x), and obtain the updates: Both authors contributed equally to this work.

2 . u = 0. t = f 0 (u t ) 3. u t+ = γ ˆ t = s= αs f0(us) γ s= αs We therefore obtain a variant on subgradient descent where u t+ is a weighted average of the first t subgradients (times a step size /γ). Note that these are the subgradients of f 0, which are related to the subgradients of f by f 0 (x) = f(x) γx. Dual algorithm. We can also consider the dual form of our mirror descent algorithm (dual boosted mirror descent, or DBMD):. arg min R(). u t arg min u h(u) + t, u = h ( t ) 3. t+ arg max Θ, û t R() = R (û t ) Convergence now hinges upon strong convexity of R rather than h, and we obtain the same O(/T ) convergence rate (see Corollary 4): L(û T, ) L + O(/T ). (5) Θ { 0 : u K An important special case is h(u) =, where K is some compact set. Also let : u K R = f, where f is an arbitrary strongly convex function. Then we are minimizing f over the compact set K, and we obtain the following updates, which are equivalent to (standard) conditional gradient or Frank-Wolfe:. = f(0). u t arg min u K t, u 3. t+ = f(û t ) Our notation is slightly different from previous presentations in that we use linear weights (α t ) instead of geometric weights (often denoted ρ t, as in []). However, under the mapping α t = ρ t / t s= ( ρ s), we obtain an algorithm equivalent to the usual formulation. Optimality. The solutions generated by conditional gradient have an attractive sparsity property: the solution after the t-th iteration is a convex combination of t extreme points of the optimized space. In [4] it is shown that conditional gradient is asymptotically optimal amongst algorithms with sparse solutions. This suggests that the O(/T ) convergence rate obtained in this paper cannot be improved without further assumptions. For further discussion of the properties of conditional gradient, see Jaggi [4]. Primal vs. dual algorithms. It is worth taking a moment to contrast the primal and dual algorithms. Note that both algorithms have a primal convergence guarantee (i.e. both guarantee that û T is close to optimal). The sense in which the latter algorithm is a dual algorithm is that it hinges on strong convexity of the dual, as opposed to strong convexity of the primal. If we care about dual convergence instead of primal convergence, the only change we need to make is to take ˆ T at the end of the algorithm instead of û T. (This can be seeing by defining the loss function L (, u) def = L(u, ), which inverts the role of u and but yields the same updates as above.) Discussion. Our framework is intriguing in that it involves a purely greedy minimization of surrogate loss functions (alternating between the primal and dual), and yet is powerful enough to capture both (a variant of) subgradient descent and conditional gradient descent, as well as a host of other first-order methods. An example of a more complex first-order method captured by our framework is the low-rank SDP solver introduced by Arora, Hazan, and Kale []. Briefly, the AHK algorithm seeks to minimize

3 m j= (Tr(AT j X) b j) subject to the constraints X 0 and Tr(X) ρ. We can then define { 0 : X 0 and Tr(X) ρ h(x) = (6) : else and R (X) = m j= (Tr(AT j X) b j ). (7) Note that this decomposition is actually a special case of the conditional gradient decomposition above, and so we obtain the updates m X t+ argmin X 0,T r(x) ρ [Tr( ˆX ] j t ) b j Tr( j X), (8) j= whose solution turns out to be ρvv T, where v is the top singular vector of the matrix m j= [Tr( ˆX ] j t ) b j A j. This example serves both to illustrate the flexibility of our framework and to highlight the interesting fact that the Arora-Hazan-Kale SDP algorithm is a special case of conditional gradient (to get the original formulation in [], we need to replace the function x with x + log x +, where x + = max(x, 0)). It is worth noting that the analysis in [] appears to be tighter in their special case than the analysis we give here. We plan to investigate this phenomenon in future work. Related work. A generalized Frank-Wolfe algorithm has also been proposed in [5], which considers first-order convex surrogate functions more generally. Algorithm 3 of [5] turns out to be equivalent to a line-search formulation of the algorithm we propose in this paper. However, the perspective is quite different. While both algorithms share the intuition of trying to construct a proximal Frank-Wolfe algorithm, the general framework of [5] studies convex upper bounds, rather than the convex lower bounds of this paper (although for the proximal gradient surrogates that they suggest using, their Frank-Wolfe algorithm does implicitly use lower bounds). Another difference is that their analysis focuses on the Lipschitz properties of the objective function, whereas we use the more general notion of strong convexity with respect to an arbitrary norm. We believe that this geometric perspective is useful for tailoring the proximal function to specific applications, as demonstrated in our discussion of q-herding below. Finally, the saddle point perspective on Frank-Wolfe is not present in [5], although they do discuss the general idea of saddle-point methods in relation to convex surrogates. q-herding. In addition to unifying several existing methods, our framework allows us to extend herding to an algorithm we call q-herding. Herding is an algorithm for constructing pseudosamples that match a specified collection of moments from a distribution; it was originally introduced by Welling [6] and was shown to be a special case of conditional gradient by Bach et al. [3]. Let M X be the probability simplex over X. Herding can be cast as trying to minimize E µ [φ(x)] φ over µ M X, for a given φ : X R d and φ R d. As shown in [3], the herding updates are equivalent to DBMD with h(µ) 0 and R() = T φ +. The implicit assumption in the herding algorithm is that φ(x) is bounded. We are able to construct a more general algorithm that only requires φ(x) p to be bounded for some p. This q-herding algorithm works by taking R() = T φ + q q q, where p + q =. In this case our convergence results imply that E µ [φ(x)] φ p p decays at a rate of O(/T ) (proof to appear in an extended version of this paper). Convergence results. We end by stating our formal convergence results. Proofs are given in the Appendix. For the primal algorithm (PBMD) we have: Theorem. Suppose that h is strongly convex with respect to a norm and let r =. Then, for any u, L(û, ) L(u, ) + r αt+a t A. (9) t+ This is actually a variant of their algorithm, which we present for ease of exposition. t= 3

4 Corollary. Under the hypotheses of Theorem, for α t = we have and for α t = t we have L(û, ) Similarly, for the dual algorithm (DBMD) we have: L(u, ) + r (log(t ) + ). (0) T L(û, ) L(u, ) + 8r T. () Theorem 3. Suppose that R is strongly convex with respect to a norm and let r = u u. Then, for any u, L(û, ) L(u, ) + r Corollary 4. Under the hypotheses of Theorem 3, for α t = we have and for α t = t we have t= L(û, ) L(u, ) + r (log(t ) + ) T αt+a t A. () t+ (3) L(û, ) L(u, ) + 8r T (4) Thus, a step size of α t = t yields the claimed O(/T ) convergence rate. Acknowledgements. Thanks to Cameron Freer, Simon Lacoste-Julien, Martin Jaggi, Percy Liang, Arun Chaganty, and Sida Wang for helpful discussions. Thanks also to the anonymous reviewer for suggesting improvements to the paper. JS was ported by the Hertz Foundation. JHH was ported by the U.S. Government under FA9550--C-008 and awarded by the Department of Defense, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 3 CFR 68a. References [] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In FOCS, pages , 005. [] Francis Bach. Duality between subgradient and conditional gradient methods. arxiv.org, November 0. [3] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the Equivalence between Herding and Conditional Gradient Algorithms. In ICML. INRIA Paris - Rocquencourt, LIENS, March 0. [4] Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML, pages , 03. [5] Julien Mairal. Optimization with First-Order Surrogate Functions. In ICML, 03. [6] Max Welling. Herding dynamical weights to learn. In ICML,

5 Appendix: Convergence Proofs We now prove the convergence results given above. Throughout this section, assume that α,..., α T is a sequence of real numbers and that A t = s= α s. We further require that A t > 0 for all t. Also recall that the Bregman divergence is defined by D f (x x ) def = f(x ) f(x ), x x f(x ). Our proofs hinge on the following key lemma: Lemma 5. Let z,..., z T be vectors and let f(x) be a strictly convex function. Define ẑ t to be t A t s= α sz s. Let x,..., x T be chosen via x t+ = arg min x f(x) + ẑ t, x. Then for any x we have {α t (f(x t ) + z t, x t )} f(x ) + ẑ t, x + A t D f (x t x t+ ). t= Proof. First note that, if x 0 = arg min f(x) + z, x, then f(x 0 ) = z. Now note that so we have α t z t = A t ẑ t A t ẑ t (5) = A t f(x t+ ) + A t f(x t ), (6) {α t (f(x t ) + z t, x t )} (7) t= = = {α t f(x t ) + A t f(x t ) A t f(x t+ ), x t } (8) t= {α t f(x t ) A t f(x t+ ), x t x t+ } (9) t= f(x T + ), x T + (0) = {A t f(x t ) A t f(x t+ ), x t x t+ A t f(x t+ )} () t= + (f(x T + ) f(x T + ), x T + ) = {A t D f (x t x t+ )} () t= + (f(x T + ) f(x T + ), x T + ) = {A t D f (x t x t+ )} + (f(x T + ) + ẑ T, x T + ) (3) t= {A t D f (x t x t+ )} + (f(x ) + ẑ T, x ). (4) t= Dividing both sides by completes the proof. We also note that D f (x t x t+ ) = D f (ẑ t+ z t ), where f (z) = x z, x f(x). This form of the bound will often be more useful to us. Another useful property of Bregman divergences is the following lemma which relates strong convexity of f to strong smoothness of f : Lemma 6. Suppose that D f (x x) x x for some norm and for all x, x. (In this case we say that f is strongly convex with respect to.) Then D f (y y) y y for all y, y. t= 5

6 We require a final porting proposition before proving Theorem. Proposition 7 (Convergence of PBMD). Consider the updates t arg max, u t R() and u t+ arg min u h(u) + ˆ s, u. Then we have L(û T, ) L(u, ) + A t D h (u t u t+ ). (5) Proof. Note that L(u t, t ) = max L(u t, ) by construction. Also note that, if we invoke Lemma 5 with f = h and z t = t then we get the inequality t= α t L(u t, t ) t= t= α t L(u, t ) + Combining these together, we get the string of inequalities ( ) L(û T, ) = L α t u t, as was to be shown. t= α t L(u t, ) t= α t L(u t, t ) t= t= α t L(u, t ) + L(u, ) + Proof of Theorem. By Lemma 6, we have D h (u t u t+ ) = D h (ˆ t+ ˆ t ) A t D h (u t u t+ ). (6) t= A t D h (u t u t+ ) t= A t D h (u t u t+ ), t= ˆ t+ ˆ t = s t α s s s t+ s t α α s s s s t+ α s = α t+ α s s α t+ t+ A t A t+ A t+ s t α t+ α s s + α t+ t+ A t A t+ A t+ r αt+ A. t+ s t It follows that t= A t D h (u t u t+ ) r t= αt+a t A. t+ 6

7 Proof of Corollary. If we let α t =, then A t = t and α t+ At = t A (t+) t+ t. We therefore get r t= α t+a t A t+ r T t r (log(t ) + ). (7) T + t= If we let α t = t, then A t = t(t+) and α t+ At = (t+) t(t+) A (t+) t+ (t+) = t(t+) (t+). We therefore get which completes the proof. r t= α t+a t A t+ 4r T (T + ) t= = 8r T +, (8) The proof of Theorem 3 requires an analagous porting proposition to that of Theorem. Proposition 8 (Convergence of DBMD). Consider the updates u t arg min u h(u) + t, u and t+ arg max, û t R(). Then we have L(û, ) L(u, ) + A t D R ( t t+ ). (9) Proof. If we invoke Lemma 5 with f = R and z t = u t, then for all we get the inequality Re-arranging yields t= t= α t L(u t, t ) α t L(u t, ) + Now, we have the following string of inequalities: ( ) L(û, ) = L α t u t, = t= α t L(u t, ) t= t= t= t= t= t= α t L(u t, ) (30) t= A t D R ( t t+ ). t= α t L(u t, t ) (3) t= A t D R ( t t+ ). (3) t= α t L(u t, t ) + A t D R ( t t+ ) t= α t inf u L(u, t) + α t L(u, t ) + A t D R ( t t+ ) t= A t D R ( t t+ ) t= α t L(u, ) + 7 A t D R ( t t+ ) t=

8 as was to be shown. = L(u, ) + A t D R ( t t+ ), t= Proofs of Theorem 3 and Corollary 4. The proofs are identical to those for PBMD. 8

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Convex Optimization Conjugate, Subdifferential, Proximation

Convex Optimization Conjugate, Subdifferential, Proximation 1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Lecture 23: Conditional Gradient Method

Lecture 23: Conditional Gradient Method 10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song

More information

Lecture 6 : Projected Gradient Descent

Lecture 6 : Projected Gradient Descent Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Mathematical Programming manuscript No. (will be inserted by the editor) Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming Guanghui Lan Zhaosong Lu Renato D. C. Monteiro

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

c 2015 Society for Industrial and Applied Mathematics

c 2015 Society for Industrial and Applied Mathematics SIAM J. OPTIM. Vol. 5, No. 1, pp. 115 19 c 015 Society for Industrial and Applied Mathematics DUALITY BETWEEN SUBGRADIENT AND CONDITIONAL GRADIENT METHODS FRANCIS BACH Abstract. Given a convex optimization

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with H. Bauschke and J. Bolte Optimization

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Fidelity and trace norm

Fidelity and trace norm Fidelity and trace norm In this lecture we will see two simple examples of semidefinite programs for quantities that are interesting in the theory of quantum information: the first is the trace norm of

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Lecture 24 November 27

Lecture 24 November 27 EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods Stat260/CS294: Spectral Graph Methods Lecture 20-04/07/205 Lecture: Local Spectral Methods (3 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Learning with Submodular Functions: A Convex Optimization Perspective

Learning with Submodular Functions: A Convex Optimization Perspective Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

Dual methods for the minimization of the total variation

Dual methods for the minimization of the total variation 1 / 30 Dual methods for the minimization of the total variation Rémy Abergel supervisor Lionel Moisan MAP5 - CNRS UMR 8145 Different Learning Seminar, LTCI Thursday 21st April 2016 2 / 30 Plan 1 Introduction

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Lecture 7: Semidefinite programming

Lecture 7: Semidefinite programming CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 7: Semidefinite programming This lecture is on semidefinite programming, which is a powerful technique from both an analytic and computational

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

A Simple Algorithm for Nuclear Norm Regularized Problems

A Simple Algorithm for Nuclear Norm Regularized Problems A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Proceedings of Machine Learning Research 76: 40, 207 Algorithmic Learning Theory 207 A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Pooria

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

1 Quantum states and von Neumann entropy

1 Quantum states and von Neumann entropy Lecture 9: Quantum entropy maximization CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: February 15, 2016 1 Quantum states and von Neumann entropy Recall that S sym n n

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

On the von Neumann and Frank-Wolfe Algorithms with Away Steps

On the von Neumann and Frank-Wolfe Algorithms with Away Steps On the von Neumann and Frank-Wolfe Algorithms with Away Steps Javier Peña Daniel Rodríguez Negar Soheili July 16, 015 Abstract The von Neumann algorithm is a simple coordinate-descent algorithm to determine

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Online Optimization with Gradual Variations

Online Optimization with Gradual Variations JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date Noname manuscript No. (will be inserted by the editor) A Simple Convergence Analysis of Bregman Proximal Gradient Algorithm Yi Zhou Yingbin Liang Lixin Shen arxiv:1503.05601v3 [math.oc] 17 Dec 2017 Received:

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Theory of Convex Optimization for Machine Learning

Theory of Convex Optimization for Machine Learning Theory of Convex Optimization for Machine Learning Sébastien Bubeck 1 1 Department of Operations Research and Financial Engineering, Princeton University, Princeton 08544, USA, sbubeck@princeton.edu Abstract

More information

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem Michael Patriksson 0-0 The Relaxation Theorem 1 Problem: find f := infimum f(x), x subject to x S, (1a) (1b) where f : R n R

More information

A. Derivation of regularized ERM duality

A. Derivation of regularized ERM duality A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as

More information