A Greedy Framework for First-Order Optimization

Similar documents
Conditional Gradient (Frank-Wolfe) Method

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

A Unified Approach to Proximal Algorithms using Bregman Distance

arxiv: v1 [math.oc] 1 Jul 2016

Convex Optimization Conjugate, Subdifferential, Proximation

Adaptive Online Gradient Descent

Lecture 16: FTRL and Online Mirror Descent

Optimization for Machine Learning

Lecture 23: November 19

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Online Convex Optimization

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

arxiv: v4 [math.oc] 5 Jan 2016

Lecture 23: Conditional Gradient Method

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

1 Sparsity and l 1 relaxation

Advanced Machine Learning

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Lecture 6 : Projected Gradient Descent

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Proximal Methods for Optimization with Spasity-inducing Norms

Generalized Conditional Gradient and Its Applications

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

arxiv: v1 [math.oc] 10 Oct 2018

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Math 273a: Optimization Subgradients of convex functions

STA141C: Big Data & High Performance Statistical Computing

Coordinate Descent and Ascent Methods

Lecture 19: Follow The Regulerized Leader

Convex Repeated Games and Fenchel Duality

Optimization and Optimal Control in Banach Spaces

Primal-dual first-order methods with O(1/ǫ) iteration-complexity for cone programming

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Stochastic Gradient Descent with Only One Projection

Descent methods. min x. f(x)

c 2015 Society for Industrial and Applied Mathematics

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

18.657: Mathematics of Machine Learning

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Convex Repeated Games and Fenchel Duality

Optimization methods

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

Iteration-complexity of first-order penalty methods for convex programming

arxiv: v2 [cs.lg] 10 Oct 2018

Fidelity and trace norm

ECS289: Scalable Machine Learning

Lecture 24 November 27

The Frank-Wolfe Algorithm:

The FTRL Algorithm with Strongly Convex Regularizers

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

Complexity bounds for primal-dual methods minimizing the model of objective function

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

On Optimal Frame Conditioners

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Dual and primal-dual methods

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

A survey: The convex optimization approach to regret minimization

Regret bounded by gradual variation for online convex optimization

Learning with Submodular Functions: A Convex Optimization Perspective

On Nesterov s Random Coordinate Descent Algorithms - Continued

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Dual methods for the minimization of the total variation

Optimization methods

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Lecture 7: Semidefinite programming

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

A Simple Algorithm for Nuclear Norm Regularized Problems

Convex Optimization Lecture 16

Bregman Divergence and Mirror Descent

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Homework 4. Convex Optimization /36-725

Mini-Batch Primal and Dual Methods for SVMs

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Composite Objective Mirror Descent

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Lecture 9: September 28

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

1 Quantum states and von Neumann entropy

Big Data Analytics: Optimization and Randomization

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

On the von Neumann and Frank-Wolfe Algorithms with Away Steps

Lecture 5 : Projections

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Online Optimization with Gradual Variations

Nonlinear Optimization for Optimal Control

arxiv: v3 [math.oc] 17 Dec 2017 Received: date / Accepted: date

Lecture: Adaptive Filtering

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Theory of Convex Optimization for Machine Learning

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

A. Derivation of regularized ERM duality

Transcription:

A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts Institute of Technology Cambridge, MA 039 jhuggins@mit.edu Introduction. Recent work has shown many connections between conditional gradient and other first-order optimization methods, such as herding [3] and subgradient descent []. By considering a type of proximal conditional method, which we call boosted mirror descent (BMD), we are able to unify all of these algorithms into a single framework, which can be interpreted as taking successive arg-mins of a sequence of surrogate functions. Using a standard online learning analysis based on Bregman divergences, we are able to demonstrate an O(/T ) convergence rate for all algorithms in this class. Setup. More concretely, pose that we are given a function L : U Θ R defined by L(u, ) = h(u) + u, R() () and wish to find the saddle point L def = min max L(u, ). () u We should think of u as the primal variables and as the dual variables; we will assume throughout that h and R are both convex. We will also abuse notation and define L(u) def = max L(u, ); we can equivalently write L(u) as L(u) = h(u) + R (u), (3) where R is the Fenchel conjugate of R. Note that L(u) is a convex function. Moreover, since R R is a one-to-one mapping, for any convex function L and any decomposition of L into convex functions h and R, we get a corresponding two-argument function L(u, ). Primal algorithm. Given the function L(u, ), we define the following optimization procedure, which will generate a sequence of points (u, ), (u, ),... converging to a saddle point of L. First, take a sequence of weights α, α,..., and for notational convenience define s= û t = α su s s= α s s= and ˆt = α s s s= α. s Then the primal boosted mirror descent (PBMD) algorithm is the following iterative procedure:. u arg min u h(u). t arg max Θ, u t R() = R (u t ) 3. u t+ arg min u h(u) + ˆ t, u = h ( ˆ t ) As long as h is strongly convex, for the proper choice of α t we obtain the bound (see Corollary ): L(û T, ) L + O(/T ). (4) Θ As an example, pose that we are given a γ-strongly convex function f: that is, f(x) = γ x + f 0 (x), where f 0 is convex. Then we let h(x) = γ x, R (x) = f 0 (x), and obtain the updates: Both authors contributed equally to this work.

. u = 0. t = f 0 (u t ) 3. u t+ = γ ˆ t = s= αs f0(us) γ s= αs We therefore obtain a variant on subgradient descent where u t+ is a weighted average of the first t subgradients (times a step size /γ). Note that these are the subgradients of f 0, which are related to the subgradients of f by f 0 (x) = f(x) γx. Dual algorithm. We can also consider the dual form of our mirror descent algorithm (dual boosted mirror descent, or DBMD):. arg min R(). u t arg min u h(u) + t, u = h ( t ) 3. t+ arg max Θ, û t R() = R (û t ) Convergence now hinges upon strong convexity of R rather than h, and we obtain the same O(/T ) convergence rate (see Corollary 4): L(û T, ) L + O(/T ). (5) Θ { 0 : u K An important special case is h(u) =, where K is some compact set. Also let : u K R = f, where f is an arbitrary strongly convex function. Then we are minimizing f over the compact set K, and we obtain the following updates, which are equivalent to (standard) conditional gradient or Frank-Wolfe:. = f(0). u t arg min u K t, u 3. t+ = f(û t ) Our notation is slightly different from previous presentations in that we use linear weights (α t ) instead of geometric weights (often denoted ρ t, as in []). However, under the mapping α t = ρ t / t s= ( ρ s), we obtain an algorithm equivalent to the usual formulation. Optimality. The solutions generated by conditional gradient have an attractive sparsity property: the solution after the t-th iteration is a convex combination of t extreme points of the optimized space. In [4] it is shown that conditional gradient is asymptotically optimal amongst algorithms with sparse solutions. This suggests that the O(/T ) convergence rate obtained in this paper cannot be improved without further assumptions. For further discussion of the properties of conditional gradient, see Jaggi [4]. Primal vs. dual algorithms. It is worth taking a moment to contrast the primal and dual algorithms. Note that both algorithms have a primal convergence guarantee (i.e. both guarantee that û T is close to optimal). The sense in which the latter algorithm is a dual algorithm is that it hinges on strong convexity of the dual, as opposed to strong convexity of the primal. If we care about dual convergence instead of primal convergence, the only change we need to make is to take ˆ T at the end of the algorithm instead of û T. (This can be seeing by defining the loss function L (, u) def = L(u, ), which inverts the role of u and but yields the same updates as above.) Discussion. Our framework is intriguing in that it involves a purely greedy minimization of surrogate loss functions (alternating between the primal and dual), and yet is powerful enough to capture both (a variant of) subgradient descent and conditional gradient descent, as well as a host of other first-order methods. An example of a more complex first-order method captured by our framework is the low-rank SDP solver introduced by Arora, Hazan, and Kale []. Briefly, the AHK algorithm seeks to minimize

m j= (Tr(AT j X) b j) subject to the constraints X 0 and Tr(X) ρ. We can then define { 0 : X 0 and Tr(X) ρ h(x) = (6) : else and R (X) = m j= (Tr(AT j X) b j ). (7) Note that this decomposition is actually a special case of the conditional gradient decomposition above, and so we obtain the updates m X t+ argmin X 0,T r(x) ρ [Tr( ˆX ] j t ) b j Tr( j X), (8) j= whose solution turns out to be ρvv T, where v is the top singular vector of the matrix m j= [Tr( ˆX ] j t ) b j A j. This example serves both to illustrate the flexibility of our framework and to highlight the interesting fact that the Arora-Hazan-Kale SDP algorithm is a special case of conditional gradient (to get the original formulation in [], we need to replace the function x with x + log x +, where x + = max(x, 0)). It is worth noting that the analysis in [] appears to be tighter in their special case than the analysis we give here. We plan to investigate this phenomenon in future work. Related work. A generalized Frank-Wolfe algorithm has also been proposed in [5], which considers first-order convex surrogate functions more generally. Algorithm 3 of [5] turns out to be equivalent to a line-search formulation of the algorithm we propose in this paper. However, the perspective is quite different. While both algorithms share the intuition of trying to construct a proximal Frank-Wolfe algorithm, the general framework of [5] studies convex upper bounds, rather than the convex lower bounds of this paper (although for the proximal gradient surrogates that they suggest using, their Frank-Wolfe algorithm does implicitly use lower bounds). Another difference is that their analysis focuses on the Lipschitz properties of the objective function, whereas we use the more general notion of strong convexity with respect to an arbitrary norm. We believe that this geometric perspective is useful for tailoring the proximal function to specific applications, as demonstrated in our discussion of q-herding below. Finally, the saddle point perspective on Frank-Wolfe is not present in [5], although they do discuss the general idea of saddle-point methods in relation to convex surrogates. q-herding. In addition to unifying several existing methods, our framework allows us to extend herding to an algorithm we call q-herding. Herding is an algorithm for constructing pseudosamples that match a specified collection of moments from a distribution; it was originally introduced by Welling [6] and was shown to be a special case of conditional gradient by Bach et al. [3]. Let M X be the probability simplex over X. Herding can be cast as trying to minimize E µ [φ(x)] φ over µ M X, for a given φ : X R d and φ R d. As shown in [3], the herding updates are equivalent to DBMD with h(µ) 0 and R() = T φ +. The implicit assumption in the herding algorithm is that φ(x) is bounded. We are able to construct a more general algorithm that only requires φ(x) p to be bounded for some p. This q-herding algorithm works by taking R() = T φ + q q q, where p + q =. In this case our convergence results imply that E µ [φ(x)] φ p p decays at a rate of O(/T ) (proof to appear in an extended version of this paper). Convergence results. We end by stating our formal convergence results. Proofs are given in the Appendix. For the primal algorithm (PBMD) we have: Theorem. Suppose that h is strongly convex with respect to a norm and let r =. Then, for any u, L(û, ) L(u, ) + r αt+a t A. (9) t+ This is actually a variant of their algorithm, which we present for ease of exposition. t= 3

Corollary. Under the hypotheses of Theorem, for α t = we have and for α t = t we have L(û, ) Similarly, for the dual algorithm (DBMD) we have: L(u, ) + r (log(t ) + ). (0) T L(û, ) L(u, ) + 8r T. () Theorem 3. Suppose that R is strongly convex with respect to a norm and let r = u u. Then, for any u, L(û, ) L(u, ) + r Corollary 4. Under the hypotheses of Theorem 3, for α t = we have and for α t = t we have t= L(û, ) L(u, ) + r (log(t ) + ) T αt+a t A. () t+ (3) L(û, ) L(u, ) + 8r T (4) Thus, a step size of α t = t yields the claimed O(/T ) convergence rate. Acknowledgements. Thanks to Cameron Freer, Simon Lacoste-Julien, Martin Jaggi, Percy Liang, Arun Chaganty, and Sida Wang for helpful discussions. Thanks also to the anonymous reviewer for suggesting improvements to the paper. JS was ported by the Hertz Foundation. JHH was ported by the U.S. Government under FA9550--C-008 and awarded by the Department of Defense, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 3 CFR 68a. References [] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In FOCS, pages 339 348, 005. [] Francis Bach. Duality between subgradient and conditional gradient methods. arxiv.org, November 0. [3] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the Equivalence between Herding and Conditional Gradient Algorithms. In ICML. INRIA Paris - Rocquencourt, LIENS, March 0. [4] Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML, pages 47 435, 03. [5] Julien Mairal. Optimization with First-Order Surrogate Functions. In ICML, 03. [6] Max Welling. Herding dynamical weights to learn. In ICML, 009. 4

Appendix: Convergence Proofs We now prove the convergence results given above. Throughout this section, assume that α,..., α T is a sequence of real numbers and that A t = s= α s. We further require that A t > 0 for all t. Also recall that the Bregman divergence is defined by D f (x x ) def = f(x ) f(x ), x x f(x ). Our proofs hinge on the following key lemma: Lemma 5. Let z,..., z T be vectors and let f(x) be a strictly convex function. Define ẑ t to be t A t s= α sz s. Let x,..., x T be chosen via x t+ = arg min x f(x) + ẑ t, x. Then for any x we have {α t (f(x t ) + z t, x t )} f(x ) + ẑ t, x + A t D f (x t x t+ ). t= Proof. First note that, if x 0 = arg min f(x) + z, x, then f(x 0 ) = z. Now note that so we have α t z t = A t ẑ t A t ẑ t (5) = A t f(x t+ ) + A t f(x t ), (6) {α t (f(x t ) + z t, x t )} (7) t= = = {α t f(x t ) + A t f(x t ) A t f(x t+ ), x t } (8) t= {α t f(x t ) A t f(x t+ ), x t x t+ } (9) t= f(x T + ), x T + (0) = {A t f(x t ) A t f(x t+ ), x t x t+ A t f(x t+ )} () t= + (f(x T + ) f(x T + ), x T + ) = {A t D f (x t x t+ )} () t= + (f(x T + ) f(x T + ), x T + ) = {A t D f (x t x t+ )} + (f(x T + ) + ẑ T, x T + ) (3) t= {A t D f (x t x t+ )} + (f(x ) + ẑ T, x ). (4) t= Dividing both sides by completes the proof. We also note that D f (x t x t+ ) = D f (ẑ t+ z t ), where f (z) = x z, x f(x). This form of the bound will often be more useful to us. Another useful property of Bregman divergences is the following lemma which relates strong convexity of f to strong smoothness of f : Lemma 6. Suppose that D f (x x) x x for some norm and for all x, x. (In this case we say that f is strongly convex with respect to.) Then D f (y y) y y for all y, y. t= 5

We require a final porting proposition before proving Theorem. Proposition 7 (Convergence of PBMD). Consider the updates t arg max, u t R() and u t+ arg min u h(u) + ˆ s, u. Then we have L(û T, ) L(u, ) + A t D h (u t u t+ ). (5) Proof. Note that L(u t, t ) = max L(u t, ) by construction. Also note that, if we invoke Lemma 5 with f = h and z t = t then we get the inequality t= α t L(u t, t ) t= t= α t L(u, t ) + Combining these together, we get the string of inequalities ( ) L(û T, ) = L α t u t, as was to be shown. t= α t L(u t, ) t= α t L(u t, t ) t= t= α t L(u, t ) + L(u, ) + Proof of Theorem. By Lemma 6, we have D h (u t u t+ ) = D h (ˆ t+ ˆ t ) A t D h (u t u t+ ). (6) t= A t D h (u t u t+ ) t= A t D h (u t u t+ ), t= ˆ t+ ˆ t = s t α s s s t+ s t α α s s s s t+ α s = α t+ α s s α t+ t+ A t A t+ A t+ s t α t+ α s s + α t+ t+ A t A t+ A t+ r αt+ A. t+ s t It follows that t= A t D h (u t u t+ ) r t= αt+a t A. t+ 6

Proof of Corollary. If we let α t =, then A t = t and α t+ At = t A (t+) t+ t. We therefore get r t= α t+a t A t+ r T t r (log(t ) + ). (7) T + t= If we let α t = t, then A t = t(t+) and α t+ At = (t+) t(t+) A (t+) t+ (t+) = t(t+) (t+). We therefore get which completes the proof. r t= α t+a t A t+ 4r T (T + ) t= = 8r T +, (8) The proof of Theorem 3 requires an analagous porting proposition to that of Theorem. Proposition 8 (Convergence of DBMD). Consider the updates u t arg min u h(u) + t, u and t+ arg max, û t R(). Then we have L(û, ) L(u, ) + A t D R ( t t+ ). (9) Proof. If we invoke Lemma 5 with f = R and z t = u t, then for all we get the inequality Re-arranging yields t= t= α t L(u t, t ) α t L(u t, ) + Now, we have the following string of inequalities: ( ) L(û, ) = L α t u t, = t= α t L(u t, ) t= t= t= t= t= t= α t L(u t, ) (30) t= A t D R ( t t+ ). t= α t L(u t, t ) (3) t= A t D R ( t t+ ). (3) t= α t L(u t, t ) + A t D R ( t t+ ) t= α t inf u L(u, t) + α t L(u, t ) + A t D R ( t t+ ) t= A t D R ( t t+ ) t= α t L(u, ) + 7 A t D R ( t t+ ) t=

as was to be shown. = L(u, ) + A t D R ( t t+ ), t= Proofs of Theorem 3 and Corollary 4. The proofs are identical to those for PBMD. 8