Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization /

2 Outline Today: Conditional gradient method Convergence analysis Properties and variants 2

3 So far... Unconstrained optimization Gradient descent Conjugate gradient method Accelerated gradient methods Newton and Quasi-newton methods Trust region methods Proximal gradient descent 3

4 So far... Unconstrained optimization Gradient descent Conjugate gradient method Accelerated gradient methods Newton and Quasi-newton methods Trust region methods Proximal gradient descent Constrained optimization Projected gradient descent Conditional gradient (Frank-Wolfe) method - today... 3

5 Projected gradient descent Consider the constrained problem min x f(x) subject to x C where f is convex and smooth, and C is convex. Recall projected gradient descent: choose an initial x (0), and for k = 1, 2, 3,... x (k) = P C ( x (k 1) t k f(x (k 1)) where P C is the projection operator onto the set C 4

6 Projected gradient descent Consider the constrained problem min x f(x) subject to x C where f is convex and smooth, and C is convex. Recall projected gradient descent: choose an initial x (0), and for k = 1, 2, 3,... x (k) = P C ( x (k 1) t k f(x (k 1)) where P C is the projection operator onto the set C This was a special case of proximal gradient descent. 4

7 Gradient, proximal and projected gradient descent were motivated by a local quadratic expansion of f: f(y) f(x) + f(x) T (y x) + 1 2t (y x)t (y x) leading to x (k) = P C ( argmin y f(x (k 1) ) T (y x (k 1) ) + 1 2t y x(k 1) 2 2 ) 5

8 Gradient, proximal and projected gradient descent were motivated by a local quadratic expansion of f: f(y) f(x) + f(x) T (y x) + 1 2t (y x)t (y x) leading to x (k) = P C ( argmin y f(x (k 1) ) T (y x (k 1) ) + 1 2t y x(k 1) 2 2 ) Newton method improved the quadratic expansion using Hessian of f (can do projected Newton too): f(y) f(x) + f(x) T (y x) (y x)t 2 f(x)(y x) 5

9 Gradient, proximal and projected gradient descent were motivated by a local quadratic expansion of f: f(y) f(x) + f(x) T (y x) + 1 2t (y x)t (y x) leading to x (k) = P C ( argmin y f(x (k 1) ) T (y x (k 1) ) + 1 2t y x(k 1) 2 2 ) Newton method improved the quadratic expansion using Hessian of f (can do projected Newton too): f(y) f(x) + f(x) T (y x) (y x)t 2 f(x)(y x) What about a simpler linear expansion of f (when does it make sense)? f(y) f(x) + f(x) T (y x) 5

10 Conditional gradient (Frank-Wolfe) method Using a simpler linear expansion of f: Choose an initial x (0) C and for k = 1, 2, 3,... s (k 1) argmin s C f(x (k 1) ) T s x (k) = (1 γ k )x (k 1) + γ k s (k 1) Note that there is no projection; update is solved directly over the constraint set C 6

11 Conditional gradient (Frank-Wolfe) method Using a simpler linear expansion of f: Choose an initial x (0) C and for k = 1, 2, 3,... s (k 1) argmin s C f(x (k 1) ) T s x (k) = (1 γ k )x (k 1) + γ k s (k 1) Note that there is no projection; update is solved directly over the constraint set C The default choice for step sizes is γ k = 2/(k + 1), k = 1, 2, 3,.... No dependence on Lipschitz constant, condition number, or backtracking line search parameters. 6

12 Conditional gradient (Frank-Wolfe) method Using a simpler linear expansion of f: Choose an initial x (0) C and for k = 1, 2, 3,... s (k 1) argmin s C f(x (k 1) ) T s x (k) = (1 γ k )x (k 1) + γ k s (k 1) Note that there is no projection; update is solved directly over the constraint set C The default choice for step sizes is γ k = 2/(k + 1), k = 1, 2, 3,.... No dependence on Lipschitz constant, condition number, or backtracking line search parameters. For any choice 0 γ k 1, we see that x (k) C by convexity. (why?) 6

13 Conditional gradient (Frank-Wolfe) method Using a simpler linear expansion of f: Choose an initial x (0) C and for k = 1, 2, 3,... s (k 1) argmin s C f(x (k 1) ) T s x (k) = (1 γ k )x (k 1) + γ k s (k 1) Note that there is no projection; update is solved directly over the constraint set C 7

14 Conditional gradient (Frank-Wolfe) method Using a simpler linear expansion of f: Choose an initial x (0) C and for k = 1, 2, 3,... s (k 1) argmin s C f(x (k 1) ) T s x (k) = (1 γ k )x (k 1) + γ k s (k 1) Note that there is no projection; update is solved directly over the constraint set C Can also think of the update as x (k) = x (k 1) + γ k (s (k 1) x (k 1) ) i.e., we are moving less and less in the direction of the linearization minimizer as the algorithm proceeds 7

15 f f(x) g(x) s x (From Jaggi 2011) D 8

16 Norm constraints What happens when C = {x : x t} for a norm? Then s argmin s t = t f(x (k 1) ) T s ( argmax s 1 = t f(x (k 1) ) ) f(x (k 1) ) T s where is the corresponding dual norm. Norms: f(x) = x p. Let q be such that 1/p + 1/q = 1, then x p = max z q 1 zt x And f(x) = argmax z q 1 z T x 9

17 10 Norm constraints What happens when C = {x : x t} for a norm? Then s argmin s t = t f(x (k 1) ) T s ( argmax s 1 = t f(x (k 1) ) ) f(x (k 1) ) T s where is the corresponding dual norm. In other words, if we know how to compute subgradients of the dual norm, then we can easily perform Frank-Wolfe steps A key to Frank-Wolfe: this can often be simpler or cheaper than projection onto C = {x : x t}. Also often simpler or cheaper than the prox operator for

18 11 For the l 1 -regularized problem Example: l 1 regularization min x f(x) subject to x 1 t we have s (k 1) t f(x (k 1) ). Frank-Wolfe update is thus i k 1 argmax i f(x (k 1) ) i=1,...p x (k) = (1 γ k )x (k 1) γ k t sign ( ik 1 f(x (k 1) ) ) e ik 1 This is a kind of coordinate descent. (More on coordinate descent later.) Note: this is a lot simpler than projection onto the l 1 ball, though both require O(n) operations

19 12 For the l p -regularized problem Example: l p regularization min x f(x) subject to x p t for 1 p, we have s (k 1) t f(x (k 1) ) q, where p, q are dual, i.e., 1/p + 1/q = 1. Claim: can choose s (k 1) i = α sign ( f i (x (k 1) ) ) f i (x (k 1) ) q/p, i = 1,... n where α is a constant such that s (k 1) q = t (check this), and then Frank-Wolfe updates are as usual Note: this is a lot simpler than projection onto the l p ball, for general p. Aside from special cases (p = 1, 2, ), these projections cannot be directly computed (must be treated as an optimization)

20 13 Example: trace norm regularization For the trace-regularized problem min X f(x) subject to X tr t we have S (k 1) t f(x (k 1) ) op. Claim: can choose S (k 1) = t uv T where u, v are leading left, right singular vectors of f(x (k 1) ) (check this), and then Frank-Wolfe updates are as usual Note: this is a lot simpler and more efficient than projection onto the trace norm ball, which requires a singular value decomposition.

21 14 Constrained and Lagrange forms Recall that solution of the constrained problem min x f(x) subject to x t are equivalent to those of the Lagrange problem min x f(x) + λ x as we let the tuning parameters t and λ vary over [0, ]. More on this later. We should also compare the Frank-Wolfe updates under to the proximal operator of

22 15 l 1 norm: Frank-Wolfe update scans for maximum of gradient; proximal operator soft-thresholds the gradient step; both use O(n) flops l p norm: Frank-Wolfe update raises each entry of gradient to power and sums, in O(n) flops; proximal operator not generally directly computable Trace norm: Frank-Wolfe update computes top left and right singular vectors of gradient; proximal operator soft-thresholds the gradient step, requiring a singular value decomposition Many other regularizers yield efficient Frank-Wolfe updates, e.g., special polyhedra or cone constraints, sum-of-norms (group-based) regularization, atomic norms. See Jaggi (2011)

23 Comparing projected and conditional gradient for constrained lasso problem, with n = 100, p = 500: f fstar 1e 01 1e+00 1e+01 1e+02 1e+03 Projected gradient Conditional gradient We will see that Frank-Wolfe methods match convergence rates of known first-order methods; but in practice they can be slower to converge to high accuracy (note: fixed step sizes here, line search would probably improve convergence) k 16

24 17 Sub-optimality gap Frank-Wolfe iterations admit a very natural suboptimality gap: max s C f(x(k 1) ) T (x (k 1) s) This is an upper bound on f(x (k 1) ) f Proof: by the first-order condition for convexity f(s) f(x (k 1) ) + f(x (k 1) ) T (s x (k 1) ) Minimizing both sides over all s C yields f f(x (k 1) ) + min s C f(x(k 1) ) T (s x (k 1) ) Rearranged, this gives the sub-optimality gap above

25 18 Note that max s C f(x(k 1) ) T (x (k 1) s) = f(x (k 1) ) T (x (k 1) s (k 1) ) so this quantity comes directly from the Frank-Wolfe update.

26 19 Convergence analysis Following Jaggi (2011), define the curvature constant of f over C: M = max x,s,y C y=(1 γ)x+γs 2 ( ) γ 2 f(y) f(x) f(x) T (y x) (Above we restrict γ [0, 1].) Note that M = 0 when f is linear. The quantity f(y) f(x) f(x) T (y x) is called the Bregman divergence defined by f Theorem: Conditional gradient method using fixed step sizes γ k = 2/(k + 1), k = 1, 2, 3,... satisfies f(x (k) ) f 2M k + 2 Number of iterations needed to have f(x (k) ) f ɛ is O(1/ɛ)

27 20 This matches the known rate for projected gradient descent when f is Lipschitz, but how do the assumptions compare?. In fact, if f is Lipschitz with constant L then M diam 2 (C) L, where diam(c) = max x,s C x s 2 To see this, recall that f Lipschitz with constant L means f(y) f(x) f(x) T (y x) L 2 y x 2 2 Maximizing over all y = (1 γ)x + γs, and multiplying by 2/γ 2, M max x,s,y C y=(1 γ)x+γs 2 γ 2 L 2 y x 2 2 = max x,s C L x s 2 2 and the bound follows. Essentially, assuming a bounded curvature is no stronger than what we assumed for proximal gradient

28 Basic inequality The key inequality used to prove the Frank-Wolfe convergence rate is: f(x (k) ) f(x (k 1) ) γ k g(x (k 1) ) + γ2 k 2 M Here g(x) = max s C f(x) T (x s) is the sub-optimality gap discussed earlier. The rate follows from this inequality, using induction Proof: write x + = x (k), x = x (k 1), s = s (k 1), γ = γ k. Then f(x + ) = f ( x + γ(s x) ) f(x) + γ f(x) T (s x) + γ2 2 M = f(x) γg(x) + γ2 2 M Second line used definition of M, and third line the definition of g 21

29 22 Affine invariance Important property of Frank-Wolfe: its updates are affine invariant. Given nonsingular A : R n R n, define x = Ax, h(x ) = f(ax ). Then Frank-Wolfe on h(x ) proceeds as s = argmin h(x ) T z z A 1 C (x ) + = (1 γ)x + γs Multiplying by A reveals precisely the same Frank-Wolfe update as would be performed on f(x). Even convergence analysis is affine invariant. Note that the curvature constant M of h is M = max x,s,y A 1 C y =(1 γ)x +γs 2 ( ) γ 2 h(y ) h(x ) h(x ) T (y x ) matching that of f, because h(x ) T (y x ) = f(x) T (y x)

30 23 Inexact updates Jaggi (2011) also analyzes inexact Frank-Wolfe updates. That is, suppose we choose s (k 1) so that f(x (k 1) ) T s (k 1) min s C f(x(k 1) ) T s + Mγ k 2 where δ 0 is our inaccuracy parameter. Then we basically attain the same rate Theorem: Conditional gradient method using fixed step sizes γ k = 2/(k +1), k = 1, 2, 3,..., and inaccuracy parameter δ 0, satisfies f(x (k) ) f 2M (1 + δ) k + 2 Note: the optimization error at step k is Mγ k 2 δ. Since γ k 0, we require the errors to vanish δ

31 24 Some variants Some variants of the conditional gradient method: Line search: instead of fixing γ k = 2/(k + 1), k = 1, 2, 3,..., use exact line search for the step sizes γ k = argmin γ [0,1] f ( x (k 1) + γ(s (k 1) x (k 1) ) ) at each k = 1, 2, 3,.... Or, we could use backtracking Fully corrective: directly update according to x (k) = argmin y f(y) subject to y conv{x (0), s (0),... s (k 1) } Can make much better progress, but is also quite a bit harder

32 25 References K. Clarkson (2010), Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm J. Giesen and M. Jaggi and S. Laue, S. (2012), Approximating parametrized convex optimization problems M. Jaggi (2011), Sparse convex optimization methods for machine learning M. Jaggi (2011), Revisiting Frank-Wolfe: projection-free sparse convex optimization M. Frank and P. Wolfe (1956), An algorithm for quadratic programming

### Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

### Lecture 23: November 19

10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

### Lecture 23: Conditional Gradient Method

10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley

### Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

### Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

### Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

### Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

### arxiv: v1 [math.oc] 1 Jul 2016

Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

### Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

### Lecture 14: October 17

1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

### Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

### Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

### Lecture 5: September 12

10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

### Lecture 9: September 28

0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

### Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

### Optimization for Machine Learning

Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

### Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

### Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

### Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

### 10. Unconstrained minimization

Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

### Optimization methods

Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

### Lecture 8: February 9

0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

### Convex Optimization Lecture 16

Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

### Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

### Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

### On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

### Lecture 17: October 27

0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

### Lecture 14: Newton s Method

10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

### ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

### Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

### Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

### Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

### Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

### Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

### Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

### Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

### Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

### Constrained Optimization and Lagrangian Duality

CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

### Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

### Duality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and

### Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal

### Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

### Unconstrained minimization

CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

### Optimization methods

Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

### An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

### Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

### 10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

### min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

### Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

### Lecture 7: September 17

10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively

### Optimization Tutorial 1. Basic Gradient Descent

E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

### March 8, 2010 MATH 408 FINAL EXAM SAMPLE

March 8, 200 MATH 408 FINAL EXAM SAMPLE EXAM OUTLINE The final exam for this course takes place in the regular course classroom (MEB 238) on Monday, March 2, 8:30-0:20 am. You may bring two-sided 8 page

### Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

### This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

### Homework 4. Convex Optimization /36-725

Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

### 14. Nonlinear equations

L. Vandenberghe ECE133A (Winter 2018) 14. Nonlinear equations Newton method for nonlinear equations damped Newton method for unconstrained minimization Newton method for nonlinear least squares 14-1 Set

### Descent methods. min x. f(x)

Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

### NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

### SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

### Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

### Lecture 5 : Projections

Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

### Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song

### Lecture 6: September 12

10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

### Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

### Nonlinear Optimization: What s important?

Nonlinear Optimization: What s important? Julian Hall 10th May 2012 Convexity: convex problems A local minimizer is a global minimizer A solution of f (x) = 0 (stationary point) is a minimizer A global

### Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

### Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

### Proximal methods. S. Villa. October 7, 2014

Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

### ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

### 1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion

### On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

### Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

### Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2011/12 IGPM, RWTH Aachen Numerisches Rechnen

### Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

### Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2 Coralia Cartis, University of Oxford INFOMM CDT: Modelling, Analysis and Computation of Continuous Real-World Problems Methods

### Chapter 2. Optimization. Gradients, convexity, and ALS

Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve

### Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

### Lecture: Smoothing.

Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Smoothing 2/26 introduction smoothing via conjugate

### STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

### Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

### A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

### CPSC 540: Machine Learning

CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

### Lasso: Algorithms and Extensions

ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

### 8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

### ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

### The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

### Higher-Order Methods

Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

### Lecture 5: September 15

10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

### Lecture 1: September 25

0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

### Lecture 23: November 21

10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: