Accelerated gradient methods

Size: px
Start display at page:

Download "Accelerated gradient methods"

Transcription

1 ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018

2 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated proximal gradient methods (FISTA) Convergence analysis Lower bounds

3 (Proximal) gradient methods Iteration complexities of (proximal) gradient methods strongly convex and smooth problems ( O κ log 1 ) ε convex and smooth problems ( ) 1 O ε Can one still hope to further accelerate convergence? Accelerated GD 7-3

4 Issues and possible solutions Issues: GD focuses on improving cost per iteration, which might be too short-sighted GD might sometimes zigzag / experience abrupt changes Solutions: exploit information from history / past iterates add buffer (like momentum) to yield smoother trajectory Accelerated GD 7-4

5 Heavy-ball methods Polyak 64

6 Heavy-ball method minimize x R n f(x) B. Polyak x t+1 = x t η t f(x t ) + θ t (x t x t 1 ) }{{} momentum term add inertia to the ball (i.e. include momentum term) to mitigate zigzagging Accelerated GD 7-6

7 Heavy-ball method x ú gradient descent x ú heavy-ball method Accelerated GD 7-7

8 State-space models One can understand heavy-ball methods through dynamics of following dynamical system [ ] x t+1 ] x t = [ (1 + θt )I θ t I I 0 ] [ ] [ x t ηt f(x x t 1 t ) 0 or equivalently, [ ] [ x t+1 x (1 + θt )I θ x t x = t I I 0 }{{} state ] [ x t x x t 1 x ] [ ηt f(x t ) 0 [ 1 ] [ (1 + θt )I η = t 0 f(x τ )dτ θ t I x t x I 0 x t 1 x }{{} system matrix ] ] with x τ = x t + τ(x x t ), where last line comes from fundamental theorem of calculus Accelerated GD 7-8

9 System matrix [ ] [ x t+1 x ( ) 1 ] [ ] 1 + θt I ηt x t x = 0 f(x τ )dτ θ t I x t x I 0 x t 1 x }{{} :=H t (system matrix) (7.1) implication: convergence of heavy-ball methods depends on spectrum of system matrix H t key idea: find appropriate stepsize η t and momentum coefficient θ t to control spectrum of H Accelerated GD 7-9

10 Convergence for strongly convex problems Theorem 7.1 (Convergence of heavy-ball methods for strongly convex functions) Suppose f is µ-strongly convex and L-smooth. Set κ = L/µ, η t 4/( L + µ), θ t max { 1 η t L, 1 η t µ }. Then [ ] x t+1 x x t x ( κ 1 κ + 1 ) t [ x 1 x x 0 x ] iteration complexity: O( κ log 1 ε ) significant improvement over GD: O( κ log 1 ε ) vs. O(κ log 1 ε ) relies on knowledge of both L and µ Accelerated GD 7-

11 Proof of Theorem 7.1 In view of (7.1), it suffices to control spectrum of H t. Let λ i denote ith eigenvalue of λ f(x τ )dτ and set Λ :=..., then λ n [ ( ) ] H t = 1 + θt I ηt Λ θ t I t I 0 [ ] max 1 + θt η t λ i θ t 1 i n 1 0 To finish proof, it suffices to show [ max 1 + θt η t λ i θ t i 1 0 ] κ 1 κ + 1 (7.) Accelerated GD 7-11

12 Proof of Theorem 7.1 To show (7.), note that two eigenvalues of roots of z (1 + θ t η t λ i )z + θ t = 0 [ 1 + θt η t λ i θ t 1 0 ] are If (1 + θ t η t λ i ) 4θ t, then roots of this equation have same magnitudes θt (as they are either both imaginary or there is only one root). In addition, one can easily check that (1 + θ t η t λ i ) 4θ t is satisfied if θ t [( 1 η t λ i ), ( 1 + ηt λ i ) ], (7.3) which would hold if one picks θ t = max {( 1 η t L ), ( 1 ηt µ ) } Accelerated GD 7-1

13 Proof of Theorem 7.1 With this choice of θ t, we have H t θ t 4 Finally, setting η t = ( L+ ensures 1 η µ) t L = (1 η t µ), which yields ( ) ( θ t = max 1 L, 1 ) µ L + µ L + µ = This in turn establishes H t κ 1 κ + 1 ( ) κ 1 κ + 1 Accelerated GD 7-13

14 Nesterov s accelerated gradient methods

15 Convex case minimize x R n f(x) For strongly convex f, including momentum terms allows to improve iteration complexity from O ( κ log 1 ) ( ) ε to O κ log 1 ε Can we obtain improvement for convex case (without strong convexity) as well? Accelerated GD 7-15

16 Nesterov s idea Nesterov 83 x t+1 = y t η t f(y t ) y t+1 = x t+1 + t ( x t+1 x t) t + 3 Y. Nesterov alternates between gradient updates and proper extrapolation each iteration takes nearly same cost as GD not a descent method (i.e. we may not have f(x t+1 ) f(x t )) one of most beautiful and mysterious results in optimization... Accelerated GD 7-16

17 Convergence of Nesterov s accelerated gradient method Suppose f is convex and L-smooth. If η t η = 1/L, then iteration complexity: O ( 1 ε ) f(x t ) f opt L x0 x (t + 1) much faster than gradient methods we ll provide proof for (more general) proximal version later Accelerated GD 7-17

18 Interpretation using differential equations Nesterov s momentum coefficient mysterious t t+3 = 1 3 t is particularly Accelerated GD 7-18

19 has Consider 1/ > of 1/ and note that P = W 1/ RW version > (X) P := W 1/isBrescaled B>W B B WR: P = F( => )rf = U U >. diagonal elements are effective resistances. Given th Interpretationwhose using differential equations As a result, leverage is recaled effective resistance has 3 3 score w.r.t. and note that P = W 1/ RW 1/ is rescaled version of R ping With forceconstant > > x ls klg kxls klg X prob., X kxls ( > )resistances. = U UGiven. whose diagonal elements Pare=effective t ime-varying dampling force spring force has F = (X) leverage score w.r.t. As arf result, is recaled effective resis > P ls= x (ls k> ) kx =k U U >. ls With constant prob., kx LG LG As a result, leverage score w.r.t. is recaled effective res Fls= (X)ls k With constant prob., kxls x klg rf kx LG F = rf (X) damping force 3 X 3 X time-varying dampling force spring force damping force 3 X 3 X To develop insights into why Nesterov s method works so well, it s helpful to look at its continuous limits (ηt 0), which is given by second-order ordinary differential equations (ODE).. X(τ ) + 3/τ {z} dampling coefficient Accelerated GD. X(τ ) + f (X(τ )) = 0 {z potential } Su, Boyd, Candes

20 Heuristic derivation of ODE To begin with, Nesterov s update rule is equivalent to x t+1 x t = t 1 η t + x t x t 1 η η f(y t ) (7.4) Let t = τ η. Set X(τ) x τ/ η = x t and X(τ + η) x t+1. Then Taylor expansion gives x t+1 x t Ẋ(τ) X(τ) η η x t x t 1 Ẋ(τ) 1.. X(τ) η η which combined with (7.4) yields. X(τ) X(τ) ( η 1 3 ) η 1.. (Ẋ(τ) X(τ) ) η η f ( X(τ) ) τ =... X(τ) + 3 X(τ) + f ( X(τ) ) 0 τ Accelerated GD 7-0

21 Convergence rate of ODE... X + 3 X + f(x) = 0 (7.5) τ It is not hard to show that this ODE obeys ( ) 1 f(x(τ)) f opt O τ (7.6) which somehow explains Nesterov s O(1/t ) convergence Accelerated GD 7-1

22 Proof of (7.6) Define E(τ) = τ ( f(x) f opt) + X + τ X X. This obeys }{{} Lyapunov function / energy function. E = τ ( f(x) f opt) + τ. f(x), X + 4 X + τ. X X, 3. X + τ.. X (i) = τ ( f(x) f opt) τ X X, f ( X ) (by convexity) 0 where (i) follows by replacing τ.. X + 3Ẋ with τ f( X ) This means E is non-decreasing in τ, and hence (defn of E) opt E(τ) f(x(τ)) f τ E(0) ( ) 1 τ = O τ. Accelerated GD 7-

23 Magic number 3... X + 3 X + f(x) = 0 τ 3 is smallest constant that guarantees O(1/τ ) convergence, and can be replaced by any other α 3 in some sense, 3 minimizes pre-constant in convergence bound O(1/τ ) (see Su, Boyd, Candes 14) Accelerated GD 7-3

24 minimize log m P i=1 ng as xk œ x0 + span{òf (x0 ),, Òf (xk 1 )} for all 1 Æ k estingly, no first-order methods can improve upon Nesterov s t in general Example T Numerical exp(a i x + bi ) example taken from UCLA EE36C Convergence analysis precisely, convex and L-smooth function f s.t. y generated problems with m = 000, n = 00 m X (k)?? 3LÎx0 xú Î 3(t + 1) tep size5.4, usedwe forimmediate gradient method FISTA minimize Lemma arriveand at log x exp(a> i x i=1! + bi ) definition of first-order methods (f (x ) f )/f em 5.3 with randomly generated problems and m = 000, n = 00 f best,t f opt Æ! f (xt ) f opt Ø 0 e f is convex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, gradient gradient ppoe Ï is fl-strongly convexw.r.t. Î Î. Then FISTA FISTA 1 " supxœc DÏ x, x qt Lf fl k=0 k qt k=0 k f ÆO Ô fl Ô t ted GD f Accelerated GD ent methods 00 5! " flR Ô1 t = Lf t with R := supxœc DÏ x, x0, then A 50 0 B 150 Ô k k L R log t f best,t opt Ô

25 Extension to composite models minimize x subject to F (x) := f(x) + h(x) x R n f: convex and smooth h: convex (may not be differentiable) let F opt := min x F (x) be optimal cost Accelerated GD 7-5

26 FISTA (Beck & Teboulle 09) Fast iterative shrinkage-thresholding algorithm x t+1 ( = prox ηth y t η t f(y t ) ) y t+1 = x t+1 + θ t 1 (x t+1 x t ) θ t+1 where y 0 = x 0, θ 0 = 1 and θ t+1 = θ t momentum coefficient originally proposed by Nesterov 83 Accelerated GD 7-6

27 Momentum coefficient θ t+1 = θ t with θ 0 = 1 coefficient θt 1 θ t+1 = 1 3 t + o( 1 t ) (homework) asymptotically equivalent to t t+3 Accelerated GD 7-7

28 Momentum coefficient θ t+1 = θ t with θ 0 = 1 Fact 7. For all t 1, one has θ t t+ (homework) Accelerated GD 7-7

29 Convergence analysis

30 Convergence for convex problems Theorem 7.3 (Convergence of accelerated proximal gradient methods for convex problems) Suppose f is convex and L-smooth. If η t 1/L, then F (x t ) F opt L x0 x (t + 1) improved iteration complexity (i.e. O(1/ ε)) than proximal gradient method (i.e. O(1/ε)) fast if prox can be efficiently implemented Accelerated GD 7-9

31 Recap: fundamental inequality for proximal method Recall following fundamental inequality shown in last lecture: Lemma 7.4 Let y + = prox 1 L h ( y 1 L f(y)), then F (y + ) F (x) L x y L x y+ Accelerated GD 7-30

32 Proof of Theorem build discrete-time version of Lyapunov function. magic happens! Lyapunov function is non-increasing when Nesterov s momentum coefficient is adopted Accelerated GD 7-31

33 Proof of Theorem 7.6 Key lemma: monotonicity of certain Lyapunov function Lemma 7.5 Let u t = θ t 1 x t ( x + (θ t 1 1)x t 1). Then }{{} or θ t 1 (x t x ) (θ t 1 1)(x t 1 x ) u t+1 + ( L θ t F (x t+1 ) F opt) u t + ( L θ t 1 F (x t ) F opt). quite similar to X + τ X X + τ ( f(x) f opt) (Lyapunov function) as discussed before (think about θ t t/) Accelerated GD 7-3

34 Proof of Theorem 7.6 With Lemma 7.5 in place, one has ( L θ t 1 F (x t ) F opt) u 1 + L 0( θ F (x 1 ) F opt) = x 1 x + ( F (x 1 ) F opt) L To bound RHS of this inequality, we use Lemma 7.4 and y 0 = x 0 to get ( F (x 1 ) F opt) y 0 x x 1 x = x 0 x x 1 x L x 1 x + L( F (x 1 ) F opt) x 0 x As a result, L t 1( θ F (x t ) F opt) x 1 x + ( F (x 1 ) F opt) x 0 x L, = F (x t ) F opt L x0 x θ t 1 (Fact 7.) L x 0 x (t + 1) Accelerated GD 7-33

35 Proof of Lemma 7.5 Take x = 1 θ t x + ( 1 1 θ t ) x t and y = y t in Lemma 7.4 to get ( F (x t+1 ) F θ 1 t θ 1 t x + ( 1 θt 1 ) ) x t (7.7) x + ( 1 θt 1 ) x t y t L L = L x θt + ( θ t 1 ) x t θ t y t L θt (i) = L ( u t θt u t+1 θ 1 t x + ( 1 θt 1 ) x t x t+1 x + ( θ t 1 ) x t θ t x t+1 }{{} = u t+1 ), (7.8) where (i) follows from definition of u t and y t = x t + θt 1 1 θ t (x t x t 1 ) Accelerated GD 7-34

36 Proof of Lemma 7.5 (cont.) We will also lower bound (7.7). By convexity of F, ( F θt 1 x + ( 1 θt 1 ) ) x t θt 1 F (x ) + ( 1 θt 1 ) F (x t ) = θt 1 F opt + ( 1 θt 1 ) F (x t ) ( F θ 1 t ( 1 θ 1 t x + ( 1 θt 1 ) ) x t F (x t+1 ) )( F (x t ) F opt) ( F (x t+1 ) F opt) Combining this with (7.8) and θ t θ t = θ t 1 yields L( u t u t+1 ( ) θ t F (x t+1 ) F opt) ( θt )( θ t F (x t ) F opt) = θt ( F (x t+1 ) F opt) θt 1( F (x t ) F opt), thus finishing proof Accelerated GD 7-35

37 Convergence for strongly convex case x t+1 ( = prox ηth y t η t f(y t ) ) κ 1 y t+1 = x t+1 + (x t+1 x t ) κ + 1 Theorem 7.6 (Convergence of accelerated proximal gradient methods for strongly convex case) Suppose f is µ-strongly convex and L-smooth. If η t 1/L, then F (x t ) F opt ( 1 1 ) t (F (x 0 ) F opt + µ x0 x ) κ Accelerated GD 7-36

38 A practical issue Fast convergence requires knowledge of κ = L/µ in practice, estimating µ is typically very challenging A common observation: ripples / bumps in the traces of cost values Accelerated GD 7-37

39 as long as xk œ x0 + span{òf (x0 ),, Òf (xk 1 )} for More precisely, convex and L-smooth function f s.t. 3LÎx0 xú Î 3(t + 1) Rippling behavior Convergence analysis definition of first-order methods Numericalarrive example: take y t+1 = xt+1 +, we immediate at 1 q t+1 1+ q (x vex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, fl-strongly convex w.r.t. Î Î. Then t f (x ) f opt Ø f (x ) f 4! supxœc DÏ x, x 6 k 8 1 qt q q q q q q q " 0 k=0 =0 = q / = q /3 = q = 3q = q =1! + Lf fl k " qt k=0 period of ripples is often proportional to p L/µ k O Donoghue, Candes 1 with R := supxœc DÏ x, x0, then A Ô B k Lf ofr log 1twith different estimates of q. Algorithm best,t opt Figure 1: Convergence Ô f f ÆO Ô fl t when q > q : we underestimate momentum slower convergence Interpretation. The optimal momentum depends on the condition number of the function; Accelerated GD 1 Ô t result in general f opt Æ 0 xt ); q = 1/κ specifically, higher momentum is required when the function has a higher condition number. Under- estimating the amount of momentum required leads to slower convergence. However we are more 1 q further remove log t factor when q < q : we overestimate momentum ( overshooting / rippling behavior often in the other regime, that of overestimated momentum, because generally q = 0, in which case 1+ βk 1; this corresponds to high momentum and rippling behavior, as we see in Figure This can be visually understood in Figure (), which shows the trajectories of sequences generated by Algorithm 1 minimizing a positive definite quadratic in two dimensions, under q = q, the optimal choice of q, and q = 0. The high momentum causes the trajectory to overshoot the minimum and oscillate around it. This causes a rippling in the function values along the trajectory. Later we Accelerated GD shall demonstrate that the period of these ripples is proportional to the square root of the (local) condition number of the function. q is large) 7-38

40 Adaptive restart (O Donoghue, Candes 1) When certain criterion is met, restart running FISTA with x 0 y 0 x t x t θ 0 = 1 take current iteration as new starting point erase all memory of previous iterates and restart momentum back to zero Accelerated GD 7-39

41 as long as xk œ x0 + span{òf (x0 ),, Òf (x 3LÎx0 xú 3(t + 1) More precisely, convex and L-smooth functio Interestingly, no first-order methods can improv result in general 5.3 definition of first-order methods manumerical 5.4, we immediate arrive at comparisons of adaptive restart schemes flR Ô1 Lf t 4 6! " supxœc DÏ x, x0 + 8 k Æ 1 14 qt Lf fl k=0 k qt k=0 k Gradient descent No Restart Restart every 0 Restart every 400 Restart every 00 Restart optimal, 700 With q=µ/l Adaptive Restart Function scheme Adaptive Restart Gradient scheme! " with R := supxœc DÏ x, x0, then A Ô B k Lf Roflog t best,t opt Figure 3: Comparison fixed Ô and adaptive restart intervals. f f ÆO Ô fl function scheme: restart when ft(xt ) > f (xt 1 ) Accelerated GD 16 = Ô f opt t f (x ) f opt Ø f (x ) f f best,t 0 is convex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, e Ï is fl-strongly convex w.r.t. Î Î. Then gradient scheme: restart when h f (y t 1 ), xt xt 1 i > 0 one can further remove log t factor {z } q=q q = 0 momentum lead us towards5-37 restart when a bad direction 0.1 adaptive restart Accelerated GD

42 Illustration 0.1 q = q q =0 adaptive restart 0.05 x x 1 with overestimated momentum (e.g. q = 0), one sees spiralling trajectory adaptive restart helps mitigate this issue Accelerated GD 7-41

43 Lower bounds

44 Optimality of Nesterov s method Interestingly, no first-order methods can improve upon Nesterov s result in general More precisely, convex and L-smooth function f s.t. f(x t ) f opt 3L x0 x 3(t + 1) as long as x k x 0 + span{ f(x 0 ),, f(x k 1 )} for all 1 k t }{{} definition of first-order methods Nemirovski, Yudin 83 Accelerated GD 7-43

45 Example L f (x) = 4 minimizex R(n+1) 1 where A = f is convex and L-smooth f Accelerated GD L = 8 1 > x Ax e> 1x R(n+1) (n+1) 1 optimizer x is given by x i = 1 opt 1 1 n + i n+ (1 i n) obeying and kx k n

46 Example L f (x) = 4 minimizex R(n+1) 1 where A = f (x) = 1... L 4 Ax L4 e > x Ax e> 1x R(n+1) (n+1) 1 span{ f (x0 ),, f (xk 1 )} = span{e1,, ek } if x0 = 0 {z :=Kk } every iteration of first-order methods expands search space by at most one dimension Accelerated GD 7-44

47 Example (cont.) If we start with x 0 = 0, then = f(x n ) inf f(x) = L x K n 8 f(x n ) f opt x 0 x ( L 1 8 ( 1 n ) ) n+1 1 n+ 1 3 (n + ) = 3L 3(n + 1) Accelerated GD 7-45

48 Summary: accelerated proximal gradient stepsize convergence iteration rule rate complexity convex & smooth ( ) ( ) η t = 1 L O 1 O 1 ε problems t ( strongly convex & ( ) ) t ( ηt = 1 L O 1 1 ) κ O κ log 1 smooth problems ε Accelerated GD 7-46

49 Reference [1] Some methods of speeding up the convergence of iteration methods, B. Polyak, USSR Computational Mathematics and Mathematical Physics, 1964 [] A method of solving a convex programming problem with convergence rate O(1/k ), Y. Nestrove, Soviet Mathematics Doklady, [3] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, 009. [4] First-order methods in optimization, A. Beck, Vol. 5, SIAM, 017. [5] Mathematical optimization, MATH301 lecture notes, E. Candes, Stanford. [6] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, 007. Accelerated GD 7-47

50 Reference [7] Large-scale numerical optimization, MS&E318 lecture notes, M. Saunders, Stanford. [8] Proximal algorithms, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 013. [9] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and trends in machine learning, 015. [] Optimization methods for large-scale systems, EE36C lecture notes, L. Vandenberghe, UCLA. [11] Problem complexity and method efficiency in optimization, A. Nemirovski, D. Yudin, Wiley, [1] Introductory lectures on convex optimization: a basic course, Y. Nesterov, 004 [13] A differential equation for modeling Nesterov s accelerated gradient method, W. Su, S. Boyd, E. Candes, NIPS, 014. Accelerated GD 7-48

51 Reference [14] On accelerated proximal gradient methods for convex-concave optimization, P. Tseng, 008 [15] Adaptive restart for accelerated gradient schemes, B. O donoghue, and E. Candes, Foundations of computational mathematics, 01 Accelerated GD 7-49

Proximal gradient methods

Proximal gradient methods ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08 Outline Proximal gradient descent for composite functions Proximal mapping / operator

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

GRADIENT = STEEPEST DESCENT

GRADIENT = STEEPEST DESCENT GRADIENT METHODS GRADIENT = STEEPEST DESCENT Convex Function Iso-contours gradient 0.5 0.4 4 2 0 8 0.3 0.2 0. 0 0. negative gradient 6 0.2 4 0.3 2.5 0.5 0 0.5 0.5 0 0.5 0.4 0.5.5 0.5 0 0.5 GRADIENT DESCENT

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Accelerating Nesterov s Method for Strongly Convex Functions

Accelerating Nesterov s Method for Strongly Convex Functions Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018 ELE 538B: Large-Scale Opimizaion for Daa Science Quasi-Newon mehods Yuxin Chen Princeon Universiy, Spring 208 00 op ff(x (x)(k)) f p 2 L µ f 05 k f (xk ) k f (xk ) =) f op ieraions converges in only 5

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Gradient Methods Using Momentum and Memory

Gradient Methods Using Momentum and Memory Chapter 3 Gradient Methods Using Momentum and Memory The steepest descent method described in Chapter always steps in the negative gradient direction, which is orthogonal to the boundary of the level set

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights

A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights Journal of Machine Learning Research 17 (16) 1-43 Submitted 3/15; Revised 1/15; Published 9/16 A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights Weijie Su

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights

A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights A Differential Equation for Modeling Nesterov s Accelerated Gradient Method: Theory and Insights Weijie Su 1 Stephen Boyd Emmanuel J. Candès 1,3 1 Department of Statistics, Stanford University, Stanford,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution 1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Non-negative Matrix Factorization via accelerated Projected Gradient Descent Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the

More information

Lecture 5: September 15

Lecture 5: September 15 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Geometric Descent Method for Convex Composite Minimization

Geometric Descent Method for Convex Composite Minimization Geometric Descent Method for Convex Composite Minimization Shixiang Chen 1, Shiqian Ma 1, and Wei Liu 2 1 Department of SEEM, The Chinese University of Hong Kong, Hong Kong 2 Tencent AI Lab, Shenzhen,

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Integration Methods and Optimization Algorithms

Integration Methods and Optimization Algorithms Integration Methods and Optimization Algorithms Damien Scieur INRIA, ENS, PSL Research University, Paris France damien.scieur@inria.fr Francis Bach INRIA, ENS, PSL Research University, Paris France francis.bach@inria.fr

More information

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization

More information

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING KATYA SCHEINBERG, DONALD GOLDFARB, AND XI BAI Abstract. We propose new versions of accelerated first order methods for convex

More information

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS A SIMPLE PARALLEL ALGORITHM WITH AN O(/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS HAO YU AND MICHAEL J. NEELY Abstract. This paper considers convex programs with a general (possibly non-differentiable)

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Unconstrained minimization: assumptions

Unconstrained minimization: assumptions Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming,

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Expanding the reach of optimal methods

Expanding the reach of optimal methods Expanding the reach of optimal methods Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with C. Kempton (UW), M. Fazel (UW), A.S. Lewis (Cornell), and S. Roy (UW) BURKAPALOOZA! WCOM

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Hamiltonian Descent Methods

Hamiltonian Descent Methods Hamiltonian Descent Methods Chris J. Maddison 1,2 with Daniel Paulin 1, Yee Whye Teh 1,2, Brendan O Donoghue 2, Arnaud Doucet 1 Department of Statistics University of Oxford 1 DeepMind London, UK 2 The

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information