Proximal gradient methods

Size: px
Start display at page:

Download "Proximal gradient methods"

Transcription

1 ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08

2 Outline Proximal gradient descent for composite functions Proximal mapping / operator Convergence analysis

3 Proximal gradient descent for composite functions

4 Composite models minimize x subject to F (x) := f(x) + h(x) x R n f: convex and smooth h: convex (may not be differentiable) let F opt := min x F (x) be optimal cost Proximal gradient methods 6-4

5 Examples l regularized minimization minimize x f(x) + x }{{} h(x): l norm use l regularization to promote sparsity nuclear norm regularized minimization minimize X f(x) + X }{{} h(x): nuclear norm use nuclear norm regularization to promote low-rank structure Proximal gradient methods 6-5

6 A proximal view of gradient descent To motivate proximal gradient methods, we first revisit gradient descent x t+ = arg min x { x t+ = x t η t f(x t ) f(x t ) + f(x t ), x x t }{{} first-order approximation + } x x t η } t {{} proximal term Proximal gradient methods 6-6

7 A proximal view of gradient descent I JA proximal view of gradien Îx xt Î A proximal view of t gradient descent I x t+ x = arg min f (xt ) + ÈÒf (xt ), x xt Í A proximal view of gradient descent x ( ) xt+ = arg min f (xt ) + ÈÒf (xt ), x xt Í + xt+ = arg min f (xt ) + h f (xt ), x xt i + x kx xt k ηt A proximal view of gradien Îx xt Î + c f (x) tt+ tis the point where By optimality condition, xt+ x = x t Òf (xt ) I t f (xt ) + ÈÒf (xt ), x xt Í and xtmin Î have same xt+ = arg f (x ) +slope ÈÒf (xt ), x xt Í t Îx x Proximal gradient methods 5-3 Ì Îx I xt Î + c f (x) J t+ By optimality condition, x is the point wher t xt Í and t+ is the point f (x By optimality condition, where xt+ = argxmin f (xt ) + ÈÒf (xt ), x t )x+tèòf Í +(xt ), xîx xt Î t Îx xt Î t t t t x f (x ) + ÈÒf (x ), x x Í and t Îx x Î have same slope t roximal gradient methods 5-3 Proximal gradient methods first-order approximation proximal term By optimality condition, xt+ is the point where f (xt ) + h f (xt ), x xt i and - η t kx xt k have same slope Proximal gradient methods 6-7

8 How about projected gradient descent? x t+ = arg min x = arg min x where C (x) = x t+ = P C ( x t η t f(x t ) ) { f(x t ) + f(x t ), x x t + } x x t + C (x) η t { x (x t η t f(x t )) } + η t C (x) (6.) { 0, if x C, else Proximal gradient methods 6-8

9 Proximal operator Define proximal operator prox h (x) := arg min z for any convex function h { z x + h(z) } This allows one to express projected GD update (6.) as x t+ = prox ηt C ( x t η t f(x t ) ) (6.) Proximal gradient methods 6-9

10 Proximal gradient methods One can generalize (6.) to accommodate more general h Algorithm 6. Proximal gradient algorithm : for t = 0,, do : x t+ ( = prox ηth x t η t f(x t ) ) alternates between gradient updates on f and proximal minimization on h useful if prox h is inexpensive Proximal gradient methods 6-0

11 Proximal mapping / operator

12 Why consider proximal operators? { prox h (x) := arg min z z x + h(z)} well-defined under very general conditions (including nonsmooth convex functions) can be evaluated efficiently for many widely used functions (in particular, regularizers) this abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms Proximal gradient methods 6-

13 bt+ = t t + t t {z } Example: indicator functions = b momentum term proxµt+ g bt+ µt+ rf t+ t+ xt+ = xt t Òf (xt ) + t+ x = I Ì A proximal view of gradient descent J arg min f (x ) + ÈÒf (xtt+ ), x xtnorm Í+ Îx xt Î Example: x = arg min ft (xt ) + ÈÒf (xt ), x xt Í + x x approximation Îx xt Î +first-order c proximal term I t If h = C is indicator function Proximal gradient methods ( 0, if x C If h(x) =h(x) ÎxÎ, then = else prox h (x) = Â, (sof t thresholding) st (x; ) Îx x t 5- condition, xt+ is the point where t ), x xt Í and Îx xt Î have same slo f (x ) + ÈÒf (x t where Âst (x) = x +, if x <, is applied in entry-wise manner then Y By optimality ]x, if tx > [0, else proxh (x) = arg min kz xk Proximal gradient methods z C Proximal gradient methods (Euclidean projection) Proximal gradient methods 5-6-3

14 Example: l norm x t x t+ x t Îx xt Î + c Îx xt Î + c x t+ If h(x) = x, then (prox λh (x)) i = ψ st (x i ; λ) (soft-thresholding) x λ, if x > λ where ψ st (x) = x + λ, if x < λ 0, else Proximal gradient methods 6-4

15 Basic rules If f(x) = ag(x) + b with a > 0, then prox f (x) = prox ag (x) affine addition: if f(x) = g(x) + a x + b, then prox f (x) = prox g (x a) Proximal gradient methods 6-5

16 Basic rules quadratic addition: if f(x) = g(x) + ρ x a, then prox f (x) = prox +ρ g ( + ρ x + ρ + ρ a ) scaling and translation: if f(x) = g(ax + b) with a 0, then prox f (x) = a ( ) prox a g(ax + b) b (homework) Proximal gradient methods 6-6

17 Proof for quadratic addition { prox f (x) = arg min z x x + g(z) + ρ } z a { } + ρ = arg min x z z, x + ρa + g(z) { = arg min x z } z, x + ρa + + ρ + ρ g(z) = arg min x = prox +ρ g { ( z + ρ x + ρ ) + ρ a ( + ρ x + ρ ) + ρ a + + ρ g(z) } Proximal gradient methods 6-7

18 Basic rules orthogonal mapping: if f(x) = g(qx) with Q orthogonal (QQ = Q Q = I), then prox f (x) = Q prox g (Qx) (homework) orthogonal affine mapping: if f(x) = g(qx + b) with QQ = α I, then }{{} does not require Q Q=α I prox f (x) = ( ) ( ) I αq Q x + αq prox α g(qx + b) b for general Q, it is not easy to derive prox f from prox g Proximal gradient methods 6-8

19 Basic rules norm composition: if f(x) = g( x ) with domain(g) = [0, ), then x prox f (x) = prox g ( x ) x 0 x Proximal gradient methods 6-9

20 Proof for norm composition Observe that { min f(z) + } z z x { = min g( z ) + z z z x + } x { = min min g(α) + α 0 z: z =α α z x + } x { = min g(α) + α 0 α α x + } x (Cauchy-Schwarz) = min {g(α) + } (α x ) α 0 From above calculation, we know optimal point is α = prox g ( x ) and z = α x x = prox x g ( x ), x thus concluding proof Proximal gradient methods 6-0

21 = PC (x)> PC (z) xp (x)z S= PC (z) z PCs( > C( ) sp if x œ C, proxh (x) is Euclidean Œ else methods (constrained) Gradient I C, which is nonexpansive: projection PC onto 0, if x œ C Recall that when h(x) =, proxh (x) is Euclidean ŒC (xelse Îx x Î + c Gradient x methods x (constrained) ÎP )Gradient P (x )Î methods (constrained) Gradient Æ Îx x Î C methods (constrained) projection PC onto C, which is nonexpansive: a property for general prox ( ) h PC (z) x when z h(x) = 0, Recall that PC (x) 0 nsiveness of proximal operators s Cx Nonexpansiveness of proximal operators I C Proximal gradient methods W C s P ( ÎPC (x ) PC (x )Î Æ Îx xc Î veness of proximal operators Gradient methods (constrained) Gradient methods (constrained) gradient methods Îx x Î + cproximal x x property for general proxh ( ) Gradient methods (constrained) ) PC ( 5 5- Gradient methods (constrained) Recall that when h(x) = C (x), proxh (x) is Euclidean projection PC onto C, which is nonexpansive for convex C: ansiveness) Proximal gradient methods 3-3 kpc (x ) PC (x )k kx x k 6-3-3

22 Nonexpansiveness of proximal operators Nonexpansiveness of proximal operators Îx xoperators Î + c x x nsiveness ofc proximal Îx x Î + Nonexpansiveness of proximal operators Nonexpansiveness of proximal operators onexpansiveness is a property for general proxh ( ) x Î + Îx x Î + c Îxx xxî+ c Îx xîx Î + c Îx x Î + c x x of proximal Nonexpansiveness of proximal operators Nonexpansiveness c x x prox (x ) prox ) operators h h (xis a property for general proxh ( ) a property for general proxh ( ) Nonexpansiveness in some sense, proximal operator behaves like projection Nonexpansiveness is a property for general proxh ( ) + x x Îc Îx xîx Î +xcî + c Îx x Îx+ Îx c x xis aprox h (x ) prox h (x ) Nonexpansiveness property for general proxh ( ) Nonexpansiveness is a property for general proxh ( ) Fact 6. act 5. (Nonexpansiveness) Fact 5. (Nonexpansiveness) Fact 5. (Nonexpansiveness) Îprox (firm(xnonexpansiveness) ) prox (x )Î Æ Îx x Î hîprox (x ) prox (x )Î Æ Îx x Î (x ) prox h Îprox h h (x )Î Æ Îx x Î Fact 5. (Nonexpansiveness)h h Proximal gradient methods hprox ) sense, prox ),like xprojection x i kprox (x5-3 ) proxh (x )k ansiveness) Fact 5. (Nonexpansiveness) operator h (x h (xoperator In some sense, proximal behaves ximal gradient methods 5-3 h In some proximal behaveslike projection Îproxh (x )Proximal prox (xmethods )Îsense, Æ Îx proximal x Î operator behaves like projection In hsome gradient Îprox (x ) prox (x )Î Æ Îx (nonexpansiveness) x Îh h Îx Æsense, proximal operator behaves h (x ) prox h (x In )Î some like projection Proximal gradient methods x Î In some sense, proximal operator behaves like projection kprox proximal operator behaves like projection 5-3 h (x ) prox h (x )k Proximal gradient methods Proximal gradient methods kx x k 6-

23 Proof of Fact 6. Let z = prox h (x ) and z = prox h (x ). Subgradient characterizations of z and z read x z h(z ) and x z h(z ) The nonexpansiveness claim z z x x would follow if (x x ) (z z ) z z }{{} firm nonexpansiveness (together with Cauchy-Schwarz) = (x z x + z ) (z z ) 0 h(z ) h(z ) + x z, z }{{} z add these inequalities h(z = ) h(z ) h(z ) + x z, z }{{} z h(z ) Proximal gradient methods 6-3

24 Resolvent of subdifferential operator One can interpret prox via resolvant of subdifferential operator Fact 6. z = prox f (x) z = (I + f) }{{} (x) resolvent of operator f where I is identity mapping Proximal gradient methods 6-4

25 Justification of Fact 6. { z = arg min f(u) + } u u x 0 f(z) + z x (optimality condition) x (I + f) (z) z = (I + f) (x) Proximal gradient methods 6-5

26 Moreau decomposition Fact 6.3 Suppose f is closed convex, and f (x) := sup z { x, z f(z)} is convex conjugate of f. Then x = prox f (x) + prox f (x) key relationship between proximal mapping and duality generalization of orthogonal decomposition Proximal gradient methods 6-6

27 x + g g(x) K K ++, g(z) + + g(x) = max{x, x a n + + x = arg +} = S (x) x x kzk hz, x + ai + min kzk hz, x + ai + g(z) n = arg min = prox + x+ a x g + + prox (x) = x P (x) g g(x) = max{x,, x } = S (x) x + an kzk = prox hz,+ x g+ ai g ++ + g(z) + = arg min kzk = max{x, g(x) =, arg x=n }min + x= S (x) hz, x + ai + max{x n g}=+s (x)pc (x) + P? (x) (,, x ) x C + + = x g prox (x) P (x) ( g + = max{x n } = S (x) proxg (x) = x= arg Pprox (x) g (x) = x min z P (x) x g(x) + a +,, x g(z) x +C? + = arg min C + x+ a + P (x) + P? (x) prox (x) = P(x) (x)x xk? z P g(x) +PxP + + C? (x) K? PC (x) + PC? (x) C K CK P K (x) PKC(x) = prox + x + a P (x) + P (x)? n C g C + (x) x = prox g x+ a K K PK (x) PK PK+? (x) + x+ + g(x) = max{x,,+ x } = S + (x) = prox = x P (x) prox (x) Moreau decomposition for convex cones = prox x+ a + + P (x) + P (x) g(x) = max{x,, x } = S (x) K KP?=(x) PK (x) P}K=P= (x) (x) PxK? (x) +x? (x) K K K (x) K (x) (x) prox x(x) PP P PK (x) PK (x) g(x) x, K nk S K max{x K PK PK (x) PK (x) x g, x(x) +? proxg (x) = x P (x) P (x) + PC? (x) PC (x) + PCC (x)? K K PK (x) K K PK (x) PK? (x) x g(x) = pr + g(x) = max{x,, xn } = S (x) proxg (x) = x P (x) PC (x) + PC? (x) PK (x) PK (x) PK? (x) K K xpk (x) K P K(x) PK (x) K? When K is closed convex cone, (K ) (x) = K (x) (exercise) with K := {x hx, zi 0, z K} polar cone of K. This gives x = PK (x) + PK (x) a special case: if K is subspace, then K = K, and hence Proximal gradient methods x = PK (x) + PK (x) 6-7 x P

28 Proof of Fact 6.4 Let u = prox f (x), then from optimality condition we know that x u f(u). This together with conjugate subgradient theorem (homework) yields u f (x u) In view of optimality condition, this means x u = prox f (x) = x = u + (x u) = prox f (x) + prox f (x) Proximal gradient methods 6-8

29 Example: prox of support function For any closed and convex set C, support function S C is defined as S C (x) = sup z C x, z. Then prox SC (x) = x P C (x) (6.3) Proof: First of all, it is easy to verify that (exercise) Then Moreau decomposition gives S C(x) = C (x) prox SC (x) = x prox S C (x) = x prox C (x) = x P C (x) Proximal gradient methods 6-9

30 Example: l norm prox (x) = x P B (x) where B := {z z } is unit l ball Remark: projection onto l ball can be computed efficiently Proof: Since x = sup z: z x, z = S B (x), we can invoke (6.3) to arrive at prox (x) = prox (x) = x P SB B (x) Proximal gradient methods 6-30

31 Example: max function Let g(x) = max{x,, x n }, then prox g (x) = x P (x) where := {z R n + z = } is probability simplex Remark: projection onto can be computed efficiently Proof: Since g(x) = max{x,, x n } = S (x) (support function of ), we can invoke (6.3) to reach prox g (x) = x P (x) Proximal gradient methods 6-3

32 Extended Moreau decomposition A useful extension (homework): Fact 6.4 Suppose f is closed convex and λ > 0. Then x = prox λf (x) + λprox λ f (x/λ) Proximal gradient methods 6-3

33 Convergence analysis

34 Cost monotonicity Objective value is non-increasing in t: Lemma 6.5 Suppose f is convex and L-smooth. If η t /L, then F (x t+ ) F (x t ) different from subgradient methods (for which objective value might be non-monotonic in t) constant stepsize rule is recommended when f is convex and smooth Proximal gradient methods 6-34

35 Proof of cost monotonicity Main pillar: Lemma 6.6 a fundamental inequality Let y + = prox L h ( y L f(y)), then F (y + ) F (x) L x y L x y+ g(x, y) }{{} 0 by convexity where g(x, y) := f(x) f(y) f(y), x y Take x = y = x t (and hence y + = x t+ ) to complete proof Proximal gradient methods 6-35

36 Monotonicity in estimation error Proximal gradient iterates are not only monotonic w.r.t. cost, but also monotonic in estimation error Lemma 6.7 Suppose f is convex and L-smooth. If η t /L, then x t+ x x t x Proof: from Lemma 6.6, taking x = x, y = x t (and hence y + = x t+ ) yields F (x t+ ) F (x ) }{{} 0 which immediately concludes proof + g(x, y) L }{{} x x t L x x t+ 0 Proximal gradient methods 6-36

37 Proof of Lemma 6.6 Define φ(z) = f(y) + f(y), z y + L z y + h(z) It is easily seen that y + = arg min z φ(z). Two important properties: Since φ(z) is L-strongly convex, one has φ(x) φ(y + ) + L x y+ Remark: we are propergating smoothness of f to strong convexity of another function φ From smoothness, φ(y + ) = f(y) + f(y), y + y + L y+ y + h(y + ) }{{} upper bound on f(y + ) f(y + ) + h(y + ) = F (y + ) Proximal gradient methods 6-37

38 Proof of Lemma 6.6 (cont.) Taken collectively, these yield φ(x) F (y + ) + L x y+, which together with definition of φ(x) gives f(y) + f(y), x y + h(x) + L }{{} x y F (y + ) + L x y+ =f(x)+h(x) g(x,y)=f (x) g(x,y) which finishes proof Proximal gradient methods 6-38

39 Convergence for convex problems Theorem 6.8 (Convergence of proximal gradient methods for convex problems) Suppose f is convex and L-smooth. If η t /L, then F (x t ) F opt L x0 x t achieves better iteration complexity (i.e. O(/ε)) than subgradient method (i.e. O(/ε )) fast if prox can be efficiently implemented Proximal gradient methods 6-39

40 Proof of Theorem 6.8 With Lemma 6.6 in mind, set x = x, y = x t to obtain F (x t+ ) F (x ) L xt x L xt+ x g(x, x t ) }{{} 0 by convexity L xt x L xt+ x Apply it recursively and add up all inequalities to get t k=0 ( ) F (x k+ ) F (x ) L x0 x L xt x This combined with monotonicity of F (x t ) (cf. Lemma 6.6) yields F (x t ) F (x ) L x0 x t Proximal gradient methods 6-40

41 Convergence for strongly convex problems Theorem 6.9 (Convergence of proximal gradient methods for strongly convex problems) Suppose f is µ-strongly convex and L-smooth. If η t /L, then x t x ( µ ) t x 0 x L linear convergence: attains ε accuracy within O(log ε ) iterations Proximal gradient methods 6-4

42 Proof of Theorem 6.9 Taking x = x, y = x t (and hence y + = x t+ ) in Lemma 6.6 gives F (x t+ ) F (x ) L x x t L x x t+ g(x, x t ) }{{} µ x x t+ L µ x t x L xt+ x This taken collectively with F (x t+ ) F (x ) 0 yields x t+ x Applying it recursively concludes proof ( µ ) x t x L Proximal gradient methods 6-4

43 Numerical example: LASSO taken from UCLA EE36C minimizex f (x) = kax bk + kxk with i.i.d. Gaussian A R , ηt = /L, L = λmax (A> A) Proximal gradient methods 6-43

44 with R := supxœc DÏ x, x A Ô Lf R best,t opt f f ÆO Ô fl Suppose f is convex and Lipschitz continuous (i and suppoe Ï is fl-strongly convex w.r.t. Î Î. Example: -norm regularized least-squares! qt! minimize one can further remove log t factor k=0 k " supxœc DÏ x, x0 + 4, we immediatenumerical arrive at example: LASSO bk + kxk kax taken from UCLA EE36C vex and Lipschitz continuous (i.e. Îg t Îú Æ Lf ) on C, 00 fl-strongly convex w.r.t. Î Î. Then " 0 4 flR Ô Lf t 3 Ô (ff(x 0 qt Lf fl k=0 k! " qt k=0 k Mirror descent ) f )/fæ? opt?! If t = R Ô t Æ supxœc0dï x, x0 + best,t (k) Theorem 5.3 f opt 0 5 with R := sup DÏ x, x0, then 0 xœc 0 0 A 0Ô B Lf R log kt Ô f best,t f opt Æ O Ô fl randomly generated A R ; step ttk = /L with L = max(at A) Proximal gradient methods 6-43

45 Backtracking line search Recall that for unconstrained case, backtracking line search is based on sufficient decrease criterion f ( x t η f(x t ) ) f(x t ) η f(xt ) Proximal gradient methods 6-44

46 Backtracking line search Recall that for unconstrained case, backtracking line search is based on sufficient decrease criterion f ( x t η f(x t ) ) f(x t ) η f(xt ) As a result, this is equivalent to updating η t = /L t until f ( x t η t f(x t ) ) f(x t ) L t f(x t ), f(x t ) + L t f(x t ) = f(x t ) f(x t ), x t x t+ + L t xt x t+ Proximal gradient methods 6-44

47 Backtracking line search Let T L (x) := prox L h ( x L f(x)) : Algorithm 6. Backtracking line search for proximal gradient methods : Initialize η =, 0 < α /, 0 < β < : while f ( T Lt (x t ) ) > f(x t ) f(x t ), x t T Lt (x t ) + L t TLt (x t ) x t do 3: L t β Lt (or L t β L t ) here, L t corresponds to η t, and T Lt (x t ) generalizes x t+ Proximal gradient methods 6-45

48 Summary: proximal gradient methods convex & smooth (w.r.t. f) problems strongly convex & smooth (w.r.t. f) problems stepsize convergence iteration rule rate complexity ( ) ( ) η t = L O t O ε ( η t = L O ( κ )t) O ( κ log ) ε Proximal gradient methods 6-46

49 Reference [] Proximal algorithms, N. Parikh and S. Boyd, Foundations and Trends in Optimization, 03. [] First-order methods in optimization, A. Beck, Vol. 5, SIAM, 07. [3] Convex optimization and algorithms, D. Bertsekas, 05. [4] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and trends in machine learning, 05. [5] Mathematical optimization, MATH30 lecture notes, E. Candes, Stanford. [6] Optimization methods for large-scale systems, EE36C lecture notes, L. Vandenberghe, UCLA. Proximal gradient methods 6-47

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Accelerated gradient methods

Accelerated gradient methods ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

The proximal mapping

The proximal mapping The proximal mapping http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/37 1 closed function 2 Conjugate function

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Lecture 5: September 15

Lecture 5: September 15 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Lecture 5 : Projections

Lecture 5 : Projections Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization

More information

Douglas-Rachford splitting for nonconvex feasibility problems

Douglas-Rachford splitting for nonconvex feasibility problems Douglas-Rachford splitting for nonconvex feasibility problems Guoyin Li Ting Kei Pong Jan 3, 015 Abstract We adapt the Douglas-Rachford DR) splitting method to solve nonconvex feasibility problems by studying

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

The Proximal Gradient Method

The Proximal Gradient Method Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Convex Optimization Conjugate, Subdifferential, Proximation

Convex Optimization Conjugate, Subdifferential, Proximation 1 Lecture Notes, HCI, 3.11.211 Chapter 6 Convex Optimization Conjugate, Subdifferential, Proximation Bastian Goldlücke Computer Vision Group Technical University of Munich 2 Bastian Goldlücke Overview

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

9. Dual decomposition and dual algorithms

9. Dual decomposition and dual algorithms EE 546, Univ of Washington, Spring 2016 9. Dual decomposition and dual algorithms dual gradient ascent example: network rate control dual decomposition and the proximal gradient method examples with simple

More information

Lecture 7: September 17

Lecture 7: September 17 10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in

More information

Nesterov s Optimal Gradient Methods

Nesterov s Optimal Gradient Methods Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学 Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Dual Decomposition.

Dual Decomposition. 1/34 Dual Decomposition http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/34 1 Conjugate function 2 introduction:

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Lecture 1: Background on Convex Analysis

Lecture 1: Background on Convex Analysis Lecture 1: Background on Convex Analysis John Duchi PCMI 2016 Outline I Convex sets 1.1 Definitions and examples 2.2 Basic properties 3.3 Projections onto convex sets 4.4 Separating and supporting hyperplanes

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Lecture 6: September 17

Lecture 6: September 17 10-725/36-725: Convex Optimization Fall 2015 Lecturer: Ryan Tibshirani Lecture 6: September 17 Scribes: Scribes: Wenjun Wang, Satwik Kottur, Zhiding Yu Note: LaTeX template courtesy of UC Berkeley EECS

More information

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants 53rd IEEE Conference on Decision and Control December 5-7, 204. Los Angeles, California, USA Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants Panagiotis Patrinos and Lorenzo Stella

More information

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018 ELE 538B: Large-Scale Opimizaion for Daa Science Quasi-Newon mehods Yuxin Chen Princeon Universiy, Spring 208 00 op ff(x (x)(k)) f p 2 L µ f 05 k f (xk ) k f (xk ) =) f op ieraions converges in only 5

More information

4th Preparation Sheet - Solutions

4th Preparation Sheet - Solutions Prof. Dr. Rainer Dahlhaus Probability Theory Summer term 017 4th Preparation Sheet - Solutions Remark: Throughout the exercise sheet we use the two equivalent definitions of separability of a metric space

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

A projection algorithm for strictly monotone linear complementarity problems.

A projection algorithm for strictly monotone linear complementarity problems. A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu

More information

Functional Analysis Exercise Class

Functional Analysis Exercise Class Functional Analysis Exercise Class Week 9 November 13 November Deadline to hand in the homeworks: your exercise class on week 16 November 20 November Exercises (1) Show that if T B(X, Y ) and S B(Y, Z)

More information

Optimization for Learning and Big Data

Optimization for Learning and Big Data Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for

More information

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ???? DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem

Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Iterative Convex Optimization Algorithms; Part One: Using the Baillon Haddad Theorem Charles Byrne (Charles Byrne@uml.edu) http://faculty.uml.edu/cbyrne/cbyrne.html Department of Mathematical Sciences

More information

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

2 Regularized Image Reconstruction for Compressive Imaging and Beyond EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

Unconstrained minimization: assumptions

Unconstrained minimization: assumptions Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming,

More information

Lecture 2: Convex Sets and Functions

Lecture 2: Convex Sets and Functions Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are

More information