Lecture 25: Subgradient Method and Bundle Methods April 24

Similar documents
Lecture 4: Convex Functions, Part I February 1

Subgradient Method. Ryan Tibshirani Convex Optimization

Lecture 1: Convex Sets January 23

5. Subgradient method

Lecture 7: September 17

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradient Methods

Lecture 6: Conic Optimization September 8

1 Strict local optimality in unconstrained optimization

Lecture 6: September 12

Primal/Dual Decomposition Methods

Lecture 24 November 27

Lecture 14 Ellipsoid method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Selected Topics in Optimization. Some slides borrowed from

10. Ellipsoid method

6. Proximal gradient method

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

LMI Methods in Optimal and Robust Control

Primal Solutions and Rate Analysis for Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Unconstrained minimization of smooth functions

Nonlinear Systems Theory

Lecture 5: September 15

Optimization methods

Optimization methods

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Cubic regularization of Newton s method for convex problems with constraints

Accelerated primal-dual methods for linearly constrained convex problems

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

arxiv: v1 [math.oc] 1 Jul 2016

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

Lecture 3. Optimization Problems and Iterative Algorithms

Duality (Continued) min f ( x), X R R. Recall, the general primal problem is. The Lagrangian is a function. defined by

Lecture 6: September 17

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Lecture 15: October 15

Lecture 6 : Projected Gradient Descent

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

More First-Order Optimization Algorithms

Proximal and First-Order Methods for Convex Optimization

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Barrier Method. Javier Peña Convex Optimization /36-725

Lecture 8. Strong Duality Results. September 22, 2008

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Algorithms for Nonsmooth Optimization

Lecture 23: November 21

SECTION C: CONTINUOUS OPTIMISATION LECTURE 9: FIRST ORDER OPTIMALITY CONDITIONS FOR CONSTRAINED NONLINEAR PROGRAMMING

CO 250 Final Exam Guide

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

6. Proximal gradient method

Nonlinear equations. Norms for R n. Convergence orders for iterative methods

Estimate sequence methods: extensions and approximations

Convex Optimization. Problem set 2. Due Monday April 26th

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

arxiv: v1 [math.oc] 5 Dec 2014

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Introduction to Nonlinear Stochastic Programming

How hard is this function to optimize?

Converse Lyapunov theorem and Input-to-State Stability

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

1 Convexity, concavity and quasi-concavity. (SB )

Nonlinear Control Lecture 5: Stability Analysis II

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Flat Chain and Flat Cochain Related to Koch Curve

Lecture 5: September 12

The Proximal Gradient Method

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Solving Dual Problems

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

The Q-parametrization (Youla) Lecture 13: Synthesis by Convex Optimization. Lecture 13: Synthesis by Convex Optimization. Example: Spring-mass System

Lecture 1: Background on Convex Analysis

Math 273a: Optimization Convex Conjugacy

Lecture 23: Conditional Gradient Method

Lecture 2: Subgradient Methods

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Composite nonlinear models at scale

Optimization over Sparse Symmetric Sets via a Nonmonotone Projected Gradient Method

Accelerating the cubic regularization of Newton s method on convex problems

Convex Analysis and Economic Theory AY Elementary properties of convex functions

1 Dimension Reduction in Euclidean Space

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

Gradient methods for minimizing composite functions

Spring 2017 CO 250 Course Notes TABLE OF CONTENTS. richardwu.ca. CO 250 Course Notes. Introduction to Optimization

Lecture 8: February 9

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Lecture 14: October 17

Lecture 9: September 28

On Stochastic Subgradient Mirror-Descent Algorithm with Weighted Averaging

arxiv: v1 [stat.ml] 12 Nov 2015

Lecture 19: Follow The Regulerized Leader

Transcription:

IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything discussed in the class. Please email A swang157@illinois.edu) if you find any typos or mistakes. In this lecture, we cover the following topics Subgradient Method Bundle Method Kelley cutting plane method Level method Reference: Nesterov.004. Chapter 3..3, 3.3 5.1 Subgradient Method Recall that subgradient method works as follows x t+1 = Π X x t γ t g t ), t = 1,,... where g t fx t ), γ t > 0 and Π X x) = arg min y X y x is the projection operator. Note that the projection on X is easy to compute when X is simple, e.g. X is a ball, box, simplex, polyhedron, etc. Lemma 5.1 Projection) x inr n, z X, x z x Π X x) + z Π X x) Proof: When x X, the inequality immediately hold true. Let x X. By definition. Π X x) = arg min z x By optimality condition, this implies Π X x) x) z Π X x)) 0, z X. Hence, x z = x Π X x) + Π X x) z x Π X x) + Π X x) z 5-1

Lecture 5: Subgradient Method and Bundle Methods April 4 5- Lemma 5. Key relation) For the subgradient method, we have Proof: x t+1 x x t x γ t fx t ) f ) + γ t g t ) x t+1 x = Π X x t γ t g t ) x x t γ t g t x = x t x γ t g t x t x ) + γ t g t Due to convexity of f, we have f fx t ) + g t x x t ), i.e. g t x t x ) > fx t ) f Combing there two inequalities leads to the desired result. Remark: Note that when f is known, we can choose the optimal γ t by minimizing the right hand side of ): γt = fxt) f, which is the Polyak s stepsize. g t In fact, knowing f is not a problem sometimes. For instance, when the goal is to solve the convex feasibility problem: Find x X, s.t. g i x) 0, i = 1,..., m. We can formulate this as m max g ix) or min maxg i x), 0) 1 i m min he optimal value f is known to be 0 in this case. i=1 If f is not known, one can replace f by its online estimate. heorem 5.3 Convergence) Suppose fx) is convex and Lipschitz continuous on X: fx) fy) M f x y, where M f < +. hen the subgradient method satisfies: where ˆx = γ t) 1 γ tx t ) x, y X fˆx ) f x 1 x + γ t M f γ t Proof: he Lipschitz continuity implies that g t M f, t. Summing up the key relation ) from t = 1 to t =, we obtain γ t fx t ) f ) x 1 x x +1 x + By convexity of f: γ t)fˆx ) γ tfx t ) his further leads to γ t )[fˆx ) f ] 1 x 1 x + M and concludes the proof. γt Mf γ t

Lecture 5: Subgradient Method and Bundle Methods April 4 5-3 Convergence under various stepsize Assume D X = max x,y x y is the diameter of the set X. It is interesting to see how the bounds in the above theorem would imply the convergence and even the convergence rate with different choices of stepsizes. By abuse of notation, we denote both min fx t) f and fˆx ) f as ɛ. 1 t 1. Constant stepsize: with γ t γ, ɛ D X + γ M f γ = D X 1 γ + M f γ M f γ. It is worth noticing that the error upper-bound does not diminish to zero as grows to infinity, which shows one of the drawbacks of using arbitrary constant stepsizes. In addition, to optimize the upper bound, we can select the optimal stepsize γ to obtain: γ = D X M f ɛ D XM f. It is shown that under this optimal choice ɛ O D XM f ). However, this exhibits another drawback of constant stepsize that in practice is not known in prior for evaluating the optimal γ.. Scaled stepsize: with γ t = γ gx t), DX ɛ + γ M γ f 1/ gx t ) Ω 1 γ + 1 ) γ M f γ. Similarly, we can select the optimal γ by minimizing the right hand side, i.e. γ = D X : γ t = D X gxt ) ɛ D XM. he same convergence rate is achieved while the same drawback about not knowing in prior still exists in choosing γ t. 3. Non-summable but diminishing stepsize: ) / ) ɛ DX + γt Mf γ t ) / 1 DX + γt Mf ) γ t + Mf γt t= 1 +1 / t= 1 +1 where 1 1. When, select large 1 and the first term on the right hand side 0 since γ t is non-summable. he second term also 0 because γ t always approaches zero faster than γ t. Consequently, we know that γ t ɛ 0.

Lecture 5: Subgradient Method and Bundle Methods April 4 5-4 An example choice of the stepsize is γ t = O ) 1 t with q 0, 1]. As in the above cases, if we q choose γ t = Ω M, then t γ t = D ) X DX M f ln ) ɛ O. M f t In fact, if we choose the averaging from instead of 1, we have ) min fx Mf D X t) f O. / t 4. Non-summable but square-summable stepsize: It is obvious that ) / ɛ Ω + M ) γt γ t 0. A typical choice of γ t = 1 t 1+q, q > 0 also result in the rate of O 1 ). 5. Polyak stepsize: he stepsize yields x t+1 x x t x fx t) f ) gx t ), 5.1) which guarantees x t x decreases each step. Applying 5.1) recursively, we obtain fx t ) f ) DX M f <. herefore we have ɛ 0 as and ɛ O 1 ). Corollary 5.4 When is known, setting γ t fˆx ) f D XM f D X M f, in particular, we have Remark: Subgradient method converges sublinearly. number of iterations. For an accuracy ɛ > 0, need O D X M f ɛ ) 5. Bundle Method When running the subgradient method, we obtain a bundle of affine underestimate of fx): fx t ) + g t x x t ), t = 1,,...

Lecture 5: Subgradient Method and Bundle Methods April 4 5-5 Definition 5.5 he piecewise linear function: { f t x) = max fxi ) + gi x x i ) } 1 i t where g i fx i ) is called the t-th model of convex function f. Note 1. f t x) fx), x X. f t x i ) = fx i ), 1 i t 3. f 1 x) f x)... f t x)... fx) 5..1 Kelley method Kelley, 1960) he Kelley method works as follows: x t+1 = arg min f tx) Obviously, the above algorithm converges so long as X is compact. he auxiliary problem is not so disturbing reduces to LP) when X is polyhedron. However, the issue is that x t is not unique and Kelley method can be very unstable. Indeed, the worst-case complexity of Kelley method is at least O 1 ɛ n 1)/ ). Remedy: o prevent the instability issue, a possible remedy is update x t+1 by with properly selected α t > 0. { x t+1 = arg min f t x) + α } t x x t 5.. Level Method Lemarchal, Nemirovski, Nesterov, 1995) Denote = min ft f tx) f t = min fx i) 1 i t minimal value of the model) record value of the model) we have 1... f f f... f f 1 Denote the level set L t = { x : f t x) l t := 1 α) f t + α f } t Note that L t is nonempty, convex and closed, and doesn t contain the search points {x 1,..., x t }

Lecture 5: Subgradient Method and Bundle Methods April 4 5-6 he Level method works as follows { x t+1 = Π Lt x t ) = arg min x xt } : f t x) l t Note when α = 0, reduces to Kelley method. α = 1, there will be no progress. he auxiliary problem reduces to a quadratic program when X is polyhedron. heorem 5.6 When t > 1 1 α) α α) ) Mf D X, ɛ we have f t f t ɛ where M f is the Lipschitz constant and D X is the diameter of set X. Remark: Level method achieves same complexity as the subgradient method which indeed is unimprovable), but can perform much better than subgradient method in practice.