Chapter 8 Gradient Methods

Similar documents
Chapter 10 Conjugate Direction Methods

Chapter 5 Elements of Calculus

Unconstrained optimization

Chapter 3 Transformations

Lecture Notes: Geometric Considerations in Unconstrained Optimization

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

8 Numerical methods for unconstrained problems

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

The Steepest Descent Algorithm for Unconstrained Optimization

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

Projection Theorem 1

Optimization methods

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

y 2 . = x 1y 1 + x 2 y x + + x n y n 2 7 = 1(2) + 3(7) 5(4) = 3. x x = x x x2 n.

Interior-Point Methods for Linear Optimization

Gradient Descent. Dr. Xiaowei Huang

Iterative Methods for Solving A x = b

EECS 275 Matrix Computation

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Nonlinear Programming

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Applied Linear Algebra in Geoscience Using MATLAB

Unconstrained minimization of smooth functions

Notes on Some Methods for Solving Linear Systems

Performance Surfaces and Optimum Points

Conjugate gradient algorithm for training neural networks

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

The Conjugate Gradient Method

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

1 Kernel methods & optimization

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

5 Handling Constraints

Lecture 10: October 27, 2016

Optimization Tutorial 1. Basic Gradient Descent

the method of steepest descent

Lecture 20: 6.1 Inner Products

y 1 y 2 . = x 1y 1 + x 2 y x + + x n y n y n 2 7 = 1(2) + 3(7) 5(4) = 4.

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

1 Directional Derivatives and Differentiability

Lecture 23: 6.1 Inner Products

Convex Functions and Optimization

Line Search Methods for Unconstrained Optimisation

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

1. Method 1: bisection. The bisection methods starts from two points a 0 and b 0 such that

Convex Optimization. Problem set 2. Due Monday April 26th

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

ECE580 Partial Solution to Problem Set 3

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

12 The Heat equation in one spatial dimension: Simple explicit method and Stability analysis

Numerical Optimization Prof. Shirish K. Shevade Department of Computer Science and Automation Indian Institute of Science, Bangalore

A Distributed Newton Method for Network Utility Maximization, II: Convergence

Nonlinear Optimization for Optimal Control

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Unconstrained Multivariate Optimization

Math 411 Preliminaries

Numerical Optimization

INNER PRODUCT SPACE. Definition 1

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Constrained optimization: direct methods (cont.)

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Linear Algebra. Session 12

Linear Algebra Massoud Malek

Optimization and Root Finding. Kurt Hornik

Math 164: Optimization Barzilai-Borwein Method

Numerical optimization

Upon successful completion of MATH 220, the student will be able to:

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

The speed of Shor s R-algorithm

Chapter 7 Iterative Techniques in Matrix Algebra

Lec10p1, ORF363/COS323

Static unconstrained optimization

Algebra II. Paulius Drungilas and Jonas Jankauskas

Notes on Constrained Optimization

3. Mathematical Properties of MDOF Systems

LECTURE 1: SOURCES OF ERRORS MATHEMATICAL TOOLS A PRIORI ERROR ESTIMATES. Sergey Korotov,

This manuscript is for review purposes only.

Optimization Methods

Principal Component Analysis

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

DO NOT OPEN THIS QUESTION BOOKLET UNTIL YOU ARE TOLD TO DO SO

Gradient Methods Using Momentum and Memory

Meaning of the Hessian of a function in a critical point

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

Applied Linear Algebra

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

2. Review of Linear Algebra

Multipoint secant and interpolation methods with nonmonotone line search for solving systems of nonlinear equations

A Multiplicative Operation on Matrices with Entries in an Arbitrary Abelian Group

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

NORMS ON SPACE OF MATRICES

Linear Algebra: Characteristic Value Problem

Lec 2: Mathematical Economics

Lecture 8 Optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

A trust region algorithm with a worst-case iteration complexity of O(ɛ 3/2 ) for nonconvex optimization

Unit 2, Section 3: Linear Combinations, Spanning, and Linear Independence Linear Combinations, Spanning, and Linear Independence

Transcription:

Chapter 8 Gradient Methods An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1

Introduction Recall that a level set of a function is the set of points satisfying for some constant. Thus, a point is on the level set corresponding to level if In the case of functions of two real variables, 2

Introduction The gradient of at, denoted by, is orthogonal to the tangent vector to an arbitrary smooth curve passing through on the level set The direction of maximum rate of increase of a real-valued differentiable function at a point is orthogonal to the level set of the function through that point. The gradient acts in such a direction that for a given small displacement, the function increases more in the direction of the gradient than in any other direction. 3

Introduction Cauchy-Schwarz inequality Recall that,, is the rate of increase of in the direction at the point. By the Cauchy-Schwarz inequality, because. But if, then Thus, the direction in which points is the direction of maximum rate of increase of at. The direction in which points is the direction of maximum rate of decrease of at. Hence, the direction of negative gradient is a good direction to search if we want to find a function minimizer. 4

Introduction Let be a starting point, and consider the point Then, by Taylor s theorem, we obtain If, then for sufficiently small, we have This means that the point is an improvement over the point if we are searching for a minimizer. 5

Introduction Given a point, to find the next point, we move by an amount, where is a positive scalar called the step size. We refer to this as a gradient descent algorithm (or gradient algorithm). The gradient varies as the search proceeds, tending to zero as we approach the minimizer. We can take very small steps and reevaluate the gradient at every step, or take large steps each time. The former results in a laborious method of reaching the minimizer, whereas the latter may result in a more zigzag path the minimizer. 6

The Method of Steepest Descent Steepest descent is a gradient algorithm where the step size is chosen to achieve the maximum amount of decrease of the objective function at each individual step. At each step, starting from the point, we conduct a line search in the direction until a minimizer,, is found. 7

Proposition 8.1 Proposition 8.1: If is a steepest descent sequence for a given function, then for each the vector is orthogonal to the vector Proof: From the iterative formula of the method of steepest descent it follows that To complete the proof it is enough to show Observe that is a nonnegative scalar that minimizes. Hence, using the FONC and the chain rule gives us 8

Proposition 8.2 Proposition 8.2: If is a steepest descent sequence for a given function and if, then Proof: Recall that where is the minimizer of over all. Thus, for, we have By the chain rule, because by assumption. Thus, and this implies that there is an such that for all Hence, 9

Descent Property Descent property: if If for some, we have, then the point satisfies the FONC. In this case,. We can use the above as the basis for a stopping criterion for the algorithm. The condition, however, is not directly suitable as a practical stopping criterion, because the numerical computation of the gradient will rarely be identically equal to zero. A practical criterion is to check if the norm is less than a prespecified threshold. Alternatively, we may compute, and if the difference is less than some threshold, then we stop. 10

Descent Property Another alternative is to compute the norm, and we stop if the norm is less than a prespecified threshold. We may check relative values of the quantities above The two relative stopping criteria are preferable because they are scale-independent. Scaling the objective function does not change the satisfaction of the criterion. To avoid dividing by very small numbers, modify as 11

Example Use the steepest descent method to find the minimizer of The initial point is We find that Hence, To compute, we need Using the secant method from Section 7.4, we obtain 12

Example Plot versus We compute To find, we first determine Next, we find Using the secant method again, we obtain 13

Example Thus, To find, we first determine The value Note that the minimizer of is and hence it appears that we have arrived at the minimizer in only three iterations. 14

Steepest Descent for Quadratic Function A quadratic function of the form where is a symmetric positive define matrix, and. The unique minimizer of can be found by setting the gradient of to zero, where because and 15

Steepest Descent for Quadratic Function The Hessian of is. To simplify the notation we write. Then, for the steepest descent algorithm for the quadratic function can be represented as where In the quadratic case, we can find an explicit formula for. Assume that, for if, then and the algorithm stops. 16

Steepest Descent for Quadratic Function Because is the minimizer of, we apply the FONC to to obtain Therefore, if But, Hence, In summary, the method of steepest descent for the quadratic stakes the form 17

Example Let. Then, starting from an arbitrary initial point, we arrive at the solution at only one step. However, if, then the method of steepest descent shuffles ineffectively back and forth when searching for the minimizer in a narrow valley. This example illustrates a major drawback in the steepest descent method. 18

Convergence In a descent method, as each new point is generated by the algorithm, the corresponding value of the objective function decreases in value. An iterative algorithm is globally convergent if for any arbitrary starting point the algorithm is guaranteed to generate a sequence of pints converging to a point that satisfies the FONC for a minimizer. If not, it may still generate a sequence that converges to a point satisfying the FONC, provided that the initial point is sufficiently close to the point. Locally convergent How fast the algorithm converges to a solution point: rate of convergence 19

Convergence The convergence analysis is more convenient if instead of working with we deal with where. The solution point is obtained by solving ; that is, The function differs from only by a constant 20

Convergence Lemma 8.1: The iterative algorithm with satisfies where if, then, and if, then 21

Convergence Theorem 8.1: Let be the sequence resulting from a gradient algorithm. Let be as defined in Lemma 8.1, and suppose that for all. Then, converges to for any initial condition if and only if Proof: From Lemma 8.1 we have, from which we obtain Assume that for all, for otherwise the result holds trivially. 22

Convergence Note that if and only if. We see that this occurs if and only if, which, in turn, holds if and only if Note that by Lemma 8.1, and is well defined [ is taken to be ]. Therefore, it remains to show that if and only if We first show that implies that. For this, first observe that for any, we have. Therefore,, and hence. Thus, if, then clearly 23

Convergence Finally, we show that implies that By contraposition. Suppose that. Then, it must be that. Observe that for and sufficiently close to 1, we have. Therefore, for sufficiently large,, which implies that. Hence, implies that. This completes the proof. The assumption in Theorem 8.1 that for all is significant. Furthermore, the result of the theorem does not hold in general if we do not have this assumption. 24

Example A counter example to show in Theorem 8.1 is necessary. For each, choose in such a way that and (we can always do this if, for example, ). From Lemma 8.1 we have Therefore,. Because, we also have that. Hence,, which implies that (for all ). On the other hand, it is clear that for all. Hence, the result of the theorem does not hold if for some. 25

Convergence Rayleigh s inequality. For any, we have We also have 26

Convergence Lemma 8.2: Let be an real symmetric positive definite matrix. Then, for any, we have Proof: Appling Rayleigh s inequality and using the properties of symmetric positive definite matrices listed previously, we get 27

Convergence Theorem 8.2: In the steepest descent algorithm, we have for any Proof: If for some, then and the result holds. So assume that for all. Recall that for the steepest descent algorithm, Substituting this expression for in the formula for yields Note that in this case for all. Furthermore, by Lemma 8.2, we have. Therefore, we have, and hence by Theorem 8.1, we conclude that 28

Convergence Consider now a gradient method with fixed step size; that is, for all. The resulting algorithm is of the form We refer to the algorithm above as a fixed-step-size gradient algorithm. The algorithm is of practical interest because of its simplicity. The algorithm does not require a line search at each step to determine. Clearly, the convergence of the algorithm depends on the choice of. 29

Convergence Theorem 8.3: For the fixed-step-size gradient algorithm, for any if and only if Proof: : By Rayleigh s inequality we have and Therefore, substituting the above in the formula for, we have Therefore, for all, and. Hence, by Theorem 8.1, we conclude that 30

Convergence Proof: : We use contraposition. Suppose that either or. Let be chosen such that is an eigenvector of corresponding to the eigenvalue. Because we obtain Taking norms on both sides, we get Because or, Hence, cannot converge to 0, and thus the sequence does not converge to 31

Example Let the function be given by We wish to find the minimizer of using a fixed-step-size gradient algorithm where is a fixed step size. Solution: To apply Theorem 8.3, we first symmetrize the matrix in the quadratic term of f to get The eigenvalues of the matrix are 6 and 12. Hence, by Theorem 8.3, the algorithm converges to the minimizer for all if and only if lies in the range 32

Convergence Rate Theorem 8.4: In the method of steepest descent applied to the quadratic function, at every step we have Proof: In the proof of Theorem 8.2, we showed that. Therefore, and the result follows. 33

Convergence Rate Let called the condition number of. Then, it follows from Theorem 8.4 that The term plays an important role in the convergence of to 0 (and hence of to ). We refer to as the convergence ratio. The smaller the value of, the smaller will be relative to, and hence the faster converges to 0. 34

Convergence Rate The convergence ratio decreases as decreases. If then, corresponding to the circular contours of (Figure 8.6). In this case the algorithm converges in a single step to the minimizer. As increases, the speed of convergence of (and hence ) decreases. The increase in reflects that fact that the contours of are more eccentric. 35

Convergence Rate Definition 8.1: Given a sequence that converges to, that is,, we say the order of convergence is, where, if If for all then we say that the order of convergence is Note that in the definition above, 0/0 should be understood to be 0. 36

Convergence Rate The order of convergence of a sequence is a measure of its rate of convergence; the higher the order, the faster the rate of convergence. The order of convergence is sometimes also called the rate of convergence. If and we say that the convergence is sublinear. If and, we say that the convergence is linear. If, we say that the convergence is superlinear. If, we say that the convergence is quadratic. 37

Example Suppose that and thus. Then, If, the sequence converges to 0, whereas if, it grows to. If, the sequence converges to 1. Hence, the order of convergence is 1. Suppose that, where, and thus. Then, If, the sequence converges to 0, whereas if, it grows to. If, the sequence converges to. Hence, the order of convergence is 1. 38

Example Suppose that, where and, and thus Then, If, the sequence converges to 0, whereas if, it grows to. If, the sequence converges to 1. Hence, the order of convergence is. Suppose that for all, and thus. Then, for all. Hence, the order of convergence is. 39

Convergence Rate The order of convergence can be interpreted using the notion of the order symbol. Recall that ( big-oh of ) if there exists a constant such that for sufficiently small. The order of convergence is at least if 40

Convergence Rate Theorem 8.5: Let be a sequence that converges to. If then the order of convergence (if it exists) is at least. Proof: Let be the order of convergence of. Suppose that Then, these exists such that for sufficiently large, Hence, 41

Convergence Rate Taking limits yields Because by definition is the order of convergence Combining the two inequalities above, we get Therefore, because, we conclude that that is, the order of convergence is at least. 42

Example Similarly, we can show that if then the order of convergence (if it exists) strictly exceeds. Suppose that we are given a scalar sequence that converges with order of convergence and satisfies The limit of must be 2, because it is clear from the equation that. Also, we see that. Hence, we conclude that 43

Example Consider the problem of finding a minimizer of the function given by. Suppose that we use the algorithm with step size and initial condition We first show that the algorithm converges to a local minimizer of. We have. The fixed-step-size gradient algorithm is therefore given by With, we can derive the expression Hence, the algorithm converges to 0, a strict local minimizer of. Note that Therefore, the order of convergence is 2. 44

Convergence Rate The steepest descent algorithm has an order of convergence of 1 in the worst case. Lemma 8.3: In the steepest descent algorithm, if for all then if and only if is an eigenvector of. Proof: Suppose that for all. Recall that for the steepest descent algorithm, Sufficiency is easy to show by verification. To show necessity, suppose that. Then,, which implies that. Therefore,. 45

Convergence Rate Premultiplying by and subtracting from both sides yields which can be rewritten as Hence, is an eigenvector of. By the lemma, if is not an eigenvector of, then (recall that cannot exceed 1) 46

Theorem 8.6 Theorem 8.6: Let be a convergent sequence of iterates of the steepest descent algorithm applied to a function. Then, the order of convergence of is 1 in the worst case; that is, there exist a function and an initial condition such that the order of convergence of is equal to 1. Proof: Let be a quadratic function with Hessian. Assume that the maximum and minimum eigenvalues of satisfy. To show that the order of convergence is 1, it suffices to show that there exists such that for some. 47

Theorem 8.6 By Rayleigh s inequality Similarly, Combining the inequalities above with Lemma 8.1, we obtain Therefore, it suffices to choose such that for some 48

Theorem 8.6 Recall that for the steepest descent algorithm, assuming that for all, First consider the case where. Suppose that is chosen such that is not an eigenvector of. Then, is also not an eigenvector of. By Proposition 8.1, is not an eigenvector of for any [because any two eigenvectors corresponding to and are mutually orthogonal]. Also, lies in one of two mutually orthogonal directions. Therefore, by Lemma 8.3, for each, the value of of two numbers, both of which are strictly less than 1. This proves the case. 49

Theorem 8.6 For the general case, let and be mutually orthogonal eigenvectors corresponding to and. Choose such that lies in the span of and but is not equal to either. Note that also lies in the span of and, but is not equal to either. By manipulating as before, we can write. Any eigenvector of is also an eigenvector of. Therefore, lies in the span of and for all ; that is, the sequence is confined within the two-dimensional subspace spanned by and. We can now proceed as in the case. 50