MATH 4211/6211 Optimization Quasi-Newton Method

Similar documents
Convex Optimization CMU-10725

Nonlinear Programming

Quasi-Newton Methods

5 Quasi-Newton Methods

Quasi-Newton methods for minimization

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

MATH 4211/6211 Optimization Basics of Optimization Problems

Nonlinear Optimization: What s important?

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

The Conjugate Gradient Method

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Chapter 4. Unconstrained optimization

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

Optimization II: Unconstrained Multivariable

Lecture 18: November Review on Primal-dual interior-poit methods

Optimization II: Unconstrained Multivariable

Math 408A: Non-Linear Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Lecture 7 Unconstrained nonlinear programming

Unconstrained optimization

Programming, numerics and optimization

Numerical solutions of nonlinear systems of equations

FALL 2018 MATH 4211/6211 Optimization Homework 4

NonlinearOptimization

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Higher-Order Methods

Chapter 10 Conjugate Direction Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 14: October 17

Matrix Secant Methods

2. Quasi-Newton methods

Optimization and Root Finding. Kurt Hornik

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Lecture 10: September 26

Second Order Optimization Algorithms I

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

The Conjugate Gradient Algorithm

Data Mining (Mineria de Dades)

Algorithms for Constrained Optimization

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

1 Numerical optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Multivariate Newton Minimanization

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Matrix Derivatives and Descent Optimization Methods

The Steepest Descent Algorithm for Unconstrained Optimization

Statistics 580 Optimization Methods

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Conditional Gradient (Frank-Wolfe) Method

Introduction to gradient descent

1 Numerical optimization

17 Solution of Nonlinear Systems

Optimization Methods

Static unconstrained optimization

ECE580 Exam 2 November 01, Name: Score: / (20 points) You are given a two data sets

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

arxiv: v3 [math.na] 23 Mar 2016

Generalized Gradient Descent Algorithms

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

x k+1 = x k + α k p k (13.1)

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

ECE580 Partial Solution to Problem Set 3

ISyE 6663 (Winter 2017) Nonlinear Optimization

Math 411 Preliminaries

Lecture V. Numerical Optimization

Jim Lambers MAT 419/519 Summer Session Lecture 11 Notes

Search Directions for Unconstrained Optimization

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Math 273a: Optimization Netwon s methods

A Quasi-Newton Algorithm for Nonconvex, Nonsmooth Optimization with Global Convergence Guarantees

MATH 4211/6211 Optimization Constrained Optimization

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Numerical Methods for Large-Scale Nonlinear Systems

Preconditioned conjugate gradient algorithms with column scaling

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

MATH 3795 Lecture 13. Numerical Solution of Nonlinear Equations in R N.

Steepest Descent. Juan C. Meza 1. Lawrence Berkeley National Laboratory Berkeley, California 94720

High Order Methods for Empirical Risk Minimization

Quasi-Newton Methods

Gradient-Based Optimization

(Inv) Computing Invariant Factors Math 683L (Summer 2003)

Optimization Methods for Machine Learning

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Constrained optimization: direct methods (cont.)

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Optimization for neural networks

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

A = , A 32 = n ( 1) i +j a i j det(a i j). (1) j=1

Transcription:

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0

Quasi-Newton Method Motivation: Approximate the inverse Hessian ( 2 f(x (k) )) 1 in the Newton s method by some H k : x (k+1) = x (k) α k H k g (k) That is, the search direction is set to d (k) = H k g (k). Based on H k, x (k), g (k), quasi-newton generates the next H k+1, and so on. Xiaojing Ye, Math & Stat, Georgia State University 1

Proposition. If f C 1, g (k) 0, and H k 0, then d (k) = H k g (k) is a descent direction. Proof. Let x (k+1) = x (k) αh k g (k) for some α, then by Taylor s expansion f(x (k+1) ) = f(x (k) ) αg (k)t H k g (k) + o( H k g (k) α) < f(x (k) ) for α sufficiently small. Xiaojing Ye, Math & Stat, Georgia State University 2

Recall that for quadratic functions with Q 0, the Hessian is H (k) = Q for all k, and For notation simplicity, we denote g (k+1) g (k) = Q(x (k+1) x (k) ) x (k) = x (k+1) x (k) and g (k) = g (k+1) g (k) Then we can write the identity above as g (k) = Q x (k) or equivalently Q 1 g (k) = x (k) Xiaojing Ye, Math & Stat, Georgia State University 3

In quasi-newton method, H k is in the place of Q 1 : Newton : Quasi-Newton : x (k+1) = x (k) α k Q 1 g (k) x (k+1) = x (k) α k H k g (k) Therefore we would like to have a sequence of H k with same property of Q 1 : H k+1 g (i) = x (i), 0 i k for all k = 0, 1, 2,.... Xiaojing Ye, Math & Stat, Georgia State University 4

If this is true, then at iteration n, there are H n g (0) = x (0) H n g (1) = x (1). H n g (n 1) = x (n 1) or H n [ g (0),..., g (n 1) ] = [ x (0),..., x (n 1) ]. On the other hand, Q 1 [ g (0),..., g (n 1) ] = [ x (0),..., x (n 1) ]. If [ g (0),..., g (n 1) ] is invertible, then we have H n = Q 1. Then at the iteration n + 1, there is x (n+1) = x (n) α n H n g (n) = x since this is the same as the Newton s update. Hence for quadratic functions, quasi-newton method would converge in at most n steps. Xiaojing Ye, Math & Stat, Georgia State University 5

Quasi-Newton method d (k) = H k g (k) where H 0, H 1,... are symmetric. α k = arg min f(x (k) + α k d (k) ) α 0 x (k+1) = x (k) + αd (k) Moreover, for quadratic functions of form f(x) = 1 2 xt Qx b T x, the matrices H 0, H 1,... are required to satisfy H k+1 g (i) = x (i), 0 i k Xiaojing Ye, Math & Stat, Georgia State University 6

Theorem. Consider a quasi-newton algorithm applied to a quadratic function with symmetric Q, such that for all k = 0, 1,..., n 1, there are H k+1 g (i) = x (i), 0 i k and H k are all symmetric. If α i 0 for 0 i k, then d (0),..., d (n) are Q-conjugate. Xiaojing Ye, Math & Stat, Georgia State University 7

Proof. We prove by induction. It is trivial to show for k = 0. Assume the claim holds for some k < n 1. We have for i k that d (k+1)t Qd (i) = (H k+1 g (k+1) ) T Qd (i) = g (k+1)t Q x (i) H k+1 α i = g (k+1)t g (i) H k+1 = g (k+1)t x(i) α i = g (k+1)t d (i) Since d (0),..., d (k) are Q-conjugate, we know g (k+1)t d (i) = 0 for all i k. Hence d (0),..., d (k), d (k+1) are Q-conjugate. By induction the claim holds. α i Xiaojing Ye, Math & Stat, Georgia State University 8

The theorem above also shows that quasi-newton method is a conjugate direction method, and hence converges in n steps for quadratic objective functions. In practice, there are various ways to generate H k such that H k+1 g (i) = x (i), 0 i k Now we learn three algorithms that produce such H k. Xiaojing Ye, Math & Stat, Georgia State University 9

Rank one correction formula Suppose we would like to update H k to H k+1 by adding a rank one matrix a k z (k) z (k)t for some a k R and z (k) R n : H k+1 = H k + a k z (k) z (k)t Now let us derive what this a k z (k) z (k)t should be. Since we need H k+1 g (i) = x (i) for i k, we at least need H k+1 g (k) = x (k). That is x (k) = H k+1 g (k) = (H k + a k z (k) z (k)t ) g (k) = H k g (k) + a k (z (k)t g (k) )z (k) Xiaojing Ye, Math & Stat, Georgia State University 10

Therefore z (k) = x(k) H k g (k) a k (z (k)t g (k) ) and hence H k+1 = H k + ( x(k) H k g (k) )( x (k) H k g (k) ) T a k (z (k)t g (k) ) 2 On the other hand, multiplying g (k)t on both sides of x (k) H k g (k) = a k (z (k)t g (k) )z (k), we obtain g (k)t ( x (k) H k g (k) ) = a k (z (k)t g (k) ) 2. Hence H k+1 = H k + ( x(k) H k g (k) )( x (k) H k g (k) ) T g (k)t ( x (k) H k g (k) ) This is the rank one correction formula. Xiaojing Ye, Math & Stat, Georgia State University 11

We obtained the formula by requiring H k+1 g (k) = x (k). However, we also need H k+1 g (i) = x (i) for i < k. This turns out to be true automatically: Theorem. For the rank one algorithm applied to quadratic functions with Hessian symmetric Q, there are H k+1 g (i) = x (i), 0 i k for k = 0, 1,..., n 1. Xiaojing Ye, Math & Stat, Georgia State University 12

Proof. We here only need to show H k+1 g (i) = x (i) for i < k: H k+1 g (i) = Note that (H k + ( x(k) H k g (k) )( x (k) H k g (k) ) T g (k)t ( x (k) H k g (k) ) ) g (i) = x (i) + ( x(k) H k g (k) )( x (k) H k g (k) ) T g (i) g (k)t ( x (k) H k g (k) ) (H k g (k) ) T g (i) = g (k)t H k g (i) = g (k)t x (i) = x (k)t Q x (i) = x (k)t g (i) Hence the second term on the right is zero, and we obtain This completes the proof. H k g (i) = x (i) Xiaojing Ye, Math & Stat, Georgia State University 13

Issues with rank one correction formula: H k+1 may not be positive definite even if H k is. Hence H k g (k) may not be a descent direction; the denominator in the rank one correction is g (k)t ( x (k) H k g (k) ), which can be close to 0 and makes computation unstable. Xiaojing Ye, Math & Stat, Georgia State University 14

We now study the DFP algorithm which improves the rank one correction formula by ensuring positive definiteness of H k. DFP algoirthm [Davidson 1959, Fletcher and Powell 1963] H k+1 = H k + x(k) x (k)t x (k)t g (H k g (k) )(H k g (k) g (k)t H k g (k) (k) )T Xiaojing Ye, Math & Stat, Georgia State University 15

We first show that DFP is a quasi-newton method. Theorem. The DFP algorithm applied to quadratic functions satisfies H k+1 g (i) = x (i), 0 i k for all k. Xiaojing Ye, Math & Stat, Georgia State University 16

Proof. We prove this by induction. It is trivial for k = 0. Assume the claim is true for k, i.e., H k g (i) = x (i) for all i k 1. Now we first have H k+1 g (i) = x (i) for i = k by direct computation. For i < k, there is H k+1 g (i) = H k g (i) + x(k) ( x (k)t g (i) ) x (k)t g (k) (H k g (k) )(H k g (k) ) T g (i) g (k)t H k g (k) Note that due to assumption d (0),..., d (k) are Q-conjugate, and hence x (k)t g (i) = x (k)t Q x (i) = α k α i d (k)t Qd (i) = 0 similarly g (k)t H k g (i) = g (k)t x (i) = 0. This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 17

Next we show that H k+1 inherits positive definiteness of H k in DFP algorithm. Theorem. Suppose g (k) 0, then H k 0 implies H k+1 0 in DFP. Proof. For any x R n, there is (k) )2 x T H k+1 x = x T H k x + (xt x (k) ) 2 x (k)t g (xt H k g (k) g (k)t H k g (k) For notation simplicity, we denote a = H 1/2 k x and b = H 1/2 k where H k = H 1/2 k H 1/2 k (we know H 1/2 k g (k) exists since H k is SPD). Xiaojing Ye, Math & Stat, Georgia State University 18

Proof (cont). Now we have x T H k+1 x = a 2 b 2 (a T b) 2 b 2 + (xt x (k) ) 2 x (k)t g (k) Note also that x (k) = α k d (k) = α k H k g (k), therefore x (k)t g (k) = x (k)t (g (k+1) g (k) ) = x (k)t g (k) = α k g (k)t H k g (k) where we used d (k)t g (k+1) = 0 due to Q-conjugacy of d (k) in the second equality. Hence x T H k+1 x 0 since both terms on the right side are nonnegative. Xiaojing Ye, Math & Stat, Georgia State University 19

Proof (cont). Now we need to show that the two terms cannot be 0 simultaneously. Suppose the first term is 0, then a = βb for some scalar β. That is H 1/2 k x = βh 1/2 k g (k), or x = βh k g (k). In this case, there is (x T x (k) ) T = (β g (k)t x (k) ) 2 = (βα k g (k)t H k g (k) ) 2 > 0 and hence the second term is positive. This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 20

BFGS algorithm (named after Broyden, Fletcher, Goldfarb, Shannon) Instead of directly finding H k such that H k+1 g (i) = x (i) for 0 i k, the BFGS first find B k such that B k+1 x (i) = g (i), 0 i k Then replacing H k by B k and swapping x (k) and g (k) in DFP yield B k+1 = B k + g(k) g (k)t x (k)t g (B k x (k) )(B k x (k) x (k)t B k x (k) (k) )T Xiaojing Ye, Math & Stat, Georgia State University 21

Then the actual H k = B k 1 and hence H k+1 = (B k + g(k) g (k)t x (k)t g (B k x (k) )(B k x (k) x (k)t B k x (k) = H k + (1 + g(k)t H k g (k) ) x (k) x (k)t g (k)t x (k) x (k)t g (k) (k) ) 1 )T H k g (k) x (k)t + (H k g (k) x (k)t ) T g (k)t x (k) This is the update rule of H k in BFGS algorithm Xiaojing Ye, Math & Stat, Georgia State University 22

The inverse was obtained by applying the following result: Lemma. [Sherman-Morrison formula] Let A be a nonsingular matrix, and u and v are column vectors such that 1 + v T A 1 u 0, then A + uv T is nonsingular, and Proof. Direct computation. (A + uv T ) 1 = A 1 (A 1 u)(v T A 1 ) 1 + v T A 1 u Xiaojing Ye, Math & Stat, Georgia State University 23

BFGS algorithm: 1. Set k = 0; select x (0) and SPD H 0, and compute g (0) = f(x (0) ). 2. Repeat: d (k) = H (k) g (k) α k = arg min f(x (k) + αd (k) ) α 0 x (k+1) = x (k) + α k d (k) g (k+1) = f(x (k+1) ) Until g (k) = 0. H k+1 = H k + (Compute the BFGS update of H k ) k k + 1 Xiaojing Ye, Math & Stat, Georgia State University 24