MATH 4211/6211 Optimization Quasi-Newton Method

MATH 4211/6211 Optimization Quasi-Newton Method Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0

Quasi-Newton Method Motivation: Approximate the inverse Hessian ( 2 f(x (k) )) 1 in the Newton s method by some H k : x (k+1) = x (k) α k H k g (k) That is, the search direction is set to d (k) = H k g (k). Based on H k, x (k), g (k), quasi-newton generates the next H k+1, and so on. Xiaojing Ye, Math & Stat, Georgia State University 1

Proposition. If f C 1, g (k) 0, and H k 0, then d (k) = H k g (k) is a descent direction. Proof. Let x (k+1) = x (k) αh k g (k) for some α, then by Taylor s expansion f(x (k+1) ) = f(x (k) ) αg (k)t H k g (k) + o( H k g (k) α) < f(x (k) ) for α sufficiently small. Xiaojing Ye, Math & Stat, Georgia State University 2

Recall that for quadratic functions with Q 0, the Hessian is H (k) = Q for all k, and For notation simplicity, we denote g (k+1) g (k) = Q(x (k+1) x (k) ) x (k) = x (k+1) x (k) and g (k) = g (k+1) g (k) Then we can write the identity above as g (k) = Q x (k) or equivalently Q 1 g (k) = x (k) Xiaojing Ye, Math & Stat, Georgia State University 3

In quasi-newton method, H k is in the place of Q 1 : Newton : Quasi-Newton : x (k+1) = x (k) α k Q 1 g (k) x (k+1) = x (k) α k H k g (k) Therefore we would like to have a sequence of H k with same property of Q 1 : H k+1 g (i) = x (i), 0 i k for all k = 0, 1, 2,.... Xiaojing Ye, Math & Stat, Georgia State University 4

If this is true, then at iteration n, there are H n g (0) = x (0) H n g (1) = x (1). H n g (n 1) = x (n 1) or H n [ g (0),..., g (n 1) ] = [ x (0),..., x (n 1) ]. On the other hand, Q 1 [ g (0),..., g (n 1) ] = [ x (0),..., x (n 1) ]. If [ g (0),..., g (n 1) ] is invertible, then we have H n = Q 1. Then at the iteration n + 1, there is x (n+1) = x (n) α n H n g (n) = x since this is the same as the Newton s update. Hence for quadratic functions, quasi-newton method would converge in at most n steps. Xiaojing Ye, Math & Stat, Georgia State University 5

Quasi-Newton method d (k) = H k g (k) where H 0, H 1,... are symmetric. α k = arg min f(x (k) + α k d (k) ) α 0 x (k+1) = x (k) + αd (k) Moreover, for quadratic functions of form f(x) = 1 2 xt Qx b T x, the matrices H 0, H 1,... are required to satisfy H k+1 g (i) = x (i), 0 i k Xiaojing Ye, Math & Stat, Georgia State University 6

Theorem. Consider a quasi-newton algorithm applied to a quadratic function with symmetric Q, such that for all k = 0, 1,..., n 1, there are H k+1 g (i) = x (i), 0 i k and H k are all symmetric. If α i 0 for 0 i k, then d (0),..., d (n) are Q-conjugate. Xiaojing Ye, Math & Stat, Georgia State University 7

Proof. We prove by induction. It is trivial to show for k = 0. Assume the claim holds for some k < n 1. We have for i k that d (k+1)t Qd (i) = (H k+1 g (k+1) ) T Qd (i) = g (k+1)t Q x (i) H k+1 α i = g (k+1)t g (i) H k+1 = g (k+1)t x(i) α i = g (k+1)t d (i) Since d (0),..., d (k) are Q-conjugate, we know g (k+1)t d (i) = 0 for all i k. Hence d (0),..., d (k), d (k+1) are Q-conjugate. By induction the claim holds. α i Xiaojing Ye, Math & Stat, Georgia State University 8

The theorem above also shows that quasi-newton method is a conjugate direction method, and hence converges in n steps for quadratic objective functions. In practice, there are various ways to generate H k such that H k+1 g (i) = x (i), 0 i k Now we learn three algorithms that produce such H k. Xiaojing Ye, Math & Stat, Georgia State University 9

Rank one correction formula Suppose we would like to update H k to H k+1 by adding a rank one matrix a k z (k) z (k)t for some a k R and z (k) R n : H k+1 = H k + a k z (k) z (k)t Now let us derive what this a k z (k) z (k)t should be. Since we need H k+1 g (i) = x (i) for i k, we at least need H k+1 g (k) = x (k). That is x (k) = H k+1 g (k) = (H k + a k z (k) z (k)t ) g (k) = H k g (k) + a k (z (k)t g (k) )z (k) Xiaojing Ye, Math & Stat, Georgia State University 10

Therefore z (k) = x(k) H k g (k) a k (z (k)t g (k) ) and hence H k+1 = H k + ( x(k) H k g (k) )( x (k) H k g (k) ) T a k (z (k)t g (k) ) 2 On the other hand, multiplying g (k)t on both sides of x (k) H k g (k) = a k (z (k)t g (k) )z (k), we obtain g (k)t ( x (k) H k g (k) ) = a k (z (k)t g (k) ) 2. Hence H k+1 = H k + ( x(k) H k g (k) )( x (k) H k g (k) ) T g (k)t ( x (k) H k g (k) ) This is the rank one correction formula. Xiaojing Ye, Math & Stat, Georgia State University 11

We obtained the formula by requiring H k+1 g (k) = x (k). However, we also need H k+1 g (i) = x (i) for i < k. This turns out to be true automatically: Theorem. For the rank one algorithm applied to quadratic functions with Hessian symmetric Q, there are H k+1 g (i) = x (i), 0 i k for k = 0, 1,..., n 1. Xiaojing Ye, Math & Stat, Georgia State University 12

Proof. We here only need to show H k+1 g (i) = x (i) for i < k: H k+1 g (i) = Note that (H k + ( x(k) H k g (k) )( x (k) H k g (k) ) T g (k)t ( x (k) H k g (k) ) ) g (i) = x (i) + ( x(k) H k g (k) )( x (k) H k g (k) ) T g (i) g (k)t ( x (k) H k g (k) ) (H k g (k) ) T g (i) = g (k)t H k g (i) = g (k)t x (i) = x (k)t Q x (i) = x (k)t g (i) Hence the second term on the right is zero, and we obtain This completes the proof. H k g (i) = x (i) Xiaojing Ye, Math & Stat, Georgia State University 13

Issues with rank one correction formula: H k+1 may not be positive definite even if H k is. Hence H k g (k) may not be a descent direction; the denominator in the rank one correction is g (k)t ( x (k) H k g (k) ), which can be close to 0 and makes computation unstable. Xiaojing Ye, Math & Stat, Georgia State University 14

We now study the DFP algorithm which improves the rank one correction formula by ensuring positive definiteness of H k. DFP algoirthm [Davidson 1959, Fletcher and Powell 1963] H k+1 = H k + x(k) x (k)t x (k)t g (H k g (k) )(H k g (k) g (k)t H k g (k) (k) )T Xiaojing Ye, Math & Stat, Georgia State University 15

We first show that DFP is a quasi-newton method. Theorem. The DFP algorithm applied to quadratic functions satisfies H k+1 g (i) = x (i), 0 i k for all k. Xiaojing Ye, Math & Stat, Georgia State University 16

Proof. We prove this by induction. It is trivial for k = 0. Assume the claim is true for k, i.e., H k g (i) = x (i) for all i k 1. Now we first have H k+1 g (i) = x (i) for i = k by direct computation. For i < k, there is H k+1 g (i) = H k g (i) + x(k) ( x (k)t g (i) ) x (k)t g (k) (H k g (k) )(H k g (k) ) T g (i) g (k)t H k g (k) Note that due to assumption d (0),..., d (k) are Q-conjugate, and hence x (k)t g (i) = x (k)t Q x (i) = α k α i d (k)t Qd (i) = 0 similarly g (k)t H k g (i) = g (k)t x (i) = 0. This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 17

Next we show that H k+1 inherits positive definiteness of H k in DFP algorithm. Theorem. Suppose g (k) 0, then H k 0 implies H k+1 0 in DFP. Proof. For any x R n, there is (k) )2 x T H k+1 x = x T H k x + (xt x (k) ) 2 x (k)t g (xt H k g (k) g (k)t H k g (k) For notation simplicity, we denote a = H 1/2 k x and b = H 1/2 k where H k = H 1/2 k H 1/2 k (we know H 1/2 k g (k) exists since H k is SPD). Xiaojing Ye, Math & Stat, Georgia State University 18

Proof (cont). Now we have x T H k+1 x = a 2 b 2 (a T b) 2 b 2 + (xt x (k) ) 2 x (k)t g (k) Note also that x (k) = α k d (k) = α k H k g (k), therefore x (k)t g (k) = x (k)t (g (k+1) g (k) ) = x (k)t g (k) = α k g (k)t H k g (k) where we used d (k)t g (k+1) = 0 due to Q-conjugacy of d (k) in the second equality. Hence x T H k+1 x 0 since both terms on the right side are nonnegative. Xiaojing Ye, Math & Stat, Georgia State University 19

Proof (cont). Now we need to show that the two terms cannot be 0 simultaneously. Suppose the first term is 0, then a = βb for some scalar β. That is H 1/2 k x = βh 1/2 k g (k), or x = βh k g (k). In this case, there is (x T x (k) ) T = (β g (k)t x (k) ) 2 = (βα k g (k)t H k g (k) ) 2 > 0 and hence the second term is positive. This completes the proof. Xiaojing Ye, Math & Stat, Georgia State University 20

BFGS algorithm (named after Broyden, Fletcher, Goldfarb, Shannon) Instead of directly finding H k such that H k+1 g (i) = x (i) for 0 i k, the BFGS first find B k such that B k+1 x (i) = g (i), 0 i k Then replacing H k by B k and swapping x (k) and g (k) in DFP yield B k+1 = B k + g(k) g (k)t x (k)t g (B k x (k) )(B k x (k) x (k)t B k x (k) (k) )T Xiaojing Ye, Math & Stat, Georgia State University 21

Then the actual H k = B k 1 and hence H k+1 = (B k + g(k) g (k)t x (k)t g (B k x (k) )(B k x (k) x (k)t B k x (k) = H k + (1 + g(k)t H k g (k) ) x (k) x (k)t g (k)t x (k) x (k)t g (k) (k) ) 1 )T H k g (k) x (k)t + (H k g (k) x (k)t ) T g (k)t x (k) This is the update rule of H k in BFGS algorithm Xiaojing Ye, Math & Stat, Georgia State University 22

The inverse was obtained by applying the following result: Lemma. [Sherman-Morrison formula] Let A be a nonsingular matrix, and u and v are column vectors such that 1 + v T A 1 u 0, then A + uv T is nonsingular, and Proof. Direct computation. (A + uv T ) 1 = A 1 (A 1 u)(v T A 1 ) 1 + v T A 1 u Xiaojing Ye, Math & Stat, Georgia State University 23

BFGS algorithm: 1. Set k = 0; select x (0) and SPD H 0, and compute g (0) = f(x (0) ). 2. Repeat: d (k) = H (k) g (k) α k = arg min f(x (k) + αd (k) ) α 0 x (k+1) = x (k) + α k d (k) g (k+1) = f(x (k+1) ) Until g (k) = 0. H k+1 = H k + (Compute the BFGS update of H k ) k k + 1 Xiaojing Ye, Math & Stat, Georgia State University 24