Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Size: px

Start display at page:

Download "Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent"

Donna Ramsey
5 years ago
Views:

1 Nonlinear Optimization Steepest Descent and Niclas Börlin Department of Computing Science Umeå University A disadvantage with the Newton method is that the Hessian has to be derived and implemented, which may introduce blunders, calculated, which requires O(n 2 ) operations, stored, which require storage for n 2 /2 elements, inverted, which requires O(n 3 ) operations. December 6, 2007 Hessian approximation Convergence of However, provided we use a global stratey, we may replace the Hessian in the Newton equation 2 f(x k )p = f(x k ) with any positive definite matrix B k, i.e. and still get global convergence. B k p = f(x k ). Some methods calculate hessian approximations based on first- or second-order derivatives, e.g. are Gauss-Newton B k = J(x k ) T J(x k ), Trust-region Levenberg-Marquardt B k = 2 f(x k ) + λi, B k = J(x k ) T J(x k ) + λi. Steepest Descent The steepest-descent method uses the simplest Hessian approximation B k = I p k = f(x k ). This approximation produce a cheap calculation of the search direction. Other methods calculate their hessian approximations without any derivative information. Two such methods are called Steepest Descent and Quasi-Newton.

2 Hessian approximation Convergence of However, the convergence rate is linear with a convergence constant C bounded from above by C sup = ( ) κ(q) 1 2, κ(q) + 1 where Q = 2 f(x ) is the true hessian and κ(q) is the condition number of Q. The convergence rate will be poor even for moderate condition numbers. κ(q) C The steepest-descent method should only be used for problems that are known to be well-conditioned. Hessian approximation Convergence of Gauss-Newton and the Steepest-descent method for the antelope problem (κ(q) 500). x x 1 x x Properties of Common Properties of Common use a sequence of symmetric positive definite matrices that approximate the hessian (or the inverse hessian). For every iteration, the next (inverse) hessian approximation is calculated by updating the current approximation. Each update is constructed to include the curvature information computed in the last step, i.e. the next (inverse) hessian should behave as the true (inverse) hessian over the last step. This condition is called the secant condition. The one-dimensional secant method uses the approximation f (x k ) f (x k ) f (x k 1 ) x k x k 1, f (x k )(x k x k 1 ) f (x k ) f (x k 1 ). In multiple dimensions this corresponds to 2 f(x k )(x k x k 1 ) f(x k ) f(x k 1 ). From this we obtain the secant equation B k (x k x k 1 ) = f(x k ) f(x k 1 ).

3 With substitutions Properties of Common Properties of Common Properties of we obtain s k = x k+1 x k, y k = f(x k+1 ) f(x k ), B k+1 s k = y k. If H k = B 1, the secand equation becomes k H k+1 y k = s k. The hessian approximation B k has n 2 /2 independent elements, but the secant equation has only n equations. Thus, several updating schemes are possible. An example of a Quasi-Newton update formula is B k+1 = B k + (y k B k s k )(y k B k s k ) T (y k B k s k ) T s k This formula illustrates several key features with quasi-newton approximations. The new approximation Bk+1 is found by updating the old approximation B k. As a starting approximation B0 = I may be used, but if a better approximation is available at a small cost, it should be used. Properties of Common Properties of Common will be satisfied indepedently of how Bk is chosen: B k+1 s k = B k s k + (y k B k s k )(y k B k s k ) T (y k B k s k ) T s k s k = B k s k + (y k B k s k )((y k B k s k ) T s k ) (y k B k s k ) T s k = B k s k + (y k B k s k ) = y k. The new approximation Bk+1 can be obtained using O(n 2 ) operations since the update only involves vector products. The number of operations needed to calculate the search direction depends on the choice of update formula. If a Bk update formula is used, the solution of the secant equation needs O(n 3 ) operations. Update formulas for the cholesky factors LL T = B k exist that reduce the number of operations to O(n 2 ). Another O(n 2 ) solution is if we use an update formula for H k. The search direction may then be calculated from p k = H k ( f k ). Thus, the total time complexity of one iteration of a Quasi-Newton method can be reduced to O(n 2 ) compared to O(n 3 ) for e.g. Newton and Trust-region methods.

4 Common Properties of Common Properties of Common The most widely used Quasi-Newton formula is known as BFGS (Broyden-Fletcher-Goldfarb-Shanno): B k+1 = B k (B ks k )(B k s k ) T s T k B ks k + y k yk T yk T s. k BFGS preserves symmetry and positive definiteness if y T k s k > 0, which may be satisfied with a line search using the Wolfe condition. The corresponding update formula for H k is One of the first Quasi-Newton formulas was DFP (Davidon-Fletcher-Powell): and B k+1 = (I ρ k y k s T k )B k(i ρ k s k y T k ) + ρ ky k y T k, where ρ k = 1/(y T k s k). H k+1 = H k (H ky k )(H k y k ) T y T k H ky k + s k sk T yk T s. k H k+1 = (I ρ k s k y T k )H k(i ρ k y k s T k ) + ρ ks k s T k, where ρ k = 1/(y T k s k). Properties of Common Gauss-Newton and BFGS on the antelope problem. 0.5 Properties of Common x x 1» j»»»»» B 0 =, B =,,,,, »»» ff»» ,,, B =, f(x ) = The BFGS method has super-linear convergence, i.e. faster than linear but slower than quadratic. The deficit to quadratically convergent methods usually shows only in the few last iterations. Thus, for many practical applications, BFGS converges as fast as a Newton method, i.e. it requires approximately the same number of iterations. Since the BFGS search direction can be calculated in n 2 time vs. n 3 time for the Newton method, the required execution time is usually substantionally less than Newton.

5 Consider the absolute termination criteria Mathematically, a minimization algorithm should terminate with a solution x k satisfying f(x k ) = 0 and 2 f(x k ) positive semi-definite. Due to e.g. the use of finite arithmetic, no algorithm is guaranteed to satisfy this condition in finite time. Instead we will rely on termination rules based on thresholds on e.g. f(x k ). f(x k ) ǫ, for some ǫ > 0. I.e. we stop when the norm of the gradient falls below a threshold. This condition is scale dependent, since a change of unit in f would rescale f and affect the strength of the condition. E.g. a change of unit from km to m would correspond to a scaling of A small modification of the absolute termination criteria leads the relative termination criteria which is not scale dependent. f(x k ) ǫ f(x k ), However, if f(x ) 0, the relative test will be difficult or even impossible to satisfy due to round-off errors. A possible combination is For least squares problems the test min f(x) = 1 x 2 r(x)t r(x), Jp = J(J T J) 1 J T r ǫ(1 + r ) may be used instead of the gradient test. Since Jp and F belong to the same vector space, the test may be interpreted geometrically. The quotient f(x k ) ǫ(1 + f(x k ) ). This test will behave like an absolute test if f(x k ) 0, and otherwise like a relative test. Jp r = cos α, describes the angle α between the residual r and the tangent plane at r. α

6 For least squares problems min f(x) = 1 x 2 r(x)t r(x), the test Jp = J(J T J) 1 J T r ǫ(1 + r ) may be used instead of the gradient test. Since Jp and F belong to the same vector space, the test may be interpreted geometrically. Close to the solution the residual will approach orthogonality with the tangent plane, i.e. fxk Jkpk fx α π/2 and Jp r 0. o

5 Quasi-Newton Methods

Unconstrained Convex Optimization 26 5 Quasi-Newton Methods If the Hessian is unavailable... Notation: H = Hessian matrix. B is the approximation of H. C is the approximation of H 1. Problem: Solve min