2. Quasi-Newton methods

Size: px

Start display at page:

Download "2. Quasi-Newton methods"

Camron Pearson
6 years ago
Views:

1 L. Vandenberghe EE236C (Spring 2016) 2. Quasi-Newton methods variable metric methods quasi-newton methods BFGS update limited-memory quasi-newton methods 2-1

2 Newton method for unconstrained minimization minimize f(x) f convex, twice continously differentiable Newton method x + = x t 2 f(x) 1 f(x) advantages: fast convergence, affine invariance disadvantages: requires second derivatives, solution of linear equation can be too expensive for large scale applications Quasi-Newton methods 2-2

3 Variable metric methods x + = x th 1 f(x) H 0 is approximation of the Hessian at x, chosen to: avoid calculation of second derivatives simplify computation of search direction Variable metric interpretation (EE236B, lecture 10, page 11) x = H 1 f(x) is steepest descent direction at x for quadratic norm z H = ( z T Hz ) 1/2 Quasi-Newton methods 2-3

4 Quasi-Newton methods given starting point x (0) dom f, H compute quasi-newton direction x = H 1 1 f(x( 1) ) 2. determine step size t (e.g., by bactracing line search) 3. compute x () = x ( 1) + t x 4. compute H different methods use different rules for updating H in step 4 can also propagate H 1 to simplify calculation of x Quasi-Newton methods 2-4

5 Broyden-Fletcher-Goldfarb-Shanno (BFGS) update BFGS update where H = H 1 + yyt y T s H 1ss T H 1 s T H 1 s s = x () x ( 1), y = f(x () ) f(x ( 1) ) Inverse update H 1 = ) ) (I syt y T H 1 1 (I yst s y T + sst s y T s note that y T s > 0 for strictly convex f; see page 1-9 cost of update or inverse update is O(n 2 ) operations Quasi-Newton methods 2-5

6 Positive definiteness if y T s > 0, BFGS update preserves positive definitess of H Proof: from inverse update formula, ( ) T ( ) v T H 1 v = v st v s T y y H 1 1 v st v s T y y + (st v) 2 y T s if H 1 0, both terms are nonnegative for all v second term is zero only if s T v = 0; then first term is zero only if v = 0 this ensures that x = H 1 f(x ) is a descent direction Quasi-Newton methods 2-6

7 Secant condition the BFGS update satisfies the secant condition H s = y, i.e., H (x () x ( 1) ) = f(x () ) f(x ( 1) ) Interpretation: define second-order approximation at x () f quad (z) = f(x () ) + f(x () ) T (z x () ) (z x() ) T H (z x () ) secant condition implies that gradient of f quad agrees with f at x ( 1) : f quad (x ( 1) ) = f(x () ) + H (x ( 1) x () ) = f(x ( 1) ) Quasi-Newton methods 2-7

8 Secant method for f : R R, BFGS with unit step size gives the secant method x (+1) = x () f (x () ) H, H = f (x () ) f (x ( 1) ) x () x ( 1) x ( 1) x () x (+1) f quad (z) f (z) Quasi-Newton methods 2-8

9 Convergence Global result if f is strongly convex, BFGS with bactracing line search (EE236B, lecture 10-6) converges from any x (0), H 0 0 Local convergence if f is strongly convex and 2 f(x) is Lipschitz continuous, local convergence is superlinear: for sufficiently large, where c 0 x (+1) x 2 c x () x 2 0 (cf., quadratic local convergence of Newton method) Quasi-Newton methods 2-9

10 Example minimize c T x m log(b i a T i x) n = 100, m = 500 i= Newton 10 3 BFGS f(x ) f f(x ) f cost per Newton iteration: O(n 3 ) plus computing 2 f(x) cost per BFGS iteration: O(n 2 ) Quasi-Newton methods 2-10

11 Square root BFGS update to improve numerical stability, propagate H in factored form H = L L T if H 1 = L 1 L T 1 then H = L L T with L = L 1 ( I + (αỹ s) st s T s ), where ỹ = L 1 1 y, s = L 1s, α = ( ) 1/2 st s y T s if L 1 is triangular, cost of reducing L to triangular form is O(n 2 ) Quasi-Newton methods 2-11

12 Optimality of BFGS update X = H solves the convex optimization problem minimize subject to tr(h 1 1 Xs = y X) log det(h 1 1 X) n cost function is nonnegative, equal to zero only if X = H 1 also nown as relative entropy between densities N(0, X), N(0, H 1 ) optimality result follows from KKT conditions: X = H satisfies X 1 = H (sνt + νs T ), Xs = y, X 0 with ν = 1 s T y ( 2H 1 1 y ( 1 y 1 + yt H 1 y T s ) s ) Quasi-Newton methods 2-12

13 Davidon-Fletcher-Powell (DFP) update switch H 1 and X in objective on previous page minimize subject to tr(h 1 X 1 ) log det(h 1 X 1 ) n Xs = y minimize relative entropy between N(0, H 1 ) and N(0, X) problem is convex in X 1 (with constraint written as s = X 1 y) solution is dual of BFGS formula H = ) ) (I yst s T H 1 (I syt y s T + yyt y s T y (nown as DFP update) predates BFGS update, but is less often used Quasi-Newton methods 2-13

14 Limited memory quasi-newton methods main disadvantage of quasi-newton method is need to store H or H 1 Limited-memory BFGS (L-BFGS): do not store H 1 explicitly instead we store the m (e.g., m = 30) most recent values of s j = x (j) x (j 1), y j = f(x (j) ) f(x (j 1) ) we evaluate x = H 1 f(x() ) recursively, using H 1 j = ( I s jy T j y T j s j ) H 1 j 1 ( I y js T j y T j s j ) + s js T j y T j s j for j =, 1,..., m + 1, assuming, for example, H 1 m = I cost per iteration is O(nm); storage is O(nm) Quasi-Newton methods 2-14

15 References J. Nocedal and S. J. Wright, Numerical Optimization (2006), chapters 6 and 7 J. E. Dennis and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations (1983) Quasi-Newton methods 2-15

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)