Lecture 7 Unconstrained nonlinear programming

Lecture 7 Unconstrained nonlinear programming Weinan E 1,2 and Tiejun Li 2 1 Department of Mathematics, Princeton University, weinan@princeton.edu 2 School of Mathematical Sciences, Peking University, tieli@pku.edu.cn No.1 Science Building, 1575

Outline Application examples

Energy minimization: virtual drug design Virtual drug design is to find a best position of a ligand (a small protein molecule) interacting with a large target protein molecule. It is equivalent to an energy minimization problem.

Energy minimization: protein folding Protein folding is to find the minimal energy state of a protein molecule from its sequence structure. It is an outstanding open problem for global optimization in the molecular mechanics.

Energy minimization: mathematical formulation Molecular force field V total = i k ri 2 (ri ri0)2 + i k θi 2 (θi θi0)2 + i V ni 2 (1+cos(nφi γi)) + ij (( σij ) 6 ( σij ) 12) 4ɛ + r ij r ij ij q iq j ɛr ij Webpage for the explanation of the force field Energy minimization problem with respect to all the configuration of the atoms min V total (x 1,..., x N )

Nonlinear least squares Suppose that we have a series of experimental data (t i, y i), i = 1,..., m. We wish to find parameter x R n such that the remainder r i(x) = y i f(t i, x), i = 1,..., m minimized. Mathematically, define error function where r = (r 1,..., r m) such that φ(x) = 1 2 r(x)t r(x) min φ(x). x Because the function f is nonlinear, it is called a nonlinear least square problem.

Optimal control problem Classical optimal control problem: such that the constraint dx dt min = g(x, u), T 0 f(x, u)dt x(0) = x0, x(t ) = xt is satisfied. Here u(t) is the control function, x(t) is the output. It is a nonlinear optimization in function space.

Optimal control problem Example: Isoparametric problem. max u 1 0 x 1(t)dt dx 1 dt = u, dx 2 dt = 1 + u 2. x 1(0) = x 1(1) = 0, x 2(0) = 0, x 2(1) = π 3 x 1 dx 2 dt dx =udt 1 1 t

Outline Application examples

Iterations Iterative methods Object: construct sequence {x k } k=1, such that x k converge to a fixed vector x, and x is the solution of the linear system. General iteration idea: If we want to solve equations g(x) = 0, and the equation x = f(x) has the same solution as it, then construct x k+1 = f(x k ). If x k x, then x = f(x ), thus the root of g(x) is obtained.

Convergence order Suppose an iterating sequence lim x n = x, and where ɛ n is called error bound. If when x n x ɛ n lim ɛn+1 ɛ n = C, 1. 0 < C < 1, x n is called linear convergence; q, q 2, q 3,, q n,, (q < 1) 2. C = 1, x n is called sublinear convergence; 1, 1 2, 1 3,, 1 n, 3. C = 0, x n is called superlinear convergence; 1, 1 2!, 1 3!,, 1 n!,

Convergence order If lim ɛn+1 ɛ p = C, C > 0, p > 1 n then x n is called p-th order convergence. q, q p, q p2,, q pn, Numerical examples for different convergence orders

Remark on p-th order convergence If p = 1, i.e. linear convergence, the number of significant digits is increasing linearly, such as 2, 3, 4, 5,...; If p > 1, the number of significant digits is increasing exponentially (O(p n )). Suppose p = 2, then the number of significant digits is increased as 2, 4, 8, 16,...!! So a very accurate result will be obtained after 4 5 iterations;

Golden section method Suppose there is a triplet (a, x k, c) and f(x k ) < f(a), f(x k ) < f(c), we want to find x k+1 in (a, c) to perform a section. Suppose x k+1 is in (a, x k ). x k+1 a xk c w z 1 w If f(x k+1 ) > f(x k ), then the new search interval is (x k+1, c); If f(x k+1 ) < f(x k ), then the new search interval is (a, x k ).

Golden section method Define and w = x k a c a, 1 w = c x k c a z = x k x k+1. c a If we want to minimize the worst case possibility (for two cases), we must make w = z + (1 w). (w > 1 2 ) Pay attention that w is also obtained from the previous stage of applying same strategy. This scale similarity implies we have z w = 1 w 5 1 w = 0.618 2 This is called Golden section method.

Golden section method Golden section method is a method to find the local minimum of a function f. Golden section method is a linear convergence method. The contraction coefficient is C = 0.618. Golden section method for Example min ϕ(x) = 0.5 xe x2 where a = 0, c = 2.

One dimensional Newton s method Suppose we want to minimize ϕ(x) min ϕ(x) x Taylor expansion at current iteration point x 0 ϕ(x) = ϕ(x 0) + ϕ (x 0)(x x 0) + 1 2 ϕ (x 0)(x x 0) 2 + Local quadratic approximation ϕ(x) g(x) = ϕ(x 0) + ϕ (x 0)(x x 0) + 1 2 ϕ (x 0)(x x 0) 2 Minimize g(x) at g (x) = 0, then Newton s method x 1 = x 0 ϕ (x 0) ϕ (x 0) x k+1 = x k ϕ (x k ) ϕ (x k )

One dimensional Newton s method Graphical explanation Local parabolic approximation x k+1 x k Example where x 0 = 0.5. min ϕ(x) = 0.5 xe x2

One dimensional Newton s method Theorem If ϕ (x ) 0, then Newton s method converges with second order if x 0 is close to x sufficiently. Drawbacks of Newton s method: 1. one needs to compute the second order derivative which is a huge cost (especially for high dimensional case). 2. The initial state x 0 must be very close to x.

High dimensional Newton s method Suppose we want to minimize f(x), x R n min f(x) x Taylor expansion at current iteration point x 0 f(x) = f(x 0) + f(x 0) (x x 0) + 1 2 (x x0)t 2 f(x 0)(x x 0) + Local quadratic approximation f(x) g(x) = f(x 0) + f(x 0) (x x 0) + 1 2 (x x0)t H f (x 0)(x x 0) where H f is the Hessian matrix defined as (H f ) ij = Minimize g(x) at g(x) = 0, then 2 f x i x j. x 1 = x 0 H f (x 0) 1 f(x 0) Newton s method x k+1 = x k H f (x k ) 1 f(x k )

High dimensional Newton s method Example min f(x 1, x 2) = 100(x 2 x 2 1) 2 + (x 1 1) 2 Initial state x 0 = ( 1.2, 1).

Steepest decent method Basic idea: Find a series of decent directions p k and corresponding stepsize α k such that the iterations x k+1 = x k + α k p k and f(x k+1 ) f(x k ). The negative gradient direction f is the steepest decent direction, so choose p k := f(x k ) and choose α k such that min α f(x k + αp k )

Inexact line search To find α such that min α f(x k + αp k ) is equivalent to perform a one dimensional minimization. But it is enough to find an approximate α by the following inexact line search method. Inexact line search is to make the following type of the decent criterion f(x k ) f(x k+1 ) ɛ 0 is satisfied.

Inexact line search An example of inexact line search strategy by half increment (or decrement) method: [a 0, b 0] = [0, + ), α 0 = 1; [a 1, b 1] = [0, 1], α 1 = 1 2 [a 2, b 2] = [0, 1 2 ], α2 = 1 4 ; [a3, b3] = [ 1 4, 1 2 ], α3 = 3 8 1/4 3/8 1/2 a=0 b=1

Steepest decent method Steepest decent method for example Initial state x 0 = ( 1.2, 1). min f(x 1, x 2) = 100(x 2 x 2 1) 2 + (x 1 1) 2

Dumped Newton s method If the initial value of Newton s method is not near the minimum point, a strategy is to apply dumped Newton s method. Choose the decent direction as the Newton s direction p k := H 1 f (x k ) f(x k ) and perform the inexact line search for min α f(x k + αp k )

Conjugate gradient method Recalling conjugate gradient method for quadratic function ϕ(x) = 1 2 xt Ax b T x 1. Initial step: x 0, p 0 = r 0 = b Ax 0 2. Suppose we have x k, r k, p k, the CGM step 2.1 Search the optimal α k along p k ; α k = (r k) T p k (p k ) T Ap k 2.2 Update x k and gradient direction r k ; x k+1 = x k + α k p k, r k+1 = b Ax k+1 2.3 According to the calculation before to form new search direction p k+1 β k = (r k+1) T Ap k (p k ) T Ap k, p k+1 = r k+1 + β k p k

Conjugate gradient method Local quadratic approximation of general nonlinear optimization f(x) f(x 0) + f(x 0) (x x 0) + 1 2 (x x0)t H f (x 0)(x x 0) where H f (x 0) is the Hessian of f at x 0. Apply conjugate gradient method to the quadratic function above successively. The computation of β k needs the formation of Hessian matrix H f (x 0) which is a formidable task! Equivalent transformation in the quadratic case β k = (r k+1) T Ap k = ϕ(x k+1) 2 (p k ) T Ap k ϕ(x k ) 2 This formula does NOT need the computation of Hessian matrix.

Conjugate gradient method for nonlinear optimization Formally generalize CGM to nonlinear optimization 1. Given initial x 0 and ɛ > 0; 2. Compute g 0 = f(x 0) and p 0 = g 0, k = 0; 3. Compute λ k from and min λ f(x k + λp k ) x k+1 = x k + λ k p k, g k+1 = f(x k+1 ) 4. If g k+1 ɛ, the iteration is over. Otherwise compute µ k+1 = g k+1 2 g k 2 p k+1 = g k+1 + µ k+1 p k Set k = k + 1, iterate until convergence.

Conjugate gradient method In realistic computations, because there is only n conjugate gradient directions for n dimensional problem, it often restarts from current point after n iterations.

Conjugate gradient method CGM for example Initial state x 0 = ( 1.2, 1). min f(x 1, x 2) = 100(x 2 x 2 1) 2 + (x 1 1) 2

Variable metric method A general form of iterations x k+1 = x k λ k H k f(x k ) 1. If H k = I, it is steepest decent method; 2. If H k = [ 2 f(x k )] 1, it is dumped Newton s method. In order to keep the fast convergence of Newton s method, we hope to approximate [ 2 f(x k )] 1 as H k with reduced computational efforts as H k+1 = H k + C k, where C k is a correction matrix which is easily computed.

Variable metric method First consider quadratic function f(x) = a + b T x + 1 2 xt Gx we have f(x) = b + Gx Define g(x) = f(x), g k = g(x k ), then g k+1 g k = G(x k+1 x k ). Define x k = x k+1 x k, g k = g k+1 g k we have G x k = g k.

Variable metric method For general nonlinear function f(x) f(x k+1 )+ f(x k+1 ) (x x k+1 )+ 1 2 (x x k+1) T H f (x k+1 )(x x k+1 ). Similar procedure as above we have [H f (x k+1 )] 1 g k = x k. As H k+1 is a approximation of [H f (x k+1 )] 1, it must satisfy H k+1 g k = x k.

DFP method Davidon-Fletcher-Powell method: Choose C k as rank-2 correction matrix C k = α k uu T + β k vv T where α k, β k, u, v are undetermined variables. From H k+1 = H k + C k and H k+1 g k = x k we have α k u(u T g k ) + β k v(v T g k ) = x k H k g k Take u = H k g k, v = x k and α k = 1 u T g k, β k = We obtain the famous DFP method H k+1 = H k H k g k g T k H k g T k H k g k 1 v T g k + x k x T k x T k g k

Remark on DFP method If f(x) is quadratic and H 0 = I, then the result will converge in n steps theoretically; If f(x) is strictly convex, the DFP method is convergent globally. If H k is SPD and g k 0, then H k+1 is SPD also.

DFP method DFP method for example Initial state x 0 = ( 1.2, 1). min f(x 1, x 2) = 100(x 2 x 2 1) 2 + (x 1 1) 2

BFGS method The most popular variable metric method is BFGS (Broyden-Fletcher-Goldfarb-Shanno) method shown as below H k+1 = H k H k g k g T k H k g T k H k g k where + x k x T k x T k g + ( g T k H k g k )v k v T k k v k = x k x T k g H k g k k g T k H k g k BFGS is more stable than DFP method; BFGS is also a rank-2 correction method for H k.

Nonlinear least squares Mathematically, nonlinear least squares is to minimize φ(x) = 1 2 r(x)t r(x) We have m φ(x) = J T (x)r(x), H φ (x) = J T (x)j(x) + r i(x)h ri (x) i=1 where J(x) is the Jacobian matrix of r(x). Direct Newton s method for increment s k in nonlinear least squares H φ (x k )s k = φ(x k )

Gauss-Newton method If make the assumption that the residual r i(x) is very small, we will drop the term m i=1 ri(x)hr i (x) in Newton s method and we obtain Gauss-Newton method (J T (x k )J(x k ))s k = φ(x k ) Gauss-Newton method is equivalent to solve a sequence of linear least squares problems to approximate the nonlinear least squares.

Levenberg-Marquardt method If the Jacobian J(x) is ill-conditioned, one may take the Levenberg-Marquardt method as (J T (x k )J(x k ) + µ k I)s k = φ(x k ) where µ k is a nonnegative parameter chosen by some strategy. L-M method may be viewed as a regularization method for Gauss-Newton method.

Homework assignment Newton s method and BFGS method for example Initial state x 0 = ( 1.2, 1). min f(x 1, x 2) = 100(x 2 x 2 1) 2 + (x 1 1) 2

References 1. 2004 2. J.F. Bonnans et al., Numerical optimization: Theoretical and practical aspects, Universitext, Springer, Berlin, 2003.