Numerical Optimization Unit 2: Multivariable optimization problems Che-Rung Lee Scribe: February 28, 2011 (UNIT 2) Numerical Optimization February 28, 2011 1 / 17
Partial derivative of a two variable function Given a two variable function f (x 1, x 2 ). The partial derivative of f with respect to x i is f f (x 1 + h, x 2 ) f (x 1, x 2 ) = lim h 0 x 1 h f f (x 1, x 2 + h) f (x 1, x 2 ) = lim h 0 x 2 h The meaning of partial derivative: let F (x 1 ) = f (x 1, v) and G(x 2 ) = f (u, x 2 ), f x 1 (x 1, v) = F (x 1 ). f x 2 (u, x 2 ) = G (x 2 ). (UNIT 2) Numerical Optimization February 28, 2011 2 / 17
Gradient Definition The gradient of a function f : R n R is a vector in R n defined as f / x 1 x 1 g = f ( x) =., where x =. f / x n x n (UNIT 2) Numerical Optimization February 28, 2011 3 / 17
Directional derivative Definition The directional derivative of a function f : R n R in the direction p is defined as D(f ( x), p) = lim h 0 f ( x + h p) f (x) h. Remark If f : R n R is continuously differentiable in a neighborhood of x, for any vector p. D(f ( x), p) = f (x) T p, (UNIT 2) Numerical Optimization February 28, 2011 4 / 17
The descent directions A direction p is called a descent direction of f ( x) at x if D(f ( x 0 ), p) < 0. If f is smooth enough, p is a descent direction if f ( x 0 ) T p < 0. Which direction p makes f ( x 0 + p) decreasing most? Mean Value theorem f ( x 0 + p) = f ( x 0 ) + f ( x 0 + α p) p p = f ( x 0 ) is called the steepest descent direction of f (x) at x 0. f ( x 0 + p) = f ( x 0 ) + f ( x 0 + α p) p f ( x 0 ) f ( x 0 ) T f ( x 0 ) (UNIT 2) Numerical Optimization February 28, 2011 5 / 17
The steepest descent algorithm The steepest descent algorithm For k = 1, 2,... until convergence Compute p k = f (x k ) Find α k (0, 1) s,t, F (α k ) = f ( x k + α k p k ) is minimized. x k+1 = x k + α k p k You can use any single variable optimization techniques to compute α k. If F (α k ) = f ( x k + α k p k ) is a quadratic function, α k has a theoretical formula. (will be derived in next slides.) If F (α k ) = f ( x k + α k p k ) is more than a quadratic function, we may approximate it by a quadratic model and use the formula to solve α k. Higher order polynomial approximation will be mentioned in the line search algorithm. (UNIT 2) Numerical Optimization February 28, 2011 6 / 17
Quadratic model If f ( x) is a quadratic function, we can write it as f (x, y) = ax 2 + bxy + cy 2 + dx + ey + f (0, 0). If f is smooth, the derivatives of f are Let x = ( x y f x = 2ax + by + d, f y 2 f x 2 = 2a, 2 f y 2 = 2c, ), f ( x) can be expressed as f ( x) = 1 ( 2 x T 2a b b 2c = 2cy + bx + e 2 f x y = ) x + x T ( d e 2 f y x = b. ) + f ( 0). (UNIT 2) Numerical Optimization February 28, 2011 7 / 17
Gradient and Hessian The gradient of f, as defined before, is f ( g( x) = f ( x) = x 2a b f = b 2c y ) ( d x + e The second derivative, which is a matrix called Hessian, is f f ( 2 f ( x) = H( x) = x 2 x y 2a b f f = b 2c y x y 2 ) ) Therefore, f ( x) = 1/2 x T H( 0) x + g( 0) T x + f ( 0), f ( x) = H x + g, and 2 f = H In the following lectures, we assume H is symmetric. Thus, H = H T. (UNIT 2) Numerical Optimization February 28, 2011 8 / 17
Optimal α k for quadratic model We denote H k = H( x k ), g k = g( x k ), and f k = f ( x k ). Also, H = H( 0), g = g( 0), and f = f ( 0). F (α) = f ( x k + α p k ) = 1 2 ( x k + α p k ) T H( x k + α p k ) + g T ( x k + α p k ) + f ( 0) = 1 2 x k T H x k + g T x k + f ( 0) + α(h x k + g) T p k + α2 2 p k T H p k = f k + α g k T p k + α2 2 p k T k F (α) = g k T p k + α p k T k The optimal solution of α k is at F (α) = 0, which is α k = g k T p k H p k p T k (UNIT 2) Numerical Optimization February 28, 2011 9 / 17
Optimal condition Theorem (Necessary and sufficient condition of optimality) Let f : R n R be continuously differentiable in D. If x D is a local minimizer, f ( x ) = 0 and 2 f ( x) is positive semidefinite. If f ( x ) = 0 and 2 f ( x) is positive definite, then x is a local minimizer. Definition A matrix H is called positive definite if for any nonzero vector v R n, v H v > 0. H is called positive semidefinite if v H v 0 for all v R n. H is negative definite or negative semidefinite if H is positive definite or positive semidefinite. H is indefinite if it is neither positive semidefinite nor negative semidefinite. (UNIT 2) Numerical Optimization February 28, 2011 10 / 17
Convergence of the steepest descent method Theorem (Convergence theorem of the steepest descent method) If the steepest descent method converges to a local minimizer x, where 2 f ( x) is positive definite, and e max and e min are the largest and the smallest eigenvalue of 2 f ( x), then Definition x k+1 x lim k x k x ( ) emax e min e max + e min For a scalar λ and an unit vector v, (λ, v) is an eigenpair of of a matrix H if Hv = λv. The scalar λ is called an eigenvalue of H, and v is called an eigenvector. (UNIT 2) Numerical Optimization February 28, 2011 11 / 17
Newton s method We use the quadratic model to find the step length α k. Can we use the quadratic model to find the search direction p k? Yes, we can. Recall the quadratic model (now p is the variable.) f ( x k + p) 1 2 p T H k p + p T g k + f k Compute the gradient p f ( x k + p) = H k p + g k The solution of p f ( x k + p) = 0 is p k = H 1 k g k. Newton s method uses p k as the search direction Newton s method 1 Given an initial guess x 0 2 For k = 0, 1, 2,... until converge x k+1 = x k H 1 k g k. (UNIT 2) Numerical Optimization February 28, 2011 12 / 17
Descent direction The direction p k = H 1 k g k is called Newton s direction Is p k a descent direction? (what s the definition of descent directions?) We only need to check if g T k p k < 0. g T k p k = g T k H 1 k g k. Thus, p k is a descent direction if H 1 is positive definite. For a symmetric matrix H, the following conditions are equivalent H is positive definite. H 1 is positive definite. All the eigenvalues of H are positive. (UNIT 2) Numerical Optimization February 28, 2011 13 / 17
Some properties of eigenvalues/eigenvectors A symmetric matrix H, of order n has n real eigenvalues and n real and linearly independent (orthogonal) eigenvectors Hv 1 = λ 1 v 1, Hv 2 = λ 2 v 2,..., Hv n = λ n v n λ 1 λ 2 Let V = [v 1 v 2... v n ], Λ =, HV = V Λ.... λn If λ 1, λ 2,..., λ n are nonzero, since H = V ΛV 1, 1/λ 1 H 1 = V Λ 1 V 1, Λ 1 1/λ 2 =... 1/λn The eigenvalues of H 1 are 1 λ 1, 1 λ 2,..., 1 λ n. (UNIT 2) Numerical Optimization February 28, 2011 14 / 17
How to solve H p = g? For a symmetric positive definite matrix H, H p = g can be solved by Cholesky decomposition, which is similar to LU decomposition, but is only half computational cost of LU decomposition. h 11 h 12 h 13 Let H = h 21 h 22 h 23, where h 12 = h 21, h 13 = h 31, h 23 = h 32. h 31 h 32 h 33 Cholesky decomposition makes H = LL T, where L is a lower l 11 triangular matrix, L = l 21 l 22 l 31 l 32 l 33 Using Cholesky decomposition, H p = g can be solved by 1 Compute H = LL T 2 p = (L T ) 1 L 1 g In Matlab, use p = H \ g. Don t use inv(h). (UNIT 2) Numerical Optimization February 28, 2011 15 / 17
The Cholesky decomposition For i = 1, 2,..., n l ii = h ii For j = i + 1, i + 2,..., n l ji = h ji l ii For k = i + 1, i + 2,..., j h jk = h jk l ji l ki h 11 h 12 h 13 h 21 h 22 h 23 h 31 h 32 h 33 l 11 = h 11 l 21 = h 21 /l 11 l 31 = h 31 /l 11 =LL T = l 2 11 l 11 l 21 l 11 l 31 l 11 l 21 l 2 21 + l2 22 l 21 l 31 + l 22 l 32 l 11 l 33 l 21 l 31 + l 22 l 32 l 2 31 + l2 32 + l2 33 h (2) 22 = h 22 l 21 l 21 h (2) 32 = h 32 l 21 l 31 h (2) 33 = h 33 l 31 l 31 l 22 = h (2) 22 l 32 = h (2) 32 /l 22 l 33 = h (2) 33 l 32l 32 (UNIT 2) Numerical Optimization February 28, 2011 16 / 17
Convergence of Newton s method Theorem Suppose f is twice differentiable. 2 f is continuous in a neighborhood of x and 2 f ( x ) is positive definite, and if x 0 is sufficiently close to x, the sequence converges to x quadratically. Three problems of Newton s method 1 H may not be positive definite Modified Newton s method + Line search. 2 H is expensive to compute Quasi-Newton. 3 H 1 is expensive to compute Conjugate gradient. (UNIT 2) Numerical Optimization February 28, 2011 17 / 17