Lecture Notes: Geometric Considerations in Unconstrained Optimization

Lecture Notes: Geometric Considerations in Unconstrained Optimization James T. Allison February 15, 2006 The primary objectives of this lecture on unconstrained optimization are to: Establish connections between optimality conditions and problem geometry Provide several motivations for the gradient method and Newton s method Illustrate these concepts with numerical examples The derivation of optimality conditions using abstract means is important, but the intuition gained through geometric understanding of optimality conditions can also be very useful. This additional insight can contribute to more effective implementation of optimization theory. A brief derivation of first and second order conditions is provided, followed by a discussion of function approximation models and a geometric explanation of optimality conditions. Finally, the impact of problem condition and scaling on optimization algorithms is discussed. 1 Optimality Conditions For a point to be a minimum, perturbations about this point ( x = x x 0 ) must result only in objective function increases: f = f(x) f(x 0 ) 0 (1) Finite term Taylor series expansions of a function are accurate near the point of expansion. Combining a first order expansion with equation 1, we can derive a necessary condition for optimality 1 : f = f(x ) x + o( x ) f f(x ) x 0 f(x ) = 0 (2) A point that meets this condition is a stationary point (x ), but it is unkown whether this point is a minimum, maximum, or a saddle point. Evaluation of this first order necessary condition involves the solution of a system of nonlinear equations (equation 2). A second order expansion about a known stationary point provides curvature information via a quadratic approximation of the function, and enables the determination of whether the stationary point is in fact a minimum. If we apply equation 1 to a second order expansion about a stationary point, noting that the linear term in this case is zero, we arrive at the following condition: x H x > 0 x (3) H is the Hessian (also written as 2 f(x)). The satisfaction of this condition and the stationarity condition of equation 2 together comprises a second order sufficiency condition, i.e., if this condition is met the point in 1 Note that in this document vectors are considered to be column vectors, and gradients are also considered to be column vectors. The transpose of a vector x is denoted x. Copyright c 2006 by James T. Allison 1

2 FUNCTION MODELS question is known to be an minimum. Evaluating this condition for all possible perturbations would be very difficult. However, it is known from linear algebra theory that equation 3 is satisfied if and only if the objective function Hessian matrix is positive definite. A positive definite matrix is often denoted with the expression H 0. A matrix is positive definite if and only if all of its eigenvalues are positive, which are easily evaluated numerically. The relationship between positive definiteness, positive eigenvalues, and function geometry will be clarified in these lecture notes. 2 Function Models The Taylor series expansions used in deriving the above optimality conditions can be viewed as function approximation models. Both linear and quadratic models were used, and a geometric understanding of these models can add insights to optimality conditions and optimization algorithms. Linear Function Models A linear function model characterizes the slope of a function in the neighborhood of a point. In R space a linear model is a line tangent to a function, and in R n space 2 it is a hyperplane tangent to the function. If the tangent plane is not horizontal, then directions of descent exist, as does an improved objective function value. Therefore, an optimal point must have a horizontal tangent plane. The gradient of the objective function is zero when the tangent plane, defined by a linear Taylor series expansion, is horizontal. This verifies equation 2. This geometric description also motivates the gradient method for unconstrained optimization. Gradient Method Algorithm: 1. Build a linear model for the function at the current point, and if descent directions exist, move in the direction of steepest descent ( f) until the objective function stops improving. 2. Update the linear model and repeat until f = 0. The iterative formula for the gradient method, where k is the iteration number and α is the step size, is: x k+1 = x k α f(x k ) (4) The gradient method algorithm converts a multidimensional minimization problem to a sequence of onedimensional line searches. During each of these line searches we are looking at a slice of the objective function surface. This is illustrated in the following example. Example 1: Consider the quadratic function: 7x 2 1 + 2.4x 1 x 2 + x 2 2 (5) The contours of the function level set are shown in the first plot of Figure 1. We can see by inspection of the objective function that the minimum is at x = [0 0. If we start at x 0 = [10 5 and perform the line search min α f(α) = x 0 α f(x 0 ), the objective function in the search direction appears as shown in the second plot of Figure 1. The search direction is illustrated in the first plot, and second plot is a slice of the objective function surface in this search direction. Quadratic Function Models A quadratic model can capture curvature information of a function in the neighborhood of a point. In R space a quadratic model is a parabola, and in R n it is a paraboloid. Constructing a quadratic model of a function facilitates the approximation of the function s stationary point, since the quadratic model has its own stationary point. Linear models (hyperplanes) do not have stationary points. The closer to quadratic a function s shape is, the better this approximation will be, and will of course be exact for quadratic objective 2 R is the set of all real numbers, and R n is the set of all real valued vectors of length n. Copyright c 2006 by James T. Allison 2

2 FUNCTION MODELS 15 Example 1: Convex Quadratic Function 2500 Line Search View 10 2000 5 x 2 0 5 x s x 0 f(x 0 α f(x 0 )) 1500 1000 10 500 15 15 10 5 0 5 10 15 x 1 0 0 0.05 0.1 0.15 0.2 α Figure 1: Contour and line search plots for the quadratic function of Example 1 functions. Iterative approximation of a function s stationary point forms the basis for Newton s method for unconstrained optimization. Sequential quadratic modeling is the first of three motivations for Newton s method for optimization that will be discussed in these lecture notes. Newton s Method: 1. Build a quadratic model for the function at the current point, and use the stationary point of this model as the approximation for the objective function stationary point. 2. Check for convergence, and iterate if not converged. The iterative formula for Newton s method, where H 1 is the inverse of the objective function s Hessian, is: x k+1 = x k H 1 f(x k ) (6) Newton s method exhibits very fast (quadratic) local convergence compared to the slower linear convergence of the gradient method. Newton s method, however, can be unstable. It may converge to a maximum instead of a minimum, since it does not have a descent property. Newton s method seeks to find a stationary point, but has no ability to distinguish between a maximum or a minimum. In contrast, the gradient method will always decrease the objective function at each iteration because it always moves in a descent direction. The gradient method will find a stationary point that is either a minimum, or a saddle point that is an improvement over the starting point. In other words, the gradient method is effective at moving in a descent direction, even far from a stationary point location, while Newton s method is effective at converging quickly to a stationary point when one is near. Quasi-Newton methods combine the good global convergence of the gradient method with the rapid local convergence of Newton s method. Such methods begin with gradient method iterations, and dynamically transform into Newton s method iterations. You may be familiar with another form of Newton s method that is for finding the roots of a function. The one-dimensional root finding formula is: x k+1 = x k f(xk ) f (x k ) Copyright c 2006 by James T. Allison 3

3 QUADRATIC FORMS AND GEOMETRY An extension of Newton s method to multiple dimenions takes the form: x k+1 = x k J 1 f(x k ) (7) This multidimensional version seeks to solve the system of equations f(x) = 0. Note that f(x) is a vectorvalued function. The matrix J is the Jacobian of the function f(x), which is a matrix where each row is the transpose of the gradient of each component of the vector function f(x). This next concept establishes the connection between Newton s method for root finding (i.e., for solving systems of non-linear equations), and Newton s method for unconstrained optimization. Recall that if we are seeking to find a stationary point of an objective function f(x), we need to solve the system of equations f(x) = 0. If we use Newton s method for solving nonlinear systems of equations to solve f(x) = 0, we replace f(x) in equation 7 with the vector valued function f(x), and replace J 1 with the inverse of the Jacobian of f(x). Observe that the inverse of the Jacobian of f(x) is in fact the inverse of the Hessian of f(x), or H 1. Hence, by applying Newton s method for solving systems of equation to the problem of finding a stationary point, we have derived Newton s method for unconstrained optimization as defined in equation 6. This is the second of three motivations for Newton s method discussed in this document. 3 Quadratic Forms and Geometry Quadratic models can take either of two general shapes: paraboloid (convex or concave), or hyperboloid (a saddle). Example 1 exhibited a function with a convex parabolic shape, and a hyperboloid will be illustrated shortly. First a brief review of quadratic function forms will be given. A function has a quadratic form if it is a linear combination of x i x j terms. It can be written in matrix form: f(x) = x Ax, where A is a symmetric matix that defines the quadratic function. The conversion of the function in equation 8 will be illustrated. f(x) = 2x 2 1 + x 1 x 2 + x 2 2 + x 2 x 3 + x 2 3 (8) First, each coefficient of squared terms (i.e., i = j) are placed on the diagonal at location (i, i). 2 1 1 Then, since each cross (or interaction ) term is split across two off-diagonal entries, the coefficient terms is divided by two and placed in each entry. 2 0.5 0.5 1 0.5 0.5 1 Finally, any quadratic terms that do not appear in the original function are assigned a value of zero in the matrix. 0.5 1 0.5 2 0.5 0 0 0.5 1 The function in equation 8, rewritten in matrix form, is: f(x) = [x 1 x 2 x 3 2 0.5 0 0.5 1 0.5 x 1 x 2 = x Ax (9) 0 0.5 1 x 3 The correctness of this representation can be verified by performing the vector and matrix multiplications in eqaution 9, and observing that the result simplifies to equation 8. Certain properties of the quadratic form indicate what type of shape the function has. Three general possible shapes exist. Copyright c 2006 by James T. Allison 4

, ), ), ) 3 QUADRATIC FORMS AND GEOMETRY! #"$ % 0 - %1"2./ 0 34$5 76 8 %9 /% #"$ % * '() * '() * '() & & & -! #"$./ 0 - %1"2./ 0 34$5 76 8 %9 /% #"$ % + Figure 2: Illustration of convex, concave, and hyperbolic quadratic functions if x Ax > 0 if x Ax < 0 x, A is positive definite convex quadratic function x, A is negative definite concave quadratic function if x Ax is positive for some x and negative for other, A is indefinite hyperbolic quadratic function Figure 2 illustrates each of these three cases using both surface and contour plots of covex, concave, and hyperbolic quadratic functions. The quadratic functions corresponding to the left, center, and right plots, respectively, are given below. f 1 (x) = x A 1 x where: A 1 = [ 7 1.2 1.2 1 f 2 (x) = x A 2 x f 3 (x) = x A 3 x [ 7 1.2, A 2 = 1.2 1 [ 5 2.6, and A 3 = 2.6 2 It can also be demonstrated that if a matrix has all positive eigenvalues, i.e., λ i > 0 i, the matrix is positive definite. Similarly, if λ i < 0 i, the corresponding matrix is negative definite, and if eigenvalues take both positive and negative values, then the matrix is indefinite. Eigenvalues provide a nice way of evaluating properties of a quadratic function, but what exactly is the connection between eigenvalues and function geometry? We will demonstrate an intuitive interpretation of eigenvalues, and illustrate this with a numerical example. Copyright c 2006 by James T. Allison 5

3 QUADRATIC FORMS AND GEOMETRY Eigenvalues, Eigenvectors, and Geometry An eigenvalue λ and corresponding eigenvector v of a matrix A satisfies the relation: Av = λv (10) Eigenvectors v are vectors that result in a scalar multiple of themselves if they are pre-multiplied by the associated matrix. We can gain geometric intuition for what eigenvalues and eigenvectors are by shifting and rotating the coordinate system that we use to view a quadratic function. This is a lengthy process, but the end result will provide significant geometric insight. We start with a general quadratic function (including constant and linear terms), and translate the coordinate axes to be centered at the function s stationary point by defining new coordinates: z = x x. Note that this function s gradient is b + 2Ax, and the stationary point is x = 1 2 A 1 b. The vectors x and b have length n, and the matrix A has dimension n n. f(x) = f 0 + x b + x Ax f(z) = f 0 + (z + x ) b + (z + x ) A(z + x ) f(z) = (f 0 + x b + x Ax ) + z Az + z (b + 2Ax ) f(z) = f + z Az For convenience f is defined as the function value at x, and the last term was dropped in the third equation because of stationarity (i.e., b + 2Ax = 0). The coordinate system can be rotated by transforming (multiplying) the coordinate variables with a matrix. Consider the matrix V = [v 1 v 2... v n, with columns that are the normalized eigenvectors of A. Using V to rotate the coordinates will cause the coordinates to be aligned with the eigenvectors of A. This rotation is effected through the multiplication p = V z, where p are the new coordinates. Since the normalized eigenvectors form an orthonormal basis, the matrix V is orthogonal, and the following identities hold: V = V 1, V V = VV = I I is the identity matrix, an n n matrix with ones on the diagonal. We can use these identities and the definition of p to write z in terms of the rotated coordinates p. It will also be helpful to know z in terms of p: z = Iz = VV z = Vp z = [Vp = p V Substituting these expressions for z and z into the last equation for f(z), we arrive at a new form of the original quadratic function in terms of the translated and rotated coordinates p: f(p) = f + p V AVp This functional expression can be further simplified by defining the matrix Λ = V AV, which turns out to be a diagonal matrix whose entries are the eigenvalues of A. The function can be rewritten as: f(p) = f + p Λp (11) Since all off-diagonal terms are zero, the function can be written using a simple summation (equation 12). This final result will enable geometric interpretation of eigenvalues. f(p) = f + n λ i p 2 i (12) This form provides an excellent geometric interpretation for eigenvalues and eigenvectors. If we move along an eigenvector direction (i.e, vary p i ), the function will decrease if λ i < 0, and increase if λ i > 0. This interpretation is congruent with the geometry associated with positive definite, negative definite, and indefinite matrices. If an eigenvalue is large, then the rate of change in the associated direction will be large. The eigenvalues and eigenvectors from the functions in Figure 2 are shown below, and the eigenvector directions are plotted in Figure 3. Eigenvectors point in the direction of the axes of the level set contour ellipses. Note that the eigenvectors associated with the larger eigenvalues point in the direction of the minor axes of the level set ellipses, since the function in the direction of the minor axes is steepest. i=1 Copyright c 2006 by James T. Allison 6

& % & % & % 4 PROBLEM CONDITION AND SCALING! " '!(" ' " )+*", - ". ' " $& $& $& $% $% $% #& #& #& #$% #$% #$% #$& % #$% $% #$& % #$% $% #$& % #$% $% " Figure 3: Eigenvector directions of quadratic functions from Figure 2 Function 1: v 1 = [.189.982 [.982, λ 1 =.769, v 2 =.189, λ 2 = 7.23 Function 2: Function 3: v 1 = v 1 = [.982.189 [.949.314 [.189, λ 1 =.723, v 2 =.982 [.314, λ 1 = 5.86, v 2 =.949, λ 2 =.769, λ 2 = 2.86 4 Problem Condition and Scaling An objective function is more difficult to minimize when it is highly elliptical. A quantitative measure of this is the condition number (C) of a function (equation 13). It is defined as the ratio between the maximum and minimum eigenvalues of the function s Hessian. A perfectly conditioned function has a condition number of 1, while ill-conditioned problems have very large condition numbers. Recall that large eigenvalues correspond to very steep function responses. Thus, an ill-conditioned function has directions with rapid change in some directions, and very little change in other directions. Also note that these directions of disparate sensitivity are not necessarily aligned with coordinate axes. C = λ max λ min (13) The gradient method has particular difficulty with poorly conditioned problems. The influence of steep directions can drown out the influence of relatively flat directions. For example, if the algorithm is evaluating a point in a long, narrow valley, the gradient method could be numerically fooled into thinking that the gradient is zero, even if the point is far from the minimum. The directional derivative in the steep direction may in fact be zero if the point is in the low point of the valley, but since the derivative in the nearly flat direction is so small, machine precision limitations may incorrectly identify a zero gradient. When the gradient method is Copyright c 2006 by James T. Allison 7

4 PROBLEM CONDITION AND SCALING stuck in such a valley, not much progress can be made with each step because of the relatively small gradient. Whether algorithm convergence is based on having a zero gradient or a sufficiently small step size, the gradient method may terminate before finding the solution because of poor scaling. In addition, it can be show that the each search direction of the gradient method (with exact line search) is orthogonal to the previous search direction. This results in a zig-zag route to the solution that requires many iterations. What can be done to address this issue with using the gradient method for ill-conditioned problems? A common approach is to scale the variables such that the objective function is approximately just as sensitive to all variables. A simple scaling approach is to multiply each variable by a scalar such that the nominal value (or starting point value) is equal to one. These scalars multipliers can be used to form a scaling vector, such that the scaled variables can be calculated using a vector multiplication: y = s x. Here y is the vector of scaled variables, and s is the scaling vector. Similarly, in constrained optimization it is important to scale the objective and constraint function values such that they have the same magnitude. Scaling each variable individually works well when the eigenvectors are nearly aligned with the coordinate axes. Recall that this might not be the case, i.e., a function may have a steep direction that points in a direction somewhere in the middle of the coordinate axes. Such a function requires more sophisticated scaling to achieve reasonable conditioning. The interaction between variables must be considered, and a scaling matrix may be used to accomplish this since the off-diagonal terms of a matrix can account for variable interaction. A useful class of scaling matrices are symmetric and positive definite. For convenience we define S 1 as a scaling matrix, and write: x = Sy (14) If we define the objective function in the new variable space as h(y) = f(sy), then the new gradient method iteration becomes: y k+1 = y k α h(y k ) (15) Although we could proceed using this formula, obtain a solution in terms of y, and convert the solution back to the original variable space using equation 14, it will be instructive to recast equation 15 in terms of x. If we premultiply this equation by S, define S 2 = D k, and use the chain rule to obtain the relation h(y) = S f(x), we find after algebraic manipulation that: x k+1 = x k αd k f(x k ) (16) This result is in fact a scaled gradient method iteration. The gradient scaling matrix for iteration k, D k, will ensure descent if it is symmetric and positive definite. It turns out that the best scaling results are obtained if we set D k to the inverse of the function s Hessian evaluated at x k, i.e., D k = H(x k ). Observe that when this is the case, and if we set the step size α = 1, the scaled steepest descent algorithm becomes Newton s method for optimization, as defined in equation 6. This is the third and final motivation for using Newton s method for unconstrained optimization that will be discussed in these lecture notes. If the Hessian of an objective function is positive definite at a point, then Newton s method will produce descent for that iteration, since the scaled gradient method is guaranteed descent when D k 0. The Hessian, however, may not be positive definite. Geometrically, when Newton s method is operating in a region where the objective function is convex (i.e., H(x k ) 0) that includes a minimum, it will iteratively descend to that minimum. Conversely, if the region is concave (i.e., H(x k ) 0) with an associated maximum, Newton s method will ascend to the maximum. Ideal scaling will remove any function ellipticity. This will transform elliptical level sets of a function s contour plot to circular level sets. Scaling a quadratic function with the inverse of its Hessian will result in a perfectly conditioned function with circular level sets. Applying the gradient method to such a function will locate the minimum in one step, since f(x k ) will point directly to the minimum. Since using the inverse of the Hessian to scale a function for the gradient method is the same as using Newton s method, this scenario is equivalent to applying Newton s method to the minimization of a quadratic function. Recall that Newton s method will find the minimum of a quadratic function in one step, since the quadratic approximation model is exact. Whether we view this situation as Newton s method applied to a quadratic function, or use of the gradient method with ideal scaling, the result is the same the solution is identified in one step. Copyright c 2006 by James T. Allison 8

5 SUMMARY 5 Summary A connection was established between optimality conditions and a geometric understanding of functions. Three approaches were used to motivate the use of Newton s method: 1. Sequential second-order function approximations 2. Newton s method for root finding to solve f(x ) = 0 3. Use of the objective function s Hessian to provide ideal scaling Copyright c 2006 by James T. Allison 9