EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science
Multidimensional Unconstrained Optimization Suppose we have a function f() of more than one variable f(x 1, x 2,, x n ) We want to find the values of x 1, x 2,, x n that give f() the largest (or smallest) possible value Graphical solution is not possible, but a graphical picture helps understanding Hilltops and contour maps
Methods of solution Direct or non-gradient methods do not require derivatives Grid search Random search One variable at a time Line searches and Powell s method Simplex optimization
Gradient methods use first and possibly second derivatives Gradient is the vector of first partials Hessian is the matrix of second partials Steepest ascent/descent Conjugate gradient Newton s method Quasi-Newton methods
Grid and Random Search Given a function and limits on each variable, generate a set of random points in the domain, and eventually choose the one with the largest function value Alternatively, divide the interval on each variable into small segments and check the function for all possible combinations
2 2 1 2 2 1 1 1 2 2 f( x, x ) x x 2x 2x x x 2 x 2 1 x 3 f( 1,1.5) 1.25 1 2
Direct Search with 10,000 Points Method x 1 x 2 f E Random -.985 1.486 1.2498.0199 Random -.989 1.493 1.2499.0131 Random -1.003 1.490 1.2498.0107 Random -.992 1.486 1.2499.0157 Random -1.002 1.498 1.2500.0027 Random -.998 1.499 1.2500.0024 Random -1.015 1.520 1.2497.0255 Random -.999 1.493 1.2500.0070 Grid -.990 1.505 1.2497.0113
Features of Random and Grid Search Slow and inefficient Requires knowledge of domain Works even for discontinuous functions Poor in high dimension Grid search can be used iteratively, with progressively narrowing domains
Line searches Given a starting point and a direction, search for the maximum, or for a good next point, in that direction. Equivalent to one dimensional optimization, so can use Newton s method or another method from previous chapter Different methods use different directions
x v ( x, x,, x ) 1 2 ( v, v,, v ) 1 2 n n f ( x) f( x, x,, x ) 1 2 g( λ) f ( x λv) n
One-Variable-at-a Time Search Given a function f() of n variables, search in the direction in which only variable 1, changes Then search in the direction from that point in which only variable 2 changes, etc. Slow and inefficient in general Can speed up by searching in a direction after n changes (pattern direction)
Powell s Method If f() is quadratic, and if two points are found by line searches in the same direction from two different starting points, then the line joining the two ending points (a conjugate direction) heads toward the optimum Since many functions we encounter are approximately quadratic near the optimum, this can be effective
Start with a point x 0 and two random directions h 1 and h 2 Search in the direction of h 1 from x 0 to find a new point x 1 Search in the direction of h 2 from x 1 to find a new point x 2. Let h 3 be the direction joining x 0 to x 2 Search in the direction of h 3 from x 2 to find a new point x 3 Search in the direction of h 2 from x 3 to find a new point x 4 Search in the direction of h 3 from x 4 to find a new point x 5
Points x 3 and x 5 have been found by searching in the direction of h 3 from two starting points x 2 and x 4 Call the direction joining x 3 and x 5 h 4 Search in the direction of h 4 from x 5 to find a new point x 6 The new point x 6 will be exactly the optimum if f() is quadratic The iterations can then be repeated Errors estimated by change in x or in f()
Nelder-Mead Simplex Algorithm Direct search method that uses simplices, which are triangles in dimension 2, pyramids in dimension 3, etc. At each iteration a new point is added usually in the direction of the face of the simplex with largest function values
4.2 4 3-2
Gradient Methods The gradient of f() at a point x is the vector of partial derivatives of the function f() at x For smooth functions, the gradient is zero at an optimum, but may also be zero at a non-optimum The gradient points uphill The gradient is orthogonal to the contour lines of a function at a point
Directional Derivatives Given a point x in R n, a unit direction v, and a function f() of n variables, we can define a new function g() of one variable by g(λ)=f(x+λv) The derivative g (λ) is the directional derivative of f() at x in the direction of v This is greatest when v is in the gradient direction
x v 1 ( x, x,, x ) 1 2 ( v, v,, v ) T vv 1 2 i 1 2 i f( x) f( x, x,, x ) 1 2 f f f f,,, x x x 1 2 g( λ) f ( x λv) n v n n n T f f f g'(0) ( f) v v, v,, vn x x x n 1 2 1 2 n
Steepest Ascent The gradient direction is the direction of steepest ascent, but not necessarily the direction leading directly to the summit We can search along the direction of steepest ascent until a maximum is reached Then we can search again from a new steepest ascent direction
x x 2 1 2 1 2 f( x, x ) at (2,2) f (2,2) 8 2 1 1 2 2 1 f ( x, x ) x f (2,2) 4 f ( x, x ) 2 x x f (2,2) 8 2 1 2 1 2 2 f (2,2) (4,8) (2 4 λ,2 8 λ) is the gradient line g( λ) f (2 4 λ,2 8 λ) (2 4 λ)(2 8 λ) 2
The Hessian The Hessian of a function f() is the matrix of second partial derivatives The gradient is always 0 at a maximum (for smooth functions) The gradient is also 0 at a minimum The gradient is also 0 at a saddle point, which is neither a maximum nor a minimum A saddle point is a max in at least one direction and a min in at least one direction
Max, Min, and Saddle Point For one-variable functions, the second derivative is negative at a maximum and positive at a minimum For functions of more than one variable, a zero of the gradient is a max if the second directional derivative is negative for every direction and is a min if the second directional derivative is positive for every direction
Positive Definiteness A matrix H is positive definite if x T Hx > 0 for every vector x Equivalently, every eigenvalue of H is positive λ is an eigenvalue of H with eigenvector x if Hx = λx -H is positive definite if every eigenvalue of H is negative
Max, Min, and Saddle Point If the gradient f of a function f is zero at a point x and the Hessian H is positive definite at that point, then x is a local min If f is zero at a point x and -H is positive definite at that point, then x is a local max If f is zero at a point x and neither H nor -H is positive definite at that point, then x is a saddle point The determinant H helps only in dimension 1 or 2
Finite-Difference Approximations If analytical derivatives cannot be evaluated, one can use finite-difference approximations Centered difference approximations are in general more accurate, though requiring extra function evaluations Increment often macheps 1/2 or 1e-8 for dp This can be problematic for large problems
Complexity of Finite-Difference Derivatives In an n-variable problem, the function value is one function evaluation (FE) A finite-difference gradient is n FE s if forward or backward and 2n FE s if centered. A finite difference Hessian is O(n 2 ) FE s With a thousand variable problem, this can be huge
Steepest Ascent/Descent This is the simplest of the gradient-based methods From the current guess, compute the gradient Search along the gradient direction until a local max is reached of this onedimensional function Repeat until convergence
f( x, x ) 2x x 2x x 2x f x f x 1 2 2 2 1 2 1 2 1 1 2 ( x, x ) f ( x, x ) 2x 2 2x 1 2 1 2 1 1 2 2 1 ( x, x ) f ( x, x ) 2x 4x 1 2 2 1 2 1 2 True optimum 0 2x 2 2x 2 1 0 2x 4x 1 2 0 2 2x ( x, x ) (2,1) 2 2 H 2 4 2
Eigenvalues If H is a matrix, we can find the eigenvalues in a number of ways We will examine numerical methods for this later, but there is an algebraic method for small matrices We illustrate this for the Hessian in this example
H Hx H I x 0 2 2 2 4 x 0 2 2 2 4 0 ( 2)( 4) 4 2 0 6 4 6 36 16 2 6 20 x 0 2 solution is a maximum
f x f x f( x, x ) 2x x 2x x 2x 1 2 2 2 1 2 1 2 1 1 2 ( x, x ) f ( x, x ) 2x 2 2x 1 2 1 1 2 2 1 ( x, x ) f ( x, x ) 2x 4x 1 2 2 1 2 1 2 f ( 1,1) 7 f ( 1,1) 2x 2 2x 6 1 2 1 f ( 1,1) 2x 4x 6 2 1 2 g( ) f( 1 6,1 6 )
g( ) f( 1 6,1 6 ) 2( 1 6 )(1 6 ) 2( 1 6 ) ( 1 6 ) 2(1 6 ) 2 180 72 7 g '( ) 360 72 0 0.2 x ( 1 6(.2),1 6(.2)) (.2,.2) 2 2
f (2,1) 2 f (1, 1) 7 f (0.2, 0.2) 0.2 f f 1 2 (0.2, 0.2) 1.2 (0.2, 0.2) 1.2 g( ) f(0.2 1.2, 0.2 1.2 ) g 2 1.44 2.88 0.2 2 '( ) 2.88 2.88 1 f (1.4,1) 1.64
Practical Steepest Ascent In real examples, the maximum in the gradient direction cannot be calculated analytically Problem reduces to one dimensional optimization as a line search One can also use more primitive line searches that are fast but do not try to find the absolute optimum
Newton s Method Steepest ascent can be quite slow Newton s method is faster, though it requires evaluation of the Hessian Function is modeled by a quadratic at a point using first and second derivatives The quadratic is solved exactly This is used as the next iterate
A second-order multivariate Taylor series expansion at the current iterate is T T f( x) f( x ) f ( x )( x x ) 0.5( x x ) H ( x x ) i i i i i i At the optimum, the gradient is 0, so f( x) f( x ) H ( x x ) 0 i i i If H is invertible,then 1 x i 1 x i H i f( xi) In practice, solve the linear problem, H x H x f( x ) i i i i
Variations on Newton s Method Quasi-Newton methods use approximate Hessians that are built up as the iterations progress. There are several methods of doing this, the best is probably BFGS (Broyden-Fletcher-Goldfarb-Shanno) These do not require analytical or numerical Hessians to be calculated at each step
The Marquardt algorithm uses a compromise between steepest ascent and the Newton solution The steepest ascent direction is equivalent to using H=I Thus, if we use a Hessian of H+αI, and gradually reduce α from a large value, we get steepest ascent at first followed by more and more of the Newton direction
The trust region approach is an alternative to line searches Rather than searching along the gradient, or moving directly to the Newton solution (which may diverge), the trust region approach finds the maximum/minimum value of the quadratic model subject to a constraint on the stepsize This also results in a mixture of gradient and Newton until the Newton step is in the trust region
Using Matlab to find Optima fminbnd finds the minimum of a onevariable function fminsearch finds the minimum of a multivariable function fminunc (in the Optimization Toolbox) also finds minima of unconstrained functions
fminbnd Finds the minimum of a function of one variable on a closed interval. Assumes the function is continuous Uses a combination of golden section search and quadratic interpolation Exhibits slow convergence when the minimum is near the boundary fmincon is better (in Optimization Toolbox)
fminsearch Direct search method for functions of several variables Does not assume differentiability Can handle discontinuities Uses Nelder-Mead simplex algorithm Tends to be reliable but slow
fminunc Finds minima in unconstrained problems in several to large dimension For medium-scale optimization uses quasi-newton method with BFGS updates and mixed quadratic-cubic line search For large-scale optimization uses subspace trust-region method based on interior-reflective Newton method using preconditioned conjugate gradients
function f=fx(x) f=-(2*sin(x)-x^2/10) >> [ x fval ] = fminbnd('fx',0,4) >> x x = 1.4275 >> fval fval = -1.7757
function f=fxy(x) f=-(2*x(1)*x(2)+2*x(1)-x(1)^2-2*x(2)^2) >> [ x fval ] = fminsearch('fxy',[-1,1]) >> x x = 1.9999 1.0000 >> fval fval = -2.0000 >> [ x fval ] = fminunc('fxy',[-1,1]) >> x x = 2.0000 1.0000 >> fval fval = -2.0000