1/35 NonlinearOptimization Pavel Kordík Department of Computer Systems Faculty of Information Technology Czech Technical University in Prague Jiří Kašpar, Pavel Tvrdík, 2011 Unconstrained nonlinear optimization, Advanced Quasi-Newton Computer System method Architectures, MI-POA, 09/2011, Lecture 12 MI-POA Evropský sociální fond Praha & EU: Investujeme do vaší budoucnosti
Outline Zero order methods Random search Powell s method First order methods Steepest descent Conjugate gradient Second order methods 2/35 2
Powell s method Minimizing a function of multiple variables Proceeding from starting point in some vector direction n -> minimizing f(p) along the line nby onedimensional methods Critical part : Choosing the next direction n Not computing the function s gradient 3/35
4/35 Powell s method: algorithm x 0 Define set of n search directions S q coordinate unit vectors, q=1,,n x =x 0, y=x q=0 q=q+1 Find α* to min F(x q-1 + α* S q ) x q = (x q-1 + α* S q ) One iteration, n+1 one dimensional searches Find conjugate direction S q+1 =x q -y N Find α* to min F(x q + α* S q+1 ) x q+1 = (x q + α* S q+1 ) Converged? Update search directions S q =S q+1 q=1,,n Y Exit N q=n? Y y=x q+1
5/35 Starting point P 1 P 0 P E P 2 u 2
6/35 Polynomial interpolation Bracket the minimum. Fit a quadratic or cubic polynomial which interpolates f(x)at some points in the interval. Jump to the (easily obtained) minimum of the polynomial. Throw away the worst point and repeat the process.
Polynomial interpolation Quadratic interpolation using 3 points, 2 iterations Other methods to interpolate? 2 points and one gradient Cubic interpolation 7/35
Examples of quadratic functions Case 1: both eigenvalues positive with positive definite minimum 8/35
Examples of quadratic functions Case 2: eigenvalues have different sign with indefinite saddle point 9/35
Examples of quadratic functions Case 3: one eigenvalues is zero with positive semidefinite parabolic cylinder 10/35
11/35 Optimization for quadratic functions Assume that H is positive definite There is a unique minimum at If N is large, it is not feasible to perform this inversion directly.
Steepest descent Basic principle is to minimize the N-dimensional function by a series of 1D line-minimizations: The steepest descent method chooses p k to be parallel to the gradient Step-size α k is chosen to minimize f(x k + α k p k ). For quadratic forms there is a closed form solution: 12/35
13/35 Steepest descent The gradient is everywhere perpendicular to the contour lines. After each line minimization the new gradient is always orthogonal to the previous step direction (true of any line minimization). Consequently, the iterates tend to zig-zag down the valley in a very inefficient manner
14/35 Conjugate gradient Each p k is chosen to be conjugate to all previous search directions with respect to the Hessian H: The resulting search directions are mutually linearly independent. Remarkably, p k can be chosen using only knowledge of p k-1,, and
Conjugate gradient An N-dimensional quadratic form can be minimized in at most N conjugate descent steps. 3 different starting points. Minimum is reached in exactly 2 steps. 15/35
16/35 Optimization for General functions Apply methods developed using quadratic Taylor series expansion
17/35 Rosenbrock s function Minimum at [1, 1]
18/35 Steepest descent The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation) The zig-zag behaviour is clear in the zoomed view The algorithm crawls down the valley
Conjugate gradient Again, an explicit line minimization must be used at every step The algorithm converges in 98 iterations Far superior to steepest descent 19/35
20/35 Newton method Expand f(x)by its Taylor series about the point x k where the gradient is the vector and the Hessian is the symmetric matrix
Hessian Matrix of f(x) f ( x) is a C 2 2 ( ) f ( x) H x function of 2 f 2 x1 = M 2 f xn x is a symmetric matrix. n variables, 2 ( x) f ( x) 2 ( x) f ( x) Since cross - partials are equal for a 1 L O L M x C 2 1 x x 2 n n. function, H(x) Fin500J Topic 4 21/35
Conditions for a Minimum or a Maximum Value of a Function of Several Variables (cont.) Let f(x) be a C 2 function in R n. Suppose that x* is a critical point of f(x), i.e., f ( x *) = 0. ( ) 1. If the Hessian H x * is a positive definite matrix, then x* is a local minimum of f(x); ( ) 2. If the Hessian H x * is a negative definite matrix, then x* is a local maximum of f(x). ( ) 3. If the Hessian H x * is an indefinite matrix, then x* is neither a local maximum nor a local minimum of f(x). Fin500J Topic 4 22/35
Pavel Kordík (ČVUT FIT) MI-NON, 2011 Nonlinear Optimization 23/35 Example xy y x y x f 9 ), ( 3 3 + = Findthe local maxsand minsof f(x,y) Firstly, computing the first order partial derivatives (i.e., gradient of f(x,y)) and setting them to zero ( ). 3 3 and is (0,0) * *, points critical 0 9 3 9 3 ), ( 2 2 ), - ( y x x y y x y f x f y x f = + + = =
Pavel Kordík (ČVUT FIT) MI-NON, 2011 Nonlinear Optimization 24/35 Example (Cont.). 6 9 9 6 ), ( 2 2 2 2 2 2 2 = = y x y f x y f y x f x f y x f We now compute the Hessian of f(x,y) The first order leading principal minor is 6x and the second order principal minor is -36xy-81. At (0,0), these two minors are 0 and -81, respectively. Since the second order leading principal minor is negative, (0,0) is a saddle of f(x,y), i.e., neither a max nor a min. At (3, -3), these two minors are 18 and 243. So, the Hessian is positive definite and (3,-3) is a local min of f(x,y). Is (3, -3) a global min?
Newton method For a minimum we require that, and so with solution. This gives the iterative update If f(x)is quadratic, then the solution is found in one step. The method has quadratic convergence (as in the 1D case). The solution is guaranteed to be a downhill direction. Rather than jump straight to the minimum, it is better to perform a line minimization which ensures global convergence If H=I then this reduces to steepest descent. 25/35
26/35 Newton method -example The algorithm converges in only 18 iterations compared to the 98 for conjugate gradients. However, the method requires computing the Hessian matrix at each iteration this is not always feasible
Quasi-Newton methods If the problem size is large and the Hessian matrix is dense then it may be infeasible/inconvenient to compute it directly. Quasi-Newton methods avoid this problem by keeping a rolling estimate of H(x), updated at each iteration using new gradient information. Common schemes are due to Broyden, Goldfarb, Fletcher and Shanno (BFGS), and also Davidson, Fletcher and Powell (DFP). The idea is based on the fact that for quadratic functions holds and by accumulating g k s and x k s we can calculate H. 27/35
Quasi-Newton BFGS method Set H 0 = I. Update according to where The matrix inverse can also be computed in this way. Directions δ k s form a conjugate set. H k+1 is positive definite if H k is positive definite. The estimate H k is used to form a local quadratic approximation as before 28/35
BFGS example The method converges in 34 iterations, compared to 18 for the full-newton method 29/35
30/35 Non-linear least squares It is very commonin applications for a cost function f(x)to be the sum of a large number of squared residuals If each residual depends non-linearlyon the parameters xthen the minimization of f(x)is a nonlinear least squares problem.
31/35 Non-linear least squares The M N Jacobian of the vector of residuals ris defined as Consider Hence
Non-linear least squares For the Hessian holds Gauss-Newton approximation Note that the second-order term in the Hessian is multiplied by the residuals r i. In most problems, the residuals will typically be small. Also, at the minimum, the residuals will typically be distributed with mean = 0. For these reasons, the second-order term is often ignored. Hence, explicit computation of the full Hessian can again be avoided. 32/35
33/35 Gauss-Newton example The minimization of the Rosenbrock function can be written as a least-squares problem with residual vector
34/35 Gauss-Newton example minimization with the Gauss-Newton approximation with line search takes only 11 iterations
Comparison Newton CG Quasi-Newton Pavel Kordík (ČVUT FIT) Gauss-Newton Nonlinear Optimization MI-NON, 2011 35/35