Practical Optimization: Basic Multidimensional Gradient Methods

Size: px

Start display at page:

Download "Practical Optimization: Basic Multidimensional Gradient Methods"

Moses Dale Snow
6 years ago
Views:

1 Practical Optimization: Basic Multidimensional Gradient Methods László Kozma Helsinki University of Technology S Postgraduate Seminar on Signal Processing

2 Contents Recap Basic principles Properties of algorithms One-dimensional optimization Multidimensional Optimization Overview Steepest descent Newton method Gauss-Newton method Homework

3 Recap Basic Principles Tools Gradient Hessian Taylor series Extrema of functions Weak/strong Local/global Necessary and sufficient conditions Stationary points Minimum/maximum/saddle Classify them by characterizing the Hessian Convex/concave functions

4 Recap Properties of Algorithms Point-to-point mappings Iterative: x k x k+1 Descent: f(x k+1 ) < f(x k ) Convergence of an algorithm Convergent Convergent to a solution point Rate of convergence: 0 β, β = lim k x k+1 ˆx x k ˆx p

5 Recap One Dimensional Optimization Basic problem: minimize F = f(x) where x L x x U knowing that f(x) has single minimum in this range. Search methods: repeatedly reduce bracket Dichotomous search Fibonacci search Golden-Section search Approximation methods: approximate function with low-order polynomial

6 Multidimensional Optimization Overview Constrained optimization: usually reduced to unconstrained Unconstrained optimization Search methods Perform only function evaluations Explore parameter space in organized manner Very inefficient, used only when gradient info not available Gradient methods First-order (use g) Second-order (use g and H)

7 Steepest-Descent Method Minimize F = f(x) for x E n We have from Taylor series: F + F = f(x + δ) f(x) + g T δ δt Hδ F g T δ g = [g 1 g 2... g n ] T δ = [δ 1 δ 2...δ n ] T n F g i δ i = g δ cosθ i=1 where θ is the angle between g and δ

8 Steepest-descent

9 Steepest-Descent method Assuming f continuous around x Steepest descent direction: d = g Change δ in x given by δ = αd. If α small, will decrease value of f To obtain maximum reduction, solve one-dim. problem: minimize α F = f(x + αd) Usually this search does not give minimizer of original f Therefore we need to perform it iteratively

10 Steepest-Descent method

11 Orthogonality of directions

12 Finding α Line search (see Chapter 4.) Analytical solution: f(x k + δ k ) f(x k ) + δ k T g k δ k T H k δ k δ k = αg k (steepest-descent direction) df(x k αg k ) dα = 0 α = α k g k T g k g T k H k g k Approximation accurate if δ k small or f quadratic.

13 Finding α α = α k g k T g k g T k H k g k If Hessian not available, approximate α k = ˆα (for ex. value from previous iteration) ˆf f k ˆαg k T g k ˆα2 g k T H k g k plug it in α k. g k T H k g k 2( ˆf f k + ˆαg k T g k ) ˆα 2

14 Provided that: Convergence of Steepest-descent f(x) C 2 has a local minimiser x Hessian is positive definite at x x k sufficiently close to x f(x k+1 ) f(x ) f(x k ) f(x ) ( 1 r 1 + r )2 r = smallest eigenvalue of H k largest eigenvalue of H k Linear convergence (rate depends on H k ) Convergence fast if eigenvalues constant (contours circular) Consequence: scaling of variables can help

15 Newton Method Quadratic approximation using Taylor-series: f(x + δ) f(x) + n j=1 f x i δ i n n i=1 j=1 f(x + δ) f(x) + g T δ δt Hδ 2 f x i x j δ i δ j Differentiate with respect to δ k (k = 1, 2,...,n) and set to 0. We obtain g = Hδ The optimum change δ = H 1 g

16 Newton Method δ = H 1 g Newton direction. Solution exists if Hessian is nonsingular Follows from 2nd order sufficiency conditions at x (if minimum exists and we are close to it) Otherwise H can be forced to become positive definite (implies non-singular) Taylor approximation valid If this holds (quadratic f), minimum reached in one step Otherwise iterative approach is needed (similarly to Steepest-descent) If H not positive definite, update may not yield reduction

17 Newton method

18 Newton method Convergence Initially slow, becomes fast close to the solution Complementary to Steepest-descent Order of convergence: 2 Main drawback: H 1

19 Modification of the Hessian How to make H k positive definite? (1) Goldfeld, Quandt, Trotter s method Ĥ k = H k + βi n 1 + β If H k ok, β set to small value, Ĥk H k. If H k not ok, β set to large value, Ĥk I n, Newton method reduces to Steepest-descent.

20 (2) Zwart s method Modification of the Hessian Ĥ k = U T H k U + ǫ Where: U T U = I n ǫ diagonal n*n matrix with elements ǫ i U T H k U diagonal with elements λ i (the eigenvalues of H k ). Then Ĥk diagonal with elements λ i + ǫ i We can set: ǫ i = 0 if λ i > 0 ǫ i = δ λ i if λ i 0 This way we ignore components due to negative eigenvalues, while preserving convergence properties U T H k U formed by solving det(h k λi n ) which is time-consuming.

21 Modification of the Hessian (3) Matthews, Davies method Practical algorithm based on Gaussian elimination Deduce D = LH k L T (D diagonal, L lower triangular) H k positive definite iff D positive definite (see earlier) If D not positive definite, replace each nonpositive element with a positive element, to obtain ˆD Then Ĥk = L 1 ˆD(L T ) 1 The Newton direction: d k = Ĥ 1 g k = L T ˆD 1 Lg k The exact algorithm is somewhat involved

22 Computation of the Hessian Second derivatives might be impossible to compute They can be approximated with numerical formulas

23 Gauss-Newton Method In many problems we want to optimize several functions in the same time f = [f 1 (x)f 2 (x)...f m (x)] T f p (x) for p = 1, 2,...,m independent functions of x We form a new function: F = m p=1 f p(x) 2 = f T f Minimizing F in the traditional way is minimizing f p (x) in the least-squares sense Useful trick the other way around: if function is sum-of-squares, we can "split" it We use Newton s method with some fancy notation

24 Gauss-Newton Method F = m f p (x) 2 = f T f p=1 F x i = m p=1 2f p (x) f p x i

25 Gauss-Newton Method Or in Matrix form: which says in fact: g F = 2J T f

26 Gauss-Newton Method Similarly for the second derivatives: Neglecting the second derivatives: We obtain: H F 2J T J

27 Gauss-Newton Method Having obtained g F and H F we have: x k+1 = x k α k (J T J) 1 (J T f) (1) Notes if f p (x k ) close to linear (near x ), approximation of Hessian is accurate if f p (x k ) linear, Hessian is exact, we reach solution in one step If H F singular, same solutions as earlier Algorithm proceeds similarly as before

28 Homework 1 What is the role of the Hessian in the convergence rate of the Steepest-descent method? 2 What are some advantages/disadvantages of the Newton method compared to Steepest-descent method? 3 Minimize f(x) = x x x 1 + 4x 2 using steepest-descent method with initial point x 0 = [00] T. (Hint: find a generic term for the iteration points). Show that the algorithm converges to the global minimum. 4 Sketch the optimization steps for f(x) = ln(1 x 1 x 2 ) ln(x 1 ) ln(x 2 ) using basic steepest descent and Newton s method. [Optional: run the optimization, compare convergence rate, accuracy, effect of initial point]. 5 Sketch the optimization steps for f(x) = (x 1 +10x 2 ) 2 +5(x 3 x 4 ) 2 +(x 2 2x 3 ) (x 1 x 4 ) 4

29 Reference Andreas Antonio and Wu-Sheng Lu, Practical Optimization Algorithms and Engineering Applications, Springer, 2007

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Optimization: Nonlinear Optimization without Constraints Nonlinear Optimization without Constraints 1 / 23 Nonlinear optimization without constraints Unconstrained minimization min x f(x) where f(x) is