Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Size: px

Start display at page:

Download "Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09"

Joanna York
5 years ago
Views:

1 Numerical Optimization 1

2 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2

3 Overview of Methods Is the function smooth / non-smooth / semismooth? Are derivatives available? Is the problem convex / non-convex? The great watershed in optimization isn t between linearity and nonlinearity, but convexity and non-convexity, T. Rockafellar, 1993 Is the problem linear / nonlinear? Is the problem constrained / unconstrained? 3

4 Optimization problems Generic unconstrained minimization problem: where Vector space is the search space is a cost (or objective) function A solution is the minimizer of The value is the minimum Finding the global minimum of general functions is as hard as finding a needle in a haystack 4

5 Local vs. global l minimum i Local minimum Global minimum 5

6 Local vs. global in real life False summit 8,030 m Main summit 8,047 m Broad Peak (K3), 12 th highest mountain on Earth 6

7 Convex vs. non-convex functions A function defined on a convex set is called convex if for any and For convex function local minimum = global minimum Convex Non-convex 7

8 One-dimensional optimality conditions Point is the local minimizer of a -function if. Approximate a function around as a parabola using Taylor expansion guarantees the minimum at guarantees the parabola is convex 8

9 Gradient In multidimensional case, linearization of the function according to Taylor gives a multidimensional analogy of the derivative. The function, denoted as, is called the gradient of In one-dimensional case, it reduces to standard definition of derivative 9

10 Gradient In Euclidean space ( ), can be represented in standard basis in the following way: i-th place which gives 10

Hessian Linearization of the gradient gives a

Hessian of Ludwig Otto Hesse (1811-1874) In

11 Hessian Linearization of the gradient gives a multidimensional analogy of the secondorder derivative. The function, denoted as is called the Hessian of Ludwig Otto Hesse ( ) In the standard basis, Hessian is a symmetric matrix of mixed second-order derivatives 11

12 Multi-dimensional optimality conditions Point is the local minimizer of a -function if. for all, i.e., the Hessian is a positive definite matrix (denoted ) Approximate a function around as a parabola using Taylor expansion guarantees the minimum at guarantees the parabola is convex 12

13 Optimization algorithms Descent direction Step size 13

14 Generic optimization algorithm Start with some Determine descent direction Choose step size such that Update iterate Until convergence Increment iteration counter Solution Descent direction Step size Stopping criterion 14

15 Stopping criteria Near local minimum, (or equivalently ) Stop when gradient norm becomes small Stop when step size becomes small Stop when relative objective change becomes small 15

16 Line search Optimal step size can be found by solving a one-dimensional optimization problem One-dimensional optimization algorithms for finding the optimal step size are generically called exact line search 16

17 Armijo rule The function sufficiently decreases if Armijo rule (Larry Armijo, 1966): start with and decrease it by multiplying by some until the function sufficiently decreases 17

18 Descent direction How to descend in the fastest way? Go in the direction in which the height lines are the densest Devil s Tower Topographic map 18

19 Steepest descent Directional derivative: how much changes in the direction (negative for a descent direction) Find a unit-length direction minimizing directional derivative 19

20 Steepest descent L 2 norm L 1 norm Normalized steepest descent Coordinate descent (coordinate axis in which descent is maximal) 20

21 Steepest descent algorithm Start with some Compute steepest t descent direction Choose step size using line search Until convergence Update iterate Increment iteration counter 21

22 Simple quadratic function Example Matlab Example 22

23 Condition number Condition number is the ratio of maximal and minimal eigenvalues of the Hessian, Problem with large condition number is called ill-conditioned Steepest descent convergence rate is slow for ill-conditioned problems 23

24 Generates a sequence Conjungate Gradient (CG-Method) Where the search directions have the property (This property p is known as conjugacy) New search direction is given by 24

25 Conjungate Gradient (CG-Method) Different choices of Fletcher - Reeves Polak Ribiere To ensure that d is a descend of Polak Ribiere... 25

26 Conjungate Gradient (CG-Method) Properties of the CG method: In the linear case: At most n steps for n variables Each iteration is very fast to compute Needs little memory to store -> large problems Global convergence is asured by line search Example 26

27 Preconditioning Perform steepest descent in a rescaled coordinate system, called preconditioning. Function: Gradient: Preconditioner should be chosen to improve the condition number of the Hessian in the proximity of the solution In system of coordinates, the Hessian at the solution is 27

28 Newton method as optimal preconditioner Best theoretically possible preconditioner, giving descent direction Ideal condition number Problem: the solution is unknown in advance Newton direction: use Hessian as a preconditioner at each iteration 28

29 Another derivation of the Newton method Approximate the function as a quadratic function using second-order Taylor expansion (quadratic function in ) Close to solution the function looks like a quadratic function; the Newton method converges fast 29

30 Newton method Start with some Compute Newton direction Choose step size using line search Until convergence Update iterate Increment iteration counter 30

31 Properties of Newton s method Fast convergence (at least locally) Quadratic convergence Not always very robust Each iterations amounts for solving a linear system Overall performance depends on how fast can the system be solved 31

32 Example Try again our simple quadratic example Needs one iteration to find the solution! 32

33 Frozen Hessian Observation: close to the optimum, the Hessian does not change significantly ifi Reduce the number of Hessian inversions by keeping the Hessian from previous iterations and update it once in a few iterations Such a method is called Newton with frozen Hessian 33

Louis Cholesky (1875-1918) Forward substitution Backward

34 Cholesky factorization Decompose the Hessian where is a lower triangular matrix Solve the Newton system in two steps Andre Louis Cholesky ( ) Forward substitution Backward substitution Complexity:, better than straightforward matrix inversion 34

35 Truncated Newton Solve the Newton system approximately (it s a linear system) A few iterations of conjugate gradients or other algorithm for the solution of linear systems can be used Such a method is called truncated or inexact Newton 35

36 Levenberg: Levenberg-Marquardt Method Combine Gradient with Newton Steepest Descent Newton Method x x τ +1 = x τ τ + 1 = xτ λd H τ 1 τ d τ x 1 τ + 1 = xτ ( Hτ + βi) dτ β = 0 x τ + 1 = xτ << β Marquardt: x x 1 τ τ τ β d + 1 = Take into account H also for large β Better solution for error valley Problem, Gradient is bend towards quadratic H 1 τ d τ x 1 τ + 1 = x τ ( H τ + βdiag [ H ]) d τ 36

37 Levenberg-Marquardt Algorithm x 1 τ + 1 = xτ ( Hτ + βdiag( Hτ )) dτ Initalize β with small value e.g ) 1 +ז ) and E(x ז Calculate E(x 1 ז ז ) ז E(x ) 1 +ז If E(x Increase β (*10) and undo last If E(x 1 +ז )< E(x ז ) Decrease β (/10) If the error gets bigger, the quadratic approximation is not good more weight for gradient direction 37

38 Quasi-Newton Methods Idea is to approximate the Hessian by some positive definite matrix B and solve Most efficient method is the method of Broyden, Fletcher, Goldfarb and Shanno (BFGS) This method allows to iteratively update the inverse of B For large scale problems: Limited memory BFGS (Nocedal) 38

39 Non-convex optimization Using convex optimization methods with non-convex functions does not guarantee global convergence! There is no theoretical guaranteed global optimization, just heuristics Local minimum Global minimum Good initialization Multiresolution 39

40 Iterative majorization Construct a majorizing function satisfying Majorizing inequality: for all is convex or easier to optimize w.r.t. 40

41 Iterative majorization Start with some Find such that t Update iterate Until convergence Increment iteration counter Solution 41

42 Constrained optimization MINEFIELD CLOSED ZONE 42

in which the constraints hold is called feasible set A point belonging to the

43 Constrained optimization problems Generic constrained minimization problem where are inequality constraints are equality constraints A subset of the search space in which the constraints hold is called feasible set A point belonging to the feasible set is called a feasible solution A minimizer of the problem may be infeasible! 43

44 An example Equality constraint Inequality constraint Feasible set Inequality constraint t is active at point if, inactive otherwise A point is regular if the gradients of equality constraints and of active inequality constraints are linearly independent 44

45 Lagrange multipliers Main idea to solve constrained problems: arrange the objective and constraints t into a single function and minimize it as an unconstrained problem is called Lagrangian and are called Lagrange multipliers 45

46 KKT conditions If is a regular point and a local minimum, there exist Lagrange multipliers and such thatt for all and for all such that for active constraints and zero for inactive constraints Known as Karush-Kuhn-Tucker conditions Necessary but not sufficient! 46

47 KKT conditions Sufficient conditions: If the objective is convex, the inequality constraints are convex and the equality constraints are affine, and for all and for all such that for active constraints and zero for inactive constraints then is the solution of the constrained problem (global constrained minimizer) 47

48 Geometric interpretation Consider a simpler problem: Equality constraint The gradient of objective and constraint must line up at the solution 48

49 Penalty methods Define a penalty aggregate where and are parametric penalty functions For larger values of the parameter, the penalty on the constraint violation is stronger 49

50 Penalty methods Inequality penalty Equality penalty 50

51 Penalty methods Start with some and initial value of Find by solving an unconstrained optimization problem initialized with Set Until convergence Set Update Solution 51

52 Literature Bronstein, Bronstein, Kimmel, Numerical Geometry of Non-Rigid Shapes (Slides are taken from tosca.cs.technicon.ac.il/book) ) Nocedal, Wright, Numerical Optimization 52

Numerical optimization

Numerical optimization Lecture 4 Alexander & Michael Bronstein tosca.cs.technion.ac.il/book Numerical geometry of non-rigid shapes Stanford University, Winter 2009 2 Longest Slowest Shortest Minimal Maximal