Optimization
Unconstrained optimization One-dimensional Multi-dimensional Newton s method Basic Newton Gauss- Newton Quasi- Newton Descent methods Gradient descent Conjugate gradient Constrained optimization Newton with equality constraints Active-set method Simplex method Interior-point method
Unconstrained optimization Define an objective function over a domain: f: R n R Optimization variables: x T = {x 1,x 2,,x n } minimize f(x 1,x 2,,x n ) minimize f(x), for x R n
Constraints Equality constraints! a i (x) = 0 for x R n, where i =1,,p! Inequality constraints c j (x) 0 for x R n, where j =1,,q
Constrained optimization minimize f(x), for x R n subjec to a i (x) = 0, where i =1,,p c j (x) 0, where j =1,,q Solution: x* satisfies constraints ai and cj, while minimizing the objective function f(x)
Formulate an optimization General optimization problem is very difficult to solve Certain problem classes can be solve efficiently and reliably Convex problems can be solved with global solutions efficiently and reliably Nonconvex problems do not guarantee global solutions
Example: pattern matching A pattern can be described by a set of points, P = {p1, p2,..., pn} The same object viewed from a different distance or a different angle corresponds to a different P Two patterns P and P are similar if p i = cos sin sin cos p i + r 1 r 2
Example: pattern matching Let Q = {q1, q2,..., qn} be the target pattern, find the most similar pattern among P1, P2,..., Pn
Inverse kinematics a set of 3D marker positions a pose described by joint angles
Optimal motion trajectories
Quiz Arrive at d with velocity = 0! Maximal force allowed: F Minimize time? Minimize energy? 0 d
Unconstrained optimization Newton method Gauss-Newton method Gradient descent method Conjugate gradient method
Newton method Find the roots of of a nonlinear function C(x) =0 We can linearize the function as C( x) =C(x)+C (x)( x x) =0, where C (x) = C x Then we can estimate the roots as x = x C(x) C (x)
Root estimation C(x) C(x (1) )=C(x (0) )+C (x (0) )(x (1) x (0) ) x (2) x (1) x (0) x
Root estimation Pros: Quadratic convergence! Cons:! Sensitive to initial guess! Example?! Slope can t be zero at solution! Why?
Minimization Find x such that the nonlinear function F (x ) is a minimum What is the simplest function that has minima? F (x (k) + δ) =F (x (k) )+F (x (k) )δ + 1 2 F (x (k) )δ 2 Find the minima of F (x) Find the roots of F (x) F(x (k) + δ) δ =0 δ = F (x) F (x)
Conditions What are the conditions for minima to exist?! Necessary conditions: a local minimum exists at x*!! F (x )=0 F (x ) 0 Sufficient conditions: an isolated minimum exists x* F (x )=0 F (x ) > 0
Minimization F (x ) > 0 F (x) x x F (x)
Multidimensional optimization Search methods only need function evaluations! First-order gradient-based methods depend on the information of gradient g! Second-order gradient-based methods depend on both gradient and Hessian H
Multiple variables F (x (k) + p) =F (x (k) )+g T (x (k) )p + 1 2 pt H(x (k) )p g(x) = x F = F x 1. F x n gradient vector H(x) = 2 xxf = 2 F 2 F x 2 x 1 1 x n 2 F x 2 x 1 2 F x 2 x n.. 2 F x n x 1 2 F x 2 n Hessian matrix
Multiple variables 0=g(x (k) )+H(x (k) )p p = H(x (k) ) 1 g(x (k) ) x (k+1) = x (k) + p
Multiple variables Necessary conditions: g(x )=0 p T H p 0 H is positive semi-definite Sufficient conditions: g(x )=0 p T H p > 0 H is positive definite
Gauss-Newton method What if the objective function is in the form of a vector of functions?!! f =[f 1 (x) f 2 (x) f m (x)] T! The real-valued function can be formed as F = m p=1 f p (x) 2 = f T f
Jacobian Each f p (x) depends on x i for i = 1,2,...,m, a gradient matrix can be formed!!!! The Jacobian need not to be a square matrix
Gradient and Hessian Gradient of objective function! m F! = 2f p (x) f p x i x i! p=1 g F =2J T f Hessian of objective function 2 F x i x j =2 m p=1 f p x i f p x j +2 m p=1 f p (x) x i 2 f p x j H F 2J T J
Gauss-Newton algorithm In k th iteration, compute f p (x k ) and J k to obtain new g k and H k! Compute p k = -(2J T J) -1 (2J T f) = -(J T J) -1 (J T f)! Find α k that minimizes F(x k + α k p k )! Set x k+1 = x k + α k p k
First-order gradient methods Greatest gradient descent Conjugate gradient
Solving large linear system Ax = b A a known, square, symmetric, and positive semi-definite matrix b a known vector x an unknown vector If A is dense, solve with factorization and back substitution If A is sparse, solve with iterative methods (descent methods)
Quadratic form F (x) = 1 2 xt Ax b T x + c The gradient of F(x) is F (x) = 1 2 AT x + 1 2 Ax b If A is symmetric, F (x) =Ax b F (x) =0=Ax b The critical point of F is also the solution to Ax = b If A is not symmetric, what is the linear system solved by finding the critical points of F?
Greatest gradient descent Start at an arbitrary point x (0) and slide down to the bottom of the paraboloid Take a series of steps x (1), x (2),... until we are satisfied that we are close enough to the solution x* Take a step along the direction in which F descents most quickly F (x (k) )=b Ax (k)
Greatest gradient descent Important definitions: error: e (k) = x (k) x residual: r (k) = b Ax (k) = F (x (k) ) = Ae (k) Think residual as the direction of the greatest descent
Line search x (1) = x (0) + αr (0) x (0) r (0) But how big of a step should we take? A line search is a procedure that chooses α to minimize F along a line
2 (a) (c) 1 2.5 0 2-2.5-5 2 Line search 0-2 (b) -2.5 0 2.5 5 1 150 0 50 100 (d) Figure 6: The method of Steepest Descent. (a) Starting at 2 2 1-4 -2 2 4 6 140 120 100 80 60 40 20-4 -6 (c) 0.2 0.4 0.6 2.5 0 2-2.5-5 -4-2 descent of. (b) Find the point on the intersection of these two surf is the intersection of surfaces. The bottommost point is our target. ( is orthogonal to the gradient of the previous step. 0 1 2 4 2-2 -4-6 -
Optimal step size d dα F (x (1)) =F (x (1) ) T d dα x (1) = F (x (1) ) T r (0) =0 F (x (1) ) r (0) r T (0) r (1) =0
Optimal step size Exercise: derive alpha from r T (k) r (k+1) =0 Hint: replace the terms involving (k+1) with those involving (k) by x (k+1) = x (k) + αr (k) Ans: α = rt k r k r T k Ar k
Recurrence of residual 1. 2. 3. r (k) = b Ax (k) α = rt k r k r T k Ar k x (k+1) = x (k) + αr (k) The algorithm requires two matrix-vector multiplications per iteration One multiplication can be eliminated by replacing step 1 with r (k+1) = r (k) αar (k)
Quiz In our IK problem, we use greatest gradient descent method to find an optimal pose, but we can t compute alpha using the formula described in the previous slides, why?
Poor convergence What is the problem with greatest descent? Wouldn t it be nice if we can avoid to traverse the same direction?
Conjugate directions Pick a set of directions: d (0), d (1),, d (n 1) Take exactly one step along each direction Solution is found within n steps Two problems: 1. How do we determine these directions? 2. How do we determine the step size along each direction?
A-orthogonality If we take the optimal step size along each direction F (x (k+1) ) T d dα F (x (k+1)) = 0 d dα x (k+1) = 0 r T (k+1) d (k) = 0 d T (k) Ae (k+1) = 0 Two different vectors v and u are A-orthogonal or conjugate, if v T Au = 0
A-orthogonality vectors are A-orthogonal vectors are orthogonal
Optimal size step e (k+1) must be A-orthogonal to d (k) Using this condition, can you derive α (k)?
Algorithm Suppose we can come up with a set of A-orthogonal directions {d (k) }, this algorithm will converge in n steps 1. Take d (k) 2. 3. α (k) = dt (k) r (k) d T (k) Ad (k) x (k+1) = x (k) + α (k) d (k)
Why does it work? We need to prove that x can be found in n steps if we take step size along at each step α (k) e (0) = n 1 i=0 d T (j) Ae (0) = d (k) δ i d (i) n 1 i=0 δ i d T (j) Ad (i) d T (j) Ae (0) = δ j d T (j) Ad (j) d s are linearly independent if d s are A-orthogonal δ j = dt (j) Ae (0) d T (j) Ad (j) = dt (j) A(e (0) j 1 k=0 δ kd (k) ) d T (j) Ad (j) = dt (j) Ae (j) d T (j) Ad (j) = α (j)
Quiz Given that d s are A-orthogonal, prove that d s are linearly independent.
Search directions We know how to determine the optimal step size along each direction (second problem solved)! We still need to figure out what search directions are! What do we know about d (0), d (1),..., d (n-1)?! They are A-orthogonal to each other: d (i)t Ad (j) = 0! d (i) is A-orthogonal to e (i+1)
Gram-Schmidt Conjugation Suppose we have a set of linearly independent vectors u s, the search directions can be represented as k 1 d (k) = u k + β ki d (i) i=0 and d (0) = u 0 Use the same trick to get rid of the summation d T (k) Ad (j) = u T (k) Ad (j) + β kj d T (j) Ad (j) k>j β kj = ut k Ad (j) d T j Ad (j) What are the drawbacks of Gram-Schmidt conjugation?
Conjugate gradients If we pick a set of u s intelligently, we might be able to save both time and space! It turns out that residuals (r s) is an excellent choice for u s! residuals are orthogonal to each other! residual is orthogonal to the previous search directions
Proof: Orthogonality Proof r (k) is orthogonal to all the previous search directions d (0), d (1),, d (k 1) e (k) = n 1 j=k δ j d (j) d T (i) Ae (k) = d T (i) r (k) =0 n 1 j=k δ j d T (i) Ad (j) =0 if i < k if i < k From here, we can proof r T (i) r (j) =0,i j identity 1 d T (k) r (k) = r T (k) r (k) identity 2
Conjugate gradients d (k) = r (k) + k 1 i=0 β ki d (i) d T (k) Ad (j) = r T (k) Ad (j) + k 1 i=0 β ki d T (i) Ad (j) j<k 0=r T (k) Ad (j) + β kj d T (j) Ad (j) (by A-orthogonality of d vectors) β kj = rt (k) Ad (j) d T (j) Ad (j) Each d (k) requires O(n 3 ) operations! However...
Conjugate gradients r (k) is A-orthogonal to all the previous search directions except for d (k 1) β kj = rt (k) Ad (j) d T (j) Ad (j) =0 if j<k 1 rt (k) r (k) β kj = r T (k 1) r (k 1) if j = k 1 proof: r T (k) Ad (j) =0when j<k 1
Proof: A-orthogonality Proof r (k) is A-orthogonal to all the previous search directions except for d (k 1) r (j+1) = Ae (j+1) = A(e (j) + α (j) d (j) ) = r (j) α (j) Ad (j) r T (k) r (j+1) = r T (k) r (j) α (j) r T (k) Ad (j) use identity 1 r T (k) r (k) α (k) j = k r T (k) Ad (j) = { rt (k) r (k) j = k 1 α (k 1) 0 otherwise
Conjugate gradients Simplify β k β k = rt (k) Ad (k 1) d T (k 1) Ad (k 1) = r T (k) r (k) α (k 1) d T (k 1) Ad (k 1) = rt (k) r (k) d T (k 1) r (k 1) = rt (k) r (k) r T (k 1) r (k 1) use identity 2
Conjugate gradients Put it all together d (0) = r (0) = b Ax (0) α (k) = rt (k) r (k) d T (k) Ad (k) x (k+1) = x (k) + α (k) d (k) r (k+1) = r (k) α (k) Ad (k) β (k+1) = rt (k+1) r (k+1) r T (k) r (k) d (k+1) = r (k+1) + β (k+1) d (k)
References J. Shewchuk, An introduction to conjugate gradient method without agonizing pain! A. Antoniou and W.S. Lu, Practical optimization! R. Fletcher, Practical methods of optimization! J. Betts, Practical methods for optimal control using nonlinear programming