Optimization Mathematical optimization Determine the best solutions to certain mathematically defined problems that are under constrained determine optimality criteria determine the convergence of the solution The advent of computer had great impact on the development of optimization methods When do we need optimizations? Optimal robotic control Optimal robotic control Arrive at d with vel= 0 with tradeoff between time and energy Inverse kinematics Optimal motion trajectories minimize T = time + energy 0 d
Inverse kinematics Optimal motion trajectories a set of 3D markers a pose Optimization taxonomy Optimization taxonomy Unconstrained Constrained Discontinuous Unconstrained Constrained Discontinuity Newton-like methods Linear Integer Newton-like methods Linear Integer Descent methods Quadratic Stochastic Descent methods Quadratic Stochastic Nonlinear equations Nonlinear Network Nonlinear equations Nonlinear Network
Newton s methods Root estimation Find the roots of of a nonlinear function Root estimation Minimization One variable Multi variables Quasi-Newton method C(x) = 0 We can linearize the function as C( x) = C(x) + C (x)( x x) = 0, where C (x) = C x Then we can estimate the roots as x = x C(x) C (x) Root estimation Newton s convergence theorem C(x) C(x () ) = C(x (0) ) + C (x (0) )(x () x (0) ) Consider C(x) = 0 and assume x* is such a root. If C (x*) is not zero and C (x) is continuous on an interval containing x* Local convergence: for x 0 is suitably close to x* Newton s method converges to x* x () x () x (0) x Quadratic convergence: the algorithm converges quadratically, that is, x (k+) x c x (k) x
Root estimation Minimization Pros: Quadratic convergence Cons: Sensitive to initial guess Find x such that the nonlinear function F (x ) is a minimum What is the simplest model that has minima? F (x (k) + δ) = F (x (k) ) + F (x (k) )δ + F (x (k) )δ Example? Slope can t be zero at solution Why? Find the minima of F (x) F (x (k) + δ) δ = 0 Find the roots of F (x) δ = F (x) F (x) Conditions Minimization What are the conditions for minima to exist? Necessary conditions: a local minima exists at x* F (x ) > 0 F (x) F (x ) = 0 F (x ) 0 Sufficient conditions: an isolated minima exists at x* F (x ) = 0 F (x ) > 0 F (x) x x
Example Stationary points F(x) = x F(x) = x 4 Which function has a strict isolated minimum at x = 0? Many methods only locate a point x* such that F (x*) = 0 x* refers to a stationary point that has the following three types minimum maximum saddle Multiple variables Multiple variables F (x (k) + p) = F (x (k) ) + g T (x (k) )p + pt H(x (k) )p g(x) = x F = H(x) = xxf = F x. F x n F x gradient vector F x x n F x x F x x n.. F x n x F x n Hessian matrix 0 = g(x (k) ) + H(x (k) )p p = H(x (k) ) g(x (k) ) x (k+) = x (k) + p
Multiple variables Positive definite matrix Necessary conditions: g(x ) = 0 p T H p 0 Sufficient conditions: g(x ) = 0 p T H p > 0 H is positive semi-definite H is positive definite Function F at some arbitrary point can be approximated by F (x (k+) ) = F (x ) + pt H p If x is the minimizer of F, p T H p > 0 x (k+) F (x (k+) ) = F (x ) + g T (x )p + pt H p (By g(x ) = 0) Finite difference Newton method Finite difference Newton method The main drawback of Newton s method is that the user must supply the formulation to compute Hessian matrix Finite difference methods estimate H (k) by computing differences in gradient vectors Evaluate the vector with increment h i in each coordinate direction e i How many gradient evaluations are required to update Hessian? The estimated Hessian matrix might no longer be positive definite Need to solve for linear system to compute the inverse of Hessian Each column of H (k) is g(x (k) + h i e i ) g(x (k) ) h i All these problems can be solved by Quasi-Newton methods Rectify the symmetry of H (k) by H (k) = H (k) + H T (k)
Quasi-Newton method Quasi-Newton method Quasi-Newton methods construct a new estimate of the Hessian matrix using information from previous iterates Approximate H (k) In each iteration: using a symmetric positive definite matrix Ĥ(k). p = Ĥ(k)g (k). x (k+) = x (k) + p Ĥ () The initial matrix can be any symmetric positive definite matrix, for example, Ĥ () = I By repeated updates of, Quasi-Newton method turns an arbitrary matrix in to a close approximation of Ĥ (k+) Ĥ () H (k) In each iteration, is computed by augmenting with second derivative information gained on the k-th iteration The Quasi-Newton condition: Ĥ (k+) γ (k) = p (k) Ĥ (k) 3. update Ĥ (k) giving Ĥ (k+),where γ (k) = g (k+) g (k) Quasi-Newton method Optimization taxonomy Ĥ (k+) γ (k) = p (k) Unconstrained Constrained Discontinuity Ĥ (k+) = Ĥ(k) + E (k) = Ĥ(k) + auu T Ĥ (k) γ (k) + auu T γ (k) = p (k) Newton-like methods Linear Integer u = p (k) Ĥ(k)γ (k) au T γ (k) = Descent methods Quadratic Stochastic Ĥ (k+) = Ĥ + (p Ĥγ)(p Ĥγ)T (p Ĥγ)T γ Nonlinear equations Nonlinear Network
Descent methods Solving large linear system Ax = b Greatest gradient descent Conjugate direction Conjugate gradient A b x a known, square, symmetric, and positive semi-definite matrix a known vector an unknown vector If A is dense, solve with factorization and backsubstitution If A is sparse, solve with iterative methods (Conjugate Gradient) the quadratic form Greatest gradient descent F (x) = xt Ax b T x + c The minimizer of F is also the solution to Ax = b F (x) = 0 = Ax b Start at an arbitrary point x (0) and slide down to the bottom of the paraboloid Take a series of steps x (), x (),... until we are satisfied that we are close enough to the solution x* Take a step along the direction in which F descents most quickly F (x (k) ) = b Ax (k)
Greatest gradient descent line search Important definitions: error: e (k) = x (k) x residual: r (k) = b Ax (k) = F (x (k) ) r (0) x () = x (0) + αr (0) But how big a step should we take? = Ae (k) Think residual as the direction of the greatest descent x (0) A line search is a procedure that chooses α to minimize F along a line The Method of Steepest Descent 7 4 (a) (b) 0 0 (a) (c).5 0 -.5-5 -.5 0.5 5 (b) Line search (d) 50 0 50 00-4 - 4 6 0 40 0 00 80 60 40 0 - -4-6 0 (c) 0. 0.4 0.6.5 0 -.5-5 -4-4 6 - Figure 6: The method of Steepest Descent. (a) Starting at, take a step in the direction of steepest descent of. (b) Find the point on the intersection of these two surfaces that minimizes. (c) This parabola is the intersection of surfaces. The bottommost point is our target. (d) The gradient at the bottommost point is orthogonal to the gradient of the previous step. 4-4 -6 -.5 0.5 5 (d) 50 0 50 00 Optimal step size d dα F (x ()) = F (x () ) T d dα x () = F (x () ) T r (0) = 0 F (x () ) r (0) r T (0) r () = 0
Optimal step size Recurrence of residual x (k+) = x (k) + αr (k) r T (k) r (k+) = 0 Exercise: derive alpha Ans: rt k r k α = r T k Ar k.. 3. r (k) = b Ax (k) rt k r k α = r T k Ar k x (k+) = x (k) + αr (k) The algorithm requires two matrix-vector multiplications per iteration One multiplication can be eliminated by replacing step 3 with r (k+) = r (k) αar (k) Poor convergence Conjugate direction Pick a set of orthogonal directions: What is the problem with greatest descent? Wouldn t it be nice if we can avoid to traverse the same direction? d (0), d (),, d (n ) Take exactly one step along each direction Solution is found within n steps Two problems:. How do we determine these directions?. How do we determine the step size along each direction?
Conjugate direction Conjugate directions Let s deal with the second problem first x (k+) = x (k) + α (k) d (k) To compute α (k), we need to know e (k). If we knew e (k), the problem would already be solved! Use the fact that e (k+) should be orthogonal to d (k) so that we need never step in the direction of d (k) again d (k) e (k+) = 0 d (k) (e (k) + α (k) d (k) ) = 0 α (k) = dt (k) e (k) d T (k) d (k) Instead making search directions orthogonal, we find a set of directions that are A-orthogonal to each other Two vectors d (i) and d (j) are A-orthogonal or conjugate, if d T (i) Ad (j) = 0 What seems to be the problem? A-orthogonality A-orthogonality If we take the optimal step size along each direction F (x (k+) ) T d dα F (x (k+)) = 0 d dα x (k+) = 0 r T (k+) d (k) = 0 d T (k) Ae (k+) = 0 e (k+) must be A-orthogonal to d (k) vectors are A-orthogonal vectors are orthogonal
Optimal size step Algorithm e (k+) must be A-orthogonal to d (k) Suppose we can come up with a set of A-orthogonal directions {d (k) } Using this condition, can you derive α (k)?. Compute d (k). α (k) = dt (k) r (k) d T (k) Ad (k) 3. x (k+) = x (k) + α (k) d (k) Why does it work? Search directions We need to prove that x can be found in n steps if we take α (k) step size along at each step n e (0) = δ i d (i) i=0 d (k) n d T (j) Ae (0) = δ i d T (j) Ad (i) i=0 d T (j) Ae (0) = δ j d T (j) Ad (j) δ j = dt (j) Ae (0) d T (j) Ad (j) = dt (j) Ae (j) d T (j) Ad = α (j) (j) = dt (j) A(e (0) + j k=0 α kd (k) ) d T (j) Ad (j) We know how to determine the optimal step size along each direction (second problem solved) We still need to figure out what search directions are What do we know about d (0), d (),..., d (n-)? They are A-orthogonal to each other: d (i)ad (j) = 0 d (i) is A-orthogonal to e (i+)
Gram-Schmidt Conjugation Gram-Schmidt Conjugation Suppose we have a set of linearly independent vectors u 0, u,..., u n- To construct d (i), take u i and subtract out any components that are not A-orthogonal to the previous d vectors The search directions can be represented as k d (k) = u k + β ki d (i) i=0 and Use the same trick to get rid of the summation d T (k) Ad (j) = u T k Ad (j) + β ki d T (j) Ad (j) d (0) = u 0 k > j u 0 d (0) u + d (0) β kj = ut k Ad (j) d T j Ad (j) u u * d () What are the drawbacks of Gram-Schmidt conjugation? Conjugate gradients Conjugate gradient If we pick up a set of u s intelligently, we might be able to save both time and space It turns out that residuals (r s) is an excellent choice for u s Take r (k) and subtract out any components that are not A-orthogonal to the previous d vectors k d (k) = r (k) + β ki d (i) i=0 k d T (k) Ad (j) = r T (k) Ad (j) + β ki d T (i) Ad (j) i=0 j < k residual is orthogonal to the previous search directions residuals work for Greatest Descent 0 = r T (k) Ad (j) + β kj d T (j) Ad (j) β kj = rt (k) Ad (j) d T (j) Ad (j) (by A-orthogonality of d vectors) Each d (k) requires O(n ) operations! However...
Conjugate gradient Conjugate gradient r (k) is A-orthogonal to all the previous search directions except for d (k ) β kj = rt (k) Ad (j) d T (j) Ad = 0 if j < k (j) r (k+) = Ae (k+) = A(e (k) + α (k) d (k) ) = r (k) α (k) Ad (k) r T (j) r (k+) = r T (j) r (k) α (k) r T (j) Ad (k) rt (k) r (k) β kj = r T (k ) r (k ) if j = k proof: r T (k) Ad (j) = 0 when j < k r T (j) Ad (k) = { r T (j) r (j) α (j) rt (j) r (j) α (j ) 0 j = k j = k + otherwise Conjugate gradient Put it all together d (0) = r (0) = b Ax (0) rt (k) r (k) α (k) = d T (k) Ad (k) x (k+) = x (k) + α (k) d (k) r (k+) = r (k) α (k) Ad (k) β (k+) = rt (k+) r (k+) r T (k) r (k) d (k+) = r (k+) + β (k+) d (k)