Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Optimization 2 CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Optimization 2 1 / 38

Last Time... Last Time... Gradient descent updates variables by x k+1 = x k α f(x k ). (1) Gradient descent can be slow. To increase convergence rate, find good step length α using line search. Leow Wee Kheng (NUS) Optimization 2 2 / 38

Line Search Line Search Gradient descent minimizes f(x) by updating x k along f(x k ): x k+1 = x k α f(x k ). (2) In general, other descent directions p k are possible if f(x) p k < 0. Then, update x k according to x k+1 = x k +αp k. (3) Ideally, want to find α > 0 that minimizes f(x k +αp k ). Too expensive. In practice, find α > 0 that reduces f(x) sufficiently (not too small). The simple condition f(x k +αp k ) < f(x k ) may not be sufficient. Leow Wee Kheng (NUS) Optimization 2 3 / 38

Line Search A popular sufficient line search condition is Armijo condition: f(x k +αp k ) f(x k )+c 1 α f(x k ) p k, (4) where c 1 > 0, usually quite small, say, c 1 = 10 4. Armijo condition may accepts very small α. So, add another condition to ensure sufficient reduction of f(x): f(x k +α k p k ) p k c 2 f(x) p k, (5) with 0 < c 1 < c 2 1, typically c 2 = 0.9. These two conditions together are called Wolfe conditions. Known result: Gradient descent will find local minimum if α satisfies Wolfe conditions. Leow Wee Kheng (NUS) Optimization 2 4 / 38

Line Search Another way is to bound α from below, giving Goldstein conditions: f(x k )+(1 c)α f(x k ) p k f(x k +αp k ) f(x k )+cα f(x k ) p k. (6) Another simple way is to start with large α and reduce it iteratively: Backtracking Line Search 1. Choose large α > 0, 0 < ρ < 1. 2. Repeat until Armijo condition is met: 2.1 α ρα. In this case, Armijo condition alone is enough. Leow Wee Kheng (NUS) Optimization 2 5 / 38

Gauss-Newton Method Gauss-Newton Method Carl Friedrich Gauss Sir Isaac Newton Leow Wee Kheng (NUS) Optimization 2 6 / 38

Gauss-Newton Method Gauss-Newton method estimates Hessian H by 1st-order derivatives. Express E(a) as the sum of square of r i (a), E(a) = n r i (a) 2, (7) i=1 where r i is the error or residual of each data point x i : and Then, 2 E a j a l = 2 n i=1 r i (a) = v i f(x i ;a), (8) r = [r 1 r n ]. (9) E a j = 2 n i=1 [ ri r i 2 ] r i +r i a j a l a j a l r i r i a j, (10) 2 n i=1 r i a j r i a l. (11) Leow Wee Kheng (NUS) Optimization 2 7 / 38

Gauss-Newton Method The derivative r i a j is the (i,j)-th element of the Jacobian J r of r. J r = r 1 r 1 a 1 a m..... r n r n a 1 a m The gradient and Hessian are related to the Jacobian by So, the update equation becomes. (12) E = 2J r r. (13) H 2J r J r, (14) a k+1 = a k (J r J r ) 1 J r r. (15) Leow Wee Kheng (NUS) Optimization 2 8 / 38

Gauss-Newton Method We can also use the Jacobian J f of f(x i ): J f = f(x 1 ) f(x 1 ) a 1 a m..... f(x n ) f(x n ) a 1 a m. (16) Then, J f = J r, and the update equation becomes a k+1 = a k +(J f J f) 1 J f r. (17) Leow Wee Kheng (NUS) Optimization 2 9 / 38

Gauss-Newton Method Compare to linear fitting: linear fitting a = (D D) 1 D v Gauss-Newton a = (J f J f) 1 J f r (J f J f) 1 J f is analogous to pseudo-inverse. Compare to gradient descent: gradient descent a = α E Gauss-Newton a = 1 2 (J r J r ) 1 E Gauss-Newton updates a along negative gradient at variable rate. Gauss-Newton method may converge slowly or diverge if initial guess a(0) is far from minimum, or matrix J J is ill-conditioned. Leow Wee Kheng (NUS) Optimization 2 10 / 38

Gauss-Newton Method Condition Number Condition Number The condition number of (real) matrix D is given by the ratio of its singular values: κ(d) = σ max(d) σ min (D). (18) If D is normal, i.e., D D = DD, then its condition number is given by the ratio of its eigenvalues: κ(d) = λ max(d) λ min (D). (19) Large condition number means the elements of D are less balanced. Leow Wee Kheng (NUS) Optimization 2 11 / 38

Gauss-Newton Method Condition Number For the linear equation Da = v, the condition number indicates change of solution a vs. change of v. Small condition number: Small change in v gives small change in a: stable. Small error in v gives small error in a. D is well-conditioned. Large condition number: Small change in v gives large change in a: unstable. Small error in v gives large error in a. D is ill-conditioned. Leow Wee Kheng (NUS) Optimization 2 12 / 38

Gauss-Newton Method Practical Issues Practical Issues The update equation, with J = J f, a k+1 = a k +(J J) 1 J r. (20) requires computation of matrix inverse, which can be expensive. Can use singular value decomposition (SVD) to help. SVD of J is J = UΣV. (21) Substituting it into the update equation yields (Homework) a k+1 = a k +VΣ 1 U r. (22) Leow Wee Kheng (NUS) Optimization 2 13 / 38

Gauss-Newton Method Practical Issues Convergence criteria: same as gradient descent E k+1 E k E k < ǫ, e.g., ǫ = 0.0001. a j,k+1 a j,k a j,k < ǫ, for each component j. If divergence occurs, can introduce step length α: where 0 < α < 1. a k+1 = a k +α(j J) 1 J r, (23) Is there an alternative way to control convergence? Yes: Levenberg-Marquardt Method. Leow Wee Kheng (NUS) Optimization 2 14 / 38

Levenberg-Marquardt Method Levenberg-Marquardt Method Gauss-Newton method updates a according to (J J) a = J r. (24) Levenberg modifies the update equation to λ is adjusted at each iteration. (J J+λI) a = J r. (25) If reduction of error E is large, can use small λ: more similar to Gauss-Newton. If reduction of error E is small, can use large λ: more similar to gradient descent. Leow Wee Kheng (NUS) Optimization 2 15 / 38

Levenberg-Marquardt Method If λ is very large, (J J+λI) λi, i.e., just doing gradient descent. Marquardt scales each component of gradient: (J J+λdiag(J J)) a = J r, (26) where diag(j J) is a diagonal matrix of the diagonal elements of J J. The (i,i)-th component of diag(j J) is n ( ) f(xl ) 2. (27) l=1 a i If (i,i)-th component of gradient is small, a i is updated more: faster convergence along i-th direction. Then, λ doesn t have to be too large. Leow Wee Kheng (NUS) Optimization 2 16 / 38

Quasi-Newton Quasi-Newton Quasi-Newton also minimizes f(x) by iteratively updating x: where the descent direction p k is x k+1 = x k +αp k (28) p k = H 1 k f(x k). (29) There are many methods of estimating H 1. The most effective is the method by Broyden, Fletcher, Goldfarb and Shanno (BFGS). Leow Wee Kheng (NUS) Optimization 2 17 / 38

Quasi-Newton BFGS Method Let x k = x k+1 x k and y k = f(x k+1 ) f(x k ). Then, estimate H k or H 1 k by H k+1 = H k H k x k x k H k x k H + y kyk k x k yk x, (30) k ( H 1 k+1 = I x kyk ) ( yk x H 1 k I y k x ) k k yk x + x k x k k yk x. (31) k (For derivation, please refer to [1].) Leow Wee Kheng (NUS) Optimization 2 18 / 38

Quasi-Newton BFGS Algorithm 1. Set α and initialize x 0 and H 1 0. 2. For k = 0,1,... until convergence: 2.1 Compute search direction p k = H 1 k f(x k). 2.2 Perform line search to find α that satisfies Wolfe conditions. 2.3 Compute x k = αp k, and set x k+1 = x k + x k. 2.4 Compute y k = f(x k+1 ) f(x k ). 2.5 Compute H 1 k+1 by Eq. 31. No standard way to initialize H. Can try H = I. Leow Wee Kheng (NUS) Optimization 2 19 / 38

Approximation of Gradient Approximation of Gradient All previous methods require the gradient f (or E). In many complex problems, it is very difficult to derive f. Several methods to overcome this problem: Use symbolic differentiation to derive math expression of f. Provided in Matlab, Maple, etc. Approximate gradient numerically. Optimize without gradient. Leow Wee Kheng (NUS) Optimization 2 20 / 38

Approximation of Gradient Forward Difference From Taylor series expansion, f(x+ x) = f(x)+ f(x) x+ (32) Denote the j-th component of x as ǫe j, where e j is the unit vector of the j-th dimension. Then, the j-th component of f(x) is f(x) f(x+ǫe j) f(x). (33) x j ǫ Note that x+ǫe j updates only the j-th component: Then, x+ǫe j = [ x 1 x j 1 x j x j+1 x m ] + (34) [ 0 0 ǫ 0 0 ] f(x) [ f(x) x 1 ] f(x). (35) x m Leow Wee Kheng (NUS) Optimization 2 21 / 38

Approximation of Gradient Central Difference Forward difference method has forward bias: x+ǫe j. Error in forward bias can accumulate quickly. To be less biased, use central difference. f(x+ǫe j ) f(x)+ǫ f(x) x j f(x ǫe j ) f(x) ǫ f(x) x j. So, f(x) f(x+ǫe j) f(x ǫe j ). (36) x j 2ǫ Leow Wee Kheng (NUS) Optimization 2 22 / 38

Optimization without Gradient Optimization without Gradient A simple idea is to descend along each coordinate direction iteratively. Let e 1,...,e m denote the unit vectors along coordinates 1,...,m. Coordinate Descent 1. Choose initial x 0. 2. Repeat until convergence: 2.1 For each j = 1,...,m: 2.1.1 Find x k+1 along e j that minimizes f(x k ). Leow Wee Kheng (NUS) Optimization 2 23 / 38

Optimization without Gradient Alternative for Step 2.1.1: Find x k+1 along e j using line search to reduce f(x k ) sufficiently. Strength: Simplicity. Weakness: Can be slow. Can iterate around the minimum, never approaching it. Powell s Method Overcomes the problem of coordinate descent. Basic Idea: Choose directions that do not interfere with or undo previous effort. Leow Wee Kheng (NUS) Optimization 2 24 / 38

Alternating Optimization Alternating Optimization Let S R m denote the set of x that satisfy the constraints. Then, constrained optimization problem can be written as: minf(x). (37) x S Alternating optimization (AO) is similar to coordinate descent except that the variables x j,j = 1,...,m, are split into groups. Partition x S into p non-overlapping parts y l such that x = (y 1,...,y l,...,y p ). (38) Example: x = (x 1,x 2,x }{{} 4,x 3,x 7,x }{{} 5,x 6,x 8,x 9 ). }{{} y 1 y 2 y 3 Denote S l S as the set that contains y l. Leow Wee Kheng (NUS) Optimization 2 25 / 38

Alternating Optimization Alternating Optimization 1. Choose initial x (0) = (y (0) 1,...,y(0) p ). 2. For k = 0,1,2,..., until convergence: 2.1 For l = 1,...,p: 2.1.1 Find y (k+1) l y (k+1) l such that = arg min f(y (k+1) 1,...,y (k+1) y l S l 1, y l, y (k) l+1,...,y(k) p ). l That is, find y (k+1) l while fixing other variables. (Alternative) Find y (k+1) l that decreases f(x) sufficiently: f(y (k+1) 1,...,y (k+1) l 1, y (k) l, y (k) l+1,...,y(k) p ) f(y (k+1) 1,...,y (k+1) l 1, y (k+1) l, y (k) l+1,...,y(k) p ) > c > 0. Leow Wee Kheng (NUS) Optimization 2 26 / 38

Alternating Optimization Example: initially (y (0) 1,y(0) 2,y(0) 3,...,y(0) l,...,y (0) p ) k = 0,l = 1 (y (1) 1,y(0) 2,y(0) 3,...,y(0) l,...,y (0) p ) k = 0,l = 2 (y (1) 1,y(1) 2,y(0) 3,...,y(0) l,...,y (0) p ) k = 0,l = p (y (1) 1,y(1) 2,y(1) 3,...,y(1) l,...,y p (1) ) k = 1,l = 1 (y (2) 1,y(1) 2,y(1) 3,...,y(1) l,...,y p (1) ) Obviously, we want to partition the variables so that Each sub-problem of finding y l is relatively easy to solve. Solving y l does not undo previous partial solutions.. Leow Wee Kheng (NUS) Optimization 2 27 / 38

Alternating Optimization Example: k-means clustering k-means clustering tries to group data points x into k clusters. c 2 c 1 c 3 Let c j denote the prototype of cluster C j. Then, k-means clustering finds C j and c j, j = 1,...,k, that minimize within-cluster difference: k D = (x c j ) 2. (39) j=1 x C j Leow Wee Kheng (NUS) Optimization 2 28 / 38

Alternating Optimization Suppose C j are already known. Then, the c j that minimize D is given by That is, c j is the centroid of C j : How to find C j? Use alternating optimization. D c j = 2 x C j (x c j ) = 0. (40) c j = 1 C j x C j x. (41) Leow Wee Kheng (NUS) Optimization 2 29 / 38

Alternating Optimization k-means clustering 1. Initialize c j, j = 1,...,k. 2. Repeat until convergence: 2.1 For each x, assign it to the nearest cluster C j : (fix c j, update C j ) 2.2 Update cluster centroids: (fix C j, update c j ) x c j x c l, for l j. c j = 1 C j Step 2.2 is (locally) optimal. Step 2.1 does not undo step 2.2. Why? x C j x. So, k-means clustering can converge to a local minimum. Leow Wee Kheng (NUS) Optimization 2 30 / 38

Convergence Rates Convergence Rates Suppose the sequence {x k } converges to x. Then, the sequence converges to x linearly if µ (0,1) such that x k+1 x lim k x k x = µ. (42) The sequence converges sublinearly if µ = 1. The sequence converges superlinearly if µ = 0. If the sequence converges sublinearly and then, the sequence converges logarithmically. x k+2 x k+1 lim = 1, (43) k x k+1 x k Leow Wee Kheng (NUS) Optimization 2 31 / 38

Convergence Rates We can distinguish between different rates of superlinear convergence. The sequence converges with order of q, for q > 1, if µ > 0 such that x k+1 x lim k x k x q = µ > 0. (44) If q = 1, the sequence converges q-linearly. If q = 2, the sequence converges qudratically or q-quadratically. If q = 3, the sequence converges cubically or q-cubically. etc. Leow Wee Kheng (NUS) Optimization 2 32 / 38

Convergence Rates Known Convergence Rates Gradient descent converges linearly, often slowly linearly. Gauss-Newton converges quadratically, if it converges. Levenberg-Marquardt converges quadratically. Quasi-Newton converges superlinearly. Alternating optimization converges q-linearly if f is convex around a local minimizer x. Leow Wee Kheng (NUS) Optimization 2 33 / 38

Summary Summary Method For Use Gradient descent optimization f Gauss-Newton nonlinear least-square E, estimate of H Levenberg-Marquardt nonlinear least-square E, estimate of H quasi-newton optimization f, estimate of H Newton optimization f, H coordinate descent optimization f alternating optimization optimization f Leow Wee Kheng (NUS) Optimization 2 34 / 38

Summary SciPy supports many optimization methods including linear least-square Levenberg-Marquardt method quasi-newton BFGS method line search that satisfies Wolfe conditions etc. For more details and other methods, refer to [1]. Leow Wee Kheng (NUS) Optimization 2 35 / 38

Probing Questions Probing Questions Can you apply an algorithm specialized for nonlinear least square, e.g., Gauss-Newton, to generic optimization? Gauss-Newton may converge slowly or diverge if the matrix J J is ill-conditioned. How to make the matrix well-conditioned? How can your alternating optimization program detect interference of previous update by current update? If your alternating optimization program detects frequent interference between updates, what can you do about it? So far, we have look at minimizing a scalar function f(x). How to minimize a vector function f(x) or multiple scalar functions f l (x) simultaneously? How to ensure that your optimization program will always converge regardless of the test condition? Leow Wee Kheng (NUS) Optimization 2 36 / 38

Homework Homework 1. Describe the essence of Gauss-Newton, Levenberg-Marquardt, quasi-newton, coordinate descent and alternating optimization each in one sentence. 2. With SVD of J given by J = UΣV, show that (J J) 1 J = VΣ 1 U. Hint: For matrices A and B, (AB) 1 = B 1 A 1. 3. Q3(c) of AY2015/16 Final Evaluation. 4. Q5 of AY2015/16 Final Evaluation. Leow Wee Kheng (NUS) Optimization 2 37 / 38

References References 1. J. Nocedal and S. J. Wright, Numerical Optimization, Springer-Verlag, 1999. 2. W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes: The Art of Scientific Computing, 3rd Ed., Cambridge University Press, 2007. www.nr.com 3. J. C. Bezdek and R. J. Hathaway, Some notes on alternating optimization, in Proc. Int. Conf. on Fuzzy Systems, 2002. Leow Wee Kheng (NUS) Optimization 2 38 / 38