Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Similar documents
5 Quasi-Newton Methods

Maximum Likelihood Estimation

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Principal Component Analysis

Lecture 7 Unconstrained nonlinear programming

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Programming, numerics and optimization

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Math 273a: Optimization Netwon s methods

Unconstrained optimization

Chapter 4. Unconstrained optimization

Robust PCA. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

8 Numerical methods for unconstrained problems

Optimization II: Unconstrained Multivariable

MATH 4211/6211 Optimization Quasi-Newton Method

Numerical Optimization

Optimization Methods

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Optimization II: Unconstrained Multivariable

Higher-Order Methods

Nonlinear Programming

Maria Cameron. f(x) = 1 n

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 5. Nonlinear Equations

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Optimization and Root Finding. Kurt Hornik

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Nonlinear Optimization: What s important?

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Quasi-Newton methods for minimization

Trajectory-based optimization

Conjugate Directions for Stochastic Gradient Descent

2. Quasi-Newton methods

CS 450 Numerical Analysis. Chapter 5: Nonlinear Equations

Statistics 580 Optimization Methods

Line Search Methods for Unconstrained Optimisation

Lecture 14: October 17

Optimization Methods

Convex Optimization CMU-10725

EECS260 Optimization Lecture notes

Numerical Methods I Solving Nonlinear Equations

NonlinearOptimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

17 Solution of Nonlinear Systems

Introduction to Nonlinear Optimization Paul J. Atzberger

Quasi-Newton Methods

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Math 411 Preliminaries

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

COMP 558 lecture 18 Nov. 15, 2010

Notes on Numerical Optimization

MATH 3795 Lecture 13. Numerical Solution of Nonlinear Equations in R N.

Optimization Tutorial 1. Basic Gradient Descent

Chapter 3 Numerical Methods

Data Mining (Mineria de Dades)

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Numerical solutions of nonlinear systems of equations

Convex Optimization. Problem set 2. Due Monday April 26th

Linear and Nonlinear Optimization

Gradient-Based Optimization

Applied Mathematics 205. Unit I: Data Fitting. Lecturer: Dr. David Knezevic

Geometry optimization

Matrix Derivatives and Descent Optimization Methods

Non-polynomial Least-squares fitting

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Outline. Scientific Computing: An Introductory Survey. Nonlinear Equations. Nonlinear Equations. Examples: Nonlinear Equations

Math 409/509 (Spring 2011)

CLASS NOTES Computational Methods for Engineering Applications I Spring 2015

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

A Trust-region-based Sequential Quadratic Programming Algorithm

CPSC 540: Machine Learning

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Line Search Algorithms

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

Scientific Computing: An Introductory Survey

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Static unconstrained optimization

Improved Damped Quasi-Newton Methods for Unconstrained Optimization

Written Examination

Iterative Methods for Smooth Objective Functions

Quasi-Newton Methods

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Solving linear equations with Gaussian Elimination (I)

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

Nonlinear Optimization for Optimal Control

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

March 8, 2010 MATH 408 FINAL EXAM SAMPLE

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

LBFGS. John Langford, Large Scale Machine Learning Class, February 5. (post presentation version)

Search Directions for Unconstrained Optimization

Transcription:

Optimization 2 CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Optimization 2 1 / 38

Last Time... Last Time... Gradient descent updates variables by x k+1 = x k α f(x k ). (1) Gradient descent can be slow. To increase convergence rate, find good step length α using line search. Leow Wee Kheng (NUS) Optimization 2 2 / 38

Line Search Line Search Gradient descent minimizes f(x) by updating x k along f(x k ): x k+1 = x k α f(x k ). (2) In general, other descent directions p k are possible if f(x) p k < 0. Then, update x k according to x k+1 = x k +αp k. (3) Ideally, want to find α > 0 that minimizes f(x k +αp k ). Too expensive. In practice, find α > 0 that reduces f(x) sufficiently (not too small). The simple condition f(x k +αp k ) < f(x k ) may not be sufficient. Leow Wee Kheng (NUS) Optimization 2 3 / 38

Line Search A popular sufficient line search condition is Armijo condition: f(x k +αp k ) f(x k )+c 1 α f(x k ) p k, (4) where c 1 > 0, usually quite small, say, c 1 = 10 4. Armijo condition may accepts very small α. So, add another condition to ensure sufficient reduction of f(x): f(x k +α k p k ) p k c 2 f(x) p k, (5) with 0 < c 1 < c 2 1, typically c 2 = 0.9. These two conditions together are called Wolfe conditions. Known result: Gradient descent will find local minimum if α satisfies Wolfe conditions. Leow Wee Kheng (NUS) Optimization 2 4 / 38

Line Search Another way is to bound α from below, giving Goldstein conditions: f(x k )+(1 c)α f(x k ) p k f(x k +αp k ) f(x k )+cα f(x k ) p k. (6) Another simple way is to start with large α and reduce it iteratively: Backtracking Line Search 1. Choose large α > 0, 0 < ρ < 1. 2. Repeat until Armijo condition is met: 2.1 α ρα. In this case, Armijo condition alone is enough. Leow Wee Kheng (NUS) Optimization 2 5 / 38

Gauss-Newton Method Gauss-Newton Method Carl Friedrich Gauss Sir Isaac Newton Leow Wee Kheng (NUS) Optimization 2 6 / 38

Gauss-Newton Method Gauss-Newton method estimates Hessian H by 1st-order derivatives. Express E(a) as the sum of square of r i (a), E(a) = n r i (a) 2, (7) i=1 where r i is the error or residual of each data point x i : and Then, 2 E a j a l = 2 n i=1 r i (a) = v i f(x i ;a), (8) r = [r 1 r n ]. (9) E a j = 2 n i=1 [ ri r i 2 ] r i +r i a j a l a j a l r i r i a j, (10) 2 n i=1 r i a j r i a l. (11) Leow Wee Kheng (NUS) Optimization 2 7 / 38

Gauss-Newton Method The derivative r i a j is the (i,j)-th element of the Jacobian J r of r. J r = r 1 r 1 a 1 a m..... r n r n a 1 a m The gradient and Hessian are related to the Jacobian by So, the update equation becomes. (12) E = 2J r r. (13) H 2J r J r, (14) a k+1 = a k (J r J r ) 1 J r r. (15) Leow Wee Kheng (NUS) Optimization 2 8 / 38

Gauss-Newton Method We can also use the Jacobian J f of f(x i ): J f = f(x 1 ) f(x 1 ) a 1 a m..... f(x n ) f(x n ) a 1 a m. (16) Then, J f = J r, and the update equation becomes a k+1 = a k +(J f J f) 1 J f r. (17) Leow Wee Kheng (NUS) Optimization 2 9 / 38

Gauss-Newton Method Compare to linear fitting: linear fitting a = (D D) 1 D v Gauss-Newton a = (J f J f) 1 J f r (J f J f) 1 J f is analogous to pseudo-inverse. Compare to gradient descent: gradient descent a = α E Gauss-Newton a = 1 2 (J r J r ) 1 E Gauss-Newton updates a along negative gradient at variable rate. Gauss-Newton method may converge slowly or diverge if initial guess a(0) is far from minimum, or matrix J J is ill-conditioned. Leow Wee Kheng (NUS) Optimization 2 10 / 38

Gauss-Newton Method Condition Number Condition Number The condition number of (real) matrix D is given by the ratio of its singular values: κ(d) = σ max(d) σ min (D). (18) If D is normal, i.e., D D = DD, then its condition number is given by the ratio of its eigenvalues: κ(d) = λ max(d) λ min (D). (19) Large condition number means the elements of D are less balanced. Leow Wee Kheng (NUS) Optimization 2 11 / 38

Gauss-Newton Method Condition Number For the linear equation Da = v, the condition number indicates change of solution a vs. change of v. Small condition number: Small change in v gives small change in a: stable. Small error in v gives small error in a. D is well-conditioned. Large condition number: Small change in v gives large change in a: unstable. Small error in v gives large error in a. D is ill-conditioned. Leow Wee Kheng (NUS) Optimization 2 12 / 38

Gauss-Newton Method Practical Issues Practical Issues The update equation, with J = J f, a k+1 = a k +(J J) 1 J r. (20) requires computation of matrix inverse, which can be expensive. Can use singular value decomposition (SVD) to help. SVD of J is J = UΣV. (21) Substituting it into the update equation yields (Homework) a k+1 = a k +VΣ 1 U r. (22) Leow Wee Kheng (NUS) Optimization 2 13 / 38

Gauss-Newton Method Practical Issues Convergence criteria: same as gradient descent E k+1 E k E k < ǫ, e.g., ǫ = 0.0001. a j,k+1 a j,k a j,k < ǫ, for each component j. If divergence occurs, can introduce step length α: where 0 < α < 1. a k+1 = a k +α(j J) 1 J r, (23) Is there an alternative way to control convergence? Yes: Levenberg-Marquardt Method. Leow Wee Kheng (NUS) Optimization 2 14 / 38

Levenberg-Marquardt Method Levenberg-Marquardt Method Gauss-Newton method updates a according to (J J) a = J r. (24) Levenberg modifies the update equation to λ is adjusted at each iteration. (J J+λI) a = J r. (25) If reduction of error E is large, can use small λ: more similar to Gauss-Newton. If reduction of error E is small, can use large λ: more similar to gradient descent. Leow Wee Kheng (NUS) Optimization 2 15 / 38

Levenberg-Marquardt Method If λ is very large, (J J+λI) λi, i.e., just doing gradient descent. Marquardt scales each component of gradient: (J J+λdiag(J J)) a = J r, (26) where diag(j J) is a diagonal matrix of the diagonal elements of J J. The (i,i)-th component of diag(j J) is n ( ) f(xl ) 2. (27) l=1 a i If (i,i)-th component of gradient is small, a i is updated more: faster convergence along i-th direction. Then, λ doesn t have to be too large. Leow Wee Kheng (NUS) Optimization 2 16 / 38

Quasi-Newton Quasi-Newton Quasi-Newton also minimizes f(x) by iteratively updating x: where the descent direction p k is x k+1 = x k +αp k (28) p k = H 1 k f(x k). (29) There are many methods of estimating H 1. The most effective is the method by Broyden, Fletcher, Goldfarb and Shanno (BFGS). Leow Wee Kheng (NUS) Optimization 2 17 / 38

Quasi-Newton BFGS Method Let x k = x k+1 x k and y k = f(x k+1 ) f(x k ). Then, estimate H k or H 1 k by H k+1 = H k H k x k x k H k x k H + y kyk k x k yk x, (30) k ( H 1 k+1 = I x kyk ) ( yk x H 1 k I y k x ) k k yk x + x k x k k yk x. (31) k (For derivation, please refer to [1].) Leow Wee Kheng (NUS) Optimization 2 18 / 38

Quasi-Newton BFGS Algorithm 1. Set α and initialize x 0 and H 1 0. 2. For k = 0,1,... until convergence: 2.1 Compute search direction p k = H 1 k f(x k). 2.2 Perform line search to find α that satisfies Wolfe conditions. 2.3 Compute x k = αp k, and set x k+1 = x k + x k. 2.4 Compute y k = f(x k+1 ) f(x k ). 2.5 Compute H 1 k+1 by Eq. 31. No standard way to initialize H. Can try H = I. Leow Wee Kheng (NUS) Optimization 2 19 / 38

Approximation of Gradient Approximation of Gradient All previous methods require the gradient f (or E). In many complex problems, it is very difficult to derive f. Several methods to overcome this problem: Use symbolic differentiation to derive math expression of f. Provided in Matlab, Maple, etc. Approximate gradient numerically. Optimize without gradient. Leow Wee Kheng (NUS) Optimization 2 20 / 38

Approximation of Gradient Forward Difference From Taylor series expansion, f(x+ x) = f(x)+ f(x) x+ (32) Denote the j-th component of x as ǫe j, where e j is the unit vector of the j-th dimension. Then, the j-th component of f(x) is f(x) f(x+ǫe j) f(x). (33) x j ǫ Note that x+ǫe j updates only the j-th component: Then, x+ǫe j = [ x 1 x j 1 x j x j+1 x m ] + (34) [ 0 0 ǫ 0 0 ] f(x) [ f(x) x 1 ] f(x). (35) x m Leow Wee Kheng (NUS) Optimization 2 21 / 38

Approximation of Gradient Central Difference Forward difference method has forward bias: x+ǫe j. Error in forward bias can accumulate quickly. To be less biased, use central difference. f(x+ǫe j ) f(x)+ǫ f(x) x j f(x ǫe j ) f(x) ǫ f(x) x j. So, f(x) f(x+ǫe j) f(x ǫe j ). (36) x j 2ǫ Leow Wee Kheng (NUS) Optimization 2 22 / 38

Optimization without Gradient Optimization without Gradient A simple idea is to descend along each coordinate direction iteratively. Let e 1,...,e m denote the unit vectors along coordinates 1,...,m. Coordinate Descent 1. Choose initial x 0. 2. Repeat until convergence: 2.1 For each j = 1,...,m: 2.1.1 Find x k+1 along e j that minimizes f(x k ). Leow Wee Kheng (NUS) Optimization 2 23 / 38

Optimization without Gradient Alternative for Step 2.1.1: Find x k+1 along e j using line search to reduce f(x k ) sufficiently. Strength: Simplicity. Weakness: Can be slow. Can iterate around the minimum, never approaching it. Powell s Method Overcomes the problem of coordinate descent. Basic Idea: Choose directions that do not interfere with or undo previous effort. Leow Wee Kheng (NUS) Optimization 2 24 / 38

Alternating Optimization Alternating Optimization Let S R m denote the set of x that satisfy the constraints. Then, constrained optimization problem can be written as: minf(x). (37) x S Alternating optimization (AO) is similar to coordinate descent except that the variables x j,j = 1,...,m, are split into groups. Partition x S into p non-overlapping parts y l such that x = (y 1,...,y l,...,y p ). (38) Example: x = (x 1,x 2,x }{{} 4,x 3,x 7,x }{{} 5,x 6,x 8,x 9 ). }{{} y 1 y 2 y 3 Denote S l S as the set that contains y l. Leow Wee Kheng (NUS) Optimization 2 25 / 38

Alternating Optimization Alternating Optimization 1. Choose initial x (0) = (y (0) 1,...,y(0) p ). 2. For k = 0,1,2,..., until convergence: 2.1 For l = 1,...,p: 2.1.1 Find y (k+1) l y (k+1) l such that = arg min f(y (k+1) 1,...,y (k+1) y l S l 1, y l, y (k) l+1,...,y(k) p ). l That is, find y (k+1) l while fixing other variables. (Alternative) Find y (k+1) l that decreases f(x) sufficiently: f(y (k+1) 1,...,y (k+1) l 1, y (k) l, y (k) l+1,...,y(k) p ) f(y (k+1) 1,...,y (k+1) l 1, y (k+1) l, y (k) l+1,...,y(k) p ) > c > 0. Leow Wee Kheng (NUS) Optimization 2 26 / 38

Alternating Optimization Example: initially (y (0) 1,y(0) 2,y(0) 3,...,y(0) l,...,y (0) p ) k = 0,l = 1 (y (1) 1,y(0) 2,y(0) 3,...,y(0) l,...,y (0) p ) k = 0,l = 2 (y (1) 1,y(1) 2,y(0) 3,...,y(0) l,...,y (0) p ) k = 0,l = p (y (1) 1,y(1) 2,y(1) 3,...,y(1) l,...,y p (1) ) k = 1,l = 1 (y (2) 1,y(1) 2,y(1) 3,...,y(1) l,...,y p (1) ) Obviously, we want to partition the variables so that Each sub-problem of finding y l is relatively easy to solve. Solving y l does not undo previous partial solutions.. Leow Wee Kheng (NUS) Optimization 2 27 / 38

Alternating Optimization Example: k-means clustering k-means clustering tries to group data points x into k clusters. c 2 c 1 c 3 Let c j denote the prototype of cluster C j. Then, k-means clustering finds C j and c j, j = 1,...,k, that minimize within-cluster difference: k D = (x c j ) 2. (39) j=1 x C j Leow Wee Kheng (NUS) Optimization 2 28 / 38

Alternating Optimization Suppose C j are already known. Then, the c j that minimize D is given by That is, c j is the centroid of C j : How to find C j? Use alternating optimization. D c j = 2 x C j (x c j ) = 0. (40) c j = 1 C j x C j x. (41) Leow Wee Kheng (NUS) Optimization 2 29 / 38

Alternating Optimization k-means clustering 1. Initialize c j, j = 1,...,k. 2. Repeat until convergence: 2.1 For each x, assign it to the nearest cluster C j : (fix c j, update C j ) 2.2 Update cluster centroids: (fix C j, update c j ) x c j x c l, for l j. c j = 1 C j Step 2.2 is (locally) optimal. Step 2.1 does not undo step 2.2. Why? x C j x. So, k-means clustering can converge to a local minimum. Leow Wee Kheng (NUS) Optimization 2 30 / 38

Convergence Rates Convergence Rates Suppose the sequence {x k } converges to x. Then, the sequence converges to x linearly if µ (0,1) such that x k+1 x lim k x k x = µ. (42) The sequence converges sublinearly if µ = 1. The sequence converges superlinearly if µ = 0. If the sequence converges sublinearly and then, the sequence converges logarithmically. x k+2 x k+1 lim = 1, (43) k x k+1 x k Leow Wee Kheng (NUS) Optimization 2 31 / 38

Convergence Rates We can distinguish between different rates of superlinear convergence. The sequence converges with order of q, for q > 1, if µ > 0 such that x k+1 x lim k x k x q = µ > 0. (44) If q = 1, the sequence converges q-linearly. If q = 2, the sequence converges qudratically or q-quadratically. If q = 3, the sequence converges cubically or q-cubically. etc. Leow Wee Kheng (NUS) Optimization 2 32 / 38

Convergence Rates Known Convergence Rates Gradient descent converges linearly, often slowly linearly. Gauss-Newton converges quadratically, if it converges. Levenberg-Marquardt converges quadratically. Quasi-Newton converges superlinearly. Alternating optimization converges q-linearly if f is convex around a local minimizer x. Leow Wee Kheng (NUS) Optimization 2 33 / 38

Summary Summary Method For Use Gradient descent optimization f Gauss-Newton nonlinear least-square E, estimate of H Levenberg-Marquardt nonlinear least-square E, estimate of H quasi-newton optimization f, estimate of H Newton optimization f, H coordinate descent optimization f alternating optimization optimization f Leow Wee Kheng (NUS) Optimization 2 34 / 38

Summary SciPy supports many optimization methods including linear least-square Levenberg-Marquardt method quasi-newton BFGS method line search that satisfies Wolfe conditions etc. For more details and other methods, refer to [1]. Leow Wee Kheng (NUS) Optimization 2 35 / 38

Probing Questions Probing Questions Can you apply an algorithm specialized for nonlinear least square, e.g., Gauss-Newton, to generic optimization? Gauss-Newton may converge slowly or diverge if the matrix J J is ill-conditioned. How to make the matrix well-conditioned? How can your alternating optimization program detect interference of previous update by current update? If your alternating optimization program detects frequent interference between updates, what can you do about it? So far, we have look at minimizing a scalar function f(x). How to minimize a vector function f(x) or multiple scalar functions f l (x) simultaneously? How to ensure that your optimization program will always converge regardless of the test condition? Leow Wee Kheng (NUS) Optimization 2 36 / 38

Homework Homework 1. Describe the essence of Gauss-Newton, Levenberg-Marquardt, quasi-newton, coordinate descent and alternating optimization each in one sentence. 2. With SVD of J given by J = UΣV, show that (J J) 1 J = VΣ 1 U. Hint: For matrices A and B, (AB) 1 = B 1 A 1. 3. Q3(c) of AY2015/16 Final Evaluation. 4. Q5 of AY2015/16 Final Evaluation. Leow Wee Kheng (NUS) Optimization 2 37 / 38

References References 1. J. Nocedal and S. J. Wright, Numerical Optimization, Springer-Verlag, 1999. 2. W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes: The Art of Scientific Computing, 3rd Ed., Cambridge University Press, 2007. www.nr.com 3. J. C. Bezdek and R. J. Hathaway, Some notes on alternating optimization, in Proc. Int. Conf. on Fuzzy Systems, 2002. Leow Wee Kheng (NUS) Optimization 2 38 / 38