Math 273a: Optimization Netwon s methods

Similar documents
Math 273a: Optimization Basic concepts

Introduction to gradient descent

MATH 4211/6211 Optimization Basics of Optimization Problems

Lecture 7 Unconstrained nonlinear programming

How do we recognize a solution?

Programming, numerics and optimization

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

8 Numerical methods for unconstrained problems

Generalized Gradient Descent Algorithms

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Maria Cameron. f(x) = 1 n

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Math 164: Optimization Barzilai-Borwein Method

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

ECE580 Fall 2015 Solution to Midterm Exam 1 October 23, Please leave fractions as fractions, but simplify them, etc.

1. Nonlinear Equations. This lecture note excerpted parts from Michael Heath and Max Gunzburger. f(x) = 0

Improving the Convergence of Back-Propogation Learning with Second Order Methods

x 2 x n r n J(x + t(x x ))(x x )dt. For warming-up we start with methods for solving a single equation of one variable.

Higher-Order Methods

17 Solution of Nonlinear Systems

10.34 Numerical Methods Applied to Chemical Engineering Fall Quiz #1 Review

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Simple Iteration, cont d

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

Math 273a: Optimization Subgradient Methods

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Selected Topics in Optimization. Some slides borrowed from

Numerical solutions of nonlinear systems of equations

ECE133A Applied Numerical Computing Additional Lecture Notes

Mathematical optimization

Quasi-Newton Methods

Maximum Likelihood Estimation

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Newton s Method. Javier Peña Convex Optimization /36-725

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

STOP, a i+ 1 is the desired root. )f(a i) > 0. Else If f(a i+ 1. Set a i+1 = a i+ 1 and b i+1 = b Else Set a i+1 = a i and b i+1 = a i+ 1

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

MATH 4211/6211 Optimization Quasi-Newton Method

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Math (P)refresher Lecture 8: Unconstrained Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

14. Nonlinear equations

Numerical Optimization

Introduction to Nonlinear Optimization Paul J. Atzberger

Nonlinear Optimization: What s important?

Lecture Notes to Accompany. Scientific Computing An Introductory Survey. by Michael T. Heath. Chapter 5. Nonlinear Equations

The Steepest Descent Algorithm for Unconstrained Optimization

Stochastic Quasi-Newton Methods

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Chapter 3 Numerical Methods

Numerical Methods I Solving Nonlinear Equations

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Unconstrained optimization

1. Search Directions In this chapter we again focus on the unconstrained optimization problem. lim sup ν

c 2007 Society for Industrial and Applied Mathematics

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

Notes on Numerical Optimization

Why should you care about the solution strategies?

Math 411 Preliminaries

NonlinearOptimization

Arc Search Algorithms

13. Nonlinear least squares

Gradient Descent. Dr. Xiaowei Huang

Practical Optimization: Basic Multidimensional Gradient Methods

Unconstrained minimization of smooth functions

GRADIENT = STEEPEST DESCENT

ECS289: Scalable Machine Learning

Math 409/509 (Spring 2011)

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

Statistics 580 Optimization Methods

SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS BISECTION METHOD

Nonlinear Programming

Nonlinear equations and optimization

Newton-type Methods for Solving the Nonsmooth Equations with Finitely Many Maximum Functions

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

ECE580 Partial Solution to Problem Set 3

MATH 3795 Lecture 13. Numerical Solution of Nonlinear Equations in R N.

Geometry optimization

CS 450 Numerical Analysis. Chapter 5: Nonlinear Equations

A New Trust Region Algorithm Using Radial Basis Function Models

Review of Classical Optimization

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Trajectory-based optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Applied Mathematics 205. Unit I: Data Fitting. Lecturer: Dr. David Knezevic

Computational Methods. Least Squares Approximation/Optimization

Chapter 4. Unconstrained optimization

Lecture 14: October 17

CPSC 540: Machine Learning

Transcription:

Math 273a: Optimization Netwon s methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 some material taken from Chong-Zak, 4th Ed.

Main features of Newton s method Uses both first derivatives (gradients) and second derivatives (Hessian) Based on local quadratic approximations to the objective function Requires a positive definite Hessian to work Converges very quickly near the solution (under conditions) Require a lot of work at each iteration: forming the Hessian inverting or factorizing the (approximate) Hessian

Basic idea Given the current point x (k) construct a quadratic function (known as the quadratic approximation) to the objective function that matches the value and both the first and second derivatives at x (k) minimize the quadratic function instead of the original objective function set the minimizer as x (k+1) Note: a new quadratic approximation will be constructed at x (k+1) Special case: the objective is quadratic the approximation is exact and the method returns a solution in one step

Quadratic approximation Assumption: function f C 2, i.e., twice continuously differentiable Apply Taylor s expansion, keep first three terms, drop terms of order 3 f(x) q(x) := f(x (k) )+g (k)t (x x (k) )+ 1 2 (x x(k) ) T F (x (k) )(x x (k) ) where g (k) := f(x (k) ) is the gradient at x (k) F (x (k) ) := 2 f(x (k) ) is the Hessian at x (k)

Generating the next point Minimizing q(x) by apply the first-order necessary condition: 0 = q(x) = g (k) + F (x (k) )(x x (k) ). If F (x (k) ) 0 (positive definite), then q achieves its unique minimizer at x (k+1) := x (k) F (x (k) ) 1 g (k). We have 0 = q(x (k+1) ) Can be viewed an iteration for solving g(x) = 0 using its Jacobian F (x).

Example f(x 1, x 2) = log(1 x 1 x 2) log(x 1) log(x 2) 1 0.9 0.8 0.7 0.6 x 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1 a

Example Start Newton s method from [ 1 10 ; 6 10 ] 1 0.9 0.8 level sets true minimizer newton iterates 0.7 0.6 x 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1

Example f(x 1, x 2) and its quadratic approximation q(x 1, x 2) at [ 1 ; 6 ] share the same 10 10 value, gradient and Hessian at [ 1 ; 6 ]. The new point minimizes q(x1, x2) 10 10 1 0.9 0.8 0.7 0.6 y 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y 1

Example Start Newton s method from [ 1 10 ; 1 10 ] 1 0.9 0.8 level sets true minimizer newton iterates 0.7 0.6 x 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1

Example Start Newton s method from [ 1 10 ; 1 100 ] 1 0.9 0.8 level sets true minimizer newton iterates 0.7 0.6 x 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1

Aware of the drawbacks Unlike the move along f(x (k) ), where a sufficiently small step size guarantees the objective decrease, Newton s method jumps to a potentially distant point. This makes it vulnerable. recall in 1D, f < 0 can cause divergence even if F (x (k) ) 0 (positive definite), objective descent is not guaranteed Nonetheless, Newton s method has superior performance when starting near the solution.

Analysis: quadratic function minimization The objective function f(x) = 1 2 xt Qx b T x Assumption: Q is symmetric and invertible g(x) = Qx b F (x) = Q. First-order optimality condition g(x ) = Qx b = 0. So, x = Q 1 b. Given any initial point x (0), by Newton s method x (1) = x (0) F (x (0) ) 1 g (0) = x (0) Q 1 (Qx (0) b) = Q 1 b = x. The solution is obtained in one step.

Analysis: how fast is Newton s method? (assumption: Lipschitz continuous Hessian near x ) Let e (k) := x (k) x Theorem Suppose f C 2, F is Lipschitz continuous near x, and f(x ) = 0. If x (k) is sufficiently close to x and F (x ) 0, then C > 0 such that e (j+1) C e (j) 2, j = k, k + 1,.... Just a sketch proof. Since F is Lipschitz around x and F (x ) 0, we can have (F (x) ci) 0 for some c > 0 for all x in a small neighborhood of x. Thus, F (x) 1 < c 1 for all x in the neighborhood. e (j+1) = e (j) + d (j) where d (j) = F (x (j) ) 1 g (j). Taylor expansion near x (j) : 0 = g(x ) = g(x (j) ) F (x (j) )e (j) + O( e (j) 2 ). Thus e (j+1) = e (j) + d (j) = F (x (j) ) 1 O( e (j) 2 ) = O( e (j) 2 ). Argue that x (j+1), x (j+2),... stay in the neighborhood.

Asymptotic rates of convergence Suppose sequence {x k } converges to ˉx. Perform the ratio test x k+1 ˉx lim k x k ˉx = μ. if μ = 1, then {x k } converges sublinearly. if μ (0, 1), then {x k } converges linearly; if μ = 0, then {x k } converges superlinearly; To distinguish superlinear rates of convergence, we check x k+1 ˉx lim = μ > 0 k x k ˉx q if q = 2, it is quadratic convergence; if q = 3, it is cubic convergence; q can be non-integer, e.g., 1.618 for the secant method...

Example a k = 1/2 k b k = 1/4 k/2 c k = 1/2 2k d k = 1/(k + 1) semilogy plots (wikipedia)

Another example Let C = 100. k (1/k 2 ) Ce k Ce k1.618 Ce k2 1 1.0e0 3.7e2 3.7e2 3.7e2 3 3.3e-1 5.0e1 2.7e0 1.2e-1 5 2.0e-1 6.7e-0 1.3e-3 1.4e-8 7 1.4e-1 9.1e-1 7.6e-8 5.2e-19 9 1.1e-1 1.2e-1 6.4e-13 6.6e-33 Comments: the constant C is not important in superlinear convergence even with a big C, higher-order convergence will quickly catch up the constant C is more important in lower-order convergence superlinear convergence is shockingly fast!

Analysis: descent direction Theorem If the Hessian F (x (k) ) 0 (positive definite) and g (k) = f(x (k) ) 0, then the search direction d (k) = F (x (k) ) 1 g (k) is a descent direction, that is, there exists ᾱ > 0 such that f(x (k) + αd (k) ) < f(x (k) ), α (0, ᾱ). Proof. Let φ(α) := f(x (k) + αd (k) ). Then φ (α) := f(x (k) + αd (k) ) T d (k). Since F (x (k) ) 0 and g (k) 0, we have F (x (k) ) 1 0 φ (0) := f(x (k) + αd (k) ) T d (k) = g (k)t F (x (k) ) 1 g (k) < 0. Finally, apply first-order Taylor expansion to φ(α) to get the result.

Two more issues with Newton s method Hessian evaluation: When the dimension n is large, obtain F (x (k) ) can be computationally expensive We will study quasi-newton methods to alleviate this difficulty (in a future lecture) Indefinite Hessian: When the Hessian is not positive definite, the direction is not necessarily descending. There are simple modifications.

Modified Newton s method Strategy: use F (x (k) ) if F (x (k) ) 0 and λ min(f (x (k) )) > ɛ; otherwise, use ˆF (x (k) ) = F (x (k) ) + E so that ˆF (x (k) ) 0 and λ min( ˆF (x (k) )) > ɛ. Method 1 (Greenstadt): replace any tiny or negative eigenvalues by δ = max{ɛ machine, ɛ machine H } n where H = max i=1,...,n h j=1 ij. This is computationally expensive. Method 2 (Levenberg-Marquardt): Was proposed for least-squares but works here. Replace ˆF F + γi. It shifts every eigenvalue of F up by γ.

Modified Newton s method Method 3 (advanced topic: modified Cholesky / Gill-Murray): Any symmetric matrix A 0 can be factored as A = ˉLˉL T or A = LDL T, where L and ˉL are lower triangular, D is positive diagonal, and L has ones on its main diagonal. Properties of the Cholesky factorization: Very useful in solving linear systems of equations. Reduces a system to two backsolves. If A 0 (indefinite but still symmetric), D has zero or negative element(s) on its diagonal.

Modified Newton s method The factorization is stable if A is positive definite. (Small errors in A will not cause large errors in L or D.) Example: [ ] [ ] [ ] [ ] 1 a 1 1 1 a =. a 1 a 1 1 a 2 1 If A A + vv T, the factorization can be updated to a product form (avoiding the factorization from scratch, which is more expensive). If A is sparse, Cholesky with pivots keeps L sparse with moderately more zeros The cost is n 3 /6 + O(n 2 ), roughly half of Gaussian elimination.

Modified Newton s method Forsgren, Gill, Murray: perform pivoted Cholesky factorization. That is, permute the matrix at each step to pull the largest remaining diagonal element to the pivot position The effect: postpone the modification and keeps it as small as possible. When no acceptable element remains [ ] [ ] [ A11 A 12 L11 D1 = A 21 A 22 L 21 I D 2 ] [ ] L T 11 L T 21, I replace D 2 (not necessarily diagonal!) by a positive definition matrix and complete the factorization. No extra work if the Cholesky factorization is taken in the outer-product form. The Cholesky factorization also tells if the current point is a minimizer or a saddle point.

Modified Newton s method for saddle point Suppose we are at a saddle point ˉx. Then ḡ = f(ˉx) = 0 and 2nd-order approximation How do we descend? f(ˉx + d) q(d) := f(ˉx) + ḡ T d + 1 }{{} 2 dt F (ˉx)d. =0 Greenstadt: pick d = αiui, where αi > 0 and (λi, ui) are eigen-pairs i:λ i <0 of F (x). d is a positive linear combination of the negative curvature directions. Then, d T F (ˉx)d < 0 and q(d) < f(ˉx). Cholesky method: recall D 2 correspond to the negative curvatures. If entry d ij of D 2 has the largest absolute value among all entries of D 2, pick Then, d T F (x)d = d T LDL T d < 0. L T d = e i sign(d ij)e j.

Overview of the Gauss-Newton method A modification to Newton s method, solves nonlinear least squares, very popular Pros: second derivatives are no longer computed Cons: does not apply to general problems Can be improved by line search, Levenberg-Marquardt, etc.

Nonlinear least squares Given functions r i : R n R, i = 1,..., m The goal is to find x so that r i(x) = 0 or r i(x) 0 for all i. Consider the nonlinear least-squares problem minimize x 1 2 m (r i(x)) 2. i=1

Define r = [r 1,..., r m] T. Then we have minimize x f(x) = 1 2 r(x)t r(x). The gradient f(x) is formed by components ( f(x)) j = f x j (x) = Define the Jacobian of r r 1 x 1 (x) J(x) = Then, we have r m x 1 (x) m i=1 f(x) = J(x) T r(x) r i (x) r i x j (x) r i x n (x) r m x n (x)

The Hessian F (x) is symmetric matrix. Its (k, j)th component is ( 2 f = m ) r i(x) ri (x) x k x j x k x j i=1 m ( ) ri = (x) ri 2 r i (x) + r i(x) (x) x k x j x k x j i=1 Let S(x) be formed by (k, j)th components m 2 r i r i(x) (x) x k x j i=1 Then, we have F (x) = J(x) T J(x) + S(x) Therefore, Newton s method has the iteration x (k+1) = x (k) (J(x) T J(x) + S(x)) 1 }{{} F (x) 1 J(x) T r(x) } {{ } f(x)

The Gauss-Newton method When the matrix S(x) is ignored in some applications to save computation, we arrive at the Gauss-Newton method x (k+1) = x (k) (J(x) T J(x)) 1 J(x) T r(x) }{{}}{{} (F (x) S(x)) 1 f(x) A potential problem is that J(x) T J(x) 0 and f(x (k+1) ) f(x (k) ). Fixes: line search, Levenberg-Marquardt, and Cholesky/Gill-Murray.

Example: nonlinear data-fitting Given a sinusoid y = A sin(ωt + φ) Determine parameters A, ω, and φ so that the sinusoid best fits the observed points: (t i, y i ), i = 1,..., 21.

Let x := [A, ω, φ] T and r i(x) := y i A sin(ωt i + φ) Problem minimize 21 i=1 (y i A sin(ωt i + φ) }{{} r i (x) ) 2 Derive J(x) R 21 3 and apply the Gauss-Newton iteration x (k+1) = x (k) (J(x) T J(x)) 1 J(x) T r(x) Results: A = 2.01, ω = 0.992, φ = 0.541.

Conclusions Although Newton s method has many issues, such as the direction can be ascending if F (x (k) ) 0 may not ensure descent in general must start close to the solution, Newton s method has the following strong properties: one-step solution for quadratic objective with an invertible Q second-order convergence rate near the solution if F is Lipschtiz a number of modifications that address the issues.