Multivariate Newton Minimanization

Similar documents
Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

NonlinearOptimization

Nonlinear Optimization: What s important?

Optimization Methods

Quasi-Newton Methods

Nonlinear Programming

Optimization II: Unconstrained Multivariable

Chapter 4. Unconstrained optimization

Gradient Descent. Dr. Xiaowei Huang

1 Numerical optimization

MATH 4211/6211 Optimization Quasi-Newton Method

Convex Optimization CMU-10725

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

Unconstrained Multivariate Optimization

Convex Optimization. Problem set 2. Due Monday April 26th

Programming, numerics and optimization

Math (P)refresher Lecture 8: Unconstrained Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Exploring the energy landscape

Statistics 580 Optimization Methods

Optimization II: Unconstrained Multivariable

Introduction to gradient descent

Functions of Several Variables

Unconstrained optimization

1 Numerical optimization

Lecture V. Numerical Optimization

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Numerical Optimization

Unconstrained Optimization

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

Lecture 7 Unconstrained nonlinear programming

Gradient-Based Optimization

Optimization and Root Finding. Kurt Hornik

Lecture Notes: Geometric Considerations in Unconstrained Optimization

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

5 Quasi-Newton Methods

Performance Surfaces and Optimum Points

Higher-Order Methods

Review of Classical Optimization

Static unconstrained optimization

Optimization Methods

Mathematical optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 14: October 17

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

Krzysztof Tesch. Continuous optimisation algorithms

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Scientific Computing: Optimization

2. Quasi-Newton methods

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Nonlinear Optimization

Introduction to Unconstrained Optimization: Part 2

Chapter 9 Global Nonlinear Techniques

Minimization of Static! Cost Functions!

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Practical Optimization: Basic Multidimensional Gradient Methods

Numerical Optimization: Basic Concepts and Algorithms

Geometry optimization

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Algorithms for Constrained Optimization

Introduction to Nonlinear Optimization Paul J. Atzberger

Gradient Descent. Sargur Srihari

CHAPTER 2: QUADRATIC PROGRAMMING

14. Nonlinear equations

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Optimization. Totally not complete this is...don't use it yet...

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

UNCONSTRAINED OPTIMIZATION

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Numerical solutions of nonlinear systems of equations

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

The Steepest Descent Algorithm for Unconstrained Optimization

Part 4: IIR Filters Optimization Approach. Tutorial ISCAS 2007

Root Finding (and Optimisation)

Data Mining (Mineria de Dades)

Positive Definite Matrix

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Lecture 5: September 12

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Optimality Conditions

Chapter III. Unconstrained Univariate Optimization

September Math Course: First Order Derivative

Tangent spaces, normals and extrema

17 Solution of Nonlinear Systems

Optimization. Next: Curve Fitting Up: Numerical Analysis for Chemical Previous: Linear Algebraic and Equations. Subsections

5 Handling Constraints

Lecture 2 - Unconstrained Optimization Definition[Global Minimum and Maximum]Let f : S R be defined on a set S R n. Then

Matrix Derivatives and Descent Optimization Methods

g(t) = f(x 1 (t),..., x n (t)).

REVIEW OF DIFFERENTIAL CALCULUS

Introduction to unconstrained optimization - direct search methods

Optimisation in Higher Dimensions

Deep Learning. Authors: I. Goodfellow, Y. Bengio, A. Courville. Chapter 4: Numerical Computation. Lecture slides edited by C. Yim. C.

Transcription:

Multivariate Newton Minimanization

Optymalizacja syntezy biosurfaktantu

Rhamnolipid Rhamnolipids are naturally occuring glycolipid produced commercially by the Pseudomonas aeruginosa species of bacteria. Application: They promote the uptake and biodegradation of poorly soluble substrates, They serve as immune modulators and virulence factors, They act as antimicrobials, They are used in surface motility, They are used to develope biofilm.

Rhamnolipid kinetics

Użyty został wariant 2 2 planowanego doświadczenia x 1 stężenie glicerolu x 2 stosunek wytłoków z trzciny cukrowej do nasion słonecznika x 1 x 2

y=46,25-2,35x1 6,18x2-15,8x12-14,92x22-9,74x1 x2

Surfaces (hipersurfaces) can have much more complex topology

Optimisation Process of finding maximum (minimum) value of a given finction in a specific region (constraints): Unconstrained Constrained.

Finding a root of the nonlinear equation Newton-Raphson method f(x) f(x 1 ) f (x 1 ) f x 1 0 x 2 x 1 x x 1 x 2 f x 1 = f x 1 0 x 1 x 2 = f x 1 x 1 x 2 x 2 = x 1 f x 1 f x 1

Finding a root of the nonlinear equation - Newton-Raphson method f(x) f(x 2 ) f (x 2 ) f x 2 0 x 3 x 2 x x 3 x 2 f x 2 = f x 2 0 x 3 x 2 = f x 2 x 3 x 2 x 3 = x 2 f x 2 f x 2

Finding a root of the nonlinear equation general expression f(x) f(x i ) f (x i ) f x i 0 x i+1 x i x x i+1 x i f x i = f x i 0 x i x i+1 = x i+1 = x i f x i f x i f x i x i x i+1

When do we stop? When nth change f(x)=0 or the change is small f(x) f(x i ) f (x i ) f x i 0 x i+1 x i x x i x i+1 x i+1 x i x i+1 100% = err Relative change

Example f x = x 2 1 f(x) Stopping cryterio < 30 % f(x i ) 1 1

Example f x = x 2 1 f(x) f 4 = 15 y = 8x Stopping cryterio < 30 % f x 1 = 2 4 = 8 f(x i ) 1 1 x 2 =? x 1 = 4 x f x = x 2 1 = 2x x i+1 = x i f x i f x i

Example f x = x 2 1 f(x) y = 8x f 4 = 15 Stopping cryterio < 30 % 2.125 4 4 100% = 46% f x 1 = 2 4 = 8 f(x i ) 1 1 x 2 =? x 1 = 4 x f x = x 2 1 = 2x x i+1 = x i f x i f x i i = 1, x 2 = x 1 f x 1 f = 4 42 1 x 1 2 4 = 4 15 8 = 2.125

Example f x = x 2 1 f(x) y = 4.25x f(x i ) f x 2 = 2 2.125 = 4.25 1 f x = x 2 1 = 2x 1 x 3 =? x 2 = 2.125 x i+1 = x i f x i f x i x i = 2, x 3 = x 2 f x 2 f = 2.125 2.1252 1 x 2 2 2.125 = 1.28

Example f x = x 2 1 f(x) 1.28 2.125 1.28 100% = 66% f(x i ) f 2.125 = 3.51 f 1 = 0.68 y = 4.25x f x 2 = 2 2.125 = 4.25 1 f x = x 2 1 = 2x 1 x 3 =? x 2 = 2.125 x i+1 = x i f x i f x i x i = 2, x 3 = x 2 f x 2 f = 2.125 2.1252 1 x 2 2 2.125 = 1.28

Example f x = x 2 1 f(x) f(x i ) y = 2.56x f x 3 = 2 1.28 = 2.56 1 f x = x 2 1 = 2x 1 x 3 = 1.28 x 4 =? x i+1 = x i f x i f x i x

Example f x = x 2 1 f(x) Finish! 1.03 1.28 1.03 f 1 = 0.07 100% = 24% f(x i ) f 2.125 = 3.51 y = 2.56x f x 3 = 2 1.28 = 2.56 1 f x = x 2 1 = 2x 1 x 3 = 1.28 x 4 =? x i+1 = x i f x i f x i x i = 3, x 4 = x 3 f x 3 f = 1.28 1.282 1 x 3 2 1.28 = 1.03

What about minima (maxima)? We have procedure to find zero of the function f(x) x i+1 = x i f x i f x i Where function f has its minima (maxima) f (x) = 0 So we are looking for the zero of the function g x = f (x) = 0 x i+1 = x i g x i g x i = x i f x i f x i

What about multidemensional problem? In order to explore topology of multidemensional surface we have to use again Taylors expansion series. Taylor expansion gives information about surrounding of the function (f x + Δx ) using ONLY LOCAL information about function: - f(x) it s value, - f (x) the rate of change of f in x, - f (x) it s curvature in x, - f (x) it s rate of change of curvature in x, - And so on. What is important that all the derivates are computed only in point x.

Accuracy using derivatives Trunctuation error resulting of final representation using TE f x + Δx = f x + df dx Δx + 1 d 2 f 2 dx 2 Δx2 + df dx = f x + Δx f x Δx 1 2 d 2 f dx 2 Δx + df dx f x + Δx f x Δx

Trunctuation error Trunctuation error resulting of final representation using TE ε T = 1 d 2 f Δx ~Δx 2 dx2

Accuracy using derivatives Round-off error resulting of final representation of numbers (lack of significant figures) f x + Δx = f x + df dx Δx + 1 d 2 f 2 dx 2 Δx2 + df dx = f x + Δx + ε f x + ε Δx 1 2 d 2 f dx 2 Δx + df dx f x + Δx f x Δx + 2 ε Δx

Round-off error Round-off error resulting of final representation of numbers (lack of significant figures) ε R = 2 ε Δx ~ 1 Δx ε total = 2 ε Δx 1 2 d 2 f dx 2 Δx

Examples of tructuation and round-off errors f x = x 3 + x 1 3 at point x = 3 True derivative value at point 3 is 27.2886751 Δx = 0.01 df dx f x + Δx f x = f 3 + 0.1 f 3 Δx 0.01 ε T = 0.08512 = 27.37385 df dx f x + Δx f x Δx f 3 + 0.1 f 3 0.1 = 2Δx 0.02 ε T = 0.000105 = 27.28878

Examples of tructuation and round-off errors f x = x 3 + x 1 3 at point x = 3 True derivative value at point 3 is 27.2886751 Δx = 0.01 f 3 + 0.1 = 29.0058362 29.006 f(3) = 28.7320508 28.732 df dx f x + Δx f x = f 3 + 0.1 f 3 Δx 0.01 ε total = 0.1113 = 27.4 Δx 1.0 9.98 0.1 0.911 0.01 0.1113 0.001 0.2887 0.0001 2.7113 0.00001 27.728 ε total

Total error Δx 1.0 9.98 0.1 0.911 0.01 0.1113 0.001 0.2887 0.0001 2.7113 0.00001 27.728 ε total

Taylor s expansion in 2D f x i+1, y i+1 = f x i, y i + f x Δx + f y Δy f x i+1, y i+1 = f x i, y i + f x, f y Δx Δy

Taylor s expansion in 2D of two functions f 1 x i+1, y i+1 = f 1 x i, y i + f 1 Δx + f x 1 Δy y f 1 x i+1, y i+1 = f 1 x i, y i + f Δx 1, f x 1 y Δy f 2 x i+1, y i+1 = f 2 x i, y i + f 2 Δx + f x 2 Δy y f 2 x i+1, y i+1 = f 2 x i, y i + f 2 x, f 2 y Δx Δy

Taylor s expansion in 2D of two functions f 1 x i+1, y i+1 = f 1 x i, y i + f Δx 1, f x 1 y Δy f 2 x i+1, y i+1 = f 2 x i, y i + f 2 x, f 2 y Δx Δy

Taylor s expansion in 2D of two functions f 1 x i+1, y i+1 = f 1 x i, y i + f Δx 1, f x 1 y Δy f 2 x i+1, y i+1 = f 2 x i, y i + f Δx 2, f x 2 y Δy f 1 x i+1, y i+1 f 2 x i+1, y i+1 = f 1 x i, y i f 2 x i, y i f 1x, f 1 y + f 2, f x 2 y Δx Δy

Taylor s expansion in 2D of two functions f 1 x i+1, y i+1 f 2 x i+1, y i+1 = f 1 x i, y i f 2 x i, y i f 1x, f 1 y + f 2, f x 2 y Δx Δy f x i+1 = f x i + J i Δ x f x i+1 = f x i + J i Δ x

Multivariate Taylor s expansion f x + Δ x = f x + J x Δ x + 1 2 Δ xt HΔ x +

Multivariate Taylor s expansion Multivariate vector function f x + Δ x = f x + J x Δ x + 1 2 Δ xt HΔ x + E.g. gravitational force F r = G Mm r 3 r

Multivariate Taylor s expansion Multivariate vector function f x + Δ x = f x + J x Δ x + 1 2 Δ xt HΔ x + E.g. gravitational force F r = G Mm r 3 r f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x +

Multivariate Taylor s expansion Multivariate vector function f x + Δ x = f x + J x Δ x + 1 2 Δ xt HΔ x + E.g. gravitational force F r = G Mm r 3 r Multivariate scalar function f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + E.g. energy, cost

Multivariate Taylor s expansion Multivariate vector function f x + Δ x = f x + J x Δ x + 1 2 Δ xt HΔ x + E.g. gravitational force F r = G Mm r 3 r Multivariate scalar function f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + E.g. energy, cost J x = f 1 f 1 x 1 x n f n f n x 1 x n f x = f x 1 f x n H x = 2 f f 2 x 1 x 1 x n f 2 f n x 1 x 2 n x n

Multivariate Taylor s expansion example f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y

Multivariate Taylor s expansion example f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x = f x 1 f x n f x = f x f y = 2x 2y

Multivariate Taylor s expansion example f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x = f x 1 f x n f x = f x f y = 2x 2y H x = 2 f x 2 f x y f x y 2 f y 2 = 2 0 0 2

Multivariate Taylor s expansion example f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x = f x 1 f x n f x = f x f y = 2x 2y H x = 2 f x 2 f x y f x y 2 f y 2 = 2 0 0 2 f x + Δ x = f x + f x f y Δx Δy + 1 2 Δx Δy 2 f x 2 f x y f x y 2 f y 2 Δx Δy +

Multivariate Taylor s expansion example f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x = f x 1 f x n f x = f x f y = 2x 2y H x = 2 f x 2 f x y f x y 2 f y 2 = 2 0 0 2 f x + Δ x = f x + f x f y Δx Δy + 1 2 Δx Δy 2 f x 2 f x y f x y 2 f y 2 Δx Δy + f x + Δ x = f + 1 2 Δx Δy 2 0 0 2 x + 2x 2y Δx Δy Δx Δy +

Multivariate Taylor s expansion example f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x + Δ x = f x + f x f y Δx Δy + 1 2 Δx Δy 2 f x 2 f x y f x y 2 f y 2 Δx Δy +

Multivariate Taylor s expansion example f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x + Δ x = f x + f x f y Δx Δy + 1 2 Δx Δy 2 f x 2 f x y f x y 2 f y 2 Δx Δy + f x + Δ x = f x + 2x 2y Δx Δy + 1 2 Δx Δy 2 0 0 2 Δx Δy +

Multivariate Taylor s expansion example f: R 2 R 1 np. f x, y = x 2 + y 2 x = x y f x + Δ x = f x + f x f y Δx Δy + 1 2 Δx Δy 2 f x 2 f x y f x y 2 f y 2 Δx Δy + f x + Δ x = f x + 2x 2y Δx Δy + 1 2 Δx Δy 2 0 0 2 Δx Δy + f x + Δ x = f x + 2xΔx + 2yΔy + Δx 2 + Δy 2 f Δ x = f 0,0 = Δx 2 + Δy 2

Multivariate Taylor s expansion Gradient part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + T f( x) = f x 1 f x n x = x 1 x n

Multivariate Taylor s expansion Gradient part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + T f( x) = f f x 1 x n x = x 1 x n T f x x = f x 1 x 1 + f x 2 x 2 + + f x n x n = n i=1 f x i x i = n i=1 f xi x i

Multivariate Taylor s expansion Gradient part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + T f( x) = f f x 1 x n x = x 1 x n T f x x = f x 1 x 1 + f x 2 x 2 + + f x n x n = n i=1 f x i x i = n i=1 f xi x i y T x = n i=1 y i x i

Multivariate Taylor s expansion Gradient part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + T f( x) = f f x 1 x n x = x 1 x n T f x x = f x 1 x 1 + f x 2 x 2 + + f x n x n = n i=1 f x i x i = n i=1 f xi x i y T x = n i=1 Plane tangential to function in point x Equation of plane ax 1 + bx 2 y i x i

Multivariate Taylor s expansion Gradient part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + T f( x) = f f x 1 x n x = x 1 x n T f x x = f x 1 x 1 + f x 2 x 2 + + f x n x n = n i=1 f x i x i = n i=1 f xi x i y T x = n i=1 Plane tangential to function in point x Equation of plane ax 1 + bx 2 y i x i Gradient gives information about the rate of change of f in each direction (x,y)

Gradient T f( x) = f x 1 f x n

Gradient T f( x) = f x 1 f x n

Gradient T f( x) = f f x 1 x n f( x)

Gradient T f( x) = f f x 1 x n f( x) f( x)

Gradient T f( x) = f f x 1 x n f( x) f( x) f( x)

Gradient T f( x) = f f x 1 x n f( x) f( x) - f( x) f( x) f( x)

Gradient T f( x) = f f x 1 x n f( x) f( x) f( x) f( x) f( x)

Gradient is a vector perpendicular to the function isoline T f( x) = f f x 1 x n f x = c f x(t) = c 0 = dc dt = df x t dt = f x t x 1 x 1 t t + f x t x 2 x 2 t t + + f x t x n x n t t = n f x t i=1 x i x i t t x 1 t 0 = dc n f x t dt = x i i=1 x i t t = f x t f x t x 1 x n t x n t t = T f x x (t)

Multivariate Taylor s expansion Hessian part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + H x = 2 f 2 f 2 x 1 x 1 x n 2 f 2 f n x 1 x 2 n x n x = x 1 x n x T = x 1 x n

Multivariate Taylor s expansion - Hessian part f x + Δ x = f x + T f( x)δ x + 1 2 Δ xt HΔ x + H x = 2 f 2 f 2 x 1 x 1 x n 2 f 2 f n x 1 x 2 n x n x = x 1 x n x T = x 1 x n x T H x = x 1 x n 2 f 2 f 2 x 1 x 1 x n 2 f 2 f n x 1 x 2 n x n x 1 x n

H x = H 11 H 1n H n1 H nn x 1 x n

H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n

n H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n = j=1 n H 1j x j = n j=1 H 1j x j H nj x j H nj x j j=1

n H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n = j=1 n H 1j x j = n j=1 H 1j x j H nj x j H nj x j j=1 x T H x = x 1 x n H 11 H 1n H n1 H nn x 1 x n

n H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n = j=1 n H 1j x j = n j=1 H 1j x j H nj x j H nj x j j=1 x T H x = x 1 x n H 11 H 1n H n1 H nn x 1 x n = x 1 x n n j=1 H 1j x j H nj x j

n H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n = j=1 n H 1j x j = n j=1 H 1j x j H nj x j H nj x j j=1 x T H x = x 1 x n H 11 H 1n H n1 H nn x 1 x n = x 1 x n n j=1 H 1j x j H nj x j = n j=1 x 1 x n H 1j x j H nj x j

n H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n = j=1 n H 1j x j = n j=1 H 1j x j H nj x j H nj x j j=1 x T H x = x 1 x n H 11 H 1n H n1 H nn x 1 x n = x 1 x n n j=1 H 1j x j H nj x j = n j=1 x 1 x n H 1j x j H nj x j x T H x = n j=1 x 1 x n H 1j x j H nj x j

n H x = H 11 H 1n H n1 H nn x 1 x n = H 11 x 1 + H 12 x 2 + + H 1n x n H n1 x 1 + H n2 x 2 + + H nn x n = j=1 n H 1j x j = n j=1 H 1j x j H nj x j H nj x j j=1 x T H x = x 1 x n H 11 H 1n H n1 H nn x 1 x n = x 1 x n n j=1 H 1j x j H nj x j = n j=1 x 1 x n H 1j x j H nj x j x T H x = n j=1 x 1 x n H 1j x j H nj x j = n j=1 H 1j x j x 1 + H 1j x j x 2 + + H 1j x j x n = n n i=1 j=1 H ij x i x j

Hessian containts all information about the shape of the function around minima f x + Δ x = f x + T f x Δ x + 1 2 Δ xt HΔ x + f x + Δ x = f x + 1 2 Δ xt HΔ x +

Recall quadratic function f x = ax 2 a > 0 a = f (x) > 0 1 dimensional Hessian

Recall quadratic function f x = ax 2 a = 0 a = f x = 0 1 dimensional Hessian

Recall quadratic function f x = ax 2 a < 0 a = f x < 0 1 dimensional Hessian

Recall quadratic function f x = ax 2 + by 2 H = 2 f x 2 f x y f x y 2 f y 2 a 0 0 b Defines shape of a skeleton function of the resulting surface

How to detect minimum (unconstrained)? 1. Necessary condition for uncostrained optimum at point x : f x = 0 and f( x) is differentiable at x 2. Sufficient condition for uncostrained optimum at point x: f x = 0 and f( x) is differentiable at x and 2 f( x ) is positive definite

Positive and negative defined matrix If for every vector x the following is true for a symmetric matrix H: x T H x > 0 then H is positive definite x T H x < 0 then H is negative definite Very cumbersome definition one needs to check every vector x

Positive and negative defined matrix Symmetric matrix H is positive definite if: 1. All eigenvalues are positive. 2. The determinant of each of its principal minor matrices are positive. Symmetric matrix H is negative definite if: 1. All eigenvalues are negative. 2. If we reverse the sign of each matrix s elements and the determinant of each of its principal minor matrices are positive.

Example H f 2 4 4 2 x = 2 4 4 2 2 = 2 > 0 = 4 16 = 12 < 0

Nature of stationary points Hessian H positive definite: Quadratic form Eigenvalues y T Hy 0 0 i y T T M M y T My My My 2 0 Local nature: minimum

Nature of stationary points (2) Hessian H negative definite: Quadratic form Eigenvalues Local nature: maximum y T Hy 0 0 i

Nature of stationary points (3) Hessian H indefinite: Quadratic form Eigenvalues y T Hy 0 0 i Local nature: saddle point

Nature of stationary points (4) Hessian H positive semi-definite: Quadratic form Eigenvalues y T Hy 0 0 i H singular! Local nature: valley

Nature of stationary points (5) Hessian H negative semi-definite: Quadratic form Eigenvalues y T Hy 0 0 i H singular! Local nature: ridge

Stationary point nature summary y T Hy i Definiteness H Nature x* 0 Positive d. Minimum 0 0 Positive semi-d. Indefinite Valley Saddlepoint 0 0 Negative semi-d. Negative d. Ridge Maximum

Newton method for optimization 1D Recall Newton method for finding a root of the function f (f x = 0?) x i+1 = x i f x i f x i Function f has minima (maximum) if f x = 0 Let g x = f x = 0 x i+1 = x i g x i g x i = x i f x i f x i

Newton method 1D different perspective Taylor expansion Newton method in fact is again application of Taylor expansion. It answer the question: How far should I jump to reach minimum namely point where f x i+1 = 0? That question can be unswered by expanding the function in the initial point x i : f x i + x = f x i + f x i x + 1 2 f x 2 + Now we want to move by Δx so we will reach minimum namely df x i + x = 0 dx So 0 = df x i + x = d dx dx f x i + f x i x + 1 2 f x 2 +

Newton method 1D different perspective Taylor expansion 0 = df x i + x dx = d dx f x i + f x i x + 1 2 f x 2 + 0 = f x i + f x + We take (for computational simplicty) only first term with x. Recall from the previous equation it comes from quadratic term. 0 = f x i + f x Because x = x i+1 x i 0 = f x i + f x i+1 x i x i+1 = x i f x i f x i

In Newton method we travel along parabola. Why? Because its the first polynomial that has minima indicator f(x) f x i f x i + Δx f x i f x i+1 = x i + Δx x i+1 x i x

For second degree polynomial it is a one step method. f x = x 2 1 f(x) f x i f x i + Δx f x i x i+1 x i x f x i+1 = x i + Δx x i+1 = x i f x i f x i

For second degree polynomial it is a one step method. f x = x 2 1 f(x) f x i f x i + Δx f x i x i+1 x i x f x i+1 = x i + Δx x i+1 = x i f x i f x i = x i 2x i 2

For second degree polynomial it is a one step method. f x = x 2 1 f(x) f x i f x i + Δx f x i x i+1 x i x f x i+1 = x i + Δx x i+1 = x i f x i f x i = x i 2x i 2

For second degree polynomial it is a one step method. f x = x 2 1 f(x) f x i f x i + Δx f x i x i+1 x i x f x i+1 = x i + Δx x i+1 = x i f x i f x i = x i 2x i 2 = x i x i = 0

Multivariate Newton method for optimization Again How far should I jump to reach minimum namely point so f x i+1 = 0. f x i+1 = f x i + T f x i ( x i+1 x i ) + 1 2 x i+1 x T i H( x i+1 x i ) f x = b f x = 0 Δ x f x i+1 = Δ x f x i + T f x i ( x i+1 x i ) + 1 2 x i+1 x T i H( x i+1 x i ) f x = b T x f x = b f x = x T A x f x = A T x + A x Δ x f x i+1 = f x i + 1 2 HT ( x i+1 x i ) + 1 2 H( x i+1 x i ) H T = H Δ x f x i+1 = f x i + H( x i+1 x i )

We look for x where f x = 0 so.. 0 = Δ x f x i+1 = f x i + H( x i+1 x i ) 0 = f x i + H( x i+1 x i ) f x i = H( x i+1 x i ) x i+1 = x i H 1 ( x i ) f x i

Multivariate optimisation Newton method Pros: Converges fast (especially for quadratic functions) Uses both information about shape (gradient) and curvature (Hessian). Cons: High compuational cost (Hessian, inverse) Hessian may be singular, Computational errors.

Steepest descent Easy idea let s move towards steepest descent namely along f Question is how far should we go? r = f x x We go along straight line (vector). Question is how far should we go? x 1 = x 0 α f x 0 = x 0 + α r 0

Steepest descent Starting point x 0 Easy idea let s move towards steepest descent namely along f r = f x x Final point x 1 r = f x 0

Steepest descent Starting point x 0 Easy idea let s move towards steepest descent namely along f r = f x x Final point x 1 r = f x 0 We go along straight line (vector). Question is how far should we go? x 1 = x 0 α f x 0 = x 0 + α r 0

Steepest descent Starting point x 0 Easy idea let s move towards steepest descent namely along f r = f x x Final point x 1 r = f x 0 We go along straight line (vector). Question is how far should we go? x 1 = x 0 α f x 0 = x 0 + α r 0 How much should be α? We go until we reach minimum along x 1 direction.

Steepest descent Starting point x 0 Easy idea let s move towards steepest descent namely along f r = f x x Final point x 1 r = f x 0 We go along straight line (vector). Question is how far should we go? x 1 = x 0 α f x 0 = x 0 + α r 0 How much should be α? f x 1 ( x) We go until we reach minimum along x 1 direction. α

Directional derivative We go until we reach minimum along x 1 direction. So we need To calculate derivative over α along x 1. df x dα = df x α dα = df dx 1 dx 1 dα + df dx 2 dx 2 dα + + df n dx n df dx n dα = dx i dx i dα i=1 df x dα n = i=1 df dx i dx i dα = T f d x dα

Steepest descent f x 1 Starting point x 0 Easy idea let s move towards steepest descent namely along f Question is how far should we go? r = f x x. Final point x 1 r 0 = f x 0 x 1 = x 0 α f x 0 = x 0 + α r 0 df x 1 dα = T f x 1 d x 1 dα = T f x 1 d dα x 0 + α r 0 = T f x 1 r 0 df x 1 dα = 0 T f x 1 r 0 = 0 r 1 T r 0 = 0 We go until gradient is perpendicular to the gradient in initial point.

Steepest descent f x 1 Starting point x 0 Easy idea let s move towards steepest descent namely along f Question is how far should we go? r = f x x. Final point x 1 f x 0 x 1 = x 0 α f x 0 = x 0 + α r 0 df x 1 dα = T f x 1 d x 1 dα = T f x 1 d dα x 0 + α r 0 = T f x 1 r 0 df x 1 dα = 0 T f x 1 r 0 = 0 r 1 T r 0 = 0 We go until gradient is perpendicular to the gradient in initial point. And then start again

How to find it computationally? r 1 T r 0 = 0 f x i+1 = f x i + T f x i ( x i+1 x i ) + 1 2 x i+1 x T i H( x i+1 x i ) f x 1 = f x 0 + T f x 0 ( x 1 x 0 ) + 1 2 x 1 x T 0 H( x 1 x 0 ) f x 1 = T f x 0 + H( x 1 x 0 ) 0 = f x 1 = T f x 0 + H( x 1 x 0 )

How to find it computationally? f x 1 = T f x 0 + H( x 1 x 0 ) r 1 = T f x 0 + H( x 1 x 0 ) r 1 = T f x 0 + Hα r 0 r 1 T r 0 = 0 T f x 0 + Hα r T 0 r 0 = 0

How to find it computationally? T f x 0 + Hα r T 0 r 0 = 0 f x 0 r 0 + α r T 0 H T r 0 = 0 f x 0 r 0 + α r T 0 H r 0 = 0 α = f r T 0 H x 0 r 0 r 0

Steepest descent routine ex f x = x 2 + y 2 1. Choose an initial point say x 0 = 2 2. 2. Choose accuracy ε say 10 6. 2. Compute gradient at this point r 0 = f x 0 = 2x 2y = 4 4. 3. Compute optimal α along r 0 : - compute Hessian at x 0 = 2 2, H = 2 0 0 2, - compute r 0 T r 0 = 2 2 2 2 = 8 - compute r T 0 H r 0 = 2 2 2 0 0 2 - compute α = r 0 T r 0 = 8 = 1 r T 0 H r 0 16 2 2 2 = 16. 4. Compute next point x 1 = x 0 α r 0 = 2 2 1 4 2 4 = 0 0. 5. Compute f x = f 0 = 0. If f x ε finish else go to 2. 0

Steepest descent method Pros: Always goes downhill Always converges Simple implementation Cons: Slow on eccentric functions

Steepest descent method eccentric function example Theorem If we define the error function in the objective function at current value x as: There hold at every step k Where A largest eigenvalue of H a smallest eigenvalue of H E E x = 1 2 x x k+1 x T H( x x ) A a A + a 2 E x k

Steepest descent method eccentric function example For function f x = x 2 + y 2 H = 2 0 0 2 A = 2, a = 2 E x k+1 0 4 E xk = 0 direct method For function f x = 50x 2 + y 2 H = 100 0 0 2 A = 50, a = 2 E x k+1 98 102 2 E x k slow convergence

Solution? combined methods Recall that function around minimum is quadratic f x + Δx = f x + df(x) dx Δx + 1 2 d 2 f x dx 2 Δx2 + But if in f(x) there is a minimum so df(x) dx = 0 and f x + Δx f x + aδx 2 So around minimum Newton method should work really well. Combined method (so called quasi-newton methods) start with steepest descent and transform into Newton method once it reaches near minimum region.

Improvements Computing Hessian is very costfull not metioning inverse: BFGS (Broyden-Fletcher-Goldfarb-Shanno), Conjugate gradients, DFP (Davidon-Fletcher-Powell).