Nonlinear Programming Models p. 1

Size: px

Start display at page:

Download "Nonlinear Programming Models p. 1"

Violet Sanders
5 years ago
Views:

1 Nonlinear Programming Models Fabio Schoen Introduction Nonlinear Programming Models p. Nonlinear Programming Models p. 2 NLP problems Local and global optima min f(x) x S R n Standard form: min f(x) h i (x) = 0 i =,m g j (x) 0 j =,k A global minimum or global optimum is any x S such that x S f(x) f(x ) A point x is a local optimum if ε > 0 such that x S B( x,ε) f(x) f( x) where B( x,ε) = {x R n : x x ε} is a ball in R n. Any global optimum is also a local optimum, but the opposite is generally false. Here S = {x R n : h i (x) = 0 i,g j (x) 0 j} Nonlinear Programming Models p. 3 Nonlinear Programming Models p. 4

2 Convex Functions Convex Functions A set S R n is convex if x,y S λx + ( λ)y S for all choices of λ [0, ]. Let Ω R n : non empty convex set. A function f : Ω R is convex iff for all x,y Ω,λ [0, ] f(λx + ( λ)y) λf(x) + ( λ)f(y) x y Nonlinear Programming Models p. 5 Nonlinear Programming Models p. 6 Properties of convex functions Convex functions Every convex function is continuous in the interior of Ω. It might be discontinuous, but only on the frontier. If f is continuously differentiable then it is convex iff for all y Ω f(y) f(x) + (y x) T f(x) x y Nonlinear Programming Models p. 7 Nonlinear Programming Models p. 8

3 If f is twice continuously differentiable f it is convex iff its Hessian matrix is positive semi-definite: [ ] 2 2 f f(x) := x i x j then 2 f(x) 0 iff v T 2 f(x)v 0 v R n or, equivalently, all eigenvalues of 2 f(x) are non negative. Example: an affine function is convex (and concave) For a quadratic function (Q: symmetric matrix): we have f(x) = 2 xt Qx + b T x + c f(x) = Qx + b f is convex iff Q 0 2 f(x) = Q Nonlinear Programming Models p. 9 Nonlinear Programming Models p. 0 Convex Optimization Problems min f(x) x S is a convex optimization problem iff S is a convex set and f is convex on S. For a problem in standard form min f(x) h i (x) = 0 i =,m g j (x) 0 j =,k Maximization Slight abuse in notation: a problem maxf(x) x S is called convex iff S is a convex set and f is a concave function (not to be confused with minimization of a concave function, (or maximization of a convex function) which are NOT a convex optimization problem) if f is convex, h i (x) are affine functions, g j (x) are convex functions, then the problem is convex. Nonlinear Programming Models p. Nonlinear Programming Models p. 2

4 Convex and non convex optimization Convex optimization is easy, non convex optimization is usually very hard. Fundamental property of convex optimization problems: every local optimum is also a global optimum (will give a proof later) Minimizing a positive semidefinite quadratic function on a polyhedron is easy (polynomially solvable); if even a single eigenvalue of the hessian is negative the problem becomes NP hard Convex functions: examples Many (of course not all... ) functions are convex! affine functions a T x + b quadratic functions 2 xt Qx + b T x + c with Q = Q T, Q 0 any norm is a convex function x log x (however log x is concave) f is convex if and only if x 0,d R n, its restriction to any line: φ(α) = f(x 0 + αd), is a convex function a linear non negative combination of convex functions is convex g(x,y) convex in x for all y g(x,y)dy convex Nonlinear Programming Models p. 3 Nonlinear Programming Models p. 4 more examples... max i {a T i x + b} is convex f,g: convex max{f(x),g(x)} is convex f a convex functions for any a A (a possibly uncountable set) sup a A f a (x) is convex f convex f(ax + b) let S R n be any set f(x) = sup s S x s is convex Trace(A T X) = i,j A ijx ij is convex (it is linear!) log detx is convex over the set of matrices X R n n : X 0 λ max (X) (the largest eigenvalue of a matrix X) Data Approximation Nonlinear Programming Models p. 5 Nonlinear Programming Models p. 6

5 Table of contents norm approximation maximum likelihood robust estimation Problem: Norm approximation min Ax b x where A, b: parameters. Usually the system is over-determined, i.e. b Range(A). For example, this happens when A R m n with m > n and A has full rank. r := Ax b: residual. Nonlinear Programming Models p. 7 Nonlinear Programming Models p. 8 Examples r = r T r: least squares (or regression ) r = r T Pr with P 0: weighted least squares r = max i r i : minimax, or l or di Tchebichev approximation r = i r i : absolute or l approximation Possible (convex) additional constraints: maximum deviation from an initial estimate: x x est ǫ simple bounds l i x i u i ordering: x x 2 x n Nonlinear Programming Models p. 9 Example: l norm Matrix A R norm residuals Nonlinear Programming Models p. 20

6 l norm l 2 norm 20 8 norm residuals norm 2 residuals Nonlinear Programming Models p. 2 Nonlinear Programming Models p. 22 Variants min i h(y i a T i x) where h: convex function: { z 2 z h linear quadratic h(z) = 2 z z > { 0 z dead zone : h(z) = z z > { log( z 2 ) z < logarithmic barrier: h(z) = z comparison norm (x) norm 2(x) linquad(x) deadzone(x) logbarrier(x) Nonlinear Programming Models p. 23 Nonlinear Programming Models p. 24

7 Maximum likelihood Max likelihood estimate - MLE Given a sample X,X 2,...,X k and a parametric family of probability density functions L( ; θ), the maximum likelihood estimate of θ given the sample is ˆθ = arg max L(X,...,X k ;θ) θ Example: linear measures with and additive i.i.d. (independent identically dsitributed) noise: X i = a T i θ + ε i () (taking the logarithm, which does not change optimum points): ˆθ = arg max log(p(x i a T i θ)) θ If p is log concave this problem is convex. Examples: i ε N(0,σ), i.e. p(z) = (2πσ) /2 exp( z 2 /2σ 2 ) MLE is the l 2 estimate: θ = arg min Aθ X 2 ; p(z) = (/(2a)) exp( z /a) l estimate: ˆθ = arg min θ Aθ X where ε i iid random variables with density p( ): k L(X...,X k ;θ) = p(x i a T i θ) i= Nonlinear Programming Models p. 25 Nonlinear Programming Models p. 26 Ellipsoids p(z) = (/a) exp( z/a) {z 0} (negative exponential) the estimate can be found solving the LP problem: min T (X Aθ) Aθ p uniform on [ a,a] the MLE is any θ such that Aθ X a X An ellipsoid is a subset of R n of the form E = {x R n : (x x 0 ) T P (x x 0 ) } where x 0 R n is the center of the ellipsoid and P is a symmetric positive-definite matrix. Alternative representations: E = {x R n : Ax b 2 } where A 0, or E = {x R n : x = x 0 + Au u 2 } where A is square and non singular (affine transformation of the unit ball) Nonlinear Programming Models p. 27 Nonlinear Programming Models p. 28

8 Robust Least Squares Least Squares: ˆx = arg min i (at i x b i) 2 Hp: a i not known, but it is known that where P i = P T i a i E i = {ā i + P i u : u } 0. Definition: worst case residuals: max (a T i x b i) 2 a i E i i A robust estimate of x is the solution of ˆx r = arg min max (a T i x b x i) 2 a i E i i It holds: RLS α + β T y α + β y then, choosing y = β/ β if α 0 and y = β/ β, otherwise if α < 0, then y = and then: α + β T y = α + β T β/ β sign(α) max (a T i x b i ) a i E i = α + β = max ā T i x b i + u T P i x u = ā T i x b i + P i x Nonlinear Programming Models p. 29 Nonlinear Programming Models p Thus the Robust Least Squares problem reduces to ( ) /2 min ( ā T i x b i + P i x ) 2 (a convex optimization problem). Transformation: i min t 2 x,t min t 2 x,t ā T i x b i + P i x t i ā T i x + b i + P i x t i (Second Order Cone Problem). A norm cone is a convex set C = {(x,t) R n+ : x t} ā T i x b i + P i x t i i i.e. Nonlinear Programming Models p. 3 Nonlinear Programming Models p. 32

9 Geometrical Problems Geometrical Problems projections and distances polyhedral intersection extremal volume ellipsoids classification problems Nonlinear Programming Models p. 33 Nonlinear Programming Models p. 34 Projection on a set Given a set C the projection of x on C is defined as: P C (x) = arg min z x z C Projection on a convex set If C = {x : Ax = b,f i (x) 0,i =,m} where f i : convex C is a convex set and the problem is convex. P C (x) = arg min x z Az = b f i (z) 0 i =,m Nonlinear Programming Models p. 35 Nonlinear Programming Models p. 36

10 Distance between convex sets Distance between convex sets dist(c (),C (2) ) = min x y x C (),y C (2) If C (j) = {x : A (j) x = b (j),f (j) i 0} then the minimum distance can be found through a convex model: min x () x (2) A () x () = b () A (2) x (2) = b (2) f () i x () 0 f (2) i x (2) 0 Nonlinear Programming Models p. 37 Nonlinear Programming Models p. 38 Polyhedral intersection : polyhedra described by means of linear inequalities: P = {x : Ax b}, P 2 = {x : Cx d} Polyhedral intersection P P2 =? It is a linear feasibility problem: Ax b,cx d P P 2? Just check sup{c T k x : Ax b} d k k (solution of a finite number of LP s) Nonlinear Programming Models p. 39 Nonlinear Programming Models p. 40

11 Polyhedral intersection (2) 2: polyhedra (polytopes) described through vertices: P = conv{v,...,v k }, P 2 = conv{w,...,w h } Minimal ellipsoid containing k points Given v,...,v k R n find an ellipsoid E = {x : Ax b } P P2 =? Need to find λ,λ k,µ,µ h 0: λ i = µ j = i λ i v i i j = µ j w j j P P 2? i =,...,k check whether µ j 0: µ j = j µ j w j = v i Nonlinear Programming Models p. 4 j with minimal volume containing the k given points. * * * * * * * * * * * * * * * * * * * Nonlinear Programming Models p. 42 Max. ellipsoid contained in a polyhedron A = A T 0. Volume of E is proportional to det A convex optimization problem (in the unknowns: A, b): min log deta A = A T Given P = {x : Ax b} find an ellipsoid: E = {By + d : y } contained in P with maximum volume. A 0 Av i b i =,k Nonlinear Programming Models p. 43 Nonlinear Programming Models p. 44

12 Max. ellipsoid contained in a polyhedron E P a T i (By + d) b i y : y sup {a T i By + a T i d} b i i y Ba i + a T i d b i max B,d log detb B = B T 0 Ba i + a T i d b i i =,... Difficult variants These problems are hard: find a maximal volume ellipsoid contained in a polyhedron given by its vertices * * * * * * * * * * * * * * * * * * Nonlinear Programming Models p. 45 Nonlinear Programming Models p. 46 find a minimal volume ellipsoid containing a polyhedron described as a system of linear inequalities. It is already a difficult problem to show whether a given ellipsoid E contains a polyhedron P = {Ax b}. This problem is still difficult even when the ellipsoid is a sphere: this problem is equivalent to norm maximization in a polyhedron it is an NP hard concave optimization problem. Nonlinear Programming Models p. 47 Nonlinear Programming Models p. 48

13 Linear classification (separation) Given two point sets X,...,X k,y,...,y h find an hyperplane a T x = t such that: (LP feasibility problem). a T X i i =,k a T Y j j =,h Nonlinear Programming Models p. 49 Nonlinear Programming Models p. 50 Robust separation Robust separation Find a maximal separation: ( max a: a min i ) a T X i maxa T Y j j equivalent to the convex problem: maxt t 2 a T X i t i a T Y j t 2 j a Nonlinear Programming Models p. 5 Nonlinear Programming Models p. 52

14 Optimality Conditions: descent directions Optimality Conditions Fabio Schoen Let S R n be a convex set and consider the problem min f(x) x S where f : S R. Let x,x 2 S and d = x 2 x. d is a feasible direction. If there exists ǫ > 0 such that f(x + ǫd) < f(x ) ǫ (0, ǫ), d is called a descent direction at x. Elementary necessary optimality condition: if x is a local optimum, no descent direction may exist at x Optimality Conditions p. Optimality Conditions p. 2 Optimality Conditions for Convex Sets If x S is a local optimum for f() and there exists a neighborhood U(x ) such that f C (U(x )), then d T f(x ) 0 d : feasible direction Optimality Conditions p. 3 Optimality Conditions p. 4

15 proof Taylor expansion: Optimality Conditions: tangent cone General case: f(x + ǫd) = f(x ) + ǫd T f(x ) + o(ǫ) d cannot be a descent direction, so, if ǫ is sufficiently small, then f(x + ǫd) f(x ). Thus min f(x) g i (x) 0 x X i =,...,m (X : open set) and dividing by ǫ, ǫd T f(x ) + o(ǫ) 0 d T f(x ) + o(ǫ) ǫ Letting ǫ 0 the proof is complete. 0 Let S = {x X : g i (x) 0,i =,...,m}. Tangent cone to S in x: T( x) = {d R n }: where x k S. d d = lim x k x x k x x k x Optimality Conditions p. 5 Optimality Conditions p. 6 Some examples S = R n T(x) = R n x S = {Ax = b} T(x) = {d : Ad = 0} S = {Ax b}; let I be the set of active constraints in x: a T i x = b i i I a T i x < b i i I. Optimality Conditions p. 7 Optimality Conditions p. 8

16 Let d = lim k (x k x)/ (x k x) a T i d = a T i lim k (x k x)/ (x k x) i I = lima T i (x k x)/ (x k x) k = lim(a T i x k b)/ (x k x) k 0 Thus if d T( x) a T i d 0 for i I. Optimality Conditions p. 9 Optimality Conditions p. 0 Example Viceversa, let x k = x + α k d. If a T i d 0 for i I a T i x k = a T i ( x + α k d) i I = b i + α k a T i d b i a T i x k = a T i ( x + α k d) i I < b i + α k a T i d b i if α k small enough Thus T(x) = {d : a T i d 0 i I} Let S = {(x,y) R 2 : x 2 y = 0} (parabola). Tangent cone at (0, 0)? Let {(x k,y k ) (0, 0)}, i.e. x k 0,y k = x 2 k : (x k,y k ) (0, 0) = x 2 k + (x k) 4 = x k + x 2 k and lim x k 0 + lim x k 0 x k x k + x 2 k x k x k + x 2 k = lim x k 0 + = lim x k 0 y k x k + x 2 k y k x k + x 2 k = 0 = 0 Optimality Conditions p. thus T(0, 0) = {(, 0), (, 0)} Optimality Conditions p. 2

17 Descent direction d R n is a feasible direction in x S if ᾱ > 0 : x + αd S α [0,ᾱ). d feasible d T( x), but in general the converse is false. If f( x + αd) f( x) d is a descent direction α (0,ᾱ) I order necessary opt condition Let x S R n be a local optimum for min x S f(x); let f C (U( x)). Then d T f( x) 0 d T( x) Proof d = lim k (x k x)/ (x k x). Taylor expansion: f(x k ) = f( x) + T f( x)(x k x) + o( x k x ) = f( x) + T f( x)(x k x) + x k x o(). x local optimum U( x) : f(x) f( x) x U S. Optimality Conditions p. 3 Optimality Conditions p If k is large enough, x k U( x): f(x k ) f( x) 0 thus T f( x)(x k x) + x k x o() 0 Dividing by (x k x) : T f( x)(x k x)/ (x k x) + o() 0 and in the limit T f( x)d 0. Examples Unconstrained problems Every d R n belongs to the tangent cone at a local optimum T f( x)d 0 Choosing d = e i e d = e i we get f( x) = 0 d R n NB: the same is true if x is a local minimum in the relative interior of the feasible region. Optimality Conditions p. 5 Optimality Conditions p. 6

18 Linear equality constraints min f(x) Ax = b Tangent cone: {d : Ad = 0}. Necessary conditions: T f( x)d 0 d : Ad = 0 equivalent statement: Linear equality constraints From LP duality max 0 T λ = 0 A T λ = f( x) Thus at a local minimum point there exist Lagrange multipliers: λ : A T λ = f( x) min d T f( x)d = 0 Ad = 0 (a linear program). Optimality Conditions p. 7 Optimality Conditions p. 8 Linear inequalities min f(x) Ax b Tangent cone at a local minimum x: {d R n : a T i d 0 i I( x)}. Let A I be the rows of A associated to active constraints at x. Then min d T f( x)d = 0 A I d 0 λ 0 Linear inequalities From LP duality: max 0 T λ = 0 A T Iλ = f( x) λ 0 Thus, at a local optimum, the gradient is a non positive linear combination of the coefficients of active constraints. Optimality Conditions p. 9 Optimality Conditions p. 20

19 Farkas Lemma Geometrical interpretation Let A: matrix in R m n and b R n. One and only one of the following sets: A T y 0 b T y > 0 A T y 0 Ax = b b T y > 0 x 0 and a is non empty Ax = b x 0 b {z : x : z = Ax,x 0} a 2 {y : A T y 0} Optimality Conditions p. 2 Optimality Conditions p. 22 Proof ) if x 0 : Ax = b b T y = x T A T y. Thus if A T y 0 b T y 0. 2) Premise: Separating hyperplane theorem: let C and D be two convex nonempty sets: C D =. Then there exists a 0 and b: a T x b a T x b x C x D If C is a point and D is a closed convex set, separation is strict, i.e. a T C < b a T x > b x D Farkas Lemma (proof) 2) let {x : Ax = b,x 0} =. Let S = {y R m : x 0,Ax = y} S is closed, convex and b S. From the separating hyperplane theorem: α R m 0, β R: α T y β α T b > β x S 0 S β 0 α T b > 0; α T Ax β for all x 0. This is possible iff α T A 0. Letting y = α we obtain a solution of A Y y 0 b T y > 0 Optimality Conditions p. 23 Optimality Conditions p. 24

20 First order feasible variations cone First order variations G( x) = {d R n : T g i ( x)d 0} i I G( x) T( x). In fact if {x k } is feasible and then g i ( x) 0 and d = lim k x k x x k x g( x + lim k (x k x)) 0 Optimality Conditions p. 25 Optimality Conditions p g( x + lim k g( x + lim k Let α k = x k x, if α k 0: x k x x k x x k x ) 0 x k x lim x k x x k x ) 0 g( x + lim k x k x d) 0 g( x + α k d) 0 g i ( x + α k d) = g i ( x) + α k T g i ( x)d + o(α k ) where α k > 0 and d belong to the tangent cone T( x). If the i th constraint is active, then g i ( x + α k d) = α k T g i ( x)d + o(α k ) 0 g i ( x + α k d)/α k = T g i ( x)d + o(α k ))/α k 0 Letting α k 0 the result is obtained. Optimality Conditions p. 27 Optimality Conditions p. 28

21 example KKT necessary conditions G( x) T( x); x 3 + y 0 y 0 (Karush Kuhn Tucker) Let x X R n,x be a local optimum for min f(x) g i (x) 0 x X I: indices of active constraints at x. If:. f(x),g i (x) C ( x) for i I i =,...,m 2. constraint qualifications conditions: T( x) = G( x) hold in x ; then there exist Lagrange multipliers λ i 0,i I: f( x) + i I λ i g i ( x) = 0. Optimality Conditions p. 29 Optimality Conditions p. 30 Proof x local optimum if d T( x) d T f( x) 0. But d T( x) d T g i ( x) 0 i I. Thus it is impossible that T f( x)d > 0 T g i ( x)d 0 i I From Farkas Lemma there exists a solution of: λ i T g i ( x) = T f( x) i I i I λ i 0 i I Constraint qualifications: examples polyhedra: X = R n and g i (x) are affine functions: Ax b linear independence: X open set, g i (x),i I continuous in x and { g i ( x)},i I are linearly independent. Slater condition: X open set, g i (x),i I convex differentiable functions in x, g i (x),i I continuous in x, and ˆx X strictly feasible: g i (ˆx) < 0 i I. Optimality Conditions p. 3 Optimality Conditions p. 32

22 Convex problems An optimization problem is a convex problem if S is a convex set, i.e. λ [0, ] min f(x) x S x,y S λx + ( λ)y S f is a convex function on S, i.e. f(λx + ( λ)y) λf(x) + ( λ)f(y) Standard convex problem min f(x) g i (x) 0 i =,m h j (x) = 0 j =,k if f is convex g i are convex h j are affine (i.e. of the form α T x + β) then the problem is convex. λ [0, ] and x,y S Optimality Conditions p. 33 Optimality Conditions p. 34 Convex problems Every local optimum is a global one. Proof: x: local optimum for min S f(x) x : global optimum. S convex λx + ( λ) x S. Thus if λ 0 f( x) f(λx + ( λ) x λf(x ) + ( λ)f( x) f( x) f(x ) and x is also a global optimum. Sufficiency of st order conditions (for a convex differentiable problem: if d T f( x) d T( x), then x is a (global) optimum Proof: f(y) f( x) + (y x) T f(x) But y x T( x) f(y) f( x) + d T f(x) f( x) thus x is a global minimum. y S y S Optimality Conditions p. 35 Optimality Conditions p. 36

23 Convexity of the set of global optima (for convex problems) The set of global minima of a convex problem is a convex set. In fact, let x and ȳ be global minima for the convex problem min f(x) x S Then, choosing λ [0, ] we have λ x + ( λ)ȳ S, as S is convex. Moreover f(λ x + ( λ)ȳ) λf( x) + ( λ)f(ȳ) λf + ( λ)f = f KKT for equality constraints x: local optimum for min f(x) g i (x) 0 h j (x) = 0 x X R n i =,...,m j =,...,k Let I: set of active inequalities in x. If f(x), g i (x),i I,h j (x) C and constraint qualifications hold in x, λ i 0 i I e µ j R, j =,...,h: where f is the global minimum value. Thus the equality holds and the proof is complete. f( x) + i I h λ i g i ( x) + µ j h j ( x) = 0 j= Optimality Conditions p. 37 Optimality Conditions p. 38 Complementarity KKT equivalent formulation: m h f( x) + λ i g i ( x) + µ j h j ( x) = 0 i= j= λ i g i ( x) = 0 i =,...,m Condition λ i g i ( x) = 0 is called complementarity condition II order necessary conditions If f,g,h j C 2 in x and the gradients of active constraints in x are linearly independent, then there exist mutlipliers λ i 0,i I and µ j,j =,...,k such that and f( x) + i I λ i g i ( x) + k µ j h j ( x) = 0 j= d T 2 L( x)d 0 for every direction d: d T g i ( x) 0,d T h j (x) = 0 where Optimality Conditions p L(x) := 2 f(x) + i I k λ i 2 g i (x) + µ j 2 h j (x) j= Optimality Conditions p. 40

24 Sufficient conditions Let f,g i,h j twice continuously differentiable. Let x,λ,µ : Problem: Lagrange Duality f(x ) + i I k λ i g i (x ) + µ j h j (x ) = 0 j= λ ig i (x ) = 0 f = min f(x) g i (x) 0 x X λ i 0 definition: Lagrange Function: d T 2 L(x )d > 0 d :d T h j (x ) = 0 d T g i (x ) = 0,i I L(x;λ) = f(x) + i λ i g i (x) λ 0,x X then x is a local minimum. Optimality Conditions p. 4 Optimality Conditions p. 42 Relaxation Given an optimization problem a relaxation is a problem min f(x) x S Proof: Lagrange minimization is a relaxation Feasible set of the Lagrange problem: X (contains the original one) If g(x) 0 and λ 0 where S Q min g(x) x Q g(x) f(x) x S. L(x,λ) = f(x) + λ T g(x) f(x) Weak Duality : The optimal value of a relaxation is a lower bound on the optimum value of the problem. Optimality Conditions p. 43 Optimality Conditions p. 44

25 Dual Lagrange function Example (circle packing) with respect to constraints g(x) 0: θ(λ) = inf x X L(x,λ) = inf x X (f(x) + λt g(x)) For every choice of λ 0, θ(λ) is a lower bound for every feasible solution and in particular, is a lower bound for the global minimum value of the problem. min r 4r 2 (x i x j ) 2 (y i y j ) 2 0 x i,y i x i, y i 0 i < j N i =,...,N i =,...,N Optimality Conditions p. 45 Optimality Conditions p. 46 solution When N = 2, relaxing the first constraint: θ(λ) = min x,y,r r + λ(4r2 (x x 2 ) 2 (y y 2 ) 2 ) x,x 2,y,y 2 0 x,x 2,y,y 2 Minimizing with respect to x,y x x 2 = y y 2 = from which θ(λ) = min r + 4λr 2 2λ r r = 8λ θ(λ) = 2λ 6λ This is a lower bound on the optimum value. Best possible lower bound: Optimality Conditions p. 47 θ = maxθ(λ) λ λ = θ = 2 Optimality Conditions p. 48

26 Lagrange Dual Choosing (x,y ) = (0, 0) and (x 2,y 2 ) = (, ) a feasible solution with r = 2/2 is obtained. The Lagrange dual gives a lower bound equal to 2/2: same as the objective function at a feasible solution optimal solution! (an exception, not the rule!) This problem might:. be unbounded θ = maxθ(λ) λ 0 2. have a finite sup but non max 3. have a unique maximum attained in correspondence with a single solution x 4. have many different maxima, each connected with a different solution x Optimality Conditions p. 49 Optimality Conditions p. 50 Equality constraints Linear Programming f = minf(x) g i (x) 0 i =,...,m h j (x) = 0 j =,...,k x X Lagrange function: L(x;λ,µ) = f(x) + λ T g(x) + µ T h(x) where λ 0, but µ is free. Dual Lagrange function: but: min c T x Ax b θ(λ) = min x c T x + λ T (Ax b) min x (c T + λ T A)x = = λ T b + min x (c T + λ T A)x. { 0 if c T + λ T A = 0 otherwise. Optimality Conditions p. 5 Optimality Conditions p. 52

27 ... Lagrange dual function: { θ(λ) = λ T b if c T + λ T A = 0 otherwise. Quadratic Programming (QP) min 2 xt Qx + c T x Ax = b Lagrange dual: which is equivalent to: max λ T b λ T A + c T = 0 λ 0 (Q: symmetric). Lagrange dual function: θ(λ) = min x 2 xt Qx + c T x + λ T (Ax b) = λ T b + min x 2 xt Qx + (c T + λ T A)x max λ T b λ T A = c T λ 0 Optimality Conditions p. 53 Optimality Conditions p. 54 QP Case Q has at least one negative eigenvalue min x 2 xt Qx + (c T + λ T A)x = QP Case 2 Q positive definite minimum point of the dual Lagrange function: Q x + (c + A T λ) = 0 In fact d : d T Qd < 0. Choosing x = αd with α > 0 2 xt Qx + (c T + λ T A)x = 2 α2 d T Qd + α(c T + λ T A)d and for large values of α this can be made as small as desired. i.e. x = Q (c + A T λ) Optimality Conditions p. 55 Optimality Conditions p. 56

28 Lagrange function value: θ(λ) = λ T b + 2 xt Q x + (c T + λ T A) x Lagrange dual (seen as a min problem): min λ λ T b + 2 (c + AT λ) T Q (c + A T λ) = λ T b + 2 (c + AT λ) T Q QQ (c + A T λ) (c T + λ T A)Q (c + A T λ) Optimality conditions: b + AQ (c + A T λ) = 0 = λ T b + 2 (c + AT λ) T Q (c + A T λ) (c T + λ T A)Q (c + A T λ) = λ T b 2 (c + AT λ) T Q (c + A T λ) But recalling that x = Q (c + A T λ) b A x = 0 feasibility of x if we find optimal multipliers λ (a linear system) we get the optimal solution x (thanks to feasibility and weak duality)! Optimality Conditions p. 57 Optimality Conditions p. 58 Properties of the Lagrange dual Dim. For any problem f = minf(x) g i (x) 0 x X i =,...,m where X is non empty and compact, if f and g i are continuous then the Lagrange dual function is concave From Weierstrass theorem θ(λ) = min x X f(x) + λt g(x) exists and is finite θ(ηa + ( η)b) = min x X (f(x) + (ηa + ( η)b)t g(x)) = min x X (η(f(x) + at g(x)) + ( η)(f(x) + b T g(x))) η min x X (f(x) + at g(x)) + ( η) min x X (f(x) + bt g(x)) = ηθ(a) + ( η)θ(b). Optimality Conditions p. 59 Optimality Conditions p. 60

29 Solution of the Lagrange dual... is equivalent to max λ maxz θ(λ) = max min (f(x) + λ x X λt g(x)) z f(x) + λ T g(x) λ 0 x X After having computed f and g in x,x 2,...,x k a restricted dual can be defined: Let λ be the optimal solution of the restricted dual. Is it an optimal dual solution? Is it true that z f(x) + λ T g(x)? Check: we look for x, optimal solution of min f(x) + λ T g(x) x X if f( x) + λ T g( x) z then we have found the optimal solution of the dual; otherwise the pair x,f( x) is added to the restricted dual and a new solution is computed. maxz z f(x j ) + λ T g(x j ) j =,...,k λ 0 Optimality Conditions p. 6 Optimality Conditions p. 62 Geometric programming Unconstrained Geometric program: min x>0 m n c k x α kj j α kj R,c k > 0 k= j= (non convex). Variable substitution: x j = exp(y j ) y j R Transformed problem: ( m n ) min c k e α kjy j = y k= min y j= m k= e αt k y+β k β k = log c k still non convex, but its logarithm is convex. Optimality Conditions p. 63 Optimality Conditions p. 64

30 Duality example solving the dual Dual of m min f(x) min log exp(αk T x + β k ) k= Dual function L(λ) = min log m exp y k + λ T (Ax + β y) x,y k= No constraints dual lagrange function is identical to f(x)! Strong duality holds, but is useless. Simple transformation: min log m expy k k= y k = α T k x + β k Minimization in x is unconstrained: min λ T Ax if λ T A 0 L(λ) is unbounded if λ T A = 0 then L(λ) = min log y m exp y k + λ T (β y) k= Optimality Conditions p. 65 Optimality Conditions p. 66 First order (unconstrained) optimality conditions w.r.t. y i : Substituting λ j = expy j / k expy k, expy i k exp y λ i = 0 k L(λ) = log j exp y j j λ j y j Lagrange multipliers exist provided that λ i = λ i > 0 i i = log exp y j y j expy j / expy k j j k = k exp y k( exp y k (log expy j y k )) k j = ( expy k k j exp y (log ) exp y j y k ) j j = k λ k log λ k Optimality Conditions p. 67 Optimality Conditions p. 68

31 Lagrange Dual Special cases: linear constraints The Lagrange Dual becomes: maxβ T λ λ k λ k log λ k λ k = k A T λ = 0 λ 0 Lagrange function: min f(x) Ax b L(x,λ) = f(x) + λ T (b Ax) Constraint qualifications always hold (polyhedron). If x is a local optimum there exists λ 0: Ax b f(x ) = A T λ λ T (b Ax ) = 0 Optimality Conditions p. 69 Optimality Conditions p. 70 Non negativity constraints min f(x) x 0 Lagrange function: L(x,λ) = f(x) λ T x. KKT conditions: f(x ) = λ x 0 λ 0 (λ ) T x = 0 from which λ j = f(x ) x j j =,n f(x ) x j = 0 j : x j > 0 f(x ) x j 0 otherwise Optimality Conditions p. 7 Optimality Conditions p. 72

32 Box constraints Box constr. (cont) min f(x) l x u l i < u i i Lagrange function: L(x,λ,µ) = f(x) + λ T (l x) + µ T (x u). KKT conditions: f(x ) = λ µ (l x ) T λ = 0 Then, from complementarity, f(x ) x j = λ j j J l f(x ) x j = µ j j J u f(x ) x j = 0 j J 0 (x u) T µ = 0 (λ,µ ) 0 Given x let J l = {j : x j = l j },J u = {j : x j = u j },J 0 = {j : l j < x j < u j } Optimality Conditions p. 73 Optimality Conditions p. 74 Optimization over the simplex Thus f(x ) x j 0 j J l f(x ) x j 0 j J u f(x ) x j = 0 j J 0 min f(x) T x = x 0 Lagrange function: L(x,λ,µ) = f(x) λ T x + µ T ( T x ). KKT: f(x ) = λ µ with feasibility l x u T x = (x,λ ) 0 (λ ) T x = 0 Optimality Conditions p. 75 Optimality Conditions p. 76

33 simplex... f(x ) x j λ j = µ (all equal). Thus, from complementarity, if x j > 0 then λ j = 0 and f(x ) x j = µ ; otherwise f(x ) x j µ. Thus, if j : x j > 0, f(x ) x j f(x ) x k k Application: Min var portfolio Given n assets with random returns R,...,R n, how to invest e in such a way that the resulting portfolio has minimum variance? If x j denotes the percentage of the investment on asset j, how to compute the variance of this portfolio P(x)? Var = E(P(x) (E(P(x)))) 2 ( n ) 2 = E (R j E(R j ))x j j= = i,j (R i E(R i ))(R j E(R j ))x i x j = x T Qx where Q is the variance-covariance matrix of the n assets. Optimality Conditions p. 77 Optimality Conditions p. 78 Min var portfolio Problem (objective multiplied by /2 for simpler computations): min(/2)x T Qx T x = x 0 Optimal portfolio KKT: for all j : x j > 0: Q ij x j j j Q kj x j Vector Qx might be thaught as the vector of marginal contributions to the total risk (which is a weighted sum of elements of Qx). Thus in the optimal portfolio, all assets with positive level give equal (and minimal) contribution to the total risk. k Optimality Conditions p. 79 Optimality Conditions p. 80

34 Optimization Algorithms Algorithms for unconstrained local optimization Fabio Schoen Most common form for optimization algorithms: Line search-based methods: Given a starting point x 0 a sequence is generated: x k+ = x k + α k d k where d k R n : search direction, α k > 0: step Usually first d k is chosen and than the step is obtained, often from a dimensional optimization Algorithms for unconstrained local optimization p. Algorithms for unconstrained local optimization p. 2 Trust-region algorithms Speed measures A model m(x) and a confidence region U(x k ) containing x k are defined. The new iterate is chosen as the solution of the constrained optimization problem min m(x) x U(x k ) The model and the confidence region are possibly updated at each iteration. Let x : local optimum. The error in x k might be measured e.g. as e(x k ) = x k x e(x k ) = f(x k ) f(x ). Given {x k } x if q > 0,β (0, ) : (for k large enough): e(x k ) qβ k or {x k } is linearly convergent, or converges with order ; β : convergence rate A sufficient condition for linear convergence: lim sup e(x k+) e(x k ) β Algorithms for unconstrained local optimization p. 3 Algorithms for unconstrained local optimization p. 4

35 super linear convergence Higher order convergence If for every β (0, ) exists q: e(x k ) qβ k If, given p >, q > 0,β (0, ) : e(x k ) qβ (pk ) then convergence is super linear. Sufficient condition: lim sup e(x k+) e(x k ) = 0 then {x k } is said to converge with order at least p If p = 2 quadratic convergence Sufficient condition: lim sup e(x k+) e(x k ) p < Algorithms for unconstrained local optimization p. 5 Algorithms for unconstrained local optimization p. 6 Examples Examples converges to 0 with order one (linear convergence) k converges to 0 with order one (linear convergence) k converges to 0 with order k 2 Algorithms for unconstrained local optimization p. 7 Algorithms for unconstrained local optimization p. 7

36 Examples Examples converges to 0 with order one (linear convergence) k converges to 0 with order k 2 2 k converges to 0 with order converges to 0 with order one (linear convergence) k converges to 0 with order k 2 2 k converges to 0 with order k k converges to 0 with order ; convergence is super linear Algorithms for unconstrained local optimization p. 7 Algorithms for unconstrained local optimization p. 7 Examples Descent directions and the gradient converges to 0 with order one (linear convergence) k converges to 0 with order k 2 2 k converges to 0 with order k k converges to 0 with order ; convergence is super linear converges a 0 with order 2 quadratic convergence 2 2k Let f C (R n ), x k R n : f(x k ) 0 Let d R n. If then d is a descent direction Taylor expansion: d T f(x k ) < 0 f(x k + αd) f(x k ) = αd T f(x k ) + o(α) f(x k + αd) f(x k ) α = d T f(x k ) + o() Thus if α is small enough f(x k + αd) f(x k ) < 0 NB: d might be a descent direction even if d T f(x k ) = 0 Algorithms for unconstrained local optimization p. 7 Algorithms for unconstrained local optimization p. 8

37 Convergence of line search methods If a sequence x k+ = x k + α k d k is generated in such a way that: L 0 = {x : f(x) f(x 0 )} is compact d k 0 whenever f(x k ) 0 f(x k+ ) f(x k ) if f(x k ) 0 k then lim k d T k d k f(x k) = 0 if d k 0 then d T k f(x k) d k σ( f(x k ) ) where σ is such that lim k σ(t k ) = 0 lim k t k = 0 (σ is called a forcing function) Algorithms for unconstrained local optimization p. 9 Algorithms for unconstrained local optimization p. 0 Comments on the assumptions Then either there exists a finite index k such that f(x k) = 0 or otherwise x k L 0 and all of its limit points are in L 0 {f(x k )} admits a limit lim k f(x k ) = 0 for every limit point x of {x k } we have f( x) = 0 f(x k+ ) f(x k ): most optimization methods choose d k as a descent direction. If d k is a descent direction, choosing α k sufficiently small ensures the validity of the assumption d lim T k k f(x d k k) = 0: given a normalized direction d k, the scalar product d k T f(x k ) is the directional derivative of f along d k : it is required that this goes to zero. This can be achieved through precise line searches (choosing the step so that f is minimized along d k ) d T k f(x k) d k σ( f(x k ) ): letting, e.g., σ(t) = ct, c > 0, if d k : d T k f(x k) < 0 then the condition becomes d T k f(x k) d k f(x k c Algorithms for unconstrained local optimization p. 2 Algorithms for unconstrained local optimization p.

38 Gradient Algorithms Recalling that then the condition becomes cos θ k = dt k f(x k) d k f(x k cos θ k c that is, the angle between d k and f(x k ) is bounded away from orthogonality. General scheme: with D k 0 e α k > 0 If f(x k ) 0 then is a descent direction. In fact x k+ = x k α k D k f(x k ) d k = D k f(x k ) d T k f(x k ) = T f(x k )D k f(x k ) < 0 d T k f(xk) θk Algorithms for unconstrained local optimization p. 3 Algorithms for unconstrained local optimization p. 4 Steepest Descent or gradient method: D k := I i.e. x k+ = x k α k f(x k ). If f(x k ) 0 then d k = f(x k ) is a descent direction. Moreover, it is the steepest (w.r.t. the euclidean norm): min d R n T f(x k )d d f(x k ) Algorithms for unconstrained local optimization p. 5 Algorithms for unconstrained local optimization p. 6

39 ... Newton s method min d R T f(x k )d n dt d KKT conditions: In the interior T f(x k ) = 0; if the constraint is active f(x k ) + λ d d = 0 dt d = λ 0 d = f(x k) f(x k ). Algorithms for unconstrained local optimization p. 7 D k := ( 2 f(x k ) ) Motivation: Taylor expansion of f: f(x) f(x k ) + T f(x k )(x x k ) + 2 (x x k) T 2 f(x k )(x x k ) Minimizing the approximation: f(x k ) + 2 f(x k )(x x k ) = 0 If the hessian is non singular x = x k ( 2 f(x k ) ) f(xk ) Algorithms for unconstrained local optimization p. 8 Step choice Given d k, how to choose α k so that x k+ = x k + α k d k? optimal choice (one-dimensional optimization): α k = arg min α 0 f(x k + αd k ). Analytical expression of the optimal step is available only in few cases. E.g. if f(x) = 2 xt Qx + c T x with Q 0. Then f(x k + αd k ) = 2 (x k + αd k ) T Q(x k + αd k ) + c T (x k + αd k ) where β does not depend on α. = 2 α2 d T k Qd k + α(qx k + c) T d k + β Minimizing w.r.t. α: αd T k Qd k + (Qx k + c) T d k = 0 E.g., in steepest descent: α k = α = (Qx k + c) T d k d T k Qd k = dt k f(x k) d T k 2 f(x k )d k f(x k ) 2 T f(x k ) 2 f(x k ) f(x k ) Algorithms for unconstrained local optimization p. 9 Algorithms for unconstrained local optimization p. 20

40 Approximate step size Avoid too large steps Rules for choosing a step-size (from the sufficient condition for convergence): f(x k+ ) < f(x k ) lim k d T k d k f(x k) = 0 Often it is also required that x k+ x k 0 d T K f(x k + α k d k ) 0 In general it is important to insure a sufficient reduction of f and a sufficiently large step x k+ x k Algorithms for unconstrained local optimization p. 2 Algorithms for unconstrained local optimization p. 22 Avoid too small steps Armijo s rule Input: δ (0, ), γ (0, /2), k > 0 α := k ; while (f(x k + αd k ) > f(x k ) + γαd T k f(x k)) do α := δα ; end return α Typical values : δ [0., 0.5], γ [0 4, 0 3 ]. On exit the returned step is such that f(x k + αd k ) f(x k ) + γαd T k f(x k ) Algorithms for unconstrained local optimization p. 23 Algorithms for unconstrained local optimization p. 24

41 Line search in practice acceptable steps How to choose the initial step size k? Let φ(α) = f(x k + αd k ). A possibility is to choose k = α, the minimizer of a quadratic approximation to φ( ). Example: α q(α) = c 0 + c α + 2 c 2α 2 q(0) = c 0 := f(x k ) q (0) = c := d T k f(x k ) γαd T k f(x k) Then α = c /c 2. αd T k f(x k) Algorithms for unconstrained local optimization p. 25 Algorithms for unconstrained local optimization p. 26 Third condition? If an estimate ˆf of the minimum of f(x k + αd k ) is available choose c 2 : min q(α) = ˆf. min q(α) = q( c /c 2 ) = c 0 c 2 /c 2 := ˆf c 2 = c 2 /2( ˆf c 0 ) Thus it is reasonable to start with k = 2 ˆf f(x k ) d T k f(x k) A reasonable estimate might be to choose k = 2 (f(x k ) f(x k )) d T k f(x k) α = c /c 2 = 2 ˆf c 0 c Algorithms for unconstrained local optimization p. 27 Algorithms for unconstrained local optimization p. 28

42 Convergence of steepest descent x k+ = x k α k f(x k ) If a sufficiently accurate step size is used the condition of the theorem on global convergence are satisfied the steepest descent algorithm globally converges to a stationary point. Sufficiently accurate means exact line search or, e.g., Armijo s rule. Local analysis of steepest descent Behaviour of the algorithm when minimizing f(x) = 2 xt Qx where Q 0. (local and global) optimum: x = 0. Steepest descent method: x k+ = x k α k f(x k ) = x k α k Qx k = (I α k Q)x k Error (in x) at step k + : Algorithms for unconstrained local optimization p. 29 x k+ 0 = (I α k Q)x k = x T k (I α kq) 2 x k Algorithms for unconstrained local optimization p. 30 Analysis Let A: symmetric with eigenvalues: λ < < λ n. Then λ v 2 v T Av λ m v 2 v R n x T k (I α k Q) 2 x k λ x T k x k where λ largest eigenvalue of (I α k Q) λ is an eigenvalue of A iff αλ is an eigenvalue of αa λ is an eigenvalue of A iff + λ is an eigenvalue of I + A Thus the eigenvalues of (I α k Q) are αλ i where λ i are the eigenvalues of Q. The maximum eigenvalue will be: max{( α k λ ) 2, ( α k λ n ) 2 } thus x k+ max{( α k λ ) 2, ( α k λ n ) 2 } x k = max{ α k λ, α k λ n } x k Algorithms for unconstrained local optimization p. 3 Algorithms for unconstrained local optimization p. 32

43 Eliminating the dependency on α k : max{ αλ, αλ n } = max{ αλ, + αλ, αλ n, + αλ n } 5 4 αλ αλ n Algorithms for unconstrained local optimization p. 33 α 0 and λ λ n, αλ αλ n + αλ + αλ n and thus max{ α k λ, α k λ n } x k = max{ αλ, + αλ n } Minimum point: αλ = + αλ n i.e. α 2 = λ + λ n Algorithms for unconstrained local optimization p. 34 Analysis In the best possible case x k+ x k α λ = = λ n λ λ n + λ = ρ ρ + 2 λ + λ n λ where ρ = λ n /λ : condition number of Q ρ (ill conditioned problem) very slow convergence ρ very speed convergence Zig zagging min 2 (x2 + My 2 ) where M > 0. Optimum: x 0y = 0. Starting point: (M, ). Iterates: [ ] [ ] [ ] xk+ xk x = + α k My k y k+ With optimal step size [ xk+ y k+ y k ] [ ( M M ) k ] M+ = ( ) M k M+ Algorithms for unconstrained local optimization p. 35 Algorithms for unconstrained local optimization p. 36

44 Zig zagging Converegence is rapid if M very slow and zig zagging if M or M Slow convergence and zig zagging are general phenomena (especially when the starting point is near the longest axes of the ellipsoidal level sets) Algorithms for unconstrained local optimization p. 37 Algorithms for unconstrained local optimization p. 38 Analysis of Newton s method Newton-Raphson method: x k+ = x k ( 2 f(x k )) f(x k ). Let x : local optimum. Taylor expansion of f: f(x ) = 0 = f(x k ) + 2 f(x k )(x x k ) + o( x x k ) If 2 f(x k ) is non singular and ( 2 f(x k )) is limited 0 = ( 2 f(x k ) ) f(xk ) + (x x k ) + ( 2 f(x k ) ) o( x x k ) = x x k+ + o( x x k ) Thus i.e. x x k+ x x k = o( x x k ) x x k x x k+ = o( x x k ) convergence is at least super linear Algorithms for unconstrained local optimization p. 39 Algorithms for unconstrained local optimization p. 40

45 Local Convergence of Newton s Method Let f C 2 (U(x,δ )), where U: ball with radius δ and center x ; let 2 f(x ) be non singular. Then:. δ > 0 : if x 0 U(x,δ) {x k } is well defined and converges to x at least superlinearly. 2. If δ > 0,L > 0,M > 0 : and 2 f(x) 2 f(y) L x y Difficulties Many things might go wrong: at some iteration, 2 f(x k ) might be singular. For example: if x k belongs to a flat region f(x) = constant. even if non singular, inversion 2 f(x k ) or, in any case, solving a linear system with coefficient matrix 2 f(x k ) is numerically unstable and computationally demanding there is no guarantee that 2 f(x k ) 0 Newton direction might not be a descent direction ( 2 f(x)) M then, if x 0 U(x,δ) Newton s method converges with order at least 2 and x k+ x LM 2 x k x 2 Algorithms for unconstrained local optimization p. 4 Algorithms for unconstrained local optimization p. 42 Difficulties Newton s method just tries to solve the system f(x k ) = 0 and thus might very well be attracted towards a maximum the method lacks global convergence: it converges only if started near a local optimum Newton type methods line search variant: x k+ = x k α k ( 2 f(x k )) f(x k ) Modified Newton method: replace 2 f(x k ) by ( 2 f(x k ) + D k ) where D k is chosen so that 2 f(x k ) + D k is positive definite Algorithms for unconstrained local optimization p. 43 Algorithms for unconstrained local optimization p. 44

46 Quasi-Newton methods Consider solving the nonlinear system f(x) = 0. Taylor expansion of the gradient: f(x k ) f(x k+ ) + 2 f(x k+ )(x k x k+ ) Let B k+ be an approximation of the hessian in x k+. Quasi Newton equation: B k+ (x k+ x k ) = f(x k+ ) f(x k ) Let: Quasi Newton equation s k := x k+ x k y k := f(x k+ ) f(x k ) Quasi Newton equation: B k+ s k = y k. If B k was the previous approximate hessian, we ask that. the variation between B k and B k+ is small 2. nothing changes along directions which are normal to the step s k : B k z = B k+ z z : z T s k = 0 Choosing n vectors z which are orthogonal to s k n 2 linearly independent equations in n 2 unknowns a unique solution. Algorithms for unconstrained local optimization p. 45 Algorithms for unconstrained local optimization p. 46 Broyden updating It can be shown that the unique solution is given by: B k+ = B k + (y k B k s k )s T k s T k s k Theorem: let B k R n n and s k 0. The unique solution to: min B k ˆB F ˆB ˆBs k = y k is Broyden s update B k+ here X F = TrX T X denotes Frobenius norm. proof B k+ B k = (y k B k s k )s T k s T k s k ( = ˆBs k B k s k )s T k s T k s k = ( ˆB B k )s k s T k s T k s k ( ˆB B k ) s ks T k s T k s = ( ˆB Trsk s T k B k ) s ks T k k s T k s k = ( ˆB B k ) st k s k s T k s = ( ˆB B k ) k Unicity is a consequence of the strict convexity of the norm and the convexity of the feasible region. Algorithms for unconstrained local optimization p. 47 Algorithms for unconstrained local optimization p. 48

47 Quasi-Newton and optimization Special situation:. the hessian matrix in optimization problems is symmetric; 2. in gradient methods, when we let x k+ = x k (B k+ ) f(x k ), it is desirable that B k+ be positive definite. Broyden s update: B k+ = B k + (y k B k s k )s T k s T k s k is generally not symmetric even if B k is. Simmetry Remedy: let C = B k + (y k B k s k )s T k s T k s k C 2 = 2 (C + C T ) symmetrization: However, it does not satisfy Quasi Newton equation. Broyden update of C 2 : which is not symmetric,... C 3 = C 2 + (y k C 2 s k )s T k s T k s k Algorithms for unconstrained local optimization p. 49 Algorithms for unconstrained local optimization p. 50 PBS update BFGS In the limit B k+ = B k + (y k B k s k )s T k + s k(y k B k s k ) T s T k s k + (st k (y k B k s k ))s k s T k (s T k s k) 2 Same ideas, but applied to the approximate inverse Hessian: Inverse Quasi Newton equation: s k = H k+ y k (PBS Powell-Broyden-Symmetric update). Imposing also hereditary positive definiteness, DFP (Davidon-Fletcher-Powell) is obtained: B k+ = B k + (y k B k s k )yk T + y k(y k B k s k ) T yk Ts k ( = I y ) ( ks T k yk Ts B k I s ) kyk T k yk Ts + y kyk T k yk Ts k + (st k (y k B k s k ))y k y T k (y T k s k) 2 lead to the most common Quasi Newton update: BFGS (Broyden-Fletcher-Goldfarb-Shanno): ( H k+ = I s ) ( kyk T yk Ts H k I y ) ks T k k yk Ts + s ks T k k yk Ts k Algorithms for unconstrained local optimization p. 5 Algorithms for unconstrained local optimization p. 52

48 BFGS method x k+ = x k α k H k f(x k ) ( H k+ = I s kyk T yk Ts k y k = f(x k+ ) f(x k ) s k = x k+ x k ) H k ( I y ks T k y T k s k ) + s ks T k y T k s k Trust Region methods Possible defect of standard Newton method: the approximation becomes less and less precise if we move away from the current point. Long step bad approximation. Idea: constrained minimization of quadratic approximation: x k+ = arg min m k (x) x k+ x k k m k (x) = f(x k ) + T f(x k )(x k+ x k ) where + 2 (x k+ x k ) T 2 f(x k )(x k+ x k ) k > 0: parameter. First advantage (over pure Newton): the step is always definite (thanks to Weierstrass s theorem) Algorithms for unconstrained local optimization p. 53 Algorithms for unconstrained local optimization p. 54 Outline of Trust Region Let m k ( ) a local model function. E.g. in Newton Trust Region methods, m k (s) = f(x k ) + s T f(x k ) + 2 st 2 f(x k )s or in a Quasi-Newton Trust Region method m k (s) = f(x k ) + s T f(x k ) + 2 st B k s How to choose and update the trust region radius k? Given a step s k, let ρ k = f(x k) f(x k + s k ) m k (0) m k (s k ) the ratio between the actual reduction and the predicted reduction Algorithms for unconstrained local optimization p. 55 Algorithms for unconstrained local optimization p. 56

49 Model updating Algorithm ρ k = f(x k) f(x k + s k ) m k (0) m k (s k ) The predicted reduction is always non negative; if ρ k is small (surely if it is negative) the model and the function strongly disagree the step must be rejected and the trust region reduced if ρ k it is safe to expand the trust region intermediate ρ k values lead us to keep the region unchanged Data: ˆ > 0, 0 (0, ˆ ), η [0,/4] for k = 0,,... do Find the step s k and ρ k minimizing the model in the trust region ; if ρ k < /4 then k+ = k /4 ; else end if ρ k > 3/4 and s k = k then k+ = min{2 k, ˆ } ; else k+ = k ; end if ρ k > η then x k+ = x k + s k ; else xk+ = x k ; end end Algorithms for unconstrained local optimization p. 57 Algorithms for unconstrained local optimization p. 58 Solving the model How to find min f(x k ) T s + s 2 st B k s s If B k 0, KKT conditions are necessary and sufficient; rewriting the constraint as s T s 2 : Thus either s is in the interior of the ball with radius, in which case λ = 0 and we have the (quasi)-newton step: p = B k f(x k) or s = and if λ > 0 then 2λs = f(x k ) Bs = m k (s) s is parallel to the negtaive gradient of the model and normal to its contour lines. f(x k ) + B k s + 2λs = 0 λ( s ) = 0 Algorithms for unconstrained local optimization p. 59 Algorithms for unconstrained local optimization p. 60

50 The Cauchy Point Finding the Cauchy point Strategy to approximately solve the trust region sub problem. Find the Cauchy point : the minimizer of m k along the direction f(x k ) within the trust region. First find the direction: p s k = arg minf k + f(x k ) T p p p k Then along this direction find a minimizer τ k = arg min τ 0 m k(τp s k) τp s k k Finding p s k is easy: analytic solution: For the step size τ k : p s k = f(x k) g k If f(x k ) T B k f(x k ) 0 negative curvature direction largest possible step τ k = Otherwise the model along the line is strictly convex, so τ k = min{, k f(x k ) 3 k f(x k ) T B k f(x k ) } The Cauchy point is x k + τ k p s k. Algorithms for unconstrained local optimization p. 6 Choosing the Cauchy point global but extremely slow convergence (similar to steepest descent). Usually an improved point is searched starting from the Cauchy one. Algorithms for unconstrained local optimization p. 62 Pattern Search Derivative Free Optimization For smooth optimization, but without knowledge of derivatives. Elementary idea: if x R 2 is not a local minimum for f, then at least one of the directions e,e 2, e, e 2 (moving towards E, N, W, S) forms an acute angle with f(x) is a descent direction. Direct search: explores all the direction in search of one which gives a descent. Algorithms for unconstrained local optimization p. 63 Algorithms for unconstrained local optimization p. 64

51 Coordinate search Pattern search Let D = {±e i } be the set of coordinate directions and their opposites Data: k = 0, 0 an initial step length, x 0 a starting point while is large enough do if f(x k + k d) < f(x k ) for some d D then x k+ = x k + k d (step accepted) ; else k+ = 0.5 k ; end k = k + ; end It is not necessary to explore 2n directions. It is sufficient that the set of directions forms a positive span, i.e. every v R n should be expressible as a non negative linear combination of the vectors in the set. Formally, G is a generating set iff v 0 R n g G : v T g > 0 A good generating set should be characterized by a sufficiently high cosine measure: κ(g) := min max v T d v 0 d G v d Algorithms for unconstrained local optimization p. 65 Algorithms for unconstrained local optimization p. 66 Examples Step Choice In the first case κ 0.962, in the second κ = 0.5, in the third κ = x k + k d k if f(x k + k d k ) < f(x k ) ρ( k )(success) x k+ = x k otherwise (failure) where ρ(t) = o(t). We let k+ = φ k k where φ k for successful iterations, φ k < otherwise. Direct methods possess good convergence properties. Algorithms for unconstrained local optimization p. 67 Algorithms for unconstrained local optimization p. 68

Nonlinear Programming Models

Nonlinear Programming Models Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Nonlinear Programming Models p. Introduction Nonlinear Programming Models p. NLP problems minf(x) x S R n Standard form: