Notes on Numerical Optimization

Size: px

Start display at page:

Download "Notes on Numerical Optimization"

Meagan Norton
5 years ago
Views:

1 Notes on Numerical Optimization University of Chicago, 2014 Viva Patel October 18,

2 Contents Contents 2 List of Algorithms 4 I Fundamentals of Optimization 5 1 Overview of Numerical Optimization Problem and Classification Theoretical Properties of Solutions Algorithm Overview Fundamental Definitions and Results Line Search Methods Step Length Step Length Selection Algorithms Global Convergence and Zoutendji Trust Region Methods Trust Region Subproblem and Fidelity Fidelity Algorithms Approximate Solutions to Subproblem Cauchy Point Dogleg Method Global Convergence of Cauchy Point Methods Iterative Solutions to Subproblem Exact Solution to Subproblem Newton s Root Finding Iterative Solution Trust Region Subproblem Algorithms Conjugate Gradients 23 II Model Hessian Selection 24 5 Newton s Method 24 6 Newton s Method with Hessian Modification 28 7 Quasi-Newton s Method Ran-2 Update: DFP & BFGS Ran-1 Update: SR III Specialized Applications of Optimization 32 8 Inexact Newton s Method 32 9 Limited Memory BFGS 33 2

3 10 Least Squares Regression Methods Linear Least Squares Regression Line Search: Gauss-Newton Trust Region: Levenberg-Marquardt IV Constrained Optimization Theory of Constrained Optimization

4 List of Algorithms 1 Bactracing Algorithm Interpolation Algorithm Trust Region Management Algorithm Solution Acceptance Algorithm Overview of Conjugate Gradient Algorithm Conjugate Gradients Algorithm for Convex Quadratic Fletcher Reeves CG Algorithm Line Search with Modified Hessian Overview of Inexact CG Algorithm L-BFGS Algorithm

5 Part I Fundamentals of Optimization 1 Overview of Numerical Optimization 1.1 Problem and Classification 1. Problem: arg min z R n f(z) : { c i (z) = 0 c i (z) 0 i E i I (a) f : R n R is nown as the objective function (b) E are equality constraints (c) I are inequality constraints 2. Classifications (a) Unconstrained vs. Constrained. If E I = then it is an unconstrained problem. (b) Linear Programming vs. Nonlinear Programming. When f(z) and c i (z) are all linear functions of z then this is a linear programming problem. Otherwise, it is a nonlinear programming problem. (c) Note: This is not the same as having linear or nonlinear equations. 1.2 Theoretical Properties of Solutions 1. Global Minimizer. Wea Local Minimizer (Local Minimizer). Strict Local Minimizer. Isolated Local Minimizer. Definition 1.1. Let f : R n R (a) x is a global minimizer of f if f(x ) f(x) for all x R n (b) x is a local minimizer of f if for some neighborhood, N of x, f(x ) f(x) for all x N(x ). (c) x is a strict local minimizer of f if for some neighborhood, N(x ), f(x ) < f(x) for all x N(x ). (d) x is an isolated local minimizer of f if for some neighborhood, N(x ), there are no other minimizers of f in N(x ). Note 1.1. Every isolated local minimizer is a strict minimizer, but it is not true that a strict minimizer is always an isolated local minimizer. 2. Taylor s Theorem Theorem 1.1. Suppose f : R n R is continuously differentiable. Let p R n. Then for some t [0, 1]: f(x + p) = f(x) + f(x + tp) T p (1) 5

6 If f is twice continuously differentiable, then: f(x + p) = f(x) + And for some t [0, 1]: f(x + tp)pdt (2) f(x + p) = f(x) + f(x) T p pt 2 f(x + tp)p (3) 3. Necessary Conditions (a) First Order Necessary Conditions Lemma 1.1. Let f be continuously differentiable. If x is a minimizer of f then f(x ) = 0. Proof. Suppose f(x ) 0. Let p = f(x ). Then, p T f(x ) < 0 and by continuity there is some T > 0 such that for any t [0, T ), p T f(x + tp) < 0. We now apply Equation 1 of Taylor s Theorem to get a contradiction. Let t [0, T ) and q = x + tp. Then t [0, t] such that: f(x + q) = f(x ) + t f(x + t tp) T p < f(x ) (b) Second Order Necessary Conditions Lemma 1.2. Suppose f is twice continuously differentiable. If x is a minimizer of f then f(x ) = 0 and 2 f(x ) 0 (i.e. positive semi-definite) Proof. The first conclusion follows from the first order necessary condition. We proceed with the second one by contradiction and Taylor s Theorem. Suppose 2 f(x ) 0. Then there is a p such that: p T 2 f(x )p < 0 By continuity, there is a T > 0 such that p T 2 f(x + tp)p < 0 for t [0, T ]. Fix t [0, T ] and let q = tp. Then, by Equation 3 from Taylor s Theorem, t [0, t] such that: f(x + q) = f(x ) t2 p T 2 f(x + t tp)p < f(x ) 4. Second Order Sufficient Condition Lemma 1.3. Let x R n. Suppose 2 f is continuous, 2 f(x ) 0, and f(x ) = 0. Then x is a strict local minimizer. 6

7 Proof. By continuity, there is a ball of radius T about x, B, in which 2 f(x) 0. Let p < T. Then, there exists a t [0, 1] such that: f(x + p) = f(x ) pt 2 f(x + tp)p > f(x ) 5. Convexity and Local Minimizers Lemma 1.4. When f is convex any local minimizer x is a global minimizer. If, in addition, f is differentiable then any stationary point is a global minimizer. Proof. Suppose x is not a global minimizer. Let z be the global minimizer. Then f(x ) > f(z). By convexity, for any λ [0, 1]: f(x ) > λf(z) + (1 λ)f(x ) f(λz + (1 λ)x ) So as λ 0, we see that in any neighborhood of x there is a point w such that f(w) < f(x ). A contradiction. For the second part, suppose z is as above. Then: f(x ) T (z x f(x + λ(z x )) f(x ) ) = lim λ 0 λ λf(z) + (1 λ)f(x ) f(x ) lim λ 0 λ f(z) f(x ) < Algorithm Overview 1. In general, algorithms begin with a seed point, x 0, and locally search for decreases in the objective function, producing iterates x, until stopping conditions are met. 2. Algorithms typically generate a local model for f near a point x : f(x + p) m (p) = f + p T f pt B p 3. Different choices of B will lead to different methods with different properties: (a) B = 0 will lead to steepest descent methods (b) Letting B be the closest positive definite approximation to 2 f leads to newton s methods (c) Iterative approximations to the Hessian given by B lead to quasinewton s methods (d) Conjugate Gradient methods update p without explicitly computing B 7

8 1.4 Fundamental Definitions and Results 1. Q-convergence. Definition 1.2. Let x x in R n. (a) x converge Q-linearly if q (0, 1) such that for sufficiently large : (b) x converge Q-superlinearly if: x +1 x x x q x +1 x lim x x 0 (c) x converges Q-quadratically if q 2 > 0 such that for sufficiently large : x +1 x x x 2 q 2. R-convergence. Definition 1.3. Let x x in R n (a) x converges R-linearly if v R converging Q-linearly such that x x v (b) x converges R-superlinearly if v R converging Q-superlinearly such that x x v (c) x converges R-quadratically if v R converging Q-quadratically such that x x v 3. Sherman-Morrison-Woodbury Lemma 1.5. If A and Ã are non-singular and related by Ã = A + abt then: Ã 1 = A 1 A 1 ab T A b T A 1 a Proof. We can chec this by simply plugging in the formula. However, to derive this: Ã 1 = ( A + ab T ) 1 = ( A [ I + A 1 ab T ]) 1 = [ I + A 1 ab T ] 1 A 1 = [ I A 1 ab T + A 1 ab T A 1 ab T ] A 1 = [ I A 1 a(1 b T A 1 a + )b T ] A 1 = A 1 A 1 ab T A b T A 1 a 8

9 2 Line Search Methods 2.1 Step Length 1. Descent Direction Definition 2.1. p R n is a descent direction if p T f < Problem: Find α arg min φ(α) where φ(α) = f(x + αp ). However, it is too costly to find the exact minimizer of φ. Instead, we find an α which acceptably reduces φ. 3. Wolfe Condition (a) Armijo Condition. Curvature Condition. Wolfe Condition. Strong Wolfe Condition. Definition 2.2. Fix x and p to be a descent direction. Let φ(α) = f(x + αp) and so φ (α) = f T (x + αp)p. Fix 0 < c 1 < c 2 < 1. i. α satisfies the Armijo Condition if φ(α) φ(0)+αc 1 φ (0). Note l(α) := φ(0) + αc 1 φ (0). ii. α satisfies the Curvature Condition if φ (α) c 2 φ (0) iii. α satisfies the Strong Curvature Condition if φ (α) c 2 φ (0) iv. The Wolfe Condition is the Armijo Condition with the Curvature Condition v. The Strong Wolfe Condition is the Armijo Condition with the Strong Curvature Condition. (b) The Armijo Condition: guarantees a decrease at the next iterate by ensuring φ(alpha) < φ(0). (c) Curvature Condition i. If φ (α) < c 2 φ (0), then φ is still decreasing at α and so we improve the reduction in f by taing a larger α ii. If φ (α) c 2 φ (0), then either we are closer to a minimum where φ = 0 or φ > 0 which means we have surpassed the minimum. iii. The strong condition guarantees that the choice of α is closer to φ = 0. (d) Existence of Step satisfying Wolfe and Strong Wolfe Conditions Lemma 2.1. Suppose f : R n R is continuously differentiable. Let p be a descent direction at x and assume that f is bounded from below along the ray {x + αp : α > 0}. If 0 < c 1 < c 2 < 1 there exists an interval satisfying the Wolfe and Strong Wolfe Condition. Proof. i. Satisfying Armijo Condition: Let l(α) = f(x) + c 1 αp T f(x). Since l(α) is decreasing and φ(α) is bounded from below, eventually φ(α) = l(α). Let α A > 0 be the first time this intersection occurs. Then φ(α) < l(α) for α (0, α A ). 9

10 ii. Curvature Condition: By the mean value theorem, β (0, α A ) such that (since l(α A ) = φ(α A )): α A φ (β) = φ(α A ) φ(0) = c 1 α A φ (0) > c 2 α A φ (0) And since φ is smooth there is an interval containing β for which this inequality holds. iii. Strong Curvature Condition. Since φ (β) = c 1 φ (0) < 0, it follows that: φ (β) < c 2 φ (0) 4. Goldstein Condition (a) Goldstein Condition Definition 2.3. Tae c (0, 1/2). The Goldstein Condition is satisfied by α if: φ(0) + (1 c)αφ (0) φ(α) φ(0) + cαφ (0) (b) Existence of Step satisfying Goldstein Condition Lemma 2.2. Suppose f : R n R is continuously differentiable. Let p be a descent direction at x and assume that f is bounded from below along the ray {x + αp : α > 0}. Let c (0, 1/2). Then there exists an interval satisfying the Goldstein condition. Proof. Let l(α) = φ(0)+(1 c)αφ (0) and u(α) = φ(0)+cαφ (0). For α > 0, l(α) < u(α) since c (0, 1/2). Since φ is bounded from below let α u be the smallest point of intersection between u(α) and φ(α) for α > 0. And let α l be the largest point of intersection, less than α u, of l(α) and φ(α). Then, for α (α l, α u ), l(α) < φ(α) < u(α). 5. Bactracing (a) Bactracing Definition 2.4. Let ρ (0, 1) and ᾱ > 0. Bactracing checs over a sequence ᾱ, ρᾱ, ρ 2 ᾱ,... until a α is found satisfying the Armijo condition. (b) Existence of Step satisfying Bactracing Lemma 2.3. Suppose f : R n R is continuously differentiable. Let p be a descent direction at x. Then there is a step produced by the bactracing algorithm which satisfies the Armijo condition. Proof. ɛ > 0 such that for 0 < α < ɛ, φ(α) < φ(0) + cαφ (0) by continuity. Since ρ n ᾱ 0, there is an n such that ρ n ᾱ < ɛ. 10

11 2.2 Step Length Selection Algorithms 1. Algorithms which implement the Wolfe or Goldstein Conditions exist and are ideal, but are not discussed herein. 2. Bactracing Algorithm Algorithm 1: Bactracing Algorithm input : Reduction rate ρ (0, 1), Initial estimate ᾱ > 0, c (0, 1), φ(α) α ᾱ while φ(α) φ(0) + αcφ (0) do α ρα end return α 3. Interpolation with Expensive First Derivatives (a) Suppose the Interpolation technique produces a sequence of guesses α 0, α 1,.... (b) Interpolation produces α by modeling φ with a polynomial m (quadratic or cubic) and then minimizes m to find a new estimate for α. The parameters of the polynomial are given by requiring: i. m(0) = φ(0) ii. m (0) = φ (0) iii. m(α 1 ) = φ(α 1 ) iv. m(α 2 ) = φ(α 2 ) (c) When = 1, we model using a quadratic polynomial 4. Interpolation with Inexpensive First Derivatives (a) Suppose interpolation produces a sequence of guesses α 0, α 1,... (b) In this case, m is always a cubic polynomial such that, α is computed by: i. m(α 1 ) = φ(α 1 ) ii. m(α 2 ) = φ(α 2 ) iii. m (α 1 ) = φ (α 1 ) iv. m (α 2 ) = φ (α 2 ) 5. Interpolation Algorithm Algorithm 2: Interpolation Algorithm input : Feasible Search Region [ā, b], Initial estimate α 0 > 0, φ α 0, β α 0 while φ(β) φ(0) + cβφ (0) do Approximate m if Inexpensive Derivatives then α β end Explicitly compute minimizer of m in feasible region. Store as β. end return β 11

12 2.3 Global Convergence and Zoutendji 1. Globally Convergent. Zoutendji Condition. Definition 2.5. Suppose we have an algorithm, Ω which produces iterates (x ) and denote f(x ) = f. (a) Ω is globally convergent if f 0. (That is, x converge to a stationary point.) (b) Suppose Ω produced search directions p such that p = 1. And let θ be the angle between f and p. Ω satisfies the Zoutendji Condition if cos 2 (θ ) f 2 < =1 2. Zoutendji Condition & Angle Bound implies global convergence Lemma 2.4. Suppose Ω produces a sequence (x, p, f, θ ) such that δ > 0 and 1, cos(θ ) δ. If Ω satisfies the Zoutendji condition then Ω is globally convergent. Proof. By the Zoutendji condition: Therefore, f 2 0. δ 2 =1 f 2 < 3. Example of Angle Bound Example 2.1. Suppose p = B 1 f where B are positive definite and B B 1 < M < for all. Letting = 2, B = XΛX T be the EVD of B, and X T f = z: f T cos(θ ) = B 1 f f B 1 f = z T Λ 1 z z Λ 1 z n z 2 i = i=1 λ i n z 1/λ 1 z 2 1/λ n z 2 λ n λ 1 1 M Hence, if Ω satisfies the Zoutendji Condition, we see that lim n f n 0. i=1 z 2 i λ 2 i 12

13 4. Wolfe Condition Line Search Satisfies Zoutendji Condition. Theorem 2.1. Suppose we have an objective f satisfying: (a) f is bounded from below in R n (b) Given an initial x 0, there is an open set N containing L = {x : f(x) f(x 0 ) (c) f is continuously differentiable on N (d) f is Lipschitz continuous on N Suppose we have an algorithm Ω producing (x, p, f, θ, α ) such that: (a) p is a descent direction (with p = 1) (b) α satisfies the Wolfe Conditions Then Ω satisfies the Zoutendji Condition. Proof. The general strategy is to lower bound α uniformly by C f T p, where C > 0. Then using the descent condition: f +1 f cα f T p f C f T p 2 f C cos 2 (θ ) f 2 2 f 0 C cos ( θ j ) f j 2 2 j=1 And we can conclude that since f is bounded from below by l, then f 0 f f 0 l <. The result follows. So now we set out to show that α C f T p. We do this by leveraging the Curvature Condition. (a) From the curvature condition, we have that: ( f +1 f ) T p (c 2 1) f T p (b) From Lipschitz Continuity, with constant L: ( f +1 f ) T p f +1 f p Lα p 2 = Lα Together these imply that 1 c2 L f T p α. 5. Goldstein Condition Line Search satisfies Zoutendji Condition Theorem 2.2. Suppose f has the same properties as it does in Theorem 2.1. And suppose Ω produces (x, p, f, θ, α ) such that: (a) p is a descent direction with p = 1 (b) α satisfies the Goldstein conditions. Then Ω satisfies the Zoutendji Condition. 13

14 Proof. We use Taylor s Theorem and then Lispchitz Continuity to find the lower bound. By Taylor s theorem, for t (0, 1): (1 c)α f T p f +1 f = f(x + t α p ) T α p Therefore, cα f T p ( f(x + t α p ) T f T )p Lt α p 2 Lα The result follows. 6. Bactracing Line Search satisfies Zoutendji Condition Theorem 2.3. Suppose f has the same properties as it does in Theorem 2.1. And suppose Ω produces (x, p, f, θ, α ) such that: (a) p is a descent direction with p = 1 (b) α is selected by bactracing with ᾱ = 1 Then Ω satisfies the Zoutendji condition. Proof. Suppose Ω uses ρ (0, 1). There are two cases to consider. When α = 1 and α < 1. We denote the first subsequence by (j) and the second by (l). For α (l) we now at least that α (l) /ρ was rejected, that is: f(x + α (l) p /ρ) > f + cα (l) f T p /ρ Using the same strategy as in Goldstein, t such that: f(x + t α (l) p /ρ) T α p /ρ > cα (l) f T p /ρ Therefore, by Lipschitz continuity: (1 c) f T p Lt α (l) /ρ Lα (l) /ρ We now consider α (j) = 1 since the Armijo condition gives the sequence: f (j+1) f (j)+1 f (j) + c f T p = f (j) c cos(θ (j)s ) f (j) Therefore: c j cos(θ (j) ) f (j) < Then J such that j J, cos 2 (θ (j) ) f(j) 2 cos(θ(j) ) f(j) < 1 Therefore, the Zoutendji condition holds. 14

15 3 Trust Region Methods 3.1 Trust Region Subproblem and Fidelity 1. The strategy of the trust region problem is as follows: at any point x we have a model m x (p) of f(x + p) which we trust is a good approximation in a region p. Minimizing this, we have a new iterate on which to continue applying the approach. 2. Trust Region Subproblem Definition 3.1. Let x R n and f : R n R be an objective function with model m x (p) in a region p. The Trust Region Subproblem is to find arg min{m x (p) : p } Note 3.1. Usually m x is taen to be m x (p) = f(x) + f T x p pt Bp where B is the model Hessian and is not necessarily positive definite. 3. Fidelity Definition 3.2. The Fidelity of m x in a region p at a point p + x is f(x + p) f(x) ρ(m x, f, p, ) = m x (p) m x (0) 4. Fidelity and Trust Region. Let 0 < c 1 < c 2 < 1 (a) If ρ c 2, especially if p is intended for descent, the model is a good approximation to f and we can expand the trust region (b) If ρ c 1, especially if p is intended for descent, the model is a poor approximation to f and we should shrin the trust region (c) Otherwise, the model is a sufficient approximation to f and the trust region should not be adjusted. 5. Fidelity and Iterates. Suppose p causes a descent in m x. And let η (0, 1/4] (a) If ρ < η then we do not have a sufficient decrease in the function and the process should be repeated with x (b) If ρ η then we have a sufficient decrease in f, and we should continue the process with x + p 6. Notation (a) Let x 0, x 1,... be the iterates produces by subsequential solutions to the trust region subproblem for some objective f (b) m x (p) =: m (p) (c) for a particular m is denoted (d) Accepted solutions to the subproblem at x are p so that x +1 = x + p. 15

16 3.2 Fidelity Algorithms 1. Fidelity and Trust Region Algorithm 3: Trust Region Management Algorithm input : Thresholds 0 < c 2 < c 1 < 1, Trust Region, max, ρ x if ρ x > c 1 then min(2, max ) else if ρ < c 2 then /2 else Continue end return Note 3.2. Usually c 1 = 1/4 and c 2 = 3/4 2. Fidelity and Solution Acceptance Algorithm 4: Solution Acceptance Algorithm input : Treshold η (0, 1/4], ρ,x,p if ρ x (p) η then x = x + p else Continue end return x 3.3 Approximate Solutions to Subproblem Cauchy Point 1. Cauchy Point Definition 3.3. Let m x be a model of f within p. Let p S be the direction of steepest descent at m x (0) such that p S = 1, and let τ be: arg min{m x (τp s ) : τ } Then, p C = τp S is the Cauchy Point of m x. Note 3.3. The Cauchy point is the line search minimizer of m along the direction of steepest descent subject to p. 2. Computing the Cauchy Point Lemma 3.1. Let m x be a quadratic model of f within p such that m x (p) = f(x) + g T p pt Bp Then, its Cauchy Point is: { ( ) g p C g = min, g 3 g T Bg > 0 g T Bg g g gt Bg 0 16

17 Proof. First, we compute the direction of steepest descent: p S = g g Then, we have that m x (tp S ) = f(x) + tg T p S + t2 2 (ps ) T Bp S. If the quadratic term is negative, then t = is the minimizer. And so one option is: p C = g g If the quadratic term is positive then: d dt m x(tp S ) = d dt = g T p S + tp S Bp S Setting this to 0 and substituting bac in for p S : t = g 3 g T Bg This value could potentially be larger than, so to safeguard against this, in this case: ( ) τ = min, g 3 g T Bg 3. Using the Cauchy point does not produce the best rate of convergence Dogleg Method 1. Requirement: B 0 2. Intuitively, the dogleg method moves towards the minimizer of m x which is p min = B 1 g until it reaches this point or hits the boundary of the trust region. It does this by first moving to the Cauchy Point, and then along a linear path to p min 3. Dogleg Path Definition 3.4. Let m x be the quadratic model with B 0 of f with radius. Let p C be the Cauchy Point and p min be the unconstrained minimizer of m x. The Dogleg Path is p DL (t) where: { p DL p C t t [0, 1] (t) = p C + (t 1)(p min p C ) t [1, 2] 4. Dogleg Method Definition 3.5. The Dogleg Method chooses x +1 = x + p DL (τ) where p DL (τ) =. 17

18 5. Properties of Dogleg Path Lemma 3.2. Suppose p DL (t) is the Dogleg path of m x with = and B 0. Then: (a) m x (p DL (t)) is decreasing (b) p DL (t) is increasing Proof. When 0 < t < 1, this case is uninteresting since for these values we now m x (p DL (t)) is decreasing and p DL (t) is increasing. So we consider 1 < t < 2 and reparameterize using t 1 = z. We have that: d dz m x(p DL (t)) = (p min p C ) T (g Bp C ) + z(p min p C ) T B(p min p C ) = (1 z)(p min p C ) T B(p min p C ) Letting c = (p min p C ) T B(p min p C ) > 0 since B 0, we have that the derivative is strictly negative for z < 1. To show that the norm is increasing, first we show that the angle between p C and p min is between ( π/2, π/2), then we show its length is longer. (a) For the angle, g 0: p min, p C = g T B 1 g g 2 g T Bg > 0 (b) Using Cauchy-Schwartz, g p min g p C since: B 1 g g g T B 1 g = gt B 1 gg T Bg g T g 4 Bg g T Bg Global Convergence of Cauchy Point Methods 1. Reduction obtained by Cauchy Point Lemma 3.3. The Cauchy Point p c satisfies m (p C ) m (0) 1 ( 2 g min, g ) B Proof. We have that m (p C ) m (0) = g T p C (pc ) T B (p C ) 18

19 This brings in two cases when g T Bg 0 then (dropping subscripts) m (p C ) m (0) = g g 2 gt Bg g 1 ( 2 g min, g ) B When g T Bg 0, and (g T Bg) g 3, we are again in the above case: When g T Bg > g 3 : m (p C ) m (0) = g g 2 gt Bg g g 1 ( 2 g min, g ) B m (p C ) m (0) = g 4 g T Bg + g 4 2(g T Bg) 2 gt Bg = g 4 2g T Bg 1 g 2 2 B 1 ( 2 g min, g ) B 2. Reduction achieved by Dogleg Corollary 3.1. The dogleg step p DL satisfies: m (p DL ) m (0) 1 2 g min (, g ) B Proof. Since m (p DL ) mpc the result follows. 3. Reduction achieved by any method based on Cauchy Point Corollary 3.2. Suppose a method uses step p where m (p ) m (0) c(m (p C ) m (0)) for c (0, 1). Then p satisfies: m (p ) m (0) 1 ( 2 c g min, g ) B 19

20 4. Convergence when η = 0 Theorem 3.1. Let x 0 R n and f be an objective function. Let S = {x : f(x) f(x 0 )} and S(R 0 ) = {x : x y < R 0, y S} be a neighborhood of radius R 0 about S. Suppose the following conditions: (a) f is bounded from below on S (b) f is Lipschitz continuously differentiable on S(R 0 ) for some R 0 > 0 and some constant L. (c) There is a c such that for all : m (p ) m (0) c g min (d) There is a γ 1 such that for all : (e) η = 0 (fidelity threshold) (f) B β for all. p γ (, g ) B Then: lim inf g = 0 5. Convergence when η (0, 1/4) Theorem 3.2. If η (0, 1/4) with all other conditions from Theorem 3.1 holding, then: lim g = Iterative Solutions to Subproblem Exact Solution to Subproblem 1. Conditions for Minimizing Quadratics Lemma 3.4. Let m be the quadratic function where B is a symmetric matrix. m(p) = g T p pt Bp (a) m attains a minimum if and only if B 0 and g Im(B). If B 0 then every p satisfying Bp = g is a global minimizer of M. (b) m has a unique minimizer if and only if B 0. Proof. If p is a minimizer of m we have from second order conditions that 0 = m(p ) = Bp + g and B = 2 m(p ) 0. Now suppose that g Im(B), then p such that Bp = g. Let w R n then: m(p + w) = g T p + g T w ( w T Bw + 2p T Bw + p T Bp ) = m(p) + g T w 1 2 2gT w wt Bw m(p) 20

21 where the inequality follows since B 0. This proves the first part. Now suppose m has a unique minimizer, p, and suppose w such that w T Bw = 0. Since p is a minimizer, we have that Bp = g and so from the computation above m(p+w) = m(p) and so p+w, p are minimizers to m which is a contradiction. For the other direction, since B 0 and letting p = B 1 g, by Second Order Sufficient Conditions, m has a unique minimizer. 2. Characterization of Exact Solution Theorem 3.3. The vector p is a solution to the trust region problem: min f(x) + p R gt p + 1 n 2 pt Bp : p if and only if p is feasible and λ > 0 such that: (a) (B + λi)p = g (b) λ( p ) = 0 (c) B + λi 0 Proof. First we prove ( ) direction. So λ 0 which satisfies the properties. Consider the problem m (p) = f(x) + g T p pt (B + λi)p = m(p) λ p 2 Moreover, m (p) m (p ) which implies that: m(p) m(p ) + λ( p 2 p 2 ) So if λ = 0 then m(p) m(p ) and p is minimizer of m, or if λ > 0 then p = and so for a feasible p, p, which implies m(p) m(p ). Now suppose p is the minimizer of m. If p < then p is the unconstrained minimizer of m and so λ = 0, and so m(p ) = Bp + g = 0 and 2 m(p ) = B 0 by the previous lemma. Now suppose p =. Then the second condition is automatically satisfied. We now use the Lagrangian to determine the solution. L(λ, p) = g T p+ 1 2 pt Bp+ λ 2 (pt p 2 ) Differentiating with respect to p and setting this equal to 0, we see that (B + λi)p + g = 0 Now to show that (B + λi) is positive semi definite, note that for any p such that p 2 = and since p is the minimizer of m: m(p) + λ 2 p 2 m(p ) + λ 2 p 2 Rearranging, we have that (p p ) T (B λi)(p p ) 0 And since the set of all normalized ±(p p ) where p = is dense on the unit sphere, B λi is positive semi-definite. The last thing to show is that λ 0 which follows using the fact that if λ < 0 then p must be a global minimizer of m. By the previous lemma, this is a contradiction. 21

22 3.4.2 Newton s Root Finding Iterative Solution 1. Theorem 3.3 guarantees that a solution exists and we need on find λ which satisfies these conditions. Two cases exist from this theorem: Case 1 λ = 0. If λ = 0 then B 0 and we can compute a solution p using QR decomposition. Case 2 λ > 0. First, letting QΛQ T, be the EVD of B, and defining: p(λ) = Q(Λ + λi) 1 Q T g Hence: n p(λ) 2 = g T Q(Λ + λi) 2 Q T (qj T g = g)2 (λ j + λ) 2 j=1 From Theorem 3.3, we want to find λ > 0 for which: 2 = n j=1 (q T j g)2 (λ j + λ) 2 2. The second case has two sub-cases which must be considered. Let Q 1 be the columns of Q which correspond to the eigenspace of λ 1. Case 2a If Q T 1 g 0, then we implement Newton s Root finding algorithm on: f(λ) = 1 1 p(λ) since as λ λ 1, p(λ) and so a solution exists. Case 2b If Q T 1 g = 0, then applying Newton s root finding algorithm naively is not useful since p(λ) as λ λ 1. We note that when λ = λ 1, (B + λi) 0 and so we can find a τ such that: 2 = j:λ 1 λ j (qt j g)2 (λ j λ 1 ) 2 + τ 2 Computation. z 0 such that (B λ 1 I)z = 0 and z = 1. Therefore, q T j z = 0 for all j : λ 1 λ j. Setting we need only find τ. p(λ 1, τ) = j:λ 1 λ j (qt j g)2 (λ j λ 1 ) 2 + τ Trust Region Subproblem Algorithms 22

23 4 Conjugate Gradients 1. Conjugated Gradients Overview Algorithm 5: Overview of Conjugate Gradient Algorithm input : x 0 a starting point, ɛ tolerance, f objective x x 0 Compute Gradient: r f(x) Compute Conjugated Descent Direction: p r while r > ɛ do Compute Optimal Step Length: α Compute Step: x x + αr Compute Gradient: r f(x) Compute Conjugated Step Direction: p end return Minimizer: x 2. Conjugate Gradients for minimizing convex (A 0), quadratic function f(x) = 1 2 xt Ax b T x Algorithm 6: Conjugate Gradients Algorithm for Convex Quadratic input : x 0 a starting point, ɛ = 0 tolerance, f objective x x 0 Compute Gradient: r f(x) = Ax b Compute Conjugated Descent Direction: p r while r > ɛ do Compute Optimal Step Length: α = rt r p T Ap Compute Step: x x + αr Store Previous: r old r Compute Gradient: r r + αap Compute Conjugated Step Direction: p r + end return Minimizer: x r 2 p r old 2 3. Fletcher-Reeves for Nonlinear Functions Algorithm 7: Fletcher Reeves CG Algorithm input : x 0 a starting point, ɛ tolerance, f objective x x 0 Compute Gradient: r f(x) Compute Conjugated Descent Direction: p r while r > ɛ do Compute Optimal Step Length: α (Strong Wolfe Line Search) Compute Step: x x + αr Store Previous: r old r Compute Gradient: r f(x) Compute Conjugated Step Direction: p r + r 2 r 2 old p end return Minimizer: x 23

24 Part II Model Hessian Selection 5 Newton s Method 1. Requires 2 f 0 in some region for the method to wor. 2. Line Search (a) The search direction: p N = 2 f 1 f (b) Rate of Convergence Theorem 5.1. Suppose f is an objective with minimum x satisfying the Second Order Conditions. Moreover: i. Suppose f is twice continuously differentiable in a neighborhood of x ii. Specifically, suppose 2 f is Lipschitz continuous in a neighborhood of x iii. Suppose p = p N and x +1 = x + p N iv. Suppose x x Then for x 0 sufficiently close to x : i. x x quadratically ii. f 0 quadratically. Proof. Our goal is to get x +1 x in terms of 2 f and use Lipschitz continuity to provide an upper bound. i. We have that: x +1 x = x + p x = x x 2 f 1 f = 2 f 1 [ 2 f (x x ) ( f f ) ] = 2 f 1 = 2 f 1 [ 2 f (x x ) 1 0 ] 2 f(x + t(x x ))(x x )dt [ 1 ( 2 f 2 f(x + t(x x )) ) ] (x x )dt 0 24

25 ii. Using Lipschitz continuity: x +1 x 2 f 1 2 f f 2 f(x + t(x x )) x x dt tl x x 2 dt 2 f 1 L 2 x x 2 We now need to show that 2 f 1 is bounded from above. Let η(2r) be a neighborhood of radius 2r about x for which 2 f 0. In this neighborhood, 2 f 1 exists and g(x) = 2 f(x) 1 is continuous. Then, on the compact closed ball η(r), g(x) has a finite maximum. Therefore, we have quadratic convergence of x x. iii. We now consider the first derivative: f +1 f = f +1 = f+1 f 2 f p 1 [ = 2 (x + tp ) 2 ] f p dt 0 1 Lt p 2 L 2 g(x ) 2 f 2 So if x 0 η(r), the result follows Trust Region (a) Asymptotically Similar Definition 5.1. A sequence p is asymptotically similar to a sequence q if p q = o( q ) (b) The Subproblem (c) Rate of Convergence min f(x) + f T p R n x p pt 2 f x p : p Theorem 5.2. Suppose f is an objective with minimum x satisfying the Second Order Conditions. Moreover: i. Suppose f is twice continuously differentiable in a neighborhood of x 25

26 ii. Specifically, suppose 2 f is Lipschitz continuous in a neighborhood of x iii. Suppose for some c (0, 1), p satisfy ( m (p ) m (0) c f min, f ) 2 f and are asymptotically similar to p N when p N 1 2. iv. Suppose x +1 = x + p and x x Then: i. becomes inactive for sufficiently large. ii. x x superlinearly. Proof. We need to show that are bounded from below. The second part follows from this quicly since x x implies p N 0 and eventually p N 1 2. So applying the asymptotic similarity: x + p x x + p N x + p N p o( x x 2 ) + o( p N ) o( x x 2 ) + o( x x ) o( x x ) where the penultimate inequality follows from: 2 f 1 f f ( 2 f 1 x x sup x η(r) 2 f(x) ) To get that is bounded from below, we frame it in terms of ρ. The numerator of ρ 1 satisfies: f(x + p ) f(x ) m (p ) m (0) 1 1 ( 2 f(x + tsp ) 2 ) f p dt 2 pt 0 L 4 p 3 And the denominator of ρ 1 satisfies: ( m (p ) m (0) c f min, f ) 2 f ( c f min p, f ) 2 f So we need only lower bound f by p to translate this to a lower bound for. If p N > 1 2 then (in some neighborhood of x ) p 2 p N 2 2 f 1 f C f 26

27 Else if p N 1 2 then Therefore: p p N + o( p N ) 2 p N C f m (p ) m (0) c p C ( ) min p p, 2 f 2 f 1 Since the second term s denominator is the condition number of the matrix (which must be greater than or equal to 1), we have that: ρ 1 C p C Hence, if < 3 4C, ρ > 3/4 and would double. So we now that 1 ˆ for some fixed M such that 1 ˆ 3 2 M 2 M 4C. So we have that for sufficiently large, p 0 and so ρ 1 and becomes inactive. 27

28 6 Newton s Method with Hessian Modification 1. If 2 f 0, we can add a matrix E so that 2 f + E =: B 0. As long as this model produces a descent direction, and B B 1 are bounded uniformly, Zoutendji guarantees convergence. 2. Newton s Method with Hessian Modification Algorithm 8: Line Search with Modified Hessian input : Starting point x 0, Tolerance ɛ 0 while p > ɛ do Find E so that 2 f + E 0 p Solve B p = g Compute α (Wolfe, Goldstein, Bactracing, Interpolation) x +1 x + α p + 1 end 3. Minimum Frobenius Norm (a) Problem: find E which satisfies min E 2 F : λ min(a + E) δ. (b) Solution: If A = QΛQ T is the EVD of A then E = Qdiag(τ i )Q T where { 0 λ i δ τ i = δ λ i λ i < δ 4. Minimum 2-Norm (a) Problem: find E which satisfies min E 2 2 : λ min(a + E) δ (b) Solution: If λ min (A) = λ n then E = τi where: { 0 λ n δ τ = δ λ n λ n < δ 5. Modified Cholesy (a) Compute LDL Factorization but add constants to ensure positive definiteness of A (b) May require pivoting to ensure numerical stability 6. Modified Bloc Cholesy (a) Compute LBL factorization where B is a bloc diagonal (1 1 and 2 where the number of blocs is the number of positive eigenvalues and the number of 2 2 blocs is the number of negative eigenvalues) (b) Compute the EVD of B and modify it to ensure appropriate eigenvalues. 28

29 7 Quasi-Newton s Method 1. Fundamental Idea Computation. Let f be our objective and suppose it is Lipschitz twice continuously differentiable. From Taylor s Theorem: f(x + p) = f(x) + 1 = f(x) + 2 f(x)p f(x + tp)pdt 1 = f(x) + 2 f(x)p + o( p ) Therefore, letting x = x and p = x +1 x 2 f (x +1 x ) f +1 f 0 ( 2 f(x + tp) 2 f(x) ) pdt Quasi-Newton methods choose an approximation to 2 f, B which satisfies: B +1 (x +1 x ) = g +1 g 2. Secant Equation Definition 7.1. Let x i be a sequence of iterates when minimizing an objective function f. Let s = x +1 x and y = g +1 g where g i = f i. Then the secant equation is where B +1 is some matrix. 3. Curvature Condition B +1 s = y Definition 7.2. The curvature condition is s T y > 0. Note 7.1. If B satisfies the secant equation and s, y satisfy the curvature condition then s T B +1s > 0. So the quadratic along the step direction is convex. 4. Line Search Rate of Convergence Theorem 7.1. Suppose f is an objective with minimum x satisfying the Second Order Conditions. Moreover: (a) Suppose f is twice continuously differentiable in a neighborhood of x (b) Suppose B is computed using a quasi-newton method and B p = g (c) Suppose x +1 = x + p 29

30 (d) Suppose x x When x 0 is sufficiently close to x, x x superlinearly if and only if: (B 2 f(x ))p lim p 0 Proof. We show that the condition is equivalent to: p p N = o( p ). First, if the condition holds: p p N = 2 f 1 ( 2 ) f p + f = 2 f 1 ( 2 ) f B p = 2 f 1 [( 2 f 2 f(x ) ) p + ( 2 f(x ) ] ) B p Taing norms and using that x 0 is sufficiently close to x, and continuity: p p N = o( p ) For the other direction: o( p ) = p p N ( 2 f(x ) B ) p Now we just apply triangle inequality and properties of the Newton Step: x + p x x + p N x + p p N = o( x x 2 )+o( p ) Note that o( p ) = o( x x ). 7.1 Ran-2 Update: DFP & BFGS 1. Davidson-Fletcher-Powell Update (a) Let: G = f(x + tαp )dt So Taylor s Theorem implies: y = G s (b) Let: A W = G 1/2 AG 1/2 F (c) Problem. Given B symmetric positive definite, s and y, find arg min B B W : B = B T, Bs = y (d) Solution: B +1 = ( I y s T ) ( y T s B I s y T ) y T s + y y T y T s 30

31 (e) Inverse Hessian: H +1 = H H y y T H y T H y + s s T y T s 2. Broyden, Fletcher, Goldfarb, Shanno Update (a) Problem: Given H (inverse of B ) symmetric positive definite, s and y, find arg min H H W : H = H T, Hy = s (b) Solution: H +1 = ( I s y T ) ( y T s H I y s T ) y T s + s s T y T s (c) Hessian: B +1 = B B s s T B s T B s 3. Preservation of Positive Definiteness + y y T y T s Lemma 7.1. If B 0 and B +1 is updated from the BFGS method, then B Proof. We can equivalently show that H Let z R n \ {0}. Then: z T H +1 z = w T H w + ρ (s T z) 2 where w = z ρ (s T z)y. If s T z = 0 then w = z and zt H z > 0. If s T z 0, then ρ (s T z)2 > 0 (regardless of if w = 0) and so H Ran-1 Update: SR1 1. Problem: find v such that: B +1 = B + σvv T : B +1 s = y 2. Solution: B +1 = B + (y B s )(y B s ) T s T (y B s ) Proof. Multiplying by s : Hence, v is a multiple of y B s. y B s = σ(v T s )v 31

32 Part III Specialized Applications of Optimization 8 Inexact Newton s Method 1. Requirements for search direction: (a) p is inexpensive to compute (b) p approximates the newton step closely as measured by: z = 2 f p f (c) The error is controlled by the current gradient: z η f 2. Typically, inexact methods eep B = 2 f and improve on computing b 3. CG Based Algorithmic Overview Algorithm 9: Overview of Inexact CG Algorithm input : x 0 a starting point, f objective, restrictions x x 0 Define Tolerance: ɛ Compute Gradient: r f(x) Compute Conjugated Descent Direction: p r while r > ɛ do Chec Constrained Polynomial if p T Bp 0 then Compute α given restrictions Compute Minimizer: x x + αp Brea Loop else Compute Optimal Step Length: α if Restrictions on x then Compute τ given restrictions Compute Minimizer: x x + ταp Brea Loop else Compute Step: x x + αr end Compute Gradient: r f(x) Compute Conjugated Step Direction: p end end x x return Minimizer: x 4. As with the usual CG, we can implement methods for computing the gradient and conjugate step direction efficiently 32

33 5. Line Search Strategy: the restriction of α is simply that it satisfies the strong Wolfe Conditions 6. Trust-Region Strategy: (a) The restriction on α is that x + αp = because the restricted polynomial in the direction p is decreasing, so we should go all the way to the boundary (b) The restriction on x is that x + αp > because x+αp is no longer in the trust region so we need to find τ that ensures this happens. 7. An alternative is to use Lanczos Method, which generalizes CG Methods. 9 Limited Memory BFGS 1. In normal BFGS: H +1 = (I ρ s y T )H (I ρ y s T ) + ρ s s T 2. Substituting in for H until H 0 reveals that only s, y must be stored to compute H Limited BFGS capitalizes on this by storing a fixed number of s, y and re-updating H from a (typically diagonal) matrix H0 Algorithm 10: L-BFGS Algorithm input : x 0 a starting point, f objective, m storage size, ɛ tolerance x x 0 0 while f(x) > ɛ do Choose H 0 Update H from {(s, y ),..., (s m, y m )} Compute p u H f(x) Implement Line Search or Trust Region (dogleg) to find p x x + p + 1 end return Minimizer: x 33

34 10 Least Squares Regression Methods 1. In least squares problems, the objective function has a specialized form, for example: m f(x) = r j (x) 2 i=1 (a) x R n and usually are the parameters of a particular model of interest (b) r j are usually the residuals evaluated for the model φ with parameter x at point t j and output y j. That is: 2. Formulation r j (x) = φ(x, t j ) y j (a) Let r(x) = [ r 1 (x) r 2 (x) r m (x) ]. Then: f(x) = 1 2 r(x) 2 (b) The first derivative: f(x) = r(x) T r(x) = r 1 (x) T r 2 (x) T. r m (x) T r(x) =: J(x)T r(x) (c) The Second Derivative: 2 f(x) = r(x) T r(x) + 2 r(x)r(x) = J(x) T J(x) + 2 r(x)r(x) 3. The main conceit behind Least Squares methods is to approximate 2 f(x) by J(x) T J(x), thus, only first derivative information is required Linear Least Squares Regression 1. Suppose φ(x, t j ) = t T j x then: r(x) = T x y where each row of T is t T j 2. Objective Function: f(x) = 1 T x y 2 2 Note The objective function can be solved directly using matrix factorizations if m is not too large. 3. First Derivative: f(x) = T T T x T y Note The normal equations may be better to solve if m is large because by multiplying by the transpose, we need only solve an n n system of equations. 34

35 10.2 Line Search: Gauss-Newton 1. The Gauss-Newton Algorithm is a line search for non-linear least squares regression with search direction: J T J p = J T r 2. Notice that these are the normal equations from: 1 2 J p + r 2 3. Using this search direction, we proceed with line search as per usual 10.3 Trust Region: Levenberg-Marquardt 1. The local model is: m(p) = f(x) + (J(x) T r(x)) T p pt J(x) T J(x)p p 2. Since f(x) = r(x) 2 /2, the solution to minimizing m(p) is equivalent to: min 1 J(x)p + r 2 p 2 35

36 Part IV Constrained Optimization 10.4 Theory of Constrained Optimization 1. In constrained optimization, the problem is: min{x arg min f(x) : c i (x) = 0 i E, c j (x) 0 j I} (a) f is the usual objective function (b) E is the index for equality constraints (c) I is the index for inequality constraints 2. Feasible Set. Active Set. Definition The feasible set, Ω, is the set of all points which satisfy the constraints: Ω = {x : c i (x) = 0 i E, c j (x) 0 j I} The active set at point x Ω, A(x), is defined by: A(x) = E {j I : c j (x) = 0} 3. Feasible Sequence. Tangent. Tangent Cone. Definition Let x Ω. A sequence z x is a feasible sequence if for sufficiently large K, if K then z Ω. A vector d is a tangent to x if there is a feasible sequence z x and a sequence t 0 such that: z x lim = d t The set of all tangents is the Tangent Cone at x, denoted T Ω (x). Note T Ω (x) contains all step directions which for a sufficiently small step size will remain in the feasible region Ω. 4. Linearized Feasible Directions. Definition Let x Ω and A(x) be its active set. The set of linearized feasible directions, F(x) is defined as: F(x) = {d : d T c i (x) = 0 i E, d T c i (x) 0 A(x) I} Note Let x Ω and suppose we linearize the constraints near x. The set F(x) contains all search directions for which the linearized approximations to c i (x+d) satisfy the constraints. For example If c i (x) = 0 for i E, then 0c i (x + d) c i (x) + d T c i (x) = d T c i (x) for d to be a good search direction. 5. LICQ Constrain Qualification 36

37 Definition Given a point x and active set A(x), the linear independence constrain qualification (LICQ) holds if the set of active constraint gradients, { c i (x) : i A(x)}, is linearly independent. Note Most algorithms use line search or trust region require that F(x) T Ω (x), else there is not a good way to find another iterate. 6. First Order Necessary Conditions Theorem Let f, c i be continuously differentiable. Suppose x is a local solution to the optimization problem and at x, LICQ holds. Then there is a Lagrange multiplier vector λ, with components λ i for i E I, such that the Karush-Kuhn-Tucer conditions are satisfied at (x, λ): (a) x L(x, λ) = 0 (b) For all i E, c i (x ) = 0 (c) For all j I, c j (x ) 0 (d) For all j I, λ j 0 (e) For all j I, λ j c j (x ) = 0 37

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf. Maria Cameron 1. Trust Region Methods At every iteration the trust region methods generate a model m k (p), choose a trust region, and solve the constraint optimization problem of finding the minimum of