(Here > 0 is given.) In cases where M is convex, there is a nice theory for this problem; the theory has much more general applicability too.

Similar documents
Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

REVIEW OF DIFFERENTIAL CALCULUS

Supremum and Infimum

Trust Regions. Charles J. Geyer. March 27, 2013

1 Convexity, concavity and quasi-concavity. (SB )

Tangent spaces, normals and extrema

10-725/36-725: Convex Optimization Prerequisite Topics

Division of the Humanities and Social Sciences. Supergradients. KC Border Fall 2001 v ::15.45

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Chapter 2 Convex Analysis

Chapter 8: Taylor s theorem and L Hospital s rule

Lecture 7: CS395T Numerical Optimization for Graphics and AI Trust Region Methods

UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

1 Directional Derivatives and Differentiability

Unconstrained minimization of smooth functions

Differentiation. f(x + h) f(x) Lh = L.

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Convex Analysis and Economic Theory AY Elementary properties of convex functions

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Course Summary Math 211

September Math Course: First Order Derivative

MATH2070 Optimisation

Convex Optimization. Problem set 2. Due Monday April 26th

Optimization and Optimal Control in Banach Spaces

Convex Functions and Optimization

Lines, parabolas, distances and inequalities an enrichment class

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

Constrained Optimization and Lagrangian Duality

1 Strict local optimality in unconstrained optimization

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Preliminary draft only: please check for final version

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Linear and non-linear programming

MATH 1A, Complete Lecture Notes. Fedor Duzhin

Math 118, Fall 2014 Final Exam

Math 5BI: Problem Set 6 Gradient dynamical systems

HW3 - Due 02/06. Each answer must be mathematically justified. Don t forget your name. 1 2, A = 2 2

Monotone Function. Function f is called monotonically increasing, if. x 1 x 2 f (x 1 ) f (x 2 ) x 1 < x 2 f (x 1 ) < f (x 2 ) x 1 x 2

3.5 Quadratic Approximation and Convexity/Concavity

Chapter 2: Preliminaries and elements of convex analysis

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

Static Problem Set 2 Solutions

Assignment 1: From the Definition of Convexity to Helley Theorem

Math 320: Real Analysis MWF 1pm, Campion Hall 302 Homework 8 Solutions Please write neatly, and in complete sentences when possible.

Chapter 2: Unconstrained Extrema

Numerical Methods for Inverse Kinematics

Chapter 13. Convex and Concave. Josef Leydold Mathematical Methods WS 2018/19 13 Convex and Concave 1 / 44

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Analysis II - few selective results

In English, this means that if we travel on a straight line between any two points in C, then we never leave C.

The Rocket Car. UBC Math 403 Lecture Notes by Philip D. Loewen

M311 Functions of Several Variables. CHAPTER 1. Continuity CHAPTER 2. The Bolzano Weierstrass Theorem and Compact Sets CHAPTER 3.

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

SOLUTIONS TO ADDITIONAL EXERCISES FOR II.1 AND II.2

Lecture Notes 6: Dynamic Equations Part C: Linear Difference Equation Systems

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

Lecture 1: January 12

THEODORE VORONOV DIFFERENTIABLE MANIFOLDS. Fall Last updated: November 26, (Under construction.)

A SET OF LECTURE NOTES ON CONVEX OPTIMIZATION WITH SOME APPLICATIONS TO PROBABILITY THEORY INCOMPLETE DRAFT. MAY 06

In view of (31), the second of these is equal to the identity I on E m, while this, in view of (30), implies that the first can be written

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

This manuscript is for review purposes only.

Math Precalculus I University of Hawai i at Mānoa Spring

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

1 Overview. 2 A Characterization of Convex Functions. 2.1 First-order Taylor approximation. AM 221: Advanced Optimization Spring 2016

MATH 205C: STATIONARY PHASE LEMMA

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions

Convex Optimization and Modeling

Unconstrained Optimization

Notes on uniform convergence

fy (X(g)) Y (f)x(g) gy (X(f)) Y (g)x(f)) = fx(y (g)) + gx(y (f)) fy (X(g)) gy (X(f))

Local strong convexity and local Lipschitz continuity of the gradient of convex functions

Problems in Linear Algebra and Representation Theory

ECE133A Applied Numerical Computing Additional Lecture Notes

Summary Notes on Maximization

Mathematical Foundations -1- Convexity and quasi-convexity. Convex set Convex function Concave function Quasi-concave function Supporting hyperplane

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

MAT 473 Intermediate Real Analysis II

Chapter 3a Topics in differentiation. Problems in differentiation. Problems in differentiation. LC Abueg: mathematical economics

Richard S. Palais Department of Mathematics Brandeis University Waltham, MA The Magic of Iteration

Lecture 7: Positive Semidefinite Matrices

A Proximal Method for Identifying Active Manifolds

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Introduction to Real Analysis Alternative Chapter 1

B553 Lecture 5: Matrix Algebra Review

Basics of Calculus and Algebra

OPER 627: Nonlinear Optimization Lecture 9: Trust-region methods

FALL 2018 MATH 4211/6211 Optimization Homework 1

Examination paper for TMA4180 Optimization I

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

FIRST-ORDER SYSTEMS OF ORDINARY DIFFERENTIAL EQUATIONS III: Autonomous Planar Systems David Levermore Department of Mathematics University of Maryland

TMA 4180 Optimeringsteori KARUSH-KUHN-TUCKER THEOREM

Transcription:

Convex Analysis with Applications UBC Math 604 Lecture Notes by Philip D. Loewen In trust region methods, we minimize a quadratic model function M = M(p) over the set of all p R n satisfying a constraint ( ) g(p) def = 1 p 2 0. 2 (Here > 0 is given.) In cases where M is convex, there is a nice theory for this problem; the theory has much more general applicability too. A. Convex Sets and Functions Definition. A subset S of a real vector space X is called convex when, for every pair of distinct points x 0, x 1 in S, one has x t := (1 t)x 0 + tx 1 S t (0, 1). We call S strictly convex if in fact one has x t int S above. Definition. Let X be a real vector space containing a convex subset S. A function f: S R is called convex when, for every pair of distinct points x 0, x 1 in S, one has f ((1 t)x 0 + tx 1 ) (1 t)f(x 0 )+tf(x 1 ) t (0, 1). If this statement holds with strict inequality, then f is called strictly convex on S. Geometrically, a convex function is one whose chords lie on or above its graph; a strictly convex function is one whose chords lie strictly above the graph except at their endpoints. Example. Given a row vector g of length n, andann n matrix H = H T take X = R n and S = X. Then f(x) def = gx + 1 2 xt Hx is convex iff H 0 and strictly convex iff H>0. Proof. Pick x 0,x 1 R n distinct. Let v = x 1 x 0, and consider, for t (0, 1), x t =(1 t)x 0 + tx 1 = x 0 + t(x 1 x 0 )=x 0 + tv. We have [(1 t)f(x 0 )+tf(x 1 )] f(x t ) =(1 t) [ gx 0 + 1 2 xt 0 Hx [ 0] + t gx1 + 1 2 xt 1 Hx [ 1] gxt + 1 2 xt t Hx ] t = 1 2 (1 [ ] t)xt 0 Hx 0 + 1 2 txt 1 Hx 1 1 2 (1 t) 2 x T 0 Hx 0 +2t(1 t)x T 0 Hx 1 + t 2 x T 1 Hx 1 = 1 2 [(1 t) (1 t)2 ]x T 0 Hx 0 + 1 2 [t t2 ]x T 1 Hx 1 t(1 t)x T 0 Hx 1 = 1 2 t(1 t)(x 1 x 0 ) T H(x 1 x 0 )= 1 2 t(1 t)vt Hv. Now v 0 by assumption, so if H > 0 then the right side is positive whenever t (0, 1). If we know only H 0, then at least the right side is nonnegative in this interval. //// File convex, version of 20 February 01, page 1. Typeset at 09:13 February 20, 2001.

2 PHILIP D. LOEWEN Notice that the definition of convexity makes no reference to differentiability, and indeed many convex functions fail to be differentiable. A simple example of this is f 1 (x) = x on X = R note that f 1 (x) = x sgn(t) dt; a more complicated example 0 is f 2 (x) := x 0 t dt. The graph of f 2 is shown below: it has corners at every integer value of x. z f 2 (x) := x 0 t dt. Operations Preserving Convexity. Here are several ways to combine functions that respect the property of convexity. In each of them, X is a real vector space with convex subset D. (Confirming these statements is an easy exercise: please try it!) (a) Positive Linear Combinations: If the functions f 1, f 2,..., f n are convex on D, and c 1, c 2,..., c n are positive real numbers, then the sum function f(x) :=c 0 + c 1 f 1 (x)+c 2 f 2 (x)+ + c n f n (x) is well-defined and convex on the set D. Moreover, if any one of the functions f k happens to be strictly convex, then the sum function f will be strictly convex too. (b) Restriction to Convex Subsets: If f is convex on D and S is a convex subset of D, then the restriction of f to S is convex on S. (An important particular case arises when D = X is the whole space and S is some affine subspace of X: a function convex on X is automatically convex on S.) (c) Pointwise Maxima: If f 1, f 2,..., f n are convex on D, then so is their maximum function f(x) :=max{f 1,f 2,...,f n }. (A similar result holds for the maximum of an infinite number of convex functions, under suitable hypotheses.) x File convex, version of 20 February 01, page 2. Typeset at 09:13 February 20, 2001.

Convex Analysis with Applications 3 B. Convexity and its Characterizations The First-Derivative Test. Theorem (Subgradient Inequality). Let X be a real vector space containing a convex subset S. Let the function f: X R be Gâteaux differentiable at every point of S. (a) f is convex on S if and only if f(x)(y x) f(y) f(x) for any distinct x and y in S. (b) f is strictly convex on S if and only if f(x)(y x) < f(y) f(x) for any distinct x and y in S. Proof. (a)(b)( ) Let x 0 and x 1 be distinct points of S, andlett (0, 1) be given. Define x t := (1 t)x 0 + tx 1. In statement (b), the assumption is equivalent to f(x) <f(y)+ f(x)(x y) for any x y: choosingy = x t here, and then applying the inequality to both x = x 0 and x = x 1 in turn, we get f(x t ) <f(x 0 )+ f(x t )(x t x 0 ), ( ) f(x t ) <f(x 1 )+ f(x t )(x t x 1 ). ( ) Multiplying ( ) by(1 t) and( ) byt and adding the resulting inequalities produces f(x t ) <tf(x 1 )+(1 t)f(x 0 )+ f(x t )(t(x t x 1 )+(1 t)(x t x 0 )). ( ) A simple calculation reveals that the operand of f(x t ) appearing here is the zero vector, so this reduces to the inequality defining the strict convexity of f on S. In statement (a), the assumed inequality is non-strict, but exactly the same steps lead to a non-strict version of conclusion ( ). This confirms the ordinary convexity of f on S. (a)( ) Suppose f is convex on S. Let two distinct points x and y in S be given. Then for any t (0, 1), the definition of convexity gives f (x + t(y x)]) = f ((1 t)x + ty) (1 t)f(x)+tf(y) =f(x)+t (f(y) f(x)). Rearranging this gives f (x + t(y x)) f(x) t f(y) f(x). In the limit as t 0 +, the left side converges to the directional derivative f [x; y x], which equals f(x)(y x) by the definition of the Gâteaux derivative. Thus we have f(x)(y x) f(y) f(x). ( ) File convex, version of 20 February 01, page 3. Typeset at 09:13 February 20, 2001.

4 PHILIP D. LOEWEN This argument applies to any distinct points x and y in S, so the result follows. (b)( ) Iff is known to be strictly convex on S, then we certainly obtain ( ). It remains only to show that equality cannot hold. Indeed, if x 0 and x 1 are distinct points of S where one has f(x 1 )=f(x 0 )+ f(x 0 )(x 1 x 0 ), the definition of convexity implies that for all t in (0, 1), f(x t ) (1 t)f(x 0 )+tf(x 1 ) =(1 t)f(x 0 )+t[f(x 0 )+ f(x 0 )(x 1 x 0 )] = f(x 0 )+t f(x 0 )(x 1 x 0 )=f(x 0 )+ f(x 0 )(x t x 0 ) f(x t ). (The last inequality is a consequence of ( ).) This chain of inequalities reveals that f(x t )=(1 t)f(x 0 )+tf(x 1 ) t (0, 1), which cannot happen for a strictly convex function f. //// The inequality at the heart of the previous Theorem, namely f(x)(y x) f(y) f(x) y X, is called the subgradient inequality. Geometrically, it says that the tangent hyperplane to the graph of f at the point (x, f(x)) lies on or below the graph of f at every point y in X. (Write f(y) f(x)+ f(x)(y x) to see this.) The name also helps one remember the direction of the inequality: the gradient f(x) isalways on the small side. It is worth noting that the subgradient inequality written above follows from the convexity of f using the differentiability property only at the point x see the proof of (a) above. The Second-Derivative Test. We will never need second derivatives in general vector spaces, so we focus on the finite-dimensional case (X = R n ) in the next result. Here we use the familiar notation f(x) instead of f(x), and write 2 f(x) forthe usual Hessian matrix of mixed partial derivatives of f at the point x: [ 2 f(x) ] = ij 2 f/ x i x j. We assume f C 2 (R n ), so the order of mixed differentiation does not matter and the Hessian matrix is symmetric. The shorthand 2 f(x) > 0 means that this matrix is positive definite, i.e., v T 2 f(x)v >0 v R n,v 0. (B.1) (A similar understanding governs the notation 2 f(x) 0.) File convex, version of 20 February 01, page 4. Typeset at 09:13 February 20, 2001.

Convex Analysis with Applications 5 Theorem. Let f: R n R be of class C 2. (a) f is convex on R n if and only if 2 f(x) 0 for all x in R n ; (b) If 2 f(x) > 0 for all x in R n,thenf is strictly convex on R n. Proof. (b) Pick any two distinct points x and y in R n. Taylor s theorem says that thereissomepointξ on the line segment joining x to y for which f(y) =f(x)+ f(x)(y x)+ 1 2 (y x)t 2 f(ξ)(y x) >f(x)+ f(x)(y x). The inequality here holds because 2 f(ξ) > 0 by assumption. Since x and y in X are arbitrary, this establishes the hypothesis of the previous Theorem (part (b)): the strict convexity of f follows. (a)( ) Just rewrite the proof of (b) with inplaceof >. ( ) Let us prove first that the operator f is monotone, i.e., that ( f(y) f(x)) (y x) 0 x, y R n. ( ) To see this, just write down the subgradient inequality twice and add: Rearranging the sum produces ( ). f(y) f(x) f(x)(y x) f(x) f(y) f(y)(x y) 0 ( f(x) f(y)) (y x) Now fix any x and v in R n, and put y = x + tv into ( ), assuming t>0. It follows that ( f(x + tv) f(x)) (v) 0 t >0. Take g(t) := f(x + tv)(v): then the previous inequality implies, upon division by t>0, that g(t) g(0) 0 t >0. t Now f is C 2,sog is C 1 : hence the limit of the left-hand side as t 0 + exists and equals g (0). We deduce that g (0) 0, i.e., v T 2 f(x)v 0. Since v R n was arbitrary, this shows that 2 f(x) 0, as required. //// The second-derivative test, together with the simple convexity-preserving operations mentioned above, is an easy way to confirm (or refute) the convexity of a given function. For example, when X = R n, the quadratic function f(x) :=x T Hx + gx + c built around symmetric n n matrix H, arow vector g, and a constant c, will be convex if and only if H 0 and strictly convex if and only if H>0. File convex, version of 20 February 01, page 5. Typeset at 09:13 February 20, 2001.

6 PHILIP D. LOEWEN [ a b Exercises. (1) Consider the 2 2 matrix H = b d ].Provethat H 0 [ a 0, d 0, and ad b 2 0 ]. (B.2) Then show that (B.2) remains true if all strict inequalities ( > ) are replaced by nonstrict ones ( ). [ ] A B (2) Consider the 2n 2n matrix H = B T,inwhichA = A D T, B, andd = D T are n n matrices. Show that H>0 [ A>0, D > 0, and A BD 1 B T > 0 ]. Here the matrix inequalities are interpreted as in (B.1). Hints: (i) D>0implies the existence of D 1 ; (ii) D = D T implies (D 1 ) T =(D T ) 1 ; and (iii) if D is invertible then [ ][ ][ ] I BD 1 A BD H = 1 B T 0 I 0 0 I 0 D D 1 B T. I C. Convexity as a Sufficient Condition General Theory Proposition. Let S be an affine subset of a real vector space X. Suppose that f: S R is convex on S. If x is a point of S where f is Gâteaux differentiable, then the following statements are equivalent: (a) x minimizes f over S, i.e., f( x) f(x) for all x in S; (b) x is a critical point for f relative to S, i.e., f( x)(h) =0for all h in the subspace V = S x. Proof. (a b) Proved much earlier, without using convexity. (b a) Note that for any point x of S, the difference h = x x lies in the subspace V of the statement. Hence the subgradient inequality, together with condition (b), gives f(x) f( x)+ f( x)(x x) =f( x)+0 x S. //// Corollary. If f: S R is strictly convex on some affine subset S of a real vector space X, and x is a critical point for f relative to S, then x minimizes f over S uniquely, i.e., f( x) <f(x) for all x S, x x. D. Separation Theorem Separation. Let U, V be convex subsets of R n. If the intersection U V contains at most one point, then there exists α 0inR n such that α T u α T v u U, v V. More generally, the existence of α 0 is assured whenever (int U) (int V )=. File convex, version of 20 February 01, page 6. Typeset at 09:13 February 20, 2001.

Convex Analysis with Applications 7 E. Inequality-Constrained Convex Programming Problem Statement. Differentiable convex functions F, G: R n R are given. We want to min {F (x) :G(x) 0}. Sufficient Conditions. Given x R n, the following conditions guarantee that x solves the problem above: There exists λ 0 such that (i) λg( x) =0, (ii) the function x F (x)+λg(x) has gradient 0 at x. ( ) Note that the function described above would be convex, so having gradient 0 at x is the same as having a global min at x. Proof. Note that since λ 0andbothf,g are convex, every critical point for f + λg is a global minimizer: F ( x)+λg( x) F (x)+λg(x) x R n. Now G( x) 0 by assumption, and for any x R n satisfying the constraint G(x) 0, F ( x) F (x)+λg(x) F (x). This completes the proof. //// Necessary Conditions. Suppose x R n solves the problem above. Then one of these two statements hold: (i) G( x) =0and G( x) = 0. [ABNORMAL constraint is degenerate.] (ii) Conclusion ( ) aboveholdsfor x. [NORMAL] Proof. If G( x) < 0, then G(x) < 0 for all x near x, so x is an unconstrained local minimum for f. Hence F ( x) = 0. Case (ii) holds, because λ =0worksin( ). So assume that G( x) = 0. [Note that conclusion (i) in ( ) is then automatic.] Define U = {(F ( x) β, G( x) γ) :β>0,γ 0}, V = {(F (x)+r, G(x)+s) :x R n,r,s 0}. These are convex sets in R 2. [Practice: If v 0,v 1 V,sayv i =(F (x i )+r i,g(x i )+s i ), show v t =(1 t)v 0 +tv 1 V using convexity inequality on f,g.] Note that U V = : if (p, q) U V, then there must be some specific x R n and α>0,β 0,r 0,s 0 such that p = F ( x) β = F (x)+r, i.e., F (x) =F ( x) β r<f( x), q = G( x) γ = G(x)+s, i.e., G(x) =G( x) γ s 0. File convex, version of 20 February 01, page 7. Typeset at 09:13 February 20, 2001.

8 PHILIP D. LOEWEN The existence of such an x contradicts the optimality of x. So pick α =(α 0,α 1 ) (0, 0) such that α T u α T v for all u U, v V.Expand α T (v u) 0: 0 α 0 [F (x)+r F ( x)+β]+α 1 [G(x)+s G( x)+γ] x R n,β >0, γ,r,s 0. ( ) Pick x = x, r =0=s: This forces α 0 0andα 1 0. 0 α 0 β + α 1 γ β >0, γ 0. If α 0 =0,thenα 1 > 0 (since vector α 0). Takes =0=γ in ( ) toget This is the abnormal case. 0 G(x) G( x) x R n. Assume α 0 > 0. Define λ = α 1 /α 0 0. Taking γ =0=r = s in ( ) and rearranging gives F ( x) +λg( x) F (x) +λg(x) +β β > 0. This implies the same inequality even when β = 0, and shows that x f + λg has a global minimum at x. Consequently its gradient is 0: that s condition (ii) in ( ). //// To illustrate this proof, here is a sketch of the set V for the case of F (x) = (x 2) 2 +(y 2) 2 and G(x) =x 2 + y 2 6. The pairs (F (x),g(x)) are shaded in gray; points included only because of the additional (r, s) in the definition are shown in black. g(x,y) = x 2 + y 2 6 10 8 6 4 2 0 2 4 6 8 10 5 0 5 10 15 f(x,y) = (x 2) 2 + (y 2) 2 Sensitivity. You can see from picture that, in the presence of sufficient smoothness, the multiplier λ has the following interpretation: λ = V (0), where V (z) = min {F (x) :G(x) =z}. File convex, version of 20 February 01, page 8. Typeset at 09:13 February 20, 2001.

Convex Analysis with Applications 9 Here V is the value function associated with our problem. Another proof: V (G(x)) F (x) V (G( x)) = F ( x), x, } = f V g min at x = F ( x) V (0) G( x) =0. The Nonconvex Case. Similar results hold when f,g are nonconvex, but the proofs are much harder. (All results are expressed in terms of critical points rather than global minima.) Detailed discussion postponed. F. Application to the Trust-Region Subproblem Recall our problem, with > 0 given in advance: minimize subject to M(p) def = 1 2 pt Hp + gp + c p. This has the form above, with ) a quadratic function M and a constraint G(p) 0 where G(p) = 1 2 ( p 2 2.Note M(p) =g + p T H, G(p) =p T. Theorem. A vector p B[0; ] solves the problem above if and only if there exists λ 0 satisfying all of (i) λ ( p ) = 0, (ii) (H + λi) p = g T, (iii) H + λi 0. Proof. ( ) Suppose some p obeys (i) (iii). Then (iii) implies that the quadratic function M +λg is convex, and (i) (ii) show that p is a critical point for this function with λg( p) = 0. The sufficient conditions above imply that p gives the constrained minimum. ( ) Now suppose p B[0; ] gives the minimum. If p <, then conditions (i) (iii) hold for λ = 0. So assume p =. Apply the Necessary Conditions above. The abnormal case is impossible it requires both G( p) = 0(so p = > 0) and G( p) =0(so p = 0), and these are incompatible. Hence there must be some λ 0 such that λg( p) = 0 and 0= (M + λg)( p) =g + p T H + λp T, i.e., (H + λi)p = g T. This proves (i) (ii). To get (iii), use (ii) to help compare the minimizer p with general File convex, version of 20 February 01, page 9. Typeset at 09:13 February 20, 2001.

10 PHILIP D. LOEWEN elements p obeying p = : 0 M(p) M( p) = 1 2 pt Hp 1 2 pt H p + g(p p) = 1 2 pt Hp 1 2 pt H p p T (H + λi)(p p) ( = 1 p T (H + λi)p λ p 2) ( 1 p T (H + λi) p λ p 2) p T (H + λi)(p p) 2 2 [ = 1 2 p T (H + λi)p 2 p T (H + λi)p + p T (H + λi) p ] + λ [ p 2 p 2] 2 = 1 2 (p p)t (H + λi)(p p)+0. The last equation holds because p = = p. Since the resulting inequality holds for all p with p =, we deduce that H + λi 0. //// So to calculate the exact solution p for the subproblem, first try λ =0: ifh 0 already, and the solution p of Hp = g T obeys p, then that s the answer. Otherwise, we define p(λ) = (H + λi) 1 g T, noting that (H + λi) > 0 for all λ>0 sufficiently large, and then reduce λ until p(λ) =. Thisisa1D root-finding problem in the variable λ. Exact Search for λ. This works even when H is not positive. The derivation uses the orthogonal diagonalization of H: H = QΛQ T, where Λ = diag(λ 1,λ 2,...,λ n )andλ 1 λ 2 λ n. Each column of Q is a unit vector, and they are all orthogonal: QQ T = I = Q T Q. Hence H + λi = QΛQ T + λqq T = Q(Λ + λi)q T, p(λ) = (H + λi) 1 g T = Q(Λ + λi) 1 Q T g T! = Calculating p(λ) T p(λ) either as a double sum or as the expansion of n j=1 gq j λ j + λ q j. p(λ) 2 = gq(λ + λi) 1 Q T Q(Λ + λi) 1 Q T g T = gqdiag ( (λ j + λ) 2) Q T g T leads to p(λ) 2 = n j=1 (gq j ) 2 (λ + λ j ) 2. The exact solution to the subproblem corresponds to the unique λ λ 1 such that p(λ) 2 = 2. How do we find this λ? Suppose all eigenvalues are distinct, and gq 1 > 0. Then the right side converges to0asλ and to + as λ λ + 1. (Sketch.) The goal is simply to pick λ [λ 1, + ) to arrange p(λ) =, recalling n gq j p(λ) = q j. λ + λ j j=1 File convex, version of 20 February 01, page 10. Typeset at 09:13 February 20, 2001.

Convex Analysis with Applications 11 Now λ p(λ) is very steep near λ 1, so Newton s method may work badly on this problem. A nice trick is to turn everything upside down. Newton s method for 1 finding a zero of F (λ) = p(λ) 1 : λ (l+1) = λ (l) F (λ(l) ) F (λ (l) ). Using our earlier expression for p(λ) 2,weget F (λ) = 1 d p(λ) 2 dλ p(λ) = 1 p(λ) 2 d dλ p(λ) 2 = 1 p(λ) 2 1 1 = 2 p(λ) 3 = 1 p(λ) 3 n j=1 2 p(λ) 2 n [ 2 (gq j) 2 j=1 (gq j ) 2 (λ + λ j ) 3 d dλ p(λ) 2 ] (λ + λ j ) 3 Final result Nocedal and Wright page 81: Start with λ (0) large enough that H + λ (0) I>0. Repeat for l =0, 1, 2,...: Write λ = λ (l) for simplicity. Find R is upper triangular such that H + λi = R T R. This is the Cholesky factorization (Matlab chol). Solve for p using R T Rp = g T. If p sufficiently near, use this vector as an approximate solution to the subproblem. Otherwise, continue: Solve for z using R T z = p. Use these vectors to build Go around again. Matlab help: CHOL Cholesky factorization. λ (l+1) = λ + ( p z ) 2 ( p ). File convex, version of 20 February 01, page 11. Typeset at 09:13 February 20, 2001.

12 PHILIP D. LOEWEN CHOL(X) uses only the diagonal and upper triangle of X. The lower triangular is assumed to be the (complex conjugate) transpose of the upper. If X is positive definite, then R = CHOL(X) produces an upper triangular R so that R *R = X. If X is not positive definite, an error message is printed. Home practice: Reconcile Newton s method ideas above with iteration scheme above. The Hard Case. If gradient g is orthogonal to the entire eigenspace corresponding to λ = λ 1, then the behaviour of p(λ) is not as described above. The correct value of λ to use in this case is λ = λ 1, and the corresponding p comes from adding an eigenspace component whose length is chosen appropriately to arrange p =. Check the book for these details. (Or, for a rough first cut, use the Cauchy point and move rapidly to the next step, hoping the Hard Case is rather rare.) File convex, version of 20 February 01, page 12. Typeset at 09:13 February 20, 2001.