Convex Analysis with Applications UBC Math 604 Lecture Notes by Philip D. Loewen In trust region methods, we minimize a quadratic model function M = M(p) over the set of all p R n satisfying a constraint ( ) g(p) def = 1 p 2 0. 2 (Here > 0 is given.) In cases where M is convex, there is a nice theory for this problem; the theory has much more general applicability too. A. Convex Sets and Functions Definition. A subset S of a real vector space X is called convex when, for every pair of distinct points x 0, x 1 in S, one has x t := (1 t)x 0 + tx 1 S t (0, 1). We call S strictly convex if in fact one has x t int S above. Definition. Let X be a real vector space containing a convex subset S. A function f: S R is called convex when, for every pair of distinct points x 0, x 1 in S, one has f ((1 t)x 0 + tx 1 ) (1 t)f(x 0 )+tf(x 1 ) t (0, 1). If this statement holds with strict inequality, then f is called strictly convex on S. Geometrically, a convex function is one whose chords lie on or above its graph; a strictly convex function is one whose chords lie strictly above the graph except at their endpoints. Example. Given a row vector g of length n, andann n matrix H = H T take X = R n and S = X. Then f(x) def = gx + 1 2 xt Hx is convex iff H 0 and strictly convex iff H>0. Proof. Pick x 0,x 1 R n distinct. Let v = x 1 x 0, and consider, for t (0, 1), x t =(1 t)x 0 + tx 1 = x 0 + t(x 1 x 0 )=x 0 + tv. We have [(1 t)f(x 0 )+tf(x 1 )] f(x t ) =(1 t) [ gx 0 + 1 2 xt 0 Hx [ 0] + t gx1 + 1 2 xt 1 Hx [ 1] gxt + 1 2 xt t Hx ] t = 1 2 (1 [ ] t)xt 0 Hx 0 + 1 2 txt 1 Hx 1 1 2 (1 t) 2 x T 0 Hx 0 +2t(1 t)x T 0 Hx 1 + t 2 x T 1 Hx 1 = 1 2 [(1 t) (1 t)2 ]x T 0 Hx 0 + 1 2 [t t2 ]x T 1 Hx 1 t(1 t)x T 0 Hx 1 = 1 2 t(1 t)(x 1 x 0 ) T H(x 1 x 0 )= 1 2 t(1 t)vt Hv. Now v 0 by assumption, so if H > 0 then the right side is positive whenever t (0, 1). If we know only H 0, then at least the right side is nonnegative in this interval. //// File convex, version of 20 February 01, page 1. Typeset at 09:13 February 20, 2001.
2 PHILIP D. LOEWEN Notice that the definition of convexity makes no reference to differentiability, and indeed many convex functions fail to be differentiable. A simple example of this is f 1 (x) = x on X = R note that f 1 (x) = x sgn(t) dt; a more complicated example 0 is f 2 (x) := x 0 t dt. The graph of f 2 is shown below: it has corners at every integer value of x. z f 2 (x) := x 0 t dt. Operations Preserving Convexity. Here are several ways to combine functions that respect the property of convexity. In each of them, X is a real vector space with convex subset D. (Confirming these statements is an easy exercise: please try it!) (a) Positive Linear Combinations: If the functions f 1, f 2,..., f n are convex on D, and c 1, c 2,..., c n are positive real numbers, then the sum function f(x) :=c 0 + c 1 f 1 (x)+c 2 f 2 (x)+ + c n f n (x) is well-defined and convex on the set D. Moreover, if any one of the functions f k happens to be strictly convex, then the sum function f will be strictly convex too. (b) Restriction to Convex Subsets: If f is convex on D and S is a convex subset of D, then the restriction of f to S is convex on S. (An important particular case arises when D = X is the whole space and S is some affine subspace of X: a function convex on X is automatically convex on S.) (c) Pointwise Maxima: If f 1, f 2,..., f n are convex on D, then so is their maximum function f(x) :=max{f 1,f 2,...,f n }. (A similar result holds for the maximum of an infinite number of convex functions, under suitable hypotheses.) x File convex, version of 20 February 01, page 2. Typeset at 09:13 February 20, 2001.
Convex Analysis with Applications 3 B. Convexity and its Characterizations The First-Derivative Test. Theorem (Subgradient Inequality). Let X be a real vector space containing a convex subset S. Let the function f: X R be Gâteaux differentiable at every point of S. (a) f is convex on S if and only if f(x)(y x) f(y) f(x) for any distinct x and y in S. (b) f is strictly convex on S if and only if f(x)(y x) < f(y) f(x) for any distinct x and y in S. Proof. (a)(b)( ) Let x 0 and x 1 be distinct points of S, andlett (0, 1) be given. Define x t := (1 t)x 0 + tx 1. In statement (b), the assumption is equivalent to f(x) <f(y)+ f(x)(x y) for any x y: choosingy = x t here, and then applying the inequality to both x = x 0 and x = x 1 in turn, we get f(x t ) <f(x 0 )+ f(x t )(x t x 0 ), ( ) f(x t ) <f(x 1 )+ f(x t )(x t x 1 ). ( ) Multiplying ( ) by(1 t) and( ) byt and adding the resulting inequalities produces f(x t ) <tf(x 1 )+(1 t)f(x 0 )+ f(x t )(t(x t x 1 )+(1 t)(x t x 0 )). ( ) A simple calculation reveals that the operand of f(x t ) appearing here is the zero vector, so this reduces to the inequality defining the strict convexity of f on S. In statement (a), the assumed inequality is non-strict, but exactly the same steps lead to a non-strict version of conclusion ( ). This confirms the ordinary convexity of f on S. (a)( ) Suppose f is convex on S. Let two distinct points x and y in S be given. Then for any t (0, 1), the definition of convexity gives f (x + t(y x)]) = f ((1 t)x + ty) (1 t)f(x)+tf(y) =f(x)+t (f(y) f(x)). Rearranging this gives f (x + t(y x)) f(x) t f(y) f(x). In the limit as t 0 +, the left side converges to the directional derivative f [x; y x], which equals f(x)(y x) by the definition of the Gâteaux derivative. Thus we have f(x)(y x) f(y) f(x). ( ) File convex, version of 20 February 01, page 3. Typeset at 09:13 February 20, 2001.
4 PHILIP D. LOEWEN This argument applies to any distinct points x and y in S, so the result follows. (b)( ) Iff is known to be strictly convex on S, then we certainly obtain ( ). It remains only to show that equality cannot hold. Indeed, if x 0 and x 1 are distinct points of S where one has f(x 1 )=f(x 0 )+ f(x 0 )(x 1 x 0 ), the definition of convexity implies that for all t in (0, 1), f(x t ) (1 t)f(x 0 )+tf(x 1 ) =(1 t)f(x 0 )+t[f(x 0 )+ f(x 0 )(x 1 x 0 )] = f(x 0 )+t f(x 0 )(x 1 x 0 )=f(x 0 )+ f(x 0 )(x t x 0 ) f(x t ). (The last inequality is a consequence of ( ).) This chain of inequalities reveals that f(x t )=(1 t)f(x 0 )+tf(x 1 ) t (0, 1), which cannot happen for a strictly convex function f. //// The inequality at the heart of the previous Theorem, namely f(x)(y x) f(y) f(x) y X, is called the subgradient inequality. Geometrically, it says that the tangent hyperplane to the graph of f at the point (x, f(x)) lies on or below the graph of f at every point y in X. (Write f(y) f(x)+ f(x)(y x) to see this.) The name also helps one remember the direction of the inequality: the gradient f(x) isalways on the small side. It is worth noting that the subgradient inequality written above follows from the convexity of f using the differentiability property only at the point x see the proof of (a) above. The Second-Derivative Test. We will never need second derivatives in general vector spaces, so we focus on the finite-dimensional case (X = R n ) in the next result. Here we use the familiar notation f(x) instead of f(x), and write 2 f(x) forthe usual Hessian matrix of mixed partial derivatives of f at the point x: [ 2 f(x) ] = ij 2 f/ x i x j. We assume f C 2 (R n ), so the order of mixed differentiation does not matter and the Hessian matrix is symmetric. The shorthand 2 f(x) > 0 means that this matrix is positive definite, i.e., v T 2 f(x)v >0 v R n,v 0. (B.1) (A similar understanding governs the notation 2 f(x) 0.) File convex, version of 20 February 01, page 4. Typeset at 09:13 February 20, 2001.
Convex Analysis with Applications 5 Theorem. Let f: R n R be of class C 2. (a) f is convex on R n if and only if 2 f(x) 0 for all x in R n ; (b) If 2 f(x) > 0 for all x in R n,thenf is strictly convex on R n. Proof. (b) Pick any two distinct points x and y in R n. Taylor s theorem says that thereissomepointξ on the line segment joining x to y for which f(y) =f(x)+ f(x)(y x)+ 1 2 (y x)t 2 f(ξ)(y x) >f(x)+ f(x)(y x). The inequality here holds because 2 f(ξ) > 0 by assumption. Since x and y in X are arbitrary, this establishes the hypothesis of the previous Theorem (part (b)): the strict convexity of f follows. (a)( ) Just rewrite the proof of (b) with inplaceof >. ( ) Let us prove first that the operator f is monotone, i.e., that ( f(y) f(x)) (y x) 0 x, y R n. ( ) To see this, just write down the subgradient inequality twice and add: Rearranging the sum produces ( ). f(y) f(x) f(x)(y x) f(x) f(y) f(y)(x y) 0 ( f(x) f(y)) (y x) Now fix any x and v in R n, and put y = x + tv into ( ), assuming t>0. It follows that ( f(x + tv) f(x)) (v) 0 t >0. Take g(t) := f(x + tv)(v): then the previous inequality implies, upon division by t>0, that g(t) g(0) 0 t >0. t Now f is C 2,sog is C 1 : hence the limit of the left-hand side as t 0 + exists and equals g (0). We deduce that g (0) 0, i.e., v T 2 f(x)v 0. Since v R n was arbitrary, this shows that 2 f(x) 0, as required. //// The second-derivative test, together with the simple convexity-preserving operations mentioned above, is an easy way to confirm (or refute) the convexity of a given function. For example, when X = R n, the quadratic function f(x) :=x T Hx + gx + c built around symmetric n n matrix H, arow vector g, and a constant c, will be convex if and only if H 0 and strictly convex if and only if H>0. File convex, version of 20 February 01, page 5. Typeset at 09:13 February 20, 2001.
6 PHILIP D. LOEWEN [ a b Exercises. (1) Consider the 2 2 matrix H = b d ].Provethat H 0 [ a 0, d 0, and ad b 2 0 ]. (B.2) Then show that (B.2) remains true if all strict inequalities ( > ) are replaced by nonstrict ones ( ). [ ] A B (2) Consider the 2n 2n matrix H = B T,inwhichA = A D T, B, andd = D T are n n matrices. Show that H>0 [ A>0, D > 0, and A BD 1 B T > 0 ]. Here the matrix inequalities are interpreted as in (B.1). Hints: (i) D>0implies the existence of D 1 ; (ii) D = D T implies (D 1 ) T =(D T ) 1 ; and (iii) if D is invertible then [ ][ ][ ] I BD 1 A BD H = 1 B T 0 I 0 0 I 0 D D 1 B T. I C. Convexity as a Sufficient Condition General Theory Proposition. Let S be an affine subset of a real vector space X. Suppose that f: S R is convex on S. If x is a point of S where f is Gâteaux differentiable, then the following statements are equivalent: (a) x minimizes f over S, i.e., f( x) f(x) for all x in S; (b) x is a critical point for f relative to S, i.e., f( x)(h) =0for all h in the subspace V = S x. Proof. (a b) Proved much earlier, without using convexity. (b a) Note that for any point x of S, the difference h = x x lies in the subspace V of the statement. Hence the subgradient inequality, together with condition (b), gives f(x) f( x)+ f( x)(x x) =f( x)+0 x S. //// Corollary. If f: S R is strictly convex on some affine subset S of a real vector space X, and x is a critical point for f relative to S, then x minimizes f over S uniquely, i.e., f( x) <f(x) for all x S, x x. D. Separation Theorem Separation. Let U, V be convex subsets of R n. If the intersection U V contains at most one point, then there exists α 0inR n such that α T u α T v u U, v V. More generally, the existence of α 0 is assured whenever (int U) (int V )=. File convex, version of 20 February 01, page 6. Typeset at 09:13 February 20, 2001.
Convex Analysis with Applications 7 E. Inequality-Constrained Convex Programming Problem Statement. Differentiable convex functions F, G: R n R are given. We want to min {F (x) :G(x) 0}. Sufficient Conditions. Given x R n, the following conditions guarantee that x solves the problem above: There exists λ 0 such that (i) λg( x) =0, (ii) the function x F (x)+λg(x) has gradient 0 at x. ( ) Note that the function described above would be convex, so having gradient 0 at x is the same as having a global min at x. Proof. Note that since λ 0andbothf,g are convex, every critical point for f + λg is a global minimizer: F ( x)+λg( x) F (x)+λg(x) x R n. Now G( x) 0 by assumption, and for any x R n satisfying the constraint G(x) 0, F ( x) F (x)+λg(x) F (x). This completes the proof. //// Necessary Conditions. Suppose x R n solves the problem above. Then one of these two statements hold: (i) G( x) =0and G( x) = 0. [ABNORMAL constraint is degenerate.] (ii) Conclusion ( ) aboveholdsfor x. [NORMAL] Proof. If G( x) < 0, then G(x) < 0 for all x near x, so x is an unconstrained local minimum for f. Hence F ( x) = 0. Case (ii) holds, because λ =0worksin( ). So assume that G( x) = 0. [Note that conclusion (i) in ( ) is then automatic.] Define U = {(F ( x) β, G( x) γ) :β>0,γ 0}, V = {(F (x)+r, G(x)+s) :x R n,r,s 0}. These are convex sets in R 2. [Practice: If v 0,v 1 V,sayv i =(F (x i )+r i,g(x i )+s i ), show v t =(1 t)v 0 +tv 1 V using convexity inequality on f,g.] Note that U V = : if (p, q) U V, then there must be some specific x R n and α>0,β 0,r 0,s 0 such that p = F ( x) β = F (x)+r, i.e., F (x) =F ( x) β r<f( x), q = G( x) γ = G(x)+s, i.e., G(x) =G( x) γ s 0. File convex, version of 20 February 01, page 7. Typeset at 09:13 February 20, 2001.
8 PHILIP D. LOEWEN The existence of such an x contradicts the optimality of x. So pick α =(α 0,α 1 ) (0, 0) such that α T u α T v for all u U, v V.Expand α T (v u) 0: 0 α 0 [F (x)+r F ( x)+β]+α 1 [G(x)+s G( x)+γ] x R n,β >0, γ,r,s 0. ( ) Pick x = x, r =0=s: This forces α 0 0andα 1 0. 0 α 0 β + α 1 γ β >0, γ 0. If α 0 =0,thenα 1 > 0 (since vector α 0). Takes =0=γ in ( ) toget This is the abnormal case. 0 G(x) G( x) x R n. Assume α 0 > 0. Define λ = α 1 /α 0 0. Taking γ =0=r = s in ( ) and rearranging gives F ( x) +λg( x) F (x) +λg(x) +β β > 0. This implies the same inequality even when β = 0, and shows that x f + λg has a global minimum at x. Consequently its gradient is 0: that s condition (ii) in ( ). //// To illustrate this proof, here is a sketch of the set V for the case of F (x) = (x 2) 2 +(y 2) 2 and G(x) =x 2 + y 2 6. The pairs (F (x),g(x)) are shaded in gray; points included only because of the additional (r, s) in the definition are shown in black. g(x,y) = x 2 + y 2 6 10 8 6 4 2 0 2 4 6 8 10 5 0 5 10 15 f(x,y) = (x 2) 2 + (y 2) 2 Sensitivity. You can see from picture that, in the presence of sufficient smoothness, the multiplier λ has the following interpretation: λ = V (0), where V (z) = min {F (x) :G(x) =z}. File convex, version of 20 February 01, page 8. Typeset at 09:13 February 20, 2001.
Convex Analysis with Applications 9 Here V is the value function associated with our problem. Another proof: V (G(x)) F (x) V (G( x)) = F ( x), x, } = f V g min at x = F ( x) V (0) G( x) =0. The Nonconvex Case. Similar results hold when f,g are nonconvex, but the proofs are much harder. (All results are expressed in terms of critical points rather than global minima.) Detailed discussion postponed. F. Application to the Trust-Region Subproblem Recall our problem, with > 0 given in advance: minimize subject to M(p) def = 1 2 pt Hp + gp + c p. This has the form above, with ) a quadratic function M and a constraint G(p) 0 where G(p) = 1 2 ( p 2 2.Note M(p) =g + p T H, G(p) =p T. Theorem. A vector p B[0; ] solves the problem above if and only if there exists λ 0 satisfying all of (i) λ ( p ) = 0, (ii) (H + λi) p = g T, (iii) H + λi 0. Proof. ( ) Suppose some p obeys (i) (iii). Then (iii) implies that the quadratic function M +λg is convex, and (i) (ii) show that p is a critical point for this function with λg( p) = 0. The sufficient conditions above imply that p gives the constrained minimum. ( ) Now suppose p B[0; ] gives the minimum. If p <, then conditions (i) (iii) hold for λ = 0. So assume p =. Apply the Necessary Conditions above. The abnormal case is impossible it requires both G( p) = 0(so p = > 0) and G( p) =0(so p = 0), and these are incompatible. Hence there must be some λ 0 such that λg( p) = 0 and 0= (M + λg)( p) =g + p T H + λp T, i.e., (H + λi)p = g T. This proves (i) (ii). To get (iii), use (ii) to help compare the minimizer p with general File convex, version of 20 February 01, page 9. Typeset at 09:13 February 20, 2001.
10 PHILIP D. LOEWEN elements p obeying p = : 0 M(p) M( p) = 1 2 pt Hp 1 2 pt H p + g(p p) = 1 2 pt Hp 1 2 pt H p p T (H + λi)(p p) ( = 1 p T (H + λi)p λ p 2) ( 1 p T (H + λi) p λ p 2) p T (H + λi)(p p) 2 2 [ = 1 2 p T (H + λi)p 2 p T (H + λi)p + p T (H + λi) p ] + λ [ p 2 p 2] 2 = 1 2 (p p)t (H + λi)(p p)+0. The last equation holds because p = = p. Since the resulting inequality holds for all p with p =, we deduce that H + λi 0. //// So to calculate the exact solution p for the subproblem, first try λ =0: ifh 0 already, and the solution p of Hp = g T obeys p, then that s the answer. Otherwise, we define p(λ) = (H + λi) 1 g T, noting that (H + λi) > 0 for all λ>0 sufficiently large, and then reduce λ until p(λ) =. Thisisa1D root-finding problem in the variable λ. Exact Search for λ. This works even when H is not positive. The derivation uses the orthogonal diagonalization of H: H = QΛQ T, where Λ = diag(λ 1,λ 2,...,λ n )andλ 1 λ 2 λ n. Each column of Q is a unit vector, and they are all orthogonal: QQ T = I = Q T Q. Hence H + λi = QΛQ T + λqq T = Q(Λ + λi)q T, p(λ) = (H + λi) 1 g T = Q(Λ + λi) 1 Q T g T! = Calculating p(λ) T p(λ) either as a double sum or as the expansion of n j=1 gq j λ j + λ q j. p(λ) 2 = gq(λ + λi) 1 Q T Q(Λ + λi) 1 Q T g T = gqdiag ( (λ j + λ) 2) Q T g T leads to p(λ) 2 = n j=1 (gq j ) 2 (λ + λ j ) 2. The exact solution to the subproblem corresponds to the unique λ λ 1 such that p(λ) 2 = 2. How do we find this λ? Suppose all eigenvalues are distinct, and gq 1 > 0. Then the right side converges to0asλ and to + as λ λ + 1. (Sketch.) The goal is simply to pick λ [λ 1, + ) to arrange p(λ) =, recalling n gq j p(λ) = q j. λ + λ j j=1 File convex, version of 20 February 01, page 10. Typeset at 09:13 February 20, 2001.
Convex Analysis with Applications 11 Now λ p(λ) is very steep near λ 1, so Newton s method may work badly on this problem. A nice trick is to turn everything upside down. Newton s method for 1 finding a zero of F (λ) = p(λ) 1 : λ (l+1) = λ (l) F (λ(l) ) F (λ (l) ). Using our earlier expression for p(λ) 2,weget F (λ) = 1 d p(λ) 2 dλ p(λ) = 1 p(λ) 2 d dλ p(λ) 2 = 1 p(λ) 2 1 1 = 2 p(λ) 3 = 1 p(λ) 3 n j=1 2 p(λ) 2 n [ 2 (gq j) 2 j=1 (gq j ) 2 (λ + λ j ) 3 d dλ p(λ) 2 ] (λ + λ j ) 3 Final result Nocedal and Wright page 81: Start with λ (0) large enough that H + λ (0) I>0. Repeat for l =0, 1, 2,...: Write λ = λ (l) for simplicity. Find R is upper triangular such that H + λi = R T R. This is the Cholesky factorization (Matlab chol). Solve for p using R T Rp = g T. If p sufficiently near, use this vector as an approximate solution to the subproblem. Otherwise, continue: Solve for z using R T z = p. Use these vectors to build Go around again. Matlab help: CHOL Cholesky factorization. λ (l+1) = λ + ( p z ) 2 ( p ). File convex, version of 20 February 01, page 11. Typeset at 09:13 February 20, 2001.
12 PHILIP D. LOEWEN CHOL(X) uses only the diagonal and upper triangle of X. The lower triangular is assumed to be the (complex conjugate) transpose of the upper. If X is positive definite, then R = CHOL(X) produces an upper triangular R so that R *R = X. If X is not positive definite, an error message is printed. Home practice: Reconcile Newton s method ideas above with iteration scheme above. The Hard Case. If gradient g is orthogonal to the entire eigenspace corresponding to λ = λ 1, then the behaviour of p(λ) is not as described above. The correct value of λ to use in this case is λ = λ 1, and the corresponding p comes from adding an eigenspace component whose length is chosen appropriately to arrange p =. Check the book for these details. (Or, for a rough first cut, use the Cauchy point and move rapidly to the next step, hoping the Hard Case is rather rare.) File convex, version of 20 February 01, page 12. Typeset at 09:13 February 20, 2001.