UNDERGROUND LECTURE NOTES 1: Optimality Conditions for Constrained Optimization Problems Robert M. Freund February 2016 c 2016 Massachusetts Institute of Technology. All rights reserved. 1
1 Introduction We consider constrained optimization problems of the form: (P) min x f(x) s.t. g(x) 0 h(x) = 0 x X, where X is an open set and g(x)) = (g 1 (x),..., g m (x)) : R n R m, h(x) = (h 1 (x),..., h l (x)) : R n R l. Let F denote the feasible region of (P), namely, F := {x X : g(x) 0, h(x) = 0}. Then the problem (P) can be written as min f(x). x F Recall that x is a local minimum of (P) if there exists ε > 0 such that f( x) f(y) for all y F B( x, ε). Also recall that local and global minima and maxima, strict and non-strict, are defined analogously. We will often use the following shorthand notation: g(x) = g 1 (x) T. g m (x) T and h(x) = h 1 (x) T. h l (x) T. Here g(x) R m n and h(x) R l n are each Jacobian matrices, the i th row of which is the transpose of the corresponding gradient. 2
2 Necessary Optimality Conditions 2.1 Geometric Necessary Conditions A set C R n is a cone if for every x C, αx C for any α 0. A set C is a convex cone if C is a cone and C is a convex set. Suppose x F. We have the following definitions: F 0 := {d : f( x) T d < 0} is the cone of improving directions of f(x) at x (not including the origin 0 in the cone). I := {i : g i ( x) = 0} is the set of indices of the binding inequality constraints at x. G 0 := {d : g i ( x) T d < 0 for all i I} is the cone of inward pointing directions for the binding constraints at x (not including the origin 0 in the cone). H 0 := {d : h i ( x) T d = 0 for all i = 1,..., l} is the set of tangent directions for the equality constraints at x. Theorem 2.1 (Geometric First-order Necessary Conditions) If x is a local minimum of (P) and either: (i) h(x) is a linear function, i.e., h(x) = Ax b for A R l n, or (ii) h i ( x), i = 1,..., l, are linearly independent, then F 0 G 0 H 0 =. Proof: Assume h(x) is the linear function h(x) = Ax b, whereby h i ( x) = A i for i = 1,..., l and H 0 = {d : Ad = 0}. Suppose d F 0 G 0 H 0. Suppose that λ > 0 and is sufficiently small. Then g i ( x + λd) < g i ( x) = 0 for i I because d G 0. Also g i ( x + λd) < 0 for i / I from continuity and the fact that g i ( x) < 0. Furthermore h( x + λd) = (A x b) + λad = 0. 3
Therefore x + λd F for all λ > 0 and sufficiently small. On the other hand, for all sufficiently small λ > 0 it holds that f( x + λd) < f( x). This contradicts the assumption that x is a local minimum of (P). The proof when h(x) is nonlinear is rather involved, as it relies on the Implicit Function Theorem. We present the proof in Section 6 at the end of this lecture note. Note that Theorem 2.1 essentially states that if a point x is (locally) optimal, there is no direction d which is both a feasible direction (i.e., such that g( x + λd) 0 and h( x + λd) 0) for small λ > 0) and is an improving direction (i.e., such that f( x + λd) < f( x) for small λ > 0), which makes sense intuitively. 2.2 Separation of Convex Sets We will shortly attempt to restate the geometric necessary local optimality conditions: F 0 G 0 H 0 = into a constructive and computable algebraic statement about the gradients of the objective function and the constraint functions. The vehicle that will make this happen involves the separation theory of convex sets. We start with some definitions: If p 0 is a vector in R n and α is a scalar, H := {x R n : p T x = α} is a hyperplane, and H + = {x R n : p T x α}, H = {x R n : p T x α} are halfspaces. Let S and T be two non-empty sets in R n. A hyperplane H = {x : p T x = α} is said to separate S and T if p T x α for all x S and p T x α for all x T, i.e., if S H + and T H. If, in addition, S T H, then H is said to properly separate S and T. H is said to strictly separate S and T if p T x > α for all x S and p T x < α for all x T. H is said to strongly separate S and T if for some ε > 0 it holds that p T x α + ε for all x S and p T x α ε for all x T. 4
Proposition 2.1 Let S be a nonempty closed convex set in R n, and y S. Then there exists a unique point x S of minimum distance to y. Furthermore, x is the minimizing point if and only if (y x) T (x x) 0 for all x S. (1) Proof: Let f(x) = x y 2. Let ˆx be an arbitrary point in S, and let Ŝ = S {x : x y ˆx y }. Then Ŝ is a compact set and Ŝ S. Furthermore, since f(x) is a continuous function it follows from the Weierstrass Theorem that f( ) attains its minimum over the set Ŝ at some point x Ŝ. As a consequence of the construction of Ŝ it also follows that x minimizes f( ) over the set S. Note that x y. Furthermore, because f( ) is a strictly convex function, x is unique. We now establish that x is the point in S that minimizes the distance to y if and only if (1) holds. Let us first assume that (1) holds; then note that for any x S, x y 2 = (x x) (y x) 2 = x x 2 + y x 2 2(x x) T (y x) x y 2. Therefore f(x) f( x), and so x is the point in S that minimizes the distance to y. Conversely, let us assume that x is the point in S that minimizes the distance to y. For any x S, λx + (1 λ) x S for any λ [0, 1]. Therefore λx + (1 λ) x y x y. Thus: x y 2 λx + (1 λ) x y 2 which when rearranged yields: = λ(x x) + ( x y) 2 = λ 2 x x 2 + 2λ(x x) T ( x y) + x y 2, λ 2 x x 2 2λ(y x) T (x x). Restricting attention to when λ > 0, we divide by λ to yield: λ x x 2 2(y x) T (x x). 5
Now let λ 0, whereby we obtain (y x) T (x x) 0. Theorem 2.2 Let S be a nonempty closed convex set in R n, and suppose that y S. Then there exists p 0 and α such that H = {x : p T x = α} strongly separates S and {y}. Proof: Let x be the point in S that minimizes the distance from y to the set S. Note that x y. Let p = y x, α = 1 2 (pt y + p T x), and ε = 1 2 (pt y p T x) = 1 2 y x 2 > 0. From Proposition 2.1 it follows that for any x S, (x x) T (y x) 0, and so p T x = (y x) T x (y x) T x = p T x = α ε. Therefore p T x α ε for all x S. On the other hand, p T y = α + ε, which completes the proof. We now use Theorem 2.2 to prove two useful results about the solvability of linear inequalities: Theorem 2.3 (Farkas Lemma) Given an m n matrix A and an n- vector c, exactly one of the following two systems has a solution: or (i) Ax 0, c T x > 0, (ii) A T y = c, y 0. Proof: First note that both systems (i) and (ii) cannot have a solution, since then we would have 0 < c T x = y T Ax 0. Suppose the system (ii) has no solution. Let S = {x : x = A T y for some y 0}. Then c S. S is easily seen to be a convex set. Also, S is a closed set. 1 Therefore there exists p 0 and α such that c T p > α and p T x α for all x S. It then follows by construction of S that p T (A T y) = (Ap) T y α for all y 0. 1 For an exact proof of this fact, see Appendix B.3 of Nonlinear Programming by Dimitri Bertsekas, Athena Scientific, 1999. 6
If (Ap) i > 0 for some i, one could set y i sufficiently large so that (Ap) T y > α, a contradiction. Thus Ap 0. Taking y = 0, we also have that α 0, and so c T p > 0, whereby p is a solution of (i). Lemma 2.1 [Key Lemma] Given matrices Ā, B, and H of appropriate dimensions, exactly one of the following two systems has a solution: (i) Āx < 0, Bx 0, Hx = 0, or (ii) ĀT u + B T w + H T v = 0, u 0, w 0, e T u = 1. Proof: It is easy to show that both (i) and (ii) cannot have a solution. Suppose (i) does not have a solution. Then the system: Āx +eθ 0, θ > 0 Bx 0 Hx 0 Hx 0 has no solution. This system can be re-written in the form: Ā e B 0 H 0 H 0 ( x θ ) ( x 0, (0,..., 0, 1) θ ) > 0. From Farkas Lemma, there exists a vector (u; w; v 1 ; v 2 ) 0 such that: T Ā e u 0 B 0 H 0 w v 1 =. 0. H 0 v 2 1 This can be rewritten as: 7
Ā T u + B T w + H T (v 1 v 2 ) = 0, e T u = 1. Setting v = v 1 v 2 completes the proof. We conclude this subsection with some extensions/consequences of Theorem 2.2. The first extension is to the strong separation of two convex sets when one of the sets is closed and bounded, as follows: Theorem 2.4 Suppose S 1 and S 2 are disjoint nonempty closed convex sets and S 1 is bounded. Then S 1 and S 2 can be strongly separated by a hyperplane. Proof: Let T = {x R n : x = y z, where y S 1, z S 2 }. Then it is easy to show that T is a convex set. We also claim that T is a closed set. To see this, let {x i } T, and suppose x = lim i x i. Then x i = y i z i for {y i } S 1 and {z i } S 2. By the Weierstrass Theorem, some subsequence of {y i } converges to a point ȳ S 1. Then z i = y i x i ȳ x (over this subsequence), so that z = ȳ x is a limit point of {z i }. Since S 2 is also closed, z S 2, and therefore x = ȳ z T, proving that T is a closed set. By hypothesis S 1 S 2 =, whereby 0 T. Since T is convex and closed, there exists a hyperplane H = {x : p T x = ᾱ} such that p T x > ᾱ for x T and p T 0 < ᾱ (and hence ᾱ > 0). Let y S 1 and z S 2. Then x = y z T, and so p T (y z) = p T x > ᾱ > 0 for any y S 1 and z S 2. Let α 1 = inf{p T y : y S 1 } and α 2 = sup{p T z : z S 2 }, and note that 0 < ᾱ α 1 α 2. Define α = 1 2 (α 1 + α 2 ) and ε = 1 2ᾱ > 0. Then for all y S 1 and z S 2 we have and p T y α 1 = 1 2 (α 1 + α 2 ) + 1 2 (α 1 α 2 ) α + 1 2ᾱ = α + ε p T z α 2 = 1 2 (α 1 + α 2 ) 1 2 (α 1 α 2 ) α 1 2ᾱ = α ε. Therefore S 1 and S 2 are strongly separated by the hyperplane H := {x R n : p T x = α}. 8
Corollary 2.1 If S is a closed convex set in R n, then S is the intersection of all halfspaces that contain S. Theorem 2.5 Let S R n, and let C be the intersection of all halfspaces containing S. Then C is the smallest closed convex set containing S. 2.3 Algebraic Necessary Conditions In this subsection we restate the geometric necessary local optimality conditions of Theorem 2.1: F 0 G 0 H 0 = into a constructive and computable algebraic statement about the gradient of the objective function and gradients of the constraint functions. Theorem 2.6 (Fritz John Necessary Conditions) Let x be a feasible solution of (P). If x is a local minimum of (P), then there exists (u 0, u, v) such that m l u 0 f( x) + u i g i ( x) + v i h i ( x) = 0, u 0, u 0, (u 0, u, v) 0, u i g i ( x) = 0, i = 1,..., m. (Note that the first equation can be rewritten as u 0 f( x) + g( x) T u + h( x) T v = 0.) Proof: If the vectors h i ( x) are linearly dependent, then there exists v 0 such that l v i h i ( x) = 0. Setting (u 0, u) = 0 establishes the result. Suppose now that the vectors h i ( x) are linearly independent. Then we can apply Theorem 2.1 and assert that F 0 G 0 H 0 =. Assume for simplicity that I = {1,..., p}. Let 9
Ā = f( x) T g 1 ( x) T. g p ( x) T h 1 ( x) T, H =.. h l ( x) T Then there is no d that satisfies Ād < 0, Hd = 0. From the Key Lemma (Lemma 2.1), there exists (u 0, u 1,..., u p ) and (v 1,..., v l ) such that u 0 f( x) + p u i g i ( x) + l v i h i ( x) = 0, with u 0 +u 1 + +u p = 1 and (u 0, u 1,..., u p ) 0. Define u p+1,..., u m = 0. Then (u 0, u) 0, (u 0, u) 0, and for any i, either g i ( x) = 0, or u i = 0. Furthermore, u 0 f( x) + p u i g i ( x) + l v i h i ( x) = 0. When u 0 > 0, the Fritz John conditions state that the gradient of the objective function and the gradients of the active constraints must line up in an intuitively obviously way having to do with the local geometry of feasible directions at the point x. The condition u i g i ( x) = 0, i = 1,..., m, can be equivalently stated as either g i ( x) = 0 or u i = 0, i = 1,..., m, and is called the complementary slackness condition or simply the complementarity condition. Theorem 2.7 (Karush-Kuhn-Tucker (KKT) Necessary Conditions) Let x be a feasible solution of (P) and let I = {i : g i ( x) = 0}. Further, suppose that h i ( x) for i = 1,..., l and g i ( x) for i I are linearly independent. If x is a local minimum, there exists (u, v) such that f( x) + g( x) T u + h( x) T v = 0, u 0, u i g i ( x) = 0, i = 1,..., m. 10
Proof: x must satisfy the Fritz John conditions. If u 0 > 0, we can redefine u u/u 0 and v v/u 0, whereby the redefined values (u, v) satisfy the conditions of the theorem. If u 0 = 0, then u i g i ( x) + i I l v i h i ( x) = 0, and so the above gradients are linearly dependent. assumptions of the theorem. This contradicts the Example 1 Consider the problem: min 6(x 1 10) 2 +4(x 2 12.5) 2 s.t. x 2 1 +(x 2 5) 2 50 x 2 1 +3x 2 2 200 (x 1 6) 2 +x 2 2 37 In this problem, we have: f(x) = 6(x 1 10) 2 + 4(x 2 12.5) 2 g 1 (x) = x 2 1 + (x 2 5) 2 50 g 2 (x) = x 2 1 + 3x2 2 200 g 3 (x) = (x 1 6) 2 + x 2 2 37 We also have: 11
f(x) = g 1 (x) = g 2 (x) = g 3 (x) = 12(x 1 10) 8(x 2 12.5) 2x 1 2(x 2 5) 2x 1 6x 2 2(x 1 6) 2x 2 Let us determine whether or not the point x = ( x 1, x 2 ) = (7, 6) is a candidate to be an optimal solution to this problem. We first check for feasibility: g 1 ( x) = 0 0 g 2 ( x) = 43 < 0 g 3 ( x) = 0 0 To check for optimality, we compute all gradients at x: 12
f(x) = g 1 (x) = g 2 (x) = g 3 (x) = 36 52 14 2 14 36 2 12 We next check to see if the gradients line up, by trying to solve for u 1 0, u 2 = 0, u 3 0 in the following system: 36 14 14 2 0 + u 1 + u 2 + u 3 = 52 2 36 12 0 Notice that ū = (ū 1, ū 2, ū 3 ) = (2, 0, 4) solves this system, and that ū 0 and ū 2 = 0. Therefore x is a candidate to be an optimal solution of this problem. Example 2 Consider the problem (P): (P) : max x x T Qx s.t. x 1, where Q is symmetric. This is equivalent to: min x x T Qx s.t. x T x 1. 13
The Fritz John conditions are: 2u 0 Qx + 2ux = 0 x T x 1 (u 0, u) 0 (u 0, u) 0 u(1 x T x) = 0. Suppose that u 0 = 0. Then u 0. However, the gradient condition implies that x = 0, and then from complementarity we have u = 0, which is a contradiction. Therefore we can assume that u 0 > 0 and so instead work directly with the KKT conditions: 2Qx + 2ux = 0 x T x 1 u 0 u(1 x T x) = 0. First note that if x and u satisfy the KKT conditions, then x T Qx = ux T x = u, whereby we see that u is the objective function value of the problem (P) at x. One solution to the KKT system is x = 0, u = 0, with objective function value u = x T Qx = 0. Are there any better solutions to the KKT system? If x 0 is a solution of the KKT system together with some value u, then x is an eigenvector of Q with nonnegative eigenvalue u, so the objective value of this solution is u. Therefore the solution of (P) with the largest objective function value is x = 0 if the largest eigenvalue of Q is nonpositive. If the largest eigenvalue of Q is positive, then the optimal objective value of (P) is the largest eigenvalue, and the optimal solution is any eigenvector x corresponding to this eigenvalue, normalized so that x = 1. Example 3 Consider the problem: 14
min (x 1 12) 2 +(x 2 + 6) 2 s.t. x 2 1 + 3x 1 +x 2 2 4.5x 2 6.5 (x 1 9) 2 +x 2 2 64 8x 1 +4x 2 = 20 In this problem, we have: f(x) = (x 1 12) 2 + (x 2 + 6) 2 g 1 (x) = x 2 1 + 3x 1 + x 2 2 4.5x 2 6.5 g 2 (x) = (x 1 9) 2 + x 2 2 64 h 1 (x) = 8x 1 + 4x 2 20 Let us determine whether or not the point x = ( x 1, x 2 ) = (2, 1) is a candidate to be an optimal solution to this problem. We first check for feasibility: g 1 ( x) = 0 0 g 2 ( x) = 14 < 0 h 1 ( x) = 0 To check for optimality, we compute all gradients at x: 15
f( x) = g 1 ( x) = g 2 ( x) = h 1 ( x) = 20 14 7 2.5 14 2 8 4 We next check to see if the gradients line up, by trying to solve for u 1 0, u 2 = 0, v 1 in the following system: 20 7 14 8 0 + u 1 + u 2 + v 1 = 14 2.5 2 4 0 Notice that (ū, v) = (ū 1, ū 2, v 1 ) = (4, 0, 1) solves this system and that ū 0 and ū 2 = 0. Therefore x is a candidate to be an optimal solution of this problem. 2.4 Other Constraint Qualifications for KKT Necessary Conditions Recall that the statement of the KKT necessary conditions established herein in Theorem 2.7 has the form if x is a local minimum of (P) and (some requirement for the constraints) then the KKT conditions must hold at x. This additional requirement for the constraints that enables us to proceed with the proof of the KKT conditions is called a constraint qualification. 16
In Theorem 2.7 we established the following constraint qualification: Linear Independence Constraint Qualification: The gradients g i ( x), i I, h i ( x), i = 1,..., l are linearly independent. We will now establish two alternative (and useful) constraint qualifications. We start with a definition: Definition 2.1 A point x is called a Slater point if x satisfies g(x) < 0 and h(x) = 0, that is, x is feasible and satisfies all inequalities strictly. Theorem 2.8 (Slater condition) Suppose that g i ( ), i = 1,..., m are convex, h i ( ), i = 1,..., l are linear, and h i (x), i = 1,..., l are linearly independent for every x, and (P) has a Slater point. Then the KKT conditions are necessary to characterize a local optimal solution. Proof: Let x be a local minimum. The Fritz-John conditions are necessary for this problem, whereby there must exist (u 0, u, v) 0 such that (u 0, u) 0 and u 0 f( x) + m u i g i ( x) + l v i h i ( x) = 0, u i g i ( x) = 0, i = 1,..., m. If u 0 > 0, dividing through by u 0 demonstrates the KKT conditions. Now suppose u 0 = 0. Let x 0 be a Slater point, and define d := x 0 x. Then for each i I, 0 = g i ( x) > g i (x 0 ), and by the convexity of g i ( ) we have g i ( x) T d < 0. Also, since h i (x) is linear, h i ( x) T d = 0, i = 1,..., l. Suppose that u i > 0 for some i I. It then would be true that 0 = 0 T d = ( u 0 f( x) + m u i g i ( x) + T l v i h i ( x)) d < 0, which yields a contradiction. Therefore u i = 0 for all i I. But this then implies that v 0 and l v i h i ( x) = 0, which violates the linear independence assumption. This is a contradiction, and so u 0 > 0. 17
Theorem 2.9 (Linear constraints) If all constraints are linear, the KKT conditions are necessary to characterize a local optimal solution. Proof: Our problem is (P) s.t. min x f(x) Ax b Mx = g. Suppose x is a local optimum. Without loss of generality, we can partition the constraints Ax b into groups A I x b I and A Ī x bī such that A I x = b I and A x < bī. Then at x, the set S := {d : A Ī Id 0, Md = 0 } is precisely the set of feasible directions. Thus, in particular, for every d as above, f( x) T d 0, for otherwise d would be a feasible descent direction at x, violating its local optimality. Therefore, the linear system f( x) T d < 0 A I d 0 Md = 0 has no solution. Applying the Key Lemma (Lemma 2.1) using Ā := f( x)t, B := A I, and H := M, there exists (u 0, w, v) satisfying u 0 = 1, w 0, and u 0 f( x) + A T I w + M T v = 0, which are precisely the KKT conditions. 3 Sufficient Conditions for Optimality for Convex Optimization Problems We start with a definition. If f( ) : X R, then the α level set of f( ) is defined as: for each α R. S α := {x X : f(x) α} 18
Proposition 3.1 If f( ) is convex on X, then S α is a convex set for every α R. Proof: Suppose that f( ) is convex. For any given value of α, suppose that x, y S α. Let z = λx + (1 λ)y for some λ [0, 1]. Then f(z) λf(x) + (1 λ)f(y) λα + (1 λ)α = α, so z S α, which shows that S α is a convex set. The program (P) min x f(x) s.t. g(x) 0 h(x) = 0 x X is called a convex problem if f(x), g i (x), i = 1,..., m, are convex functions, h i (x), i = 1..., l are linear functions, and X is an open convex set. Proposition 3.2 If the optimization problem (P) is a convex problem, then the feasible region of (P) is a convex set. Proof: Because each function g i ( ) is convex, the level sets C i := {x X : g i (x) 0}, i = 1,..., m are convex sets. Also, because each function h i ( ) is linear, the sets D i = {x X : h i (x) = 0}, i = 1,..., l are convex sets. Thus, since the intersection of convex sets is also a convex set, the feasible region F = {x X : g i (x) 0, i = 1,..., m, h i (x) = 0, i = 1,..., l} 19
is a convex set. Theorem 3.1 (KKT Sufficient Conditions for Convex Problems) Suppose that (P) is a convex problem. Let x be a feasible solution of (P), and suppose x together with multipliers (u, v) satisfies f( x) + m u i g i ( x) + u 0, l v i h i ( x) = 0, u i g i ( x) = 0, i = 1,..., m. Then x is a global optimal solution of (P). Proof: We know from Proposition 3.2 that the feasible region of (P) is a convex set. Let I = {i g i ( x) = 0} denote the index of active constraints at x. Let x F be any point different from x. Thus for i I we have g i (x) 0 and g i ( x) = 0, whereby: 0 g i (x) g i ( x) + g i ( x) T (x x) = g i ( x) T (x x) whereby g i ( x) T (x x) 0 for all i I. Also, since h i ( ) is a linear function, we have h i ( x + λ(x x)) = 0 for all λ [0, 1], whereby h i ( x) T (x x) = 0 for each i = 1,..., l. Thus, from the KKT conditions, [ f( x) T (x x) = i I u i g i ( x) T l v i h i ( x)] (x x) 0, and by the gradient inequality, f(x) f( x) + f( x) T (x x) f( x). As this is true for all x F, it follows that x is a global optimal solution of (P). 20
Example 4 Continuing Example 1, note that f( ), g 1 ( ), g 2 ( ), and g 3 ( ) are all convex functions. Therefore the problem is a convex problem, and the KKT conditions are sufficient for optimality. Since the point x = (7, 6) was shown to satisfy the KKT conditions, it follows that x = (7, 6) is a global minimum. Example 5 Continuing Example 3, note that f( ), g 1 ( ), g 2 ( ) are all convex functions and that h 1 ( ) is a linear function. Therefore the problem is a convex problem, and the KKT conditions sufficient for optimality. Since the point x = (2, 1) was shown to satisfy the KKT conditions, it follows that x = (2, 1) is a global minimum. 4 Generalizations of Convexity and Weaker Sufficient Conditions for Optimality In Section 3 we showed the sufficiency of the KKT conditions for optimality under the condition that all relevant functions are convex. Herein we explore ways that the convexity conditions can be relaxed while still maintaining the sufficiency of the KKT conditions. It turns out that the KKT conditions will be sufficient for optimality so long as: (i) f( ) is a pseudoconvex function (to be defined shortly), (ii) g 1 ( ),..., g m ( ) are quasiconvex functions (also to be defined shortly), and (iii) h 1 ( ),..., h l ( ) are linear functions. Suppose X is a convex set in R n. The differentiable function f( ) : X R is a pseudoconvex function if for every x, y X the following holds: f(x) T (y x) 0 f(y) f(x). It is straightforward to show that a differentiable convex function is pseudoconvex, which we will prove at the end of this subsection. 21
Incidentally, we define a differentiable function f( ) : X R to be pseudoconcave if for every x, y X the following holds: f(x) T (y x) 0 f(y) f(x). Suppose X is a convex set in R n. The function f( ) : X R is a quasiconvex function if for all x, y X and for all λ [0, 1], f(λx + (1 λ)y) max{f(x), f(y)}. f( ) is quasiconcave if for all x, y X and for all λ [0, 1], f(λx + (1 λ)y) min{f(x), f(y)}. It is also straightforward to show that a convex function is quasiconvex, which will also be proved at the end of this subsection. Proposition 4.1 A function f( ) is quasiconvex on X if and only if S α is a convex set for every α R. Proof: Suppose that f(x) is quasiconvex. For any given value of α, suppose that x, y S α. Let z = λx + (1 λ)y for some λ [0, 1]. Then f(z) max{f(x), f(y)} α, so z S α, which shows that S α is a convex set. Conversely, suppose S α is a convex set for every α. Let x and y be given, and let α = max{f(x), f(y)}, and hence x, y S α. Then for any λ [0, 1], f(λx + (1 λ)y) α = max{f(x), f(y)}, and so f(x) is a quasiconvex function. Theorem 4.1 (KKT Sufficient Conditions for Generalized Convex Problems) Suppose that f(x) is a pseudoconvex function, g i (x), i = 1,..., m, are quasiconvex functions, and h i (x), i = 1,..., l, are linear functions. Let x be a feasible solution of (P), and suppose x together with multipliers (u, v) satisfies m l f( x) + u i g i ( x) + v i h i ( x) = 0, 22
u 0, u i g i ( x) = 0, i = 1,..., m. Then x is a global optimal solution of (P). Proof: Because each g i ( ) is quasiconvex, the level sets C i := {x X : g i (x) 0}, i = 1,..., m are convex sets. Also, because each h i ( ) is linear, the sets D i = {x X : h i (x) = 0}, i = 1,..., l are convex sets. Thus, since the intersection of convex sets is also a convex set, the feasible region is a convex set. F = {x X : g(x) 0, h(x) = 0} Let I = {i g i ( x) = 0} denote the index of active constraints at x. Let x F be any point different from x. Then λx + (1 λ) x is feasible for all λ [0, 1]. Thus for i I we have g i ( x + λ(x x)) = g i (λx + (1 λ) x) 0 = g i ( x) for any λ [0, 1], and since the value of g i ( ) does not strictly increase by moving in the direction x x, we must have g i ( x) T (x x) 0, for all i I. Similarly, h i ( x + λ(x x)) = 0, and so h i ( x) T (x x) = 0 for all i = 1,..., l. Thus, from the KKT conditions, f( x) T (x x) = ( g( x) T u + h( x) T v ) T (x x) 0, and by pseudoconvexity, f(x) f( x) for any feasible x. 23
The following summarizes how these different generalizations of convexity are related. Proposition 4.2 It holds that: (i) A differentiable convex function is pseudoconvex. (ii) A convex function is quasiconvex. (iii) A pseudoconvex function is quasiconvex. Proof: To prove the first claim, note that if f( ) is convex and differentiable, then f(y) f(x) + f(x) T (y x). Hence, if f(x) T (y x) 0, then f(y) f(x), and so f(x) is pseudoconvex. To prove the second claim, observe from Proposition 3.1 that the level sets of a convex function f( ) are convex sets, whereby from Proposition 4.1 it follows that f( ) is quasiconvex. To show the third claim, suppose f(x) is pseudoconvex. Let x, y and λ [0, 1] be given, and let z = λx + (1 λ)y. If λ = 0 or λ = 1, then f(z) max{f(x), f(y)} trivially; therefore, assume 0 < λ < 1. Let d = y x. If f(z) T d 0, then f(z) T (y z) = f(z) T (λ(y x)) = λ f(z) T d 0, so f(z) f(y) max{f(x), f(y)}. On the other hand, if f(z) T d 0, then f(z) T (x z) = f(z) T ( (1 λ)(y x)) = (1 λ) f(z) T d 0, so f(z) f(x) max{f(x), f(y)}. Thus f(x) is quasiconvex. 24
Example 6 Let D = R n be the domain of the convex function f( ) and let R ++ denote the positive half-line, that is, R ++ := {α R α > 0}. For every x R n define the new function h(x, t) : R n R ++ R as follows: ( x ) h(x, t) := f. t Then it is a straightforward exercise to show that h(x, t) is a quasiconvex function on its domain D := R n R ++. 5 Second-Order Optimality Conditions To describe the second order conditions for optimality, we will define the following function, known as the Lagrangian function, or simply the Lagrangian: L(x, u, v) = f(x) + m u i g i (x) + l v i h i (x) = f(x) + u T g(x) + v T h(x). Using the Lagrangian, we can, for example, re-write the gradient conditions of the KKT necessary conditions as follows: x L( x, u, v) = 0, u T g( x) = 0, u 0, g( x) 0, (2) since x L(x, u, v) = f(x) + g(x) T u + h(x) T v. Also, note that 2 xxl(x, u, v) = 2 f(x)+ m u i 2 g i (x)+ l v i 2 h i (x). Here we use the standard notation: 2 q(x) denotes the Hessian of the function q(x), and 2 xxl(x, u, v) denotes the submatrix of the Hessian of L(x, u, v) corresponding to the second partial derivatives with respect to the x variables only. Theorem 5.1 (KKT second order necessary conditions) Suppose x is a local minimum of (P), and g i ( x), i I and h i ( x), i = 1,..., l are linearly independent. Then x must satisfy the KKT conditions. Furthermore, every d that satisfies: g i ( x) T d 0, i I, h i ( x) T d = 0, i = 1..., l 25
must also satisfy d T xx L( x, u, v)d 0. Theorem 5.2 (KKT second order sufficient conditions) Suppose the point x F together with multipliers (u, v) satisfies the KKT conditions. Let I + = {i I : u i > 0} and I 0 = {i I : u i = 0}. Additionally, suppose that every d 0 that satisfies also satisfies g i ( x) T d = 0, i I +, g i ( x) T d 0, i I 0, h i ( x) T d = 0, i = 1..., l d T 2 xxl( x, u, v)d > 0. Then x is a strict local minimum of (P). 6 A Proof of Theorem 2.1 The proof of Theorem 2.1 relies on the Implicit Function Theorem. To motivate the Implicit Function Theorem, consider a system of linear functions: h(x) := Ax b and suppose that we are interested in solving h(x) = Ax b = 0. 26
Let us assume that A R l n has full row rank (i.e., its rows are linearly independent). Then we can partition the columns of A and elements of x as follows: A = [B N], x = (y; z), so that B R l l is non-singular, and h(x) = By + Nz b. Let s(z) = B 1 b B 1 Nz. Then for any z, h(s(z), z) = Bs(z)+Nz b = 0, i.e., x = (s(z), z) solves h(x) = 0. This idea of invertability of a system of equations is generalized (although only locally) by the following version of the Implicit Function Theorem, where we will preserve the notation used above: Theorem 6.1 (Implicit Function Theorem) Let h(x) : R n R l and x = (ȳ 1,..., ȳ l, z 1,..., z n l ) = (ȳ, z) satisfy: 1. h( x) = 0 2. h(x) is continuously differentiable in a neighborhood of x 3. The l l Jacobian matrix h 1 ( x) h 1 ( x) y l y 1..... h l ( x) y 1 h l ( x) y l is nonsingular. Then there exists ε > 0 along with functions s(z) = (s 1 (z),..., s l (z)) such that for all z B( z, ε), h(s(z), z) = 0. Moreover, s k (z) is continuously differentiable, k = 1,..., l. Furthermore, for all i = 1,..., l and j = 1,..., n l we have: l k=1 h i (y, z) y k s k(z) z j + h i(y, z) z j = 0. 27
Proof of Theorem 2.1: Let A = h( x) R l n. Then A has full row rank, and its columns (along with corresponding elements of x) can be re-arranged so that A = [B N] and x = (ȳ; z), where B is non-singular. Let z lie in a small neighborhood of z. Then, from the Implicit Function Theorem, there exists s(z) such that h(s(z), z) = 0. Suppose that d F 0 G 0 H 0, and let us write d = (q; p). Then 0 = Ad = Bq + Np, whereby q = B 1 Np. Let z(θ) = z + θp, y(θ) = s(z(θ)) = s( z + θp), and x(θ) = (y(θ), z(θ)). We will derive a contradiction by showing that d is an improving feasible direction, i.e., for small θ > 0, x(θ) is feasible and f(x(θ)) < f( x). To show feasibility of x(θ), note that for θ > 0 sufficiently small, it follows from the Implicit Function Theorem that: Furthermore, for i = 1,..., l we have: h(x(θ)) = h(s(z(θ)), z(θ)) = 0. 0 = h i(x(θ)) θ = l k=1 h i (s(z(θ)), z(θ)) y k s k(z(θ)) θ n l + k=1 h i (s(z(θ)), z(θ)) z k(θ) z k θ. Let r k = s k(z(θ)) θ, and recall that z k(θ) θ = p k. At θ = 0, the above equation system can then be re-written as 0 = Br + Np, or r = B 1 Np = q. Therefore, x k(θ) θ = d k for k = 1,..., n. For i I, g i (x(θ)) = g i ( x) + θ g i(x(θ)) θ = θ n k=1 g i (x(θ)) x k + θ α i (θ) θ=0 x k(θ) θ = θ g i ( x) T d + θ α i (θ), θ=0 where α i (θ) 0 as θ 0. Hence g i (x(θ)) < 0 for all i = 1,..., m for θ > 0 28
sufficiently small, and therefore, x(θ) is feasible for any θ > 0 sufficiently small. On the other hand, f(x(θ)) = f( x) + θ f( x) T d + θ α(θ) < f( x) for θ > 0 sufficiently small, where α(θ) 0 as θ 0. But this contradicts the local optimality of x. Therefore no such d can exist, and the theorem is proved. 29
7 Summary Road Map of Constrained Optimization Optimality Conditions Here is a road map of the structure of the constrained optimization optimality conditions we have developed herein. Necessary Conditions: If x solves P, then it holds that.... 1. Geometric Conditions (Theorem 2.1): F 0 G 0 H 0 = 2. Analytic Conditions: (i) Fritz John Necessary Conditions (Theorem 2.6) u 0 f( x) + m u i g i ( x) + (ii) KKT Necessary Conditions f( x) + m u i g i ( x) + l v i h i ( x) = 0 l v i h i ( x) = 0 The KKT conditions require a constraint qualification such as: (a) linear independence of active gradients (Theorem 2.7), or (b) Slater condition (Theorem 2.8), or (c) all constraints are linear (Theorem 2.9), or (d) some other constraint qualification. Sufficient Conditions: If P satisfies... and x satisfies..., then x is an optimal solution. 1. If convexity conditions hold, then the KKT conditions are sufficient for optimality (Theorem 3.1) 2. If generalized convexity conditions hold, then the KKT conditions are sufficient for optimality (Theorem 4.1) 30