Existence of imizers We have just talked a lot about how to find the imizer of an unconstrained convex optimization problem. We have not talked too much, at least not in concrete mathematical terms, about the conditions under which functionals achieve their imum over a domain. Here is a fundamental result in analysis: the Weierstrass extreme value theorem. If f(x) is a continuous functional on a compact set K R N, then it attains its imum value at least once. That is, f(x) x K has a imizer on K there exists a x K such that f(x ) f(x) for all x K. For a proof of this, see just about any introductory text on analysis. The same is also true for f achieving its maximum value on K. In the unconstrained setting, we are interested in f(x), x R N where f is convex. Simple examples illustrate that the imum does not necessarily have to be achieved for any x. That is, there is no x such that f(x ) f(x) for all x R N. For example, f(x) = e x does not have a imizer on the real line, even though it is as convex and as smooth as can be. There is, however, a class of functions for which we can guarantee a global imizer in the unconstrained setting. If the sublevel sets of f, S(f, β) = {x R N : f(x) b} 1
are compact (closed and bounded), then there will be at least one global imizer. This should be easy to see just choose β such that S(f, β) is non-empty, then f(x) x S(f,β) has a imizer (by the extreme value theorem), and this also clearly corresponds to a imizer of f over R N. If f is continuous (which all convex functions with dom f = R N are), then having compact sublevel sets is the same as being coercive: for every sequence {x k } R N with x k 2, we have f(x k ) as well. (I will let you prove that at home.) Until now, we have taken it for granted that local imizers for convex functions are also global imizers. We will nail this down right now. Let f(x) be convex function on R N, and suppose x is a local imizer of f in that there exists an ɛ > 0 such that f(x ) f(x) for all x x 2 ɛ. Then x is also a global imizer: f(x ) f(x) for all x R N. To prove this, suppose that there were a ˆx x such that f(ˆx) f(x ). Then by the convexity of f, f(x + θ(ˆx x )) (1 θ)f(x ) + θf(ˆx) f(x ) for all 0 θ 1. But choosing a small enough value of θ puts x + θ(ˆx x ) in the neighborhood where x is supposed to be a local. Specifically, 2
if we take θ < ɛ/ ˆx x 2, then the inequality above directly contradicts the assertion that x is a local imizer. Thus no such ˆx can exist. Our final result in this section gives a sufficient (but definitely not necessary) condition for the imizer to be unique. Let f be a strictly convex on R N. If f has a global imizer, then it is unique. This is again easy to argue by contradiction. Let x be a global imizer, and suppose that there existed a ˆx x with f(ˆx) = f(x ). But then there would be many x which achieve smaller values, as for all 0 < θ < 1, f(θx + (1 θ)ˆx) < θf(x ) + (1 θ)f(ˆx) = f(x ). As this would contradict the assertion that x is a global imizer, no such ˆx can exist. We close this section by noting that the entire discussion above would stay the same if we replaced x R N f(x) with x U f(x) for any open set U R N. Optimality conditions: unconstrained case How do we know when we have a imizer of a convex function on our hands? What is our certificate of optimality? For the time being, we will assume that f is differentiable, that f(x) exists at 3
every point we are considering. There are a comparable set of results for non-smooth f that we will discuss in last segment of the course. In our discussion on algorithms for unconstrained optimization over the past two weeks, we have often mentioned the following, but have never actually discussed exactly why it is true. Let f be a convex differentiable function on R N. Then x solves x R N f(x) if and only if f(x ) = 0. The proof of this relies on the critical fact that we can decrease f by moving in a direction which has an obtuse angle with the gradient. Let f be a function on R N that is differentiable at x, and let d R N be a vector obeying d, f(x) < 0. Then for small enough t > 0, f(x + td) < f(x). We call such a d a descent direction from x. d, f(x) > 0, then for small enough t > 0, f(x + td) > f(x). We call such a d an ascent direction from x. Similarly, if This fundamental fact is a direct consequence of the Taylor theorem: for any u R N, f(x + u) = f(x) + u, f(x) + h(u) u 2, 4
where h(u) : R N R is some function satisfying h(u) 0 as u 0. Taking u = td, we have f(x + td) = f(x) + t ( d, f(x) + h(td) d 2 ). For t > 0 small enough, we can make h(td) d 2 < d, f(x), and so the term inside the parentheses above is negative if d, f(x) is negative, and it is positive if d, f(x) is positive. At a particular point x, the only way we can make d, f(x ) 0 for all choices of d is if f(x ) = 0. So clearly x is a imizer f(x ) = 0. On the other hand, if f is convex, then f(x + td) f(x ) + t d, f(x ), for all t R and choices of d R N. This now makes it clear that f(x ) = 0 x is a imizer. Again, for everything we have said in this section, you can use any open domain U in place of R N. Optimality conditions: constrained case In this section, we consider the general constrained problem x C f(x) where C is a closed, convex set, and f is again a convex function. We have the following fundamental result. 5
Let f be a differentiable convex function, and C be a closed convex set. Then x is a imizer of if and only if x C and for all y C. x C f(x) y x, f(x ) 0 This result is geometrically intuitive; it is saying that every vector from x to another point y in C must make an obtuse angle with f(x ). That is, there cannot be any descent directions from x that lead to another point in C. Here is a picture: f(x) (level lines) rf(x? ) x? C 6
To prove this, we first argue that y x, f(x ) 0 for all y C implies that x is optimal. Since f is convex, for any y C and so f(y) f(x ) + y x, f(x ), f(y) f(x ) y x, f(x ) 0, Since this holds for every y C, x is a imizer. Now suppose that x is a imizer. If there were a y C such that y x, f(x ) < 0, then d = y x would be a descent direction, and there would exist a 0 < t < 1 such that f(x + t(y x )) < f(x ). Since C is convex and x, y C, we know x + t(y x ) C. But this contradicts the assertion that x is a imizer, so no such y exists. Examples The abstract geometrical result in the previous section will eventually lead us to the Karush-Kuhn-Tucker (KKT) conditions. But we will build up to this by looking at what it tells us in several important (and prevalent) cases. We assume throughout this section that f is convex, differentiable, and defined on all of R N. 7
Linear constraints Consider a convex optimization problem with linear 1 constraints, x f(x) subject to Ax = b, where A is M N and b R M. At a solution x, we have y x, f(x ) 0, for all y such that Ay = b. Since Ax = b as well, this is equivalent to h, f(x ) 0, for all h Null(A). Since h Null(A) h Null(A), we must have h, f(x ) = 0, for all h Null(A), i.e. the gradient is orthogonal to the null space of A. This means that it is in the row space, f(x ) Range(A T ), and so there is a ν R M such that f(x ) + A T ν = 0. 1 We really should be saying affine constraints, but linear constraints is typical nomenclature for this type of problem. 8
Summary: x is a solution to if and only if 1. Ax = b, and x f(x) subject to Ax = b, 2. there exists ν R M such that f(x ) + A T ν = 0. f(x) (level lines) rf(x? )=A T? x? x? + Null(A) 9
Non-negativity constraints Now consider the convex program At a solution x, we will have x f(x) subject to x 0. y x, f(x ) 0, for all y R N +. (1) Since both 0 R N + and 2x R N +, this means and so x, f(x ) = 0, (2) y, f(x ) 0, for all y R N +, meaning that the gradient has only non-negative values as well, f(x ) 0. (3) The conditions (2) and (3) are sufficient as well, as together they imply (1). Condition (3) is the same as saying there exists a λ 0 such that f(x ) λ = 0. We can also see that (2) and (3), along with the fact that x R N +, mean that f(x ) and x can only be non-zero at different indices: [ f(x )] n > 0 x n = 0, x n > 0 [ f(x )] n = 0. 10
Summary: x is a solution to if and only if x f(x) subject to x 0, 1. x 0, and there exists a λ R N such that 2. λ 0, and 3. λ n x n = 0 for all n = 1,..., N, and 4. f(x ) λ = 0. x? rf(x? )=? R N + 11
A single convex inequality constraint Now consider the convex program x f(x) subject to g(x) 0, where g is also a differentiable convex function. We will argue that in this case, the optimality conditions for x, g(x ) 0, and y x, f(x ) 0, for all y with g(y) 0, are equivalent to one of these two conditions holding, 1. g(x ) < 0 and f(x ) = 0, or 2. g(x ) = 0 and the gradients of g and f are negatively aligned: g(x ) = λf(x ), for some λ > 0. Establishing this relies on the following geometric fact 2 : Let u, v be vectors in R N. If no d exists such that d, u < 0, and d, v < 0 simultaneously, (4) then u and v are negatively aligned, u = λv, for some λ > 0. (5) The converse also holds, as if (5) is true, there is no way (4) can be true. 2 This is a special case of the famous Gordan Theorem. 12
The argument for this is simple. The sets {x : x, u < 0} and {x : x, v < 0} are open half spaces, and these half spaces are disjoint if and only if (5) holds. u v {x : hx, vi =0} {x : hx, ui =0} Suppose now that there is a x such that g(x ) = 0 and and λ > 0 so that g(x ) = λ f(x ). Let x be any other feasible point; g(x) 0. Then, by the convexity of g, g(x + θ(x x )) 0, for all 0 θ 1. Since the above is true for all θ in this range, we know that x x cannot be an ascent direction for g from x. Thus x x, g(x ) 0. Since g(x ) = λ f(x ), we now know Then by the convexity of f, x x, f(x ) 0. f(x) f(x ) + x x, f(x ) f(x ), 13
and so x is a imizer. Now suppose that x, with g(x ) 0, is a imizer. We know that g(x ) and f(x ) must be negatively aligned, as otherwise our geometric fact dictates that there is a d that is a descent direction for both g and f, meaning there is a 0 < t < 1 such that f(x + td) < f(x ), and g(x + td) < g(x ) 0. This would mean that there is a feasible point at which f is smaller than it is at x, directly contradicting the assertion that x is a imizer. Thus no such d can exist. We can collect all of this into the following summary: x is a solution to if and only if x f(x) subject to g(x) 0, 1. g(x ) 0, and there exists a λ R such that 2. λ 0, and 3. λ g(x ) = 0, and 4. f(x ) + λ g(x ) = 0. 14
f(x) (level lines)? rg(x? ) x? rf(x? ) {x : g(x) apple 0} 15