Convexity, Duality, and Lagrange Multipliers

Size: px

Start display at page:

Download "Convexity, Duality, and Lagrange Multipliers"

Moris Preston
6 years ago
Views:

1 LECTURE NOTES Convexity, Duality, and Lagrange Multipliers Dimitri P. Bertsekas with assistance from Angelia Geary-Nedic and Asuman Koksal Massachusetts Institute of Technology Spring 2001 These notes were developed for the needs of the class at M.I.T. (Spring 2001). They are copyright-protected, but they may be reproduced freely for noncommercial purposes.

3 Contents 1. Convex Analysis and Optimization 1.1. Linear Algebra and Analysis Vectors and Matrices Topological Properties Square Matrices Derivatives Convex Sets and Functions Basic Properties Convex and Affine Hulls Closure, Relative Interior, and Continuity Recession Cones Convexity and Optimization Local and Global Minima The Projection Theorem Directions of Recession and Existence of Optimal Solutions Existence of Saddle Points Hyperplanes Conical Approximations and Constrained Optimization Polyhedral Convexity Polyhedral Cones Polyhedral Sets Extreme Points Extreme Points and Linear Programming Subgradients Directional Derivatives Subgradients and Subdifferentials ɛ-subgradients Subgradients of Extended Real-Valued Functions Directional Derivative of the Max Function Optimality Conditions Notes and Sources iii

4 iv Contents 2. Lagrange Multipliers 2.1. Introduction to Lagrange Multipliers Enhanced Fritz John Optimality Conditions Informative Lagrange Multipliers Pseudonormality and Constraint Qualifications Exact Penalty Functions Using the Extended Representation Extensions to the Nondifferentiable Case Notes and Sources Lagrangian Duality 3.1. Geometric Multipliers Duality Theory Linear and Quadratic Programming Duality Strong Duality Theorems Convex Cost Linear Constraints Convex Cost Convex Constraints Notes and Sources Conjugate Duality and Applications 4.1. Conjugate Functions The Fenchel Duality Theorem The Primal Function and Sensitivity Analysis Exact Penalty Functions Notes and Sources Dual Computational Methods 5.1. Dual Subgradients Subgradient Methods Analysis of Subgradient Methods Subgradient Methods with Randomization Cutting Plane Methods Ascent Methods Notes and Sources

5 Preface These lecture notes were developed for the needs of a graduate course at the Electrical Engineering and Computer Science Department at M.I.T. They focus selectively on a number of fundamental analytical and computational topics in (deterministic) optimization that span a broad range from continuous to discrete optimization, but are connected through the recurring theme of convexity, Lagrange multipliers, and duality. These topics include Lagrange multiplier theory, Lagrangian and conjugate duality, and nondifferentiable optimization. The notes contain substantial portions that are adapted from my textbook Nonlinear Programming: 2nd Edition, Athena Scientific, However, the notes are more advanced, more mathematical, and more research-oriented. As part of the course I have also decided to develop in detail those aspects of the theory of convex sets and functions that are essential for an in-depth coverage of Lagrange multipliers and duality. I have long thought that convexity, aside from being an eminently useful subject in engineering and operations research, is also an excellent vehicle for assimilating some of the basic concepts of analysis within an intuitive geometrical setting. Unfortunately, the subject s coverage in mathematics and engineering curricula is scant and incidental. I believe that at least part of the reason is that while there are a number of excellent books on convexity, as well as a true classic (Rockafellar s 1970 book), none of them is well suited for teaching nonmathematicians who form the largest part of the potential audience. I have therefore tried in these notes to make convex analysis accessible by limiting somewhat its scope and by emphasizing its geometrical character, while at the same time maintaining mathematical rigor. The coverage of the theory is significantly extended in the exercises, whose detailed solutions are posted on the internet. I have included as many insightful illustrations as I could come up with, and I have tried to use geometric visualization as a principal tool for maintaining the students interest in mathematical proofs. To highlight a contrast in style, Rockafellar s marvelous book contains no figures at all! v

6 vi Preface An important part of my approach has been to maintain a close link between the theoretical treatment of convexity concepts with their application to optimization. For example, in Chapter 1, soon after the development of some of the basic facts about convexity, I discuss some of their applications to optimization and saddle point theory; soon after the discussion of hyperplanes and cones, I discuss conical approximations and necessary conditions for optimality; soon after the discussion of polyhedral convexity, I discuss its application in linear programming; and soon after the discussion of subgradients, I discuss their use in optimality conditions. I follow consistently this style in the remaining chapters, although having developed in Chapter 1 most of the needed convexity theory, the discussion in the subsequent chapters is more heavily weighted towards optimization. In addition to their educational purpose, these notes aim to develop two topics that I have recently researched with two of my students, and to integrate them into the overall landscape of convexity, duality, and optimization. These topics are: (a) A new approach to Lagrange multiplier theory, based on a set of enhanced Fritz-John conditions and the notion of constraint pseudonormality. This work, joint with my Ph.D. student Asuman Koksal, aims to generalize, unify, and streamline the theory of constraint qualifications. It allows for an abstract set constraint (in addition to equalities and inequalities), it highlights the essential structure of constraints using the new notion of pseudonormality, and it develops the connection between Lagrange multipliers and exact penalty functions. (b) A new approach to the computational solution of (nondifferentiable) dual problems via incremental subgradient methods. These methods, developed jointly with my Ph.D. student Angelia Geary-Nedic, include some interesting randomized variants, which according to both analysis and experiment, perform substantially better than the standard subgradient methods for large scale problems that typically arise in the context of duality. The lecture notes may be freely reproduced and distributed for noncommercial purposes. They represent work-in-progress, and your feedback and suggestions for improvements in content and style will be most welcome. Dimitri P. Bertsekas dimitrib@mit.edu Spring 2001

7 1 Convex Analysis and Optimization Date: June 10, 2001 Contents 1.1. Linear Algebra and Analysis p Vectors and Matrices p Topological Properties p Square Matrices p Derivatives p Convex Sets and Functions p Basic Properties p Convex and Affine Hulls p Closure, Relative Interior, and Continuity..... p Recession Cones p Convexity and Optimization p Local and Global Minima p The Projection Theorem p Directions of Recession and Existence of Optimal Solutions p Existence of Saddle Points p Hyperplanes p. 81 1

8 2 Convex Analysis and Optimization Chap Conical Approximations and Constrained Optimization. p Polyhedral Convexity p Polyhedral Cones p Polyhedral Sets p Extreme Points p Extreme Points and Linear Programming.... p Subgradients p Directional Derivatives p Subgradients and Subdifferentials p ɛ-subgradients p Subgradients of Extended Real-Valued Functions. p Directional Derivative of the Max Function... p Optimality Conditions p Notes and Sources p. 137

9 Sec. 1.0 Preface 3 In this chapter we provide the mathematical background for this book. In Section 1.1, we list some basic definitions, notational conventions, and results from linear algebra and analysis. We assume that the reader is familiar with this material, so no proofs are given. In the remainder of the chapter, we focus on convex analysis with an emphasis on optimization-related topics. We assume no prior knowledge of the subject, and we provide proofs and a fairly detailed development. For related and additional material, we recommend the books by Hoffman and Kunze [HoK71], Lancaster and Tismenetsky [LaT85], and Strang [Str76] (linear algebra), the books by Ash [Ash72], Ortega and Rheinboldt [OrR70], and Rudin [Rud76] (analysis), and the books by Rockafellar [Roc70], Ekeland and Temam [EkT76], Rockafellar [Roc84], Hiriart- Urruty and Lemarechal [HiL93], Rockafellar and Wets [RoW98], Bonnans and Shapiro [BoS00], and Borwein and Lewis [BoL00] (convex analysis). The book by Rockafellar [Roc70], widely viewed as the classic convex analysis text, contains a deeper and more extensive development of convexity that the one given here, although it does not cross over into nonconvex optimization. The book by Rockafellar and Wets [RoW98] is a deep and detailed treatment of variational analysis, a broad spectrum of topics that integrate classical analysis, convexity, and optimization of both convex and nonconvex (possibly nonsmooth) functions. These two books represent important milestones in the development of optimization theory, and contain a wealth of material, a good deal of which is original. However, they are written for the advanced reader, in a style that many nonmathematicians find challenging. As we embark on the study of convexity, it is worth listing some of the properties of convex sets and functions that make them so special in optimization. (a) Convex functions have no local minima that are not global. Thus the difficulties associated with multiple disconnected local minima, whose global optimality is hard to verify in practice, are avoided. (b) Convex sets are connected and have feasible directions at any point (assuming they consist of more than one point). By this we mean that given any point x in a convex set X, it is possible to move from x along some directions and stay within X for at least a nontrivial interval. In fact a stronger property holds: given any two distinct points x and x in X, the direction x x is a feasible direction at x, and all feasible directions can be characterized this way. For optimization purposes, this is important because it allows a calculus-based comparison of the cost of x with the cost of its close neighbors, and forms the basis for some important algorithms. Furthermore, much of the difficulty commonly associated with discrete constraint sets (arising for example in combinatorial optimization), is not encountered under convexity.

10 4 Convex Analysis and Optimization Chap. 1 (c) Convex sets have a nonempty relative interior. In other words, when viewed within the smallest affine set containing it, a convex set has a nonempty interior. Thus convex sets avoid the analytical and computational optimization difficulties associated with thin and curved constraint surfaces. (d) A nonconvex function can be convexified while maintaining the optimality of its global minima, by forming the convex hull of the epigraph of the function. (e) The existence of a global minimum of a convex function over a convex set is conveniently characterized in terms of directions of recession (see Section 1.3). (f) Polyhedral convex sets (those specified by linear equality and inequality constraints) are characterized in terms of a finite set of extreme points and extreme directions. This is the basis for finitely terminating methods for linear programming, including the celebrated simplex method (see Section 1.6). (g) Convex functions are continuous and have nice differentiability properties. In particular, a real-valued convex function is directionally differentiable at any point. Furthermore, while a convex function need not be differentiable, it possesses subgradients, which are nice and geometrically intuitive substitutes for a gradient (see Section 1.7). Just like gradients, subgradients figure prominently in optimality conditions and computational algorithms. (h) Convex functions are central in duality theory. Indeed, the dual problem of a given optimization problem (discussed in Chapters 3 and 4) consists of minimization of a convex function over a convex set, even if the original problem is not convex. (i) Closed convex cones are self-dual with respect to orthogonality. In words, the set of vectors orthogonal to the set C (the set of vectors that form a nonpositive inner product with all vectors in a closed and convex cone C) is equal to C. This simple and geometrically intuitive property (discussed in Section 1.5) underlies important aspects of Lagrange multiplier theory. (j) Convex, lower semicontinuous functions are self-dual with respect to conjugacy. It will be seen in Chapter 4 that a certain geometrically motivated conjugacy operation on a given convex, lower semicontinuous function generates a convex, lower semicontinuous function, and when applied for a second time regenerates the original function. The conjugacy operation is central in duality theory, and has a nice interpretation that can be used to visualize and understand some of the most profound aspects of optimization.

11 Sec. 1.1 Linear Algebra and Analysis 5 Our approach in this chapter is to maintain a close link between the theoretical treatment of convexity concepts with their application to optimization. For example, soon after the development for some of the basic facts about convexity in Section 1.2, we discuss some of their applications to optimization in Section 1.3; and soon after the discussion of hyperplanes and cones in Sections 1.4 and 1.5, we discuss conditions for optimality. We follow consistently this style in the remaining chapters, although having developed in Chapter 1 most of the convexity theory that we will need, the discussion in the subsequent chapters is more heavily weighted towards optimization. 1.1 LINEAR ALGEBRA AND ANALYSIS Notation If X is a set and x is an element of X, we write x X. A set can be specified in the form X = {x x satisfies P }, as the set of all elements satisfying property P. The union of two sets X 1 and X 2 is denoted by X 1 X 2 and their intersection by X 1 X 2. The symbols and have the meanings there exists and for all, respectively. The empty set is denoted by Ø. The set of real numbers (also referred to as scalars) is denoted by R. The set R augmented with + and is called the set of extended real numbers. We denote by [a, b] the set of (possibly extended) real numbers x satisfying a x b. A rounded, instead of square, bracket denotes strict inequality in the definition. Thus (a, b], [a, b), and (a, b) denote the set of all x satisfying a < x b, a x < b, and a < x < b, respectively. When working with extended real numbers, we use the natural extensions of the rules of arithmetic: x 0 = 0 for every extended real number x, x = if x > 0, x = if x < 0, and x + = and x = for every scalar x. The expression is meaningless and is never allowed to occur. If f is a function, we use the notation f : X Y to indicate the fact that f is defined on a set X (its domain) and takes values in a set Y (its range). If f : X Y is a function, and U and V are subsets of X and Y, respectively, the set { f(x) x U } is called the image or forward image of U, and the set { x R n f(x) V } is called the inverse image of V Vectors and Matrices We denote by R n the set of n-dimensional real vectors. For any x R n, we use x i to indicate its ith coordinate, also called its ith component.

12 6 Convex Analysis and Optimization Chap. 1 Vectors in R n will be viewed as column vectors, unless the contrary is explicitly stated. For any x R n, x denotes the transpose of x, which is an n-dimensional row vector. The inner product of two vectors x, y R n is defined by x y = n i=1 x iy i. Any two vectors x, y R n satisfying x y = 0 are called orthogonal. If x is a vector in R n, the notations x > 0 and x 0 indicate that all coordinates of x are positive and nonnegative, respectively. For any two vectors x and y, the notation x > y means that x y > 0. The notations x y, x < y, etc., are to be interpreted accordingly. If X is a set and λ is a scalar we denote by λx the set {λx x X}. If X 1 and X 2 are two subsets of R n, we denote by X 1 + X 2 the vector sum {x 1 + x 2 x 1 X 1, x 2 X 2 }. We use a similar notation for the sum of any finite number of subsets. In the case where one of the subsets consists of a single vector x, we simplify this notation as follows: x + X = {x + x x X}. Given sets X i R n i, i = 1,..., m, the Cartesian product of the X i, denoted by X 1 X m, is the subset { (x1,..., x m ) x i X i, i = 1,..., m } of R n 1+ +n m. Subspaces and Linear Independence A subset S of R n is called a subspace if ax + by S for every x, y X and every a, b R. An affine set in R n is a translated subspace, i.e., a set of the form x + S = {x + x x S}, where x is a vector in R n and S is a subspace of R n. The span of a finite collection {x 1,..., x m } of elements of R n (also called the subspace generated by the collection) is the subspace consisting of all vectors y of the form y = m k=1 a kx k, where each a k is a scalar. The vectors x 1,..., x m R n are called linearly independent if there exists no set of scalars a 1,..., a m such that m k=1 a kx k = 0, unless a k = 0 for each k. An equivalent definition is that x 1 0, and for every k > 1, the vector x k does not belong to the span of x 1,..., x k 1. If S is a subspace of R n containing at least one nonzero vector, a basis for S is a collection of vectors that are linearly independent and whose span is equal to S. Every basis of a given subspace has the same number of vectors. This number is called the dimension of S. By convention, the subspace {0} is said to have dimension zero. The dimension of an affine set

13 Sec. 1.1 Linear Algebra and Analysis 7 x + S is the dimension of the corresponding subspace S. Every subspace of nonzero dimension has an orthogonal basis, i.e., a basis consisting of mutually orthogonal vectors. Given any set X, the set of vectors that are orthogonal to all elements of X is a subspace denoted by X : X = {y y x = 0, x X}. If S is a subspace, S is called the orthogonal complement of S. It can be shown that (S ) = S (see the Polar Cone Theorem in Section 1.5). Furthermore, any vector x can be uniquely decomposed as the sum of a vector from S and a vector from S (see the Projection Theorem in Section 1.3.2). Matrices For any matrix A, we use A ij, [A] ij, or a ij to denote its ijth element. The transpose of A, denoted by A, is defined by [A ] ij = a ji. For any two matrices A and B of compatible dimensions, the transpose of the product matrix AB satisfies (AB) = B A. If X is a subset of R n and A is an m n matrix, then the image of X under A is denoted by AX (or A X if this enhances notational clarity): AX = {Ax x X}. If X is subspace, then AX is also a subspace. Let A be a square matrix. We say that A is symmetric if A = A. We say that A is diagonal if [A] ij = 0 whenever i j. We use I to denote the identity matrix. The determinant of A is denoted by det(a). Let A be an m n matrix. The range space of A, denoted by R(A), is the set of all vectors y R m such that y = Ax for some x R n. The null space of A, denoted by N(A), is the set of all vectors x R n such that Ax = 0. It is seen that the range space and the null space of A are subspaces. The rank of A is the dimension of the range space of A. The rank of A is equal to the maximal number of linearly independent columns of A, and is also equal to the maximal number of linearly independent rows of A. The matrix A and its transpose A have the same rank. We say that A has full rank, if its rank is equal to min{m, n}. This is true if and only if either all the rows of A are linearly independent, or all the columns of A are linearly independent. The range of an m n matrix A and the orthogonal complement of the nullspace of its transpose are equal, i.e., R(A) = N(A ). Another way to state this result is that given vectors a 1,..., a n R m (the columns of A) and a vector x R m, we have x y = 0 for all y such that

14 ( 8 Convex Analysis and Optimization Chap. 1 a i y = 0 for all i if and only if x = λ 1a λ n a n for some scalars λ 1,..., λ n. This is a special case of Farkas lemma, an important result for constrained optimization, which will be discussed later in Section 1.6. A useful application of this result is that if S 1 and S 2 are two subspaces of R n, then S 1 + S 2 = (S 1 S 2 ). This follows by introducing matrices B 1 and B 2 such that S 1 = {x B 1 x = 0} = N(B 1 ) and S 2 = {x B 2 x = 0} = N(B 2 ), and writing S 1 +S 2 = R( [ B 1 B 2 ]) = N ([ B1 B 2 ]) = ( N(B 1 ) N(B 2 ) ) = (S1 S 2 ) A function f : R n R is said to be affine if it has the form f(x) = a x + b for some a R n and b R. Similarly, a function f : R n R m is said to be affine if it has the form f(x) = Ax + b for some m n matrix A and some b R m. If b = 0, f is said to be a linear function or linear transformation Topological Properties Definition 1.1.1: A norm on R n is a function that assigns a scalar x to every x R n and that has the following properties: (a) x 0 for all x R n. (b) αx = α x for every scalar α and every x R n. (c) x = 0 if and only if x = 0. (d) x + y x + y for all x, y R n (this is referred to as the triangle inequality). The Euclidean norm of a vector x = (x 1,..., x n ) is defined by x = (x x) 1/2 = ) 1/2 n x i 2. i=1 The space R n, equipped with this norm, is called a Euclidean space. We will use the Euclidean norm almost exclusively in this book. In particular, in the absence of a clear indication to the contrary, will denote the Euclidean norm. Two important results for the Euclidean norm are:

15 Sec. 1.1 Linear Algebra and Analysis 9 Proposition 1.1.1: (Pythagorean Theorem) For any two vectors x and y that are orthogonal, we have x + y 2 = x 2 + y 2. Proposition 1.1.2: (Schwartz inequality) For any two vectors x and y, we have x y x y, with equality holding if and only if x = αy for some scalar α. Two other important norms are the maximum norm (also called sup-norm or l -norm), defined by and the l 1 -norm 1, defined by x = max x i, i x 1 = n x i. i=1 Sequences We use both subscripts and superscripts in sequence notation. Generally, we prefer subscripts, but we use superscripts whenever we need to reserve the subscript notation for indexing coordinates or components of vectors and functions. The meaning of the subscripts and superscripts should be clear from the context in which they are used. A sequence {x k k = 1, 2,...} (or {x k } for short) of scalars is said to converge if there exists a scalar x such that for every ɛ > 0 we have x k x < ɛ for every k greater than some integer K (depending on ɛ). We call the scalar x the limit of {x k }, and we also say that {x k } converges to x; symbolically, x k x or lim k x k = x. If for every scalar b there exists some K (depending on b) such that x k b for all k K, we write x k and lim k x k =. Similarly, if for every scalar b there exists some K such that x k b for all k K, we write x k and lim k x k =.

16 10 Convex Analysis and Optimization Chap. 1 A sequence {x k } is called a Cauchy sequence if for every ɛ > 0, there exists some K (depending on ɛ) such that x k x m < ɛ for all k K and m K. A sequence {x k } is said to be bounded above (respectively, below) if there exists some scalar b such that x k b (respectively, x k b) for all k. It is said to be bounded if it is bounded above and bounded below. The sequence {x k } is said to be monotonically nonincreasing (respectively, nondecreasing) if x k+1 x k (respectively, x k+1 x k ) for all k. If {x k } converges to x and is nonincreasing (nondecreasing), we also use the notation x k x (x k x, respectively). Proposition 1.1.3: Every bounded and monotonically nonincreasing or nondecreasing scalar sequence converges. Note that a monotonically nondecreasing sequence {x k } is either bounded, in which case it converges to some scalar x by the above proposition, or else it is unbounded, in which case x k. Similarly, a monotonically nonincreasing sequence {x k } is either bounded and converges, or it is unbounded, in which case x k. The supremum of a nonempty set X of scalars, denoted by sup X, is defined as the smallest scalar x such that x y for all y X. If no such scalar exists, we say that the supremum of X is. Similarly, the infimum of X, denoted by inf X, is defined as the largest scalar x such that x y for all y X, and is equal to if no such scalar exists. For the empty set, we use the convention sup(ø) =, inf(ø) =. (This is somewhat paradoxical, since we have that the sup of a set is less than its inf, but works well for our analysis.) If sup X is equal to a scalar x that belongs to the set X, we say that x is the maximum point of X and we often write x = sup X = max X. Similarly, if inf X is equal to a scalar x that belongs to the set X, we often write x = inf X = min X. Thus, when we write max X (or min X) in place of sup X (or inf X, respectively) we do so just for emphasis: we indicate that it is either evident, or it is known through earlier analysis, or it is about to be shown that the maximum (or minimum, respectively) of the set X is attained at one of its points. Given a scalar sequence {x k }, the supremum of the sequence, denoted by sup k x k, is defined as sup{x k k = 1, 2,...}. The infimum of a sequence

17 Sec. 1.1 Linear Algebra and Analysis 11 is similarly defined. Given a sequence {x k }, let y m = sup{x k k m}, z m = inf{x k k m}. The sequences {y m } and {z m } are nonincreasing and nondecreasing, respectively, and therefore have a limit whenever {x k } is bounded above or is bounded below, respectively (Prop ). The limit of y m is denoted by lim sup k x k, and is referred to as the limit superior of {x k }. The limit of z m is denoted by lim inf k x k, and is referred to as the limit inferior of {x k }. If {x k } is unbounded above, we write lim sup k x k =, and if it is unbounded below, we write lim inf k x k =. Proposition 1.1.4: Let {x k } and {y k } be scalar sequences. (a) There holds inf k x k lim inf k x k lim sup k x k sup x k. k (b) {x k } converges if and only if lim inf k x k = lim sup k x k and, in that case, both of these quantities are equal to the limit of x k. (c) If x k y k for all k, then lim inf x k lim inf y k, k k lim sup k x k lim sup y k. k (d) We have lim inf x k + lim inf y k lim inf (x k + y k ), k k k lim sup k x k + lim sup k y k lim sup(x k + y k ). k A sequence {x k } of vectors in R n is said to converge to some x R n if the ith coordinate of x k converges to the ith coordinate of x for every i. We use the notations x k x and lim k x k = x to indicate convergence for vector sequences as well. The sequence {x k } is called bounded (respectively, a Cauchy sequence) if each of its corresponding coordinate sequences is bounded (respectively, a Cauchy sequence). It can be seen that {x k } is bounded if and only if there exists a scalar c such that x k c for all k.

18 12 Convex Analysis and Optimization Chap. 1 Definition 1.1.2: We say that a vector x R n is a limit point of a sequence {x k } in R n if there exists a subsequence of {x k } that converges to x. Proposition 1.1.5: Let {x k } be a sequence in R n. (a) If {x k } is bounded, it has at least one limit point. (b) {x k } converges if and only if it is bounded and it has a unique limit point. (c) {x k } converges if and only if it is a Cauchy sequence. o( ) Notation If p is a positive integer and h : R n R m, then we write h(x) = o ( x p) if h(x k ) lim k x k = 0, p for all sequences {x k }, with x k 0 for all k, that converge to 0. Closed and Open Sets We say that x is a closure point or limit point of a set X R n if there exists a sequence {x k }, consisting of elements of X, that converges to x. The closure of X, denoted cl(x), is the set of all limit points of X. Definition 1.1.3: A set X R n is called closed if it is equal to its closure. It is called open if its complement (the set {x x / X}) is closed. It is called bounded if there exists a scalar c such that the magnitude of any coordinate of any element of X is less than c. It is called compact if it is closed and bounded.

19 Sec. 1.1 Linear Algebra and Analysis 13 Definition 1.1.4: A neighborhood of a vector x is an open set containing x. We say that x is an interior point of a set X R n if there exists a neighborhood of x that is contained in X. A vector x cl(x) which is not an interior point of X is said to be a boundary point of X. Let be a given norm in R n. For any ɛ > 0 and x R n, consider the sets { x x x < ɛ } {, x x x ɛ }. The first set is open and is called an open sphere centered at x, while the second set is closed and is called a closed sphere centered at x. Sometimes the terms open ball and closed ball are used, respectively. Proposition 1.1.6: (a) The union of finitely many closed sets is closed. (b) The intersection of closed sets is closed. (c) The union of open sets is open. (d) The intersection of finitely many open sets is open. (e) A set is open if and only if all of its elements are interior points. (f) Every subspace of R n is closed. (g) A set X is compact if and only if every sequence of elements of X has a subsequence that converges to an element of X. (h) If {X k } is a sequence of nonempty and compact sets such that X k X k+1 for all k, then the intersection k=0 X k is nonempty and compact. The topological properties of subsets of R n, such as being open, closed, or compact, do not depend on the norm being used. This is a consequence of the following proposition, referred to as the norm equivalence property in R n, which shows that if a sequence converges with respect to one norm, it converges with respect to all other norms. Proposition 1.1.7: For any two norms and on R n, there exists a scalar c such that x c x for all x R n. Using the preceding proposition, we obtain the following.

20 14 Convex Analysis and Optimization Chap. 1 Proposition 1.1.8: If a subset of R n is open (respectively, closed, bounded, or compact) with respect to some norm, it is open (respectively, closed, bounded, or compact) with respect to all other norms. Sequences of Sets Let {X k } be a sequence of nonempty subsets of R n. The outer limit of {X k }, denoted lim sup k X k, is the set of all x R n such that every neighborhood of x has a nonempty intersection with infinitely many of the sets X k, k = 1,2,.... Equivalently, lim sup k X k is the set of all limits of subsequences {x k } K such that x k X k, for all k K. The inner limit of {X k }, denoted lim inf k X k, is the set of all x R n such that every neighborhood of x has a nonempty intersection with all except finitely many of the sets X k, k = 1, 2,.... Equivalently, lim inf k X k is the set of all limits of sequences {x k } such that x k X k, for all k = 1, 2,.... The sequence {X k } is said to converge to a set X if X = liminf k X k = lim sup X k, k in which case X is said to be the limit of {X k }. The inner and outer limits are closed (possibly empty) sets. If each set X k consists of a single point x k, lim sup k X k is the set of limit points of {x k }, while lim inf k X k is just the limit of {x k } if {x k } converges, and otherwise it is empty. Continuity Let X be a subset of R n and let f : X R m be some function. Let x be a closure point of X. If there exists a vector y R m such that the sequence { f(x k ) } converges to y for every sequence {x k } X such that lim k x k = x, we write lim z x f(z) = y. If X is a subset of R and x is a closure point of X, we denote by lim z x f(z) [respectively, lim z x f(z)] the limit of f(x k ), where {x k } is any sequence of elements of X converging to x and satisfying x k x (respectively, x k x), assuming that at least one such sequence {x k } exists, and the corresponding limit of f(x k ) exists and is independent of the choice of {x k }.

21 Sec. 1.1 Linear Algebra and Analysis 15 Definition 1.1.5: Let X be a subset of R n. (a) A function f : X R m is called continuous at a point x X if lim z x f(z) = f(x). (b) A function f : X R m is called right-continuous (respectively, left-continuous) at a point x X if lim z x f(z) = f(x) [respectively, lim z x f(z) = f(x)]. (c) A real-valued function f : X R is called upper semicontinuous (respectively, lower semicontinuous) at a vector x X if f(x) lim sup k f(x k ) [respectively, f(x) lim inf k f(x k )] for every sequence {x k } of elements of X converging to x. If f : X R m is continuous at every point of a subset of its domain X, we say that f is continuous over that subset. If f : X R m is continuous at every point of its domain X, we say that f is continuous. We use similar terminology for right-continuous, left-continuous, upper semicontinuous, and lower semicontinuous functions. If f : X R m is continuous at every point of a subset of its domain X, we say that f is continuous over that subset. If f : X R m is continuous at every point of its domain X, we say that f is continuous. We use similar terminology for right-continuous, left-continuous, upper semicontinuous, and lower semicontinuous functions. Proposition 1.1.9: (a) The composition of two continuous functions is continuous. (b) Any vector norm on R n is a continuous function. (c) Let f : R n R m be continuous, and let Y R m be open (respectively, closed). Then the inverse image of Y, { x R n f(x) Y }, is open (respectively, closed). (d) Let f : R n R m be continuous, and let X R n be compact. Then the forward image of X, { f(x) x X }, is compact. Matrix Norms A norm on the set of n n matrices is a real-valued function that has the same properties as vector norms do when the matrix is viewed as an element of R n2. The norm of an n n matrix A is denoted by A. We are mainly interested in induced norms, which are constructed as follows. Given any vector norm, the corresponding induced matrix

22 16 Convex Analysis and Optimization Chap. 1 norm, also denoted by, is defined by A = sup { x R n x =1 } Ax. It is easily verified that for any vector norm, the above equation defines a bona fide matrix norm having all the required properties. Note that by the Schwartz inequality (Prop ), we have A = sup Ax = sup y Ax. x =1 y = x =1 By reversing the roles of x and y in the above relation and by using the equality y Ax = x A y, it follows that A = A Square Matrices Definition 1.1.6: A square matrix A is called singular if its determinant is zero. Otherwise it is called nonsingular or invertible. Proposition : (a) Let A be an n n matrix. The following are equivalent: (i) The matrix A is nonsingular. (ii) The matrix A is nonsingular. (iii) For every nonzero x R n, we have Ax 0. (iv) For every y R n, there is a unique x R n such that Ax = y. (v) There is an n n matrix B such that AB = I = BA. (vi) The columns of A are linearly independent. (vii) The rows of A are linearly independent. (b) Assuming that A is nonsingular, the matrix B of statement (v) (called the inverse of A and denoted by A 1 ) is unique. (c) For any two square invertible matrices A and B of the same dimensions, we have (AB) 1 = B 1 A 1.

23 Sec. 1.1 Linear Algebra and Analysis 17 Definition 1.1.7: The characteristic polynomial φ of an n n matrix A is defined by φ(λ) = det(λi A), where I is the identity matrix of the same size as A. The n (possibly repeated or complex) roots of φ are called the eigenvalues of A. A nonzero vector x (with possibly complex coordinates) such that Ax = λx, where λ is an eigenvalue of A, is called an eigenvector of A associated with λ. Proposition : Let A be a square matrix. (a) A complex number λ is an eigenvalue of A if and only if there exists a nonzero eigenvector associated with λ. (b) A is singular if and only if it has an eigenvalue that is equal to zero. Note that the only use of complex numbers in this book is in relation to eigenvalues and eigenvectors. All other matrices or vectors are implicitly assumed to have real components. Proposition : Let A be an n n matrix. (a) If T is a nonsingular matrix and B = T AT 1, then the eigenvalues of A and B coincide. (b) For any scalar c, the eigenvalues of ci + A are equal to c + λ 1,..., c + λ n, where λ 1,..., λ n are the eigenvalues of A. (c) The eigenvalues of A k are equal to λ k 1,..., λk n, where λ 1,..., λ n are the eigenvalues of A. (d) If A is nonsingular, then the eigenvalues of A 1 are the reciprocals of the eigenvalues of A. (e) The eigenvalues of A and A coincide. Symmetric and Positive Definite Matrices Symmetric matrices have several special properties, particularly regarding their eigenvalues and eigenvectors. In what follows in this section,

24 18 Convex Analysis and Optimization Chap. 1 denotes the Euclidean norm. Proposition : Let A be a symmetric n n matrix. Then: (a) The eigenvalues of A are real. (b) The matrix A has a set of n mutually orthogonal, real, and nonzero eigenvectors x 1,..., x n. (c) Suppose that the eigenvectors in part (b) have been normalized so that x i = 1 for each i. Then A = n λ i x i x i, i=1 where λ i is the eigenvalue corresponding to x i. Proposition : Let A be a symmetric n n matrix, and let λ 1 λ n be its (real) eigenvalues. Then: (a) A = max { λ 1, λ n }, where is the matrix norm induced by the Euclidean norm. (b) λ 1 y 2 y Ay λ n y 2 for all y R n. (c) If A is nonsingular, then A 1 = 1 min { λ 1, λ n }. Proposition : Let A be a square matrix, and let be the matrix norm induced by the Euclidean norm. Then: (a) If A is symmetric, then A k = A k for any positive integer k. (b) A 2 = A A = AA.

25 Sec. 1.1 Linear Algebra and Analysis 19 Definition 1.1.8: A symmetric n n matrix A is called positive definite if x Ax > 0 for all x R n, x 0. It is called positive semidefinite if x Ax 0 for all x R n. Throughout this book, the notion of positive definiteness applies exclusively to symmetric matrices. Thus whenever we say that a matrix is positive (semi)definite, we implicitly assume that the matrix is symmetric. Proposition : (a) The sum of two positive semidefinite matrices is positive semidefinite. If one of the two matrices is positive definite, the sum is positive definite. (b) If A is a positive semidefinite n n matrix and T is an m n matrix, then the matrix T AT is positive semidefinite. If A is positive definite and T is invertible, then T AT is positive definite. Proposition : (a) For any m n matrix A, the matrix A A is symmetric and positive semidefinite. A A is positive definite if and only if A has rank n. In particular, if m = n, A A is positive definite if and only if A is nonsingular. (b) A square symmetric matrix is positive semidefinite (respectively, positive definite) if and only if all of its eigenvalues are nonnegative (respectively, positive). (c) The inverse of a symmetric positive definite matrix is symmetric and positive definite.

26 20 Convex Analysis and Optimization Chap. 1 Proposition : Let A be a symmetric positive semidefinite n n matrix of rank m. There exists an n m matrix C of rank m such that A = CC. Furthermore, for any such matrix C: (a) A and C have the same null space: N(A) = N(C ). (b) A and C have the same range space: R(A) = R(C). Proposition : Let A be a square symmetric positive semidefinite matrix. (a) There exists a symmetric matrix Q with the property Q 2 = A. Such a matrix is called a symmetric square root of A and is denoted by A 1/2. (b) There is a unique symmetric square root if and only if A is positive definite. (c) A symmetric square root A 1/2 is invertible if and only if A is invertible. Its inverse is denoted by A 1/2. (d) There holds A 1/2 A 1/2 = A 1. (e) There holds AA 1/2 = A 1/2 A Derivatives Let f : R n R be some function, fix some x R n, and consider the expression f(x + αe i ) f(x) lim, α 0 α where e i is the ith unit vector (all components are 0 except for the ith component which is 1). If the above limit exists, it is called the ith partial derivative of f at the point x and it is denoted by ( f/ x i )(x) or f(x)/ x i (x i in this section will denote the ith coordinate of the vector x). Assuming all of these partial derivatives exist, the gradient of f at x is defined as the column vector f(x) = f(x) x 1. f(x) x n.

27 Sec. 1.1 Linear Algebra and Analysis 21 For any y R n, we define the one-sided directional derivative of f in the direction y, to be f(x + αy) f(x) f (x; y) = lim, α 0 α provided that the limit exists. We note from the definitions that f (x; e i ) = f (x; e i ) f (x; e i ) = ( f/ x i )(x). If the directional derivative of f at a vector x exists in all directions y and f (x; y) is a linear function of y, we say that f is differentiable at x. This type of differentiability is also called Gateaux differentiability. It is seen that f is differentiable at x if and only if the gradient f(x) exists and satisfies f(x) y = f (x; y) for every y R n. The function f is called differentiable over a given subset U of R n if it is differentiable at every x U. The function f is called differentiable (without qualification) if it is differentiable at all x R n. If f is differentiable over an open set U and the gradient f(x) is continuous at all x U, f is said to be continuously differentiable over U. Such a function has the property f(x + y) f(x) f(x) lim y = 0, x U, (1.1) y 0 y where is an arbitrary vector norm. If f is continuously differentiable over R n, then f is also called a smooth function. The preceding equation can also be used as an alternative definition of differentiability. In particular, f is called Frechet differentiable at x if there exists a vector g satisfying Eq. (1.1) with f(x) replaced by g. If such a vector g exists, it can be seen that all the partial derivatives ( f/ x i )(x) exist and that g = f(x). Frechet differentiability implies (Gateaux) differentiability but not conversely (see for example [OrR70] for a detailed discussion). In this book, when dealing with a differentiable function f, we will always assume that f is continuously differentiable (smooth) over a given open set [ f(x) is a continuous function of x over that set], in which case f is both Gateaux and Frechet differentiable, and the distinctions made above are of no consequence. The definitions of differentiability of f at a point x only involve the values of f in a neighborhood of x. Thus, these definitions can be used for functions f that are not defined on all of R n, but are defined instead in a neighborhood of the point at which the derivative is computed. In particular, for functions f : X R that are defined over a strict subset X of R n, we use the above definition of differentiability of f at a vector x provided x is an interior point of the domain X. Similarly, we use the above definition of differentiability or continuous differentiability of f over

28 22 Convex Analysis and Optimization Chap. 1 a subset U, provided U is an open subset of the domain X. Thus any mention of differentiability of a function f over a subset implicitly assumes that this subset is open. If f : R n R m is a vector-valued function, it is called differentiable (or smooth) if each component f i of f is differentiable (or smooth, respectively). The gradient matrix of f, denoted f(x), is the n m matrix whose ith column is the gradient f i (x) of f i. Thus, f(x) = [ ] f 1 (x) f m (x). The transpose of f is called the Jacobian of f and is a matrix whose ijth entry is equal to the partial derivative f i / x j. Now suppose that each one of the partial derivatives of a function f : R n R is a smooth function of x. We use the notation ( 2 f/ x i x j )(x) to indicate the ith partial derivative of f/ x j at a point x R n. The Hessian of f is the matrix whose ijth entry is equal to ( 2 f/ x i x j )(x), and is denoted by 2 f(x). We have ( 2 f/ x i x j )(x) = ( 2 f/ x j x i )(x) for every x, which implies that 2 f(x) is symmetric. If f : R m+n R is a function of (x, y), where x = (x 1,..., x m ) R m and y = (y 1,..., y n ) R n, we write x f(x, y) = f(x,y) x 1. f(x,y) x m, y f(x, y) = f(x,y) y 1... f(x,y) y n We denote by 2 xxf(x, y), 2 xyf(x, y), and 2 yyf(x, y) the matrices with components [ 2 xx f(x, y) ] ij = 2 f(x, y) x i x j, [ 2 xy f(x, y) ] ij = 2 f(x, y) x i y j, [ 2 yy f(x, y) ] ij = 2 f(x, y) y i y j. If f : R m+n R r, f = (f 1, f 2,..., f r ), we write x f(x, y) = [ x f 1 (x, y) x f r (x, y) ], y f(x, y) = [ y f 1 (x, y) y f r (x, y) ]. Let f : R k R m and g : R m R n be smooth functions, and let h be their composition, i.e., h(x) = g ( f(x) ).

29 Sec. 1.1 Linear Algebra and Analysis 23 Then, the chain rule for differentiation states that h(x) = f(x) g ( f(x) ), x R k. Some examples of useful relations that follow from the chain rule are: ( f(ax) ) = A f(ax), 2( f(ax) ) = A 2 f(ax)a, where A is a matrix, x (f ( h(x), y )) = h(x) h f ( h(x), y ), x (f ( h(x), g(x) )) = h(x) h f ( h(x), g(x) ) + g(x) g f ( h(x), g(x) ). We now state some theorems relating to differentiable functions that will be useful for our purposes. Proposition : (Mean Value Theorem) Let f : R n R be continuously differentiable over an open sphere S, and let x be a vector in S. Then for all y such that x+y S, there exists an α [0, 1] such that f(x + y) = f(x) + f(x + αy) y. Proposition : (Second Order Expansions) Let f : R n R be twice continuously differentiable over an open sphere S, and let x be a vector in S. Then for all y such that x + y S: (a) There holds ( ( f(x + y) = f(x) + y f(x) ) ) t 2 y f(x + τy)dτ dt y. (b) There exists an α [0, 1] such that f(x + y) = f(x) + y f(x) y 2 f(x + αy)y. (c) There holds f(x + y) = f(x) + y f(x) y 2 f(x)y + o ( y 2).

30 24 Convex Analysis and Optimization Chap. 1 Proposition : (Implicit Function Theorem) Consider a function f : R n+m R m of x R n and y R m such that: (1) f(x, y) = 0. (2) f is continuous, and has a continuous and nonsingular gradient matrix y f(x, y) in an open set containing (x, y). Then there exist open sets S x R n and S y R m containing x and y, respectively, and a continuous function φ : S x S y such that y = φ(x) and f ( x, φ(x) ) = 0 for all x S x. The function φ is unique in the sense that if x S x, y S y, and f(x, y) = 0, then y = φ(x). Furthermore, if for some integer p > 0, f is p times continuously differentiable the same is true for φ, and we have φ(x) = x f ( x, φ(x) )( y f ( x, φ(x) )) 1, x Sx. As a final word of caution to the reader, let us mention that one can easily get confused with gradient notation and its use in various formulas, such as for example the order of multiplication of various gradients in the chain rule and the Implicit Function Theorem. Perhaps the safest guideline to minimize errors is to remember our conventions: (a) A vector is viewed as a column vector (an n 1 matrix). (b) The gradient f of a scalar function f : R n R is also viewed as a column vector. (c) The gradient matrix f of a vector function f : R n R m with components f 1,..., f m is the n m matrix whose columns are the (column) vectors f 1,..., f m. With these rules in mind one can use dimension matching as an effective guide to writing correct formulas quickly. 1.2 CONVEX SETS AND FUNCTIONS Basic Properties The notion of a convex set is defined below and is illustrated in Fig

31 Sec. 1.2 Convex Sets and Functions 25 αx + (1 - α)y, 0 < α < 1 x x y y y x y x Convex Sets Nonconvex Sets Figure Illustration of the definition of a convex set. For convexity, linear interpolation between any two points of the set must yield points that lie within the set. Definition 1.2.1: Let C be a subset of R n. We say that C is convex if αx + (1 α)y C, x, y C, α [0, 1]. (1.2) Note that the empty set is by convention considered to be convex. Generally, when referring to a convex set, it will usually be apparent from the context whether this set can be empty, but we will often be specific in order to minimize ambiguities. The following proposition lists some operations that preserve convexity of a set.

32 26 Convex Analysis and Optimization Chap. 1 Proposition 1.2.1: (a) The intersection i I C i of any collection {C i i I} of convex sets is convex. (b) The vector sum C 1 + C 2 of two convex sets C 1 and C 2 is convex. (c) The set x + λc is convex for any convex set C, vector x, and scalar λ. Furthermore, if C is a convex set and λ 1, λ 2 are positive scalars, we have (λ 1 + λ 2 )C = λ 1 C + λ 2 C. (d) The closure and the interior of a convex set are convex. (e) The image and the inverse image of a convex set under an affine function are convex. Proof: The proof is straightforward using the definition of convexity, cf. Eq. (1.2). For example, to prove part (a), we take two points x and y from i I C i, and we use the convexity of C i to argue that the line segment connecting x and y belongs to all the sets C i, and hence, to their intersection. The proofs of parts (b)-(e) are similar and are left as exercises for the reader. Q.E.D. A set C is said to be a cone if for all x C and λ > 0, we have λx C. A cone need not be convex and need not contain the origin (although the origin always lies in its closure). Several of the results of the above proposition have analogs for cones (see the exercises). Convex Functions The notion of a convex function is defined below and is illustrated in Fig

33 Sec. 1.2 Convex Sets and Functions 27 Definition 1.2.2: Let C be a convex subset of R n. A function f : C R is called convex if f ( αx + (1 α)y ) αf(x) + (1 α)f(y), x, y C, α [0, 1]. (1.3) The function f is called concave if f is convex. The function f is called strictly convex if the above inequality is strict for all x, y C with x y, and all α (0, 1). For a function f : X R, we also say that f is convex over the convex set C if the domain X of f contains C and Eq. (1.3) holds, i.e., when the domain of f is restricted to C, f becomes convex. αf(x) + (1 - α)f(y) f(z) x C y z Figure Illustration of the definition of a function that is convex over a convex set C. The linear interpolation αf(x) + (1 α)f(y) overestimates the function value f(αx + (1 α)y) for all α [0,1]. If C is a convex set and f : C R is a convex function, the level sets {x C f(x) γ} and {x C f(x) < γ} are convex for all scalars γ. To see this, note that if x, y C are such that f(x) γ and f(y) γ, then for any α [0, 1], we have αx + (1 α)y C, by the convexity of C, and f ( (αx + (1 α)y ) αf(x) + (1 α)f(y) γ, by the convexity of f. However, the converse is not true; for example, the function f(x) = x has convex level sets but is not convex. Unless otherwise indicated, we implicitly assume that a convex function is real-valued and is defined over the entire Euclidean space (rather than over just a convex subset). We occasionally deal with extended realvalued convex functions that can take the value of or can take the value

34 28 Convex Analysis and Optimization Chap. 1 (but never with functions that can take both values and ). A function f mapping a convex set C R n into (, ], is also called convex if the condition f ( αx + (1 α)y ) αf(x) + (1 α)f(y), x, y C, α [0, 1] holds. It can again be seen that if f is convex, the level sets {x C f(x) γ} and {x C f(x) < γ} are convex for all scalars γ. The effective domain of such a convex function f is the convex set dom(f) = { x C f(x) < }. By replacing the domain of an extended real-valued convex function with its effective domain, we can convert it to a real-valued function. In this way, we can use results stated in terms of real-valued functions, and we can also avoid calculations with. Thus, the entire subject of convex functions can be developed without resorting to extended real-valued functions. The reverse is also true, namely that extended real-valued functions can be adopted as the norm; for example, the classical treatment of Rockafellar [Roc70] uses this approach. We will adopt a flexible approach, generally preferring to avoid extended real-valued functions, unless there are strong notational or other advantages for doing so. An extended real-valued function f : X (, ] is called lower semicontinuous at a vector x X if f(x) lim inf k f(x k ) for every sequence {x k } converging to x. This definition is consistent with the corresponding definition for real-valued functions [cf. Def (c)]. If f is lower semicontinuous at every x in a subset U X, we say that f is lower semicontinuous over U. The epigraph of a function f : X (, ], where X R n, is the subset of R n+1 given by epi(f) = { (x, w) x X, w R, f(x) w } ; (see Fig ). Note that if we restrict f to its effective domain { x X f(x) < }, so that it becomes real-valued, the epigraph remains unaffected. Epigraphs are useful for our purposes because of the following proposition, which shows that questions about convexity and lower semicontinuity of functions can be reduced to corresponding questions of convexity and closure of their epigraphs.

35 Sec. 1.2 Convex Sets and Functions 29 f(x) Epigraph f(x) Epigraph Convex function x Nonconvex function x Figure Illustration of the epigraph of a convex function and a nonconvex function f : X (, ]. Proposition 1.2.2: Let f : X (, ] be a function. Then: (a) epi(f) is convex if and only if the set X is convex and f is convex over X. (b) Assuming X = R n, the following are equivalent: (i) epi(f) is closed. (ii) f is lower semicontinuous over R n. (iii) The level sets {x f(x) γ} are closed for all scalars γ. Proof: (a) Assume that X is convex and f is convex over X. If (x 1, w 1 ) and (x 2, w 2 ) belong to epi(f) and α [0, 1], we have f(x 1 ) w 1, f(x 2 ) w 2, and by multiplying these inequalities with α and (1 α), respectively, by adding, and by using the convexity of f, we obtain f ( αx 1 + (1 α)x 2 ) αf(x1 ) + (1 α)f(x 2 ) αw 1 + (1 α)w 2. Hence the vector ( αx 1 + (1 α)x 2, αw 1 + (1 α)w 2 ), which is equal to α(x 1, w 1 ) + (1 α)(x 2, w 2 ), belongs to epi(f), showing the convexity of epi(f). Conversely, assume that epi(f) is convex, and let x 1, x 2 X and α [0, 1]. The pairs ( x 1, f(x 1 ) ) and ( x 2, f(x 2 ) ) belong to epi(f), so by convexity, we have ( αx1 + (1 α)x 2, αf(x 1 ) + (1 α)f(x 2 ) ) epi(f).

36 30 Convex Analysis and Optimization Chap. 1 Therefore, by the definition of epi(f), it follows that αx 1 + (1 α)x 2 X, so X is convex, while so f is convex over X. f ( αx 1 + (1 α)x 2 ) αf(x1 ) + (1 α)f(x 2 ), (b) We first show that (i) and (ii) are equivalent. Assume that f is lower semicontinuous over R n, and let (x, w) be the limit of a sequence {(x k, w k )} epi(f). We have f(x k ) w k, and by taking limit as k and by using the lower semicontinuity of f at x, we obtain f(x) lim inf k f(x k ) w. Hence (x, w) epi(f) and epi(f) is closed. Conversely, assume that epi(f) is closed, choose any x R n, let {x k } be a sequence converging to x, and let w = lim inf k f(x k ). We will show that f(x) w. Indeed, if w =, we have f(x) w. If w <, for each positive integer n, let w n = w + 1/n, and let k(n) be an integer such that k(n) n and f ( ) x k(n) wn. The sequence {( )} x k(n), w n belongs to epi(f) and converges to (x, w), so by the closure of epi(f), we must have f(x) w. Thus, f is lower semicontinuous at x. We next show that (i) implies (iii), and that (iii) implies (ii). Assume that epi(f) is closed and let {x k } be a sequence that converges to some x and belongs to the level set { z f(z) γ }, where γ is a scalar. Then (x k, γ) epi(f) for all k, and by closure of epi(f), we have ( f(x), γ ) epi(f). Hence x belongs to the level set { x f(x) γ }, implying that this set is closed. Therefore (i) implies (iii). Finally, assume that the level sets { x f(x) γ } are closed, fix an x, and let {x k } be a sequence converging to x. If lim inf k f(x k ) <, then for each γ with lim inf k f(x k ) < γ and all sufficiently large k, we have f(x k ) < γ. From the closure of the level sets { x f(x) γ }, it follows that x belongs to all the levels with lim inf k f(x k ) < γ, implying that f(x) lim inf k f(x k ), and that f is lower semicontinuous at x. Therefore, (iii) implies (ii). Q.E.D. If the epigraph of a function f : X (, ] is a closed set, we say that f is a closed function. Thus, if we extend the domain of f to rn and consider the function f given by f(x) = { f(x) if x X, if x / X, we see that according to the preceding proposition, f is closed if and only f is lower semicontinuous over R n. Common examples of convex functions are affine functions and norms; this is straightforward to verify, using the definition of convexity. For example, for any x, y R n and any α [0, 1], we have by using the triangle inequality, αx + (1 α)y αx + (1 α)y = α x + (1 α) y,

37 Sec. 1.2 Convex Sets and Functions 31 so the norm function is convex. The following proposition provides some means for recognizing convex functions, and gives some algebraic operations that preserve convexity of a function. Proposition 1.2.3: (a) Let f 1,..., f m : R n (, ] be given functions, let λ 1,..., λ m be positive scalars, and consider the function g : R n (, ] given by g(x) = λ 1 f 1 (x) + + λ m f m (x). If f 1,..., f m are convex, then g is also convex, while if f 1,..., f m are closed, then g is also closed. (b) Let f : R n (, ] be a given function, let A be an m n matrix, and consider the function g : R n (, ] given by g(x) = f(ax). If f is convex, then g is also convex, while if f is closed, then g is also closed. (c) Let f i : R n (, ] be given functions for i I, where I is an index set, and consider the function g : R n (, ] given by g(x) = sup f i (x). i I If f i, i I, are convex, then g is also convex, while if f i, i I, are closed, then g is also closed. Proof: (a) Let f 1,..., f m be convex. We use the definition of convexity to write for any x, y R n and α [0,1], f ( αx + (1 α)y ) m ( ) = λ i f i αx + (1 α)y i=1 m ( λ i αfi (x) + (1 α)f i (y) ) i=1 m m = α λ i f i (x) + (1 α) λ i f i (y) i=1 = αf(x) + (1 α)f(y). Hence f is convex. Let f 1,..., f m be closed. Then the f i are lower semicontinuous at every x R n [cf. Prop (b)], so for every sequence {x k } converging to i=1

38 32 Convex Analysis and Optimization Chap. 1 x, we have f i (x) lim inf k f i (x k ) for all i. Hence g(x) m i=1 λ i lim inf f i(x k ) lim inf k k m i=1 λ i f i (x k ) = lim inf g(x k). k where we have used Prop (d) (the sum of the limit inferiors of sequences is less or equal to the limit inferior of the sum sequence). Therefore, g is lower semicontinuous at all x R n, so by Prop (b), it is closed. (b) This is straightforward, along the lines of the proof of part (a). (c) A pair (x, w) belongs to the epigraph epi(g) = { (x, w) g(x) w } if and only if f i (x) w for all i I, or (x, w) i I epi(f i ). Therefore, epi(g) = i I epi(f i ). If the f i are convex, the epigraphs epi(f i ) are convex, so epi(g) is convex, and g is convex. If the f i are closed, then the epigraphs epi(f i ) are closed, so epi(g) is closed, and g is closed. Q.E.D. Characterizations of Differentiable Convex Functions For differentiable functions, there is an alternative characterization of convexity, given in the following proposition and illustrated in Fig Proposition 1.2.4: Let C R n be a convex set and let f : R n R be differentiable over R n. (a) f is convex over C if and only if f(z) f(x) + (z x) f(x), x, z C. (1.4) (b) f is strictly convex over C if and only if the above inequality is strict whenever x z. Proof: We prove (a) and (b) simultaneously. Assume that the inequality (1.4) holds. Choose any x, y C and α [0,1], and let z = αx + (1 α)y. Using the inequality (1.4) twice, we obtain f(x) f(z) + (x z) f(z),

39 Sec. 1.2 Convex Sets and Functions 33 f(z) f(x) + (z - x)' f(x) x z Figure Characterization of convexity in terms of first derivatives. The condition f(z) f(x) + (z x) f(x) states that a linear approximation, based on the first order Taylor series expansion, underestimates a convex function. f(y) f(z) + (y z) f(z). We multiply the first inequality by α, the second by (1 α), and add them to obtain αf(x) + (1 α)f(y) f(z) + ( αx + (1 α)y z ) f(z) = f(z), which proves that f is convex. If the inequality (1.4) is strict as stated in part (b), then if we take x y and α (0, 1) above, the three preceding inequalities become strict, thus showing the strict convexity of f. Conversely, assume that f is convex, let x and z be any vectors in C with x z, and for α (0, 1), consider the function g(α) = f( x + α(z x) ) f(x), α (0, 1]. α We will show that g(α) is monotonically decreasing with α, and is strictly monotonically decreasing if f is strictly convex. This will imply that (z x) f(x) = lim α 0 g(a) g(1) = f(z) f(x), with strict inequality if g is strictly monotonically decreasing, thereby showing that the desired inequality (1.4) holds (and holds strictly if f is strictly convex). Indeed, consider any α 1, α 2, with 0 < α 1 < α 2 < 1, and let α = α 1 α 2, z = x + α 2 (z x). (1.5) We have f ( x + α(z x) ) αf(z) + (1 α)f(x),

40 34 Convex Analysis and Optimization Chap. 1 or f ( x + α(z x) ) f(x) α f(z) f(x), (1.6) and the above inequalities are strict if f is strictly convex. Substituting the definitions (1.5) in Eq. (1.6), we obtain after a straightforward calculation f ( x + α 1 (z x) ) f(x) α 1 f ( x + α2 (z x) ) f(x), α 2 or g(α 1 ) g(α 2 ), with strict inequality if f is strictly convex. Hence g is monotonically decreasing with α, and strictly so if f is strictly convex. Q.E.D. Note a simple consequence of Prop (a): if f : R n R is a convex function and f(x ) = 0, then x minimizes f over R n. This is a classical sufficient condition for unconstrained optimality, originally formulated (in one dimension) by Fermat in For twice differentiable convex functions, there is another characterization of convexity as shown by the following proposition. Proposition 1.2.5: Let C R n be a convex set and let f : R n R be twice continuously differentiable over R n. (a) If 2 f(x) is positive semidefinite for all x C, then f is convex over C. (b) If 2 f(x) is positive definite for all x C, then f is strictly convex over C. (c) If C = R n and f is convex, then 2 f(x) is positive semidefinite for all x. Proof: (a) By Prop (b), for all x, y C we have f(y) = f(x) + (y x) f(x) (y x) 2 f ( x + α(y x) ) (y x) for some α [0, 1]. Therefore, using the positive semidefiniteness of 2 f, we obtain f(y) f(x) + (y x) f(x), x, y C. From Prop (a), we conclude that f is convex. (b) Similar to the proof of part (a), we have f(y) > f(x) + (y x) f(x) for all x, y C with x y, and the result follows from Prop (b).

41 Sec. 1.2 Convex Sets and Functions 35 (c) Suppose that f : R n R is convex and suppose, to obtain a contradiction, that there exist some x R n and some z R n such that z 2 f(x)z < 0. Using the continuity of 2 f, we see that we can choose the norm of z to be small enough so that z 2 f(x+αz)z < 0 for every α [0, 1]. Then, using again Prop (b), we obtain f(x + z) < f(x) + z f(x), which, in view of Prop (a), contradicts the convexity of f. Q.E.D. If f is convex over a strict subset C R n, it is not necessarily true that 2 f(x) is positive semidefinite at any point of C [take for example n = 2, C = {(x 1,0) x 1 R}, and f(x) = x 2 1 x2 2 ]. A strengthened version of Prop is given in the exercises. It can be shown that the conclusion of Prop (c) also holds if C is assumed to have nonempty interior instead of being equal to R n. The following proposition considers a strengthened form of strict convexity characterized by the following equation: ( ) (x f(x) f(y) y) α x y 2, x, y R n, (1.7) where α is some positive number. Convex functions with this property are called strongly convex with coefficient α. Proposition 1.2.6: (Strong Convexity) Let f : R n R be smooth. If f is strongly convex with coefficient α, then f is strictly convex. Furthermore, if f is twice continuously differentiable, then strong convexity of f with coefficient α is equivalent to the positive semidefiniteness of 2 f(x) αi for every x R n, where I is the identity matrix. Proof: Fix some x, y R n such that x y, and define the function h : R R by h(t) = f ( x + t(y x) ). Consider some t, t R such that t < t. Using the chain rule and Eq. (1.7), we have ( dh dt (t ) dh ) dt (t) (t t) ( = f ( x + t (y x) ) f ( x + t(y x) )) (y x)(t t) α(t t) 2 x y 2 > 0. Thus, dh/dt is strictly increasing and for any t (0, 1), we have h(t) h(0) t = 1 t t 0 dh 1 (τ) dτ < dτ 1 t 1 t dh dτ h(1) h(t) (τ) dτ =. 1 t Equivalently, th(1) + (1 t)h(0) > h(t). The definition of h yields tf(y) + (1 t)f(x) > f ( ty + (1 t)x ). Since this inequality has been proved for arbitrary t (0, 1) and x y, we conclude that f is strictly convex.

42 36 Convex Analysis and Optimization Chap. 1 Suppose now that f is twice continuously differentiable and Eq. (1.7) holds. Let c be a scalar. We use Prop (b) twice to obtain f(x + cy) = f(x) + cy f(x) + c2 2 y 2 f(x + tcy)y, and f(x) = f(x + cy) cy f(x + cy) + c2 2 y 2 f(x + scy)y, for some t and s belonging to [0,1]. Adding these two equations and using Eq. (1.7), we obtain c 2 2 y ( 2 f(x+scy)+ 2 f(x+tcy) ) y = ( f(x+cy) f(x) ) (cy) αc2 y 2. We divide both sides by c 2 and then take the limit as c 0 to conclude that y 2 f(x)y α y 2. Since this inequality is valid for every y R n, it follows that 2 f(x) αi is positive semidefinite. For the converse, assume that 2 f(x) αi is positive semidefinite for all x R n. Consider the function g : R R defined by g(t) = f ( tx + (1 t)y ) (x y). Using the Mean Value Theorem (Prop ), we have ( f(x) f(y) ) (x y) = g(1) g(0) = (dg/dt)(t) for some t [0, 1]. The result follows because dg dt (t) = (x y) 2 f ( tx + (1 t)y ) (x y) α x y 2, where the last inequality is a consequence of the positive semidefiniteness of 2 f ( tx + (1 t)y ) αi. Q.E.D. As an example, consider the quadratic function f(x) = x Qx, where Q is a symmetric matrix. By Prop , the function f is convex if and only if Q is positive semidefinite. Furthermore, by Prop , f is strongly convex with coefficient α if and only if 2 f(x) αi = 2Q αi is positive semidefinite for some α > 0. Thus f is strongly convex with some positive coefficient (as well as strictly convex) if and only if Q is positive definite.

43 ( Sec. 1.2 Convex Sets and Functions Convex and Affine Hulls Let X be a subset of R n. A convex combination of elements of X is a vector of the form m i=1 α ix i, where m is a positive integer, x 1,..., x m belong to X, and α 1,..., α m are scalars such that α i 0, i = 1,..., m, m α i = 1. i=1 Note that if X is convex, then the convex combination m i=1 α ix i belongs to X (this is easily shown by induction; see the exercises), and for any function f : R n R that is convex over X, we have f ) m α i x i i=1 m α i f(x i ). (1.8) This follows by using repeatedly the definition of convexity. The preceding relation is a special case of Jensen s inequality and can be used to prove a number of interesting inequalities in applied mathematics and probability theory. The convex hull of a set X, denoted conv(x), is the intersection of all convex sets containing X, and is a convex set by Prop (a). It is straightforward to verify that the set of all convex combinations of elements of X is convex, and is equal to the convex hull conv(x) (see the exercises). In particular, if X consists of a finite number of vectors x 1,..., x m, its convex hull is conv ( {x 1,..., x m } ) { m } m = α i x i αi 0, i = 1,..., m, α i = 1. i=1 We recall that an affine set M is a set of the form x + S, where S is a subspace, called the subspace parallel to M. If X is a subset of R n, the affine hull of X, denoted aff(x), is the intersection of all affine sets containing X. Note that aff(x) is itself an affine set and that it contains conv(x). It can be seen that the affine hull of X, the affine hull of the convex hull conv(x), and the affine hull of the closure cl(x) coincide (see the exercises). For a convex set C, the dimension of C is defined to be the dimension of aff(c). Given a subset X R n, a nonnegative combination of elements of X is a vector of the form m i=1 α ix i, where m is a positive integer, x 1,..., x m belong to X, and α 1,..., α m are nonnegative scalars. If the scalars α i are all positive, the combination m i=1 α ix i is said to be positive. The cone generated by X, denoted by cone(x), is the set of nonnegative combinations of elements of X. It is easily seen that cone(x) is a convex cone, although i=1 i=1

44 38 Convex Analysis and Optimization Chap. 1 it need not be closed [cone(x) can be shown to be closed in special cases, such as when X is a finite set this is one of the central results of polyhedral convexity and will be shown in Section 1.6]. The following is a fundamental characterization of convex hulls. Proposition 1.2.7: (Caratheodory s Theorem) Let X be a subset of R n. (a) Every x in conv(x) can be represented as a convex combination of vectors x 1,..., x m X such that x 2 x 1,..., x m x 1 are linearly independent, where m is a positive integer with m n + 1. (b) Every x in cone(x) can be represented as a positive combination of vectors x 1,..., x m X that are linearly independent, where m is a positive integer with m n. Proof: (a) Let x be a vector in the convex hull of X, and let m be the smallest integer such that x has the form m i=1 α ix i, where m i=1 α i = 1, α i > 0, and x i X for all i = 1,..., m. The m 1 vectors x 2 x 1,..., x m x 1 belong to the subspace parallel to aff(x). Assume, to arrive at a contradiction, that these vectors are linearly dependent. Then, there must exist scalars λ 2,..., λ m at least one of which is positive, such that m λ i (x i x 1 ) = 0. i=2 Letting µ i = λ i for i = 2,..., m and µ 1 = m i=2 λ i, we see that m µ i x i = 0, i=1 m µ i = 0, i=1 while at least one of the scalars µ 2,..., µ m is positive. Define α i = α i γµ i, i = 1,..., m, where γ > 0 is the largest γ such that α i γµ i 0 for all i. Then, since m i=1 µ ix i = 0, we see that x is also represented as m i=1 α ix i. Furthermore, in view of the choice of γ and the fact m i=1 µ i = 0, the coefficients α i are nonnegative, sum to one, and at least one of them is zero. Thus, x can be represented as a convex combination of fewer than m vectors of X, contradicting our earlier assumption. It follows that the vectors x 2 x 1,..., x m x 1 must be linearly independent, so that their number must be at most n. Hence m n + 1.

45 Sec. 1.2 Convex Sets and Functions 39 (b) Let x be a nonzero vector in the cone(x), and let m be the smallest integer such that x has the form m i=1 α ix i, where α i > 0 and x i X for all i = 1,..., m. If the vectors x i were linearly dependent, there would exist scalars λ 1,..., λ m, with m i=1 λ ix i = 0 and at least one of the λ i is positive. Consider the linear combination m i=1 (α i γλ i )x i, where γ is the largest γ such that α i γλ i 0 for all i. This combination provides a representation of x as a positive combination of fewer than m vectors of X a contradiction. Since any linearly independent set of vectors contains at most n elements, we must have m n. Q.E.D. It is not generally true that the convex hull of a closed set is closed [take for instance the convex hull of the set consisting of the origin and the subset {(x 1, x 2 ) x 1 x 2 = 1, x 1 0, x 2 0} of R 2 ]. We have, however, the following. Proposition 1.2.8: The convex hull of a compact set is compact. Proof: Let X be a compact subset of R n.{ By Caratheodory s Theorem, n+1 } a sequence in conv(x) can be expressed as, where for all k and i, α k i 0, xk i X, and n+1 i=1 αk i i=1 αk i xk i = 1. Since the sequence { (α k 1,..., α k n+1, xk 1,..., xk n+1 )} belongs to a compact set, it has a limit point { (α 1,..., α n+1, x 1,..., x n+1 ) } such that n+1 i=1 α i = 1, and for all i, α i 0, and x i X. Thus, the vector n+1 { i=1 α ix i, which belongs to conv(x), is a limit point of the sequence n+1 }, showing that conv(x) is compact. Q.E.D. i=1 αk i xk i Closure, Relative Interior, and Continuity We now consider some generic topological properties of convex sets and functions. Let C be a nonempty convex subset of R n. The closure cl(c) of C is also a nonempty convex set (Prop ). While the interior of C may be empty, it turns out that convexity implies the existence of interior points relative to the affine hull of C. This is an important property, which we now formalize. We say that x is a relative interior point of C, if x C and there exists a neighborhood N of x such that N aff(c) C, i.e., x is an interior point of C relative to aff(c). The relative interior of C, denoted ri(c), is the set of all relative interior points of C. For example, if C is a line segment connecting two distinct points in the plane, then ri(c) consists of all points of C except for the end points.

46 40 Convex Analysis and Optimization Chap. 1 The following proposition gives some basic facts about relative interior points. Proposition 1.2.9: Let C be a nonempty convex set. (a) (Line Segment Principle) If x ri(c) and x cl(c), then all points on the line segment connecting x and x, except possibly x, belong to ri(c). (b) (Nonemptiness of Relative Interior) ri(c) is a nonempty and convex set, and has the same affine hull as C. In fact, if m is the dimension of aff(c) and m > 0, there exist vectors x 0, x 1,..., x m ri(c) such that x 1 x 0,..., x m x 0 span the subspace parallel to aff(c). (c) x ri(c) if and only if every line segment in C having x as one endpoint can be prolonged beyond x without leaving C [i.e., for every x C, there exists a γ > 1 such that x+(γ 1)(x x) C]. Proof: (a) In the case where x C, see Fig In the case where x / C, to show that for any α (0,1] we have x α = αx+(1 α)x ri(c), consider a sequence {x k } C that converges to x, and let x k,α = αx+(1 α)x k. Then as in Fig , we see that {z z x k,α < αɛ} aff(c) C for all k. Since for large enough k, we have {z z x α < αɛ/2} {z z x k,α < αɛ}, it follows that {z z x α < αɛ/2} aff(c) C, which shows that x α ri(c). (b) Convexity of ri(c) follows from the line segment principle of part (a). By using a translation argument if necessary, we assume without loss of generality that 0 C. Then, the affine hull of C is a subspace of dimension m. If m = 0, then C and aff(c) consist of a single point, which is a unique relative interior point. If m > 0, we can find m linearly independent vectors z 1,..., z m in C that span aff(c); otherwise there would exist r < m linearly independent vectors in C whose span contains C, contradicting the fact that the dimension of aff(c) is m. Thus z 1,..., z m form a basis for aff(c). Consider the set X = { x x = m α i z i, i=1 } m α i < 1, α i > 0, i = 1,..., m i=1 (see Fig ). This set is open relative to aff(c); that is, for every x X, there exists an open set N such that x N and N aff(c) X. [To see

47 Sec. 1.2 Convex Sets and Functions 41 x α = αx + (1 - α)x x x ε αε S α S C Figure Proof of the line segment principle for the case where x C. Since x ri(c), there exists a sphere S = {z z x < ɛ} such that S aff(c) C. For all α (0,1], let x α = αx + (1 α)x and let S α = {z z x α < αɛ}. It can be seen that each point of S α aff(c) is a convex combination of x and some point of S aff(c). Therefore, S α aff(c) C, implying that x α ri(c). this, note that X is the inverse image of the open set in R m { } m (α 1,..., α m ) α i < 1, α i > 0, i = 1,..., m i=1 under the linear transformation from aff(c) to R m that maps m i=1 α iz i into (α 1,..., α m ); openness of the above set follows by continuity of linear transformation.] Therefore all points of X are relative interior points of C, and ri(c) is nonempty. Since by construction, aff(x) = aff(c) and X ri(c), it follows that ri(c) and C have the same affine hull. To show the last assertion of part (b), consider vectors x 0 = α m z i, x i = x 0 + αz i, i = 1,..., m, i=1 where α is a positive scalar such that α(m+1) < 1. The vectors x 0,..., x m are in the set X and in the relative interior of C, since X ri(c). Furthermore, because x i x 0 = αz i for all i and vectors z 1,..., z m span aff(c), the vectors x 1 x 0,..., x m x 0 also span aff(c). (c) If x ri(c) the condition given clearly holds. Conversely, let x satisfy the given condition. We will show that x ri(c). By part (b), there exists a vector x ri(c). We may assume that x x, since otherwise we are done. By the given condition, since x is in C, there is a γ > 1 such that y = x + (γ 1)(x x) C. Then we have x = (1 α)x + αy, where α = 1/γ (0, 1), so by part (a), we obtain x ri(c). Q.E.D.

48 42 Convex Analysis and Optimization Chap. 1 X x 2 0 C x 1 Figure Construction of the relatively open set X in the proof of nonemptiness of the relative interior of a convex set C that contains the origin. We choose m linearly independent vectors z 1,..., z m C, where m is the dimension of aff(c), and let { m X = α i z i i=1 i=1 m α i < 1, α i > 0, i = 1,..., m }. In view of Prop (b), C and ri(c) all have the same dimension. It can also be shown that C and cl(c) have the same dimension (see the exercises). The next proposition gives several properties of closures and relative interiors of convex sets. Proposition : Let C be a nonempty convex set. (a) cl(c) = cl ( ri(c) ). (b) ri(c) = ri ( cl(c) ). (c) Let C be another nonempty convex set. Then the following three conditions are equivalent: (i) C and C have the same relative interior. (ii) C and C have the same closure. (iii) ri(c) C cl(c). (d) A ri(c) = ri(a C) for all m n matrices A. (e) If C is bounded, then A cl(c) = cl(a C) for all m n matrices A. Proof: (a) Since ri(c) C, we have cl ( ri(c) ) cl(c). Conversely, let x cl(c). We will show that x cl ( ri(c) ). Let x be any point in ri(c)

49 Sec. 1.2 Convex Sets and Functions 43 [there exists such a point by Prop (b)], and assume that that x x (otherwise we are done). By the line segment principle [Prop (a)], we have αx + (1 α)x ri(c) for all α (0,1]. Thus, x is the limit of the sequence { (1/k)x + (1 1/k)x k 1 } that lies in ri(c), so x cl ( ri(c) ). (b) Since C cl(c), we must have ri(c) ri ( cl(c) ). To prove the reverse inclusion, let z ri ( cl(c) ). We will show that z ri(c). By Prop (b), there exists an x ri(c). We may assume that x z (otherwise we are done). We choose γ > 1, with γ sufficiently close to 1 so that the vector y = z + (γ 1)(z x) belongs to ri ( cl(c) ) [cf. Prop (c)], and hence also to cl(c). Then we have z = (1 α)x + αy where α = 1/γ (0,1), so by the line segment principle [Prop (a)], we obtain z ri(c). (c) If ri(c) = ri(c), part (a) implies that cl(c) = cl(c). Similarly, if cl(c) = cl(c), part (b) implies that ri(c) = ri(c). Furthermore, if these conditions hold the relation ri(c) C cl(c) implies condition (iii). Finally, assume that condition (iii) holds. Then by taking closures, we have cl ( ri(c) ) cl(c) cl(c), and by using part (a), we obtain cl(c) cl(c) cl(c). Hence C and C have the same closure. (d) For any set X, we have A cl(x) cl(a X), since if a sequence {x k } X converges to some x cl(x) then the sequence {Ax k } A X converges to Ax, implying that Ax cl(a X). We use this fact and part (a) to write A ri(c) A C A cl(c) = A cl ( ri(c) ) cl ( A ri(c) ). Thus A C lies between the set A ri(c) and the closure of that set, implying that the relative interiors of the sets A C and A ri(c) are equal [part (c)]. Hence ri(a C) A ri(c). We will show the reverse inclusion by taking any z A ri(c) and showing that z ri(a C). Let x be any vector in A C, and let z ri(c) and x C be such that Az = z and Ax = x. By part Prop (c), there exists γ > 1 such that the vector y = z+(γ 1)(z x) belongs to C. Thus we have Ay A C and Ay = z + (γ 1)(z x), so by Prop (c) it follows that z ri(a C). (e) By the argument given in part (d), we have A cl(c) cl(a C). To show the converse, choose any x cl(a C). Then, there exists a sequence {x k } C such that Ax k x. Since C is bounded, {x k } has a subsequence that converges to some x cl(c), and we must have Ax = x. It follows that x A cl(c). Q.E.D. Note that if C is closed but unbounded, the set A C need not be closed [cf. part (e) of the above proposition]. For example take the closed set C = { (x 1, x 2 ) x 1 x 2 1, x 1 0, x 2 0 } and let A have the effect of projecting the typical vector x on the horizontal axis, i.e., A(x 1, x 2 ) = (x 1, 0). Then A C is the (nonclosed) halfline { (x 1, x 2 ) x 1 > 0, x 2 = 0 }.

50 44 Convex Analysis and Optimization Chap. 1 Proposition : Let C 1 and C 2 be nonempty convex sets: (a) Assume that the sets ri(c 1 ) and ri(c 2 ) have a nonempty intersection. Then cl(c 1 C 2 ) = cl(c 1 ) cl(c 2 ), ri(c 1 C 2 ) = ri(c 1 ) ri(c 2 ). (b) ri(c 1 + C 2 ) = ri(c 1 ) + ri(c 2 ). Proof: (a) Let y cl(c 1 ) cl(c 2 ). If x ri(c 1 ) ri(c 2 ), by the line segment principle [Prop (a)], the vector αx + (1 α)y belongs to ri(c 1 ) ri(c 2 ) for all α (0,1]. Hence y is the limit of a sequence α k x+(1 α k )y ri(c 1 ) ri(c 2 ) with α k 0, implying that y cl ( ri(c 1 ) ri(c 2 ) ). Hence, we have cl(c 1 ) cl(c 2 ) cl ( ri(c 1 ) ri(c 2 ) ) cl(c 1 C 2 ). Also C 1 C 2 is contained in cl(c 1 ) cl(c 2 ), which is a closed set, so we have cl(c 1 C 2 ) cl(c 1 ) cl(c 2 ). Thus, equality holds throughout in the preceding two relations, so that cl(c 1 C 2 ) = cl(c 1 ) cl(c 2 ). Furthermore, the sets ri(c 1 ) ri(c 2 ) and C 1 C 2 have the same closure. Therefore, by Prop (c), they have the same relative interior, implying that ri(c 1 C 2 ) ri(c 1 ) ri(c 2 ). To show the converse, take any x ri(c 1 ) ri(c 2 ) and any y C 1 C 2. By Prop (c), the line segment connecting x and y can be prolonged beyond x by a small amount without leaving C 1 C 2. By the same proposition, it follows that x ri(c 1 C 2 ). (b) Consider the linear transformation A : R 2n R n given by A(x 1, x 2 ) = x 1 + x 2 for all x 1, x 2 R n. The relative interior of the Cartesian product C 1 C 2 (viewed as a subset of R 2n ) is easily seen to be ri(c 1 ) ri(c 2 ). Since A(C 1 C 2 ) = C 1 + C 2, the result follows from Prop (d). Q.E.D. The requirement that ri(c 1 ) ri(c 2 ) Ø is essential in part (a) of the above proposition. As an example, consider the subsets of the real line C 1 = {x x > 0} and C 2 = {x x < 0}. Then we have cl(c 1 C 2 ) = Ø {0} = cl(c 1 ) cl(c 2 ). Also, consider C 1 = {x x 0} and C 2 = {x x 0}. Then we have ri(c 1 C 2 ) = {0} Ø = ri(c 1 ) ri(c 2 ).

51 Sec. 1.2 Convex Sets and Functions 45 Continuity of Convex Functions We close this section with a basic result on the continuity properties of convex functions. Proposition : If f : R n R is convex, then it is continuous. More generally, if C R n is convex and f : C R is convex, then f is continuous over the relative interior of C. Proof: Restricting attention to the affine hull of C and using a transformation argument if necessary, we assume without loss of generality, that the origin is an interior point of C and that the unit cube X = {x x 1} is contained in C. It will suffice to show that f is continuous at 0, i.e, that for any sequence {x k } R n that converges to 0, we have f(x k ) f(0). Let e i, i = 1,..., 2 n, be the corners of X, i.e., each e i is a vector whose entries are either 1 or 1. It is not difficult to see that any x X can be expressed in the form x = 2 n i=1 α ie i, where each α i is a nonnegative scalar and 2 n i=1 α i = 1. Let A = max i f(e i ). From Jensen s inequality [Eq. (1.8)], it follows that f(x) A for every x X. For the purpose of proving continuity at zero, we can assume that x k X and x k 0 for all k. Consider the sequences {y k } and {z k } given by y k = x k, z k = x k ; x k x k (cf. Fig ). Using the definition of a convex function for the line segment that connects y k, x k, and 0, we have f(x k ) ( 1 x k ) f(0) + xk f(y k ). We have x k 0 while f(y k ) A for all k, so by taking limit as k, we obtain lim sup f(x k ) f(0). k Using the definition of a convex function for the line segment that connects x k, 0, and z k, we have f(0) x k x k + 1 f(z 1 k) + x k + 1 f(x k) and letting k, we obtain f(0) lim inf f(x k). k Thus, lim k f(x k ) = f(0) and f is continuous at zero. Q.E.D. A straightforward consequence of the continuity of a real-valued function f that is convex over R n is that its epigraph as well as the level sets { x f(x) γ } are closed and convex (cf. Prop ).

52 46 Convex Analysis and Optimization Chap. 1 y k e3 e 2 x k x k+1 0 Figure Construction for proving continuity of a convex function (cf. Prop ). e 4 zk e Recession Cones Some of the preceding results [Props , (e)] have illustrated how boundedness affects the topological properties of sets obtained through various operations on convex sets. In this section we take a closer look at this issue. Given a convex set C, we say that a vector y is a direction of recession of C if x + αy C for all x C and α 0. In words, y is a direction of recession of C if starting at any x in C and going indefinitely along y, we never cross the boundary of C to points outside C. The set of all directions of recession is a cone containing the origin. It is called the recession cone of C and it is denoted by R C (see Fig ). This definition implies that the recession cone of the intersection of any collection of sets C i, i I, is equal to the corresponding intersection of the recession cones: R i I C i = i I R Ci. The following proposition gives some additional properties of recession cones. Proposition : (Recession Cone Theorem) Let C be a nonempty closed convex set. (a) The recession cone R C is a closed convex cone. (b) A vector y belongs to R C if and only if there exists a vector x C such that x + αy C for all α 0. (c) R C contains a nonzero direction if and only if C is unbounded.

53 Sec. 1.2 Convex Sets and Functions 47 x + αy x y 0 Recession Cone R C Convex Set C Figure Illustration of the recession cone R C of a convex set C. A direction of recession y has the property that x + αy C for all x C and α 0. Proof: (a) If y 1, y 2 belong to R C and λ 1, λ 2 are positive scalars such that λ 1 + λ 2 = 1, we have for any x C and α 0 x + α(λ 1 y 1 + λ 2 y 2 ) = λ 1 (x + αy 1 ) + λ 2 (x + αy 2 ) C, where the last inclusion holds because x+ αy 1 and x + αy 2 belong to C by the definition of R C. Hence λ 1 y 1 +λ 2 y 2 R C, implying that R C is convex. Let y be in the closure of R C, and let {y k } R C be a sequence converging to y. For any x C and α 0 we have x + αy k C for all k, and because C is closed, we have x + αy C. This implies that y R C and that R C is closed. (b) If y R C, every vector x C has the required property by the definition of R C. Conversely, let y be such that there exists a vector x C with x + αy C for all α 0. We fix x C and α > 0, and we show that x + αy C. We may assume that y 0 (otherwise we are done) and without loss of generality, we may assume that y = 1. Let z k = x + kαy, k = 1,2,.... If x = z k for some k, then x + αy = x + α(k + 1)y which belongs to C and we are done. We thus assume that x z k for all k, and we define (see the construction of Fig ). We have y k = z k x z k x y k = z k x, k = 1, 2,... z k x z k x z k x + Because z k is an unbounded sequence, z k x z k x 1, x x z k x = z k x z k x y + x x z k x 0, x x z k x.

54 48 Convex Analysis and Optimization Chap. 1 x + αy z k-2 z k-1 z k Convex Set C x x x + y k x + αy x + y Unit Ball Figure Construction for the proof of Prop (b). so by combining the preceding relations, we have y k y. Thus x + αy is the limit of {x + αy k }. The vector x + αy k lies between x and z k in the line segment connecting x and z k for all k such that z k x α, so by convexity of C, we have x + αy k C for all sufficiently large k. Using the closure of C, it follows that x + αy must belong to C. (c) Assuming that C is unbounded, we will show that R C contains a nonzero direction (the reverse is clear). Choose any x C and any unbounded sequence {z k } C. Consider the sequence {y k }, where y k = z k x z k x, and let y be a limit point of {y k } (compare with the construction of Fig ). For any fixed α 0, the vector x+ αy k lies between x and z k in the line segment connecting x and z k for all k such that z k x α. Hence by convexity of C, we have x + αy k C for all sufficiently large k. Since x + αy is a limit point of {x + αy k }, and C is closed, we have x + αy C. Hence the nonzero vector y is a direction of recession. Q.E.D. Note that part (c) of the above proposition yields a characterization of compact and convex sets, namely that a closed convex set is bounded if and only if R C = {0}. A useful generalization is that for a compact set W R m and an m n matrix A, the set V = {x C Ax W } is compact if and only if R C N(A) = {0}. To see this, note that the recession cone of the set V = {x R n Ax W }

55 Sec. 1.2 Convex Sets and Functions 49 is N(A) [clearly N(A) R V ; if x / N(A) but x R V we must have αax W for all α > 0, which contradicts the boundedness of W]. Hence, the recession cone of V is R C N(A), so by Prop (c), V is compact if and only if R C N(A) = {0}. One possible use of recession cones is to obtain conditions guaranteeing the closure of linear transformations and vector sums of convex sets in the absence of boundedness, as in the following two propositions (some refinements are given in the exercises). Proposition : Let C be a nonempty closed convex subset of R n and let A be an m n matrix with nullspace denoted by N(A). Assume that R C N(A) = {0}. Then AC is closed. Proof: For any y cl(ac), the set C ɛ = { x C y Ax ɛ } is nonempty for all ɛ > 0. Furthermore, by the discussion following the proof of Prop , the assumption R C N(A) = {0} implies that C ɛ is compact. It follows that the set ɛ>0 C ɛ is nonempty and any x ɛ>0 C ɛ satisfies Ax = y, so y AC. Q.E.D. Proposition : Let C 1,..., C m be nonempty closed convex subsets of R n such that the equality y y m = 0 for some vectors y i R Ci implies that y i = 0 for all i = 1,..., m. Then the vector sum C C m is a closed set. Proof: Let C be the Cartesian product C 1 C m viewed as a subset of R mn and let A be the linear transformation that maps a vector (x 1,..., x m ) R mn into x x m. We have (see the exercises) and R C = R C1 + + R Cm N(A) = { (y 1,..., y m ) y y m = 0, y i R n}, so under the given condition, we obtain R C N(A) = {0}. Since AC = C C m, the result follows by applying Prop Q.E.D. When specialized to just two sets C 1 and C 2, the above proposition says that if there is no nonzero direction of recession of C 1 that is the

56 50 Convex Analysis and Optimization Chap. 1 opposite of a direction of recession of C 2, then C 1 + C 2 is closed. This is true in particular if R C1 = {0} which is equivalent to C 1 being compact [cf. Prop (c)]. We thus obtain the following proposition. Proposition : Let C 1 and C 2 be closed, convex sets. If C 1 is bounded, then C 1 + C 2 is a closed and convex set. If both C 1 and C 2 are bounded, then C 1 + C 2 is a convex and compact set. Proof: Closedness of C 1 + C 2 follows from the preceding discussion. If both C 1 and C 2 are bounded, then C 1 + C 2 is also bounded and hence also compact. Q.E.D. Note that if C 1 and C 2 are both closed and unbounded, the vector sum C 1 + C 2 need not be closed. For example consider the closed sets of R 2 given by C 1 = { (x 1, x 2 ) x 1 x 2 1, x 1 0, x 2 0 } and C 2 = {(x 1, x 2 ) x 1 = 0 }. Then C 1 + C 2 is the open halfspace {(x 1, x 2 ) x 1 > 0}. E X E R C I S E S (a) Show that a set is convex if and only if it contains all the convex combinations of its elements. (b) Show that the convex hull of a set coincides with the set of all the convex combinations of its elements Let C be a nonempty set in R n, and let λ 1 and λ 2 be positive scalars. Show by example that the sets (λ 1 + λ 2 )C and λ 1 C + λ 2 C may differ when C is not convex [cf. Prop ] (Properties of Cones) (a) For any collection {C i i I} of cones, the intersection i I C i is a cone. (b) The vector sum C 1 + C 2 of two cones C 1 and C 2 is a cone.

57 Sec. 1.2 Convex Sets and Functions 51 (c) The closure of a cone is a cone. (d) The image and the inverse image of a cone under a linear transformation is a cone. (e) For any collection of vectors {a i i I}, the set is a closed convex cone. C = {x a ix 0, i I} (Convex Cones) Let C 1 and C 2 be convex cones containing the origin. (a) Show that C 1 + C 2 = conv(c 1 C 2 ). (b) Consider the set C given by ( ) C = (1 α)c1 αc 2. α [0,1] Show that C = C 1 C Given sets X i R n i, i = 1,..., m, let X = X 1 X m be their Cartesian product. (a) Show that the convex hull (closure, affine hull) of X is equal to the Cartesian product of the convex hulls (closures, affine hulls, respectively) of the X i s. (b) Assuming X 1,..., X m are convex, show that the relative interior (recession cone) of X is equal to the Cartesian product of the relative interiors (recession cones) of the X i s Let {C i i I} be an arbitrary collection of convex sets in R n, and let C be the convex hull of the union of the collection. Show that ( ) C = α i C i, where the union is taken over all convex combinations such that only finitely many coefficients α i are nonzero. i I

58 52 Convex Analysis and Optimization Chap Let X be a nonempty set. (a) Show that X, conv(x), and cl(x) have the same dimension. (b) Show that cone(x) = cone ( conv(x) ). (c) Show that the dimension of conv(x) is at most as large as the dimension of cone(x). Give an example where the dimension of conv(x) is smaller than the dimension of cone(x). (d) Assuming that the origin belongs to conv(x), show that conv(x) and cone(x) have the same dimension Let g be a convex, monotonically nondecreasing function of a single variable [i.e., g(y) g(y) for y < y], and let f be a convex function defined on a convex set C R n. Show that the function h defined by is convex over C. h(x) = g ( f(x) ) (Convex Functions) Show that the following functions are convex: (a) f 1 : X R is given by f 1 (x 1,..., x n ) = (x 1 x 2 x n ) 1 n where X = {(x 1,..., x n ) x 1,..., x n 0}. (b) f 2 (x) = x p with p 1. (c) f 3 (x) = 1 g(x) with g a concave function over Rn such that g(x) > 0 for all x. (d) f 4 (x) = αf(x) + β with f a convex function over R n, and α and β scalars such that α 0. (e) f5(x) = max{f(x), 0} with f a convex function over R n. (f) f 6 (x) = Ax b with A an m n matrix and b a vector in R m. (g) f 7 (x) = x Ax + b x + β with A an n n positive semidefinite symmetric matrix, b a vector in R n, and β a scalar. (h) f 8 (x) = e βx Ax with A an n n positive semidefinite symmetric matrix and β a positive scalar.

59 Sec. 1.2 Convex Sets and Functions Use the Line Segment Principle and the method of proof of Prop (c) to show that if C is a convex set with nonempty interior, and f : R n R is convex and twice continuously differentiable over C, then 2 f(x) is positive semidefinite for all x C Let C R n be a convex set and let f : R n R be twice continuously differentiable over C. Let S be the subspace that is parallel to the affine hull of C. Show that f is convex over C if and only if y 2 f(x)y 0 for all x C and y S Let f : R n R be a differentiable function. Show that f is convex over a convex set C if and only if ( f(x) f(y) ) (x y) 0, x, y C. Hint: The condition above says that the function f, restricted to the line segment connecting x and y, has monotonically nondecreasing gradient; see also the proof of Prop (Ascent/Descent Behavior of a Convex Function) Let f : R R be a convex function of a single variable. (a) (Monotropic Property) Use the definition of convexity to show that f is turning upwards in the sense that if x 1, x 2, x 3 are three scalars such that x 1 < x 2 < x 3, then f(x 2 ) f(x 1 ) x 2 x 1 f(x 3) f(x 2 ) x 3 x 2. (b) Use part (a) to show that there are four possibilities as x increases to : (1) f(x) decreases monotonically to, (2) f(x) decreases monotonically to a finite value, (3) f(x) reaches and stays at some value, (4) f(x) increases monotonically to when x x for some x R (Arithmetic-Geometric Mean Inequality) Show that if α 1,..., α n are positive scalars with n i=1 α i = 1, then for every set of positive scalars x 1,..., x n, we have x α 1 1 xα 2 2 x α n n α 1 x 1 + α 2 x α n x n, with equality if and only if x 1 = x 2 = = x n. Hint: Show that ln x is a strictly convex function on (0, ).

60 ( 54 Convex Analysis and Optimization Chap Use the result of Exercise to verify Young s inequality xy xp p + yq q, where p > 0, q > 0, 1/p + 1/q = 1, x 0, and y 0. Then, use Young s inequality to verify Holder s inequality ) 1/p ( ) 1/q n n n x i y i x i p y i q. i=1 i=1 i= Let f : R n+m R be a convex function. Consider the function h : R n R given by h(x) = inf f(x, u), u U where U is nonempty and convex subset of R m. Assuming that h(x) > for all x R n, show that h is convex. Hint: There cannot exist α [0,1], x 1, x 2, u 1 U, u 2 U such that h ( αx 1 + (1 α)x 2 ) > αf(x1, u 1 ) + (1 α)f(x 2, u 2 ) (a) Let C be a convex set in R n+1 and let Show that f is convex over R n. f(x) = inf{w (x, w) C}. (b) Let f 1,..., f m be convex functions over R n and let { m } m f(x) = inf f i (x i ) x i = x. i=1 Assuming that f(x) > for all x, show that f is convex over R n. (c) Let h : R m R be a convex function and let i=1 f(x) = inf By=x h(y), where B is an n m matrix. Assuming that f(x) > for all x, show that f is convex over the range space of B. (d) In parts (b) and (c), show by example that if the assumption f(x) > for all x is violated, then the set { x f(x) > } need not be convex.

61 Sec. 1.2 Convex Sets and Functions Let {f i i I} be an arbitrary collection of convex functions on R n. Define the convex hull of these functions f : R n R as the pointwise infimum of the collection, i.e., f(x) = inf { w (x, w) conv ( i I epi(f i ) )}. Show that f(x) is given by { f(x) = inf α i f i (x i ) i I } α i x i = x, where the infimum is taken over all representations of x as a convex combination of elements x i, such that only finitely many coefficients α i are nonzero. i I (Convexification of Nonconvex Functions) Let X be a nonempty subset of R n, let f : X R be a function that is bounded below over X. Define the function F : conv(x) R by F(x) = inf { w (x, w) conv ( epi(f) )}. (a) Show that F is convex over conv(x) and it is given by { m } m m F(x) = inf α i f(x i ) α i x i = x, x i X, α i = 1, α i 0, m 1 i=1 i=1 i=1 (b) Show that inf F(x) = inf f(x). x conv(x) x X (c) Show that the set of global minima of F over conv(x) includes all global minima of f over X (Extension of Caratheodory s Theorem) Let X 1 and X 2 be subsets of R n, and let X = conv(x 1 ) + cone(x 2 ). Show that every vector x in X can be represented in the form x = k α i x i + i=1 m i=k+1 α i x i, where m is a positive integer with m n+1, the vectors x 1,..., x k belong to X 1, the vectors x k+1,..., x m belong to X 2, and the scalars α 1,..., α m are nonnegative with α 1 + +α k = 1. Furthermore, the vectors x 2 x 1,..., x k x 1, x k+1,..., x m are linearly independent.

62 56 Convex Analysis and Optimization Chap Let x 0,..., x m be vectors in R n such that x 1 x 0,..., x m x 0 are linearly independent. The convex hull of x 0,..., x m is called an m-dimensional simplex, and x 0,..., x m are called the vertices of the simplex. (a) Show that the dimension of a convex set C is the maximum of the dimensions of the simplices included in C. (b) Use part (a) to show that a nonempty convex set has a nonempty relative interior Let X be a bounded subset of R n. Show that cl ( conv(x) ) = conv ( cl(x) ). In particular, if X is closed and bounded, then conv(x) is closed and bounded (cf. Prop ) Let C 1 and C 2 be two nonempty convex sets such that C 1 C 2. (a) Give an example showing that ri(c 1 ) need not be a subset of ri(c 2 ). (b) Assuming that the sets ri(c 1 ) and ri(c 2 ) have nonempty intersection, show that ri(c 1 ) ri(c 2 ). (c) Assuming that the sets C 1 and ri(c 2 ) have nonempty intersection, show that the set ri(c 1 ) ri(c 2 ) is nonempty Let C be a nonempty convex set. (a) Show the following refinement of the Line Segment Principle [Prop (c)]: x ri(c) if and only if for every x aff(c), there exists γ > 1 such that x + (γ 1)(x x) C. (b) Assuming that the origin lies in ri(c), show that cone(c) coincides with aff(c). (c) Show the following extension of part (b) to a nonconvex set: If X is a nonempty set such that the origin lies in the relative interior of conv(x), then cone(x) coincides with aff(x).

63 Sec. 1.2 Convex Sets and Functions Let C be a compact set. (a) Assuming that C is convex set not containing the origin on its boundary, show that cone(c) is closed. (b) Give examples showing that the assertion of part (a) fails if C is unbounded or C contains the origin on its boundary. (c) The convexity assumption in part (a) can be relaxed as follows: assuming that conv(c) does not contain the origin on its boundary, show that cone(c) is closed. Hint: Use part (a) and Exercise 1.2.7(b) (a) Let C be a convex cone. Show that ri(c) is also a convex cone. (b) Let C = cone ( {x 1,..., x m } ). Show that { m } ri(c) = α i x i αi > 0, i = 1,..., m. i= Let A be an m n matrix and let C be a nonempty convex set in R n. Assuming that the inverse image A 1 C is nonempty, show that ri(a 1 C) = A 1 ri(c), cl(a 1 C) = A 1 cl(c). [Compare these relations with those of Prop (d) and (e), respectively.] (Lipschitz Property of Convex Functions) Let f : R n R be a convex function and X be a bounded set in R n. Then f has the Lipschitz property over X, i.e., there exists a positive scalar c such that f(x) f(y) c x y, x, y X Let C be a closed convex set and let M be an affine set such that the intersection C M is nonempty and bounded. Show that for every affine set M that is parallel to M, the intersection C M is bounded when nonempty.

64 58 Convex Analysis and Optimization Chap (Recession Cones of Nonclosed Sets) Let C be a nonempty convex set. (a) Show by counterexample that part (b) of the Recession Cone Theorem need not hold when C is not closed. (b) Show that R C R cl(c), cl(r C ) R cl(c). Give an example where the inclusion cl(r C ) R cl(c) is strict, and another example where cl(r C ) = R cl(c). Also, give an example showing that R ri(c) need not be a subset of R C. (c) Let C be a closed convex set such that C C. Show that R C R C. Give an example showing that the inclusion can fail if C is not closed (Recession Cones of Relative Interiors) Let C be a nonempty convex set. (a) Show that a vector y belongs to R ri(c) if and only if there exists a vector x ri(c) such that x + αy C for every α 0. (b) Show that R ri(c) = R cl(c). (c) Let C be a relatively open convex set such that C C. Show that R C R C. Give an example showing that the inclusion can fail if C is not relatively open. [Compare with Exercise (b).] Hint: In part (a), follow the proof of Prop (b). In parts (b) and (c), use the result of part (a) Let C be a nonempty convex set in R n and let A be an m n matrix. (a) Show the following refinement of Prop : if R cl(c) N(A) = {0}, then cl(a C) = A cl(c), A R cl(c) = R A cl(c). (b) Give an example showing that A R cl(c) and R A cl(c) can differ when R cl(c) N(A) {0} (Lineality Space and Recession Cone) Let C be a nonempty convex set in R n. Define the lineality space of C, denoted by L, to be a subspace of vectors y such that simultaneously y R C and y R C. (a) Show that for every subspace S L C = (C S ) + S.

65 Sec. 1.3 Convexity and Optimization 59 (b) Show the following refinement of Prop and Exercise : if A is an m n matrix and R cl(c) N(A) is a subspace of L, then cl(a C) = A cl(c), R A cl(c) = A R cl(c) This exercise is a refinement of Prop (a) Let C 1,..., C m be nonempty closed convex sets in R n such that the equality y 1 + +y m = 0 with y i R Ci implies that each y i belongs to the lineality space of C i. Then the vector sum C C m is a closed set and R C1 + +Cm = R C1 + + R C m. (b) Show the following extension of part (a) to nonclosed sets: Let C 1,..., C m be nonempty convex sets in R n such that the equality y 1 + +y m = 0 with y i R cl(ci ) implies that each y i belongs to the lineality space of cl(c i ). Then we have cl(c C m ) = cl(c 1 ) + + cl(c m ), R cl(c1 + +Cm) = R cl(c1 ) + + R cl(c m). 1.3 CONVEXITY AND OPTIMIZATION In this section we discuss applications of convexity to some basic optimization issues, such as the existence and uniqueness of global minima. Several other applications, relating to optimality conditions and polyhedral convexity, will be discussed in subsequent sections Local and Global Minima Let X be a nonempty subset of R n and let f : R n (, ] be a function. We say that a vector x X is a minimum of f over X if f(x ) = inf x X f(x). We also call x a minimizing point or minimizer or global minimum over X. Alternatively, we say that f attains a minimum over X at x, and we indicate this by writing x arg min x X f(x).

66 60 Convex Analysis and Optimization Chap. 1 We use similar terminology for maxima, i.e., a vector x X such that f(x ) = sup x X f(x) is said to be a maximum of f over X, and we indicate this by writing x arg max x X f(x). If the domain of f is the set X (instead of R n ), we also call x a (global) minimum or (global) maximum of f (without the qualifier over X ). A basic question in minimization problems is whether an optimal solution exists. This question can often be resolved with the aid of the classical theorem of Weierstrass, which states that a continuous function attains a minimum over a compact set. We will provide a more general version of this theorem, and to this end, we introduce some terminilogy. We say that a function f : R n (, ] is coercive if lim f(x k) = k for every sequence {x k } such that x k for some norm. Note that as a consequence of the definition, the level sets { x f(x) γ} of a coercive function f are bounded whenever they are nonempty. Proposition 1.3.1: (Weierstrass Theorem) Let X be a nonempty subset of R n and let f : R n (, ] be a closed function. Assume that one of the following three conditions holds: (1) X is compact. (2) X is closed and f is coercive. (3) There exists a scalar γ such that the set { x X f(x) γ } is nonempty and compact. Then, f attains a minimum over X. Proof: If f(x) = for all x X, then every x X attains the minimum of f over X. Thus, with no loss of generality, we assume that inf x X f(x) <. Assume condition (1). Let {x k } X be a sequence such that lim f(x k) = inf f(x). k x X Since X is bounded, this sequence has at least one limit point x [Prop (a)]. Since f is closed, f is lower semicontinuous at x [cf. Prop (b)], so that f(x ) lim k f(x k ) = inf x X f(x). Since X is closed, x belongs to X, so we must have f(x ) = inf x X f(x).

67 Sec. 1.3 Convexity and Optimization 61 Assume condition (2). Consider a sequence {x k } as in the proof under condition (1). Since f is coercive, {x k } must be bounded and the proof proceeds similar to the proof under condition (1). Assume condition (3). If the given γ is equal to inf x X f(x), the set of minima of f over X is { x X f(x) γ }, and since by assumption this set is nonempty, we are done. If inf x X f(x) < γ, consider a sequence {x k } as in the proof under condition (1). Then, for all k sufficiently large, x k must belong to the set { x X f(x) γ }. Since this set is compact, {x k } must be bounded and the proof proceeds similar to the proof under condition (1). Q.E.D. Note that with appropriate adjustments, the above proposition applies to the existence of maxima of f over X. In particular, if f is upper semicontinuous at all points of X and X is compact, then f attains a maximum over X. We say that a vector x X is a local minimum of f over X if there exists some ɛ > 0 such that f(x ) f(x) for every x X satisfying x x ɛ, where is some vector norm. If the domain of f is the set X (instead of R n ), we also call x a local minimum (or local maximum) of f (without the qualifier over X ). Local and global maxima are defined similarly. An important implication of convexity of f and X is that all local minima are also global, as shown in the following proposition and in Fig Proposition 1.3.2: If X R n is a convex set and f : X R is a convex function, then a local minimum of f is also a global minimum. If in addition f is strictly convex, then there exists at most one global minimum of f. Proof: See Fig for a proof that a local minimum of f is also global. Let f be strictly convex, and to obtain a contradiction, assume that two distinct global minima x and y exist. Then the average (x + y)/2 must belong to X, since X is convex. Furthermore, the value of f must be smaller at the average than at x and y by the strict convexity of f. Since x and y are global minima, we obtain a contradiction. Q.E.D The Projection Theorem In this section we develop a basic result of analysis and optimization.

68 62 Convex Analysis and Optimization Chap. 1 f(x) αf(x*) + (1 - α)f(x) f(αx* + (1- α)x) x x* x Figure Proof of why local minima of convex functions are also global. Suppose that f is convex, and assume to arrive at a contradiction, that x is a local minimum that is not global. Then there must exist an x X such that f(x) < f(x ). By convexity, for all α (0,1), f ( αx + (1 α)x ) αf(x ) + (1 α)f(x) < f(x ). Thus, f has strictly lower value than f(x ) at every point on the line segment connecting x with x, except x. This contradicts the local minimality of x.

69 Sec. 1.3 Convexity and Optimization 63 Proposition 1.3.3: (Projection Theorem) Let C be a closed convex set and let be the Euclidean norm. (a) For every x R n, there exists a unique vector z C that minimizes z x over all z C. This vector is called the projection of x on C, and is denoted by P C (x), i.e., P C (x) = arg min z x. z C (b) For every x R n, a vector z C is equal to P C (x) if and only if (y z) (x z) 0, y C. (c) The function f : R n C defined by f(x) = P C (x) is continuous and nonexpansive, i.e., PC (x) P C (y) x y, x, y Rn. (d) The distance function is convex. d(x, C) = min z C z x, x Rn, Proof: (a) Fix x and let w be some element of C. Minimizing x z over all z C is equivalent to minimizing the same function over all z C such that x z x w, which is a compact set. Furthermore, the function g defined by g(z) = z x 2 is continuous. Existence of a minimizing vector follows by Weierstrass Theorem (Prop ). To prove uniqueness, notice that the square of the Euclidean norm is a strictly convex function of its argument [Prop (d)]. Therefore, g is strictly convex and it follows that its minimum is attained at a unique point (Prop ). (b) For all y and z in C we have y x 2 = y z 2 + z x 2 2(y z) (x z) z x 2 2(y z) (x z). Therefore, if z is such that (y z) (x z) 0 for all y C, we have y x 2 z x 2 for all y C, implying that z = P C (x). Conversely, let z = P C (x), consider any y C, and for α > 0, define

70 64 Convex Analysis and Optimization Chap. 1 y α = αy + (1 α)z. We have x y α 2 = (1 α)(x z) + α(x y) 2 = (1 α) 2 x z 2 + α 2 x y 2 + 2(1 α)α(x z) (x y). Viewing x y α 2 as a function of α, we have { x yα α 2} α=0 = 2 x z 2 + 2(x z) (x y) = 2(y z) (x z). Therefore, if (y z) (x z) > 0 for some y C, then { x yα α 2} α=0 < 0 and for positive but small enough α, we obtain x y α < x z. This contradicts the fact z = P C (x) and shows that (y z) (x z) 0 for all y C. (c) Let x and y be elements of R n. From part (b), we have ( w P C (x) ) ( x P C (x) ) 0 for all w C. Since P C (y) C, we obtain ( PC (y) P C (x) ) ( x PC (x) ) 0. Similarly, ( PC (x) P C (y) ) ( y PC (y) ) 0. Adding these two inequalities, we obtain ( PC (y) P C (x) ) ( x PC (x) y + P C (y) ) 0. By rearranging and by using the Schwartz inequality, we have PC (y) P C (x) 2 ( P C (y) P C (x) ) (y x) PC (y) P C (x) y x, showing that P C ( ) is nonexpansive and a fortiori continuous. (d) Assume, to arrive at a contradiction, that there exist x 1, x 2 R n and an α [0,1] such that d ( αx 1 + (1 α)x 2, C ) > αd(x 1, C) + (1 α)d(x 2, C). Then there must exist z 1, z 2 C such that d ( αx 1 + (1 α)x 2, C ) > α z 1 x 1 + (1 α) z 2 x 2, which implies that αz1 + (1 α)z 2 αx 1 (1 α)x 2 > α x1 z 1 + (1 α) x 2 z 2. This contradicts the triangle inequality in the definition of norm. Q.E.D. Figure illustrates the necessary and sufficient condition of part (b) of the Projection Theorem.

71 Sec. 1.3 Convexity and Optimization 65 x P C (x) C Figure Illustration of the condition satisfied by the projection P C (x). For each vector y C, the vectors x P C (x) and y P C (x) form an angle greater than or equal to π/2 or, equivalently, y ( y PC (x) ) ( x PC (x) ) Directions of Recession and Existence of Optimal Solutions The recession cone, discussed in Section 1.2.4, is also useful for characterizing directions along which convex functions asymptotically increase or decrease. A key idea here is that a function that is convex over R n can be described in terms of its epigraph, which is a closed and convex set. The recession cone of the epigraph can be used to obtain the directions along which the function slopes downward. This is the idea underlying the following proposition. Proposition 1.3.4: Let f : R n R be a convex function and consider the level sets L a = { x f(x) a }. (a) All the level sets L a that are nonempty have the same recession cone, given by R La = { y (y,0) R epi(f) }, where R epi(f) is the recession cone of the epigraph of f. (b) If one nonempty level set L a is compact, then all nonempty level sets are compact. Proof: From the formula for the epigraph epi(f) = { (x, w) f(x) w }, it can be seen that for all a for which L a is nonempty, we have { (x, a) x La } = epi(f) { (x, a) x R n }. The recession cone of the set in the left-hand side above is { (y, 0) y R La }. The recession cone of the set in the right-hand side is equal to the intersection of the recession cone of epi(f) and the recession cone of

72 66 Convex Analysis and Optimization Chap. 1 { (x, a) x R n }, which is equal to { (y, 0) y R n}, the horizontal subspace that passes through the origin. Thus we have { (y, 0) y RLa } = { (y,0) (y,0) Repi(f) }, from which it follows that R La = { y (y,0) R epi(f) }. This proves part (a). Part (b) follows by applying Prop (c) to the recession cone of epi(f). Q.E.D. For a convex function f : R n R, the (common) recession cone of the nonempty level sets of f is referred to as the recession cone of f, and is denoted by R f. Thus R f = { y f(x) f(x + αy), x R n, α 0 }. Each y R f is called a direction of recession of f. If we start at any x R n and move indefinitely along a direction of recession y, we must stay within each level set that contains x, or equivalently we must encounter exclusively points z with f(z) f(x). In words, a direction of recession of f is a direction of uninterrupted nonascent for f. Conversely, if we start at some x R n and while moving along a direction y, we encounter a point z with f(z) > f(x), then y cannot be a direction of recession. It is easily seen via a convexity argument that once we cross the boundary of a level set of f we never cross it back again, and with a little thought (see Fig and the exercises), it follows that a direction that is not a direction of recession of f is a direction of eventual uninterrupted ascent of f. In view of these observations, it is not surprising that directions of recession play a prominent role in characterizing the existence of solutions of convex optimization problems, as shown in the following proposition. Proposition 1.3.5: Let f : R n R be a convex function, X be a nonempty closed convex subset of R n, and X be the set of minimizing points of f over X. Then X is nonempty and compact if and only if X and f have no common nonzero direction of recession. Proof: Let f = inf x X f(x), and note that X = X { x f(x) f }. If X is nonempty and compact, it has no nonzero direction of recession [Prop (c)]. Therefore, there is no nonzero vector in the intersection

73 Sec. 1.3 Convexity and Optimization 67 f(x + αy) f(x + αy) f(x) f(x) α α (a) (b) f(x + αy) f(x + αy) f(x) f(x) α α (c) (d) f(x + αy) f(x + αy) f(x) f(x) α α (e) (f) Figure Ascent/descent behavior of a convex function starting at some x R n and moving along a direction y. If y is a direction of recession of f, there are two possibilities: either f decreases monotonically to a finite value or [figures (a) and (b), respectively], or f reaches a value that is less or equal to f(x) and stays at that value [figures (c) and (d)]. If y is not a direction of recession of f, then eventually f increases monotonically to [figures (e) and (f)]; i.e., for some α 0 and all α 1, α 2 α with α 1 < α 2 we have f(x + α 1 y) < f(x + α 2 y). of the recession cones of X and { x f(x) f }. This is equivalent to X and f having no common nonzero direction of recession. Conversely, let a be a scalar such that the set X a = X { x f(x) a }

74 68 Convex Analysis and Optimization Chap. 1 is nonempty and has no nonzero direction of recession. Then, X a is closed [since X is closed and { x f(x) a } is closed by the continuity of f], and by Prop (c), X a is compact. Since minimization of f over X and over X a yields the same set of minima, X, by Weierstrass Theorem (Prop ) X is nonempty, and since X X a, we see that X is bounded. Since X is closed [since X is closed and { x f(x) f } is closed by the continuity of f] it is compact. Q.E.D. If the closed convex set X and the convex function f of the above proposition have a common direction of recession, then either X is empty [take for example, X = (, 0] and f(x) = e x ] or else X is nonempty and unbounded [take for example, X = (, 0] and f(x) = max{0, x}]. Another interesting question is what happens when X and f have a common direction of recession, call it y, but f is bounded below over X: f = inf f(x) >. x X Then for any x X, we have x+αy X (since y is a direction of recession of X), and f(x + αy) is monotonically nondecreasing to a finite value as α (since y is a direction of recession of f and f > ). Generally, the minimum of f over X need not be attained. However, it turns out that the minimum is attained in an important special case: when f is quadratic and X is polyhedral (is specified by linear equality and inequality constraints). To understand the main idea, consider the problem minimize f(x) = c x x Qx subject to Ax = 0, (1.9) where Q is a positive semidefinite symmetric n n matrix, c R n is a given vector, and A is an m n matrix. Let N(Q) and N(A) denote the nullspaces of Q and A, respectively. There are two possibilities: (a) For some x N(A) N(Q), we have c x 0. Then, since f(αx) = αc x for all α R, it follows that f becomes unbounded from below either along x or along x. (b) For all x N(A) N(Q), we have c x = 0. In this case, we have f(x) = 0 for all x N(A) N(Q). For x N(A) such that x / N(Q), since N(Q) and R(Q), the range of Q, are orthogonal subspaces, x can be uniquely decomposed as x R + x N, where x N N(Q) and x R R(Q), and we have f(x) = c x + (1/2)x R Qx R, where x R is the (nonzero) component of x along R(Q). Hence f(αx) = αc x + (1/2)α 2 x R Qx R for all α > 0, with x R Qx R > 0. It follows that f is bounded below along all feasible directions x N(A).

75 Sec. 1.3 Convexity and Optimization 69 We thus conclude that for f to be bounded from below along all directions in N(A) it is necessary and sufficient that c x = 0 for all x N(A) N(Q). However, boundedness from below of a convex cost function f along all directions of recession of a constraint set does not guarantee existence of an optimal solution, or even boundedness from below over the constraint set (see the exercises). On the other hand, since the constraint set N(A) is a subspace, it is possible to use a transformation x = Bz where the columns of the matrix B are basis vectors for N(A), and view the problem as an unconstrained minimization over z of the cost function h(z) = f(bz), which is positive semidefinite quadratic. We can then argue that boundedness from below of this function along all directions z is necessary and sufficient for existence of an optimal solution. This argument indicates that problem (1.9) has an optimal solution if and only if c x = 0 for all x N(A) N(Q). By using a translation argument, this result can also be extended to the case where the constraint set is a general affine set of the form {x Ax = b} rather than the subspace {x Ax = 0}. In part (a) of the following proposition we state the result just described (equality constraints only). While we can prove the result by formalizing the argument outlined above, we will use instead a more elementary variant of this argument, whereby the constraints are eliminated via a penalty function; this will give us the opportunity to introduce a line of proof that we will frequently employ in other contexts as well. In part (b) of the proposition, we allow linear inequality constraints, and we show that a convex quadratic program has an optimal solution if and only if its optimal value is bounded below. Note that the cost function may be linear, so the proposition applies to linear programs as well.

76 70 Convex Analysis and Optimization Chap. 1 Proposition 1.3.6: (Existence of Solutions of Quadratic Programs) Let f : R n R be a quadratic function of the form f(x) = c x x Qx, where Q is a positive semidefinite symmetric n n matrix and c R n is a given vector. Let also A be an m n matrix and b R m be a vector. Denote by N(A) and N(Q), the nullspaces of A and Q, respectively. (a) Let X = {x Ax = b} and assume that X is nonempty. The following are equivalent: (i) f attains a minimum over X. (ii) f = inf x X f(x) >. (iii) c y = 0 for all y N(A) N(Q). (b) Let X = {x Ax b} and assume that X is nonempty. The following are equivalent: (i) f attains a minimum over X. (ii) f = inf x X f(x) >. (iii) c y 0 for all y N(Q) such that Ay 0. Proof: (a) (i) clearly implies (ii). We next show that (ii) implies (iii). For all x X, y N(A) N(Q), and α R, we have x + αy X and f(x + αy) = c (x + αy) (x + αy) Q(x + αy) = f(x) + αc y. If c y 0, then either lim α f(x + αy) = or lim α f(x + αy) =, and we must have f =. Hence (ii) implies that c y = 0 for all y N(A) N(Q). We finally show that (iii) implies (i) by first using a translation argument and then using a penalty function argument. Choose any x X, so that X = x+n(a). Then minimizing f over X is equivalent to minimizing f(x + y) over y N(A), or minimize h(y) subject to Ay = 0, where h(y) = f(x + y) = f(x) + f(x) y y Qy.

77 Sec. 1.3 Convexity and Optimization 71 For any integer k > 0, let h k (y) = h(y) + k 2 Ay 2 = f(x) + f(x) y y Qy + k 2 Ay 2. (1.10) Note that for all k and Denote h k (y) h k+1 (y), y R n, inf h k(y) inf h k+1(y) inf h(y) h(0) = f(x). (1.11) y R n y R n Ay=0 S = ( N(A) N(Q) ) and write any y R n as y = z + w, where z S, w S = N(A) N(Q). Then, by using the assumption c w = 0 [implying that f(x) w = (c + Qx) w = 0], we see from Eq. (1.10) that h k (y) = h k (z + w) = h k (z), (1.12) i.e., h k is determined in terms of its restriction to the subspace S. It can be seen from Eq. (1.10) that the function h k has no nonzero direction of recession in common with S, so h k (z) attains a minimum over S, call it y k, and in view of Eq. (1.12), y k also attains the minimum of h k (y) over R n. From Eq. (1.11), we have h k (y k ) h k+1 (y k+1 ) inf h(y) f(x), (1.13) Ay=0 and we will use this relation to show that {y k } is bounded and each of its limit points minimizes h(y) subject to Ay = 0. Indeed, from Eq. (1.13), the sequence {h k (y k )} is bounded, so if {y k } were unbounded, then assuming without loss of generality that y k 0, we would have h k (y k )/ y k 0, or lim k ( 1 y k f(x) + f(x) ŷ k + y k ( 1 2 ŷ k Qŷ k + k 2 Aŷ k 2 )) = 0, where ŷ k = y k / y k. For this to be true, all limit points ŷ of the bounded sequence {ŷ k } must be such that ŷ Qŷ = 0 and Aŷ = 0, which is impossible since ŷ = 1 and ŷ S. Thus {y k } is bounded and for any one of its limit points, call it y, we have y S and lim sup h k (y k ) = f(x) + f(x) y + 1 k k 2 y Qy + lim sup k 2 Ay k 2 inf h(y). Ay=0

78 72 Convex Analysis and Optimization Chap. 1 It follows that Ay = 0 and that y minimizes h(y) over Ay = 0. This implies that the vector y = x + y minimizes f(x) subject to Ax = b. (b) Clearly (i) implies (ii), and similar to the proof of part (a), (ii) implies that c y 0 for all y N(Q) with Ay 0. Finally, we show that (iii) implies (i) by using the corresponding result of part (a). For any x X, let J(x) denote the index set of active constraints at x, i.e., J(x) = {j a j x = b j}, where the a j are the rows of A. For any sequence {x k } X with f(x k ) f, we can extract a subsequence such that J(x k ) is constant and equal to some J. Accordingly, we select a sequence {x k } X such that f(x k ) f, and the index set J(x k ) is equal for all k to a set J that is maximal over all such sequences [for any other sequence {x k } X with f(x k ) f and such that J(x k ) = J for all k, we cannot have J J unless J = J]. Consider the problem minimize f(x) subject to a j x = b j, j J. (1.14) Assume, to come to a contradiction, that this problem does not have a solution. Then, by part (a), we have c y < 0 for some y N(A) N(Q), where A is the matrix having as rows the a j, j J. Consider the line {x k + γy γ > 0}. Since y N(Q), we have so that f(x k + γy) = f(x k ) + γc y, f(x k + γy) < f(x k ), γ > 0. Furthermore, since y N(A), we have a j (x k + γy) = b j, j J, γ > 0. We must also have a jy > 0 for at least one j / A [otherwise (iii) would be violated], so the line {x k + γy γ > 0} crosses the boundary of X for some γ k > 0. The sequence {x k }, where x k = x k + γ k y, satisfies {x k } X, f(x k ) f [since f(x k ) f(x k )], and the active index set J(x k ) strictly contains J for all k. This contradicts the maximality of J, and shows that problem (1.14) has an optimal solution, call it x. Since x k is a feasible solution of problem (1.14), we have f(x) f(x k ), k, so that f(x) f.

79 Sec. 1.3 Convexity and Optimization 73 We will now show that x minimizes f over X, by showing that x X, thereby completing the proof. Assume, to arrive at a contradiction, that x / X. Let ˆx k be a point in the interval connecting x k and x that belongs to X and is closest to x. We have that J(ˆx k ) strictly contains J for all k. Since f(x) f(x k ) and f is convex over the interval [x k, x], it follows that f(ˆx k ) max { f(x k ), f(x) } = f(x k ). Thus f(ˆx k ) f, which contradicts the maximality of J. Q.E.D Existence of Saddle Points Suppose that we are given a function φ : X Z R, where X R n, Z R m, and we wish to either or minimize sup φ(x, z) z Z subject to x X maximize inf φ(x, z) x X subject to z Z. These problems are encountered in at least three major optimization contexts: (1) Worst-case design, whereby we view z as a parameter and we wish to minimize over x a cost function, assuming the worst possible value of x. A special case of this is the discrete minimax problem, where we want to minimize over x X max { f 1 (x),..., f m (x) }, where the f i are some given functions. Here, Z is the finite set {1,..., m}. Within this context, it is important to provide characterizations of the max function max φ(x, z), z Z particularly its directional derivative. We do this in Section 1.7, where we discuss the differentiability properties of convex functions. (2) Exact penalty functions, which can be used for example to convert constrained optimization problems of the form minimize f(x) subject to x X, g j (x) 0, j = 1,..., r (1.15)

80 74 Convex Analysis and Optimization Chap. 1 to (less constrained) minimax problems of the form minimize f(x) + c max { 0, g 1 (x),..., g r (x) } subject to x X, where c is a large positive penalty parameter. This conversion is useful for both analytical and computational purposes, and will be discussed in Chapters 2 and 4. (3) Duality theory, where using problem (1.15) as an example, we introduce the, so called, Lagrangian function L(x, µ) = f(x) + r µ j g j (x) involving the vector µ = (µ 1,..., µ r ) R r, and the dual problem j=1 maximize inf L(x, µ) x X subject to µ 0. (1.16) The original (primal) problem (1.15) can also be written as minimize sup L(x, µ) µ 0 subject to x X [if x violates any of the constraints g j (x) 0, we have sup µ 0 L(x, µ) =, and if it does not, we have sup µ 0 L(x, µ) = f(x)]. Thus the primal and the dual problems (1.15) and (1.16) can be viewed in terms of a minimax problem. We will now derive conditions guaranteeing that sup inf φ(x, z) = inf z Z x X sup x X z Z φ(x, z), (1.17) and that the inf and sup above are attained. This is a major issue in duality theory because it connects the primal and the dual problems [cf. Eqs. (1.15) and (1.16)] through their optimal values and optimal solutions. In particular, when we discuss duality in Chapter 3, we will see that a major question is whether there is no duality gap, i.e., whether the optimal primal and dual values are equal. This is so if and only if sup inf L(x, µ) = inf µ 0 x X sup x X µ 0 L(x, µ). (1.18) We will prove in this section one major result, the Saddle Point Theorem, which guarantees the equality (1.17), assuming convexity/concavity

81 Sec. 1.3 Convexity and Optimization 75 assumptions on φ and (essentially) compactness assumptions on X and Z. Unfortunately, the theorem to be shown later in this section is only partially adequate for the development of duality theory, because compactness of Z and, to some extent, compactness of X turn out to be restrictive assumptions [for example Z corresponds to the set {µ µ 0} in Eq. (1.18), which is not compact]. We will derive additional theorems of the minimax type in Chapter 3, when we discuss duality and we make a closer connection with the theory of Lagrange multipliers. A first observation regarding the potential validity of the minimax equality (1.17) is that we always have the inequality sup inf φ(x, z) inf z Z x X sup x X z Z φ(x, z), (1.19) [for every z Z, write inf x X φ(x, z) inf x X sup z Z φ(x, z) and take the supremum over z Z of the left-hand side]. However, special conditions are required to guarantee equality. Suppose that x is an optimal solution of the problem minimize sup φ(x, z) z Z subject to x X (1.20) and z is an optimal solution of the problem maximize inf φ(x, z) x X subject to z Z. (1.21) The Saddle Point Theorem is also central in game theory, as we now briefly explain. In the simplest type of zero sum game, there are two players: the first may choose one out of n moves and the second may choose one out of m moves. If moves i and j are selected by the first and the second player, respectively, the first player gives a specified amount a ij to the second. The objective of the first player is to minimize the amount given to the other player, and the objective of the second player is to maximize this amount. The players use mixed strategies, whereby the first player selects a probability distribution x = (x 1,..., x n ) over his n possible moves and the second player selects a probability distribution z = (z 1,..., z m ) over his m possible moves. Since the probability of selecting i and j is x i z j, the expected amount to be paid by the first player to the second is i,j a ijx i z j or x Az, where A is the n m matrix with elements a ij. If each player adopts a worst case viewpoint, whereby he optimizes his choice against the worst possible selection by the other player, the first player must minimize max z x Az and the second player must maximize min x x Az. The main result, a special case of the existence result we will prove shortly, is that these two optimal values are equal, implying that there is an amount that can be meaningfully viewed as the value of the game for its participants.

82 76 Convex Analysis and Optimization Chap. 1 Then we have sup inf φ(x, z) = inf φ(x, z Z x X x X z ) φ(x, z ) sup φ(x, z) = inf sup φ(x, z). z Z x X z Z (1.22) If the minimax equality [cf. Eq. (1.17)] holds, then equality holds throughout above, so that or equivalently sup φ(x, z) = φ(x, z ) = inf φ(x, z Z x X z ), (1.23) φ(x, z) φ(x, z ) φ(x, z ), x X, z Z. (1.24) A pair of vectors x X and z Z satisfying the two above (equivalent) relations is called a saddle point of φ (cf. Fig ). The preceding argument showed that if the minimax equality (1.17) holds, any vectors x and z that are optimal solutions of problems (1.20) and (1.21), respectively, form a saddle point. Conversely, if (x, z ) is a saddle point, then the definition (1.23)] implies that inf sup φ(x, z) sup inf φ(x, z). x X z Z z Z x X This, together with the minimax inequality (1.19) guarantee that the minimax equality (1.17) holds and from Eq. (1.22), x and z are optimal solutions of problems (1.20) and (1.21), respectively. We summarize the above discussion in the following proposition. Proposition 1.3.7: A pair (x, z ) is a saddle point of φ if and only if the minimax equality (1.17) holds, and x and z are optimal solutions of problems (1.20) and (1.21), respectively. Note a simple consequence of the above proposition: the set of saddle points, when nonempty, is the Cartesian product X Z, where X and Z are the sets of optimal solutions of problems (1.20) and (1.21), respectively. In other words x and z can be independently chosen within the sets X and Z, respectively, to form a saddle point. Note also that if the minimax equality (1.17) does not hold, there is no saddle point, even if the sets X and Z are nonempty. One can visualize saddle points in terms of the sets of minimizing points over X for fixed z Z and maximizing points over Z for fixed x X: ˆX(z) = {ˆx ˆx minimizes φ(x, z) over X },

Sec. 1.3 Convexity and Optimization 77 Curve of maxima φ(x,z(x)) ^ φ(x,z) Saddle point (x *,z * ) Curve of minima φ(x(z),z) ^ x z Figure 1.3.4.

83 Sec. 1.3 Convexity and Optimization 77 Curve of maxima φ(x,z(x)) ^ φ(x,z) Saddle point (x *,z * ) Curve of minima φ(x(z),z) ^ x z Figure Illustration of a saddle point of a function φ(x, z) over x X and z Z [the function plotted here is φ(x, z) = 1 2 (x2 + 2xz z 2 )]. Let ˆx(z) = arg min φ(x, z), x X ẑ(x) = arg max φ(x, z) z Z be the curves of minimizing and maximizing points, and consider the corresponding curves φ (ˆx(z), z ) and φ ( x, ẑ(x) ) [shown in the figure for the case where ˆx(z) and ẑ(x) are unique; otherwise ˆx(z) and ẑ(x) should be viewed as set-valued mappings]. By definition, a pair (x, z ) is a saddle point if and only if max z Z φ(x, z) = φ(x, z ) = min φ(x, x X z ), or equivalently, if (x, z ) lies on both curves [x = ˆx(z ) and z = ẑ(x )]. At such a pair, we also have ) max φ(ˆx(z), z = max min φ(x, z) = z Z z Z x X φ(x, z ) = min max φ(x, z) = min φ( ) x, ẑ(x), x X z Z x X so that φ (ˆx(z), z ) φ(x, z ) φ ( x, ẑ(x) ), x X, z Z (see Prop ). Visually, the curve of maxima φ ( x, ẑ(x) ) must lie above the curve of minima φ (ˆx(z), z ) (completely, i.e., for all x X and z Z).

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions

A Geometric Framework for Nonconvex Optimization Duality using Augmented Lagrangian Functions Angelia Nedić and Asuman Ozdaglar April 15, 2006 Abstract We provide a unifying geometric framework for the