Contents Real Vector Spaces Linear Equations and Linear Inequalities Polyhedra Linear Programs and the Simplex Method Lagrangian Duality

Contents Introduction v Chapter 1. Real Vector Spaces 1 1.1. Linear and Affine Spaces 1 1.2. Maps and Matrices 4 1.3. Inner Products and Norms 7 1.4. Continuous and Differentiable Functions 11 Chapter 2. Linear Equations and Linear Inequalities 23 2.1. Gaussian Elimination 23 2.2. Orthogonal Projection and Least Square Approximation 34 2.3. Integer Solutions of Linear Equations 42 2.4. Linear Inequalities 47 Chapter 3. Polyhedra 57 3.1. Polyhedral Cones and Polytopes 57 3.2. Cone Duality 61 3.3. Polar Duality of Convex Sets 64 3.4. Faces 66 3.5. Vertices and Polytopes 70 3.6. Rational Polyhedra 73 Chapter 4. Linear Programs and the Simplex Method 75 4.1. Linear Programs 75 4.2. The Simplex Method 84 4.3. Sensitivity Analysis 95 4.4. The Primal-Dual Simplex Method 97 Chapter 5. Lagrangian Duality 101 5.1. Lagrangian Relaxation 101 5.2. Lagrangian Duality 104 5.3. Cone Duality 109 5.4. Optimality Conditions 111 Chapter 6. An Interior Point Algorithm for Linear Programs 115 6.1. A Primal-Dual Relaxation 115 6.2. An Interior Point Method 116 6.3. Constructing Good Basic Solutions 121 i

ii CONTENTS 6.4. Initialization and Rational Linear Programs 123 6.5. The Barrier Point of View 126 Chapter 7. Network Flows 129 7.1. Graphs 129 7.2. Shortest Paths 136 7.3. Maximum Flow 145 7.4. Min Cost Flows 150 Chapter 8. Complexity 159 8.1. Problems and Input Sizes 159 8.2. Algorithms and Running Times 165 8.3. Complexity of Problems 171 8.4. Arithmetic Complexity 179 Chapter 9. Integer Programming 181 9.1. Formulating an Integer Program 181 9.2. Cutting Planes I 184 9.3. Cutting Planes II 190 9.4. Branch and Bound 193 9.5. Lagrangian Relaxation 195 9.6. Dualizing the Binary Constraints 202 Chapter 10. Convex Sets and Convex Functions 205 10.1. Convex Sets 205 10.2. Convex Functions 212 10.3. Convex Minimization Problems 221 10.4. Quadratic Programming 225 10.5. The Ellipsoid Method 232 10.6. Applications of the Ellipsoid Method 238 Chapter 11. Unconstrained Optimization 251 11.1. Optimality Conditions 252 11.2. Descent Methods, Steepest Descent 255 11.3. Conjugate Directions 258 11.4. Line Search 263 11.5. Newton s Method 266 11.6. The Gauss-Newton Method 271 11.7. Quasi-Newton Methods 272 11.8. Minimization of Nondifferentiable Functions 277 Chapter 12. Constrained Nonlinear Optimization 283 12.1. First Order Optimality Conditions 284 12.2. Primal Methods 295 12.3. Second Order Optimality Conditions and Sensitivity 303 12.4. Penalty and Barrier Methods 308

CONTENTS iii 12.5. Kuhn-Tucker Methods 314 12.6. Semi-infinite and Semidefinite Programs 321 List of frequently used Symbols 335 Bibliography 337 Index 341

Introduction The goal of Mathematical Programming is the design of mathematical solution methods for optimization problems. These methods should be algorithmic in the sense that they can be converted into computer programs without much effort. The main concern, however, is not the eventual concrete implementation of an algorithm but the necessary prerequisit thereof: the exhibition of a solution strategy that hopefully makes the ensuing algorithm efficient in practice. Mathematical programming thus offers an approach to the theory of mathematical optimization that is very much motivated by the question whether certain parameters (solutions of an optimization problem or eigenvalues of a matrix) not only exist in an abstract way but can actually be computed well enough to satisfy practical needs. Mathematical optimization traditionally decomposes into three seemingly rather disjoint areas: Discrete (or combinatorial) optimization, linear optimization and nonlinear optimization. Yet, a closer look reveals a different picture. Efficiently solvable discrete optimization problems are typically those that can be cast into the framework of linear optimization. And, as a rule of thumb, nonlinear problems are solved by repeated linear (or quadratic) approximation. The dominant role of linearity in optimization is not surprising. It has long been known that much of the structural analysis of mathematical optimization can be achieved taking advantage of the language of vector spaces (see, for example, Luenberger s elegant classic treatment [55]). Moreover, it appears to be an empirical fact that not only computations in linear algebra can be carried out numerically efficiently in practice but that, indeed, efficient numerical computation is tantamount to being able to reduce the computational task as much as possible to linear algebra. The present book wants to introduce the reader to the fundamental algorithmic techniques in mathematical programming with a strong emphasis on the central position of linear algebra both in the structural analysis and the computational procedures. Although an optimization problem often admits a geometric picture involving sets of points in Euclidean space, which may guide the intuition in the structural analysis, we stress the role of the presentation of a problem in terms of explicit functions that encode the set of admissible solutions and the quantity to be optimized. The presentation is crucial for the design of a solution method and its efficiency. v

vi INTRODUCTION The book attempts to be as much self-contained as possible. Only basic knowledge about (real) vector spaces and differentiable functions is assumed at the outset. Chapter 1 reviews this material, providing proofs of facts that might be not (yet) so familiar to the reader. We really begin in Chapter 2, which introduces the fundamental techniques of numerical linear algebra we will rely on later. Chapter 3 provides the corresponding geometric point of view. Then linear programs are treated. Having linear programming techniques at our disposal, we investigate discrete optimization problems and discuss theories for analyzing their complexity with respect to their solvability by efficient algorithms. Nonlinear programs proper are presented in the last three chapters. Convex minimization problems occupy here a position between linear and nonlinear structures: while the feasible sets of linear programs are finite intersections of half-spaces, convex problems may be formulated with respect to infinite intersections of half-spaces. Convex optimization problems mark the border of efficient solvability. For example, quadratic optimization problems turn out to be efficiently solvable if and only if they are convex. The book contains many items marked Ex. These items are intended to provide both examples and exercises to which also details of proofs or additional observations are deferred. They are meant to be an integral part of the presentation of the material. We cordially invite the interested reader to test his or her understanding of the text by working them out in detail.

CHAPTER 1 Real Vector Spaces This chapter is a brief review of the basic notions and facts from linear algebra and analysis that we will use as tools in mathematical programming. The reader is assumed to be already familiar with most of the material in this chapter. The proofs we sketch here (as well as the exercises) are mainly meant as reminders. 1.1. Linear and Affine Spaces We discuss mathematical programming problems to a very large extent within the model of vector spaces V (or W etc ) over the field R of real numbers. The fundamental operations that define the structure of a vector space are: adding two vectors and multiplying a vector with a scalar (here: a real number R). So we can carry out algebraic manipulations taking advantage of the following properties: If v w are elements of the vector space V, then the sum z = v + w is an element of V. If v V is a vector in V and R a scalar, then z = v is a vector in V. The order in which vectors are added is irrelevant: v + w = w + v. These properties reflect the elementary operations to which all(!) computations in linear algebra reduce. We note that the vector 0 = 0 v V is uniquely determined (and independent of the particular choice of v V). There are two especially important examples of vector spaces in mathematical programming: (1) V = R n, the vector space of all n-tuples x = x 1 x n T of real numbers x j R. Addition and scalar multiplication of vectors in R n is carried out componentwise. (2) V = R m n, the vector space of all m n -matrices A = a ij of real numbers a ij, where i = 1 m j = 1 n. Addition and scalar multiplication is again componentwise. REMARK. Having the full field R of real numbers available as field of scalars is important mainly when we deal with limits (as they are implicit in the notions of continuous or differentiable functions), or when we want to take square roots. In most other cases, it would suffice to restrict the scalars to the subfield Q R of rational numbers. Indeed, the numerical parameters one usually encounters and deals with in the computational practice are rational numbers. For convenience, we will usually assume R as the underlying field 1

2 1. REAL VECTOR SPACES of scalars. While theoretical aspects of vector spaces are often fruitfully studied admitting scalars from the field C of complex numbers, this generality is less natural in mathematical programming, where pairs of scalars often must be compared with respect to the ordering x y of the real numbers (which cannot be extended to the complex numbers in an algebraically reasonable fashion). We think of a vector x R n usually as a column vector, i e, as a n 1 -matrix: x = which we often abbreviate as x = x j. The transpose of x is the corresponding row vector x T = x 1 x n. x 1. x n A matrix A R m n can be thought of as either an ordered set of n column vectors A j = a 1 j a mj T R m or as an ordered set of m row vectors A i = a i1 a in of length n. Alternatively, a matrix A R m n can be viewed as a vector with m n components a ij, i.e., as an element in R m n. Of particular importance is the identity matrix I = e ij R n n, defined by { 1 if i = j e ij = 0 if i j The n column vectors e 1 e n of I are the so-called unit vectors of R n. A linear subspace of the vector space V is a (non-empty) subset W V that is a vector space in its own right (relative to the addition and scalar multiplication in V). We then say that W is closed with respect to the operations of adding vectors and multiplying them by scalars. Because the intersection of linear subspaces is again a linear subspace, we observe: For every subset S of the vector space V, there exists a unique smallest linear subspace span S, called the span (or linear hull) of S, that contains S. The vector z V is a linear combination of the vectors v 1 v k, if there are scalars 1 k R such that z = 1v 1 + + kv k. Note that every linear combination results from a finite sequence of elementary operations: z 1 = 1v 1 z 2 = z 1 + 2v 2 z k = z k 1 + kv k EX. 1.1. Show: span S consists of all (finite) linear combinations of vectors in S. If v 1 v n are the column vectors of the matrix A R m n, and x = x 1 x n is a vector of coefficients x j, we describe the resulting linear combination in matrix notation: R m Ax = x 1 v 1 + + x n v n T

1.1. LINEAR AND AFFINE SPACES 3 In this case, span {v 1 v n } = {Ax R m x R n } = col A is called the column space of A. Similarly, the linear hull of the row vectors of A is the row space row A. Interchanging the rows and columns of A (i.e, passing from A = a ij to its transpose A T = a ij, where a ij = a ji ), we have row A = col A T A non-empty subset S V of vectors is called (linearly) independent if no vector s S can be expressed as a linear combination of vectors in S \ {s}. EX. 1.2. Show: {v 1 v k } is independent if and only if 1v 1 + + kv k = 0 implies i = 0 for all i = 1 k. A minimal subset B V such that V = span B is a basis of V. Note that a nonempty basis is necessarily an independent set. The dimension of V is the number of elements in a basis B: dim V = B From linear algebra, we know that any two finite bases of a vector space have the same cardinality. This fact implies that dim V is well-defined and equals the maximal size of an independent subset of V. (There are many interesting infinitedimensional vector spaces. For the applications in mathematical programming, however, we will always be concerned with finite-dimensional vector spaces). A finite-dimensional vector space V with basis B = {v 1 v n } can be identified with R n in the following way. Because B is linearly independent, each vector v V has a unique representation v = x 1 v 1 + + x n v n where x = x 1 x n T R n So we can represent the element v V by the vector x R n of its coordinates with respect to B. It is not difficult to verify that this representation is compatible with the operations of adding vectors and multiplying vectors by scalars R. REMARK. The identification of V with R n depends, of course, on the basis B we choose for the representation. Different bases lead to different representations. In the case V = R n, the unit vectors e 1 e n R n form the so-called standard basis. With respect to the latter, a vector x = x 1 x n T R n has the representation x = x 1 e 1 + + x n e n An affine subspace (also called a linear variety) of the vector space V is a subset L V such that either L = or there exists a vector p V and a linear subspace W V with the property L = p + W = {p + w w W}

4 1. REAL VECTOR SPACES It is easy to verify that the affine subspace L = p + W equals L = p + W if and only if p L and W = W. We define the affine dimension of L as { dim W if L dim L = 1 if L = An affine subspace of dimension 0 is called a point (and is identified with the element it contains). A line is an affine subspace of dimension 1. A plane is an affine subspace of dimension 2. An affine subspace H of V of dimension dim H = dim V 1 is called a hyperplane of V. EX. 1.3. Show: (a) The linear subspaces are exactly the affine subspaces L with 0 L. (b) The intersection of affine subspaces is an affine subspace. A linear combination z = 1v 1 + + kv k is said to be an affine combination of v 1 v k if the scalars i R satisfy the condition k i=1 i = 1. EX. 1.4. Show that the subset L V is an affine subspace if and only if L contains all (finite) affine combinations of elements of L. For an arbitrary subset S V, we denote by aff S the set of all finite affine combinations of elements in S and call it the affine hull (or affine span) of S. By Ex. 1.4, aff S is the smallest affine subspace that contains S. 1.2. Maps and Matrices Assume that V and W are vector spaces with bases {v 1 v n } and {w 1 w m } respectively. Every v V can be uniquely expressed as a linear combination v = x 1 v 1 + + x n v n, with coefficients x j R, and every w W as a linear combination w = y 1 w 1 + + y m w m, with coefficients y i R. The map f : V W is said to be linear if f is compatible with the vector space operations. This means that we have for all u v V and R: f u + f v = f u + f v f u = f u It follows that the linear map f is completely determined by the images of the basis vectors v j and hence by the coefficients a ij that describe these images in terms of the basis vectors w i : f v j = a 1 j w 1 + a 2 j w 2 + + a mj w m So the map f is encoded by the m n -matrix A = a ij : If x is the vector of coefficients of the vector v V, then y = Ax is the vector of coefficients of the image f v W. In that sense, a linear map corresponds to a matrix (that also depends on our choice of bases for V and W!) Conversely, we can construct a

1.2. MAPS AND MATRICES 5 linear map f from a given (m n)-matrix A as follows. For every basis vector v j, we define f v j = a 1 j w 1 + a 2 j w 2 + + a mj w m Because an arbitrary vector v V can be uniquely expressed as a linear combination v = x 1 v 1 + + x n v n, we obtain the well-defined extension f v = x 1 f v 1 + + x n f v n The composition of two (or more) linear maps corresponds to the product of their associated matrices. If f : V W and g : W Z are linear maps represented by A R m n resp. B R k m (with respect to fixed bases in V W and Z), then also g f : V Z where g f u = g f u is a linear map. The matrix C = c ij R k n describing g f is the product C = BA, i.e., the elements c ij are the inner products (cf. Section 1.3.1) of the rows of B with the columns of A: m c ij = B i A j = b il a lj REMARK. The product matrix C = BA admits two dual points of view: m m C j = BA j = a lj B l and C i = B i A = b il A l l=1 That is: The jth column C j is the linear combination of the columns of B according to the coefficients of the jth column of A. Dually, the ith row C i is the linear combination of the rows of A according to the coefficients of the ith row of B. Hence l=1 l=1 row BA row A and col BA col B The Kernel. If f : V W is a linear map and L V is a linear space, then f L W is a linear subspace of W. Similarly, if L W is a linear space of W, then f 1 L = {v V f v L} is a subspace of V. In particular, the kernel and the image of f, defined by ker f = {v V f v = 0} V im f = f V W are subspaces of V resp. W. Their dimensions are related via dim ker f + dim im f = dim V (To see this, take a basis of ker f, extend it to a basis of V and verify that the extension is mapped to a basis of im f.) Note that we may usually (or without loss of generality henceforth abbreviated: l o g ) assume im f = W, i.e., f is surjective (otherwise we simply replace W by W = im f in our reasoning).

6 1. REAL VECTOR SPACES EX. 1.5. Use the dimension formula to show that a linear map f : R n R n is surjective (i.e., f R n = R n ) if and only if it is injective (i.e., f v 1 = f v 2 implies v 1 = v 2 ). If f : R n R m is represented by A R m n with respect to the standard bases in R n and R n, then f e j = a 1 j e 1 + + a mj e m = A j Hence the vector x = x 1 e 1 + + x n e n R n is mapped to f x = x 1 f e 1 + + x n f e n = x 1 A 1 + + x n A n = Ax and we have im f = col A. The kernel of f x = Ax is also referred to as the kernel of A, i.e., ker A = {x R n Ax = 0} Now suppose that f x = Ax is surjective, i.e., col A = R m. Then A must contain m columns, say A 1 A m that form a basis of R m. Let g : R m R m be the (unique) linear map with the property g A j = e j j = 1 m. (Note that the first e j denotes here the jth unit vector in R n, while the second e j is the jth unit vector in R m.) So g f e j = e j holds for j = 1 m. Therefore, if the matrix B R m m represents g, i.e., if g y = By, we have BA e j = e j for j = 1 m and conclude that BA R m n must be of the form BA = [I N], where I R m m is the identity matrix and N R m n m. EX. 1.6. Verify that BA = [I N] is the matrix that represents f with respect to the standard basis in R n and the basis {A 1 A m } in R m. In the special case m = n, the composition g f = id yields the identity map and BA = I R n n. The matrix B R n n is then the so-called inverse of A R n n, denoted by A 1. So A 1 A = I. Moreover, f g A j = f e j = A j implies that also f g = id and, therefore, AA 1 = I holds. In the general case m n, we observe that g : R m R m (defined by g A j = e j ) has an inverse g 1 (defined by g 1 e j = A j ). In particular, B 1 R m m exists and we have Ax = 0 BAx = 0 and BA x = 0 Ax = B 1 BA x = 0 i.e., ker BA = ker A. REMARK (ELEMENTARY ROW OPERATIONS). By an elementary row operation on the matrix A R m n we understand one of the operations: Addition of one row to another row; Multiplication of one row by a non-zero scalar; Interchange of two rows;

1.3. INNER PRODUCTS AND NORMS 7 while keeping the other rows fixed. So elementary row operations consist of elementary operations on the row vectors of A. Hence an elementary row operation can be described by multiplying A from the left with an matrix B R m m (see Ex. 1.7). Because an elementary row operation can be reversed by a similar operation, B is invertible. Moreover, row A = row B 1 BA row BA row A shows row BA = row A. In other words: the row space of A is invariant under elementary row operations. EX. 1.7. Design the matrix B R 5 5 with the property that BA arises from A by subtracting 17 times the first row of A from the fourth row. Affine Maps. Restricting ourselves directly to V = R n and W = R m and their standard bases, we define an affine map f : R n R m as a map of the form f x = Ax b with A R m n b R m Clearly, the image im f of an affine map is an affine subspace of R m and, conversely, the inverse image f 1 L = {x R n f x L} of an affine subspace L R m is an affine subspace of R n. In particular f 1 {0} = {x R n f x = 0} is an affine subspace of R n. In case f 1 {0} is non-empty, we may chose an arbitrary element p f 1 {0} and obtain the representation f 1 {0} = p + ker A = {p + x x ker A} If n = m and f x = Ax b is invertible (i.e., A 1 exists), f is called an affine transformation. 1.3. Inner Products and Norms 1.3.1. Inner Products. An inner product on R n is a map : R n R n R such that for all vectors x y z R n and scalars R, (1.1) (1.2) (1.3) (1.4) x y = y x x y = x y x + y z = x z + y z x x 0 if x 0. By (1.1)-(1.3), an inner product is a symmetric bilinear form and, by (1.4), positive definite. We will usually be concerned with the so-called standard inner T, which is a special case of product for x = x 1 x n T and y = y 1 y n the matrix product: x y = x T y = n x j y j EX. 1.8. Show that x y = x T y defines an inner product on R n. j=1

8 1. REAL VECTOR SPACES It is also useful to consider the standard inner product in the vector space R m n of m n -matrices. If we think of two matrices A = a ij and B = b ij in R m n as two vectors of length mn in R mn, their inner product should be the sum of the componentwise products. We will denote this inner product by m n A B = a ij b ij i=1 REMARK. Defining the trace of a matrix C = c ij R n n as the sum of all diagonal elements, tr C = i one obtains A B = tr A T B (see Ex. 1.9). EX. 1.9. Show: A B = tr A T B, A B = B A, A B = I A T B. Give an example of matrices A B R n n such that AB BA. From the fundamental properties of the inner product, we can derive the inequality of Cauchy-Schwarz: LEMMA 1.1 (Cauchy-Schwarz). Let be an inner product on R n. Then all vectors x y R n \ {0} satisfy the inequality x y 2 j=1 c ii x x y y Equality holds if and only if x is a scalar multiple of y. Proof. We may assume x x = y y = 1 w.l.o.g. (otherwise we could scale, say, x with a nonzero scalar so that x x = 1, which would just multiply both sides of the inequality with 2.) We then find 0 x y x y = x x 2 x y + y y = 2 2 x y 0 x + y x + y = x x + 2 x y + y y = 2 + 2 x y So x y 1 = x x y y. By property (1.4) of the inner product, equality can only hold if x y = 0 or x + y = 0, i.e., if x = y or x = y. The claim follows. Inner Products and Positive Definite Matrices. Let : R n R n R be an inner product on R n, and fix a basis B = {v 1 v n } of R n. Relative to B, we can describe the inner product via the Gram matrix of inner products G = G v 1 v n = g ij = v i v j The symmetry of the inner product means that G is a symmetric matrix, i e, G = G T (or equivalently, g ij = g ji for all i j). If x = x 1 x n T and y = T are the n-tuples of scalars of the linear combinations u = x 1 v 1 + y 1 y n

1.3. INNER PRODUCTS AND NORMS 9 + x n v n and w = y 1 v 1 + + y n v n, then the bilinearity of the inner product yields (1.5) u w = x T Gy = n i=1 n g ij x i y j Moreover, if x 0, then u 0 and x T Gx = u u 0. Conversely, let us call the symmetric matrix Q = q ij R n n positive definite if x T Qx 0 holds for all x 0. Then Q gives rise to an inner product x y = x T Qy (with Q as its Gram matrix with respect to the standard basis in R n ). REMARK. Our discussion exhibits inner products of finite-dimensional vector spaces and positive definite matrices as manifestations of the same mathematical phenomenon: The positive definite matrices are exactly the Gram matrices of inner products. In particular, the standard inner product on R n has the n n -identity matrix I as its Gram matrix relative to the standard basis {e 1 e n }. Inner Products and Orthogonal Matrices. Given an inner product : R n R n R, we say that the vectors x y R n are orthogonal if x y = 0. (See also Ex. 1.13.) A system of vectors {v 1 v k } in R n is called orthonormal provided v i v j = j=1 { 1 if i = j 0 if i j Consider now R n with respect to the standard inner product x y = x T y. The matrix A R n n is called orthogonal if A T A = I (i.e., if A 1 = A T ). This property means that the column vectors A 1 A n of A satisfy A T ia j = Ae i T Ae j = e T i e j = { 1 if i = j 0 if i j i.e., the columns A 1 A n (and then also the rows of A) form an orthonormal basis of R n. EX. 1.10. Show that A R n n is orthogonal if and only if x T y = Ax T Ay holds for all x y R n. Prove that the vectors of an orthonormal system are linearly independent. 1.3.2. Norms. We can speak reasonably about the length of a vector in R n only relative to a given norm on R n, that is, a map : R n R such that for all vectors x y R n and scalars R, (1.6) (1.7) (1.8) x 0 for x 0; x = x ; x + y x + y Inequality (1.8) is the triangle inequality.

10 1. REAL VECTOR SPACES Every inner product on R n gives rise to a norm (see Ex. 1.11) via x = x x The norm arising from the standard inner product on R n is the Euclidean norm x = x T x = x 2 1 + + x 2 n REMARK. R n admits also norms that do not arise from inner products (see Ex. 1.14). In case of ambiguity, the Euclidean norm is denoted by x 2. EX. 1.11. Let be an inner product on R n. (a) Use the inequality of Cauchy-Schwarz to show that x = x x defines a norm. (b) Show that the norm defined in (a) satisfies the so-called parallelogram equality: x + y 2 + x y 2 = 2 x 2 + 2 y 2 for all x y R n. (c) Let be a norm that satisfies the parallelogram equality. Show that x y = 1 4 x + y 2 x y 2 defines an inner product. Hint for (c): Verify the claim first for vectors with components in Z and Q. Deduce then the statement for R from the continuity of the norm (cf. Section 1.4.2 below). Extending the Euclidean norm of vectors to matrices A = a ij R m n, we obtain the Frobenius norm: (1.9) A F = A A = a ij 2 EX. 1.12. Show for every A R m n and x R n : Ax 2 A F x 2. EX. 1.13 ( Theorem of Pythagoras ). Let be an inner product on R n with associated norm x = x x. Say that the vector a is perpendicular to the vector b if the distance a b from a to b is the same as the distance a b from a to b. Show: (1.10) a b = a b a b = 0 a 2 + b 2 = a b 2 EX. 1.14. Define for the vector x = x 1 x n T R n, x 1 = x 1 + + x n ( sum norm ) x = max i 1 i n ( maximum norm ) Show that both 1 and are norms but do not arise from inner products. (Hint: use Ex. 1.11(b))

1.4. CONTINUOUS AND DIFFERENTIABLE FUNCTIONS 11 1.4. Continuous and Differentiable Functions 1.4.1. Topology of R n. A norm on the vector space R n allows us to measure the distance between vectors and, therefore, to specify neighborhoods etc. We will investigate these concepts with respect to the Euclidean norm and denote it simply = 2 (unless explicitly specified otherwise). For two vectors x y R n we define their (Euclidean) distance as x y. In the case n = 1, of course, we also use the familiar notation of the absolute value x y. (For a general in-depth treatment of the analysis in R n and further details we refer, e.g. to Rudin [69]). The set of real numbers R has (by definition) the following so-called completeness property : A non-decreasing infinite sequence of real numbers r 1 r 2 has a (unique) limit r = lim k r k R if and only if the exists a bound M R such that r k M holds for all k. As a consequence of this property (cf. Ex.1.15), every subset S R has a unique infimum, which is defined as the largest lower bound for all s S and denoted by inf S. (If S is not bounded from below, then inf S = and if S =, then inf S = +.) The supremum sup S is defined similarly. EX. 1.15. Let S R be non-empty and s 1 R be such that s 1 s for all s S. Define a sequence s 1 s 2 of lower bounds for S as follows. Given s k, we set s k+1 = { s k + 1 k if s k + 1 k s for all s S s k otherwise Show: s = lim k s k = inf S. (Hint: Use k=1 1 k =.) More generally, we say that a sequence s 1 s 2 of points s k R n converges if the exists some s R n such that lim s s k = 0 k which is denoted by s = lim k s k or just s k s. A subset S R n is bounded if there exists some M R such that s M holds for all s S. The completeness property of R then implies the following (cf. Ex. 1.16): S R n is bounded if and only if every infinite sequence s 1 s 2 of points s k S admits a convergent subsequence s k1 s k2. We refer to the limit s = lim i s ki of a convergent subsequence s ki as an accumulation point of S (or the sequence s k ). EX. 1.16. Show: S R n is bounded if and only if every infinite sequence s k in S admits a convergent subsequence s ki. (Hint: Assume that S is contained in an n-dimensional cube Q 0 = [ M M] n. Subdivide Q 0 into 2 n smaller cubes. One of these, say Q 1, contains infinitely many s k. Let s k1 be the first s k that is contained in Q 1 and proceed by subdividing Q 1.)

12 1. REAL VECTOR SPACES An open ball (of radius r 0 and centered at x 0 R n ) is a subset of the form U r x 0 = {x R n x x 0 r} A set U R n is open if U contains with any x 0 also some open ball U r x 0. (In other words: An open set is precisely the union of all the open balls it contains.) A subset C R n is closed if R n \C is open. (Arbitrary) unions of open sets are open and, correspondingly, intersections of closed sets are closed. In particular, every S R n has a unique smallest closed set containing S, namely the intersection cl S of all closed sets containing S. cl S is the so-called closure of S. Similarly, every S R n admits a unique maximal open set int S contained in S, namely the union of all open balls contained in S, the interior of S. The boundary of S R n is defined as S = cl S\int S. Equivalently (cf. Ex. 1.17), S consists of all points x R n that are accumulation points of both S and R n \S. EX. 1.17. Show: If x cl S\int S then every U 1 k x k = 1 2 intersects both S and R n \S. Let s k U 1 k x S and s k U 1 k x \S and observe s k x and s k x. EX. 1.18. Show: (i) Open intervals a b are open subsets of R. (ii) An open line segment a b = { 1 a + b 0 1 } (a b R n ) is not open in R n, n 2. REMARK (RELATIVE TOPOLOGY). The topological concepts of R n carry over to subsets of R n. For our purposes, it suffices to consider affine subspaces L R n. If dim L = k, it is occasionally more natural to think of L as an isomorphic copy of R k and define the relevant notions accordingly ( relative to L ). A relatively open ball is a set of the form U r x 0 L, x 0 L, and the relative interior of a subset S L is the union of all the relatively open balls S contains. A subset S R n is compact, if it is bounded and closed. So S is compact if and only if every infinite sequence s k in S has an accumulation point s S. (The existence of s is equivalent to the boundedness of S and s S is equivalent to the closedness of S.) This observation is sometimes referred to as the Theorem of Bolzano-Weierstrass. EX. 1.19. Let S R be compact. Show that inf S and sup S belong to S. 1.4.2. Continuous Functions. Let S be a given subset of R n. Then the function f : S R m is said to be continuous at the point x 0 S if for every 0 there exists some 0 so that f x 0 f x 2 holds whenever x S and x 0 x 2! or, equivalently, f U" x 0 S U#$ f x 0. We denote this property by f x 0 = lim x x0 f x

1.4. CONTINUOUS AND DIFFERENTIABLE FUNCTIONS 13 We say that f is continuous on S if f is continuous at every x 0 S. It is easily verified that sums and products of real-valued continuous functions are again continuous. Furthermore, compositions of continuous functions are continuous. LEMMA 1.2. Let : R n R be an arbitrary norm on R n. Then f x = x is a continuous function. Proof. We first consider x 0 = 0 and let M = n max j e j (where e j is the jth unit vector). Then x = j x j e j i x j e j n max e j j max x j j M x 2 So x 2 0 implies f x 0 = f 0, i.e., f is continuous at x 0 = 0. The continuity of f in general now follows from the observation that x x 0 is equivalent with x x 0 0. If C R n is compact and f : C R m is continuous, then the image f C is compact in R m. For a proof of this fact, it suffices to show that every infinite sequence f z 1 f z 2 of images has an accumulation point f z f C. Now C is compact. So the sequence z 1 z 2 has a subsequence z k1 z k2 and a point z C such that z ki z. The continuity of f guarantees f z ki f z, which establishes f z as an accumulation point in f C. As a consequence, we obtain the following existence result which is used repeatedly in the analysis of optimization problems. THEOREM 1.1. (Weierstrass) Let C R n be a nonempty compact set and f : C R be continuous. Then f attains its maximum and minimum value on C, i.e. there exist x 0 x 1 C with f x 0 f x f x 1 for all x C Proof. If f : C R is continuous, then F = f C R is compact. Hence inf F R and sup F R exist and belong to F. The Theorem follows. REMARK. The unit spere S := {x R n x 2 = 1} is a compact set. If is any norm on R n, then in view of Lemma 1.2 % := min{ x x S} & 0 and ' = max{ x x S} exist, showing that is equivalent to the Euclidean norm 2 in the sense that % x 2 x ' x 2 for all x R n As a consequence, the definition of continuity (as well as the notion of differentiability below) does not depend on the particular norm we use.

14 1. REAL VECTOR SPACES 1.4.3. Differentiable Functions. From a computational point of view, linear and affine functions are the easiest to deal with. Therefore, we are interested in the question when a not necessarily linear function can be, at least locally, approximated by a linear function. Again, we focus right directly on the vector spaces R n and R m, equipped with the standard inner product and the Euclidean norm. In each case, we choose as reference the standard basis of unit vectors e j = 0 1 0 T. Let U be an open subset in R n. The function f : U R m is said to be differentiable at the point x 0 U if there exists a matrix A R m n and a function ( : U R m such that lim h 0 ( h = 0 and f x 0 + h = f x 0 + Ah + h ( h for all x 0 + h U A shorter way of expressing these conditions is offered by the notation f x 0 + h = f x 0 + Ah + o h (Recall that o h k generally denotes a term of the form h k) h, where ) h is a function satisfying lim h 0 ) h = 0.) The definition says that the differentiable function f can be approximated near x 0 via the affine function f h = f x 0 + Ah The matrix A is called the derivative of f at x 0 and is generally denoted by f x 0 (and by f x 0 in case n = 1). We call f differentiable on U if f is differentiable at every x 0 U. The derivative f x 0 turns out to be nothing but the Jacobian matrix associated with f (see p. 17). EX. 1.20. Show that f : U R m is necessarily continuous at x 0 U provided f is differentiable at x 0. Derivatives of Functions in One Variable. The analysis of functions of several variables can often be reduced to the one-dimensional case. Therefore, we briefly review some basic and important facts for functions f in one real variable x (with f denoting the derivative). Let f be defined on the open interval a b R and differentiable at the point x 0 a b. The following is a key observation for many optimization problems. LEMMA 1.3. If f x 0 0, then there exists some 0 such that f x 0 h f x 0 f x 0 + h whenever 0 h Proof. Because f x 0 exists, we can write f x 0 + h = f x 0 + hf x 0 + h ( h with lim h 0 ( h = 0. Hence, if f x 0 0, there exists some 0 such that ( h f x 0 whenever h *

which implies the Lemma. 1.4. CONTINUOUS AND DIFFERENTIABLE FUNCTIONS 15 An immediate consequence of Lemma 1.3 is the extremum principle: If f : a b R is differentiable at x 0 a b and if either f x 0 = max x a+ b f x or f x 0 = min x a+ b f x, then. f x 0 = 0 REMARK. The extremum principle exhibits the critical equation f x = 0 as a necessary (and very useful) optimality condition for f x : When one searches for a maximizer or a minimizer of f x, one may restrict one s attention to the so-called critical points x 0 that satisfy the critical equation. THEOREM 1.2. (Mean Value Theorem) Let f : [a b] R be continuous on the closed interval [a b] and differentiable on the open interval a b. Then there exists some, a b such that f b f a = b a f -, Proof. Define g x = b a f x [ f b f a ]x and observe g a = g b. If g is constant on [a b], then every, a b has the claimed property 0 = g -, = b a f -, f b + f a Just like f, also g is differentiable on a b and continuous on [a b]. Hence, if g is not constant on the compact set [a b] R, there exists some, a b such that g -, = max g x a+ b x or g -, = min g x a+ b x In either case, the extremum principle yields g -, = 0 and the Theorem follows as before. EX. 1.21. Show: 1 + x e x for x R. As an application of the mean Value Theorem we obtain the second order Taylor formula. LEMMA 1.4. Let U = t 0 t 0 R and assume p : U R is twice differentiable. Then, for any given t U there exists 0 /.0 1, such that Proof. Assume l o g that t p t = p 0 + tp 0 + 1 2 t2 p -. t 0 and consider, for 1 R (to be defined below) the twice differentiable function g : [0 t] R defined by g 32 := p 42 [p 0 + 2 p 0 + 1 2 2 2 1 ]

16 1. REAL VECTOR SPACES Then g 0 = 0 g 0 = 0 and 1 can be chosen so that g t = 0. We are left to show that this choice yields 1 = p -. t for some 0.5 1. Since g 0 = g t = 0, the Mean Value Theorem yields some t 1 0 t with g t 1 = 0. Applying the Mean Value Theorem to g with g 0 = g t 1 = 0, we find some t 2 0 t 1 such that g t 2 = 0. Hence and the claim follows with. t = t 2. 0 = g t 2 = p t 2 1 Directional Derivatives and the Gradient. A function f : U R m assigns to each vector x U a vector f x = f 1 x f m x T R m It follows from the definition that f is differentiable at x 0 if and only if each of the real-valued component functions f i : U R is differentiable at x 0. So we may restrict our attention to the case m = 1. Consider the differentiable real-valued function f : U R at the point x 0 U and fix a direction d R n. Then f induces locally a function p d t = f x 0 + td of the real parameter t. Moreover, the representation f x 0 + td = f x 0 + t f x 0 d + td ( td immediately shows that the derivative p d 0 exists. In fact, letting ( t = d ( td, we have and hence p d 0 = f x 0 d. p d t = p d 0 + t f x 0 d + t ( t Choosing the direction d as a unit vector e j, we recognize what the derivative matrix f x 0 = a 1 a n R 1 n actually is and how it can be computed. Recall first that the partial derivative f67 x j of f at x 0 is defined to be the derivative p e j 0 of the function p e j t = f x 0 + te j, which implies a j = f x 0 e j = f x 0 x j Hence, the derivative of f : U R is given by f x 0 = and is called the gradient of f at x 0. [ f x0 x 1 f x 0 x n EX. 1.22. Let f : R 2 R be given by f x y = x y. Show f x y = y x. ]

1.4. CONTINUOUS AND DIFFERENTIABLE FUNCTIONS 17 In general, we note for p d t = f x 0 + td the formula for the directional derivative of f at x 0 with respect to d: (1.11) p d 0 = f x 0 d = n j=1 f x 0 x j d j Formula (1.11) allows us to determine the direction with respect to which f offers the largest marginal change at x 0. With p d t defined as before, we want to solve the optimization problem max p d R n d 0 subject to d = 1 We assume f x 0 0 T. Applying the Cauchy-Schwarz inequality to (1.11), we deduce p d 0 f x 0 d = f x 0 with equality if and only if d T = f x 0 for some R. Hence we find that the gradient f x 0 yields the direction d into which f exhibits the largest marginal change at x 0. Depending on the sign of, we obtain the direction of largest increase or largest decrease. Moreover, we have the general extremum principle: If x 0 U is a local minimizer or maximizer of f in the sense that for some 0 0 f x 0 = max f x U89 x 0 x or f x 0 = min f x U89 x 0 x the one-dimensional extremum principle says that 0 = p d 0 = f x 0 d must hold for all directions d, which implies that x 0 must be a critical point, i.e., satisfy the critical equation (1.12) f x 0 = 0 T For general f : U R m, the same reasoning as in the case m = 1 shows that the derivative f x 0 has as rows exactly the gradients of the component functions f i of f. Hence the derivative f x 0 of f at x 0 is the Jacobian matrix ( ) fi x f x 0 = 0 R m n x j EX. 1.23. Show that an affine function f x = Ax + b has the derivative f x = A at every x. In particular, a linear function g x = c T x has gradient g x = c T. Let Q be a symmetric matrix and consider the quadratic function q x = x T Qx. Show that q x = 2x T Q holds for all x. The existence of the Jacobian f x (i.e., the existence of the partial derivatives f i x 67 x j ) alone does not necessarily guarantee the differentiability of f at x. A sufficient and in practice quite useful condition is the continuity of the (generally non-linear) map x f x.

18 1. REAL VECTOR SPACES LEMMA 1.5. Let U R n be open and f : U R m have continuous partial derivative functions x f i x 67 x j for all i = 1 m and j = 1 n (i e, the function x f x exists and is continuous). Then f is differentiable on U. Proof. As noted above, it suffices to assume m = 1. Given x 0 U and the vector d = d 1 d n T R n, we let x k = x 0 + d 1 e 1 + + d k e k for k = 1 n For d sufficiently small, we have x k U for all k = 1 n and f x 0 + d f x 0 = n k=1 [ f x k f x k 1 ] Applying the Mean Value Theorem to the functions, f x k 1 +, e k, we obtain numbers, k 0 d k such that whence we deduce f x k f x k 1 = d k f x k 1 +, ke k x k d ( d = f x 0 + d f x 0 f x 0 d n [ f xk 1 +, ke = d k k f x 0 x k x k k=1 Because d 1 d k 1 and lim d 0 x k 1 +, ke k = x 0, continuity of the partial derivatives finally implies lim d 0 ( d lim d 0 n k=1 f x k 1 +, ke k x k f x 0 x k ] = 0 If f : U R m has continuous partial derivatives, we call f a C 1 -function. In general, C k denotes the class of functions with continuous partial derivatives up to order k (cf. Section1.4.4 below). The Chain Rule. Let U R n and S R m be open sets and assume that the function f : U S is differentiable at x 0 U. If g : S R k is differentiable at y 0 = f x 0, the composite function h : U R k, given by h x = g f x, can be linearly approximated at x 0 by the composition of the respective derivatives. More precisely, h = g f is differentiable at x 0 and the Jacobian of h at x 0 equals the matrix product of the Jacobian matrices of f at x 0 and g at y 0 : (1.13) h x 0 = g y 0 f x 0 Formula (1.13) is known as the chain rule for differentiable functions. To verify it, it suffices to assume k = 1. Writing A = f x 0 and B = g y 0 for ease of

1.4. CONTINUOUS AND DIFFERENTIABLE FUNCTIONS 19 notation, the differentiability of f and g provides us with functions ( and : such that ( d 0 as d 0 and :; d 0 as d 0 and f x 0 + d = f x 0 + Ad + d ( d g y 0 + d = g y 0 + B d + d :; d With d = Ad + d ( d, we then obtain for all d 0, h x 0 + d = g f x 0 + d = g f x 0 + B d + d :* d = g f x 0 + BAd + B d ( d + d :; d = h x 0 + BAd + d ( B ( d + d d :; d By the choice of ( and the continuity of the linear maps x Ax, and y By, we have both B ( d 0 and :; d 0 as d 0. In order to establish the differentiability of h = g f and the chain rule, it now suffices to show that the quotient d 6 d is bounded as d 0. From Ex. 1.12, however, we know Ad A F d. In view of the triangle inequality (1.8), we therefore conclude d 6 d A F + ( d which is bounded as d 0. The Product Rule. The chain rule is an often very useful tool for the computation of derivatives. Let, for example, f 1 f 2 : a b R be differentiable on the open interval a b R and consider h t = f 1 t f 2 t. With F t = f 1 t f 2 t T and H x y = xy, we have h t = H F t. So ( ) F f t = 1 t f 2 and H x t y = y x yield i.e. h t = H f 1 t f 2 t < F t = f 2 t f 1 t h t = f 2 t f 1 t + f 1 t f 2 t ) ( f 1 t f 2 t EX. 1.24. Let f 1 f 2 : a b R be differentiable and assume f 2 t 0 for all t a b. Derive for h t = f 1 t f 2 t the quotient rule: h t = f 2 t f 1 t f 1 t f 2 t f 2 2 t )

= = = 20 1. REAL VECTOR SPACES 1.4.4. Second Derivatives and Taylor s Formula. The differentiable function f : U R gives rise to the function f : U R n via the assignment x [ f x ] T. Let us assume that also f x is differentiable. Then the partial derivatives of the partial derivatives of f exist and define the second derivative matrix ( 2 f x 2 f x = x i x j called the Hessian matrix of f at x U. REMARK. If all second partial derivatives are continuous on U, i.e., f is a C 2 -function, one can show for all x U and all i j, = 2 f x x i = x j = = 2 f x which means that the Hessian 2 f x is symmetric. EX. 1.25. Let f : U R n be differentiable. Show that the function p u t = f x 0 + tu is twice differentiable at t 0, if x 0 + t 0 u U, and satisfies n n = p u t 0 2 f x 0 + t 0 u = u i u j i=1 j=1 Consider the case where all second partial derivatives of f exist and are continuous. Then Lemma 1.5 tells us that f is differentiable. By Ex. 1.25, we know that p u t = f x 0 + tu is twice differentiable. In the subsequent discussion, we consider vectors u of unit length u = 1. Lemma 1.4 guarantees the existence of some 0 /. u 1, such that provided t representation for f. Given > x j = x i = ) x i p u t = p u 0 + p u 0 t + t2 2 p u -. ut x j 0 is so small that U t x 0 U. We want to derive an analogous 0, the assumed continuity of the Hessian matrix 2 f x allows us to choose t 0 so small that for every x U with x 0 x t, 2 f x 0 2 f x x i x j x i x j Recalling p u t = u T 2 f x 0 + tu u and observing u i u j u 2 = 1 for every two components u i and u j of u, we obtain p u 0 p u?. ut n 2 which is valid for all d = tu whenever the norm d = t is small enough (independent of the unit direction u!). With p u 0 = f x 0 u, we thus arrive at Taylor s formula for real-valued functions in several variables:

1.4. CONTINUOUS AND DIFFERENTIABLE FUNCTIONS 21 f x 0 + d = f x 0 + f x 0 d + 1 2 dt 2 f x 0 d + o d 2 or with some 2 0 1 : f x 0 + d = f x 0 + f x 0 d + 1 2 dt 2 f x 0 + 2 d d

CHAPTER 2 Linear Equations and Linear Inequalities While Chapter 1 reviews general structural aspects of real vector spaces, we now discuss fundamental computational techniques for linear systems in this chapter. For convenience of the discussion, we generally assume that the coefficients of the linear systems are real numbers. It is important to note, however, that in practical computations we mostly deal with rational numbers as input parameters. We therefore point out right at the outset that all the algorithms of this chapter (Gaussian elimination, orthogonal projection, Fourier-Motzkin elimination) will compute rational output quantities if the input parameters are rational numbers, as the reader can easily check. 2.1. Gaussian Elimination Let A = a ij R m n be an m n -matrix and b = b i R m a vector. Can we represent b as a linear combination of the column vectors of A? And if yes, how? To answer this question, we must find a vector x = x j R n such that Ax = b holds, i e, such that (2.1) a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2. a m1 x 1 + a m2 x 2 + + a mn x n = b m We refer to (2.1) as a system of linear equations in variables x 1 x n. A vector x = x j R n satisfying (2.1) is called feasible or a solution for (2.1). The system is infeasible if no solution exists. From a structural point of view, our problem is the following. We are given the linear function f : R n R m via f x = Ax, and a vector b R m. We are to determine a vector x in the solution space of Ax = b, i e, in the affine subspace (cf. Section 1.2) S = f 1 {b} = {x R n Ax = b} R n In the computational approach to the problem, we try to transform the system (2.1) of linear equalities via elementary vector space operations that leave the solution space S unaltered until the system has attained an equivalent form from which a solution can be easily inferred. 23.

24 2. LINEAR EQUATIONS AND LINEAR INEQUALITIES If all coefficients occurring with the variable x 1 are zero, the column is irrelevant for any linear combination and we are reduced to solving the subsystem involving only the variables x 2 x n. Gaussian elimination wants to achieve a similar reduction even when some (or all) coefficients occurring with x 1 are non-zero. Assume, for example, that a 11 0. Then (2.2) x 1 = 1 a 11 b 1 a 12 x 2 a 1n x n We can substitute this expression for x 1 in all other equations and obtain a new system A x = b that involves only the variables x 2 x n. In this sense, the variable x 1 has been eliminated. The systems Ax = b and A x = b of linear equations are very closely related. Each solution x = x 1 x 2 x n T for Ax = b yields a solution x = x 2 x n for A x = b (we just omit the variable x 1 in x). Conversely, each solution x for T for Ax = b by com- A x = b can be extended to a solution x = x 1 x 2 x n puting the value of x 1 via backward substitution according to the formula (2.2) from x = x 2 x n T. REMARK. From a geometrical point of view, passing from the solution space S of Ax = b to the solution space S of A x = b amounts to projecting the vectors x = x 1 x 2 x n S to x = x 2 x n S. We next eliminate another variable, say x 2, from the system A x = b in the same way etc until all variables have been eliminated. Going all the way back, we can compute a solution for the original system Ax = b via repeated backward substitution. What does it mean in terms of algebraic operations to eliminate x 1 in the system Ax = b? It turns out that there is no need to actually remove x 1 from the system. The elimination process comes down to a suitable sequence of pivoting operations that successively transform our original system into a completely equivalent system which, however, has the virtue of being easily solvable. Given a pair i j of row and column indices such that a ij 0, let us call the following operation a Gaussian i j -pivot (with pivot row i and pivot column j) on the rows (equations) of the system Ax = b: (GP) For all rows k i : Add a kj a 1 ij (row i) to row k. EX. 2.1. Assume that Ãx = b arises from Ax = b via a Gaussian i j -pivot. Show that both systems have the same solution space S. EX. 2.2. Show that the system A x = b in the Gaussian elimination step with respect to x 1 and a 11 0 is exactly the subsystem we obtain when we first apply a 1 1 -pivot to Ax = b and then remove column 1 and row 1 from the system. T