DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1 2 x =, y = 0. (1) 3 5 From the point of view of algebra, vectors are much more general objects. They are elements of sets called vector spaces that satisfy the following definition. Definition 1.1 (Vector space). A vector space consists of a set V and two operations + and satisfying the following conditions. 1. For any pair of elements x, y V the vector sum x + y belongs to V. 2. For any x V and any scalar α R the scalar multiple α x V. 3. There exists a zero vector or origin 0 such that x + 0 = x for any x V. 4. For any x V there exists an additive inverse y such that x + y = 0, usually denoted as x. 5. The vector sum is commutative and associative, i.e. for all x, y V x + y = y + x, (x + y) + z = x + (y + z). (2) 6. Scalar multiplication is associative, for any α, β R and x V α (β x) = (α β) x. (3) 7. Scalar and vector sums are both distributive, i.e. for all α, β R and x, y V (α + β) x = α x + β x, α (x + y) = α x + α y. (4)

A subspace of a vector space V is a subset of V that is also itself a vector space. From now on, for ease of notation we will ignore the symbol for the scalar product, writing α x as α x. Remark 1.2 (More general definition). We can define vector spaces over an arbitrary field, instead of R, such as the complex numbers C. We refer to any linear algebra text for more details. We can easily check that R n is a valid vector space together with the usual vector addition and vector-scalar product. In this case the zero vector is the all-zero vector [ 0 0 0... ] T. When thinking about vector spaces it is a good idea to have R 2 or R 3 in mind to gain intuition, but it is also important to bear in mind that we can define vector sets over many other objects, such as infinite sequences, polynomials, functions and even random variables as in the following example. The definition of vector space guarantees that any linear combination of vectors in a vector space V, obtained by adding the vectors after multiplying by scalar coefficients, belongs to V. Given a set of vectors, a natural question to ask is whether they can be expressed as linear combinations of each other, i.e. if they are linearly dependent or independent. Definition 1.3 (Linear dependence/independence). A set of m vectors x 1, x 2,..., x m is linearly dependent if there exist m scalar coefficients α 1, α 2,..., α m which are not all equal to zero and such that α i x i = 0. (5) Otherwise, the vectors are linearly independent. Equivalently, at least one vector in a linearly dependent set can be expressed as the linear combination of the rest, whereas this is not the case for linearly independent sets. Let us check the equivalence. Equation (5) holds with α j 0 for some j if and only if x j = 1 α i x i. (6) α j i {1,...,m}/{j} We define the span of a set of vectors {x 1,..., x m } as the set of all possible linear combinations of the vectors: { } span (x 1,..., x m ) := y y = α i x i for some α 1, α 2,..., α m R. (7) This turns out to be a vector space. 2

Lemma 1.4. The span of any set of vectors x 1,..., x m belonging to a vector space V is a subspace of V. Proof. The span is a subset of V due to Conditions 1 and 2 in Definition 1.1. We now show that it is a vector space. Conditions 5, 6 and 7 in Definition 1.1 hold because V is a vector space. We check Conditions 1, 2, 3 and 4 by proving that for two arbitrary elements of the span y 1 = α i x i, y 2 = β i x i, α 1,..., α m, β 1,..., β m R, (8) γ 1 y 1 + γ 2 y 2 also belongs to the span. This holds because γ 1 y 1 + γ 2 y 2 = (γ 1 α i + γ 2 β i ) x i, (9) so γ 1 y 1 + γ 2 y 2 is in span (x 1,..., x m ). Now to prove Condition 1 we set γ 1 = γ 2 = 1, for Condition 2 γ 2 = 0, for Condition 3 γ 1 = γ 2 = 0 and for Condition 4 γ 1 = 1, γ 2 = 0. When working with a vector space, it is useful to consider the set of vectors with the smallest cardinality that spans the space. This is called a basis of the vector space. Definition 1.5 (Basis). A basis of a vector space V is a set of independent vectors {x 1,..., x m } such that V = span (x 1,..., x m ). (10) An important property of all bases in a vector space is that they have the same cardinality. Theorem 1.6. If a vector space V has a basis with finite cardinality then every basis of V contains the same number of vectors. This theorem, which is proven in Section A of the appendix, allows us to define the dimension of a vector space. Definition 1.7 (Dimension). The dimension dim (V) of a vector space V is the cardinality of any of its bases, or equivalently the smallest number of linearly independent vectors that span V. This definition coincides with the usual geometric notion of dimension in R 2 and R 3 : a line has dimension 1, whereas a plane has dimension 2 (as long as they contain the origin). Note that there exist infinite-dimensional vector spaces, such as the continuous real-valued functions defined on [0, 1] or an iid sequence X 1, X 2,.... 3

The vector space that we use to model a certain problem is usually called the ambient space and its dimension the ambient dimension. In the case of R n the ambient dimension is n. Lemma 1.8 (Dimension of R n ). The dimension of R n is n. Proof. Consider the set of vectors e 1,..., e n R n defined by 1 0 0 1 e 1 = 0, e 2 = 0,..., e n = 0 0 1. (11) One can easily check that this set is a basis. It is in fact the standard basis of R n. 2 Inner product product and norm Up to now, the only operations that we know how to apply to the vectors in a vector space are addition and multiplication by a scalar. We now introduce a third operation, the inner product between two vectors. Definition 2.1 (Inner product). An inner product on a vector space V is an operation, that maps pairs of vectors to R and satisfies the following conditions. It is symmetric, for any x, y V x, y = y, x. (12) It is linear, i.e. for any α R and any x, y, z V α x, y = α y, x, x + y, z = x, z + y, z. (13) It is positive semidefinite: for any x V x, x 0 and x, x = 0 implies x = 0. A vector space endowed with an inner product is called an inner-product space. Example 2.2 (Dot product). We define the dot product between two vectors x, y R n as x y := i x [i] y [i], (14) where x [i] is the ith entry of x. It is easy to check that the dot product is a valid inner product. R n endowed with the dot product is usually called an Euclidean space of dimension n. 4

The norm of a vector is a generalization of the concept of length. Definition 2.3 (Norm). Let V be a vector space, a norm is a function from V to R that satisfies the following conditions. It is homogeneous. For all α R and x V It satisfies the triangle inequality In particular, it is nonnegative (set y = x). x = 0 implies that x is the zero vector 0. α x = α x. (15) x + y x + y. (16) A vector space equipped with a norm is called a normed space. Distances in a normed space can be measured using the norm of the difference between vectors. Definition 2.4 (Distance). The distance between two vectors in a normed space with norm is d (x, y) := x y. (17) Inner-product spaces are normed spaces because we can define a valid norm using the inner product. The norm induced by an inner product is obtained by taking the square root of the inner product of the vector with itself, x, := x, x. (18) The norm induced by an inner product is clearly homogeneous by linearity and symmetry of the inner product. x, = 0 implies x = 0 because the inner product is positive semidefinite. We only need to establish that the triangle inequality holds to ensure that the inner-product is a valid norm. This follows from a classic inequality in linear algebra, which is proved in Section B of the appendix. Theorem 2.5 (Cauchy-Schwarz inequality). For any two vectors x and y in an inner-product space x, y x, y,. (19) 5

Assume x, 0, x, y = x, y, y = y, x, x, (20) x, y = x, y, y = y, x, x. (21) Corollary 2.6. The norm induced by an inner product satisfies the triangle inequality. Proof. x + y 2, = x 2, + y 2, + 2 x, y (22) x 2, + y 2, + 2 x, y, by the Cauchy-Schwarz inequality (23) ( ) 2 = x, + y,. (24) Example 2.7 (Euclidean norm). The Euclidean or l 2 norm is the norm induced by the dot product in R n, x 2 := x x = n x 2 i. (25) In the case of R 2 or R 3 it is what we usually think of as the length of the vector. 3 Orthogonality An important concept in linear algebra is orthogonality. Definition 3.1 (Orthogonality). Two vectors x and y are orthogonal if x, y = 0. (26) A vector x is orthogonal to a set S, if x, s = 0, for all s S. (27) 6

Two sets of S 1, S 2 are orthogonal if for any x S 1, y S 2 x, y = 0. (28) The orthogonal complement of a subspace S is S := {x x, y = 0 for all y S}. (29) Distances between orthogonal vectors measured in terms of the norm induced by the inner product are easy to compute. Theorem 3.2 (Pythagorean theorem). If x and y are orthogonal vectors Proof. By linearity of the inner product x + y 2, = x 2, + y 2,. (30) x + y 2, = x 2, + y 2, + 2 x, y (31) = x 2, + y 2,. (32) If we want to show that a vector is orthogonal to a certain subspace, it is enough to show that it is orthogonal to every vector in a basis of the subspace. Lemma 3.3. Let x be a vector and S a subspace of dimension n. If for any basis b 1, b 2,..., b n of S, then x is orthogonal to S. x, b i = 0, 1 i n, (33) Proof. Any vector v S can be represented as v = i αn b i for α 1,..., α n R, from (33) x, v = x, αb n i = α n x, b i = 0. (34) i i We now introduce orthonormal bases. Definition 3.4 (Orthonormal basis). A basis of mutually orthogonal vectors with norm equal to one is called an orthonormal basis. 7

It is very easy to find the coefficients of a vector in an orthonormal basis: we just need to compute the dot products with the basis vectors. Lemma 3.5 (Coefficients in an orthonormal basis). If {u 1,..., u n } is an orthonormal basis of a vector space V, for any vector x V n x = u i, x u i. (35) Proof. Since {u 1,..., u n } is a basis, x = α i u i for some α 1, α 2,..., α m R. (36) Immediately, u i, x = u i, α i u i = because u i, u i = 1 and u i, u j = 0 for i j. α i u i, u i = α i (37) For any subspace of R n we can obtain an orthonormal basis by applying the Gram-Schmidt method to a set of linearly independent vectors spanning the subspace. Algorithm 3.6 (Gram-Schmidt). Input: A set of linearly independent vectors {x 1,..., x m } R n. Output: An orthonormal basis {u 1,..., u m } for span (x 1,..., x m ). Initialization: Set u 1 := x 1 / x 1 2. For i = 1,..., m 1. Compute i 1 v i := x i u j, x i u j. (38) j=1 2. Set u i := v i / v i 2. This implies in particular that we can always assume that a subspace has an orthonormal basis. Theorem 3.7. Every finite-dimensional vector space has an orthonormal basis. Proof. To see that the Gram-Schmidt method produces an orthonormal basis for the span of the input vectors we can check that span (x 1,..., x i ) = span (u 1,..., u i ) and that u 1,..., u i is set of orthonormal vectors. 8

4 Projections The projection of a vector x onto a subspace S is the vector in S that is closest to x. In order to define this rigorously, we start by introducing the concept of direct sum. If two subspaces are disjoint, i.e. their only common point is the origin, then a vector that can be written as a sum of a vector from each subspace is said to belong to their direct sum. Definition 4.1 (Direct sum). Let V be a vector space. For any subspaces S 1, S 2 V such that the direct sum is defined as S 1 S 2 = {0} (39) S 1 S 2 := {x x = s 1 + s 2 s 1 S 1, s 2 S 2 }. (40) The representation of a vector in the direct sum of two subspaces is unique. Lemma 4.2. Any vector x S 1 S 2 has a unique representation x = s 1 + s 2 s 1 S 1, s 2 S 2. (41) Proof. If x S 1 S 2 then by definition there exist s 1 S 1, s 2 S 2 such that x = s 1 + s 2. Assume x = s 1 + s 2, s 1 S 1, s 2 S 2, then s 1 s 1 = s 2 s 2. This implies that s 1 s 1 and s 2 s 2 are in S 1 and also in S 2. However, S 1 S 2 = {0}, so we conclude s 1 = s 1 and s 2 = s 2. We can now define the projection of a vector x onto a subspace S by separating the vector into a component that belongs to S and another that belongs to its orthogonal complement. Definition 4.3 (Orthogonal projection). Let V be a vector space. The orthogonal projection of a vector x V onto a subspace S V is a vector denoted by P S x such that x P S x S. Theorem 4.4 (Properties of orthogonal projections). Let V be a vector space. Every vector x V has a unique orthogonal projection P S x onto any subspace S V of finite dimension. In particular x can be expressed as For any vector s S x = P S x + P S x. (42) x, s = P S x, s. (43) For any orthonormal basis b 1,..., b m of S, P S x = x, b i b i. (44) 9

Proof. Since V has finite dimension, so does S, which consequently has an orthonormal basis with finite cardinality b 1,..., b m. Consider the vector p := x, b i b i. (45) It turns out that x p is orthogonal to every vector in the basis. For 1 j m, x p, b j = x x, b i b i, b j = m x, b j x, b i b i, b j (46) (47) = x, b j x, b j = 0, (48) so x p S and p is an orthogonal projection. Since S S = {0} 1 there cannot be two other vectors x 1 S, x 1 S such that x = x 1 + x 2 so the orthogonal projection is unique. Notice that o := x p is a vector in S such that x o = p is in S and therefore in ( S ). This implies that o is the orthogonal projection of x onto S and establishes (42). Equation (43) follows immediately from the orthogonality of any vector s S and P S x. Equation (44) follows from (43). Computing the norm of the projection of a vector onto a subspace is easy if we have access to an orthonormal basis (as long as the norm is induced by the inner product). Lemma 4.5 (Norm of the projection). The norm of the projection of an arbitrary vector x V onto a subspace S V of dimension d can be written as P S x, = d b i, x 2 (49) for any orthonormal basis b 1,..., b d of S. 1 For any vector v that belongs to both S and S v, v = v 2 2 = 0, which implies v = 0. i 10

Proof. By (44) P S x 2, = P S x, P S x (50) d d = b i, x b i, b j, x b j (51) i d = = i j d b i, x b j, x b i, b j (52) j d b i, x 2. (53) i Finally, we prove indeed that of a vector x onto a subspace S is the vector in S that is closest to x in the distance induced by the inner-product norm. Example 4.6 (Projection onto a one-dimensional subspace). To compute the projection of a vector x onto a one-dimensional subspace spanned by a vector v, we use the fact that {v/ v, } is a basis for span (v) (it is a set containing a unit vector that spans the subspace) and apply (44) to obtain P span(v) x = v, x v. (54) v 2, Theorem 4.7 (The orthogonal projection is closest). The orthogonal projection of a vector x onto a subspace S belonging to the same inner-product space is the closest vector to x that belongs to S in terms of the norm induced by the inner product. More formally, P S x is the solution to the optimization problem Proof. Take any point s S such that s P S x minimize u x u, (55) subject to u S. (56) x s 2, = P S x + P S x s 2, (57) = P S x 2, + P S x s 2, (58) > 0 because s P S x, (59) where (58) follows from the Pythagorean theorem since because P S x belongs to S and P S x s to S. 11

5 Matrices A matrix is a rectangular array of numbers. We denote the vector space of m n matrices by R m n. We denote the ith row of a matrix A by A i:, the jth column by A :j and the (i, j) entry by A ij. The transpose of a matrix is obtained by switching its rows and columns. Definition 5.1 (Transpose). The transpose A T of a matrix A R m n is a matrix in A R m n ( A ) T = A ij ji. (60) A symmetric matrix is a matrix that is equal to its transpose. Matrices map vectors to other vectors through a linear operation called matrix-vector product. Definition 5.2 (Matrix-vector product). The product of a matrix A R m n and a vector x R n is a vector in Ax R n, such that (Ax) i = n A ij x [j] (61) j=1 = A i:, x, (62) i.e. the ith entry of Ax is the dot product between the ith row of A and x. Equivalently, Ax = n A :j x [j], (63) j=1 i.e. Ax is a linear combination of the columns of A weighted by the entries in x. One can easily check that the transpose of the product of two matrices A and B is equal to the the transposes multiplied in the inverse order, (AB) T = B T A T. (64) We can now we can express the dot product between two vectors x and y as x, y = x T y = y T x. (65) The identity matrix is a matrix that maps any vector to itself. 12

Definition 5.3 (Identity matrix). The identity matrix in R n n is 1 0 0 I = 0 1 0. (66) 0 0 1 Clearly, for any x R n Ix = x. Definition 5.4 (Matrix multiplication). The product of two matrices A R m n and B R n p is a matrix AB R m p, such that (AB) ij = n A ik B kj = A i:, B :,j, (67) k=1 i.e. the (i, j) entry of AB is the dot product between the ith row of A and the jth column of B. Equivalently, the jth column of AB is the result of multiplying A and the jth column of B AB = n A ik B kj = A i:, B :,j, (68) k=1 and ith row of AB is the result of multiplying the ith row of A and B. Square matrices may have an inverse. If they do, the inverse is a matrix that reverses the effect of the matrix of any vector. Definition 5.5 (Matrix inverse). The inverse of a square matrix A R n n is a matrix A 1 R n n such that Lemma 5.6. The inverse of a matrix is unique. AA 1 = A 1 A = I. (69) Proof. Let us assume there is another matrix M such that AM = I, then M = A 1 AM by (69) (70) = A 1. (71) An important class of matrices are orthogonal matrices. 13

Definition 5.7 (Orthogonal matrix). An orthogonal matrix is a square matrix such that its inverse is equal to its transpose, U T U = UU T = I (72) By definition, the columns U :1, U :2,..., U :n of any orthogonal matrix have unit norm and orthogonal to each other, so they form an orthonormal basis (it s somewhat confusing that orthogonal matrices are not called orthonormal matrices instead). We can interpret applying U T to a vector x as computing the coefficients of its representation in the basis formed by the columns of U. Applying U to U T x recovers x by scaling each basis vector with the corresponding coefficient: x = UU T x = n U :i, x U :i. (73) Applying an orthogonal matrix to a vector does not affect its norm, it just rotates the vector. Lemma 5.8 (Orthogonal matrices preserve the norm). For any orthogonal matrix U R n n and any vector x R n, Proof. By the definition of an orthogonal matrix Ux 2 = x 2. (74) Ux 2 2 = xt U T Ux (75) = x T x (76) = x 2 2. (77) 6 Eigendecomposition An eigenvector v of A satisfies Av = λv (78) for a real number λ which is the corresponding eigenvalue. Even if A is real, its eigenvectors and eigenvalues can be complex. 14

Lemma 6.1 (Eigendecomposition). If a square matrix A R n n has n linearly independent eigenvectors v 1,..., v n with eigenvalues λ 1,..., λ n it can be expressed in terms of a matrix Q, whose columns are the eigenvectors, and a diagonal matrix containing the eigenvalues, λ 1 0 0 A = [ ] v 1 v 2 v n 0 λ 2 0 0 0 λ n [ v1 v 2 v n ] 1 (79) = QΛQ 1 (80) Proof. AQ = [ ] Av 1 Av 2 Av n (81) = [ ] λ 1 v 1 λ 2 v 2 λ 2 v n (82) = QΛ. (83) As we will establish later on, if the columns of a square matrix are all linearly independent, then the matrix has an inverse, so multiplying the expression by Q 1 on both sides completes the proof. Lemma 6.2. Not all matrices have an eigendecomposition Proof. Consider for example the matrix [ ] 0 1. (84) 0 0 Assume λ has a nonzero eigenvalue corresponding to an eigenvector with entries v 1 and v 2, then [ ] [ ] [ ] [ ] v2 0 1 v1 λv1 = =, (85) 0 0 0 v 2 λv 2 which implies that v 2 = 0 and hence v 1 = 0, since we have assumed that λ 0. This implies that the matrix does not have nonzero eigenvalues associated to nonzero eigenvectors. An interesting use of the eigendecomposition is computing successive matrix products very fast. Assume that we want to compute AA Ax = A k x, (86) 15

i.e. we want to apply A to x k times. A k cannot be computed by taking the power of its entries (try out a simple example to convince yourself). However, if A has an eigendecomposition, A k = QΛQ 1 QΛQ 1 QΛQ 1 (87) = QΛ k Q 1 (88) λ k 1 0 0 = Q 0 λ k 2 0 Q 1, (89) 0 0 λ k n using the fact that for diagonal matrices applying the matrix repeatedly is equivalent to taking the power of the diagonal entries. This allows to compute the k matrix products using just 3 matrix products and taking the power of n numbers. From high-school or undergraduate algebra you probably remember how to compute eigenvectors using determinants. In practice, this is usually not a viable option due to stability issues. A popular technique to compute eigenvectors is based on the following insight. Let A R n n be a matrix with eigendecomposition QΛQ 1 and let x be an arbitrary vector in R n. Since the columns of Q are linearly independent, they form a basis for R n, so we can represent x as x = n α i Q :i, α i R, 1 i n. (90) Now let us apply A to x k times, A k x = = n α i A k Q :i (91) n α i λ k i Q :i. (92) If we assume that the eigenvectors are ordered according to their magnitudes and that the magnitude of one of them is larger than the rest, λ 1 > λ 2..., and that α 1 0 (which happens with high probability if we draw a random x) then as k grows larger the term α 1 λ k 1Q :1 dominates. The term will blow up or tend to zero unless we normalize every time before applying A. Adding the normalization step to this procedure results in the power method or power iteration, an algorithm of great importance in numerical linear algebra. Algorithm 6.3 (Power method). Input: A matrix A. Output: An estimate of the eigenvector of A corresponding to the largest eigenvalue. 16

v 2 x 1 v 2 v 2 v 1 x 2 v 1 x 3 v 1 Figure 1: Illustration of the first three iterations of the power method for a matrix with eigenvectors v 1 and v 2, whose corresponding eigenvalues are λ 1 = 1.05 and λ 2 = 0.1661. Initialization: Set x 1 := x/ x 2, where x contains random entries. For i = 1,..., k, compute x i := Ax i 1 Ax i 1 2. (93) Figure 1 illustrates the power method on a simple example, where the matrix which was just drawn at random is equal to [ ] 0.930 0.388 A =. (94) 0.237 0.286 The convergence to the eigenvector corresponding to the eigenvalue with the largest magnitude is very fast. 7 Eigendecomposition of symmetric matrices The following lemma shows that eigenvectors of a symmetric matrix corresponding to different nonzero eigenvalues are necessarily orthogonal. Lemma 7.1. If A R n n is symmetric, then if u i and u j are eigenvectors of A corresponding to different nonzero eigenvalues λ i λ j 0 u T i u j = 0. (95) 17

Proof. Since A = A T u T i u j = 1 λ i (Au i ) T u j (96) = 1 λ i u T i A T u j (97) = 1 λ i u T i Au j (98) = λ j λ i u T i u j. (99) This is only possible if u T i u j = 0. It turns out that every n n symmetric matrix has n linearly independent vectors. The proof of this is beyond the scope of these notes. An important consequence is that all symmetric matrices have an eigendecomposition of the form where U = [ u 1 u 2 u n ] is an orthogonal matrix. A = UDU T (100) The eigenvalues of a symmetric matrix λ 1, λ 2,..., λ n can be positive, negative or zero. They determine the value of the quadratic form: q (x) := x T Ax = n ( ) λ i x T 2 u i (101) If we order the eigenvalues λ 1 λ 2... λ n then the first eigenvalue is the maximum value attained by the quadratic if its input has unit l 2 norm, the second eigenvalue is the maximum value attained by the quadratic form if we restrict its argument to be normalized and orthogonal to the first eigenvector, and so on. Theorem 7.2. For any symmetric matrix A R n with normalized eigenvectors u 1, u 2,..., u n with corresponding eigenvalues λ 1, λ 2,..., λ n λ 1 = max u 2 =1 ut Au, (102) u 1 = arg max u 2 =1 ut Au, (103) λ k = The theorem is proved in Section C of the appendix. max u T Au, (104) u 2 =1,u u 1,...,u k 1 u k = arg max u T Au. (105) u 2 =1,u u 1,...,u k 1 18

A Proof of Theorem 1.6 We prove the claim by contradiction. Assume that we have two bases {x 1,..., x m } and {y 1,..., y n } such that m < n (or the second set has infinite cardinality). The proof follows from applying the following lemma m times (setting r = 0, 1,..., m 1) to show that {y 1,..., y m } spans V and hence {y 1,..., y n } must be linearly dependent. Lemma A.1. Under the assumptions of the theorem, if {y 1, y 2,..., y r, x r+1,..., x m } spans V then {y 1,..., y r+1, x r+2,..., x m } also spans V (possibly after rearranging the indices r + 1,..., m) for r = 0, 1,..., m 1. Proof. Since {y 1, y 2,..., y r, x r+1,..., x m } spans V y r+1 = r β i y i + γ i x i, β 1,..., β r, γ r+1,..., γ m R, (106) i=r+1 where at least one of the γ j is non zero, as {y 1,..., y n } is linearly independent by assumption. Without loss of generality (here is where we might need to rearrange the indices) we assume that γ r+1 0, so that ( r ) x r+1 = 1 β i y i γ i x i. (107) γ r+1 i=r+2 This implies that any vector in the span of {y 1, y 2,..., y r, x r+1,..., x m }, i.e. in V, can be represented as a linear combination of vectors in {y 1,..., y r+1, x r+2,..., x m }, which completes the proof. B Proof of Theorem 2.5 If x, = 0 then x = 0 because the inner product is positive semidefinite, which implies x, y = 0 and consequently that (19) holds with equality. The same is true if y, = 0. Now assume that x, 0 and y, 0. By semidefiniteness of the inner product, 0 y, x + x, y 2 = 2 x 2, y 2, + 2 x, y, x, y, (108) 0 y, x x, y 2 = 2 x 2, y 2, 2 x, y, x, y. (109) These inequalities establish (19). 19

Let us prove (20) by proving both implications. ( ) Assume x, y = x, y,. Then (108) equals zero, so y, x = x, y because the inner product is positive semidefinite. ( ) Assume y, x = x, y. Then one can easily check that (108) equals zero, which implies x, y = x, y,. The proof of (21) is identical (using (109) instead of (108)). C Proof of Theorem 7.2 The eigenvectors are an orthonormal basis (they are mutually orthogonal and we assume that they have been normalized), so we can represent any unit-norm vector h k that is orthogonal to u 1,..., u k 1 as h k = α i u i (110) where i=k m h k 2 2 = αi 2 = 1, (111) by Lemma 4.5. Note that h 1 is just an arbitrary unit-norm vector. i=k Now we will show that the value of the quadratic form when the normalized input is restricted to be orthogonal to u 1,..., u k 1 cannot be larger than λ k, ( n m ) 2 h T k Ah k = λ i α j u T i u j by (101) and (110) (112) = λ k j=k n λ i αi 2 because u 1,..., u m is an orthonormal basis (113) m i=k α 2 i because λ k λ k+1... λ m (114) = λ k, by (111). (115) This establishes (102) and (104). To prove (103) and (105) we just need to show that u k achieves the maximum n ( ) u T k Au k = λ i u T 2 i u k (116) = λ k. (117) 20