Stat 206: Linear algebra James Johndrow (adapted from Iain Johnstone s notes) 2016-11-02 Vectors We have already been working with vectors, but let s review a few more concepts. The inner product of two vectors x, y is x 1 y = x j y j, which we will sometimes express as xx, yy, and the angle θ formed by two vectors can be expressed in terms of inner products j=1 cos(θ) = x 1 y? x1 x? y 1 y. Here s how to take the inner product in R. set.seed(17) p <- 5 x <- matrix(rnorm(p),p,1) # p-vector with iid normal(0,1) entries y <- matrix(rnorm(p),p,1) xy <- t(x)%*%y # compute angle thet <- acos(xy/(sqrt(t(x)%*%x)*sqrt(t(y)%*%y))) So for this example the inner product of x and y is about -0.23 and the angle between the vectors is about θ = 1.65 radians. The usual interpretation in R 2 carries over to R p, i.e. if θ = 0 or π then x y and if θ = π/2 then x K y. 1 The projection of a vector x onto a vector y is given by 2 proj(x, y) = yy1 y 1 y x = P yx = (x1 yy 1 ) 1 y 1 = x1 y y y 1 y y, where the last equality holds because x 1 y is a scalar. Every vector can be expressed as a linear combination of its projection onto y and its projection onto the orthogonal complement of y, which is defined by 1 The notation x y means x and y are parallel, and x K y means x and y are perpendicular. 2 The book uses the third expression below, but I find the first more intuitive. In particular, if you ve done linear models, you ll recognize (yy 1 )/(y 1 y) as a special case of X(X 1X) 1 X 1, the perpendicular projection operator onto the column space of X, when X = y is a vector instead of a matrix. ) (I yy1 y 1 x = (I P y )x y It s clear that
stat 206: linear algebra 2 x = yy1 y 1 y x + ) (I yy1 y 1 x, y so that every x can be written as a linear combination of its projection parallel to and perpendicular to y. Let s compute an example in R. x <- matrix(rnorm(p),p,1) y <-.5*x +.5*matrix(rnorm(p),p,1) Py <- (y%*%t(y))/c(t(y)%*%y) proj <- Py%*%x orth <- (diag(p)-py)%*%x yproj <- t(y)%*%proj yorth <- t(y)%*%orth You ll note that xp y x, yy «3.93 but that x(i P y )x, yy «1.6653345ˆ 10 16, so P y x and (I P y )x really are the portions of x parallel to and perpendicular to y. Two vectors are said to be linearly dependent if one can be written as a scalar multiple of the other, i.e. x and y satisfy x = cy for some c P R. If two vectors x and y are linearly dependent, then the projection of x onto y is just x, and the projection of x onto the orthogonal complement of y is the zero vector. 3 Thus, we can think of linearly dependent vectors are parallel. A collection of n vectors x (1),..., x (n) are linearly dependent if there exists an i and constants c 1,..., c n such that c i x (i) = c 1 x (1) +... + c (i 1) x (i 1) + c (i+1) x (i+1) +... + c n x (n), that is, if at least one of them can be written as a linear combination of the others. A collection is said to be linearly independent if none of the vectors in the collection are linearly dependent. Notice I am being pretty careful about using «signs or saying is approximately. Everything you do on the computer is in some sense approximate, since the computer only assigns limited memory to storing any number in decimal expansion. As a result, it cannot tell the difference between 0 and, say 10 106, or between 8 and 10 106. That s why I didn t round x(i P y )x, yy, so you can see this in action. That number is actually zero, but the computer has finite precision, so some error is introduced when doing the calculations. 3 This is why in the example above where I computed the projection of x onto y in R, I generated y as a weighted average of x and some random stuff, otherwise the projection would be close to zero. This hints at relationships between linear dependence and correlation, which we will get to soon. Matrices A matrix X is a rectangular array of numbers. If a matrix has n rows and p columns, we say the matrix is n ˆ p. A matrix is square if n = p, and the diagonal of a square matrix are the elements X ii with the same row and column index. The transpose, X 1, of a n ˆ p matrix X is the p ˆ n matrix whose rows are formed by the columns of X. That is, the first row of X 1 is the first column of X, the second row of X 1 is the second column of X, and so on. Two matrices of the same dimension can be added by simply adding corresponding entries. If X and A are both n ˆ p matrices, then the matrix X + A has entries (X + A) ij = X ij + A ij. Here s how to do some of these things in R.
stat 206: linear algebra 3 n <- 10 p <- 5 X <- matrix(rnorm(n*p),n,p) Xt <- t(x) # transpose of X A <- matrix(rnorm(n*p),n,p) XA <- X+A #add X and A D <- matrix(rnorm(n*n),n,n) Ddiag <- diag(d) # get the diagonal of D D2 <- diag(rnorm(n)) # an n by n diagonal matrix The notion of rank is an important one. Definition 1 (rank of a matrix). Let A be a real-valued matrix. The row rank rank r (A) of A is the number of linearly independent rows of the matrix. The column rank rank c (A) is the number of linearly independent columns. The row rank and column rank are always equal (and therefore one just refers to the rank of a matrix). A matrix A is full rank if rank(a) = min(n, p). Matrix multiplication is in some ways analogous to multiplication of real numbers, but does not obey all of the same rules. First, only matrices of conformable dimension may be multiplied. An n ˆ p matrix X and a p ˆ m matrix A may be multiplied in the order XA because the column dimension of X matches the row dimension of A. The result is a n ˆ m matrix with entries (XA) ij = k=1 l=1 X ki A jl = xx [i,], A [,j] y so that the i, j element of the product is formed by taking the inner product of the vectors X [i,] and A [,j] formed by the ith row of X and the jth column of A. N.B.: p-vectors are p ˆ 1 matrices, and their transposes are 1 ˆ p matrices. So if X is n ˆ p and y is p ˆ 1, then the product Xy is a n ˆ 1 matrix (an n-vector). Matrix multiplication does not commute, that is, in general, XA XB. For rectangular matrices, often only one direction makes sense, since we can multiply a n ˆ p matrix X by a p ˆ m matrix A but not a p ˆ m matrix A by a n ˆ p matrix X, since the column dimension of A does not match the row dimension of X. Of course, if A and X are both square p ˆ p matrices, then we can multiply in either direction. Even then, it is still not the case in general that AX = XA. Two matrices are said to commute if and only if AX = XA. When matrices commute, it simplifies a lot of calculations, but usually matrices we will be working with will not commute.
stat 206: linear algebra 4 The p-dimensional identity matrix I p is a p ˆ p matrix with all of its diagonal entries equal to one and all of its off-diagonal entries equal to zero, i.e. I p = 1 0 0 0 1 0..... 0 0 1. We will often drop the subscript p when the dimension of I is clear. The matrix I is a multiplicative identity. So if X is n ˆ p, XI = X, and IX 1 = X 1 for the p ˆ p identity I. This holds in general for any matrix/vector for which the dimensions allow multiplication to happen. Two other definitional/notational things. A (square) matrix A is said to be symmetric if A = A 1, and A is said to be orthogonal if AA 1 = I. Multiplicative inverses also exist, though again there are important differences with the one-dimensional case. First, the inverse is only defined for square matrices. If A is a square matrix, then it has an inverse B if and only if there exists a square matrix B for which AB = BA = I, from which we may deduce that matrices and their inverses commute, and that the inverse of an orthogonal matrix is its transpose. In this case we write B = A 1, so the usual notation for the inverse of A will be A 1. Not all square matrices have an inverse, but when they do, the inverse is unique. A square matrix A has an inverse if and only if its columns are linearly independent. Another useful property is the following Remark 1 (inverse of product). Suppose A and B are both invertible p ˆ p matrices. Then (AB) 1 = B 1 A 1. Proof. Suppose C = (AB) 1. Since ABB 1 A 1 = AIA 1 = AA 1 = I B 1 A 1 AB = B 1 IB = B 1 B = I and inverses are unique, it follows that C = B 1 A 1.
stat 206: linear algebra 5 We usually compute matrix inverses in R, since the number of operations necessary to compute a matrix is large. Here, I generate a random matrix and compute its inverse. X <- matrix(rnorm(n*p),n,p) Sn <- n^(-1)*t(x)%*%x Sninv <- solve(sn) tst <- Sn%*%Sninv maxdiff <- max(abs(tst-diag(p))) We can check this worked by checking how different S (n) (n) (S ) 1 is from I p. In this case, the maximum entrywise difference (in absolute value) between S (n) (S (n)) 1 and I p is 3.3306691 ˆ 10 16. Matrix inversion is generally a O(n 3 ) operation, so methods that require computing the inverse will scale poorly in p. There are various strategies to improving scalability, mainly by using methods or approximations that don t require explicitly forming the inverse of a general p ˆ p matrix. For example, it is easy to invert a diagonal matrix: the inverse is simply a diagonal matrix with entries given by the reciprocal of the entries of the original matrix 4 4 Try this for a 3-by-3 example. Perhaps the most useful matrix results in multivariate statistics have to do with eigenvalues and eigenvectors. The eigenvalues of a p ˆ p square matrix A are the solutions to the equation Ax = λx, where x P R p is a p-vector and λ is scalar. The vectors x for which there exists λ satisfying the eigenvalue equation are the eigenvectors. Clearly, if λ is a (real) solution to the eigenvalue equation with eigenvector x, then for any real number c, cλ is also a solution with eigenvector c 1 x. Since the eigenvectors and eigenvalues are only unique up to a multiplicative constant, it is typical to normalize eigenvectors to have length 1, i.e. so that x 1 x = 1, and denote these normalized eigenvectors by e. Every square, symmetric p ˆ p matrix has p pairs of eigenvalues and eigenvectors (e (1), λ 1 ),..., (e (p), λ p ). The eigenvectors can be chosen to be mutually orthogonal (so (e (j) ) 1 e (k) = 0 for every pair j, k). The eigenvectors are unique unless two or more of the eigenvalues are equal. In statistics, we are often concerned with positive definite matrices Definition 2 (positive definite matrix). A square p ˆ p matrix A is positive definite if and only if the quadratic form x 1 Ax ą 0
stat 206: linear algebra 6 for every non-zero p-vector x, where a non-zero vector is any vector for which at least one entry is not zero. A is positive semi-definite if and only if for every non-zero p-vector x. x 1 Ax ě 0 One reason to care about positive definite matrices is the following, given without proof. 5 Remark 2 (sample covariance). Suppose x (1),..., x (n) are a random sample with common mean µ and positive-definite covariance Σ. Then if n ą p, S n is positive definite. Another important fact is that a real square matrix is invertible if and only if it is positive definite. 6 If a matrix A is symmetric and positive definite, there exists a decomposition of A into a matrix U with columns consisting of the eigenvectors and a diagonal matrix Λ with diagonal entries given by the eigenvalues. 5 Technical note: the phrase almost surely should be added to the end of this remark. This is entirely irrelevant, but for the reader who has some familiarity with measure theory I wanted to be complete. 6 positive semi-definite matrices have pseudoinverses (if interested, there is a decent page on wikipedia) Theorem 1 (spectral decomposition). Suppose A is symmetric and positive semi-definite. Then A = UΛU 1 where U is a p ˆ p orthogonal matrix whose jth column is the eigenvector e j, and Λ jj = λ j, the jth eigenvalue of A. Since U is orthogonal, we can also write this as A = UΛU 1. Spectral decompositions are very useful. For example, using Remark 1 and orthogonality of U, we have that if A = UΛU 1 is the spectral decomposition of a positive definite matrix A, then A 1 = UΛ 1 U 1. Since Λ is a diagonal matrix, its inverse is just the matrix (Λ 1 ) jj = λ 1 jj, which is easy to compute. Moreover, this implies that the eigenvalues of the inverse A 1 are the reciprocals of the eigenvalues of A, so that the largest eigenvalue of A 1 is the reciprocal of the smallest eigenvalue of A, and the smallest eigenvalue of A 1 is the reciprocal of the largest eigenvalue of A. Positive definite matrices also have square roots. A square root A 1/2 of a symmetric matrix A is any matrix B for which BB = A; for positive-definite matrices, there is a unique square root A 1/2 that is also positive definite (although there are multiple square roots). One way to obtain the square root of a positive definite matrix A is from its spectral decomposition:
stat 206: linear algebra 7 Remark 3. Suppose A = UΛU 1 is the spectral decomposition of A. Then UΛ 1/2 U 1 is a square root of A, where Λ 1/2 is the diagonal matrix with entries (Λ 1/2 ) jj = λ 1/2 j. Proof. UΛ 1/2 U 1 UΛ 1/2 U 1 = UΛ 1/2 IΛ 1/2 U 1 = UΛ 1/2 Λ 1/2 U 1 = UΛU 1 = A. From this it is clear that if A has spectral decomposition UΛU 1, then A 1/2 has spectral decomposition A 1/2 = UΛ 1/2 U 1. Eigenvalues also give us useful bounds on the size of quadratic forms: Remark 4 (quadratic form eigenvalue inequalities). Let λ 1,..., λ p be the eigenvalues of A in decreasing order. Then for every x P R p, λ p x 2 2 ď x 1 Ax ď λ 1 x 2 2. Since A 1/2 x 2 = x 1 Ax, the eigenvalues of A give us bounds on the amount by which A elongates any vector. Two important matrix quantities are the trace and determinant. Determinants and traces appear in the densities of multivariate distributions. The trace of a square matrix A is the sum of its diagonal elements: tr(a) A jj. j=1 Some key properties of the trace are tr(a + B) = tr(a) + tr(b) tr(ca) = c tr(a) tr(abc) = tr(cab) = tr(bca) the cyclic property tr(a) = ÿ λ j, j so the trace is the sum of the eigenvalues. The determinant of a square p ˆ p matrix is defined recursively by A = a 1j A [ 1, j] ( 1) 1+j, j=1
stat 206: linear algebra 8 where A [ 1, j] is the submatrix obtained by deleting the first row and jth column of A, and A = a 11 if A is scalar. There is a connection with eigenvalues via the characteristic polynomial g A (t) = ti A, a polynomial equation in t whose roots are equal to the eigenvalues of A. Although not the formal definition, for our purposes it is often useful to think of the determinant of a square matrix A, as the product of its eigenvalues A = ź j λ j. A couple of useful properties of the determinant are A 1 = A 1 log A = log(λ j ) There are matrix decompositions in addition to the spectral decomposition that often prove useful in applied statistics. Suppose X is a n ˆ p real matrix. Let m = min(n, p). The singular value decomposition of X is given by j=1 X = UDV, where U is a n ˆ m orthogonal matrix, D is a m ˆ m diagonal matrix, and V is a m ˆ p orthogonal matrix. The diagonal elements of D are referred to as the singular values of X. The singular value decomposition and spectral decomposition are related, since (UDV ) 1 UDV = V 1 DU 1 UDV = V 1 DIDV = V 1 D 2 V, where D 2 is the product DD. Since V is orthogonal, it follows that D 2 = Λ in the spectral decomposition of X 1 X when X 1 X is positivedefinite (equivalently, when n ą p). You can see this for yourself in R. 7 7 What happens when n ă p? In X <- matrix(rnorm(n*p),n,p) XX <- t(x)%*%x s <- svd(x) u <- eigen(t(x)%*%x) df <- data.frame(lam=u$values,d2=s$d^2) ggplot(df,aes(x=lam,y=d2)) + geom_point() fact, a lot of this still makes sense, except that there are only n distinct eigenvalues, which are the squares of the diagonal elements of D. We can still write a spectral decomposition of X X, too, but in this case X X is positive semi-definite, not positivedefinite.
stat 206: linear algebra 9 Note that the number of singular values is equal to the rank of X. Another useful decomposition is the Cholesky decomposition Theorem 2. Suppose Σ is a symmetric and positive-definite matrix. Then there exists a unique invertible, lower triangular matrix L, referred to as the Cholesky decomposition, such that Σ = LL 1. Vector and matrix calculus can be very useful in multivariate statistics, and we ll briefly review some key results here. Suppose x is a vector and A a matrix. Then d2 15 10 5 0 0 5 10 15 lam Figure 1: eigenvalues of X X plotted against the squared singular values B Ax = A1 Bx B Bx x1 A = A B Bx x1 x = 2x B Bx x1 Ax = Ax + A 1 x. The Jacobian matrix associated with a transformation f : R p Ñ R q is the p ˆ q matrix with entries J f (x) = Bf 1 Bf 1 Bx 1 Bx 2 Bf 2 Bf 2 Bx 1 Bx 2. Bf q Bx 1. Bf q Bx 2 Bf 1 Bx p Bf 2 Bx p.... Bf q Bx p, where f(x) = (f 1 (x),..., f q (x)) 1 is the (vector) output of the function f. The Jacobian is the matrix form of the total derivative of the function f, familiar from multivariate calculus. We can express the chain rule for vector functions as J f g (x) = J f (g(x))j g (x), that is, the total derivative of the composition of functions (f g)(x) = f(g(x)) is the product of the total derivative of f evaluated at g(x) and the total derivative of g. Here s a simple example B Bx log(x1 Ax) = 2Ax x 1 Ax, assuming A is symmetric and positive definite.
stat 206: linear algebra 10 Finally, a couple of results on derivatives of trace and determinant. Suppose A is a p ˆ p symmetric and positive-definite matrix, and B is a p ˆ p matrix. Then References Btr(AB) BA B A BA = A A 1, = B 1, Btr(A 1 B) BA = B 1 A 2.