B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations of vector quantities. This lecture will present standard matrix notation, conventions, and basic identities that will be used throughout this course. During the course of this discussion we will also drop the boldface notation for vectors, and it will remain this way for the rest of the class. 1 Matrices A matrix A represents a linear transformation of an n-dimensional vector space to an m-dimensional one. It is given by an m n array of real numbers. Usually matrices are denoted as uppercase letters (e.g., A, B, C), with the entry in the i th row and j th column denoted in the subscript i,j, or when it is unambiguous, ij (e.g., A 1,2, A 1p ). A = A 1,1 A 1,n.. A m,n A m,n (1) 1

1.1 Matrix-Vector Product An m n matrix A transforms vectors x = (x 1,..., x n ) into m-dimensional vectors y = (y 1,..., y m ) = Ax as follows: y 1 = y m = A 1j x j... A mj x j Or, more concisely, y i = n A ijx j for i = 1,..., m. (Note that matrixvector multiplication is not symmetric, so xa is an invalid operation.) Linearity of matrix-vector multiplication. We can see that matrixvector multiplication is linear, that is A(ax+by) = aax+bay for all a, b, x, and y. It is also linear in terms of component-wise addition and multiplication of matrices, as long as the matrices are of the same size. More precisely, if A and B are both m n matrices, then (aa + bb)x = aax + bbx for all a, b, and x. Identity matrix. One special matrix that occurs frequently is the n n identity matrix I n, which has 0 s in all off-diagonal positions I ij with i j, and 1 s in all diagonal positions I ii. It is significant because I n x = x for all x R n. 1.2 Matrix Product When two linear transformations are performed one after the other, the result is also a linear transformation. Suppose A is m n, B is n p, and x is a p-dimensional vector, and consider the result of A(Bx) (that is, first multiplying by B and then multiplying the result by A). We see that Bx = ( p B 1j x j,..., (2) p B nj x j ) (3) and Ay = ( A 1k y k,..., A mk y k ) (4) 2

So ( A(Bx) = A 1k ( p B kj x j ),..., Rearranging the summations, we see that ( p A(Bx) = ( A 1k B kj )x j ),..., A mk ( ) p B kj x j ). (5) ) p ( A mk B kj x j ). (6) In other words, we could have A(Bx) = Cx if we were to form a matrix C such that C ij = A ik B kj (7) This is exactly the definition of the matrix product, and we say C = AB. The entry C ij of can also be obtained taking the dot-product of the i th column of A and the j th column of B. Matrix product is associative but not symmetric. By the above derivation we can drop the parentheses A(Bx) = (AB)x. So, matrix-vector and matrix-matrix multiplication are associative. Note again however that matrix-matrix multiplication is not symmetric, that is AB BA in general. Column and row vectors. Note that if we were to write an n-dimensional vector x stacked in a n 1 matrix x (denoted in lowercase), we can turn the matrix-vector y = Ax into the matrix product y = Ax. Here, if A is an m n matrix, then y is an m 1 matrix. y 1. y m = A 1,1 A 1,n.. A m,n A m,n x 1. x n (8) Hence, there is a one-to-one correspondence between vectors and matrices with one column. These matrices are called column vectors and will be our default notation for vectors throughout the rest of the course. We will occasionally also deal with row vectors, which are matrices with a single row. 1.3 Transpose The transpose A T of a matrix A simply switches A s rows and columns. (A T ) ij = A ji. (9) 3

If A is m n, then A T is n m. Symmetric matrix. If A = A T, then A is symmetric. 1.4 Matrix Inverse An inverse A 1 of an n n square matrix A is a matrix that satisfies the following equation: AA 1 = A 1 A = I n (10) where I n is the identity matrix. Not all square matrices have an inverse, in which case we say A is not invertible (or singular). Invertible matrices are significant because the unique solution x to the system of linear equations Ax = b, is simply A 1 b. This holds for any b. If the matrix is not invertible, then such an equation may or may not have a solution. Orthogonal matrix. An orthogonal matrix is a square matrix that satisfies AA T = I n. In other words, its transpose is its inverse. 1.5 Matrix identities Identities involving the transpose: (ca) T = ca T for any real value c. (A + B) T = A T + B T. (AB) T = B T A T. All 1 1 matrices are symmetric, the identity matrix is symmetric, and all uniform scalings of a symmetric matrix are symmetric. A + A T is symmetric. The dot product x y is equal to x T y, with x and y denoting the column vector representations of x and y, respectively. x T Ay = y T A T x, with x and y column vectors. Identities involving the inverse: I 1 n = I n. 4

(ca) 1 = 1 c A 1 for any real value c 0. (AB) 1 = B 1 A 1 if both B and A are invertible. If A and B are invertible, then (ABA 1 ) 1 = AB 1 A 1. 1.6 Common mistakes Matrix expressions are similar to standard expressions regarding real numbers in that addition and subtraction are equivalent, multiplication is nearly equivalent, and inverses give an approximation of division. But, this similarity leads to common pitfalls when manipulating matrix equations. Here are some common mistakes that you should look out for. 1. Swapping the arguments of a matrix product. 2. Propagating transposes or inverses into a matrix product without swapping the order of arguments. 3. Assuming that a matrix is invertible (or worse, assuming a non-square matrix is invertible). 4. Performing operations on matrices of incompatible size. 2 Rank, Null space, and Definiteness If A is not invertible (for instance, it may not be square) then the system of linear equations Ax = b may not have a solution x. Or, it may have an infinite number of solutions. Or, it may have solutions for some b s and not others. We would like to characterize, based on properties of A, when such equations can be solved. 2.1 Matrix rank Consider the columns of A as a list of vectors a 1,..., a n. Recall that if b Span(a 1,..., a n ), then b is a linear combination of a 1,..., a n. If this holds, then it is sufficient to set each component x i to the respective coefficient on a i in order to solve Ax = b. On the other hand, if b / Span(a 1,..., a n ), then 5

there is not solution. So, the set of vectors b such that Ax = b has a solution is precisely Span(a 1,..., a n ). Rank. The rank of an m n matrix A is the size of the largest subset of {a 1,..., a n } that is linearly independent. In other words, if A has rank k, then Span(a 1,..., a n ) is an k-dimensional subspace of R m. If k = n, then A is said to have full column rank, and such problems have at most one solution. If k = m, then A is said to have full row rank, and such problems have at least one solution. If k = m = n, then A is invertible. Overdetermined system. Now suppose that the rank of A is k < m. Then there are some possible values of b that are not attainable by linear combinations of a 1,..., a n. Such systems are known as overdetermined because there are more constraints than can be fulfilled by adjusting the values of x. Overdetermined systems are usually not solved exactly, but are more often solved in a least squares sense min x Ax b 2. Underdetermined system. If the rank of A is k < n, then there are an infinite number of solutions x to the equation Ax 0 = Ax. To see this, let some column of A be linearly dependent on the remaining columns. Suppose this column is a 1 without loss of generality. Then, a 1 n i=2 c ia i = 0 for some coefficients c i. So, any multiple of the vector v = (1, c 2,..., c n ) can be added to x 0 without affecting the value of A(x 0 + cv). Such systems are known as underdetermined because they may be solved by multple values of x. A system can be both underdetermined and overdetermined if k < m and k < n. This means there are some values of b for which there is no solution, but for those that do have a solution, there are an infinite number of solutions. 2.2 Null space For underdetermined systems with k < n, we ask how many directions d can we move in to preserve Ad = 0? The space of such directions is known as the null space. These are significant because if we move a point x in any such direction, we leave the value of Ax = A(x + d) = Ax + Ad = Ax unchanged. It turns out that this space can be spanned by n k linearly independent directions, and is therefore a space of dimension n k. (Null spaces will feature prominently in constrained optimization problems.) 6

2.3 Positive/Negative Definiteness A symmetric square matrix A is positive semi-definite if for all vectors x, x T Ax 0. It is strictly positive definite if equality holds only for x = 0. It can be shown that positive definite matrices are invertible. The inverse of a positive definite matrix is positive definite as well. Although it is not clear at the moment what this condition means, it will become important in later lectures. Many matrices that we encounter will be shown to be positive definite! For example, the matrix (A T A) for a matrix A of full column rank is positive definite. Also, a local minimum of a scalar field, the Hessian matrix is positive definite. Likewise, a matrix for which x T Ax 0 is called negative semi-definite, and is called strictly negative definite if equality holds only at x = 0. If none of these conditions holds, the matrix is called indefinite. 3 Matrix Factorizations Several matrix factorizations have proven useful in numerical analysis, computer science, and engineering. It is a good idea to familiarize yourself with these factorizations so that you can apply them. 3.1 Eigenvalues and Eigenvectors If there exist a real number λ and vector x such that Ax = λx, then λ and x are known as an eigenvalue and eigenvector of A, respectively. Briefly, here are some facts about eigenvalues. 1. All matrices have at least one and at most n distinct eigenvalues. 2. Symmetric matrices have real eigenvalues. 3. Positive definite matrices have a full set of real, positive eigenvalues. 4. Positive semi-definite matrices have real, nonnegative eigenvalues. 5. Nonsymmetric matrices may have complex eigenvalues and eigenvectors. 7

Eigendecomposition. Symmetric matrices A can be decomposed into the form QΛQ T, where Λ is a diagonal matrix and Q is an orthogonal matrix. Λ is related to Q in that the ith entry of Λ is an eigenvalue that corresponds to the i th column of Q, which is its eigenvector. The significance of this decomposition is that multiplication by a symmetric matrix can be represented by a rotation transformation, then an axisaligned scaling, then an inverse rotation. It also gives a convenient form for the inverse, and to test whether an inverse exists. If every element of the diagonal of Λ is nonzero, then A 1 = QΛ 1 Q T. Λ 1 is easy to compute because it simply requires taking the reciprocal of each element on the diagonal. 3.2 Decompositions into Triangular Forms LU decomposition. It can be shown that using the Gaussian elimination procedure, any matrix A can be decomposed into A = P LU, where P is a permutation matrix, L is a lower triangular matrix, and U is an upper triangular matrix. This decomposition is significant because permutation matrices are easily invertible, and triangular matrices are easily invertible if their diagonals are nonzero. (The solution to any invertible triangular matrix equation Lx = b can be found quickly through a backsubstitution procedure) So, if L, and U are invertible, then A is invertible as well! This method is very frequently employed to solve an invertible system of equations. Cholesky decomposition. The special case of the LU decomposition of a symmetric positive-definite matrix is known as a Cholesky decomposition. It can be seen that to be symmetric, U = L T, and hence A = LL T. For symmetric indefinite matrices, there is a related Cholesky decomposition into LDL T, where D is a diagonal matrix. Cholesky decompositions can be be computed in slightly fewer steps than general LU decompositions. 3.3 Singular Value Decomposition The singular value decomposition (SVD) is one of the most useful tools in scientific computing. It gives a similar factorizaton to the eigendecomposition, but can be applied to non-square matrices. It also gives convenient solutions to find a matrix s rank and null space, and to compute pseudoinverses. It is the most common method used to perform principal components analysis (PCA) in statistics and machine learning, and in generalizing New- 8

ton s method to higher dimensions. It can also be used to perform robust least-squares fitting in underdetermined systems. The SVD of an m n matrix A takes the form: A = UΣV T (11) where U is an m m orthogonal matrix, V is an n n orthogonal matrix, and Σ is an m n matrix with nonzero entries only on the diagonal. Computing the rank. The rank of A is equal to the number of nonzero elements on the diagonal of Σ. Computing the nullspace. If Σ ii = 0 for some i, then the i th column of V T is in the null space of A. The set of all such columns of V T is an orthogonal basis of the null space. If these vectors are assembled into an n (n k) matrix N, then all solutions to the equation Ax = b can be obtained by finding a single solution x 0, and letting x = x 0 + Ny any arbitrary choice of y R n k. Computing the pseudoinverse. A pseudoinverse is a generalization of the inverse of a matrix that is used when an inverse does not exist. It can also be used when a matrix is not square. The pseudoinverse A + is defined as an n m matrix that has the following properties: 1. AA + A = A 2. A + AA + = A + 3. (AA + ) T = AA + 4. (A + A) T = A + A This matrix can be computed using the SVD. Note that the pseudoinverse Σ + of Σ can be computed by taking the reciprocal of all nonzero diagonal entries of Σ, and leaving the zero entries. Then the pseudoinverse of A is A + = V Σ + U T (convince yourself that this satisfies the properties of the pseudoinverse). Note that if A is invertible, then A + = A 1. Robust least squares. The SVD can be used to solve for all least-squares solutions to a system linear equations, whether the system is full rank, underdetermined, overdetermined, or both! It can be shown that x 0 = A + b is 9

a least-squares solution to min x Ax b 2. To see this, take the gradient of this quadratic function at x 0 2A T (Ax 0 b) = A T (AA + b b) = (A T AA + A T )b (12) Now look at the transpose of the matrix above, apply the transpose rule, apply the third property of the pseudoinverse, and then apply the first property of the pseudoinverse: (A T AA + A T ) T = (AA + ) T A A = AA + A A = A A = 0 (13) Hence, the gradient at x 0 is zero. Since we can also compute the null-space matrix N, we see that all vectors of the form x = A + b + Ny, with y arbitrary, are least squares solutions as well. 4 Software considerations 4.1 Software libraries Software libraries for basic matrix operations are available in most languages. Examples include LAPACK, GSL, JAMA for Java, and Numpy for Python. Matlab is a special purpose language devised explicitly to make matrix calculations convenient. Most packages will provide the Cholesky decomposition, LU decomposition, QR decomposition, and SVD. They typically also provide eigenvalue/eigenvector computations for symmetric positive definite matrices, and sometimes for nonsymmetric matrices as well. 4.2 Computational Complexity For square matrices, matrix-vector multiplication is O(n 2 ), while the naive approach to matrix multiplication is O(n 3 ). There are algorithms that achieve a slightly lower exponent, but these are not typically not competitive in practice because of large hidden constants. Matrix inversion is as complex as matrix multiplication, and is typically solved using the O(n 3 ) LU decomposition or the Cholesky decomposition if the matrix is symmetric (also O(n 3 ) but with a smaller constant factor). Eigendecompositions and SVDs are also O(n 3 ) but with yet a larger constant factor. 10

4.3 Sparse Matrices Sparse matrices matrices in which most entries are zero arise in many applications including physical simulation and problems on graphs. Sparse matrices can be stored in less than O(n 2 ) space, and many operations (addition, multiplication) can be performed in time proportional to the number of nonzero entries rather than the size of the matrix. Solving a sparse system of equations Ax = b can often be solved efficiently using the conjugate gradient method. See J. Shewchuck (1994) An Introduction to the Conjugate Gradient Method Without the Agonizing Pain for a good (and entertainingly written) reference on this method. 5 Exercises 1. 11