Stat 206: Linear algebra

Similar documents
Properties of Matrices and Operations on Matrices

MATRIX ALGEBRA. or x = (x 1,..., x n ) R n. y 1 y 2. x 2. x m. y m. y = cos θ 1 = x 1 L x. sin θ 1 = x 2. cos θ 2 = y 1 L y.

Introduction to Matrix Algebra

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Review of Linear Algebra

CS 246 Review of Linear Algebra 01/17/19

Math Bootcamp An p-dimensional vector is p numbers put together. Written as. x 1 x =. x p

EE731 Lecture Notes: Matrix Computations for Signal Processing

Appendix A: Matrices

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Linear Algebra March 16, 2019

Vectors and Matrices Statistics with Vectors and Matrices

Knowledge Discovery and Data Mining 1 (VO) ( )

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Math Linear Algebra Final Exam Review Sheet

Chapter 1. Matrix Algebra

2. Matrix Algebra and Random Vectors

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Introduction to Quantitative Techniques for MSc Programmes SCHOOL OF ECONOMICS, MATHEMATICS AND STATISTICS MALET STREET LONDON WC1E 7HX

Chapter 3. Matrices. 3.1 Matrices

18.06SC Final Exam Solutions

Stat 206: Sampling theory, sample moments, mahalanobis

2. Linear algebra. matrices and vectors. linear equations. range and nullspace of matrices. function of vectors, gradient and Hessian

Examples True or false: 3. Let A be a 3 3 matrix. Then there is a pattern in A with precisely 4 inversions.

Basic Concepts in Matrix Algebra

COMP 558 lecture 18 Nov. 15, 2010

Matrices and Linear Algebra

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Linear Algebra for Machine Learning. Sargur N. Srihari

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Introduction to Matrices

Math Camp II. Basic Linear Algebra. Yiqing Xu. Aug 26, 2014 MIT

1 Matrices and vector spaces

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

The Hilbert Space of Random Variables

Image Registration Lecture 2: Vectors and Matrices

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

A primer on matrices

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

Chapter 2. Linear Algebra. rather simple and learning them will eventually allow us to explain the strange results of

MATH 221: SOLUTIONS TO SELECTED HOMEWORK PROBLEMS

Answers in blue. If you have questions or spot an error, let me know. 1. Find all matrices that commute with A =. 4 3

B553 Lecture 5: Matrix Algebra Review

Lecture 8: Linear Algebra Background

Repeated Eigenvalues and Symmetric Matrices

Lecture 7: Positive Semidefinite Matrices

MATH 240 Spring, Chapter 1: Linear Equations and Matrices

CS 143 Linear Algebra Review

Foundations of Matrix Analysis

CS123 INTRODUCTION TO COMPUTER GRAPHICS. Linear Algebra 1/33

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Getting Started with Communications Engineering. Rows first, columns second. Remember that. R then C. 1

BASIC NOTIONS. x + y = 1 3, 3x 5y + z = A + 3B,C + 2D, DC are not defined. A + C =

Linear Algebra Review. Fei-Fei Li

Econ Slides from Lecture 8

Dot Products, Transposes, and Orthogonal Projections

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Review problems for MA 54, Fall 2004.

1 Matrices and matrix algebra

Introduction Eigen Values and Eigen Vectors An Application Matrix Calculus Optimal Portfolio. Portfolios. Christopher Ting.

MATH 315 Linear Algebra Homework #1 Assigned: August 20, 2018

18.06 Quiz 2 April 7, 2010 Professor Strang

Chapter 2. Matrix Arithmetic. Chapter 2

Linear Algebra Massoud Malek

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Linear Algebra: Characteristic Value Problem

18.06 Problem Set 8 - Solutions Due Wednesday, 14 November 2007 at 4 pm in

A matrix over a field F is a rectangular array of elements from F. The symbol

Review of Linear Algebra

Lecture 2: Linear Algebra Review

Linear Models Review

3 (Maths) Linear Algebra

chapter 5 INTRODUCTION TO MATRIX ALGEBRA GOALS 5.1 Basic Definitions

1 Linearity and Linear Systems

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Chapter Two Elements of Linear Algebra

Linear algebra for computational statistics

Elementary linear algebra

Introduction to Matrices

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Introduction to Mobile Robotics Compact Course on Linear Algebra. Wolfram Burgard, Cyrill Stachniss, Kai Arras, Maren Bennewitz

Stat 159/259: Linear Algebra Notes

Math 291-2: Lecture Notes Northwestern University, Winter 2016

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Foundations of Computer Vision

STAT200C: Review of Linear Algebra

MAT 2037 LINEAR ALGEBRA I web:

Matrix Algebra: Summary

Linear Algebra Highlights

Conceptual Questions for Review

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Matrices. 1 a a2 1 b b 2 1 c c π

Linear Algebra Basics

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Linear Algebra V = T = ( 4 3 ).

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

18.06 Professor Strang Quiz 3 May 2, 2008

Transcription:

Stat 206: Linear algebra James Johndrow (adapted from Iain Johnstone s notes) 2016-11-02 Vectors We have already been working with vectors, but let s review a few more concepts. The inner product of two vectors x, y is x 1 y = x j y j, which we will sometimes express as xx, yy, and the angle θ formed by two vectors can be expressed in terms of inner products j=1 cos(θ) = x 1 y? x1 x? y 1 y. Here s how to take the inner product in R. set.seed(17) p <- 5 x <- matrix(rnorm(p),p,1) # p-vector with iid normal(0,1) entries y <- matrix(rnorm(p),p,1) xy <- t(x)%*%y # compute angle thet <- acos(xy/(sqrt(t(x)%*%x)*sqrt(t(y)%*%y))) So for this example the inner product of x and y is about -0.23 and the angle between the vectors is about θ = 1.65 radians. The usual interpretation in R 2 carries over to R p, i.e. if θ = 0 or π then x y and if θ = π/2 then x K y. 1 The projection of a vector x onto a vector y is given by 2 proj(x, y) = yy1 y 1 y x = P yx = (x1 yy 1 ) 1 y 1 = x1 y y y 1 y y, where the last equality holds because x 1 y is a scalar. Every vector can be expressed as a linear combination of its projection onto y and its projection onto the orthogonal complement of y, which is defined by 1 The notation x y means x and y are parallel, and x K y means x and y are perpendicular. 2 The book uses the third expression below, but I find the first more intuitive. In particular, if you ve done linear models, you ll recognize (yy 1 )/(y 1 y) as a special case of X(X 1X) 1 X 1, the perpendicular projection operator onto the column space of X, when X = y is a vector instead of a matrix. ) (I yy1 y 1 x = (I P y )x y It s clear that

stat 206: linear algebra 2 x = yy1 y 1 y x + ) (I yy1 y 1 x, y so that every x can be written as a linear combination of its projection parallel to and perpendicular to y. Let s compute an example in R. x <- matrix(rnorm(p),p,1) y <-.5*x +.5*matrix(rnorm(p),p,1) Py <- (y%*%t(y))/c(t(y)%*%y) proj <- Py%*%x orth <- (diag(p)-py)%*%x yproj <- t(y)%*%proj yorth <- t(y)%*%orth You ll note that xp y x, yy «3.93 but that x(i P y )x, yy «1.6653345ˆ 10 16, so P y x and (I P y )x really are the portions of x parallel to and perpendicular to y. Two vectors are said to be linearly dependent if one can be written as a scalar multiple of the other, i.e. x and y satisfy x = cy for some c P R. If two vectors x and y are linearly dependent, then the projection of x onto y is just x, and the projection of x onto the orthogonal complement of y is the zero vector. 3 Thus, we can think of linearly dependent vectors are parallel. A collection of n vectors x (1),..., x (n) are linearly dependent if there exists an i and constants c 1,..., c n such that c i x (i) = c 1 x (1) +... + c (i 1) x (i 1) + c (i+1) x (i+1) +... + c n x (n), that is, if at least one of them can be written as a linear combination of the others. A collection is said to be linearly independent if none of the vectors in the collection are linearly dependent. Notice I am being pretty careful about using «signs or saying is approximately. Everything you do on the computer is in some sense approximate, since the computer only assigns limited memory to storing any number in decimal expansion. As a result, it cannot tell the difference between 0 and, say 10 106, or between 8 and 10 106. That s why I didn t round x(i P y )x, yy, so you can see this in action. That number is actually zero, but the computer has finite precision, so some error is introduced when doing the calculations. 3 This is why in the example above where I computed the projection of x onto y in R, I generated y as a weighted average of x and some random stuff, otherwise the projection would be close to zero. This hints at relationships between linear dependence and correlation, which we will get to soon. Matrices A matrix X is a rectangular array of numbers. If a matrix has n rows and p columns, we say the matrix is n ˆ p. A matrix is square if n = p, and the diagonal of a square matrix are the elements X ii with the same row and column index. The transpose, X 1, of a n ˆ p matrix X is the p ˆ n matrix whose rows are formed by the columns of X. That is, the first row of X 1 is the first column of X, the second row of X 1 is the second column of X, and so on. Two matrices of the same dimension can be added by simply adding corresponding entries. If X and A are both n ˆ p matrices, then the matrix X + A has entries (X + A) ij = X ij + A ij. Here s how to do some of these things in R.

stat 206: linear algebra 3 n <- 10 p <- 5 X <- matrix(rnorm(n*p),n,p) Xt <- t(x) # transpose of X A <- matrix(rnorm(n*p),n,p) XA <- X+A #add X and A D <- matrix(rnorm(n*n),n,n) Ddiag <- diag(d) # get the diagonal of D D2 <- diag(rnorm(n)) # an n by n diagonal matrix The notion of rank is an important one. Definition 1 (rank of a matrix). Let A be a real-valued matrix. The row rank rank r (A) of A is the number of linearly independent rows of the matrix. The column rank rank c (A) is the number of linearly independent columns. The row rank and column rank are always equal (and therefore one just refers to the rank of a matrix). A matrix A is full rank if rank(a) = min(n, p). Matrix multiplication is in some ways analogous to multiplication of real numbers, but does not obey all of the same rules. First, only matrices of conformable dimension may be multiplied. An n ˆ p matrix X and a p ˆ m matrix A may be multiplied in the order XA because the column dimension of X matches the row dimension of A. The result is a n ˆ m matrix with entries (XA) ij = k=1 l=1 X ki A jl = xx [i,], A [,j] y so that the i, j element of the product is formed by taking the inner product of the vectors X [i,] and A [,j] formed by the ith row of X and the jth column of A. N.B.: p-vectors are p ˆ 1 matrices, and their transposes are 1 ˆ p matrices. So if X is n ˆ p and y is p ˆ 1, then the product Xy is a n ˆ 1 matrix (an n-vector). Matrix multiplication does not commute, that is, in general, XA XB. For rectangular matrices, often only one direction makes sense, since we can multiply a n ˆ p matrix X by a p ˆ m matrix A but not a p ˆ m matrix A by a n ˆ p matrix X, since the column dimension of A does not match the row dimension of X. Of course, if A and X are both square p ˆ p matrices, then we can multiply in either direction. Even then, it is still not the case in general that AX = XA. Two matrices are said to commute if and only if AX = XA. When matrices commute, it simplifies a lot of calculations, but usually matrices we will be working with will not commute.

stat 206: linear algebra 4 The p-dimensional identity matrix I p is a p ˆ p matrix with all of its diagonal entries equal to one and all of its off-diagonal entries equal to zero, i.e. I p = 1 0 0 0 1 0..... 0 0 1. We will often drop the subscript p when the dimension of I is clear. The matrix I is a multiplicative identity. So if X is n ˆ p, XI = X, and IX 1 = X 1 for the p ˆ p identity I. This holds in general for any matrix/vector for which the dimensions allow multiplication to happen. Two other definitional/notational things. A (square) matrix A is said to be symmetric if A = A 1, and A is said to be orthogonal if AA 1 = I. Multiplicative inverses also exist, though again there are important differences with the one-dimensional case. First, the inverse is only defined for square matrices. If A is a square matrix, then it has an inverse B if and only if there exists a square matrix B for which AB = BA = I, from which we may deduce that matrices and their inverses commute, and that the inverse of an orthogonal matrix is its transpose. In this case we write B = A 1, so the usual notation for the inverse of A will be A 1. Not all square matrices have an inverse, but when they do, the inverse is unique. A square matrix A has an inverse if and only if its columns are linearly independent. Another useful property is the following Remark 1 (inverse of product). Suppose A and B are both invertible p ˆ p matrices. Then (AB) 1 = B 1 A 1. Proof. Suppose C = (AB) 1. Since ABB 1 A 1 = AIA 1 = AA 1 = I B 1 A 1 AB = B 1 IB = B 1 B = I and inverses are unique, it follows that C = B 1 A 1.

stat 206: linear algebra 5 We usually compute matrix inverses in R, since the number of operations necessary to compute a matrix is large. Here, I generate a random matrix and compute its inverse. X <- matrix(rnorm(n*p),n,p) Sn <- n^(-1)*t(x)%*%x Sninv <- solve(sn) tst <- Sn%*%Sninv maxdiff <- max(abs(tst-diag(p))) We can check this worked by checking how different S (n) (n) (S ) 1 is from I p. In this case, the maximum entrywise difference (in absolute value) between S (n) (S (n)) 1 and I p is 3.3306691 ˆ 10 16. Matrix inversion is generally a O(n 3 ) operation, so methods that require computing the inverse will scale poorly in p. There are various strategies to improving scalability, mainly by using methods or approximations that don t require explicitly forming the inverse of a general p ˆ p matrix. For example, it is easy to invert a diagonal matrix: the inverse is simply a diagonal matrix with entries given by the reciprocal of the entries of the original matrix 4 4 Try this for a 3-by-3 example. Perhaps the most useful matrix results in multivariate statistics have to do with eigenvalues and eigenvectors. The eigenvalues of a p ˆ p square matrix A are the solutions to the equation Ax = λx, where x P R p is a p-vector and λ is scalar. The vectors x for which there exists λ satisfying the eigenvalue equation are the eigenvectors. Clearly, if λ is a (real) solution to the eigenvalue equation with eigenvector x, then for any real number c, cλ is also a solution with eigenvector c 1 x. Since the eigenvectors and eigenvalues are only unique up to a multiplicative constant, it is typical to normalize eigenvectors to have length 1, i.e. so that x 1 x = 1, and denote these normalized eigenvectors by e. Every square, symmetric p ˆ p matrix has p pairs of eigenvalues and eigenvectors (e (1), λ 1 ),..., (e (p), λ p ). The eigenvectors can be chosen to be mutually orthogonal (so (e (j) ) 1 e (k) = 0 for every pair j, k). The eigenvectors are unique unless two or more of the eigenvalues are equal. In statistics, we are often concerned with positive definite matrices Definition 2 (positive definite matrix). A square p ˆ p matrix A is positive definite if and only if the quadratic form x 1 Ax ą 0

stat 206: linear algebra 6 for every non-zero p-vector x, where a non-zero vector is any vector for which at least one entry is not zero. A is positive semi-definite if and only if for every non-zero p-vector x. x 1 Ax ě 0 One reason to care about positive definite matrices is the following, given without proof. 5 Remark 2 (sample covariance). Suppose x (1),..., x (n) are a random sample with common mean µ and positive-definite covariance Σ. Then if n ą p, S n is positive definite. Another important fact is that a real square matrix is invertible if and only if it is positive definite. 6 If a matrix A is symmetric and positive definite, there exists a decomposition of A into a matrix U with columns consisting of the eigenvectors and a diagonal matrix Λ with diagonal entries given by the eigenvalues. 5 Technical note: the phrase almost surely should be added to the end of this remark. This is entirely irrelevant, but for the reader who has some familiarity with measure theory I wanted to be complete. 6 positive semi-definite matrices have pseudoinverses (if interested, there is a decent page on wikipedia) Theorem 1 (spectral decomposition). Suppose A is symmetric and positive semi-definite. Then A = UΛU 1 where U is a p ˆ p orthogonal matrix whose jth column is the eigenvector e j, and Λ jj = λ j, the jth eigenvalue of A. Since U is orthogonal, we can also write this as A = UΛU 1. Spectral decompositions are very useful. For example, using Remark 1 and orthogonality of U, we have that if A = UΛU 1 is the spectral decomposition of a positive definite matrix A, then A 1 = UΛ 1 U 1. Since Λ is a diagonal matrix, its inverse is just the matrix (Λ 1 ) jj = λ 1 jj, which is easy to compute. Moreover, this implies that the eigenvalues of the inverse A 1 are the reciprocals of the eigenvalues of A, so that the largest eigenvalue of A 1 is the reciprocal of the smallest eigenvalue of A, and the smallest eigenvalue of A 1 is the reciprocal of the largest eigenvalue of A. Positive definite matrices also have square roots. A square root A 1/2 of a symmetric matrix A is any matrix B for which BB = A; for positive-definite matrices, there is a unique square root A 1/2 that is also positive definite (although there are multiple square roots). One way to obtain the square root of a positive definite matrix A is from its spectral decomposition:

stat 206: linear algebra 7 Remark 3. Suppose A = UΛU 1 is the spectral decomposition of A. Then UΛ 1/2 U 1 is a square root of A, where Λ 1/2 is the diagonal matrix with entries (Λ 1/2 ) jj = λ 1/2 j. Proof. UΛ 1/2 U 1 UΛ 1/2 U 1 = UΛ 1/2 IΛ 1/2 U 1 = UΛ 1/2 Λ 1/2 U 1 = UΛU 1 = A. From this it is clear that if A has spectral decomposition UΛU 1, then A 1/2 has spectral decomposition A 1/2 = UΛ 1/2 U 1. Eigenvalues also give us useful bounds on the size of quadratic forms: Remark 4 (quadratic form eigenvalue inequalities). Let λ 1,..., λ p be the eigenvalues of A in decreasing order. Then for every x P R p, λ p x 2 2 ď x 1 Ax ď λ 1 x 2 2. Since A 1/2 x 2 = x 1 Ax, the eigenvalues of A give us bounds on the amount by which A elongates any vector. Two important matrix quantities are the trace and determinant. Determinants and traces appear in the densities of multivariate distributions. The trace of a square matrix A is the sum of its diagonal elements: tr(a) A jj. j=1 Some key properties of the trace are tr(a + B) = tr(a) + tr(b) tr(ca) = c tr(a) tr(abc) = tr(cab) = tr(bca) the cyclic property tr(a) = ÿ λ j, j so the trace is the sum of the eigenvalues. The determinant of a square p ˆ p matrix is defined recursively by A = a 1j A [ 1, j] ( 1) 1+j, j=1

stat 206: linear algebra 8 where A [ 1, j] is the submatrix obtained by deleting the first row and jth column of A, and A = a 11 if A is scalar. There is a connection with eigenvalues via the characteristic polynomial g A (t) = ti A, a polynomial equation in t whose roots are equal to the eigenvalues of A. Although not the formal definition, for our purposes it is often useful to think of the determinant of a square matrix A, as the product of its eigenvalues A = ź j λ j. A couple of useful properties of the determinant are A 1 = A 1 log A = log(λ j ) There are matrix decompositions in addition to the spectral decomposition that often prove useful in applied statistics. Suppose X is a n ˆ p real matrix. Let m = min(n, p). The singular value decomposition of X is given by j=1 X = UDV, where U is a n ˆ m orthogonal matrix, D is a m ˆ m diagonal matrix, and V is a m ˆ p orthogonal matrix. The diagonal elements of D are referred to as the singular values of X. The singular value decomposition and spectral decomposition are related, since (UDV ) 1 UDV = V 1 DU 1 UDV = V 1 DIDV = V 1 D 2 V, where D 2 is the product DD. Since V is orthogonal, it follows that D 2 = Λ in the spectral decomposition of X 1 X when X 1 X is positivedefinite (equivalently, when n ą p). You can see this for yourself in R. 7 7 What happens when n ă p? In X <- matrix(rnorm(n*p),n,p) XX <- t(x)%*%x s <- svd(x) u <- eigen(t(x)%*%x) df <- data.frame(lam=u$values,d2=s$d^2) ggplot(df,aes(x=lam,y=d2)) + geom_point() fact, a lot of this still makes sense, except that there are only n distinct eigenvalues, which are the squares of the diagonal elements of D. We can still write a spectral decomposition of X X, too, but in this case X X is positive semi-definite, not positivedefinite.

stat 206: linear algebra 9 Note that the number of singular values is equal to the rank of X. Another useful decomposition is the Cholesky decomposition Theorem 2. Suppose Σ is a symmetric and positive-definite matrix. Then there exists a unique invertible, lower triangular matrix L, referred to as the Cholesky decomposition, such that Σ = LL 1. Vector and matrix calculus can be very useful in multivariate statistics, and we ll briefly review some key results here. Suppose x is a vector and A a matrix. Then d2 15 10 5 0 0 5 10 15 lam Figure 1: eigenvalues of X X plotted against the squared singular values B Ax = A1 Bx B Bx x1 A = A B Bx x1 x = 2x B Bx x1 Ax = Ax + A 1 x. The Jacobian matrix associated with a transformation f : R p Ñ R q is the p ˆ q matrix with entries J f (x) = Bf 1 Bf 1 Bx 1 Bx 2 Bf 2 Bf 2 Bx 1 Bx 2. Bf q Bx 1. Bf q Bx 2 Bf 1 Bx p Bf 2 Bx p.... Bf q Bx p, where f(x) = (f 1 (x),..., f q (x)) 1 is the (vector) output of the function f. The Jacobian is the matrix form of the total derivative of the function f, familiar from multivariate calculus. We can express the chain rule for vector functions as J f g (x) = J f (g(x))j g (x), that is, the total derivative of the composition of functions (f g)(x) = f(g(x)) is the product of the total derivative of f evaluated at g(x) and the total derivative of g. Here s a simple example B Bx log(x1 Ax) = 2Ax x 1 Ax, assuming A is symmetric and positive definite.

stat 206: linear algebra 10 Finally, a couple of results on derivatives of trace and determinant. Suppose A is a p ˆ p symmetric and positive-definite matrix, and B is a p ˆ p matrix. Then References Btr(AB) BA B A BA = A A 1, = B 1, Btr(A 1 B) BA = B 1 A 2.