Basic Concepts in Matrix Algebra

Basic Concepts in Matrix Algebra An column array of p elements is called a vector of dimension p and is written as x p 1 = x 1 x 2. x p. The transpose of the column vector x p 1 is row vector x = [x 1 x 2... x p ] A vector can be represented in p-space as a directed line with components along the p axes. 38

Basic Matrix Concepts (cont d) Two vectors can be added if they have the same dimension. Addition is carried out elementwise. x + y = x 1 x 2. x p + y 1 y 2. y p = x 1 + y 1 x 2 + y 2. x p + y p A vector can be contracted or expanded if multiplied by a constant c. Multiplication is also elementwise. cx = c x 1 x 2. x p = cx 1 cx 2. cx p 39

Examples x = 2 1 4 and x = [ 2 1 4 ] 6 x = 6 2 1 4 = 6 2 6 1 6 ( 4) = 12 6 24 x + y = 2 1 4 + 5 2 0 = 2 + 5 1 2 4 + 0 = 7 1 4 40

Basic Matrix Concepts (cont d) Multiplication by c > 0 does not change the direction of x. Direction is reversed if c < 0. 41

Basic Matrix Concepts (cont d) The length of a vector x is the Euclidean distance from the origin L x = p x 2 j j=1 Multiplication of a vector x by a constant c changes the length: L cx = p j=1 c 2 x 2 j = c p j=1 x 2 j = c L x If c = L 1, then cx is a vector of unit length. x 42

The length of x = L x = 2 1 4 2 is Examples (2) 2 + (1) 2 + ( 4) 2 + ( 2) 2 = 25 = 5 Then z = 1 5 2 1 4 2 = 0.4 0.2 0.8 0.4 is a vector of unit length. 43

Angle Between Vectors Consider two vectors x and y in two dimensions. If θ 1 is the angle between x and the horizontal axis and θ 2 > θ 1 is the angle between y and the horizontal axis, then cos(θ 1 ) = x 1 cos(θ 2 ) = y 1 L x L y sin(θ 1 ) = x 2 sin(θ 2 ) = y 2, L x L y If θ is the angle between x and y, then cos(θ) = cos(θ 2 θ 1 ) = cos(θ 2 ) cos(θ 1 ) + sin(θ 2 ) sin(θ 1 ). Then cos(θ) = x 1y 1 + x 2 y 2 L x L y. 44

Angle Between Vectors (cont d) 45

Inner Product The inner product between two vectors x and y is x y = p j=1 x j y j. Then L x = x x, L y = y y and cos(θ) = x y (x x) (y y) Since cos(θ) = 0 when x y = 0 and cos(θ) = 0 for θ = 90 or θ = 270, then the vectors are perpendicular (orthogonal) when x y = 0. 46

Linear Dependence Two vectors, x and y, are linearly dependent if there exist two constants c 1 and c 2, not both zero, such that c 1 x + c 2 y = 0 If two vectors are linearly dependent, then one can be written as a linear combination of the other. From above: x = (c 2 /c 1 )y k vectors, x 1, x 2,..., x k, are linearly dependent if there exist constants (c 1, c 2,..., c k ) not all zero such that k j=1 c j x j = 0 47

Vectors of the same dimension that are not linearly dependent are said to be linearly independent

Linear Independence-example Let x 1 = 1 2 1, x 2 = 1 0 1, x 3 = 1 2 1 Then c 1 x 1 + c 2 x 2 + c 3 x 3 = 0 if c 1 + c 2 + c 3 = 0 2c 1 + 0 2c 3 = 0 c 1 c 2 + c 3 = 0 The unique solution is c 1 = c 2 = c 3 = 0, so the vectors are linearly independent. 48

Projections The projection of x on y is defined by Projection of x on y = x y y y y = x y L y 1 L y y. The length of the projection is Length of projection = x y L y = L x x y L x L y = L x cos(θ), where θ is the angle between x and y. 49

Matrices A matrix A is an array of elements a ij with n rows and p columns: A = a 11 a 12 a 1p a 21 a 22 a 2p...... a n1 a n2 a np The transpose A has p rows and n columns. The j-th row of A is the j-th column of A A = a 11 a 21 a n1 a 12 a 22 a n2...... a 1p a 2p a np 50

Matrix Algebra Multiplication of A by a constant c is carried out element by element. ca = ca 11 ca 12 ca 1p ca 21 ca 22 ca 2p...... ca n1 ca n2 ca np 51

Matrix Addition Two matrices A n p = {a ij } and B n p = {b ij } of the same dimensions can be added element by element. The resulting matrix is C n p = {c ij } = {a ij + b ij } C = A + B = = a 11 a 12 a 1p a 21 a 22 a 2p...... a n1 a n2 a np + b 11 b 12 b 1p b 21 b 22 b 2p...... b n1 b n2 b np a 11 + b 11 a 12 + b 12 a 1p + b 1p a 21 + b 21 a 22 + b 22 a 2p + b 2p...... a n1 + b n1 a n2 + b n2 a np + b np 52

Examples [ 2 1 4 5 7 0 ] = 2 5 1 7 4 0 6 [ 2 1 4 5 7 0 ] = [ 12 6 24 30 42 0 ] [ 2 1 0 3 ] + [ 2 1 5 7 ] = [ 4 0 5 10 ] 53

Matrix Multiplication Multiplication of two matrices A n p and B m q can be carried out only if the matrices are compatible for multiplication: A n p B m q : compatible if p = m. B m q A n p : compatible if q = n. The element in the i-th row and the j-th column of A B is the inner product of the i-th row of A with the j-th column of B. 54

Multiplication Examples [ 2 0 1 5 1 3 ] 1 4 1 3 0 2 = [ 2 10 4 29 ] [ 2 1 5 3 ] [ 1 4 1 3 ] = [ 1 11 2 29 ] [ 1 4 1 3 ] [ 2 1 5 3 ] = [ 22 13 13 8 ] 55

Identity Matrix An identity matrix, denoted by I, is a square matrix with 1 s along the main diagonal and 0 s everywhere else. For example, I 2 2 = [ 1 0 0 1 ] and I 3 3 = 1 0 0 0 1 0 0 0 1 If A is a square matrix, then AI = IA = A. I n n A n p = A n p but A n p I n n is not defined for p n. 56

Symmetric Matrices A square matrix is symmetric if A = A. If a square matrix A has elements {a ij }, then A is symmetric if a ij = a ji. Examples [ 4 2 2 4 ] 5 1 3 1 12 5 3 5 9 57

Inverse Matrix Consider two square matrices A k k and B k k. If AB = BA = I then B is the inverse of A, denoted A 1. The inverse of A exists only if the columns of A are linearly independent. If A = diag{a ij } then A 1 = diag{1/a ij }. 58

Inverse Matrix For a 2 2 matrix A, the inverse is A 1 = [ a11 a 12 a 21 a 22 ] 1 = 1 det(a) [ a 22 a 12 a 21 a 11 ], where det(a) = (a 11 a 22 ) (a 12 a 21 ) denotes the determinant of A. 59

Orthogonal Matrices A square matrix Q is orthogonal if or Q = Q 1. QQ = Q Q = I, If Q is orthogonal, its rows and columns have unit length (q j q j = 1) and are mutually perpendicular (q j q k = 0 for any j k). 60

Eigenvalues and Eigenvectors A square matrix A has an eigenvalue λ with corresponding eigenvector z 0 if Az = λz The eigenvalues of A are the solution to A λi = 0. A normalized eigenvector (of unit length) is denoted by e. A k k matrix A has k pairs of eigenvalues and eigenvectors λ 1, e 1 λ 2, e 2... λ k, e k where e i e i = 1, e i e j = 0 and the eigenvectors are unique up to a change in sign unless two or more eigenvalues are equal. 61

Spectral Decomposition Eigenvalues and eigenvectors will play an important role in this course. For example, principal components are based on the eigenvalues and eigenvectors of sample covariance matrices. The spectral decomposition of a k k symmetric matrix A is A = λ 1 e 1 e 1 + λ 2e 2 e 2 +... + λ ke k e k = [e 1 e 2 e k ] = P ΛP λ 1 0 0 0 λ 2 0...... 0 0 λ k [e 1 e 2 e k ] 62

Determinant and Trace The trace of a k k matrix A is the sum of the diagonal elements, i.e., trace(a) = k i=1 a ii The trace of a square, symmetric matrix A is the sum of the eigenvalues, i.e., trace(a) = k i=1 a ii = k i=1 λ i The determinant of a square, symmetric matrix A is the product of the eigenvalues, i.e., A = k i=1 λ i 63

Rank of a Matrix The rank of a square matrix A is The number of linearly independent rows The number of linearly independent columns The number of non-zero eigenvalues The inverse of a k k matrix A exists, if and only if i.e., there are no zero eigenvalues rank(a) = k 64

Positive Definite Matrix For a k k symmetric matrix A and a vector x = [x 1, x 2,..., x k ] the quantity x Ax is called a quadratic form Note that x Ax = k i=1 kj=1 a ij x i x j If x Ax 0 for any vector x, both A and the quadratic form are said to be non-negative definite. If x Ax > 0 for any vector x 0, both A and the quadratic form are said to be positive definite. 65

Example 2.11 Show that the matrix of the quadratic form 3x 2 1 + 2x2 2 2 2x 1 x 2 is positive definite. For A = [ 3 2 2 2 ], the eigenvalues are λ 1 = 4, λ 2 = 1. Then A = 4e 1 e 1 + e 2e 2. Write x Ax = 4x e 1 e 1 x + x e 2 e 2 x = 4y1 2 + y2 2 0, and is zero only for y 1 = y 2 = 0. 66

Example 2.11 (cont d) y 1, y 2 cannot be zero because [ ] [ y1 e = 1 y 2 e 2 ] [ x1 x 2 ] = P 2 2 x 2 1 with P orthonormal so that (P ) 1 = P. Then x = P y and since x 0 it follows that y 0. Using the spectral decomposition, we can show that: A is positive definite if all of its eigenvalues are positive. A is non-negative definite if all of its eigenvalues are 0. 67

Distance and Quadratic Forms For x = [x 1, x 2,..., x p ] and a p p positive definite matrix A, d 2 = x Ax > 0 when x 0. Thus, a positive definite quadratic form can be interpreted as a squared distance of x from the origin and vice versa. The squared distance from x to a fixed point µ is given by the quadratic form (x µ) A(x µ). 68

Distance and Quadratic Forms (cont d) We can interpret distance in terms of eigenvalues and eigenvectors of A as well. Any point x at constant distance c from the origin satisfies x Ax = x ( p j=1 λ j e j e j )x = p j=1 the expression for an ellipsoid in p dimensions. λ j (x e j ) 2 = c 2, Note that the point x = cλ 1/2 1 e 1 is at a distance c (in the direction of e 1 ) from the origin because it satisfies x Ax = c 2. The same is true for points x = cλ 1/2 j e j, j = 1,..., p. Thus, all points at distance c lie on an ellipsoid with axes in the directions of the eigenvectors and with lengths proportional to λ 1/2 j. 69

Distance and Quadratic Forms (cont d) 70

Square-Root Matrices Spectral decomposition of a positive definite matrix A yields A = p j=1 λ j e j e j = P ΛP, with Λ k k = diag{λ j }, all λ j > 0, and P k k = [e 1 e 2... e p ] an orthonormal matrix of eigenvectors. Then A 1 = P Λ 1 P = p j=1 1 λ j e j e j With Λ 1/2 = diag{λ 1/2 j }, a square-root matrix is A 1/2 = P Λ 1/2 P = p j=1 λj e j e j 71

Square-Root Matrices The square root of a positive definite matrix A has the following properties: 1. Symmetry: (A 1/2 ) = A 1/2 2. A 1/2 A 1/2 = A 3. A 1/2 = p j=1 λ 1/2 j e j e j = P Lambda 1/2 P 4. A 1/2 A 1/2 = A 1/2 A 1/2 = I 5. A 1/2 A 1/2 = A 1 Note that there are other ways of defining the square root of a positive definite matrix: in the Cholesky decomposition A = LL, with L a matrix of lower triangular form, L is also called a square root of A. 72

Random Vectors and Matrices A random matrix (vector) is a matrix (vector) whose elements are random variables. If X n p is a random matrix, the expected value of X is the n p matrix where E(X) = E(X 11 ) E(X 12 ) E(X 1p ) E(X 21 ) E(X 22 ) E(X 2p )... E(X n1 ) E(X n2 ) E(X np ) E(X ij ) = x ijf ij (x ij )dx ij with f ij (x ij ) the density function of the continuous random variable X ij. If X is a discrete random variable, we compute its expectation as a sum rather than an integral., 73

Linear Combinations The usual rules for expectations apply. If X and Y are two random matrices and A and B are two constant matrices of the appropriate dimensions, then E(X + Y ) = E(X) + E(Y ) E(AX) = AE(X) E(AXB) = AE(X)B E(AX + BY ) = AE(X) + BE(Y ) Further, if c is a scalar-valued constant then E(cX) = ce(x). 74

Mean Vectors and Covariance Matrices Suppose that X is p 1 (continuous) random vector drawn from some p dimensional distribution. Each element of X, say X j has its own marginal distribution with marginal mean µ j and variance σ jj defined in the usual way: µ j = σ jj = x jf j (x j )dx j (x j µ j ) 2 f j (x j )dx j 75

Mean Vectors and Covariance Matrices (cont d) To examine association between a pair of random variables we need to consider their joint distribution. A measure of the linear association between pairs of variables is given by the covariance σ jk = E [ (X j µ j )(X k µ k ) ] = (x j µ j )(x k µ k )f jk (x j, x k )dx j dx k. 76

Mean Vectors and Covariance Matrices (cont d) If the joint density function f jk (x j, x k ) can be written as the product of the two marginal densities, e.g., then X j and X k are independent. f jk (x j, x k ) = f j (x j )f k (x k ), More generally, the p dimensional random vector X has mutually independent elements if the p dimensional joint density function can be written as the product of the p univariate marginal densities. If two random variables X j and X k are independent, then their covariance is equal to 0. [Converse is not always true.] 77

Mean Vectors and Covariance Matrices (cont d) We use µ to denote the p 1 vector of marginal population means and use Σ to denote the p p population variance-covariance matrix: Σ = E [ (X µ)(x µ) ]. If we carry out the multiplication (outer product)then Σ is equal to: E (X 1 µ 1 ) 2 (X 1 µ 1 )(X 2 µ 2 ) (X 1 µ 1 )(X p µ p ) (X 2 µ 2 )(X 1 µ 1 ) (X 2 µ 2 ) 2 (X 2 µ 2 )(X p µ p )... (X p µ p )(X 1 µ 1 ) (X p µ p )(X 2 µ 2 ) (X p µ p ) 2. 78

Mean Vectors and Covariance Matrices (cont d) By taking expectations element-wise we find that Σ = σ 11 σ 12 σ 1p σ 21 σ 22 σ 2p... σ p1 σ p2 σ pp. Since σ jk = σ kj for all j k we note that Σ is symmetric. Σ is also non-negative definite 79

Correlation Matrix The population correlation matrix is the p p matrix with off-diagonal elements equal to ρ jk and diagonal elements equal to 1. 1 ρ 12 ρ 1p ρ 21 1 ρ 2p... ρ p1 ρ p2 1. Since ρ ij = ρ ji the correlation matrix is symmetric The correlation matrix is also non-negative definite 80

Correlation Matrix (cont d) The p p population standard deviation matrix V 1/2 is a diagonal matrix with σ jj along the diagonal and zeros in all off-diagonal positions. Then Σ = V 1/2 P V 1/2 and the population correlation matrix is (V 1/2 ) 1 Σ(V 1/2 ) 1 Given Σ, we can easily obtain the correlation matrix 81

Partitioning Random vectors If we partition the random p 1 vector X into two components X 1, X 2 of dimensions q 1 and (p q) 1 respectively, then the mean vector and the variance-covariance matrix need to be partitioned accordingly. Partitioned mean vector: [ X1 ] [ E(X1 ) ] [ µ1 ] E(X) = E X 2 = E(X 2 ) = µ 2 Partitioned variance-covariance matrix: Σ = [ V ar(x 1 ) Cov(X 1, X 2 ) Cov(X 2, X 1 ) V ar(x 2 ) ] [ ] Σ11 Σ 12 = Σ 12 Σ 22 where Σ 11 is q q, Σ 12 is q (p q) and Σ 22 is (p q) (p q)., 82

Partitioning Covariance Matrices (cont d) Σ 11, Σ 22 are the variance-covariance matrices of the sub-vectors X 1, X 2, respectively. The off-diagonal elements in those two matrices reflect linear associations among elements within each sub-vector. There are no variances in Σ 12, only covariances. These covariancs reflect linear associations between elements in the two different sub-vectors. 83

Linear Combinations of Random variables Let X be a p 1 vector with mean µ and variance covariance matrix Σ, and let c be a p 1 vector of constants. Then the linear combination c X has mean and variance: E(c X) = c µ, and V ar(c X) = c Σc In general, the mean and variance of a q 1 vector of linear combinations Z = C q p X p 1 are µ Z = Cµ X and Σ Z = CΣ X C. 84

Cauchy-Schwarz Inequality We will need some of the results below to derive some maximization results later in the course. Cauchy-Schwarz inequality Let b and d be any two p 1 vectors. Then, (b d) 2 (b b)(d d) with equality only if b = cd for some scalar constant c. Proof: The equality is obvious for b = 0 or d = 0. For other cases, consider b cd for any constant c 0. Then if b cd 0, we have 0 < (b cd) (b cd) = b b 2c(b d) + c 2 d d, since b cd must have positive length. 85

Cauchy-Schwarz Inequality We can add and subtract (b d) 2 /(d d) to obtain 0 < b b 2c(b d)+c 2 d d (b d) 2 d d +(b d) 2 d d = b b (b d) 2 Since c can be anything, we can choose c = b d/d d. Then, 0 < b b (b d) 2 d (b d) 2 < (b b)(d d) d for b cd (otherwise, we have equality). d d +(d d) ( c b d d d ) 2 86

Extended Cauchy-Schwarz Inequality If b and d are any two p 1 vectors and B is a p p positive definite matrix, then (b d) 2 (b Bb)(d B 1 d) with equality if and only if b = cb 1 d or d = cbb for some constant c. Proof: Consider B 1/2 = p Then we can write i=1 λi e i e i, and B 1/2 = p i=1 1 ( λ i ) e ie i. b d = b Id = b B 1/2 B 1/2 d = (B 1/2 b) (B 1/2 d) = b d. To complete the proof, simply apply the Cauchy-Schwarz inequality to the vectors b and d. 87

Optimization Let B be positive definite and let d be any p 1 vector. Then max x 0 (x d) 2 x Bx = d B 1 d is attained when x = cb 1 d for any constant c 0. Proof: By the extended Cauchy-Schwartz inequality: (x d) 2 (x Bx)(d B 1 d). Since x 0 and B is positive definite, x Bx > 0 and we can divide both sides by x Bx to get an upper bound (x d) 2 x Bx d B 1 d. Differentiating the left side with respect to x shows that maximum is attained at x = cb 1 d. 88

Maximization of a Quadratic Form on a Unit Sphere B is positive definite with eigenvalues λ 1 λ 2 λ p 0 and associated eigenvectors (normalized) e 1, e 2,, e p. Then max x 0 x Bx x x = λ 1, attained when x = e 1 min x 0 x Bx x x = λ p, attained when x = e p. Furthermore, for k = 1, 2,, p 1, x Bx max x e 1,e 2,,e k x x = λ k+1 is attained when x = e k+1. See proof at end of chapter 2 in the textbook (pages 80-81). 89