STAT 100C: Linear models - PDF Free Download

STAT 100C: Linear models Arash A. Amini April 27, 2018 1 / 1

Table of Contents 2 / 1

Linear Algebra Review Read 3.1 and 3.2 from text. 1. Fundamental subspace (rank-nullity, etc.) Im(X ) = ker(x T ) R n Im(X ) = C(X ) = image = column space = range. ker(x ) = N(X ) = kernel = null space 2. Orthogonal decomposition of a space w.r.t. a subspace V : R n = V V 3. Spectral decomposition of a symmetric matrix A: A = UΛU T, U : orthogonal, Λ : diagonal 3 / 1

Table of Contents 4 / 1

Linear independence A set of vector {x 1,..., x n } is linearly dependent if a nontrivial linear combination of them is zero: That is, there exists c 1,..., c n not all of which zero such that n c i x i = 0 i=1 Otherwise the set is linearly independent. Example 1 Which of the two is a linearly independent set? 1 1, 2 2 2 2, 1 1, 1 1, 1 1 0 0 0 0 0 1 5 / 1

Exercise: Show that if a set of nonzero vectors are pairwise orthogonal, they are linearly independent. Exercise: Show that if a set of vectors contains the zero vector, they are linearly dependent. 6 / 1

Span The span of a set of vectors is the set of all their linear combinations: { n } span{x 1,..., x n } = c i x i c1,..., c n R i=1 Example 2 span 1 1, 2 2 = t 1 1 1 + t 2 2 2 t1, t 2 R 0 0 0 0 1 = (t 1 + 2t 2 ) 1 t1, t 2 R 0 1 = t 1 t R 0 7 / 1

Example 3 span 1 1, 1 1 = t 1 1 1 + t 2 1 1 t1, t 2 R 0 0 0 0 = t 1 + t 2 t 1 t 2 t1, t 2 R 0 = α β α, β R 0 1 0 = α 0 + β 1 α, β R 0 0 1 0 = span 0, 1 0 0 8 / 1

Basis and dimension A set of linearly independent vectors that span a subspace V is called a basis of the subspace. A subspace V can have different bases: V := α β α, β R = span 1 1, 1 1 = span 1 0, 0 1 0 0 0 0 0 All the bases for a given subspace V have the same number of elements. This common number is the dimension of V, denoted dim(v ). In the example, dim(v ) = 2. Dimension formalizes notion of degrees of freedom. 9 / 1

Table of Contents 10 / 1

Image or column space Column space of a matrix X is the span of the columns of X. Let {x 1,..., x n } R n (n-dimensional vectors). Form a matrix with these columns: X = ( x 1 x 2 x p ) R n p Recall that for β = (β 1,..., β p ) R p : We have β 1 X β = ( ) β 2 x 1 x 2 x p. = β p p β j x j { p col(x ) = span{x 1,..., x p } = β j x j β1,..., β p R p} j=1 j=1 = {X β : β R p } 11 / 1

Image or column space Column space is also called range or image: col(x ) = ran(x ) = Im(X ) = {X β : β R p } Im(X ) is a linear subspace of R n. Dimension of the image is called the rank of the matrix: rank(x ) := dim(im(x )) rank(x ) is the number of linearly independent columns of X. 12 / 1

Kernel or null space Kernel of X R n p is the set of all vector that are mapped to zero by X : Note that ker(x ) R p. ker(x ) = {β R p X β = 0} 13 / 1

Example 4 Consider X = 1 2 1 2 0 0 What is Im(X )? Im(X ) = span 1 1, 2 2 = span 1 1 0 0 0. rank(x ) = dim(im(x )) = 1. What is ker(x )? 1 2 ker(x ) = β = (β 1, β 2 ) β 1 1 + β 2 2 = 0 0 0 = { β = (β 1, β 2 ) β1 + 2β 2 = 0 } null(x ) = dim(ker(x )) = 1. rank(x ) + null(x ) = 2. 14 / 1

Table of Contents 15 / 1

Inner product, orthogonality For two vectors x, y R n, the (Euclidean) inner product is n x, y := x i y i = [ y 1 ] x 1,..., x n. = x T y i=1 y n x is orthogonal to y, denoted x y if x, y = 0. Let V R n be a (linear) subspace. We say x is orthogonal to V, denoted as x V whenever x v, v V 16 / 1

Orthogonal complement The set of all vectors x that are orthogonal to V is called the orthogonal complement of V : V = {x R n : x V } = {x R n : x, y = 0 for all y R n } 17 / 1

Example 5 Consider X = 1 1 1 1 0 0 Let V = Im(X ). What is V? Let x 1, x 2 be the two columns of X. z V iff z x 1 and z x 2. Equivalent to z 1 = z 2 = 0 and z 3 could be anything: 0 0 V = 0 : α R = span 0. α 1. 18 / 1

Norm and distance The (Euclidean) norm of a vector x R n also called l 2 norm is x = x, x = x T x = n where x = (x 1,..., x n ). For two vectors x, y R n, their (Euclidean) distance is x y = n (x i y i ) 2. i=1 i=1 x 2 i Exercise: Show that x y 2 = x 2 + y 2 2 x, y. 19 / 1

Projection Let V R n be a subspace and x R n be a vector. Then, there is a unique closest element in V to x: x = argmin x z. z V x is called the projection of x onto V. Consider the error e = x x. The projection x is the only vector in V such that e V. In other words, e V. Thus, Proposition 1 Every vector x R n can be uniquely represented as x = x + e, where x V and e V mnemonically, we write R n = V V. Thus, we have an orthogonal decomposition of the space w.r.t a subspace V and its orthogonal complement V. 20 / 1

Proof (Optional) e V implies x is the projection: Assume x V is such that e := x x V. For any z V, we have (substitute x = x + e) x z 2 = x z 2 + 2 e, x z + e 2. (1) Since x z V (why?), the cross-term vanishes. Thus, x z 2 = x z 2 + e 2 for all z V, showing that x is the projection (why?). Conversely, x being projection implies e V : Since x minimizes z x z 2, from (??), x z 2 + 2 e, x z = x z 2 e 2 0, z V. By the change of variable u = x z, u 2 + 2 e, u 0, u V. Changing u to tu and letting t 0 gives e, u 0 for all u V. 21 / 1

Other facts If V R n is a linear subspace: (V ) = V. (a) dim(r n ) = dim(v ) + dim(v ). (b) dim(v ) = n dim(v ). (b ) If X R m n, then [Im(X )] = ker(x T ). rank(x ) + nullity(x T ) = n. (c) (d) Notes: (b) follows from R n = V V. We will see the proof of (c) later. (d) follows by taking dimensions of both sides of (c) and using (b). Recall that nullity(a) = rank(ker(a)), i.e., the dimension of the null space. 22 / 1

Table of Contents 23 / 1

Spectral decomposition of symmetric matrices Let A R n n be a symmetric matrix: A = A T. Then, we have the eigenvalue decomposition (EVD) of the matrix: A = UΛU T where U is an orthogonal matrix: UU T = U T U = I (i.e, U 1 = U T ) Λ = diag(λ 1,..., λ n ) where {λ 1,..., λ n } are the eigenvalues of A. The columns of U, denoted as {u 1,..., u n } are eigenvectors of A: Au i = λ i u i {u 1,..., u n } is an orthonormal basis for R n : { 0 i j u i, u j = 1 i = j There is a corresponding decomposition for a general (rectangular) matrix called singular value decomposition (SVD). 24 / 1

Example: A 1 = 1 2 = 2 1 1 2 2 1 1 2 1 2 3 0 0 1 1 2 2 1 1 2 1 2 T Example: 1 1 = 1 2 1 1 What is rank(a 1 ) and rank(a 2 )? 1 2 1 2 1 2 2 0 1 2 0 0 1 2 1 2 1 2 T 25 / 1

Table of Contents 26 / 1

Positive semi-definite (PSD) matrices A symmetric matrix A R n n is PSD if x, Ax = x T Ax 0, for all x R n It is positive definite (PD) of x T Ax > 0 for all x 0. Let λ 1 (A), λ 2 (A),..., λ n (A) be the eigenvalues of A. A is PSD if and only if A is PD if and only if λ i (A) 0 for all i = 1,..., n. λ i (A) > 0 for all i = 1,..., n. Every PSD matrix A has a symmetric square A 1/2 defined as the unique symmetric matrix such that A 1/2 A 1/2 = A If A = UΛU T is the EVD of A, then easy to show that A 1/2 = UΛ 1/2 U T 27 / 1

Table of Contents 28 / 1

Table of Contents 29 / 1

Expectation Assume that y = (y 1,..., y n ) is a random vector. We define E(y) := ( E(y 1 ),..., E(y n ) ) or compactly [E(y)] i = E(y i ). E(y) is a nonrandom vector in R n. Important consequence of the linearity of expectation: Lemma 1 If A R m n is nonrandom and y R n is random, then Proof: E(Ay) = A E(y) [ n ] [E(Ay)] i = E[(Ay) i ] = E A ij y j = j=1 n A ij E(y j ) = [A E(y)] i j=1 30 / 1

Similarly, we defined the expectations of random matrices elementwise: If A R m n is a random matrix with entries a ij, EA is a nonrandom matrix in R m n with entries [E(A)] ij := E(a ij ) Example 6 X 1, X 2 N(0, 1) and cov(x 1, X 2) = ρ. Consider Then, E(A) = ( ) X 2 A = 1 X 1X 2 X 1X 2 X2 2. ( ) E(X 2 1 ) E(X 1X 2) E(X 1X 2) E(X2 2 = ) ( ) 1 ρ ρ 1 31 / 1

Extension of Lemma?? The following is a very useful extension of Lemma??: Lemma 2 Consider matrices A, B, C such that A is random, B and C are nonrandom. Assume that the dimensions allow us to form matrix product BAC. Then, E(BAC) = B E(A) C (2) Important: We keep the order of matrix multiplication in (??). (Matrix multiplication is noncommutative.) 32 / 1

Example 7 ( α Let x = and A as before β) ( ) X 2 A = 1 X 1 X 2 X 1 X 2 X2 2. Then, E ( x T Ax ) = x T E(A)x = ( α β ) ( ( ) 1 ρ α ρ 1) β = α 2 + β 2 + 2ραβ 33 / 1

Table of Contents 34 / 1

Covariance matrix Consider a random vector y = (y 1,..., y n ) R n. Definition 1 The covariance matrix of y, denoted as cov(y), is an n n matrix with entries [cov(y)] ij = cov(y i, y j ) = E[(y i Ey i )(y j Ey j )] Note: Ey i = E(y i ), we drop parentheses for simplicity. cov(y i, y j ) is the usual covariance between y i and y j. The diagonal entries of the covariance matrix are: [cov(y)] ii = cov(y i, y i ) = var(y i ) Recall the alternative formula: cov(y i, y j ) = E(y i y j ) E(y i )E(y j ). 35 / 1

Some properties of covariance matrix Σ := cov(y): Σ is symmetric (Σ = Σ T ). If (y 1,..., y n ) are pairwise uncorrelated, then Σ is diagonal. Letting µ = E(y) R n, we have Σ = E[(y µ)(y µ) T ]. We also have Σ = E(yy T ) µµ T. Let ỹ := y E(y) be the centered version of y. Then, cov(y) = E(ỹỹ T ). Let α R and b R n, then cov(αy + b) = α 2 cov(y) 36 / 1

Lemma 3 Let A R m n, and y R n a random vector. Then cov(ay) = A cov(y)a T Proof: Let u := Ay, and ũ := u Eu. Also, let ỹ := y Ey. Since E(u) = A E(y), we have ũ = Aỹ, hence cov(u) = E(ũũ T ) = E(Aỹỹ T A T ) = A E(ỹỹ T ) A T where last equality is by Lemma??. (Recall that (AB) T = B T A T.) 37 / 1

Example 8 X 1, X 2 N(0, 1) and cov(x 1, X 2 ) = ρ. Let X = (X 1, X 2 ) R 2. (Think of it as a column vector.) What is the covariance matrix of X? ( [ ] X ) ( ) cov(x ) = cov 1 var(x1 ) cov(x = 1, X 2 ) X 2 cov(x 2, X 1 ) var(x 2 ) ( ) 1 ρ = ρ 1 Let u := αx 1 + βx 2. for α, β R. Then u = [ α β ] [ ] X 1 X 2 We have by Lemma?? cov(u) = [ α β ] cov(x ) Note cov(u) = var(u) is a scalar in this case. [ ] α = α 2 + β 2 + 2ραβ β 38 / 1

Example 9 X 1, X 2 N(0, 1) and cov(x 1, X 2 ) = ρ as in the previous Example. Z 1 = X 1, Z 2 = X 2 and Z 3 = X 1 X 2. Z = Z 1 Z 2, E(Z) = 0 0, cov(z) =? Z 3 0 Approach 1: (recall that covariance is bilinear) cov(z 1, Z 3 ) = cov(x 1, X 1 X 2 ) = cov(x 1, X 1 ) cov(x 1, X 2 ) = var(x 1 ) cov(x 1, X 2 ) = 1 ρ. cov(z 2, Z 3 ) = cov(x 2, X 1 X 2 ) = ρ 1. Using previous example: var(z 3 ) = 1 + 1 2ρ = 2(1 ρ). Continued on next slide... 39 / 1

Conclude that 1 ρ 1 ρ cov(z) = ρ 1 ρ 1 1 ρ ρ 1 2(1 ρ) Approach 2 (matrix approach): 1 0 ( ) Z = 0 1 X1 X 1 1 2 Then, E(Z) = A E(X ) = 0 and which gives the desired result. cov(z) = A cov(x ) A T = 1 0 ( ) ( ) 0 1 1 ρ 1 0 1 ρ 1 0 1 1 1 1 40 / 1

Proposition 2 A covariance matrix is always positive semi-definite (PSD). Proof: Fix a R n and let y R n be random. Then, var(a T y) = cov(a T y) = a T cov(y)a Since var(a T y) 0, we conclude a T cov(y)a 0, for all a R. Pathalogical case: If for some a 0, we have a T cov(y)a = 0, then var(a T y) = 0, hence a T y = constant with probability 1, that is, the distribution of y lies on a lower dimensional subspace. Note that in this case, cov(y) is a singular matrix. 41 / 1

Table of Contents 42 / 1

Decorrelation or whitening For any vector y, it is possible to find a linear transform A such that z := Ay has identity covariance matrix: cov(z) = I n = diag(1, 1,..., 1), that is, the components of z = (z 1,..., z n ) are uncorrelated and have unit variance. How to get A? Let Σ = cov(y) and take A := Σ 1/2. Hence, z = Σ 1/2 y, and cov(z) = Σ 1/2 cov(y)σ 1/2 = Σ 1/2 ΣΣ 1/2 = } Σ 1/2 {{ Σ 1/2 }} Σ 1/2 {{ Σ 1/2 } = I n. I n I n Exercise: Show that we can also take A = UΛ 1/2 where Σ = UΛU T is the EVD of Σ. 43 / 1

Table of Contents 44 / 1

Multivariate normal distribution Definition 2 A random vector y = (y 1,..., y n ) has a multivariate normal distribution (MVN) with mean vector µ = (µ 1,..., µ n ) and covariance matrix Σ R n n if it has the density f (y) = We write y N(µ, Σ). 1 [ (2π) n/2 Σ exp 1 ] 1/2 2 (y µ)t Σ 1 (y µ), y R n. We implicitly assume that Σ is invertible. The MVN has numerous interesting properties. We write y N n (0, Σ) to emphasize the dimension (i.e., y R n.) 45 / 1

Important properties of MVN 1. Any affine transformation of y N(µ, Σ) again has MVN distribution. Lemma 4 Assume that y N n (µ, Σ) and let u := Ay + b where A R p n and b R p are nonrandom. Then, u N( µ, Σ), where µ = Aµ + b, Σ = AΣA T. Special case: Taking u = a T y for nonrandom a R n, we obtain a T y N(a T µ, a T Σa) 46 / 1

2. Marginal distributions of y N(µ, Σ) are again MVN. This a consequence of Lemma??: Suppose we partition y R n into y 1 R p and y 2 R n p : [ ] [ ] y1 Ip p 0 p (n p) = y }{{} y 1 + 0y 2 = y 1. 2 }{{} A y Thus y 1 = Ay, hence y 1 N(Aµ, A T ΣA): Aµ = [ I 0 ] [ ] µ 1 = µ µ 1 2 AΣA T = [ I 0 ] [ Σ 11 Σ 12 Σ 21 Σ 22 ] [ I 0 ] = Σ 11 So, y 1 N p (µ 1, Σ 1 ). Similarly, for y 2 N n p (µ 2, Σ 22 ). 47 / 1

3. Conditional distributions of y N(µ, Σ) are again MVN. Lemma 5 Assume that y N(µ, Σ) and partition y into two pieces y 1 and y 2. Then, y 1 y 2 N ( µ 1 2 (y 2 ), Σ 1 2 ) where µ 1 2 (y 2 ) = µ 1 + Σ 12 Σ 1 22 (y 2 µ 2 ). Σ 1 2 = Σ 11 Σ 12 Σ 1 22 ΣT 12. Note that if Σ 12 = 0, then y 1 y 2 N(µ 1, Σ 1 ), that is, y 1 and y 2 are independent. 48 / 1

4. Uncorrelatedness is equivalent to independence: If [ ] ( [ ] [ ] ) y1 0 Σ11 Σ N, 12 y 2 0 Σ 21 Σ 22 then y 1 and y 2 are independent if and only if Σ 12 = 0. Proof: Follows from Lemma?? as mentioned. Independence always implies uncorrelatedness, but not necessarily vice versa. In MVN however, the reverse implication holds as well. 49 / 1

Example 10 Let y N(0, Σ). Since Σ is PSD, it has a square-root, Σ 1/2. Let z = Σ 1/2 y. What is the distribution of z? We claim that z N(0, I n ): E(z) = Σ 1/2 E(y) = 0, and cov(z) = Σ 1/2 cov(y)σ 1/2 = Σ 1/2 ΣΣ 1/2 = I n. Recall that z = Σ 1/2 y is the whitened version of y. We can reduce many problems about y to problems about z. The advantage is iid z N(0, I n ) z 1,..., z n N(0, 1) i.e., z has independent identically distributed (iid) N(0, 1) coordinates. 50 / 1

Table of Contents 51 / 1

Table of Contents 52 / 1

Linear model A multiple linear regression (MLR) model, or simply a linear model, is: Population version: y = β 0 + β 1 x 1 + + β p x p +ε }{{} µ y is the response, or dependent variable (observed). x j, j = 1,..., p are independent, or explanatory, or regressor variables, or predictors or covariates. β j are fixed unknown parameters (unobserved). x j are fixed i.e., deterministic for now (observed). Alternatively, we work conditioned on {x j }. ε is random: the noise, random error, unexplained variation. The only source of randomness in the model. Assumption: E[ε] = 0 so that The goal is to estimate β j. E[y] = µ + E[ε] = µ. 53 / 1

Examples Give an example: Explaining 100C grades. Gas consumption data. Oral contraceptives. 54 / 1

Sampled version Our data or observation is a collection of n i.i.d. samples from this model: We observe {y i, x i1,..., x ip }, i = 1,..., n, and Matrix-vector form: Let Then, y i = β 0 + β 1 x i1 + + β p x ip + ε i p = β 0 + β j x ij + ε i j=1 x j = (x 1j,..., x nj ) R n x 0 := 1 := (1,..., 1) R n ε = (ε 1,..., ε n ) y = (y 1,..., y n ) R n y = β 0 1 + β 1 x 1 + + β p x p + ε p = β j x j + ε j=0 where everything is now a vector except β j. 55 / 1

Let β = (β 0, β 1,..., β p ) R p, and let X = (x 0 x 1 x p ) R n (p+1). Since X β = p j=0 β jx j, we have y = X β + ε. }{{} µ This is called multiple linear regression (MLR) model. Assumptions on the noise: (E1) E[ε] = 0, cov(ε) = σ 2 I n, and {ε i } independent. (E2) Distributional assumption: ε N(0, σ 2 I n ). (E1) implies that E[y] = µ = X β, and cov(y) = σ 2 I n (why?). (E2) implies y N(X β, σ 2 I n ). Assumptions on the design matrix: (X1) X R n (p+1) is fixed (nonrandom) and has full column rank, that is, p + 1 n and rank(x ) = p + 1. 56 / 1

Table of Contents 57 / 1

Maximum likelihood estimation (MLE) We are interested in estimating both β and σ 2 (noise variance). The MLE requires a distributional assumption. Here we assume (E2). The PDF of y N(µ, σ 2 I ) is 1 ( f (y) = (2π) n/2 σ 2 I n exp 1 ) 1/2 2 (y µ)t (σ 2 I n ) 1 (y µ) 1 = (2π) n/2 (σ 2 ) exp ( 1 y µ 2) n/2 2σ2 Viewed as a function β and σ 2, this is the likelihood L(β, σ 2 1 y) = (2πσ 2 ) exp ( 1 y X β 2) n/2 2σ2 Let us estimate β first: Maximizing the likelihood is equivalent to minimizing S(β) := y X β 2 = n [y i (X β) i ] 2 The problem min β S(β) is called least-squares (LS) problem. i=1 58 / 1

Use chain rule to compute the gradient of S(β) and set to zero: S(β) β k = 2 n i=1 (X β) i β k [y i (X β) i ], (X β) i β k = X ik Hence S(β) β k = 2[X T (y X β)] k or S(β) = 2X T (y X β). Setting to zero gives the normal equations S( β) = 0 (X T X ) β = X T y If p + 1 n and if X R n (p+1) is full-rank, i.e., rank(x ) = p + 1, then X T X is invertible and we get β = (X T X ) 1 X T y. 59 / 1

Remark 1 Shown that the maximum likelihood estimate of β under a Gaussian assumption, is the same as the least-squares (LS) estimate. The LS estimate in general makes sense even if we have no distributional assumption. 60 / 1

Table of Contents 61 / 1

Geometric interpretation of LS. (Section 4.2.1) Least-squares problem is equivalent to min y X β R β 2 min y µ 2 p+1 µ Im(X ) since, we recall, Im(X ) = {X β : β R p+1 }. That is, we are trying to find the projection µ of y onto Im(X ): µ = argmin y µ 2 µ Im(X ) By orthogonality principle: The residual e = y µ Im(X ), i.e., y µ [Im(X )] = ker(x T ) X T (y µ) = 0 µ Im(X ) means there is at least one β R p+1 that µ = X β, and X T (y X β) = 0 which is the same normal equations for β. 62 / 1

y Im(X) µ ε e µ Remark 2 The projection µ is unique in general, but β need not be. If X is full (column) rank, then β is unique. 63 / 1

Consequences of geometric interpretation The residual vector: e = y µ = y X β R n. The vector of fitted values: µ = X β. Sometimes µ is also referred to as ŷ. The following hold: Recall that x j is the jth column of X. e Im(X ). Since x j Im(X ), we have { { n e 1 e x j, j = 1,..., p i=1 e i = 0 n i=1 e ix ij = 0, j = 1,..., p Residuals are orthogonal to all the covariate vectors. Since µ Im(X ), we have e µ i e i µ i = 0. S( β) = e 2 = min β S(β). 64 / 1

Table of Contents 65 / 1

Estimation of σ 2 Back to the likelihood, substituting β for β, L( β, σ 2 1 ( y) = (2πσ 2 ) exp 1 ) n/2 2σ 2 S( β) which we want to maximize over σ 2. Equivalent to maximizing log-likelihood: over σ 2, or maximizing: log L( β, σ 2 y) = n 2 log(2πσ2 ) S( β) 2σ 2 l( β, v y) := log L( β, v y) = const. 1 2 [n log v + v 1 S( β) ] over v. (Change of variable v = σ 2.) The problem reduces to v := argmax v>0 l( β, v y) = argmin v>0 Setting the derivative to zero gives σ 2 = v = S( β)/n. (Check that this is indeed the maximizer.) [ n log v + v 1 S( β) ]. 66 / 1

For reasons that will become clear later, we often use the following modified estimator: s 2 = Compare with the MLE for σ 2 : S( β) n (p + 1). σ 2 = S( β) n. We see that s 2 is unbiased for σ 2 while σ 2 is not. Note that S( β) is the sum of squares of residuals: n S( β) = (y i (X β) n i ) 2 = ei 2 = e 2. i=1 i=1 67 / 1

Table of Contents 68 / 1

The hat matrix Recall that β = (X T X ) 1 X T y. We have H is called the hat matrix. µ = X β = X (X T X ) 1 X T y = Hy. }{{} H It is an example of an orthogonal projection matrix. These matrices have a lot of properties. 69 / 1

Properties of projection matrices These matrices have a lot of properties: Lemma 6 For any vector y R n, Hy R n is the orthogonal projection of y onto Im(X ). H is symmetric: H T = H. (Exercise: use (BCD) T = D T C T B T.) H is idempotent: H 2 = H. (Exercise.) This is what we expect from a projection matrix. If we project Hy onto Im(X ), we should get back the same thing: H(Hy) = Hy for all y. A matrix is an orthogonal projection matrix if and only if it is symmetric and idempotent. I H is also a projection matrix (symmetric and idempotent): For every y R n, (I H)y is the projection of y onto [Im(X )]. Note that (I H)y = e. To summarize, we can decompose every y R n as y = µ + e = Hy + (I H)y. }{{}}{{} Im(X ) [Im(X )] 70 / 1

Sidenote: Gram matrix Write X = (x 0 x 1 x 2 x p ) R n (p+1) Here x j is the jth column of X. Verify the following useful result: for i, j = 0, 1,..., p + 1. (X T X ) ij = x T i x j = x i, x j Thus, the entries of X T X are the pairwise inner products of columns of X. For example (X T X ) ij = 0 means x i and x j are orthogonal. X T X is called the Gram matrix of {x 0, x 1,..., x p }. Exercise: Show that X T X is always PSD. 71 / 1

Table of Contents 72 / 1

Example 11 (Simple linear regression) This is the case p = 1, and the model is y = β 0 1 + β 1 x 1 + ε. We have X = [1 x 1 ] R n 2. Assumption (X1) is satisfied if n 2 and x 1 is not a constant multiple of 1. We have β = ( 1 n X T X ) 1 1 n X T y = ( ) 1 ( 1 x ȳ x x 2 xy 1 = x 2 ( x) 2 ) ( x 2 x x 1 ) ( ) ȳ. xy From which it follows that β 1 = xy xȳ x 2 ( x) 2 = ρ xy ρ xx, β0 = ȳ β 1 x where... 73 / 1

Example (Simple linear regression (cont d)) From which it follows that β 1 = xy xȳ x 2 ( x) 2 = ρ xy ρ xx, β0 = ȳ β 1 x where ρ xy = 1 n (x i x)(y i ȳ) = xy xȳ, n i=1 ρ xx = 1 (x i x) 2 = x n 2 ( x) 2. i The formula for β 0 is easier to see if one writes down the normal equations as ( 1 n X T X ) β = 1 n X T y and solves for β 0 in terms of β 1. Note also that [( var( β 1 ) = σ2 1 ) 1 ] n n X T X = σ2. 22 nρ xx 74 / 1

Table of Contents 75 / 1

Sampling distribution Both β and σ 2 are random quantities (due to the randomness in ε). The distribution of an estimate, called its sampling distribution, allows us to quantify the uncertainty in the estimate. Basic properties like the mean and, the variance and covariances can be determined under our basic assumption (E1). To determine the full sampling distribution, we need to assume some distribution for the noise vector ε, e.g. (E2). Recall that E[y] = µ = X β, cov(y) = σ 2 I. 76 / 1

Properties of β Let us write A = (X T X ) 1 X T R (p+1) n so that β = Ay Note that AX = I p+1. Proposition 3 Under the linear model, y = X β + ε, 1. β is an unbiased estimate β, i.e., E[ β] = β. 2. with covariance matrix cov( β) = σ 2 (X T X ) 1. Exercise: Prove this. For covariance, note AA T = (X T X ) 1. 77 / 1

Properties of a T β: Consider a T β for some nonrandom a R p+1. This is interesting since e.g., with a = (0, 1, 1, 0,..., 0) we have a T β = β1 β 2 In general, E[a T β] = a T E[ β] = a T β, var(a T β) = a T cov( β)a = σ 2 a T (X T X ) 1 a The first equation shows that a T β is an unbiased estimate of a T β. 78 / 1

Fitted values The vector of fitted values µ = X β = Hy. µ is an unbiased estimate µ: E[ µ] = X E[ β] = X β = µ. Approach 2: µ = X β so that µ Im(X ) hence Hµ = µ (Verify directly!): E[ µ] = HE[y] = Hµ = µ The covariance matrix is cov( µ) = cov(hy) = H(σ 2 I n )H T = σ 2 HH = σ 2 H. 79 / 1

Residuals The residual is e = y X β = y µ = (I H)y R n. We have E[e] = E[y µ] = E[y] E[ µ] = µ µ = 0, cov(e) = (I H)(σ 2 I n )(I H) T = σ 2 (I H) 2 = σ 2 (I H) 80 / 1

Joint behavior of ( β, e) R (p+1+n) 1 Recall A = (X T X ) 1 X T. Stack β on top of e. We have β = Ay and e = (I H)y. Since H = X (X T X ) 1 X T = XA, E ) ( ) ( β A = y. e I H }{{} P ) ( β = e ( ) ( E[ β] β =. (3) E[e] 0) 81 / 1

The covariance matrix is (Recall that I H is symmetric and idempotent): cov ( ( β ) ) = P cov(y)p T e = σ 2 PP T [ ] = σ 2 A [A T (I H) ] I H T [ ] = σ 2 A [A T I H ] I H [ ] = σ 2 AA T A(I H) (I H)A T I H [ ] σ = 2 (X T X ) 1 0 0 σ 2 (I H) Note that first diagonal block matches the covariance of β as expected. 82 / 1

Sidenote Why A(I H) = 0? Algebraic calculation: AH = (X T X ) 1 X T [XA] = A. Geometric interpretation: A T = X (X T X ) 1 hence Im(A T ) Im(X ) (Check!). This means that H leaves A T intact: HA T = A T. 83 / 1

Important consequences Proposition 4 Under the linear regression model y = X β + ε and assumption (E1), cov ( ( β ) ) [ ] σ = 2 (X T X ) 1 0 e 0 σ 2 (I H) (4) where β is the LS estimate of the regression coefficient and e is the residual. Under (E1), β and e are uncorrelated. Under (E2), we have: ( β, e) has MVN normal distribution (why?) with mean vector (β, 0) and covariance matrix (??). β and e are independent (why?). S( β) = e 2 is independent of β (why?). Similarly, s 2 is independent of β. e and µ are independent. ( µ = X β is a function of β). 84 / 1