STAT 100C: Linear models

Similar documents
STAT 100C: Linear models

MLES & Multivariate Normal Theory

Maximum Likelihood Estimation

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

The Multivariate Normal Distribution 1

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

STAT 714 LINEAR STATISTICAL MODELS

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Multivariate Regression

The Multivariate Normal Distribution 1

Notes on Random Vectors and Multivariate Normal

Lecture 1: Review of linear algebra

[y i α βx i ] 2 (2) Q = i=1

Preliminaries. Copyright c 2018 Dan Nettleton (Iowa State University) Statistics / 38

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

Properties of Matrices and Operations on Matrices

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Lecture 2: Linear Algebra Review

Lecture 11. Multivariate Normal theory

identity matrix, shortened I the jth column of I; the jth standard basis vector matrix A with its elements a ij

The Multivariate Normal Distribution. In this case according to our theorem

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Econ 2120: Section 2

DS-GA 1002 Lecture notes 10 November 23, Linear models

Eigenvectors and SVD 1

BIOS 2083 Linear Models Abdus S. Wahed. Chapter 2 84

The Hilbert Space of Random Variables

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

WLS and BLUE (prelude to BLUP) Prediction

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

MATH 240 Spring, Chapter 1: Linear Equations and Matrices

STAT5044: Regression and Anova. Inyoung Kim

ELEMENTS OF MATRIX ALGEBRA

Chapter 6: Orthogonality

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

TAMS39 Lecture 2 Multivariate normal distribution

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Recall the convention that, for us, all vectors are column vectors.

Linear Models Review

1. Addition: To every pair of vectors x, y X corresponds an element x + y X such that the commutative and associative properties hold

Multivariate Analysis and Likelihood Inference

2. Matrix Algebra and Random Vectors

Linear Regression and Its Applications

Linear Algebra Review

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

The Multivariate Gaussian Distribution

14 Singular Value Decomposition

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

STAT 8260 Theory of Linear Models Lecture Notes

Designing Information Devices and Systems II

Course topics (tentative) The role of random effects

Next is material on matrix rank. Please see the handout

Statistics 910, #5 1. Regression Methods

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

MAT Linear Algebra Collection of sample exams

Random Vectors and Multivariate Normal Distributions

Eigenvalues and diagonalization

Review of Some Concepts from Linear Algebra: Part 2

Matrix Factorizations

ESTIMATION THEORY. Chapter Estimation of Random Variables

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Stat 159/259: Linear Algebra Notes

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Lecture Note 1: Probability Theory and Statistics

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Stat 206: Sampling theory, sample moments, mahalanobis

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

CAS MA575 Linear Models

A Probability Review

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Lecture 11: Regression Methods I (Linear Regression)

3 Multiple Linear Regression

Prediction. is a weighted least squares estimate since it minimizes. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark

Pseudoinverse & Moore-Penrose Conditions

MIT Spring 2015

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Linear Algebra Formulas. Ben Lee

Review of Linear Algebra

Basic Concepts in Matrix Algebra

Lecture 7: Positive Semidefinite Matrices

Chapter 4 Euclid Space

Principal Components Theory Notes

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Gaussian random variables inr n

Ch4. Distribution of Quadratic Forms in y

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

1. General Vector Spaces

Review of some mathematical tools

Lecture 22: A Review of Linear Algebra and an Introduction to The Multivariate Normal Distribution

Vectors and Matrices Statistics with Vectors and Matrices

Multivariate Gaussian Distribution. Auxiliary notes for Time Series Analysis SF2943. Spring 2013

01 Probability Theory and Statistics Review

IV. Matrix Approximation using Least-Squares

1 9/5 Matrices, vectors, and their applications

Transcription:

STAT 100C: Linear models Arash A. Amini April 27, 2018 1 / 1

Table of Contents 2 / 1

Linear Algebra Review Read 3.1 and 3.2 from text. 1. Fundamental subspace (rank-nullity, etc.) Im(X ) = ker(x T ) R n Im(X ) = C(X ) = image = column space = range. ker(x ) = N(X ) = kernel = null space 2. Orthogonal decomposition of a space w.r.t. a subspace V : R n = V V 3. Spectral decomposition of a symmetric matrix A: A = UΛU T, U : orthogonal, Λ : diagonal 3 / 1

Table of Contents 4 / 1

Linear independence A set of vector {x 1,..., x n } is linearly dependent if a nontrivial linear combination of them is zero: That is, there exists c 1,..., c n not all of which zero such that n c i x i = 0 i=1 Otherwise the set is linearly independent. Example 1 Which of the two is a linearly independent set? 1 1, 2 2 2 2, 1 1, 1 1, 1 1 0 0 0 0 0 1 5 / 1

Exercise: Show that if a set of nonzero vectors are pairwise orthogonal, they are linearly independent. Exercise: Show that if a set of vectors contains the zero vector, they are linearly dependent. 6 / 1

Span The span of a set of vectors is the set of all their linear combinations: { n } span{x 1,..., x n } = c i x i c1,..., c n R i=1 Example 2 span 1 1, 2 2 = t 1 1 1 + t 2 2 2 t1, t 2 R 0 0 0 0 1 = (t 1 + 2t 2 ) 1 t1, t 2 R 0 1 = t 1 t R 0 7 / 1

Example 3 span 1 1, 1 1 = t 1 1 1 + t 2 1 1 t1, t 2 R 0 0 0 0 = t 1 + t 2 t 1 t 2 t1, t 2 R 0 = α β α, β R 0 1 0 = α 0 + β 1 α, β R 0 0 1 0 = span 0, 1 0 0 8 / 1

Basis and dimension A set of linearly independent vectors that span a subspace V is called a basis of the subspace. A subspace V can have different bases: V := α β α, β R = span 1 1, 1 1 = span 1 0, 0 1 0 0 0 0 0 All the bases for a given subspace V have the same number of elements. This common number is the dimension of V, denoted dim(v ). In the example, dim(v ) = 2. Dimension formalizes notion of degrees of freedom. 9 / 1

Table of Contents 10 / 1

Image or column space Column space of a matrix X is the span of the columns of X. Let {x 1,..., x n } R n (n-dimensional vectors). Form a matrix with these columns: X = ( x 1 x 2 x p ) R n p Recall that for β = (β 1,..., β p ) R p : We have β 1 X β = ( ) β 2 x 1 x 2 x p. = β p p β j x j { p col(x ) = span{x 1,..., x p } = β j x j β1,..., β p R p} j=1 j=1 = {X β : β R p } 11 / 1

Image or column space Column space is also called range or image: col(x ) = ran(x ) = Im(X ) = {X β : β R p } Im(X ) is a linear subspace of R n. Dimension of the image is called the rank of the matrix: rank(x ) := dim(im(x )) rank(x ) is the number of linearly independent columns of X. 12 / 1

Kernel or null space Kernel of X R n p is the set of all vector that are mapped to zero by X : Note that ker(x ) R p. ker(x ) = {β R p X β = 0} 13 / 1

Example 4 Consider X = 1 2 1 2 0 0 What is Im(X )? Im(X ) = span 1 1, 2 2 = span 1 1 0 0 0. rank(x ) = dim(im(x )) = 1. What is ker(x )? 1 2 ker(x ) = β = (β 1, β 2 ) β 1 1 + β 2 2 = 0 0 0 = { β = (β 1, β 2 ) β1 + 2β 2 = 0 } null(x ) = dim(ker(x )) = 1. rank(x ) + null(x ) = 2. 14 / 1

Table of Contents 15 / 1

Inner product, orthogonality For two vectors x, y R n, the (Euclidean) inner product is n x, y := x i y i = [ y 1 ] x 1,..., x n. = x T y i=1 y n x is orthogonal to y, denoted x y if x, y = 0. Let V R n be a (linear) subspace. We say x is orthogonal to V, denoted as x V whenever x v, v V 16 / 1

Orthogonal complement The set of all vectors x that are orthogonal to V is called the orthogonal complement of V : V = {x R n : x V } = {x R n : x, y = 0 for all y R n } 17 / 1

Example 5 Consider X = 1 1 1 1 0 0 Let V = Im(X ). What is V? Let x 1, x 2 be the two columns of X. z V iff z x 1 and z x 2. Equivalent to z 1 = z 2 = 0 and z 3 could be anything: 0 0 V = 0 : α R = span 0. α 1. 18 / 1

Norm and distance The (Euclidean) norm of a vector x R n also called l 2 norm is x = x, x = x T x = n where x = (x 1,..., x n ). For two vectors x, y R n, their (Euclidean) distance is x y = n (x i y i ) 2. i=1 i=1 x 2 i Exercise: Show that x y 2 = x 2 + y 2 2 x, y. 19 / 1

Projection Let V R n be a subspace and x R n be a vector. Then, there is a unique closest element in V to x: x = argmin x z. z V x is called the projection of x onto V. Consider the error e = x x. The projection x is the only vector in V such that e V. In other words, e V. Thus, Proposition 1 Every vector x R n can be uniquely represented as x = x + e, where x V and e V mnemonically, we write R n = V V. Thus, we have an orthogonal decomposition of the space w.r.t a subspace V and its orthogonal complement V. 20 / 1

Proof (Optional) e V implies x is the projection: Assume x V is such that e := x x V. For any z V, we have (substitute x = x + e) x z 2 = x z 2 + 2 e, x z + e 2. (1) Since x z V (why?), the cross-term vanishes. Thus, x z 2 = x z 2 + e 2 for all z V, showing that x is the projection (why?). Conversely, x being projection implies e V : Since x minimizes z x z 2, from (??), x z 2 + 2 e, x z = x z 2 e 2 0, z V. By the change of variable u = x z, u 2 + 2 e, u 0, u V. Changing u to tu and letting t 0 gives e, u 0 for all u V. 21 / 1

Other facts If V R n is a linear subspace: (V ) = V. (a) dim(r n ) = dim(v ) + dim(v ). (b) dim(v ) = n dim(v ). (b ) If X R m n, then [Im(X )] = ker(x T ). rank(x ) + nullity(x T ) = n. (c) (d) Notes: (b) follows from R n = V V. We will see the proof of (c) later. (d) follows by taking dimensions of both sides of (c) and using (b). Recall that nullity(a) = rank(ker(a)), i.e., the dimension of the null space. 22 / 1

Table of Contents 23 / 1

Spectral decomposition of symmetric matrices Let A R n n be a symmetric matrix: A = A T. Then, we have the eigenvalue decomposition (EVD) of the matrix: A = UΛU T where U is an orthogonal matrix: UU T = U T U = I (i.e, U 1 = U T ) Λ = diag(λ 1,..., λ n ) where {λ 1,..., λ n } are the eigenvalues of A. The columns of U, denoted as {u 1,..., u n } are eigenvectors of A: Au i = λ i u i {u 1,..., u n } is an orthonormal basis for R n : { 0 i j u i, u j = 1 i = j There is a corresponding decomposition for a general (rectangular) matrix called singular value decomposition (SVD). 24 / 1

Example: A 1 = 1 2 = 2 1 1 2 2 1 1 2 1 2 3 0 0 1 1 2 2 1 1 2 1 2 T Example: 1 1 = 1 2 1 1 What is rank(a 1 ) and rank(a 2 )? 1 2 1 2 1 2 2 0 1 2 0 0 1 2 1 2 1 2 T 25 / 1

Table of Contents 26 / 1

Positive semi-definite (PSD) matrices A symmetric matrix A R n n is PSD if x, Ax = x T Ax 0, for all x R n It is positive definite (PD) of x T Ax > 0 for all x 0. Let λ 1 (A), λ 2 (A),..., λ n (A) be the eigenvalues of A. A is PSD if and only if A is PD if and only if λ i (A) 0 for all i = 1,..., n. λ i (A) > 0 for all i = 1,..., n. Every PSD matrix A has a symmetric square A 1/2 defined as the unique symmetric matrix such that A 1/2 A 1/2 = A If A = UΛU T is the EVD of A, then easy to show that A 1/2 = UΛ 1/2 U T 27 / 1

Table of Contents 28 / 1

Table of Contents 29 / 1

Expectation Assume that y = (y 1,..., y n ) is a random vector. We define E(y) := ( E(y 1 ),..., E(y n ) ) or compactly [E(y)] i = E(y i ). E(y) is a nonrandom vector in R n. Important consequence of the linearity of expectation: Lemma 1 If A R m n is nonrandom and y R n is random, then Proof: E(Ay) = A E(y) [ n ] [E(Ay)] i = E[(Ay) i ] = E A ij y j = j=1 n A ij E(y j ) = [A E(y)] i j=1 30 / 1

Similarly, we defined the expectations of random matrices elementwise: If A R m n is a random matrix with entries a ij, EA is a nonrandom matrix in R m n with entries [E(A)] ij := E(a ij ) Example 6 X 1, X 2 N(0, 1) and cov(x 1, X 2) = ρ. Consider Then, E(A) = ( ) X 2 A = 1 X 1X 2 X 1X 2 X2 2. ( ) E(X 2 1 ) E(X 1X 2) E(X 1X 2) E(X2 2 = ) ( ) 1 ρ ρ 1 31 / 1

Extension of Lemma?? The following is a very useful extension of Lemma??: Lemma 2 Consider matrices A, B, C such that A is random, B and C are nonrandom. Assume that the dimensions allow us to form matrix product BAC. Then, E(BAC) = B E(A) C (2) Important: We keep the order of matrix multiplication in (??). (Matrix multiplication is noncommutative.) 32 / 1

Example 7 ( α Let x = and A as before β) ( ) X 2 A = 1 X 1 X 2 X 1 X 2 X2 2. Then, E ( x T Ax ) = x T E(A)x = ( α β ) ( ( ) 1 ρ α ρ 1) β = α 2 + β 2 + 2ραβ 33 / 1

Table of Contents 34 / 1

Covariance matrix Consider a random vector y = (y 1,..., y n ) R n. Definition 1 The covariance matrix of y, denoted as cov(y), is an n n matrix with entries [cov(y)] ij = cov(y i, y j ) = E[(y i Ey i )(y j Ey j )] Note: Ey i = E(y i ), we drop parentheses for simplicity. cov(y i, y j ) is the usual covariance between y i and y j. The diagonal entries of the covariance matrix are: [cov(y)] ii = cov(y i, y i ) = var(y i ) Recall the alternative formula: cov(y i, y j ) = E(y i y j ) E(y i )E(y j ). 35 / 1

Some properties of covariance matrix Σ := cov(y): Σ is symmetric (Σ = Σ T ). If (y 1,..., y n ) are pairwise uncorrelated, then Σ is diagonal. Letting µ = E(y) R n, we have Σ = E[(y µ)(y µ) T ]. We also have Σ = E(yy T ) µµ T. Let ỹ := y E(y) be the centered version of y. Then, cov(y) = E(ỹỹ T ). Let α R and b R n, then cov(αy + b) = α 2 cov(y) 36 / 1

Lemma 3 Let A R m n, and y R n a random vector. Then cov(ay) = A cov(y)a T Proof: Let u := Ay, and ũ := u Eu. Also, let ỹ := y Ey. Since E(u) = A E(y), we have ũ = Aỹ, hence cov(u) = E(ũũ T ) = E(Aỹỹ T A T ) = A E(ỹỹ T ) A T where last equality is by Lemma??. (Recall that (AB) T = B T A T.) 37 / 1

Example 8 X 1, X 2 N(0, 1) and cov(x 1, X 2 ) = ρ. Let X = (X 1, X 2 ) R 2. (Think of it as a column vector.) What is the covariance matrix of X? ( [ ] X ) ( ) cov(x ) = cov 1 var(x1 ) cov(x = 1, X 2 ) X 2 cov(x 2, X 1 ) var(x 2 ) ( ) 1 ρ = ρ 1 Let u := αx 1 + βx 2. for α, β R. Then u = [ α β ] [ ] X 1 X 2 We have by Lemma?? cov(u) = [ α β ] cov(x ) Note cov(u) = var(u) is a scalar in this case. [ ] α = α 2 + β 2 + 2ραβ β 38 / 1

Example 9 X 1, X 2 N(0, 1) and cov(x 1, X 2 ) = ρ as in the previous Example. Z 1 = X 1, Z 2 = X 2 and Z 3 = X 1 X 2. Z = Z 1 Z 2, E(Z) = 0 0, cov(z) =? Z 3 0 Approach 1: (recall that covariance is bilinear) cov(z 1, Z 3 ) = cov(x 1, X 1 X 2 ) = cov(x 1, X 1 ) cov(x 1, X 2 ) = var(x 1 ) cov(x 1, X 2 ) = 1 ρ. cov(z 2, Z 3 ) = cov(x 2, X 1 X 2 ) = ρ 1. Using previous example: var(z 3 ) = 1 + 1 2ρ = 2(1 ρ). Continued on next slide... 39 / 1

Conclude that 1 ρ 1 ρ cov(z) = ρ 1 ρ 1 1 ρ ρ 1 2(1 ρ) Approach 2 (matrix approach): 1 0 ( ) Z = 0 1 X1 X 1 1 2 Then, E(Z) = A E(X ) = 0 and which gives the desired result. cov(z) = A cov(x ) A T = 1 0 ( ) ( ) 0 1 1 ρ 1 0 1 ρ 1 0 1 1 1 1 40 / 1

Proposition 2 A covariance matrix is always positive semi-definite (PSD). Proof: Fix a R n and let y R n be random. Then, var(a T y) = cov(a T y) = a T cov(y)a Since var(a T y) 0, we conclude a T cov(y)a 0, for all a R. Pathalogical case: If for some a 0, we have a T cov(y)a = 0, then var(a T y) = 0, hence a T y = constant with probability 1, that is, the distribution of y lies on a lower dimensional subspace. Note that in this case, cov(y) is a singular matrix. 41 / 1

Table of Contents 42 / 1

Decorrelation or whitening For any vector y, it is possible to find a linear transform A such that z := Ay has identity covariance matrix: cov(z) = I n = diag(1, 1,..., 1), that is, the components of z = (z 1,..., z n ) are uncorrelated and have unit variance. How to get A? Let Σ = cov(y) and take A := Σ 1/2. Hence, z = Σ 1/2 y, and cov(z) = Σ 1/2 cov(y)σ 1/2 = Σ 1/2 ΣΣ 1/2 = } Σ 1/2 {{ Σ 1/2 }} Σ 1/2 {{ Σ 1/2 } = I n. I n I n Exercise: Show that we can also take A = UΛ 1/2 where Σ = UΛU T is the EVD of Σ. 43 / 1

Table of Contents 44 / 1

Multivariate normal distribution Definition 2 A random vector y = (y 1,..., y n ) has a multivariate normal distribution (MVN) with mean vector µ = (µ 1,..., µ n ) and covariance matrix Σ R n n if it has the density f (y) = We write y N(µ, Σ). 1 [ (2π) n/2 Σ exp 1 ] 1/2 2 (y µ)t Σ 1 (y µ), y R n. We implicitly assume that Σ is invertible. The MVN has numerous interesting properties. We write y N n (0, Σ) to emphasize the dimension (i.e., y R n.) 45 / 1

Important properties of MVN 1. Any affine transformation of y N(µ, Σ) again has MVN distribution. Lemma 4 Assume that y N n (µ, Σ) and let u := Ay + b where A R p n and b R p are nonrandom. Then, u N( µ, Σ), where µ = Aµ + b, Σ = AΣA T. Special case: Taking u = a T y for nonrandom a R n, we obtain a T y N(a T µ, a T Σa) 46 / 1

2. Marginal distributions of y N(µ, Σ) are again MVN. This a consequence of Lemma??: Suppose we partition y R n into y 1 R p and y 2 R n p : [ ] [ ] y1 Ip p 0 p (n p) = y }{{} y 1 + 0y 2 = y 1. 2 }{{} A y Thus y 1 = Ay, hence y 1 N(Aµ, A T ΣA): Aµ = [ I 0 ] [ ] µ 1 = µ µ 1 2 AΣA T = [ I 0 ] [ Σ 11 Σ 12 Σ 21 Σ 22 ] [ I 0 ] = Σ 11 So, y 1 N p (µ 1, Σ 1 ). Similarly, for y 2 N n p (µ 2, Σ 22 ). 47 / 1

3. Conditional distributions of y N(µ, Σ) are again MVN. Lemma 5 Assume that y N(µ, Σ) and partition y into two pieces y 1 and y 2. Then, y 1 y 2 N ( µ 1 2 (y 2 ), Σ 1 2 ) where µ 1 2 (y 2 ) = µ 1 + Σ 12 Σ 1 22 (y 2 µ 2 ). Σ 1 2 = Σ 11 Σ 12 Σ 1 22 ΣT 12. Note that if Σ 12 = 0, then y 1 y 2 N(µ 1, Σ 1 ), that is, y 1 and y 2 are independent. 48 / 1

4. Uncorrelatedness is equivalent to independence: If [ ] ( [ ] [ ] ) y1 0 Σ11 Σ N, 12 y 2 0 Σ 21 Σ 22 then y 1 and y 2 are independent if and only if Σ 12 = 0. Proof: Follows from Lemma?? as mentioned. Independence always implies uncorrelatedness, but not necessarily vice versa. In MVN however, the reverse implication holds as well. 49 / 1

Example 10 Let y N(0, Σ). Since Σ is PSD, it has a square-root, Σ 1/2. Let z = Σ 1/2 y. What is the distribution of z? We claim that z N(0, I n ): E(z) = Σ 1/2 E(y) = 0, and cov(z) = Σ 1/2 cov(y)σ 1/2 = Σ 1/2 ΣΣ 1/2 = I n. Recall that z = Σ 1/2 y is the whitened version of y. We can reduce many problems about y to problems about z. The advantage is iid z N(0, I n ) z 1,..., z n N(0, 1) i.e., z has independent identically distributed (iid) N(0, 1) coordinates. 50 / 1

Table of Contents 51 / 1

Table of Contents 52 / 1

Linear model A multiple linear regression (MLR) model, or simply a linear model, is: Population version: y = β 0 + β 1 x 1 + + β p x p +ε }{{} µ y is the response, or dependent variable (observed). x j, j = 1,..., p are independent, or explanatory, or regressor variables, or predictors or covariates. β j are fixed unknown parameters (unobserved). x j are fixed i.e., deterministic for now (observed). Alternatively, we work conditioned on {x j }. ε is random: the noise, random error, unexplained variation. The only source of randomness in the model. Assumption: E[ε] = 0 so that The goal is to estimate β j. E[y] = µ + E[ε] = µ. 53 / 1

Examples Give an example: Explaining 100C grades. Gas consumption data. Oral contraceptives. 54 / 1

Sampled version Our data or observation is a collection of n i.i.d. samples from this model: We observe {y i, x i1,..., x ip }, i = 1,..., n, and Matrix-vector form: Let Then, y i = β 0 + β 1 x i1 + + β p x ip + ε i p = β 0 + β j x ij + ε i j=1 x j = (x 1j,..., x nj ) R n x 0 := 1 := (1,..., 1) R n ε = (ε 1,..., ε n ) y = (y 1,..., y n ) R n y = β 0 1 + β 1 x 1 + + β p x p + ε p = β j x j + ε j=0 where everything is now a vector except β j. 55 / 1

Let β = (β 0, β 1,..., β p ) R p, and let X = (x 0 x 1 x p ) R n (p+1). Since X β = p j=0 β jx j, we have y = X β + ε. }{{} µ This is called multiple linear regression (MLR) model. Assumptions on the noise: (E1) E[ε] = 0, cov(ε) = σ 2 I n, and {ε i } independent. (E2) Distributional assumption: ε N(0, σ 2 I n ). (E1) implies that E[y] = µ = X β, and cov(y) = σ 2 I n (why?). (E2) implies y N(X β, σ 2 I n ). Assumptions on the design matrix: (X1) X R n (p+1) is fixed (nonrandom) and has full column rank, that is, p + 1 n and rank(x ) = p + 1. 56 / 1

Table of Contents 57 / 1

Maximum likelihood estimation (MLE) We are interested in estimating both β and σ 2 (noise variance). The MLE requires a distributional assumption. Here we assume (E2). The PDF of y N(µ, σ 2 I ) is 1 ( f (y) = (2π) n/2 σ 2 I n exp 1 ) 1/2 2 (y µ)t (σ 2 I n ) 1 (y µ) 1 = (2π) n/2 (σ 2 ) exp ( 1 y µ 2) n/2 2σ2 Viewed as a function β and σ 2, this is the likelihood L(β, σ 2 1 y) = (2πσ 2 ) exp ( 1 y X β 2) n/2 2σ2 Let us estimate β first: Maximizing the likelihood is equivalent to minimizing S(β) := y X β 2 = n [y i (X β) i ] 2 The problem min β S(β) is called least-squares (LS) problem. i=1 58 / 1

Use chain rule to compute the gradient of S(β) and set to zero: S(β) β k = 2 n i=1 (X β) i β k [y i (X β) i ], (X β) i β k = X ik Hence S(β) β k = 2[X T (y X β)] k or S(β) = 2X T (y X β). Setting to zero gives the normal equations S( β) = 0 (X T X ) β = X T y If p + 1 n and if X R n (p+1) is full-rank, i.e., rank(x ) = p + 1, then X T X is invertible and we get β = (X T X ) 1 X T y. 59 / 1

Remark 1 Shown that the maximum likelihood estimate of β under a Gaussian assumption, is the same as the least-squares (LS) estimate. The LS estimate in general makes sense even if we have no distributional assumption. 60 / 1

Table of Contents 61 / 1

Geometric interpretation of LS. (Section 4.2.1) Least-squares problem is equivalent to min y X β R β 2 min y µ 2 p+1 µ Im(X ) since, we recall, Im(X ) = {X β : β R p+1 }. That is, we are trying to find the projection µ of y onto Im(X ): µ = argmin y µ 2 µ Im(X ) By orthogonality principle: The residual e = y µ Im(X ), i.e., y µ [Im(X )] = ker(x T ) X T (y µ) = 0 µ Im(X ) means there is at least one β R p+1 that µ = X β, and X T (y X β) = 0 which is the same normal equations for β. 62 / 1

y Im(X) µ ε e µ Remark 2 The projection µ is unique in general, but β need not be. If X is full (column) rank, then β is unique. 63 / 1

Consequences of geometric interpretation The residual vector: e = y µ = y X β R n. The vector of fitted values: µ = X β. Sometimes µ is also referred to as ŷ. The following hold: Recall that x j is the jth column of X. e Im(X ). Since x j Im(X ), we have { { n e 1 e x j, j = 1,..., p i=1 e i = 0 n i=1 e ix ij = 0, j = 1,..., p Residuals are orthogonal to all the covariate vectors. Since µ Im(X ), we have e µ i e i µ i = 0. S( β) = e 2 = min β S(β). 64 / 1

Table of Contents 65 / 1

Estimation of σ 2 Back to the likelihood, substituting β for β, L( β, σ 2 1 ( y) = (2πσ 2 ) exp 1 ) n/2 2σ 2 S( β) which we want to maximize over σ 2. Equivalent to maximizing log-likelihood: over σ 2, or maximizing: log L( β, σ 2 y) = n 2 log(2πσ2 ) S( β) 2σ 2 l( β, v y) := log L( β, v y) = const. 1 2 [n log v + v 1 S( β) ] over v. (Change of variable v = σ 2.) The problem reduces to v := argmax v>0 l( β, v y) = argmin v>0 Setting the derivative to zero gives σ 2 = v = S( β)/n. (Check that this is indeed the maximizer.) [ n log v + v 1 S( β) ]. 66 / 1

For reasons that will become clear later, we often use the following modified estimator: s 2 = Compare with the MLE for σ 2 : S( β) n (p + 1). σ 2 = S( β) n. We see that s 2 is unbiased for σ 2 while σ 2 is not. Note that S( β) is the sum of squares of residuals: n S( β) = (y i (X β) n i ) 2 = ei 2 = e 2. i=1 i=1 67 / 1

Table of Contents 68 / 1

The hat matrix Recall that β = (X T X ) 1 X T y. We have H is called the hat matrix. µ = X β = X (X T X ) 1 X T y = Hy. }{{} H It is an example of an orthogonal projection matrix. These matrices have a lot of properties. 69 / 1

Properties of projection matrices These matrices have a lot of properties: Lemma 6 For any vector y R n, Hy R n is the orthogonal projection of y onto Im(X ). H is symmetric: H T = H. (Exercise: use (BCD) T = D T C T B T.) H is idempotent: H 2 = H. (Exercise.) This is what we expect from a projection matrix. If we project Hy onto Im(X ), we should get back the same thing: H(Hy) = Hy for all y. A matrix is an orthogonal projection matrix if and only if it is symmetric and idempotent. I H is also a projection matrix (symmetric and idempotent): For every y R n, (I H)y is the projection of y onto [Im(X )]. Note that (I H)y = e. To summarize, we can decompose every y R n as y = µ + e = Hy + (I H)y. }{{}}{{} Im(X ) [Im(X )] 70 / 1

Sidenote: Gram matrix Write X = (x 0 x 1 x 2 x p ) R n (p+1) Here x j is the jth column of X. Verify the following useful result: for i, j = 0, 1,..., p + 1. (X T X ) ij = x T i x j = x i, x j Thus, the entries of X T X are the pairwise inner products of columns of X. For example (X T X ) ij = 0 means x i and x j are orthogonal. X T X is called the Gram matrix of {x 0, x 1,..., x p }. Exercise: Show that X T X is always PSD. 71 / 1

Table of Contents 72 / 1

Example 11 (Simple linear regression) This is the case p = 1, and the model is y = β 0 1 + β 1 x 1 + ε. We have X = [1 x 1 ] R n 2. Assumption (X1) is satisfied if n 2 and x 1 is not a constant multiple of 1. We have β = ( 1 n X T X ) 1 1 n X T y = ( ) 1 ( 1 x ȳ x x 2 xy 1 = x 2 ( x) 2 ) ( x 2 x x 1 ) ( ) ȳ. xy From which it follows that β 1 = xy xȳ x 2 ( x) 2 = ρ xy ρ xx, β0 = ȳ β 1 x where... 73 / 1

Example (Simple linear regression (cont d)) From which it follows that β 1 = xy xȳ x 2 ( x) 2 = ρ xy ρ xx, β0 = ȳ β 1 x where ρ xy = 1 n (x i x)(y i ȳ) = xy xȳ, n i=1 ρ xx = 1 (x i x) 2 = x n 2 ( x) 2. i The formula for β 0 is easier to see if one writes down the normal equations as ( 1 n X T X ) β = 1 n X T y and solves for β 0 in terms of β 1. Note also that [( var( β 1 ) = σ2 1 ) 1 ] n n X T X = σ2. 22 nρ xx 74 / 1

Table of Contents 75 / 1

Sampling distribution Both β and σ 2 are random quantities (due to the randomness in ε). The distribution of an estimate, called its sampling distribution, allows us to quantify the uncertainty in the estimate. Basic properties like the mean and, the variance and covariances can be determined under our basic assumption (E1). To determine the full sampling distribution, we need to assume some distribution for the noise vector ε, e.g. (E2). Recall that E[y] = µ = X β, cov(y) = σ 2 I. 76 / 1

Properties of β Let us write A = (X T X ) 1 X T R (p+1) n so that β = Ay Note that AX = I p+1. Proposition 3 Under the linear model, y = X β + ε, 1. β is an unbiased estimate β, i.e., E[ β] = β. 2. with covariance matrix cov( β) = σ 2 (X T X ) 1. Exercise: Prove this. For covariance, note AA T = (X T X ) 1. 77 / 1

Properties of a T β: Consider a T β for some nonrandom a R p+1. This is interesting since e.g., with a = (0, 1, 1, 0,..., 0) we have a T β = β1 β 2 In general, E[a T β] = a T E[ β] = a T β, var(a T β) = a T cov( β)a = σ 2 a T (X T X ) 1 a The first equation shows that a T β is an unbiased estimate of a T β. 78 / 1

Fitted values The vector of fitted values µ = X β = Hy. µ is an unbiased estimate µ: E[ µ] = X E[ β] = X β = µ. Approach 2: µ = X β so that µ Im(X ) hence Hµ = µ (Verify directly!): E[ µ] = HE[y] = Hµ = µ The covariance matrix is cov( µ) = cov(hy) = H(σ 2 I n )H T = σ 2 HH = σ 2 H. 79 / 1

Residuals The residual is e = y X β = y µ = (I H)y R n. We have E[e] = E[y µ] = E[y] E[ µ] = µ µ = 0, cov(e) = (I H)(σ 2 I n )(I H) T = σ 2 (I H) 2 = σ 2 (I H) 80 / 1

Joint behavior of ( β, e) R (p+1+n) 1 Recall A = (X T X ) 1 X T. Stack β on top of e. We have β = Ay and e = (I H)y. Since H = X (X T X ) 1 X T = XA, E ) ( ) ( β A = y. e I H }{{} P ) ( β = e ( ) ( E[ β] β =. (3) E[e] 0) 81 / 1

The covariance matrix is (Recall that I H is symmetric and idempotent): cov ( ( β ) ) = P cov(y)p T e = σ 2 PP T [ ] = σ 2 A [A T (I H) ] I H T [ ] = σ 2 A [A T I H ] I H [ ] = σ 2 AA T A(I H) (I H)A T I H [ ] σ = 2 (X T X ) 1 0 0 σ 2 (I H) Note that first diagonal block matches the covariance of β as expected. 82 / 1

Sidenote Why A(I H) = 0? Algebraic calculation: AH = (X T X ) 1 X T [XA] = A. Geometric interpretation: A T = X (X T X ) 1 hence Im(A T ) Im(X ) (Check!). This means that H leaves A T intact: HA T = A T. 83 / 1

Important consequences Proposition 4 Under the linear regression model y = X β + ε and assumption (E1), cov ( ( β ) ) [ ] σ = 2 (X T X ) 1 0 e 0 σ 2 (I H) (4) where β is the LS estimate of the regression coefficient and e is the residual. Under (E1), β and e are uncorrelated. Under (E2), we have: ( β, e) has MVN normal distribution (why?) with mean vector (β, 0) and covariance matrix (??). β and e are independent (why?). S( β) = e 2 is independent of β (why?). Similarly, s 2 is independent of β. e and µ are independent. ( µ = X β is a function of β). 84 / 1