Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Similar documents
Lecture 13: Simple Linear Regression in Matrix Format

Linear Algebra Review

Linear Models Review

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Matrix Approach to Simple Linear Regression: An Overview

STAT Homework 8 - Solutions

Stat 206: Linear algebra

Lecture 9 SLR in Matrix Form

Basic Linear Algebra in MATLAB

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

ANOVA: Analysis of Variance - Part I

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

Chapter 5 Matrix Approach to Simple Linear Regression

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

5 More on Linear Algebra

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Lecture 16 Solving GLMs via IRWLS

Matrix Algebra, part 2

. a m1 a mn. a 1 a 2 a = a n

11 a 12 a 21 a 11 a 22 a 12 a 21. (C.11) A = The determinant of a product of two matrices is given by AB = A B 1 1 = (C.13) and similarly.

STAT 100C: Linear models

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

The Multivariate Gaussian Distribution [DRAFT]

Appendix 5. Derivatives of Vectors and Vector-Valued Functions. Draft Version 24 November 2008

Gaussian Models (9/9/13)

Maximum Likelihood Estimation

Need for Several Predictor Variables

Chapter 3. Matrices. 3.1 Matrices

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Reference: Davidson and MacKinnon Ch 2. In particular page

Exercises * on Principal Component Analysis

Econ 620. Matrix Differentiation. Let a and x are (k 1) vectors and A is an (k k) matrix. ) x. (a x) = a. x = a (x Ax) =(A + A (x Ax) x x =(A + A )

Vectors and Matrices Statistics with Vectors and Matrices

Properties of Matrices and Operations on Matrices

Lecture Notes 1: Vector spaces

STAT 100C: Linear models

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

MS&E 226: Small Data

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Stat751 / CSI771 Midterm October 15, 2015 Solutions, Comments. f(x) = 0 otherwise

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Lecture 24: Weighted and Generalized Least Squares

Lecture 6 Positive Definite Matrices

STAT5044: Regression and Anova. Inyoung Kim

MIT Spring 2015

Association studies and regression

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Statistics 910, #15 1. Kalman Filter

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 26. Estimation: Regression and Least Squares

Math Camp II. Basic Linear Algebra. Yiqing Xu. Aug 26, 2014 MIT

ST 740: Linear Models and Multivariate Normal Inference

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

Notes on Linear Algebra I. # 1

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

The Hilbert Space of Random Variables

The Multivariate Normal Distribution. In this case according to our theorem

Multiple Linear Regression

MLES & Multivariate Normal Theory

In the bivariate regression model, the original parameterization is. Y i = β 1 + β 2 X2 + β 2 X2. + β 2 (X 2i X 2 ) + ε i (2)

Review Packet 1 B 11 B 12 B 13 B = B 21 B 22 B 23 B 31 B 32 B 33 B 41 B 42 B 43

Math 423/533: The Main Theoretical Topics

MATH 431: FIRST MIDTERM. Thursday, October 3, 2013.

Appendix 3. Derivatives of Vectors and Vector-Valued Functions DERIVATIVES OF VECTORS AND VECTOR-VALUED FUNCTIONS

A primer on matrices

Ch. 12 Linear Bayesian Estimators

STAT 111 Recitation 7

4 Bias-Variance for Ridge Regression (24 points)

The Multivariate Gaussian Distribution

Orthogonality. Orthonormal Bases, Orthogonal Matrices. Orthogonality

Ch 2: Simple Linear Regression

ECON The Simple Regression Model

Getting Started with Communications Engineering. Rows first, columns second. Remember that. R then C. 1

Ordinary Least Squares Regression

Regression. Oscar García

STAT 8260 Theory of Linear Models Lecture Notes

STA 2101/442 Assignment Four 1

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

The Statistical Property of Ordinary Least Squares

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Basic Concepts in Matrix Algebra

We begin by thinking about population relationships.

Xβ is a linear combination of the columns of X: Copyright c 2010 Dan Nettleton (Iowa State University) Statistics / 25 X =

5.2 Expounding on the Admissibility of Shrinkage Estimators

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Lecture 4: Least Squares (LS) Estimation

A Bayesian Treatment of Linear Gaussian Regression

Statistical Machine Learning Notes 1. Background

Lecture 8: Linear Algebra Background

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Properties of the least squares estimates

Chapter 4 - Fundamentals of spatial processes Lecture notes

Lecture 14 Simple Linear Regression

STA 302f16 Assignment Five 1

Quantum Information & Quantum Computing

Transcription:

Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is a pre-requisite for this class; I strongly urge you to go back to your textbook and notes for review Expectations and Variances with Vectors and Matrices If we have p random variables, Z, Z 2, Z p, we can put them into a random vector Z Z Z 2 Z p T This random vector can be thought of as a p matrix of random variables This expected value of Z is defined to be the vector If a and b are non-random scalars, then If a is a non-random vector then If A is a non-random matrix, then µ E Z E Z E Z 2 E Z p () E az + bw ae Z + be W (2) E(a T Z) a T E(Z) E AZ AE Z (3) Every coordinate of a random vector has some covariance with every other coordinate The variance-covariance matrix of Z is the p p matrix which stores these value In other words, Var Z Cov Z, Z 2 Cov Z, Z p Cov Z 2, Z Var Z 2 Cov Z 2, Z p Var Z (4) Cov Z p, Z Cov Z p, Z 2 Var Z p This inherits properties of ordinary variances and covariances Just as Var Z E Z 2 (E Z) 2, we have Var Z E ZZ T E Z (E Z) T (5) For a non-random vector a and a non-random scalar b, For a non-random matrix C, Var a + bz b 2 Var Z (6) Var CZ CVar Z C T (7) (Check that the dimensions all conform here: if c is q p, Var cz should be q q, and so is the right-hand side)

A random vector Z has a multivariate Normal distribution with mean µ and variance Σ if its density is ( f(z) (2π) p/2 exp ) Σ /2 2 (z µ)t Σ (z µ) We write this as Z N(µ, Σ) or Z MV N(µ, Σ) If A is a square matrix, then the trace of A denoted by A is defined to be the sum of the diaginal elements In other words, tr A A jj j Recall that the trace satisfies these properties: and we have the cyclic property tr(a + B) tr(a) + tr(b), tr(ca) c tr(a), tr(a T ) tr(a) tr(abc) tr(bca) tr(cab) If C is non-random, then Z T CZ is called a quadratic form We have that To see this, notice that E Z T CZ E Z T CE Z + trcvar Z (8) Z T CZ tr Z T CZ (9) because it s a matrix But the trace of a matrix product doesn t change when we cyclicly permute the matrices, so Z T CZ tr CZZ T (0) Therefore E Z T CZ E tr CZZ T () tr E CZZ T (2) tr CE ZZ T (3) tr C(Var Z + E Z E Z T ) (4) tr CVar Z + tr CE Z E Z T ) (5) tr CVar Z + tr E Z T CE Z (6) tr CVar Z + E Z T CE Z (7) using the fact that tr is a linear operation so it commutes with taking expectations; the decomposition of Var Z; the cyclic permutation trick again; and finally dropping tr from a scalar Unfortunately, there is generally no simple formula for the variance of a quadratic form, unless the random vector is Gaussian If Z N(µ, Σ) then Var(Z T CZ) 2 tr(cσcσ) + 4µ T CΣCµ 2 Least Squares in Matrix Form Our data consists of n paired observations of the predictor variable X and the response variable Y, ie, (X, Y ), (X n, Y n ) We wish to fit the model Y β 0 + β X + ɛ (8) where E ɛ X x 0, Var ɛ X x σ 2, and ɛ is uncorrelated across measurements 2

2 The Basic Matrices Y Y 2 Y, β Y n β0 β, X X X 2 ɛ, ɛ ɛ 2 X n ɛ n (9) Note that X which is called the design matrix is an n 2 matrix, where the first column is always, and the second column contains the actual observations of X Now β 0 + β X β 0 + β X 2 Xβ (20) β 0 + β X n So we can write the set of equations Y i β 0 + β X i + ɛ i, i,, n in the simpler form Y Xβ + ɛ 22 Mean Squared Error Let e e(β) Y Xβ (2) The training error (or observed mean squared error) is MSE(β) n n e 2 i (β) MSE(β) n et e (22) i Let us expand this a little for further use We have: MSE(β) n et e (23) n (Y Xβ)T (Y Xβ) (24) n (YT β T X T )(Y Xβ) (25) ( Y T Y Y T Xβ β T X T Y + β T X T Xβ ) n (26) ( Y T Y 2β T X T Y + β T X T Xβ ) n (27) where we used the fact that β T X T Y (Y T Xβ) T Y T Xβ 3

23 Minimizing the MSE First, we find the gradient of the MSE with respect to β: MSE(β) ( Y T Y 2 β T X T Y + β T X T Xβ ) n (28) ( 0 2X T Y + 2X T Xβ ) n (29) 2 ( X T Xβ X T Y ) n (30) We now set this to zero at the optimum, β: X T X β X T Y 0 (3) This equation, for the two-dimensional vector β, corresponds to our pair of normal or estimating equations for β 0 and β Thus, it, too, is called an estimating equation Solving, we get β (X T X) X T Y (32) That is, we ve got one matrix equation which gives us both coefficient estimates If this is right, the equation we ve got above should in fact reproduce the least-squares estimates we ve already derived, which are of course β c XY, β0 Y β X (33) Let s see if that s right As a first step, let s introduce normalizing factors of /n into both the matrix products: β (n X T X) (n X T Y) (34) Now n XT Y n X X 2 X n n i Y i i X iy i Y XY Y Y 2 Y n (35) (36) Similarly, n XT X n n i X i i X i i X2 i X X X 2 (37) Hence, ( ) n XT X X 2 X 2 X 2 X X X 2 X X 4

Therefore, (X T X) X T Y X 2 X Y X XY X 2 Y X XY (X Y ) + XY ( + X2 )Y X(c XY + X Y ) c XY s 2 xy + X 2 Y Xc XY X 2 Y Y c XY X c XY c XY β0 β (38) (39) (40) (4) (42) 3 Fitted Values and Residuals Remember that when the coefficient vector is β, the point predictions (fitted values) for each data point are Xβ Thus the vector of fitted values is Using our equation for β, we then have where Ŷ m(x) m X β Ŷ X β X(X T X) X T Y HY is called the hat matrix or the influence matrix Let s look at some of the properties of the hat matrix H X(X T X) X T (43) Influence Check that Ŷi/ Y j H ij Thus, H ij is the rate at which the i th fitted value changes as we vary the j th observation, the influence that observation has on that fitted value 2 Symmetry It s easy to see that H T H 3 Idempotency Check that H 2 H, so the matrix is idempotent Geometry A symmtric, idempotent matrix is a projection matrix This means that H projects Y into a lower dimensional subspace Specifically, Y is a point in R n but Ŷ HY is a linear combination of two vectors, namely, the two columns of X In other words: H projects Y onto the column space of X The column space of X is the set of vectors that can be written as linear combinations of the columns of X 5

3 Residuals The vector of residuals, e, is Here are some properties of I H: Influence e i / y j (I H) ij 2 Symmetry (I H) T I H e Y Ŷ Y HY (I H)Y (44) 3 Idempotency (I H) 2 (I H)(I H) I H H + H 2 But, since H is idempotent, H 2 H, and thus (I H) 2 (I H) Thus, MSE( β) n YT (I H) T (I H)Y n YT (I H)Y (45) 32 Expectations and Covariances Remember that Y Xβ + ɛ where ɛ is an n matrix of random variables, with mean vector 0, and variance-covariance matrix σ 2 I What can we deduce from this? First, the expectation of the fitted values: EŶ E HY HE Y HXβ + HE ɛ (46) X(X T X) X T Xβ + 0 Xβ (47) Next, the variance-covariance of the fitted values: Var Ŷ Var HY Var H(Xβ + ɛ) (48) using the symmetry and idempotency of H Similarly, the expected residual vector is zero: The variance-covariance matrix of the residuals: Var Hɛ HVar ɛ H T σ 2 HIH σ 2 H (49) E e (I H)(Xβ + E ɛ) Xβ Xβ 0 (50) Var e Var (I H)(Xβ + ɛ) (5) Var (I H)ɛ (52) (I H)Var ɛ (I H)) T (53) σ 2 (I H)(I H) T (54) σ 2 (I H) (55) Thus, the variance of each residual is not quite σ 2, nor are the residuals exactly uncorrelated Finally, the expected MSE is E n et e n E ɛ T (I H)ɛ (56) We know that this must be (n 2)σ 2 /n 6

4 Sampling Distribution of Estimators Let s now assume that ɛ i N(0, σ 2 ), and are independent of each other and of X The vector of all n noise terms, ɛ, is an n matrix Its distribution is a multivariate Gaussian or multivariate Normal with mean vector 0, and variance-covariance matrix σ 2 I We write this as ɛ N(0, σ 2 I) We may use this to get the sampling distribution of the estimator β: β (X T X) X T Y (57) (X T X) X T (Xβ + ɛ) (58) (X T X) X T Xβ + (X T X) X T ɛ (59) β + (X T X) X T ɛ (60) Since ɛ is Gaussian and is being multiplied by a non-random matrix, (X T X) X T ɛ is also Gaussian Its mean vector is E (X T X) X T ɛ (X T X) X T E ɛ 0 (6) while its variance matrix is Var (X T X) X T ɛ (X T X) X T Var ɛ ( (X T X) X T ) T (62) (X T X) X T σ 2 IX(X T X) (63) σ 2 (X T X) X T X(X T X) (64) σ 2 (X T X) (65) Since Var β Var (X T X) X T ɛ, we conclude that that β N(β, σ 2 (X T X) ) (66) Re-writing slightly, β N ) (β, σ2 n (n X T X) (67) will make it easier to prove to yourself that, according to this, β 0 and β are both unbiased, that Var β σ2 n s2 X β0, and that Var σ2 n ( + X2 / β0 ) This will also give us Cov, β, which otherwise would be tedious to calculate I will leave you to show, in a similar way, that the fitted values HY are multivariate Gaussian, as are the residuals e, and to find both their mean vectors and their variance matrices 5 Derivatives with Respect to Vectors This is a brief review of basic vector calculus Consider some scalar function of a vector, say f(x), where X is represented as a p matrix (Here X is just being used as a place-holder or generic variable; it s not necessarily the design matrix of a regression) We would like to think about the derivatives of f with respect to X We can write f(x) f(x,, x p ) where X (x,, x p ) T 7

The gradient of f is the vector of partial derivatives: f The first order Taylor series of f around X 0 is f(x) f(x 0 ) + Here are some properties of the gradient: Linearity x x 2 x p (68) (X X 0 ) i x i i X 0 (69) f(x 0 ) + (X X 0 ) T f(x 0 ) (70) (af(x) + bg(x)) a f(x) + b g(x) (7) Proof: Directly from the linearity of partial derivatives 2 Linear forms If f(x) X T a, with a not a function of X, then (X T a) a (72) Proof: f(x) i X ia i, so / X i a i Notice that a was already a p matrix, so we don t have to transpose anything to get the derivative 3 Linear forms the other way If f(x) bx, with b not a function of X, then (bx) b T (73) Proof: Once again, / X i b i, but now remember that b was a p matrix, and f is p, so we need to transpose 4 Quadratic forms Let C be a p p matrix which is not a function of X, and consider the quadratic form X T CX (You can check that this is scalar) The gradient is (X T CX) (C + C T )X (74) Proof: First, write out the matrix multiplications as explicit sums: X T CX j x j c jk x k k Now take the derivative with respect to x i : x i j k 8 j k x j c jk x k (75) x j c jk x k X i (76)

If j k i, the term in the inner sum is 2c ii x i If j i but k i, the term in the inner sum is c ik x k If j i but k i, we get x j c ji Finally, if j i and k i, we get zero The j i terms add up to (cx) i The k i terms add up to (c T X) i (This splits the 2c ii x i evenly between them) Thus, x i ((c + c T X) i (77) and (Check that this has the right dimensions) 5 Symmetric quadratic forms If c c T, then 5 Second Derivatives f (c + c T )X (78) X T cx 2cX (79) The p p matrix of second partial derivatives is called the Hessian I won t step through its properties, except to note that they, too, follow from the basic rules for partial derivatives 52 Maxima and Minima We need all the partial derivatives to be equal to zero at a minimum or maximum This means that the gradient must be zero there At a minimum, the Hessian must be positive-definite (so that moves away from the minimum always increase the function); at a maximum, the Hessian must be negative definite (so moves away always decrease the function) If the Hessian is neither positive nor negative definite, the point is neither a minimum nor a maximum, but a saddle (since moving in some directions increases the function but moving in others decreases it, as though one were at the center of a horse s saddle) 9