Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is a pre-requisite for this class; I strongly urge you to go back to your textbook and notes for review Expectations and Variances with Vectors and Matrices If we have p random variables, Z, Z 2, Z p, we can put them into a random vector Z Z Z 2 Z p T This random vector can be thought of as a p matrix of random variables This expected value of Z is defined to be the vector If a and b are non-random scalars, then If a is a non-random vector then If A is a non-random matrix, then µ E Z E Z E Z 2 E Z p () E az + bw ae Z + be W (2) E(a T Z) a T E(Z) E AZ AE Z (3) Every coordinate of a random vector has some covariance with every other coordinate The variance-covariance matrix of Z is the p p matrix which stores these value In other words, Var Z Cov Z, Z 2 Cov Z, Z p Cov Z 2, Z Var Z 2 Cov Z 2, Z p Var Z (4) Cov Z p, Z Cov Z p, Z 2 Var Z p This inherits properties of ordinary variances and covariances Just as Var Z E Z 2 (E Z) 2, we have Var Z E ZZ T E Z (E Z) T (5) For a non-random vector a and a non-random scalar b, For a non-random matrix C, Var a + bz b 2 Var Z (6) Var CZ CVar Z C T (7) (Check that the dimensions all conform here: if c is q p, Var cz should be q q, and so is the right-hand side)
A random vector Z has a multivariate Normal distribution with mean µ and variance Σ if its density is ( f(z) (2π) p/2 exp ) Σ /2 2 (z µ)t Σ (z µ) We write this as Z N(µ, Σ) or Z MV N(µ, Σ) If A is a square matrix, then the trace of A denoted by A is defined to be the sum of the diaginal elements In other words, tr A A jj j Recall that the trace satisfies these properties: and we have the cyclic property tr(a + B) tr(a) + tr(b), tr(ca) c tr(a), tr(a T ) tr(a) tr(abc) tr(bca) tr(cab) If C is non-random, then Z T CZ is called a quadratic form We have that To see this, notice that E Z T CZ E Z T CE Z + trcvar Z (8) Z T CZ tr Z T CZ (9) because it s a matrix But the trace of a matrix product doesn t change when we cyclicly permute the matrices, so Z T CZ tr CZZ T (0) Therefore E Z T CZ E tr CZZ T () tr E CZZ T (2) tr CE ZZ T (3) tr C(Var Z + E Z E Z T ) (4) tr CVar Z + tr CE Z E Z T ) (5) tr CVar Z + tr E Z T CE Z (6) tr CVar Z + E Z T CE Z (7) using the fact that tr is a linear operation so it commutes with taking expectations; the decomposition of Var Z; the cyclic permutation trick again; and finally dropping tr from a scalar Unfortunately, there is generally no simple formula for the variance of a quadratic form, unless the random vector is Gaussian If Z N(µ, Σ) then Var(Z T CZ) 2 tr(cσcσ) + 4µ T CΣCµ 2 Least Squares in Matrix Form Our data consists of n paired observations of the predictor variable X and the response variable Y, ie, (X, Y ), (X n, Y n ) We wish to fit the model Y β 0 + β X + ɛ (8) where E ɛ X x 0, Var ɛ X x σ 2, and ɛ is uncorrelated across measurements 2
2 The Basic Matrices Y Y 2 Y, β Y n β0 β, X X X 2 ɛ, ɛ ɛ 2 X n ɛ n (9) Note that X which is called the design matrix is an n 2 matrix, where the first column is always, and the second column contains the actual observations of X Now β 0 + β X β 0 + β X 2 Xβ (20) β 0 + β X n So we can write the set of equations Y i β 0 + β X i + ɛ i, i,, n in the simpler form Y Xβ + ɛ 22 Mean Squared Error Let e e(β) Y Xβ (2) The training error (or observed mean squared error) is MSE(β) n n e 2 i (β) MSE(β) n et e (22) i Let us expand this a little for further use We have: MSE(β) n et e (23) n (Y Xβ)T (Y Xβ) (24) n (YT β T X T )(Y Xβ) (25) ( Y T Y Y T Xβ β T X T Y + β T X T Xβ ) n (26) ( Y T Y 2β T X T Y + β T X T Xβ ) n (27) where we used the fact that β T X T Y (Y T Xβ) T Y T Xβ 3
23 Minimizing the MSE First, we find the gradient of the MSE with respect to β: MSE(β) ( Y T Y 2 β T X T Y + β T X T Xβ ) n (28) ( 0 2X T Y + 2X T Xβ ) n (29) 2 ( X T Xβ X T Y ) n (30) We now set this to zero at the optimum, β: X T X β X T Y 0 (3) This equation, for the two-dimensional vector β, corresponds to our pair of normal or estimating equations for β 0 and β Thus, it, too, is called an estimating equation Solving, we get β (X T X) X T Y (32) That is, we ve got one matrix equation which gives us both coefficient estimates If this is right, the equation we ve got above should in fact reproduce the least-squares estimates we ve already derived, which are of course β c XY, β0 Y β X (33) Let s see if that s right As a first step, let s introduce normalizing factors of /n into both the matrix products: β (n X T X) (n X T Y) (34) Now n XT Y n X X 2 X n n i Y i i X iy i Y XY Y Y 2 Y n (35) (36) Similarly, n XT X n n i X i i X i i X2 i X X X 2 (37) Hence, ( ) n XT X X 2 X 2 X 2 X X X 2 X X 4
Therefore, (X T X) X T Y X 2 X Y X XY X 2 Y X XY (X Y ) + XY ( + X2 )Y X(c XY + X Y ) c XY s 2 xy + X 2 Y Xc XY X 2 Y Y c XY X c XY c XY β0 β (38) (39) (40) (4) (42) 3 Fitted Values and Residuals Remember that when the coefficient vector is β, the point predictions (fitted values) for each data point are Xβ Thus the vector of fitted values is Using our equation for β, we then have where Ŷ m(x) m X β Ŷ X β X(X T X) X T Y HY is called the hat matrix or the influence matrix Let s look at some of the properties of the hat matrix H X(X T X) X T (43) Influence Check that Ŷi/ Y j H ij Thus, H ij is the rate at which the i th fitted value changes as we vary the j th observation, the influence that observation has on that fitted value 2 Symmetry It s easy to see that H T H 3 Idempotency Check that H 2 H, so the matrix is idempotent Geometry A symmtric, idempotent matrix is a projection matrix This means that H projects Y into a lower dimensional subspace Specifically, Y is a point in R n but Ŷ HY is a linear combination of two vectors, namely, the two columns of X In other words: H projects Y onto the column space of X The column space of X is the set of vectors that can be written as linear combinations of the columns of X 5
3 Residuals The vector of residuals, e, is Here are some properties of I H: Influence e i / y j (I H) ij 2 Symmetry (I H) T I H e Y Ŷ Y HY (I H)Y (44) 3 Idempotency (I H) 2 (I H)(I H) I H H + H 2 But, since H is idempotent, H 2 H, and thus (I H) 2 (I H) Thus, MSE( β) n YT (I H) T (I H)Y n YT (I H)Y (45) 32 Expectations and Covariances Remember that Y Xβ + ɛ where ɛ is an n matrix of random variables, with mean vector 0, and variance-covariance matrix σ 2 I What can we deduce from this? First, the expectation of the fitted values: EŶ E HY HE Y HXβ + HE ɛ (46) X(X T X) X T Xβ + 0 Xβ (47) Next, the variance-covariance of the fitted values: Var Ŷ Var HY Var H(Xβ + ɛ) (48) using the symmetry and idempotency of H Similarly, the expected residual vector is zero: The variance-covariance matrix of the residuals: Var Hɛ HVar ɛ H T σ 2 HIH σ 2 H (49) E e (I H)(Xβ + E ɛ) Xβ Xβ 0 (50) Var e Var (I H)(Xβ + ɛ) (5) Var (I H)ɛ (52) (I H)Var ɛ (I H)) T (53) σ 2 (I H)(I H) T (54) σ 2 (I H) (55) Thus, the variance of each residual is not quite σ 2, nor are the residuals exactly uncorrelated Finally, the expected MSE is E n et e n E ɛ T (I H)ɛ (56) We know that this must be (n 2)σ 2 /n 6
4 Sampling Distribution of Estimators Let s now assume that ɛ i N(0, σ 2 ), and are independent of each other and of X The vector of all n noise terms, ɛ, is an n matrix Its distribution is a multivariate Gaussian or multivariate Normal with mean vector 0, and variance-covariance matrix σ 2 I We write this as ɛ N(0, σ 2 I) We may use this to get the sampling distribution of the estimator β: β (X T X) X T Y (57) (X T X) X T (Xβ + ɛ) (58) (X T X) X T Xβ + (X T X) X T ɛ (59) β + (X T X) X T ɛ (60) Since ɛ is Gaussian and is being multiplied by a non-random matrix, (X T X) X T ɛ is also Gaussian Its mean vector is E (X T X) X T ɛ (X T X) X T E ɛ 0 (6) while its variance matrix is Var (X T X) X T ɛ (X T X) X T Var ɛ ( (X T X) X T ) T (62) (X T X) X T σ 2 IX(X T X) (63) σ 2 (X T X) X T X(X T X) (64) σ 2 (X T X) (65) Since Var β Var (X T X) X T ɛ, we conclude that that β N(β, σ 2 (X T X) ) (66) Re-writing slightly, β N ) (β, σ2 n (n X T X) (67) will make it easier to prove to yourself that, according to this, β 0 and β are both unbiased, that Var β σ2 n s2 X β0, and that Var σ2 n ( + X2 / β0 ) This will also give us Cov, β, which otherwise would be tedious to calculate I will leave you to show, in a similar way, that the fitted values HY are multivariate Gaussian, as are the residuals e, and to find both their mean vectors and their variance matrices 5 Derivatives with Respect to Vectors This is a brief review of basic vector calculus Consider some scalar function of a vector, say f(x), where X is represented as a p matrix (Here X is just being used as a place-holder or generic variable; it s not necessarily the design matrix of a regression) We would like to think about the derivatives of f with respect to X We can write f(x) f(x,, x p ) where X (x,, x p ) T 7
The gradient of f is the vector of partial derivatives: f The first order Taylor series of f around X 0 is f(x) f(x 0 ) + Here are some properties of the gradient: Linearity x x 2 x p (68) (X X 0 ) i x i i X 0 (69) f(x 0 ) + (X X 0 ) T f(x 0 ) (70) (af(x) + bg(x)) a f(x) + b g(x) (7) Proof: Directly from the linearity of partial derivatives 2 Linear forms If f(x) X T a, with a not a function of X, then (X T a) a (72) Proof: f(x) i X ia i, so / X i a i Notice that a was already a p matrix, so we don t have to transpose anything to get the derivative 3 Linear forms the other way If f(x) bx, with b not a function of X, then (bx) b T (73) Proof: Once again, / X i b i, but now remember that b was a p matrix, and f is p, so we need to transpose 4 Quadratic forms Let C be a p p matrix which is not a function of X, and consider the quadratic form X T CX (You can check that this is scalar) The gradient is (X T CX) (C + C T )X (74) Proof: First, write out the matrix multiplications as explicit sums: X T CX j x j c jk x k k Now take the derivative with respect to x i : x i j k 8 j k x j c jk x k (75) x j c jk x k X i (76)
If j k i, the term in the inner sum is 2c ii x i If j i but k i, the term in the inner sum is c ik x k If j i but k i, we get x j c ji Finally, if j i and k i, we get zero The j i terms add up to (cx) i The k i terms add up to (c T X) i (This splits the 2c ii x i evenly between them) Thus, x i ((c + c T X) i (77) and (Check that this has the right dimensions) 5 Symmetric quadratic forms If c c T, then 5 Second Derivatives f (c + c T )X (78) X T cx 2cX (79) The p p matrix of second partial derivatives is called the Hessian I won t step through its properties, except to note that they, too, follow from the basic rules for partial derivatives 52 Maxima and Minima We need all the partial derivatives to be equal to zero at a minimum or maximum This means that the gradient must be zero there At a minimum, the Hessian must be positive-definite (so that moves away from the minimum always increase the function); at a maximum, the Hessian must be negative definite (so moves away always decrease the function) If the Hessian is neither positive nor negative definite, the point is neither a minimum nor a maximum, but a saddle (since moving in some directions increases the function but moving in others decreases it, as though one were at the center of a horse s saddle) 9