Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall that the linear model is y = Xβ + Σ 1/2 z E[z] = 0 Var[z] = I, where X R n p, β R p and Σ S n +. The term linear here typically refers to the mean parameters: The expectation of any element y i of y is linear in the parameters. If the parameter space for Σ is unrestricted, then the covariance model is linear as as well. If the covariance is restricted, 1

say Σ = Σ(θ) for θ Θ, then typically the covariance model is nonlinear in its parameters (e.g. autoregressive covariance). A surprising number of other models can be viewed as submodels of the linear model: 1. Row and column effects models: Let y i,j = µ + a i + b j + σ i ψ j z i,j Y = µ11 + a1 + 1b + Σ 1/2 ZΨ 1/2 y = 1µ + (1 I)a + (I 1)b + (Ψ Σ) 1/2 z µ = [1 (1 I) (I 1)] a + Ψ Σ) 1/2 z. b 2. Random effects models: 3. Independent multivariate sample: Let y 1,..., y n i.i.d. N p (µ, Σ). Then Y = 1µ + ZΣ 1/2 y = (I 1)µ + (Σ 1/2 I)z 4. General linear model: Let y i N(B x i, Σ), independently for i = 1,..., n. Then Y = XB + ZΣ 1/2 y = (I X)b + (Σ 1/2 I)z 5. Correlated rows: Suppose Cov[Y [i,] ] Σ, Cov[Y [,j] ] Ψ. Then Y = XB + Ψ 1/2 ZΣ 1/2 y = (I X)b + (Σ 1/2 Ψ 1/2 )z. 2

6. Row and column covariates: Suppose E[y i,j ] = x i Bw j, for example, Y = XBW + Ψ 1/2 ZΣ 1/2 y = (W X)b + (Σ 1/2 Ψ 1/2 )z. So we begin by studying the deceptively simple general form of the linear model. OLS: For a given value of β, the RSS is given by RSS(β) = (y Xβ) (y Xβ). Differentiation or completing the square shows that the minimizer in β is ˆβ = (X X) 1 X y. It is straightforward to show that the OLS estimator is unbiased, E[ˆβ] = β, and that the variance of this estimator is Var[ˆβ] = (X X) 1 (X ΣX )(X X) 1. If Σ = σ 2 I then the OLS estimator is optimal in terms of variance: Theorem 1 (Gauss-Markov). Suppose Var[y] = σ 2 I and let ˇβ = Ay be an unbiased estimator for β. Then Var[ˇβ] = Var[ˆβ] + E[(ˇβ ˆβ)(ˇβ ˆβ) ] If Σ σ 2 I then the OLS estimator does not have minimum variance. 3

Exercise 1. Obtain the form of β, ŷ = Xβ, and Var[ˆβ] in terms of components of the SVD of X. Comment on the following intuitive statement: If y Xβ then X 1 y β. GLS: If we knew Σ we could obtain a decorrelated vector ỹ = Σ 1/2 y. The model for ỹ is ỹ = Σ 1/2 y = Σ 1/2 Xβ + Σ 1/2 Σ 1/2 z ỹ = Xβ + z. The OLS estimator based on the decorrelated data and corresponding regressors is ˆβ OLS = ( X X) 1 X ỹ. By the Gauss-Markov theorem, this estimator is optimal among all linear unbiased estimators, that is unbiased estimators of the form Aỹ. Obviously, this includes unbiased estimators of the form Ay, since ỹ is a linear transformation of y. What is this estimator in terms of the original data and covariates? ˆβ GLS = ˆβ OLS = ( X X) 1 X ỹ = (X Σ 1 X) 1 X Σ 1 y. The variance of this estimator is easily shown to be Var[ˆβ GLS ] = (XΣ 1 X) 1. This is the smallest variance attainable among unbiased linear estimators, since the GLS estimate is the BLUE - best linear unbiased estimator. 4

Furthermore, suppose we assume normal errors, z N n (0, I). For known Σ, the MLE of β is the minimizer of 2 log p(y X, β) = (y Xβ) Σ 1 (y Xβ). Taking derivatives, the MLE is the GLS estimator. Big picture: If y N n (Xβ, Σ) and Σ is known, ˆβ GLS = (X Σ 1 X) 1 X Σ 1 y is the GLS estimator; the ML estimator; the OLS estimator based on the decorrelated data Σ 1/2 y; the BLUE based on y. Now the problem is that Σ is rarely known. In some cases, particularly if Σ = Σ(θ) for some low dimensional parameter θ, we can jointly estimate Σ and β. This is known as feasible GLS. 2 GLS for multivariate regression Perhaps the most commonly used model is multivariate regression, a.k.a. the general linear model: Y = XB + ZΣ 1/2 y = (I X)b + (Σ 1/2 I)z 5

GLS: Using the vector form for the model, the GLS estimator of b = vec(b) is ˆB GLS = ( (I X) (Σ I) 1 (I X) ) 1 ( (I X) (Σ 1 I)y ) = (Σ 1 X X) 1 (Σ 1 X )y = (I (X X) 1 X )y Now unvectorize : ˆB GLS = (I (X X) 1 X )y = (X X) 1 X Y. Notice that Σ does not appear in formula for the GLS estimator. about why this is: Think The multivariate regression model is just a regression model for each variable: y j = Xb j + e j So the jth column of B R q p is b j,the vector of regression coefficients for the jth variable. In this model, there is no correlation among the elements of y j /rows of Y. We already knew that in this case the BLUE for b j based on y j is the OLS estimator: ˆβ j = (X X) 1 X y j. Now is this BLUE among estimators based on Y? The question comes down to whether or not observing some variables Y [, j] correlated with y j helps with estimating b j. The answer given by the above calculation is no. 6

Binding the OLS estimators columnwise gives the multivariate OLS/GLS estimator: ˆB = [ˆβ1 ˆβ p ] To summarize estimator: ˆBGLS = (X X) 1 X Y is the OLS estimator; the BLUE; the MLE under normal errors. the column-bound univariate OLS estimators. What is the covariance of ˆB? How do we even write this out, since ˆB is a matrix? One approach is to vectorize. Recall the vector form for the GLM: Y = XB + ZΣ 1/2 y = (I X)b + (Σ 1/2 I)z = Xb + Σ 1/2 z Now we already calculated the variance of a GLS estimator: Var[ ˆB] Var[ˆb] = ( X Σ 1 X) 1 = ( (I X )(Σ 1 I)(I X) ) 1 = Σ (X X) 1. = ( (Σ 1 X X) 1) This means that Cov[ ˆB [,j], ˆB [,j ]] = σ j,j (X X) 1 Cov[ ˆB [k,], ˆB [k,]] = s k,k Σ where s k,k = (X X) 1 k,k. In particular, Cov[ˆb k,j, ˆb k,j ] = σ j,j s k,k. So in a sense, Σ is the column covariance of ˆB, and (X X) 1 is the row covariance. 7

3 Covariance estimation for the GLM Since the OLS/GLS estimator of B is (X X) 1 X Y, the fitted values for Y are Ŷ = X ˆB = X(X X) 1 X Y = P X Y. So the fitted values of the matrix Y are just the column-wise projections of Y on to the space spanned by the columns of X. Obvious but important is that ˆB itself is a function of Ŷ: ˆB = (X X) 1 X Y = (X X) 1 (X X)(X X) 1 X = (X X) 1 X ( X(X X) 1 X ) Y = (X X) 1 X P X Y = (X X) 1 X Ŷ The residuals are the projection onto the null space of X Ê = Y Ŷ = (I P X )Y = (I X(X X) 1 X)Y = P 0 Y. This gives the decomposition Y = Ŷ + Ê. 8

Is Ŷ correlated with Ê? Let s find out. Let ŷ and ê be the vectorizations of Ŷ and Ê respectively. The covariance between these two matrices is the covariance between their vectorizations. Cov[ŷ, ê] = E[(ŷ E[ŷ])ê ] Now what are these items? Ŷ E[Ŷ] = P XY E[P X Y] = P X (Y E[Y]) = P X E, ŷ E[ŷ] = (I P X )e. Ê = P 0 Y = P 0 (XB + E) = P 0 E, ê = (I P 0 )e. Therefore, Cov[ŷ, ê] = E[(ŷ E[ŷ])ê ] = E[(I P X )ee (I P 0 )] = 0. So as with ordinary regression, the residuals are uncorrelated with the fitted values, and hence also uncorrelated with ˆB. This is useful, as it allows us to obtain an estimate of Σ that is independent of our estimate of B if the errors are normal. 9

Ok, let s consider estimating Σ from Ê. Now Ê = P 0E, and under the model E = ZΣ 1/2. Therefore, Ê = P 0ZΣ 1/2, and S = Ê Ê = Σ 1/2 Z P 0 P 0 ZΣ 1/2 E[S] = Σ 1/2 E[Z P 0 Z]Σ 1/2 = Σ 1/2 (tr(p 0 )I)Σ 1/2 = Σ (n q) Therefore, ˆΣ = Ê Ê/(n q) is an unbiased estimator of Σ. However, it is not the MLE: Exercise 2. Show that the MLE of (B, Σ) is ( ˆB, S/n). If the errors are Gaussian then we can say more. Suppose Y = XB + ZΣ 1/2 Z N n p (0, I I), or equivalently, Y N n p (XB, Σ I). Then Ê = P 0 ZΣ 1/2. Now recall from a few months ago the following: P 0 is a rank-(n q) idempotent projection matrix; P 0 = GG T for some G V n q,n R n (n q) ; If Z N(0, I I) then G Z N n q (0, I I). 10

So we have S = Ê Ê = Σ 1/2 Z P 0 P 0 ZΣ 1/2 = Σ 1/2 Z P 0 ZΣ 1/2 = Σ 1/2 Z GG ZΣ 1/2 = W W, where W = G ZΣ 1/2 N (n q) p (0, Σ). Therefore S = Ê Ê = W W Wishart (n q) (Σ). We ve seen a simpler version of this before: Suppose Y N n p (1µ, Σ I). Although it doesn t look like it, this is a multivariate linear regression model where the regression model for each column j is y j = 1µ j + e j. That is there is q = 1 predictor and it is constant for all rows : X = 1 and B = µ ; P X = 1(1 1) 1 1 = 1 /n; P 0 = I 11 /n = C; S = Ê Ê = Y CC Wishart n 1 (Σ). 4 Testing the GLH We now derive some classical testing and confidence procedures for the regression matrix B of the Gaussian GLM: Y N(XB, Σ I). Some results we ve obtained so far: 11

ˆB = (X X) 1 X Y N q p (B, Σ (X X) 1 ); S = Ê Ê Wishart (n q) (Σ); ˆB and Ê are uncorrelated and so independent. Consider first inference for B [k,] = b k, the (transposed) kth row of the matrix B. This p 1 vector represents the effects of the kth predictor X [,k] on Y: Y = XB + E = q X [,q] b k + E. k=1 In the context of making inference about b j, our results above tell us that ˆb j N p (b j, h jj Σ); ˆb j is independent of S. Our objective is to make a confidence region or test a hypothesis about b j based on ˆb j. Recall that a particularly simple case of this is when q = 1 and X = 1. This is just the multivariate normal model Y N n p (1µ, Σ I). Recall our results, both old and new, imply that ȳ N(µ, Σ/n) ȳ is independent of S. The point is that a statistical method developed to infer µ from (ȳ, S) can be used to infer b j from (ˆb j, S). So we proceed with the former since the notation is simpler. 12

Suppose Σ were known and we wanted to test H : E[y] = µ. We then might construct a test statistic as follows: y N(µ, Σ/n) n(y µ) N(0, Σ) (µ) = nσ 1/2 (y µ) N(0, I) z 2 (µ) = n(ȳ µ) Σ 1 (ȳ µ) χ 2 p. So in this case the null distribution is χ 2 p, and of course does not depend on the parameters. In this case that Σ is known, this statistic is a pivot statistic, that is, has a distribution that doesn t depend on any unknown parameters. We use such statistics to test and infer values of µ Since Σ is unknown we might try replacing it with ˆΣ the unbiased estimator: t 2 (ȳ, ˆΣ) = n(ȳ µ) Σ 1 (ȳ µ). Can we use this for testing and confidence region construction? To do so requires the following: show that t 2 is pivotal, that is, doesn t depend on unknown parameters; derive the distribution of t 2. 13