Multivariate Linear Models

Size: px

Start display at page:

Download "Multivariate Linear Models"

Dorcas Bishop
5 years ago
Views:

1 Multivariate Linear Models Stanley Sawyer Washington University November 7, Introduction. Suppose that we have n observations, each of which has d components. For example, we may have d measurements on each of n different individual humans or squirrels, or d test scores for n students on d different days. The observations take the form of a n d matrix Y = Y 11 Y Y 1d Y 21 Y Y 2d Y n1 Y n2... Y nd The i th row corresponds to the i th observation. We assume that we also observe r covariates whose structure is the same for all components. These might be income level, sex, altitude, etc., as in an ANOVA design. In matrix notation, the basic multivariate regression or MANOVA model is Y = Xβ + e (1.1 Here X is an n r matrix (as in the univariate case that relate the observations to the covariates and β = β 11 β β 1d β 21 β β 2d β r1 β r2... β rd is an r d matrix of parameter values. The j th column of β relates to the j th component of the observations Y i. The n d error matrix e in (1.1 is assumed to be joint Gaussian. The rows of e are independent, corresponding to independent vector-valued observations Y i R d. The columns of e corresponding to the components of the observations Y i have a fixed covariance matrix Σ. Thus the errors e satisfy E(e ia = 0 and Cov(e ia, e jb = 0, i = j Cov(e ia, e ib = Σ ab, 1 a, b d (1.2 In general, if A = { a ij } is an r s matrix and B is m n, the tensor product or Kronecker product of the matrices A and B is a matrix C = A B that

2 Multivariate Linear Models can be written in partitioned form C = a 11B a 12B... a 1sB a 21 B a 22 B... a 2s B a r1 B a r2 B... a rs B Note that C is then rm sn. In this notation, the covariate matrix of the error terms can be written Cov(e = I n Σ (1.3 This assumes that we identify the pair ia in e ia with the single integer I = (i 1m + a. The matrix Cov(e can then be written in block diagonal form with the matrix Σ down the diagonal. 2. The MLE of the matrix β. Since the errors e are jointly normal, the likelihood function of the observations Y in (1.1 can be written as L(Y, β = (2π nd/2 exp( S/2 where S = n d d i=1 a=1 b=1 ( Yia (Xβ ia Σ 1 ab ( Yib (Xβ ib Finding the matrix MLE β is equivalent to minimizing the triple sum S. Setting ( / β kc S = 0 implies n d i=1 b=1 X ik Σ 1 cb ( Yib (Xβ ib = d b=1 Σ 1 ( cb (X Y kb (X Xβ kb = 0 This is Σ 1 (X Xβ X Y = 0 in matrix form. Premultiplying by Σ leads to X Xβ = X Y. If the r r matrix X X is invertible, the matrix-valued MLE is then β = (X X 1 X Y (2.1 This is the same formula as in the univariate case (d = 1, but now β and β are r d matrices and Y is n d. That is, the d columns of the multivariate MLE β are found by applying the same r n matrix (X X 1 X as in the univariate case to each of the d columns of Y.

3 Multivariate Linear Models For Q = (X X 1 X, (2.1 and Y = Xβ + e imply β ia = β ia + Thus by (1.2 n Q iu e ua, Q = (X X 1 X u=1 Cov( β ia, β jb = = n n Q iu Q jv Cov(e ua, e vb u=1 v=1 n Q iu Q ju Σ ab = ( (X X 1 ij Σ ab (2.2 u=1 since QQ = (X X 1 X X(X X 1 = (X X 1. Thus, with the same conventions on indices in Kronecker products as before, Cov( β = (X X 1 Σ 3. Hypothesis Testing. For the regression model Y = Xβ+e, the natural hypothesis H 0 : β ia = 0, 1 a d (3.1a says that the i th covariate does not influence any of the components of Y. A generalization of (3.1a is H 0 : Lβ = 0 (3.1b where L is a q r matrix, so that Lβ is q d. Thus (3.1b is shorthand for q different relations of the form (3.1a. An example would be that a particular MANOVA effect with q degrees of freedom is equal to zero. The analog of the univariate SS term in a multivariate MANOVA is the d d matrix H L = (L β ( L(X X 1 L 1 L β The analog of the residual sum of squares SSE is the d d matrix E = (Y X β (Y X β = Y Y β X X β (3.2a (3.2b The sums in (3.2 are sometimes called SS&CP matrices to emphasize that they are matrices instead of numbers. (SS&CP stands for Sums of Squares and Cross Products. The usual one-dimensional ANOVA test for (3.1 is based on H L /E. Indeed, in the univariate case if H 0 : Lβ = 0 with rank(l = q, then F =

4 Multivariate Linear Models (1/qH L /S for S = E/(n r has an F distribution with (q, n r degrees of freedom. The multivariate analog has to be more complicated, since H L and E are d d matrices and the three matrices E 1 H L H L E 1 E 1/2 H L E 1/2 (3.3 are generally different. However, the eigenvalues of the three matrices (3.3 are the same. This is because all three matrices in (3.3 have the same characteristic polynomial f(λ = det(e 1 H L λi = det(h L λe/ det(e Several common tests of the hypothesis (3.1 are based on the eigenvalues of the matrices (3.3. Since the third matrix in (3.3 is symmetric and nonnegative definite, the eigenvalues λ 1, λ 2,..., λ d are real and nonnegative. The most common tests, and the functions of the λ i that the tests are based upon, are 1. Wilk s Lambda (test: Λ = d 2. Pillai s Trace: S 1 = d i=1 λ i λ i +1 i=1 1 λ i Hotelling-Lawley Trace: S 2 = d i=1 λ i 4. Roy s Greatest Root: S 3 = λ 1 where λ 1 λ 2... λ d 0. The last test is named after the Indian statistician S. N. Roy, so that Roy is not a first name. The likelihood ratio test for the general hypothesis (3.1 is essentially Wilk s Lambda. In general, the four tests give different P-values, and are designed to test H 0 : Lβ = 0 against four slightly different alternatives. However, there is an important special case in which the three matrices (3.3 have a single nonzero eigenvalue and the four tests give identical results. First, we need a regression into a multivariate version of the Student t distribution that is called the Hotelling T The Hotelling T 2 Statistic. Suppose that we have two samples Y 11, Y 12,..., Y 1n1, Y ij R d Y 21, Y 22,..., Y 2n2 The vector-valued observations Y ij are independent N(µ i, Σ. That is, the two samples have the same covariance function Σ but possibly different (vector-valued sample means µ 1 and µ 2.

5 Multivariate Linear Models The standard test of H 0 : µ 1 = µ 2 in the univariate case (d = 1 is based on the statistic t = n1 n 2 n 1 + n 2 ( Y 1 Y 2 / s 2 where s 2 = 1 n 1 + n n i (Y ij Y i 2 i=1 j=1 Given H 0, the statistic t has a Student s t distribution with n 1 + n 2 2 degrees of freedom and can be used as a test of H 0. Hotelling s generalization of t is T 2 = n 1n 2 n 1 + n 2 (Y 1 Y 2 S 1 (Y 1 Y 2 S = 1 n 1 + n 2 2 where 2 n i (Y ij Y i (Y ij Y i (4.1 i=1 j=1 The statistic T 2 is called the Hotelling T 2 -statistic. Most books on multivariate statistics show that (i the likelihood ratio test of H 0 : µ 1 = µ 2 is based on T 2 and (ii given H 0, the distribution of T 2 is given by T 2 d(n 1 + n 2 2 n 1 + n 2 d 1 F d, n 1 +n 2 d 1 That is, the distribution of T 2 is that constant times an F distribution with those numbers of degrees of freedom. 5. Hotelling and Wishart Distributions. In general, a random variable X is said to have a Hotelling s T 2 distribution with parameters (d, m (abbreviated T 2 (d, m if X has the same distribution as Y S 1 Y, where S = (1/m m a=1 Z az a and Y, Z 1,..., Z m are m + 1 independent random vectors with the same distribution N(0, Σ. The form of Y S 1 Y indicates that the distribution of X is independent of Σ. It can then be shown that T 2 in (4.1 has a T 2 (d, n 1 + n 2 2 distribution and that X T 2 (d, m dm m d + 1 F d, m d+1 (5.1 A random variable W is said to have a Wishart distribution with parameters Σ and m (abbreviated W (Σ, m if W has the same distribution as m a=1 Z az a, where Z 1,..., Z m are independent normally-distributed

6 Multivariate Linear Models random vectors with distribution N(0, Σ. This is a multivariate generalization of the chi-square distribution, but also depends on a matrix Σ. Since N a = Σ 1/2 Z a are independent N(0, I d, we have that W (Σ, m Σ 1/2 W (I d, mσ 1/2. Thus X has a Hotelling T 2 (d, m distribution if and only if X is distributed as Y W 1 Y, where Y is N(0, Σ, W has a Wishart distribution W (Σ, m, and Y and W are independent. 6. Hypotheses of Rank One. Suppose in the hypothesis H 0 : Lβ = 0 (6.1 for a q r matrix L that rank(l = q = 1. Section 3. Then L = h is a row vector and (See equations (3.1ab in H L = H h = (h β ( h (X X 1 h 1 h β = ch x h x h (6.2 where x h = (h β = β h and c h = 1/ ( h (X X 1 h If follows from (2.2 that Cov((x h a, (x h b = = r r h i h j Cov( β ia, β jb i=1 j=1 r i=1 j=1 r ( h i h j (X X 1 Σ ij ab = Σ ab /c h Thus, given H 0 : h β = 0, Z 0 = x h ch is N(0, Σ and H h = Z 0 Z 0. Similarly A = E 1 H L = E 1 H h = ( c h E 1 x h xh = u h x h (6.3 where E = (Y X β(y X β as in (3.2b. In general, the three matrices (3.3 all have rank one. In general, if A = cuv is a rank-one matrix where c is a nonzero constant, then Az = λz implies cu(v z = λz. If λ = 0, then v z = 0. If λ 0, then v z 0 and z is proportional to u. Then Au = λu implies cu(v u = λu. This implies that if A = cuv is rank one, then the sole nonzero eigenvalue is λ = cv u.

7 Multivariate Linear Models For A in (6.3 or for any of the matrices (3.3, the only nonzero eigenvalue is λ 1 (H, E = c h y h y h = c h (E 1/2 x h E 1/2 x h = c h x he 1 x h = h βe 1 β h / h (X X 1 h (6.4 Recall that x h ch = Z 0 has a N(0, Σ distribution if h β = 0. As in the univariate case, it can be shown that β is independent of the residuals Y X β and that the residual matrix E = (Y X β(y X β has a Wishart distribution W (Σ, n r. It follows from (6.4 that, if rank(h = 1, then (n rλ 1 (H, E has a Hotelling T 2 (d, n r distribution, which by (5.1 has an F distribution within a normalizing constant. Indeed, by (5.1, if h β = 0, λ 1 (H, E T 2 (d, n r n r d n r d + 1 F d, n r d+1 (6.5 It follows from the same argument that the eigenvector of A = E 1 H L for the eigenvalue λ 1 is y h = c h E 1 x h = c h E 1 β h. Example 1. For H 0 : β ia = 0 (1 a d, the nonzero eigenvalue is λ 1 (H, E = β i E 1 β i / ( (X X 1 ii If E is replaced by S = E/(n r and d = 1, this reduces to the square of the usual t statistic. If d > 1, it has a Hotelling T 2 (d, n r distribution. Example 2. Suppose that we have two samples Y 11, Y 12,..., Y 1n1 and Y 21, Y 22,..., Y 2n2 where Y ij R d and have the distribution N(µ i, Σ. This is equivalent to Y = Xβ + e where Y R n for n = n 1 + n 2, β = (µ 1 µ 2, and X is the n 2 matrix with X ij = 1 for j = 1, 1 i n 1 or j = 2, n i n 2, with otherwise X ij = 0. In particular ( X n1 0 X = 0 n 2 ( β = (X X 1 X Y Y = 1 Y 2 (6.6 Thus the normalized residual matrix S = E/(n 2 in (3.2b is the same as the pooled two-sample covariance matrix S in (4.1. Since β = (µ 1 µ 2,

8 Multivariate Linear Models the hypothesis H 0 : µ 1 = µ 2 is the same as H 0 : h β = 0 for h = (1 1. By (6.6, c 1 h = h (X X 1 h = (1/n 1 + (1/n 2 and Thus λ 1 (H, E = h βe 1 β h / h (X X 1 h = 1 n 1 + n 2 2 n 1 n 2 n 1 + n 2 (Y 1 Y 2 S 1 (Y 1 Y 2 (n 1 + n 2 2λ 1 (H, E = n 1n 2 n 1 + n 2 (Y 1 Y 2 S 1 (Y 1 Y 2 T 2 (2, n 1 + n 2 2 Hence any test based on λ 1 (H, E is equivalent to the Hotelling T 2 -test of Section 4. Recall that the eigenvector for λ 1 (E, H for A = E 1 H is x h = c h E 1 β h = (n 1 n 2 /(n 1 + n 2 E 1 (Y 1 Y 2. Within a constant, this is the same vector as appears in the discriminant function l(x = l 0 + X S 1 (Y 1 Y 2, S = E/(n 1 + n 2 2 in linear discriminant analysis (LDA; q.v..

Chapter 7, continued: MANOVA

Chapter 7, continued: MANOVA The Multivariate Analysis of Variance (MANOVA) technique extends Hotelling T 2 test that compares two mean vectors to the setting in which there are m 2 groups. We wish to