Lecture 3. Inference about multivariate normal distribution

Size: px

Start display at page:

Download "Lecture 3. Inference about multivariate normal distribution"

Valerie Clarke
5 years ago
Views:

1 Lecture 3. Inference about multivariate normal distribution 3.1 Point and Interval Estimation Let X 1,..., X n be i.i.d. N p (µ, Σ). We are interested in evaluation of the maximum likelihood estimates of µ and Σ. Recall that the joint density of X 1 is f(x) = 2πΣ 1 2 exp [ 1 ] 2 (x µ) Σ 1 (x µ), for x R p. The negative log likelihood function, given observations x n 1 := {x 1,..., x n }, up to an additive constant independent of µ and Σ, is then l (µ, Σ x n 1) = n 2 log Σ + 1 n (x i µ) Σ 1 (x i µ) 2 = n 2 log Σ + n 2 ( x µ) Σ 1 ( x µ) n (x i x) Σ 1 (x i x). On this end, denote the centered data matrix by X = [ x 1,..., x n ] p n, where x i = x i x. Let S 0 = 1 n X X = 1 n x i x n i. Proposition 1. The MLEs of µ and Σ, that jointly minimize l (µ, Σ x n 1), are ˆµ MLE = x, Σ MLE = S 0. Note that S 0 is a biased estimator of Σ. The sample variance covariance matrix S = n S n 1 0 is unbiased. For interval estimation of µ, we largely follow Section 7.1 of Härdle and Simar (2012). First note that since µ R p, we need to generalize the notion of intervals (primarily defined for R 1 ) to a higher dimensional space. A simple extension is a direct product of marginal intervals: for example, for intervals a < x < b and c < y < d, we obtain a rectangular region {(x, y) R 2 : a < x < b, c < y < d}. A confidence region A is a subset of R p and it depends on (random) observations X 1,..., X n. A = A(X n 1) is a confidence region of size 1 α (0, 1) for parameter µ R p if P (µ A(X n 1)) 1 α where P is with respect to the distribution of X n 1. (Elliptical confidence region) Corollary 7 in lecture 2 provides a pivot which paves a way to construct a confidence region for µ.

2 Since n p ( X µ) S 1 p 0 ( X µ) F p,n p and ( P ( X µ) S 1 0 ( X µ) < p ) n p F 1 α;p,n p = 1 α, we have that A := { ν R p : ( X ν) S 1 0 ( X ν) < p } n p F 1 α;p,n p is a confidence region of size 1 α for parameter µ. Note: F 1 α;p,n p is defined as the constant such that P (F < F 1 α;p,n p ) = 1 α where F is F p,n p distributed. The resulting confidence region for a given sample is an ellipse. (Simultaneous confidence intervals) Simultaneous confidence intervals for all linear combinations of elements of µ, i.e., a µ for arbitrary a R p, provide confidence of size 1 α for any interval in this framework, which covers a µ. This includes the marginal means µ 1,..., µ p (when choosing a wisely). We are interested in evaluating lower and upper bounds L(a) and U(a) satisfying P (L(a) < a µ < U(a), for all a R p ) 1 α Note: the lower and upper limits of the interval depend on the data (hence is random) and is also specified by a. First consider a single confidence interval by fixing a particular vector a. To evaluate a confidence interval for a µ, write new random variables Y i = a X i N 1 (a µ, a Σa) (i = 1,..., n), whose squared t-statistic is t 2 (a) := n (a µ a X) 2 a Sa fixed a, F 1,n 1. Thus, for any P (t 2 (a) F 1 α,1,n 1 ) = 1 α. (1) Next, consider many projection vectors a 1, a 2,..., a M (M is finite only for convenience). The simultaneous confidence intervals of the type similar to (1) are then ( M ) P {t 2 (a i ) h(α)} 1 α, for some h( ). We collect some facts below. 1. max a t 2 (a) h(α) implies t 2 (a i ) h(α) for all i. 2. max a t 2 (a) = n(µ X) S 1 (µ X), HS Theorem 2.5 (note the rank) 2

3 3. Corollary 7 in lecture 2. These suggest that we have h(α) = (n 1)p F n p 1 α;p,n p and ( M ) ( P {t 2 (a i ) h(α)} P max a Proposition 2. Simultaneously for all a R p, the interval contains a µ with probability 1 α. a X ± h(α)a Sa/n ) t2 (a) h(α) = 1 α. Example 1. From the Golub gene expression data, with dimension d = 7129, take the first and 1674th variables (genes), to focus on the bivariate case (p = 2). There are two populations: 11 observations from AML, 27 from ALL. Figure 1 illustrates the elliptical confidence region of size 95% and 99%. Figure 2 compares the elliptical confidence region with the simultaneous confidence intervals for a 1 = (1, 0) and a 2 = (0, 1). 2.5 x 104 Golub data, ALL (red) and AML(blue) gene # gene #1 Figure 1: Elliptical confidence regions of size 95% and 99%. 3

4 2.5 x 104 Golub data, ALL (red) and AML(blue) gene # gene #1 Figure 2: Simultaneous confidence intervals of size 95% 3.2 Hypotheses testing Consider testing a null hypothesis H 0 : θ Ω 0 against an alternative hypothesis H 1 : θ Ω 1. The principle of likelihood ratio test is as follows: Let L 0 be the maximized likelihood under H 0 : θ Ω 0, and L 1 be the maximized likelihood under θ Ω 0 Ω 1. The likelihood ratio statistic, or sometimes called Wilks statistic, is then W = 2 log( L 0 L 1 ) 0 The null hypothesis is rejected if the observed value of W is large. In some cases the exact distribution of W under H 0 can be evaluated. In other cases, Wilks theorem states that for large n (sample size), W = χ L 2 ν, where ν is the number of free parameters in H 1 but not in H 0. If the degrees of freedom in Ω 0 is q and the degrees of freedom in Ω 0 Ω 1 is r, then ν = r q. Consider testing hypotheses on µ and Σ of multivariate normal distribution, based on n-sample X 1,..., X n. case I: H 0 : µ = µ 0, H 1 : µ µ 0, Σ is known. In this case, we know the exact distribution of the likelihood ratio statistic W = n( x µ 0 ) Σ 1 ( x µ 0 ) χ 2 p, 4

5 under H 0. [Show on blackboard.] It is worthwhile to first write 2 log(l(x; µ, Σ)) = 2l(x; µ, Σ) = n log 2πΣ + n tr{σ 1 S} + n( x µ) T Σ 1 ( x µ) case II: H 0 : µ = µ 0, H 1 : µ µ 0, Σ is unknown. The MLEs under H 1 are ˆµ = x and Σ = S 0. The restricted MLE of Σ under H 0 is Σ (0) = 1 n n (x i µ 0 )(x i µ 0 ) = S 0 + δδ, where δ = n( x µ 0 ). The likelihood ratio statistic is then W = n log S 0 + δδ n log S 0. It turns out that W is a monotone increasing function of δ S 1 δ = n( x µ 0 )S 1 ( x µ 0 ), which is the Hotelling s T 2 (n 1) statistic. case III: H 0 : Σ = Σ 0, H 1 : Σ Σ 0, µ is unknown. We have the likelihood ratio statistic W = n log Σ 1 0 S 0 np + ntrace(σ 1 0 S 0 ). This is the case where the exact distribution of W is difficult to evaluate. For large n, use Wilks theorem to approximate the distribution of W by χ 2 ν with the degrees of freedom ν = p(p + 1)/2. Next, consider testing the equality of two mean vectors. Let X 11,..., X 1n1 be i.i.d. N p (µ 1, Σ) and X 21,..., X 2n2 be i.i.d. N p (µ 2, Σ). case IV: H 0 : µ 1 = µ 2, H 1 : µ 1 µ 2, Σ is unknown. Since X 1 X 2 N p (µ 1 µ 2, n 1 + n 2 Σ), n 1 n 2 (n 1 + n 2 2)S P W p (n 1 + n 2 2, Σ), where (n 1 + n 2 2)S P := (n 1 1)S 1 + (n 2 1)S 2, we have Hotelling s T 2 statistic for two-sample problem and by Theorem 5 in lecture 2 T 2 (n 1 + n 2 2) = n 1n 2 n 1 + n 2 ( X 1 X 2 ) S 1 P ( X 1 X 2 ), n 1 + n 2 p 1 (n 1 + n 2 2)p T 2 (n 1 + n 2 2) F p,n1 +n 2 p 1. Similar to case II above, the likelihood ratio statistic is a monotone function of T 2 (n 1 + n 2 2). 5

6 3.3 Hypothesis testing when p > n In the high dimensional situation where the dimension p is larger than sample size (p > n 1 or p > n 1 +n 2 2), the sample covariance S is not invertable, thus the Hotelling s T 2 statistic, which is essential in the testing procedures above, cannot be computed. We survey important proposals for testing hypotheses on means in High-Dimension, Low-Sample Size (HDLSS) data. A basic idea in generalizing a test procedure for the p > n case is to base the test on a computable test statistic which is also an estimator for µ µ 0 or µ 1 µ 2. In case II (one sample), Dempster (1960) proposed to replace S 1 in Hotelling s statistic by (trace(s)i p ) 1. He showed that under H 0 : µ = 0, T D = n X X trace(s) F r,(n 1)r, approximately, for r = (trace(σ))2, a measure of sphericity of Σ. An estimator ˆr of r is used in testing. trace(σ 2 ) Bai and Saranadasa (1996) proposed to simply replace S 1 in Hotelling s statistic by I p, yielding T B = n X X. However X X is not an unbiased estimator of µ µ since E( X X) = 1 n trace(σ) + µ µ. They showed that the standardized statistic M B = n X X trace(s) ŝd(n X X trace(s)) = 2(n 1)n (n 2)(n+1) has asymptotic N(0, 1) distribution for p, n. n X X trace(s) ( trace(s2 ) 1 (trace(s))2) n Srivastava and Du (2008) proposed to replace S in Hotelling s statistic by D S = diag(s). Then T S = n X D 1 X S n 1 n(n 1) p can be used to estimate D 1 2 n 3 n 3 Σ µ 2, which is zero under H 0 : µ = 0. Srivastava and Du s test statistic is then M S = T S ŝd(t S ) = n X D 1 S X n 1 n 3 p 2trace(R 2 ) p2 n 1, which has asymptotic N(0, 1) distribution for p, n. Here R = D S SD 2 S correlation matrix. is the sample Chen and Qin (2010) improves the two-sample test for mean vectors from that of Bai and Saranadasa (1996). In testing H 1 : µ 1 = µ 2, Bai and Saranadasa (1996) proposed to use T B = X X 1 2 n 1+n 2 n 1 n 2 trace(s P ). The substraction of trace(s P ) is to make sure that E(T B ) = µ 1 µ 2 2. Chen and Qin (2010) proposed to not use trace(s P ), by considering n1 i j T C = X n2 1iX 1j i j + X n1 n2 2iX 2j j=1 2 X 1iX 2j. n 1 (n 1 1) n 2 (n 2 1) n 1 n 2 Since E(T C ) = µ 1 µ 2 2, Chen and Qin proposed to test based on T C. There are many other ideas, including: 6

7 1. The test statistic is essentially the maximum of p normalized marginal mean differences (Cai et al., 2013); 2. Use a generalized inverse of S, denoted by S or S, to replace S 1 ; 3. Estimate Σ in a way that is invertible; 4. Reduce the dimension p of the random vector X by Z = h(x) R d, for d < n, then apply the traditional theory of hypothesis testing. Next lecture is on linear dimension reduction principal component analysis. References Bai, Z. and Saranadasa, H. (1996), Effect of high dimension: by an example of a two sample problem, Statist. Sinica, 6, Cai, T. T., Liu, W., and Xia, Y. (2013), Two-Sample Test of High Dimensional Means under Dependency, To appear in Journal of the Royal Statistical Society: Series B. Chen, S. X. and Qin, Y.-L. (2010), A two-sample test for high-dimensional data with applications to gene-set testing, The Annals of Statistics, 38, Dempster, A. P. (1960), A significance test for the separation of two highly multivariate small samples, Biometrics, 16, Srivastava, M. S. and Du, M. (2008), A test for the mean vector with fewer observations than the dimension, Journal of Multivariate Analysis, 99,

8 In evaluating the MLE of Σ for MVN, one can use the following famous result. Note: Y Y = Y (Y 1 ) T Y trace(ay 1 B) = (Y 1 BAY 1 ) T Σ {log Σ + trace(σ 1 S 0 )} = Σ 1 Σ (Σ 1 ) (Σ 1 S 0 Σ 1 ) = (Σ 1 ) (Σ 1 S 0 Σ 1 ) The first order condition leads to ˆΣ MLE = S 0 8

Multivariate Statistical Analysis

Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 9 for Applied Multivariate Analysis Outline Addressing ourliers 1 Addressing ourliers 2 Outliers in Multivariate samples (1) For