Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where µ 1 and µ 2 are two population mean vectors. This is a natural testing setting for multivariate statistics, but it often isn t the focus of scientific interest. The more common case in applied multivariate analysis is that we have observations on a p-vector, and we want to simultaneously test all of the marginal hypotheses H 0j : µ 1j = µ 2j, i.e. that each of the components of µ 1 are equal to each of the components of µ 2. To see why this differs, keep in mind that we could reject some of the H 0j and fail to reject other H 0j, whereas before we were performing a single test for equality of the whole vector. Each H 0j is a univariate simple hypothesis test, and when both groups are jointly normal or, in this case, even marginally normal we would do a t test for each H 0j. This is, in fact, perfectly valid for each test considered independently. The problem is that we are doing p tests simultaneously, and in many of the applications of interest in modern multivariate statistics, p might be large, even bigger than n. The classical tests we just covered won t work when p ą n, 1 but 1 Why? for one thing, the sample the methods we ll talk about now can be applied in that setting as well. Moreover, they aren t restricted to that setting, and in fact the problem of multiple testing is relevant whenever p ą 1. covariance matrix isn t even full rank when p > n, which means we cannot compute its inverse. But that inverse appears in the Hotelling test statistic, so we re stuck. If we test H 0j at level α, then by definition the probability of a Type I error is α. Recall that a p-value is the probability under the null of observing a test statistic at least as extreme as the one observed. Accordingly, we can recast level α testing as reject H 0 if p ă α. 2. Since we are testing p such hypotheses, the probability of making at least one type I error referred to as the familywise error rate (FWER) is larger than α. If we instead perform each test at level α/p then, by Boole s inequality 3 [ p ] ď P p j ă α p H 0 ď j=1 pÿ j=1 [ P p j ă α ] p H 0 = p α p = α, 2 I have put a tilde on p here so as not to cause confusion with the dimension of the vector/number of hypotheses 3 Some authors refer to this as Bonferroni s inequality or the union bound
stat 206: estimation and testing for a mean vector, part ii 2 so the Familywise error rate is controlled at level α. It s not necessarily obvious that controlling the familywise error rate is the right thing to do, but it s also clear that just testing independently at level α doesn t make much sense when p is more than a few. To see why, suppose that the hypothesis tests are all independent, 4 and that every one of the null hypotheses H 0j is actually true. Then if we test each at level α, the number of Type I errors (false positives) V is distributed as 4 A dubious assumption, but useful for exposition. V Binomial(p, α), so in particular, P[N ą pα] = 0.5, assuming pα is an integer. So if α = 0.05 and p = 1000, we will make more than 50 mistakes (type I errors) half of the time. This is an obvious problem in the era of modern science, where often p is in the hundreds or thousands and n is similar to p or smaller. On the other hand, controlling the FWER is pretty conservative. It might not be so bad to have a few type I errors. In particular, it might be ok so long as type I errors are a relatively small proportion of the total number of rejections of H 0j. In particular, we want to choose a level α to perform each test such that V /(V + S) is small, where V = ÿ j 1 t pj ăα,h 0j u S = ÿ j 1 t pjăα,h 1ju, the number of rejections for which the null was true and the number of rejections for which the alternative was true, respectively. In this notation, the FWER is equal to P[V ě 1]. The quantity [ ] V Q e = E[Q] = E V + S is called the False discovery rate (FDR). Q e is the expectation of the unobserved random variable Q. Let p 0 be the number of null hypotheses that are true. The following two basic facts are important: 1. If all the null hypotheses are true, then FDR is equivalent to FWER. 2. When p 0 ă p, the FDR is less than or equal to the FWER. Therefore, any procedure that controls FWER also controls FDR. However, a procedure that controls FDR only is less stringent, so there is the potential for higher power. In particular, when p p 0 is large,
stat 206: estimation and testing for a mean vector, part ii 3 S tends to be large, resulting in a larger difference between E[Q] and P[V ě 1]. Thus, controlling FDR at level α is less conservative than controlling FWER at level α, since it allows us to make multiple type I errors on expectation, so long as they don t account for too high a proportion of the total number of discoveries (hypothesis tests for which we reject the null). Notice that FDR requires us to compute probabilities under both the null and the alternative, so we ll now be concerned with the power functions of the component tests. A motivating example Let s motivate the discussion to follow with a real problem. 5 The following dataset has gene expression measurement for two samples: one of size n 1 from prostate cancer tissue, and one of size n 2 from healthy prostate tissue. The expression levels of p = 6033 genes were measured. The scientific question of interest is whether the genes are differentially expressed in cancer and healthy tissue. The corresponding null hypothesis is H 0 : µ 1j = µ 2j, where µ 1 is the true mean for the control group and µ 2 the true mean for the disease state group. Let sx 1 be the sample mean for the control group and sx 2 the mean for the cancer group. Then, assuming that the variances are all equal between the control and disease state groups, the two-sample t statistic for each gene marginally is 5 This dataset is analyzed at length in Efron s Large Scale Inference (2014). t j = sx 1j sx 2j? sj, where s j is the estimate of the standard error under equal variances, for s j = s ( 1j + s 2j 1 + 1 ) n 1 + n 2 2 n 1 n 2 ÿn 1 s 1j = (x ij sx 1j ) 2, s 2j = i=1 ÿ n 1 +n 2 i=n+1 (x ij sx 2j ) 2. It will be convenient to transform to z-values from t-values, i.e. z j = Φ 1 (2T n1+n 2 2( t j )),
stat 206: estimation and testing for a mean vector, part ii 4 where T (n1 +n 2 2) is the tail probability of the t distribution with n 1 + n 2 2 degrees of freedom, the distribution of the test statistic under the null. This allows us to discuss methods in generality, emphasizing that the basic principles apply regardless of the distribution of test statistic used. Let s load the data and compute z j for each coordinate. load('../../datasets-efron/prostatedata.rdata') ytmp <- as.numeric(colnames(prostatedata)) y <- '';y[ytmp==1] <- 'cancer';y[ytmp==2] <- 'control' X <- as.matrix(t(prostatedata)) colnames(x) <- paste('gene',seq(ncol(x)),sep='') rownames(x) <- paste('subject',seq(nrow(x)),sep='') X <- data.frame(x) Xs <- split(x,as.factor(y)) pvals <- mapply(t.test.p,xs[[1]],xs[[2]]) z.scores <- qnorm(pvals) z.scores <- data.frame(z.scores) names(z.scores) <- 'z' ggplot(z.scores,aes(x=z)) + geom_histogram(aes(y =..density..),bins=50) + geom_vline(xintercept=c(qnorm(.05/6033),qnorm(1-.05/6033))) + 0.2 stat_function(fun = dnorm, args = list(mean = 0, sd = 1), col='red') 0.1 As you can see, if we applied the Bonferroni bounds to control FWER, we would not have many discoveries (rejections of the null). In fact, there are only 2 out of the 6033 genes. bonf.p <- p.adjust(pvals,method='bonferroni') sum(bonf.p<.05) ## [1] 2 FDR control: the method of Benjamini and Hochberg Perhaps the earliest, and simplest, method for control of FDR was proposed by Benjamini and Hochberg (1995). We give the procedure here. For a proof that the procedure controls FDR at the specified level, see the original paper. Let p 1 ď p 2 ď... p p be the ordered p-values. Let k be the largest j for which density 0.4 0.3 0.0 5.0 2.5 0.0 2.5 z Figure 1: z scores for the prostate data, with the Bonferroni threshold at level 0.05 shown as vertical line and the theoretical distribution of the z score under the null overlaid in red p j ď j α, (1) p
stat 206: estimation and testing for a mean vector, part ii 5 and reject all H j, j = 1,..., k. If the test statistics are either independent or satisfy a positive correlation condition (see Benjamini and Yekutieli (2001)), this procedure controls FDR at level α. Let s use the Benjamini-Hochberg procedure on the prostate cancer data: bh.p <- p.adjust(pvals,method='bh') sum(bh.p<.05) ## [1] 21 As expected, we have considerably more discoveries (rejections of the null) 21 instead of 2 at the 0.05 level. Benjamini and Yekeutieli (2001) propose a second procedure that controls FDR at level α for any dependence structure of the test statistics. Let k be the largest j for which j p j ď p ř j j 1 α, and reject all H j, j = 1,..., k. Then FDR is controlled at level α. Let s apply this alternative FDR controlling procedure to the prostate data: by.p <- p.adjust(pvals,method='by') sum(by.p<.05) ## [1] 2 In this particular case, we ve given up all of additional discoveries we made using the Benjamini-Hochberg procedure. Benjamini and Yekeutieli (2001) argue that in most applied problems, the positive correlation condition for their original procedure is likely to hold. The specific condition needed is the following. Let D be an increasing set, that is, if x P D and y ě x, then y P D. Let D be an increasing set and J 0 Ă D an arbitrary subset of D. Definition 1. A random vector X is PRDS on I 0 if for any increasing set D, and for each j P J 0, P[X P D X j = x] is nondecreasing in x. The condition, then, for the procedure in (1) to control FDR at level α is Theorem 1 (Benjamini and Yekutieli (2001)). Let X be the random vector of test statistics and J 0 = tj : H 0j is true u. If X is PRDS on J 0, then the Benjamini and Hochberg procedure in (1) controls the FDR at level less than or equal to (p 0 /p)α.
stat 206: estimation and testing for a mean vector, part ii 6 Benjamini and Yekutieli (2001) provide numerous examples of common applications in which the condition in Theorem 1 holds. Because the condition is considered to be relatively weak, the procedure in (1) has become the default FDR controlling procedure. However, keep in mind that you should at least check whether the specific application for which you are performing multiple hypothesis testing is one that has been previously considered with respect to the PRDS condition in the literature, or evaluate the plausibility of the condition for your problem (usually requires specifying a likelihood) when using the Benjamini-Hochberg procedure. Other methods for controlling FDR Numerous other method for controlling FDR exist, though most require significant background in other areas of statistics to understand thoroughly. Among these are empirical Bayes and fully Bayes procedures that offer control of local FDR, the probability than each test statistic corresponds to a false discovery. If interested, I recommend Efron s book Large Scale Inference (2014) for a comprehensive treatment of empirical Bayes approaches. References