Stat 206: Estimation and testing for a mean vector,

Similar documents
Statistical testing. Samantha Kleinberg. October 20, 2009

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Statistical Applications in Genetics and Molecular Biology

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Advanced Statistical Methods: Beyond Linear Regression

Looking at the Other Side of Bonferroni

High-throughput Testing

Large-Scale Hypothesis Testing

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Specific Differences. Lukas Meier, Seminar für Statistik

Step-down FDR Procedures for Large Numbers of Hypotheses

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

Stat 206: Sampling theory, sample moments, mahalanobis

FDR and ROC: Similarities, Assumptions, and Decisions

Lecture 21: October 19

Association studies and regression

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Non-specific filtering and control of false positives

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Confidence Intervals and Hypothesis Tests

Hunting for significance with multiple testing

Business Statistics. Lecture 10: Course Review

Inferences about a Mean Vector

Lecture 7 April 16, 2018

Multiple testing: Intro & FWER 1

Control of Directional Errors in Fixed Sequence Multiple Testing

Multivariate Statistical Analysis

Applying the Benjamini Hochberg procedure to a set of generalized p-values

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Rejection regions for the bivariate case

Announcements. Proposals graded

This paper has been submitted for consideration for publication in Biometrics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

False Discovery Rate

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Summary of Chapters 7-9

Induction 1 = 1(1+1) = 2(2+1) = 3(3+1) 2

Multiple Regression Analysis

Modified Simes Critical Values Under Positive Dependence

Topic 3: Hypothesis Testing

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling

Post-Selection Inference

STAT 461/561- Assignments, Year 2015

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

STAT Chapter 8: Hypothesis Tests

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

Lecture 6 April

Statistics Primer. ORC Staff: Jayme Palka Peter Boedeker Marcus Fagan Trey Dejong

Lecture 2: Descriptive statistics, normalizations & testing

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao

More powerful control of the false discovery rate under dependence

Lecture 3. Inference about multivariate normal distribution

Biostatistics Advanced Methods in Biostatistics IV

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Optional Stopping Theorem Let X be a martingale and T be a stopping time such

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Sample Size Estimation for Studies of High-Dimensional Data

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Multivariate Statistical Analysis

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Multiple Testing. Tim Hanson. January, Modified from originals by Gary W. Oehlert. Department of Statistics University of South Carolina

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses

On adaptive procedures controlling the familywise error rate

Introductory Econometrics

CSE 312 Final Review: Section AA

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Week 5 Video 1 Relationship Mining Correlation Mining

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Estimation of a Two-component Mixture Model

Proof Techniques (Review of Math 271)

STAT 536: Genetic Statistics

FALSE DISCOVERY AND FALSE NONDISCOVERY RATES IN SINGLE-STEP MULTIPLE TESTING PROCEDURES 1. BY SANAT K. SARKAR Temple University

The optimal discovery procedure: a new approach to simultaneous significance testing

Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004

Chapter 23. Inference About Means

Large-Scale Multiple Testing of Correlations

One-sample categorical data: approximate inference

IEOR165 Discussion Week 12

False Discovery Control in Spatial Multiple Testing

Technical Report 1004 Dept. of Biostatistics. Some Exact and Approximations for the Distribution of the Realized False Discovery Rate

STA 437: Applied Multivariate Statistics

Quantitative Understanding in Biology Short Course Session 9 Principal Components Analysis

Statistical Data Analysis Stat 3: p-values, parameter estimation

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Bayesian model selection: methodology, computation and applications

16.3 One-Way ANOVA: The Procedure

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Lecture 9: Classification, LDA

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Transcription:

Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where µ 1 and µ 2 are two population mean vectors. This is a natural testing setting for multivariate statistics, but it often isn t the focus of scientific interest. The more common case in applied multivariate analysis is that we have observations on a p-vector, and we want to simultaneously test all of the marginal hypotheses H 0j : µ 1j = µ 2j, i.e. that each of the components of µ 1 are equal to each of the components of µ 2. To see why this differs, keep in mind that we could reject some of the H 0j and fail to reject other H 0j, whereas before we were performing a single test for equality of the whole vector. Each H 0j is a univariate simple hypothesis test, and when both groups are jointly normal or, in this case, even marginally normal we would do a t test for each H 0j. This is, in fact, perfectly valid for each test considered independently. The problem is that we are doing p tests simultaneously, and in many of the applications of interest in modern multivariate statistics, p might be large, even bigger than n. The classical tests we just covered won t work when p ą n, 1 but 1 Why? for one thing, the sample the methods we ll talk about now can be applied in that setting as well. Moreover, they aren t restricted to that setting, and in fact the problem of multiple testing is relevant whenever p ą 1. covariance matrix isn t even full rank when p > n, which means we cannot compute its inverse. But that inverse appears in the Hotelling test statistic, so we re stuck. If we test H 0j at level α, then by definition the probability of a Type I error is α. Recall that a p-value is the probability under the null of observing a test statistic at least as extreme as the one observed. Accordingly, we can recast level α testing as reject H 0 if p ă α. 2. Since we are testing p such hypotheses, the probability of making at least one type I error referred to as the familywise error rate (FWER) is larger than α. If we instead perform each test at level α/p then, by Boole s inequality 3 [ p ] ď P p j ă α p H 0 ď j=1 pÿ j=1 [ P p j ă α ] p H 0 = p α p = α, 2 I have put a tilde on p here so as not to cause confusion with the dimension of the vector/number of hypotheses 3 Some authors refer to this as Bonferroni s inequality or the union bound

stat 206: estimation and testing for a mean vector, part ii 2 so the Familywise error rate is controlled at level α. It s not necessarily obvious that controlling the familywise error rate is the right thing to do, but it s also clear that just testing independently at level α doesn t make much sense when p is more than a few. To see why, suppose that the hypothesis tests are all independent, 4 and that every one of the null hypotheses H 0j is actually true. Then if we test each at level α, the number of Type I errors (false positives) V is distributed as 4 A dubious assumption, but useful for exposition. V Binomial(p, α), so in particular, P[N ą pα] = 0.5, assuming pα is an integer. So if α = 0.05 and p = 1000, we will make more than 50 mistakes (type I errors) half of the time. This is an obvious problem in the era of modern science, where often p is in the hundreds or thousands and n is similar to p or smaller. On the other hand, controlling the FWER is pretty conservative. It might not be so bad to have a few type I errors. In particular, it might be ok so long as type I errors are a relatively small proportion of the total number of rejections of H 0j. In particular, we want to choose a level α to perform each test such that V /(V + S) is small, where V = ÿ j 1 t pj ăα,h 0j u S = ÿ j 1 t pjăα,h 1ju, the number of rejections for which the null was true and the number of rejections for which the alternative was true, respectively. In this notation, the FWER is equal to P[V ě 1]. The quantity [ ] V Q e = E[Q] = E V + S is called the False discovery rate (FDR). Q e is the expectation of the unobserved random variable Q. Let p 0 be the number of null hypotheses that are true. The following two basic facts are important: 1. If all the null hypotheses are true, then FDR is equivalent to FWER. 2. When p 0 ă p, the FDR is less than or equal to the FWER. Therefore, any procedure that controls FWER also controls FDR. However, a procedure that controls FDR only is less stringent, so there is the potential for higher power. In particular, when p p 0 is large,

stat 206: estimation and testing for a mean vector, part ii 3 S tends to be large, resulting in a larger difference between E[Q] and P[V ě 1]. Thus, controlling FDR at level α is less conservative than controlling FWER at level α, since it allows us to make multiple type I errors on expectation, so long as they don t account for too high a proportion of the total number of discoveries (hypothesis tests for which we reject the null). Notice that FDR requires us to compute probabilities under both the null and the alternative, so we ll now be concerned with the power functions of the component tests. A motivating example Let s motivate the discussion to follow with a real problem. 5 The following dataset has gene expression measurement for two samples: one of size n 1 from prostate cancer tissue, and one of size n 2 from healthy prostate tissue. The expression levels of p = 6033 genes were measured. The scientific question of interest is whether the genes are differentially expressed in cancer and healthy tissue. The corresponding null hypothesis is H 0 : µ 1j = µ 2j, where µ 1 is the true mean for the control group and µ 2 the true mean for the disease state group. Let sx 1 be the sample mean for the control group and sx 2 the mean for the cancer group. Then, assuming that the variances are all equal between the control and disease state groups, the two-sample t statistic for each gene marginally is 5 This dataset is analyzed at length in Efron s Large Scale Inference (2014). t j = sx 1j sx 2j? sj, where s j is the estimate of the standard error under equal variances, for s j = s ( 1j + s 2j 1 + 1 ) n 1 + n 2 2 n 1 n 2 ÿn 1 s 1j = (x ij sx 1j ) 2, s 2j = i=1 ÿ n 1 +n 2 i=n+1 (x ij sx 2j ) 2. It will be convenient to transform to z-values from t-values, i.e. z j = Φ 1 (2T n1+n 2 2( t j )),

stat 206: estimation and testing for a mean vector, part ii 4 where T (n1 +n 2 2) is the tail probability of the t distribution with n 1 + n 2 2 degrees of freedom, the distribution of the test statistic under the null. This allows us to discuss methods in generality, emphasizing that the basic principles apply regardless of the distribution of test statistic used. Let s load the data and compute z j for each coordinate. load('../../datasets-efron/prostatedata.rdata') ytmp <- as.numeric(colnames(prostatedata)) y <- '';y[ytmp==1] <- 'cancer';y[ytmp==2] <- 'control' X <- as.matrix(t(prostatedata)) colnames(x) <- paste('gene',seq(ncol(x)),sep='') rownames(x) <- paste('subject',seq(nrow(x)),sep='') X <- data.frame(x) Xs <- split(x,as.factor(y)) pvals <- mapply(t.test.p,xs[[1]],xs[[2]]) z.scores <- qnorm(pvals) z.scores <- data.frame(z.scores) names(z.scores) <- 'z' ggplot(z.scores,aes(x=z)) + geom_histogram(aes(y =..density..),bins=50) + geom_vline(xintercept=c(qnorm(.05/6033),qnorm(1-.05/6033))) + 0.2 stat_function(fun = dnorm, args = list(mean = 0, sd = 1), col='red') 0.1 As you can see, if we applied the Bonferroni bounds to control FWER, we would not have many discoveries (rejections of the null). In fact, there are only 2 out of the 6033 genes. bonf.p <- p.adjust(pvals,method='bonferroni') sum(bonf.p<.05) ## [1] 2 FDR control: the method of Benjamini and Hochberg Perhaps the earliest, and simplest, method for control of FDR was proposed by Benjamini and Hochberg (1995). We give the procedure here. For a proof that the procedure controls FDR at the specified level, see the original paper. Let p 1 ď p 2 ď... p p be the ordered p-values. Let k be the largest j for which density 0.4 0.3 0.0 5.0 2.5 0.0 2.5 z Figure 1: z scores for the prostate data, with the Bonferroni threshold at level 0.05 shown as vertical line and the theoretical distribution of the z score under the null overlaid in red p j ď j α, (1) p

stat 206: estimation and testing for a mean vector, part ii 5 and reject all H j, j = 1,..., k. If the test statistics are either independent or satisfy a positive correlation condition (see Benjamini and Yekutieli (2001)), this procedure controls FDR at level α. Let s use the Benjamini-Hochberg procedure on the prostate cancer data: bh.p <- p.adjust(pvals,method='bh') sum(bh.p<.05) ## [1] 21 As expected, we have considerably more discoveries (rejections of the null) 21 instead of 2 at the 0.05 level. Benjamini and Yekeutieli (2001) propose a second procedure that controls FDR at level α for any dependence structure of the test statistics. Let k be the largest j for which j p j ď p ř j j 1 α, and reject all H j, j = 1,..., k. Then FDR is controlled at level α. Let s apply this alternative FDR controlling procedure to the prostate data: by.p <- p.adjust(pvals,method='by') sum(by.p<.05) ## [1] 2 In this particular case, we ve given up all of additional discoveries we made using the Benjamini-Hochberg procedure. Benjamini and Yekeutieli (2001) argue that in most applied problems, the positive correlation condition for their original procedure is likely to hold. The specific condition needed is the following. Let D be an increasing set, that is, if x P D and y ě x, then y P D. Let D be an increasing set and J 0 Ă D an arbitrary subset of D. Definition 1. A random vector X is PRDS on I 0 if for any increasing set D, and for each j P J 0, P[X P D X j = x] is nondecreasing in x. The condition, then, for the procedure in (1) to control FDR at level α is Theorem 1 (Benjamini and Yekutieli (2001)). Let X be the random vector of test statistics and J 0 = tj : H 0j is true u. If X is PRDS on J 0, then the Benjamini and Hochberg procedure in (1) controls the FDR at level less than or equal to (p 0 /p)α.

stat 206: estimation and testing for a mean vector, part ii 6 Benjamini and Yekutieli (2001) provide numerous examples of common applications in which the condition in Theorem 1 holds. Because the condition is considered to be relatively weak, the procedure in (1) has become the default FDR controlling procedure. However, keep in mind that you should at least check whether the specific application for which you are performing multiple hypothesis testing is one that has been previously considered with respect to the PRDS condition in the literature, or evaluate the plausibility of the condition for your problem (usually requires specifying a likelihood) when using the Benjamini-Hochberg procedure. Other methods for controlling FDR Numerous other method for controlling FDR exist, though most require significant background in other areas of statistics to understand thoroughly. Among these are empirical Bayes and fully Bayes procedures that offer control of local FDR, the probability than each test statistic corresponds to a false discovery. If interested, I recommend Efron s book Large Scale Inference (2014) for a comprehensive treatment of empirical Bayes approaches. References