Biostatistics Advanced Methods in Biostatistics IV

Similar documents
Biostatistics Advanced Methods in Biostatistics IV

Looking at the Other Side of Bonferroni

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

Statistical testing. Samantha Kleinberg. October 20, 2009

High-throughput Testing

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Statistical Applications in Genetics and Molecular Biology

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Stat 206: Estimation and testing for a mean vector,

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Advanced Statistical Methods: Beyond Linear Regression

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Resampling-Based Control of the FDR

A Large-Sample Approach to Controlling the False Discovery Rate

Estimation of the False Discovery Rate

Tools and topics for microarray analysis

Large-Scale Hypothesis Testing

Lecture 21: October 19

Specific Differences. Lukas Meier, Seminar für Statistik

Lecture 6 April

Association studies and regression

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

Post-Selection Inference

Sta$s$cs for Genomics ( )

Data Mining Stat 588

Linear Methods for Prediction

Multiple testing: Intro & FWER 1

Non-specific filtering and control of false positives

One-sample categorical data: approximate inference

POLI 8501 Introduction to Maximum Likelihood Estimation

Peak Detection for Images

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

Making peace with p s: Bayesian tests with straightforward frequentist properties. Ken Rice, Department of Biostatistics April 6, 2011

Hunting for significance with multiple testing

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Week 5 Video 1 Relationship Mining Correlation Mining

Lecture 7 April 16, 2018

Probability and Statistics Notes

Classification 1: Linear regression of indicators, linear discriminant analysis

Bayesian Regression (1/31/13)

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

The Pennsylvania State University The Graduate School Eberly College of Science GENERALIZED STEPWISE PROCEDURES FOR

Statistical Inference

Familywise Error Rate Controlling Procedures for Discrete Data

Alpha-Investing. Sequential Control of Expected False Discoveries

Doing Cosmology with Balls and Envelopes

In many areas of science, there has been a rapid increase in the

Empirical Bayes Moderation of Asymptotically Linear Parameters

Announcements. Proposals graded

Harvard University. Rigorous Research in Engineering Education

The miss rate for the analysis of gene expression data

Master s Written Examination

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

New Procedures for False Discovery Control

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Bias Variance Trade-off

Exam: high-dimensional data analysis January 20, 2014

Topic 3: Hypothesis Testing

Step-down FDR Procedures for Large Numbers of Hypotheses

Statistical Applications in Genetics and Molecular Biology

False Discovery Rate

Introduction to the Analysis of Variance (ANOVA) Computing One-Way Independent Measures (Between Subjects) ANOVAs

Pearson s meta-analysis revisited

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS

Rerandomization to Balance Covariates

Linear Methods for Prediction

IEOR165 Discussion Week 12

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Model Checking and Improvement

STAT 461/561- Assignments, Year 2015

Terminology. Experiment = Prior = Posterior =

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

Applying the Benjamini Hochberg procedure to a set of generalized p-values

An Introduction to Mplus and Path Analysis

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

Research Article Sample Size Calculation for Controlling False Discovery Proportion

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Sanat Sarkar Department of Statistics, Temple University Philadelphia, PA 19122, U.S.A. September 11, Abstract

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

Data Mining. CS57300 Purdue University. March 22, 2018

The optimal discovery procedure: a new approach to simultaneous significance testing

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1

Lecture 2: Descriptive statistics, normalizations & testing

Journal of Statistical Software

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Sequential Analysis & Testing Multiple Hypotheses,

Statistical Inference with Regression Analysis

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Transcription:

Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 11 1 / 44

Tip + Paper Tip: Two today: (1) Graduate school is the best time of your academic life (with the possible exception of a postdoc), (2) It is worth having both steady, results producing projects and at least one bigger gamble. Gambling gets harder and harder as you move up the academic ladder - why many big discoveries come early in people s careers. Paper of the Day: Nocturnality in Dinosaurs Inferred from Scleral Ring and Orbit Morphology http: //www.sciencemag.org/content/332/6030/705.abstract 2 / 44

Today s Outline A new era? Multiple testing Empirical Bayes Bias-variance tradeoffs For the last one, many slides borrowed/modified from Jonathan Taylor, Art Owen. 3 / 44

My academic heritage 4 / 44

A Little Wisdom From Great-Grandpa Efron The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising? The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple Is treatment A better than treatment B? but the new methods were suited to the kinds of small data sets individual scientists might collect. The era of scientic mass production, in which new technologies typified by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind. -Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction 5 / 44

The New Paradigm in Biology 6 / 44

Other Examples Here is a mind blowing paper: The Anatomy of a Large-Scale Hypertextual Web Search Engine http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf Later they did this: 7 / 44

A Simple Motivating Problem Suppose you have observed an outcome Y and covariates X = [X 1,..., X m ]. Y = (y 1,..., y n ) T X i = (x i1,..., x i2 ) The goal is to find out which of the X i are associated with Y. If m = 1, 2, 3 and n = 10, 20, 30 you know what to do. What if m = 1 10 4 and n = 8 (like in the Endotoxin study)? This is often called the large p, small n problem. We will talk about a few approaches to this problem: Multiple testing Empirical Bayes Regularization/Shrinkage 8 / 44

High-Dimensional Testing One approach to our simple motivating problem is to fit the m regressions: Y = β 0i + β 1i X i + ɛ i and calculate a P-value for each {p 1,..., p m }. When we tested a single hypothesis, we fixed an α-level or calculated a P-value. However, these error measures don t make sense when testing a large number of hypotheses. Example: We test 10,000 hypotheses, and fix the α level at 0.05, we expect: 10, 000 0.05 = 500 false positives Calling all p-values 0.05 significant will also lead to 500 false positives. 9 / 44

XKCD s take 10 / 44

XKCD s take 11 / 44

XKCD s take 12 / 44

The Family Wise Error Rate & Bonferroni Correction Instead of definining a per-test error rate, we can define an error rate over all of the tests, e.g.: Family wise error rate = P({# of false positives 1}) The most common (and first) method for controlling the FWER is the Bonferroni correction, if the rejection region for a single test is: S α = {p : p α} then if m tests are performed the rejection region is: S bon α = {p i : p i α/m} 13 / 44

The Bonferroni Correction Controls the FWER Suppose there are m tests and the data for the first m 0 tests follows the null distribution then: ( m0 ) P({# of false positives 1}) = P I (p i α/m) > 0 = P i=1 ( m0 i=1 m 0 i=1 m 0 m α α ) {p i α/m} P(p i α/m) Note that the smaller the proportion of true nulls m 0 /m, the more conservative the cutoff is. 14 / 44

Bonferroni Adjusted P-values pi bon = inf{α : p Sα bon } = inf{α : p i α/m} = min{mp i, 1} The adjusted p-value is no longer uniform under the null, but the adjusted p-value is attractive, because of the interpretation that p bon i α implies that FWER α. 15 / 44

Independent test statistics For independent test statistics we can be smarter: Pr(any null p i < α/m) = 1 Pr(all null p i α/m) ( m0 ) = 1 Pr(p i α/m) i=1 = 1 (1 α/m) m 0 1 (1 α/m) m The last approximation is true when m m 0. We could use this to get a smarter threshold if we believe the tests are independent (they never are). But its not worth it because 1 (1 α/m) m 1 e α 1 (1 α) = α. 16 / 44

Bonferroni and dependence In the extreme case; all tests have almost the same p j ; if one is small, they re all small. so: Pr(any null p i < α/m) m 0 /mpr(p 1 < α/m) = (m 0 /m)(α/m) α/m which is valid, but using p i < α would have been better. For positively dependent test statistics increasing correlation more conservative results on average. But we can get catastrophic errors Suppose p i are all identical for the null cases and by chance p i < α/m. How many errors? 17 / 44

A common application of Bonferroni A genome-wide association study identifies three loci associated with susceptibility to uterine fibroids For each of 1 10 7 SNPs with data X i fit the model: logit(pr(y j = 1 X i j)) = β 0i + β 1i X ij http: //www.nature.com/ng/journal/v43/n5/full/ng.805.html 18 / 44

Why use Bonferroni? Only a small number of the covariates should show significant association We really don t want false positives 19 / 44

A little known fact Bonferroni correction at level k/m gives EFP k regardless of any dependence between tests. Some people argue this is how we should interpret Bonferroni (see e.g. Gordon 2007) This way of interpreting high-throughput data is recommended; k = 1 is easy to think about (and explain) Sometimes called genomewise error rate-k, GWER k (see e.g. Chen and Storey 2006) 20 / 44

A 2 2 table of outcomes Back to our simple problem: Y = β 0i + β 1i X i + ɛ i and calculate a P-value for each {p 1,..., p m }. Consider the following (hypothetical) table, for null hypotheses H 0i, 1 i m. Decide Decide true false H 0i true U V m 0 H 0i false T S m m 0 m R R m Classic Bonferroni limits Pr(V 1 m); any V 1 is equally bad. 21 / 44

False Discovery Rate A less conservative measure of (hypothetical) embarrassment V R 1 = #false positives #declared positives This is the realized False Discovery Rate Badness of each Type I error depends on R R 1 stops 0/0, sets embarrassment = 0 when R = 0 For a given decision rule, define its FDR = E [ ] V R 1 22 / 44

The False Discovery Rate & BH Correction Benjamini and Hochberg (1995) defined a set of rules which control the FDR, for independent tests 1. Calculate and order the P-values p (1),..., p (m) 2. Find the max i : p (i) αi/m 3. Decide false for all tests with p i below this threshold, and true otherwise. This set of decisions will have FDR = E [ V R 1] (m0 /m)α, for an m 0, m. 23 / 44

24 / 44

The False Discovery Rate & BH Correction Note that the FDR is an expectation, we never know the truth We could estimate m 0 to get a better FDR controlling procedure Storey s algorithm starts by estimating π 0 = m 0 m then: 1. Calculate and order the P-values p (1),..., p (m) 2. Find the max i : p (i) αi ˆπ 0m 25 / 44

Estimating ˆπ 0 26 / 44

For a fixed cutoff, t, we can also estimate the FDR Decide Decide true false H 0i true U(t) V (t) m 0 H 0i false T (t) S(t) m m 0 m R(t) R(t) m [ ] V (t) E[V (t)] FDR(t) = E R(t) 1 E[R(t)] We can estimate E[R(t)] with the plug-in estimator R(t). To estimate the numerator: m 0 E[V (t)] = Pr(p i t) = m 0 t i=1 We can calculate the plug-in estimator:mˆπ 0 t. Giving us the FDR estimate: ˆ FDR(t) = mˆπ 0t #{p i t} 27 / 44

The Q-value, the multiple testing analog of the P-value The Q-value is the FDR analog of the p-value. Remember the P-value was the smallest α level at which we could call a test significant. p i = inf{α : X i S α } A Q-value is the smallest FDR at which we can call a test significant An estimate is: q i = inf{fdr(s) : X i S} = min t p i FDR(t) ˆq(p i ) = min t p i FDR(t) ˆ Or when we estimate ˆπ 0 = 1: ˆq(p (i) ) = p (i) m/i. An important property of the Q-value is that if we call all tests significant with q i α, then the FDR is controlled at level α. 28 / 44

The FDR is Used A Lot Examples abound! Brains, genomes, and networks, oh my! 29 / 44

What happens when the tests are dependent (like they always are)? It can be shown that for quite general conditions the FDR is still controlled. In other words E[ FDR] ˆ FDR. But the estimate can become wildly variable (see Owen 2005 JRSSB Variance of the number of false discoveries ). There are two main approaches to addressing dependence for multiple testing: P-value/threshold modification or surrogate variable analysis. 30 / 44

P-value corrections See e.g. Devlin and Roeder Biometrics (1999), Efron JASA (2004), Schwartzman AOAS (2008), Benjamini and Yekutieli Annals of Stat (2001), Romano, Shaikh, and Wolf Test (2001)... 31 / 44

P-value corrections See Leek and Storey PLoS Genetics (2007), Leek and Storey PNAS (2008) 32 / 44

FDR: notes FDR was ground-breaking stuff Commonly misinterpreted as the true FDR, rather than an estimate of the FDR Also upsets those for whom Classic Bonferroni is the One True Path. See Horen (2007, JAMA) and a pissy reply by Smith et al. 33 / 44

Empirical Bayes and the Two Groups Model Back to our simple problem: Y = β 0i + β 1i X i + ɛ i Consider the case where our estimate for each H 0i is summarized by a statistic Z i where under H 0i, Z i N(0, 1). As an example consider a one sample t-test statistic T based on 20 observations, where the data are normally distributed. Then we could calculate Z j = Φ 1 (F t19 (T )) 34 / 44

Empirical Bayes and the Two Groups Model The FDR calculates a tail measure, where we look at all statistics as-or-more extreme than the one we observed. What if we want to say something about the exact statistic we observed? How interesting is it? Eventually we will say that a set of {H 0i } are true/false We could try to assess directly if Z i N(0, 1). N(0, 1) or not demands we define an alternative for Z i { F0 with probability p Z i 0 F 1 with probability 1 p 0 35 / 44

The local false discovery rate (a simple posterior probability) Under the two groups model the Z i have density We can define f (z) = p 0 f 0 (z) + (1 p 0 )f 1 (z) fdr(z) = p 0 f 0 (z) p 0 f 0 (z) + (1 p 0 )f 1 (z) - which is called the local false discovery rate by Efron in papers in (2004,2007). 36 / 44

The local false discovery rate - a frequentist interpretation Replicate in similar populations (or run the universe again) Take all Z j near z, declare them to be from F 1, i.e. not null fdr(z) defines the long-run proportion of times we make a false discovery Local means we only look at Z near z 37 / 44

The local false discovery rate - example A 50:50 mix of N(0, 1) and N(0, 4 2 ), in a picture: We get 1 fdr = 1 + e z2 2 (1 4 2) /4 so e.g. Z = 1.72 gives fdr(z) = 0.5 38 / 44

fdr: Efron s approach We have many Z j, say from our 10,000 regressions. We don t know p 0, F 0, F 1, F so: 1. Approximate f (z) by kernel density estimation (non-parameteric) 2. Fit the biggest p 0 F 0 = p 0 N(µ, σ 2 ) you can under ˆf () 3. Subtract ˆf (z) ˆp 0 f 0 (z ˆµ, ˆσ 2 ), defines (1 ˆp 0ˆf 1 (z)) 4. Plug-in ˆf 0, ˆf 1, ˆp 0 to estimate fdr(z j ) Report ˆ fdr(z j ), or decide false where ˆ fdr(z j ) < α 39 / 44

fdr: Efron s approach Several tuning parameters to choose; smoothing, finding the biggest Normal under ˆf ( ) Probably works okay for some problems, where all Z j are independent and null Z j really are from a Normal These are fairly heroic assumptions! Don t use this on your applied exam! Too much flexibility in f 1 means fdr(z) is not unimodal... could be embarrassing. There are alternatives, e.g. by calculating bootstrap null statistics to estimate f 0 (see Storey et al. PLoS Biology 2005). 40 / 44

Regularized Regression Back to our simple problem where we have data: We may want to fit the model: Y = (y 1,..., y n ) T X i = (x i1,..., x i2 ) Y = β 0 + m β 1i X i + ɛ We could just solve the estimating equation that minimizes the least squares criterion: ( ) 2 n m ˆβ ls : argmin β Y j β 0 β 1i X ij j=1 i=1 This has some nice properties (e.g. Gauss-Markov, under Normality ML). But when m > n, we have some singularity problems. Another criteria we could use is whether the estimates: m ˆf (X ) = ˆβ 0 + ˆβ 1i X i i=1 i=1 41 / 44

Mean-squared error To answer this question we could consider the mean squared error of our estimate ˆβ MSE( ˆβ) = E[ ˆβ β 2 ] = E[( ˆβ β) T ( ˆβ β)] Instead we could look at the MSE for a new observation: ( ) m MSE pop ( ˆβ) = E (Y new ˆβ 0 ˆβ 1i Xi new ) 2 = Var(Y new ˆβ 0 = σ 2 ɛ + Bias( ˆβ 0 i=1 m i=1 ˆβ 1i Xi new ) + Bias( ˆβ 0 m ˆβ 1i ) 2 + Var( ˆβ 0 i=1 m i=1 m ˆβ 1i ) 2 i=1 ˆβ 1i X new i ) 42 / 44

Bias-variance tradeoff MSE pop ( ˆβ) = σ 2 ɛ + Bias( ˆβ 0 m ˆβ 1i ) 2 + Var( ˆβ 0 i=1 m i=1 Such a decomposition is known as a bias-variance tradeoff. ˆβ 1i X new i ) As the model becomes more complex (more terms included), local structure/curvature can be picked up. But coefficient estimates suffer from high variance as more terms are included in the model So introducing a little bias in our estimate for β might lead to substantial decrease in variance, and hence to a substantial decrease in MSE. 43 / 44

Bias-variance tradeoff 44 / 44