Miscellanea False discovery rate for scanning statistics

Size: px

Start display at page:

Download "Miscellanea False discovery rate for scanning statistics"

John Hunt
6 years ago
Views:

1 Biometrika (2011), 98,4,pp C 2011 Biometrika Trust Printed in Great Britain doi: /biomet/asr057 Miscellanea False discovery rate for scanning statistics BY D. O. SIEGMUND, N. R. ZHANG Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California , U.S.A. siegmund@stanford.edu nzhang@stanford.edu AND B. YAKIR Department of Statistics, The Hebrew University of Jerusalem, Jerusalem 91905, Israel msby@mscc.huji.ac.il SUMMARY The false discovery rate is a criterion for controlling Type I error in simultaneous testing of multiple hypotheses. For scanning statistics, due to local dependence, clusters of neighbouring hypotheses are likely to be rejected together. In such situations, it is more intuitive and informative to group neighbouring rejections together and count them as a single discovery, with the false discovery rate defined as the proportion of clusters that are falsely declared among all declared clusters. Assuming that the number of false discoveries, under this broader definition of a discovery, is approximately Poisson and independent of the number of true discoveries, we examine approaches for estimating and controlling the false discovery rate, and provide examples from biological applications. Some key words: False discovery rate; Multiple comparisons; Poisson approximation; Scan statistic. 1. INTRODUCTION In a pioneering paper, Benjamini & Hochberg (1995) initiated a fruitful line of research into the false discovery rate as a method to evaluate Type I error when simultaneously testing large numbers of hypotheses. We use their notation, so R is the number of discoveries that emerge as a result of a particular statistical procedure, and V is the number of false discoveries among them. Then S = R V is the number of true discoveries. The false discovery rate is the expected relative proportion of false discoveries, FDR = E(V/R; R > 0). These quantities are defined implicitly in terms of the specific procedure that is used to make discoveries. We are concerned with estimation and control of false discovery rates when there is substantial local correlation among the statistics used for testing the hypotheses. Due to local correlation, large values of the statistic tend to occur in clumps, and multiple rejections within a clump may constitute only a single discovery, as it relates to model identification. Yet a possibly large number of correct rejections at some locations can inflate the denominator in the definition of false discovery rate, hence artificially creating a small false discovery rate, and lowering the barrier to possibly false detections at distant locations. Scanning statistics to detect sparsely distributed signals provide typical examples. In the examples that follow, there is an underlying set of observations y t, where t varies over an indexing set having some geometric structure. The y t are often assumed to be independent, but this is not necessary, providing the dependence between them is local with respect to the geometric structure. The test statistics {Z t : t D}, where Z t is a function of t and of the y s for s N t, an appropriate neighbourhood of t, are related by a measure of distance within the scanning index set D. Hence, values of Z t and Z s for nearby t and s in D

2 980 D. O. SIEGMUND, N.R.ZHANG AND B. YAKIR are correlated, so a large value at a specific τ D causes a cluster of large values at t close to τ. Thus,a group of large values of Z t within close proximity are often associated with a single signal. Example 1. The random fields to detect local activity in an fmri scan as discussed in a series of papers by Worsley, for example, Worsley et al. (1992) orsiegmund & Worsley (1995). Example 2. Massively parallel paired end DNA re-sequencing used to detect structural variation in genomic sequences. For a review, see Medvedev et al. (2009). The data come in the form of distances y t between mapped positions of relatively short paired reads from the ends of DNA sequences of approximately w base pairs in length, with the leftmost read mapped to position x t in the genome. Where there are no structural variations one observes, after subtracting w and standardizing, y t that are independent and have a standard normal distribution. For read pairs that straddle the breakpoint τ of a structural variation, the distribution of some percentage of the y t,forτ<x t τ + w, is shifted by an unknown amount δ, which is related to the size of the variant. The score statistic with respect to δ to test for a breakpoint at τ is τ+w τ+1 y t /w 1/2. A scan is conducted with τ varying over the genomic region of interest to find putative breakpoint locations. Example 3. The scan statistics of variable window width used in Zhang et al. (2010) and Siegmund et al to detect common regions of copy number variation in a set of subjects. An appropriate likelihood based statistic for the special case of a single sequence (Olshen et al., 2004) is similar to that of the preceding example, but the window width is unknown, so the scan involves two-dimensional maximization with respect to τ and w. Example 4. A genome scan to detect either linkage or association between a phenotype and related genetic variation, e.g., Lander & Botstein (1989), Siegmund & Yakir (2007). The main results of this paper are methods for estimating and controlling the false discovery rate of a given procedure, with discovery defined in the sense of detection of sparse, local signals. In order to focus on the conceptual aspect of how one defines a discovery, our assumptions are given in general, abstract terms, and we avoid, except for a few comments, the necessarily technical and application specific discussion of methods to ensure and test the validity of those assumptions. Motivated by a different genomic application, Zhang (2008) contains a similar approach without any theoretical analysis. In an unpublished 2011 manuscript in the Harvard University Biostatistics Working Paper Series A. Schwartzman, Y. Gavrilov and R. J. Adler discuss a different approach to the same general issue under specific technical assumptions motivated by and apparently limited to a one-dimensional process having the structure of Example 1. Our two central assumptions are (i) the distribution of the number of false discoveries, V, is Poisson, with expected value λ; and (ii) the number of false discoveries is independent of the number of true discoveries, S. The number of true discoveries must be nonnegative, but otherwise may follow any distribution. The Poisson assumption is valid asymptotically in a variety of applications. These include the examples given above under the assumptions made in the cited references, or more generally if one makes suitable adjustments when there are local dependencies in the underlying observations. See Aldous (1988) for numerous examples, Lindgren et al. (1983) for relevant general theorems under different sets of conditions, and Arratia et al. (1989) for a flexible Poisson approximation theorem that applies quite generally to processes involving local dependence. Methods for determining λ depend on the specific problem. Illustrative examples based on more explicit assumptions about the underlying process are discussed below. The assumption of independence between V and S is more subtle. If we treat the locations of the signals in D as fixed but unknown quantities, then D can be partitioned into disjoint sets D 0 D 1, where D 0 are hypotheses that, if rejected, would be considered part of a false discovery, and D 1 are hypotheses that, if rejected, would be considered part of a true discovery. For example, in the simple case of a one-dimensional scan with fixed window size w as in Example 2, suppose the true signals are a set of intervals I within the scan region. If we count those windows that overlap with any interval in I towards true discoveries, and the rest towards false discoveries, then D 0 ={t : (t, t + w] ι =,ι I} and D 1 = D \ D 0. Then, V

3 Miscellanea 981 would be a function solely of {Z t : t D 0 }, while S would be a function solely of {Z t : t D 1 }. At least when the true signals are sparse, approximate independence between V and S would follow if long-range dependencies between {Z t : t D 0 } and {Z t : t D 1 } are negligible. In practice, near overlap of detected signals is a danger sign regarding possible violation of this hypothesis of independence. For a more detailed discussion, see 3 1. Significant long range dependence between {Z t } may cause nonnegligible dependence between V and S. Scanning procedures that are based on a collection of localized tests are inherently designed for problems where dependence can be assumed to be local, since if long range independence does not hold, then procedures that account for that dependence would be preferred on the basis of greater power. The estimator that we propose for the false discovery rate is FDR ˆ = λ/(r + 1), (1) where λ is the expected number of false discoveries and R is the total number of discoveries. This estimator has been considered by Efron (2010) in the framework of hypothesis testing with a large number of independent hypotheses and, except for the constant 1 in the denominator, is the same as that suggested by Zhang (2008). In some cases, the parameter λ can be derived analytically. In other cases it, can be computed via permutations or simulations conducted under the null assumption. In 2, we show that the estimator (1) is unbiased under assumptions (i) and (ii). Our method for controlling the rate of false discoveries is closely associated with the procedure proposed by Benjamini & Hochberg (1995) for ordered p-values. We in effect replace their assumption regarding the relations between p-values of individual hypotheses by the assumption that an appropriately indexed family of false discoveries is a Poisson process. 2. ESTIMATING AND CONTROLLING THE FALSE DISCOVERY RATE Let V Po(λ) be the number of false discoveries and let S 0 be the number of true discoveries. Assume S is a nonnegative random variable independent of V. The total number of discoveries is R = V + S. Consider the ratio V/R, which is defined to be 0 if R = 0, and compare it to the estimator λ/(r + 1). THEOREM 1. Under assumptions (i) and (ii), E(V/R; R > 0) = E{λ/(R + 1)}. Proof. For fixed s 0, let F s (x) = (x + s) 1 I (x + s > 0), with the understanding that F 0 (0) = 0. After writing expectations as infinite series, algebraic manipulations show that E{VF s (V )}=λe{f s (V + 1)}. (2) The result follows by taking expectations with respect to the distribution of S. Hence, FDR ˆ defined in (1) is an unbiased estimator of the false discovery rate. Remark. Equation (2) has been applied elsewhere. In particular, it is the basis for the Chen (1975) method of Poisson approximation. Now suppose that false detections are a Poisson process V λ of rate 1, defined on the interval [0, λ]. We assume also that the process R λ = V λ + S λ is nondecreasing and that the processes V λ and S λ are independent. Define the backwards stopping time = max{λ λ : R λ λ/α}. This is a function of the observed process R λ, and thereby it is a function of the Poisson process V λ and the independent process S λ, both unobserved. The extreme case when = 0 corresponds to the case where R is equal to zero and the ratio V/R is then defined to be equal to zero as well. Consider the procedure whereby the stopping time is evaluated and R is reported as the number of discoveries. In Theorem 2, we prove that the expected proportion of false discoveries, E(V /R ),is bounded by α. The proof is a version of the argument given by Storey et al. (2004).

4 982 D. O. SIEGMUND, N.R.ZHANG AND B. YAKIR THEOREM 2. Under the given conditions and for the procedure associated with the stopping time, E(V /R ) α. Proof. Consider the process V λ /λ and notice that it is a mean one backwards martingale with respect to the filtration F λ = σ(v t, S t : λ t λ). The stopping time is measurable with respect to this filtration. It follows that E(V λ / λ) = E(V λ/ λ) = 1, for any λ>0. Let λ 0 and observe that 1( < λ)v λ /λ converges to 0 and is bounded by 1/α. Hence, by the dominated convergence theorem, we see that E(V / ; >0) = E(V λ/ λ) = 1. Consider the proportion V / of false detection of the proposed procedure. Since this proportion is defined to be equal to zero = 0, E(V /R ) = E(V /R ; >0). Dividing and multiplying by, we get E(V /R ; >0) = E{( /R ) (V / ); >0} α E{(V / ); >0}=α, where the inequality follows from the fact that when >0the definition of the stopping time implies that /R α. The conclusion follows. 3. EXAMPLES 3 1. Fixed-width sliding window scan Consider a fixed window scan statistic. Suppose Y 1,...,Y m are independent and normally distributed random variables with unit variance. Under a global null hypothesis they are standard normal. Under the alternative there are intervals of known length w, and unknown positive integers τ such that Y τ+1,...,y τ+w have mean μ τ > 0. The values of μ τ and the number of such intervals is unknown, although we assume that the total width of all intervals is small relative to the sample size m. This situation corresponds roughly to Example 2 in 1, although, to facilitate our simulations, the numerical values of the parameters we use below are smaller than would be typical for this application. Let Z t = ( t+w i=t+1 Y i)/w 1/2. The behaviour of Z t as a process under the global null hypothesis that all discoveries are false is easily inferred from known results. Specifically, an asymptotic approximation to p = pr(max 0 t m w Z t > z), is given, for a two-sided alternative, in display (5.3) of Siegmund & Yakir (2007, p. 112), with parameters C = 1, = 1, L = m w and β = 1/w. For large enough thresholds z, the probability that Z t exceeds z is small, and the number of clumps of Z t that exceeds z is approximately Poisson distributed with mean λ 0 = log(1 p) = mzw 1 φ(z)ν{z(2/w) 1/2 }, (3) where φ denotes the standard normal probability density function and ν is a special function associated with the overshoot of a stopped random walk (cf. Siegmund & Yakir, 2007, p. 112). Although there is no unique definition of a clump, there should usually be little difficulty in recognizing one in practice. Roughly speaking, it is a set of values of t that are relatively close together, where Z t z. Except when different true discoveries are themselves close together, different clumps are distinguished by relatively long gaps where Z t remains below the level z. If all clumps were false positives and z 0, then the size of a clump would be stochastically bounded, while the expected distances between clumps would be approximately 1/λ 0, and hence grow faster than exponentially in z. The independence of Y t makes Z s and Z t independent as long as s t >w. Clumps of false positives should be short and approximately uniformly distributed across the search interval. Hence, unless the true signals occur very frequently, the probability of a false positive occurring close to a true signal is small, so the independence of V and S would be approximately satisfied. The same would be true of the variable window scans of Example 3 provided the maximum window size is much smaller than the number of observations. See Siegmund et al. (2011) for a discussion of the data normalization used to validate the normality and independence assumptions needed by Example 3. Some simulated results are presented in Tables 1 and 2. For the simulations we took m = and w = 50. A total of 21 intervals of length w, scattered about the sequence, were simulated from the alternative distribution with mean values μ τ that range between 6/w 1/2 and 2/w 1/2 in steps of size 0 2/w 1/2.

5 Miscellanea 983 Table 1. Simulated values of false discovery rate and E{λ 0 /(R + 1)}, based on 400 repetitions with w = 50, m= Nominal values of λ 0 are 5, 3, 2 and 1, respectively. There are 21 possible discoveries, with noncentrality parameters ranging from 6 to 2 in steps of size 0 2. z FDR E{λ 0 /(R + 1)} E(V ) E(S) FDR, false discovery rate. Table 2. Simulated values of the false discovery rate for the procedure that controls this rate. The simulations are based on 400 repetitions with w = 50,m= The false discovery rate is controlled to be no more than 0 3, 0 2, 0 1 or 0 05, respectively. There are 21 possible discoveries, with noncentrality parameters ranging from 6 to 2 in steps of size 0 2. α FDR E(V ) E(S) FDR, false discovery rate. Table 1 examines the estimator λ 0 /(R + 1) of the false discovery rate for several thresholds z. Four values of z corresponding to nominal values of 5, 3, 2 and 1 for λ 0 are considered. For each level the actual level of the false discovery rate and the expectation of the estimator are presented. The expected number of false discoveries, E(V ), and the expected value of true discoveries, E(S), are also given. The expectations are based on 400 replicates of the scanning process. Table 2 examines the procedure for controlling the false discovery rate. We used the stopping rule inf{z 2:R(z) λ 0 (z)/α}, where R(z) is the number of discoveries associated with the threshold z and λ 0 (z) is the approximation (3) of the expected number of clumps associated with z computed under the global null distribution. Four values of α, 0 3, 0 2, 0 1 and 0 05, are considered. For each α the actual level of the false discovery rate, the expected number of false discoveries and the expected number of true discoveries are presented. The expectations are based on 400 replicates of the scanning process Allelic bias in transcribed RNA Another example involves an experiment of RNA expression profiles in autistic subjects (Ben-David et al., 2011). The goal of the experiment was to identify autosomal loci where only one of the two alleles is expressed. Nuclear RNA was extracted from blood cell-lines of 17 subjects and reverse transcribed. Both the cdna produced and the genomic DNA of each of the subjects were genotyped using the Affymetrix Single Nucleotide Polymorphism 6 0 array technology. The identification of loci with mono-allelic expression of RNA resulted from the examination of the cdna genotypes at single nucleotide polymorphisms that had been identified as heterozygous in genomic DNA. Specifically, the algorithm for the discovery of differentially expressed regions involved the removal, for each subject, of the single nucleotide polymorphisms that were homozygous in the genomic DNA, or were determined not to be sufficiently expressed. For the remaining cdna polymorphisms, an exponentially distributed distance from heterozygous expression was calculated using the log transformed ranking

6 984 D. O. SIEGMUND, N.R.ZHANG AND B. YAKIR z-score Location Fig. 1. Scanning windows (t, t + w) that exceed the threshold of z = 30 for a region containing 500 positions in the DNA copy number data of 3 3. Each black horizontal segment shows the start and end points of a window, with the actual value of the scan statistic shown on the y-axis. This region contains three discoveries, or clumps, shown as thick bars at the top of the plot. of the confidence score from Affymetrix Birdseed V2 genotyping algorithm (Korn et al., 2008). The p- values for the sum of scores in windows of five consecutive polymorphisms were calculated using the function rollapply from the R package zoo (R Development Core Team, 2011). Windows that included polymorphisms more than 1 Mbp apart were excluded from the analysis. On the other hand, consecutive windows with p-values <0 05 were combined if the distance between them was <1 Mbp. The p-values for the merged windows were recalculated. Final windows with a p-value < were declared to be discoveries. A total of 507 such windows were discovered using the algorithm described above. In order to estimate the false discovery rate of the algorithm, the method of 2 was applied. The markers used are heterozygous and widely separated in the scale of base pairs. Hence, it seems reasonable to assume that they behave independently, since transcription, currently understood as a localized process within the genome, should not induce dependence between the allelic expression of distantly separated polymorphisms. The transcribed allelic ratios can be permuted within individuals; and a Monte Carlo experiment then determines the Poisson parameter λ. The algorithm was applied to each permuted set of data, and the number of discoveries was counted. The average number of discoveries, computed from 100 permutations, was This average served as an estimate of the expected number of false discoveries. Consequently, the estimated rate of false discovery is 11 48/( ) = Population-wide copy number variation To detect copy number variation, Olshen et al. (2004) introduced a change-point model with white Gaussian measurement errors. Their procedure was found by Lai et al. (2005) tobepreferabletoother existing methods. See Jeng et al. (2010) for a recent discussion of this model. For the more general problem of aligned copy number variation in multiple sequences Zhang et al. (2010) and Siegmund et al. (2011), after a suitable normalization of the data described in those papers, found that the change-point model with Gaussian white noise measurement errors was reasonable. It follows from (3.3) and (3.4) of Siegmund et al. (2011) that V is approximately Poisson for high thresholds. We used (3.4) from that paper applied to the data from chromosome 4 of the Stanford Quality Control Panel. For a complete description of this application and dataset, see the cited papers. There is a total of positions, with 62 samples. We restricted our analysis to small intervals, and so conducted a variable window scan of all positions with a maximum window size of 50 and a minimum window size of 1. The theoretically derived value of λ(z) compares well with values estimated via Monte Carlo simulation, even for values of z where λ(z) is fairly large. With a false discovery rate threshold of 0 01, 337 discoveries were made. With a false discovery rate threshold of 0 1, 472 discoveries were made. See Fig. 1 for an example region containing 500 positions and 3 discoveries. ACKNOWLEDGEMENT The research of the first and third authors is supported by the Israeli-American Bi-National Fund. The second and third authors are supported by the National Science Foundation, U.S.A. We would like to thank

7 Miscellanea 985 Dr Shifman from The Hebrew University of Jerusalem for giving us access to the data of the experiment described in 3 and for conducting the simulation described therein. REFERENCES ALDOUS,D.(1988). Applications of the Poisson Clumping Heuristic. New York: Springer. ARRATIA,R.,GOLDSTEIN,L.&GORDON,L.(1989). Two moments suffice for Poisson approximation. Ann. Prob. 17, BEN-DAVID,E.,GRANOT-HERSHKOVITZ,E.,MONDERER-ROTHKOFF,G.,LERER,E.,LEVI,S.,YAARI,M.,EBSTEIN,R. P., YIRMIA, N.,& SHIFMAN, S.(2011). Identification of a functional rare variant in autism using genome-wide screen for monoallelic expression. Hum. Molec. Genet. 20, BENJAMINI, Y.&HOCHBERG, Y.(1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, CHEN,L.(1975). Poisson approximation for dependent trials. Ann. Prob. 3, EFRON, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge: Cambridge University Press. JENG,X.J.,CAI,T.T.&LI,H.(2010). Optimal sparse segment identification with application in copy number variation analysis. J. Am. Statist. Assoc. 105, KORN, J.M.,KURUVILLA, F.G.,MCCARROLL, S.A.,WYSOKER, A.,NEMESH, J.,CAWLEY, S.,HUBBELL, E., VEITCH, J.,COLLINS, P.J.,DARVISHI, K.,et al. (2008). Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nature Genet. 40, LAI,W.R.,JOHNSON,M.D.,KUCHERLAPATI,R.&PARK,P.J.(2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21, LANDER, E.&BOTSTEIN, D.(1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, LINDGREN,G.,LEADBETTER,M.R.&ROOTZÉN,H.(1983). Extremes and Related Properties of Stationary Sequences and Processes. New York: Springer. MEDVEDEV,P.,STANCIU,M.&BRUDNO,M.(2009). Computational methods for discovering structural variation with next-generation sequencing. Nature Meth. Suppl. 6, S OLSHEN,A.B.,VENKATRAMAN,E.S.,LUCITO,R.&WIGLER,M.(2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, RDEVELOPMENT CORE TEAM (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN SIEGMUND, D.O.&WORSLEY, K.J.(1995). Testing for a signal with unknown location and scale in a stationary gaussian random field. Ann. Statist. 23, SIEGMUND,D.&YAKIR,B.(2007). The Statistics of Gene Mapping. New York: Springer. SIEGMUND, D.,YAKIR, B.&ZHANG, N.(2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Statist. 5, STOREY, J.D.,TAYLOR, J.E.&SIEGMUND, D.O.(2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Statist. Soc. B 66, WORSLEY, K., EVANS, A. C., MARRETT, S.& NEELIN P. (1992). A three dimensional statistical analysis for CBF activation studies in human brain. J. Cerebral Blood Flow Metab. 12, ZHANG, Y.(2008). Poisson approximation for significance in genome-wide ChiP-chip tiling arrays. Bioinformatics 24, ZHANG, N.,SIEGMUND, D.,JI, H.&LI, J.Z.(2010). Detecting simultaneous change-points in multiple sequences. Biometrika 97, [Received September Revised August 2011]

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem