Miscellanea False discovery rate for scanning statistics

Size: px
Start display at page:

Download "Miscellanea False discovery rate for scanning statistics"

Transcription

1 Biometrika (2011), 98,4,pp C 2011 Biometrika Trust Printed in Great Britain doi: /biomet/asr057 Miscellanea False discovery rate for scanning statistics BY D. O. SIEGMUND, N. R. ZHANG Department of Statistics, Stanford University, 390 Serra Mall, Stanford, California , U.S.A. siegmund@stanford.edu nzhang@stanford.edu AND B. YAKIR Department of Statistics, The Hebrew University of Jerusalem, Jerusalem 91905, Israel msby@mscc.huji.ac.il SUMMARY The false discovery rate is a criterion for controlling Type I error in simultaneous testing of multiple hypotheses. For scanning statistics, due to local dependence, clusters of neighbouring hypotheses are likely to be rejected together. In such situations, it is more intuitive and informative to group neighbouring rejections together and count them as a single discovery, with the false discovery rate defined as the proportion of clusters that are falsely declared among all declared clusters. Assuming that the number of false discoveries, under this broader definition of a discovery, is approximately Poisson and independent of the number of true discoveries, we examine approaches for estimating and controlling the false discovery rate, and provide examples from biological applications. Some key words: False discovery rate; Multiple comparisons; Poisson approximation; Scan statistic. 1. INTRODUCTION In a pioneering paper, Benjamini & Hochberg (1995) initiated a fruitful line of research into the false discovery rate as a method to evaluate Type I error when simultaneously testing large numbers of hypotheses. We use their notation, so R is the number of discoveries that emerge as a result of a particular statistical procedure, and V is the number of false discoveries among them. Then S = R V is the number of true discoveries. The false discovery rate is the expected relative proportion of false discoveries, FDR = E(V/R; R > 0). These quantities are defined implicitly in terms of the specific procedure that is used to make discoveries. We are concerned with estimation and control of false discovery rates when there is substantial local correlation among the statistics used for testing the hypotheses. Due to local correlation, large values of the statistic tend to occur in clumps, and multiple rejections within a clump may constitute only a single discovery, as it relates to model identification. Yet a possibly large number of correct rejections at some locations can inflate the denominator in the definition of false discovery rate, hence artificially creating a small false discovery rate, and lowering the barrier to possibly false detections at distant locations. Scanning statistics to detect sparsely distributed signals provide typical examples. In the examples that follow, there is an underlying set of observations y t, where t varies over an indexing set having some geometric structure. The y t are often assumed to be independent, but this is not necessary, providing the dependence between them is local with respect to the geometric structure. The test statistics {Z t : t D}, where Z t is a function of t and of the y s for s N t, an appropriate neighbourhood of t, are related by a measure of distance within the scanning index set D. Hence, values of Z t and Z s for nearby t and s in D

2 980 D. O. SIEGMUND, N.R.ZHANG AND B. YAKIR are correlated, so a large value at a specific τ D causes a cluster of large values at t close to τ. Thus,a group of large values of Z t within close proximity are often associated with a single signal. Example 1. The random fields to detect local activity in an fmri scan as discussed in a series of papers by Worsley, for example, Worsley et al. (1992) orsiegmund & Worsley (1995). Example 2. Massively parallel paired end DNA re-sequencing used to detect structural variation in genomic sequences. For a review, see Medvedev et al. (2009). The data come in the form of distances y t between mapped positions of relatively short paired reads from the ends of DNA sequences of approximately w base pairs in length, with the leftmost read mapped to position x t in the genome. Where there are no structural variations one observes, after subtracting w and standardizing, y t that are independent and have a standard normal distribution. For read pairs that straddle the breakpoint τ of a structural variation, the distribution of some percentage of the y t,forτ<x t τ + w, is shifted by an unknown amount δ, which is related to the size of the variant. The score statistic with respect to δ to test for a breakpoint at τ is τ+w τ+1 y t /w 1/2. A scan is conducted with τ varying over the genomic region of interest to find putative breakpoint locations. Example 3. The scan statistics of variable window width used in Zhang et al. (2010) and Siegmund et al to detect common regions of copy number variation in a set of subjects. An appropriate likelihood based statistic for the special case of a single sequence (Olshen et al., 2004) is similar to that of the preceding example, but the window width is unknown, so the scan involves two-dimensional maximization with respect to τ and w. Example 4. A genome scan to detect either linkage or association between a phenotype and related genetic variation, e.g., Lander & Botstein (1989), Siegmund & Yakir (2007). The main results of this paper are methods for estimating and controlling the false discovery rate of a given procedure, with discovery defined in the sense of detection of sparse, local signals. In order to focus on the conceptual aspect of how one defines a discovery, our assumptions are given in general, abstract terms, and we avoid, except for a few comments, the necessarily technical and application specific discussion of methods to ensure and test the validity of those assumptions. Motivated by a different genomic application, Zhang (2008) contains a similar approach without any theoretical analysis. In an unpublished 2011 manuscript in the Harvard University Biostatistics Working Paper Series A. Schwartzman, Y. Gavrilov and R. J. Adler discuss a different approach to the same general issue under specific technical assumptions motivated by and apparently limited to a one-dimensional process having the structure of Example 1. Our two central assumptions are (i) the distribution of the number of false discoveries, V, is Poisson, with expected value λ; and (ii) the number of false discoveries is independent of the number of true discoveries, S. The number of true discoveries must be nonnegative, but otherwise may follow any distribution. The Poisson assumption is valid asymptotically in a variety of applications. These include the examples given above under the assumptions made in the cited references, or more generally if one makes suitable adjustments when there are local dependencies in the underlying observations. See Aldous (1988) for numerous examples, Lindgren et al. (1983) for relevant general theorems under different sets of conditions, and Arratia et al. (1989) for a flexible Poisson approximation theorem that applies quite generally to processes involving local dependence. Methods for determining λ depend on the specific problem. Illustrative examples based on more explicit assumptions about the underlying process are discussed below. The assumption of independence between V and S is more subtle. If we treat the locations of the signals in D as fixed but unknown quantities, then D can be partitioned into disjoint sets D 0 D 1, where D 0 are hypotheses that, if rejected, would be considered part of a false discovery, and D 1 are hypotheses that, if rejected, would be considered part of a true discovery. For example, in the simple case of a one-dimensional scan with fixed window size w as in Example 2, suppose the true signals are a set of intervals I within the scan region. If we count those windows that overlap with any interval in I towards true discoveries, and the rest towards false discoveries, then D 0 ={t : (t, t + w] ι =,ι I} and D 1 = D \ D 0. Then, V

3 Miscellanea 981 would be a function solely of {Z t : t D 0 }, while S would be a function solely of {Z t : t D 1 }. At least when the true signals are sparse, approximate independence between V and S would follow if long-range dependencies between {Z t : t D 0 } and {Z t : t D 1 } are negligible. In practice, near overlap of detected signals is a danger sign regarding possible violation of this hypothesis of independence. For a more detailed discussion, see 3 1. Significant long range dependence between {Z t } may cause nonnegligible dependence between V and S. Scanning procedures that are based on a collection of localized tests are inherently designed for problems where dependence can be assumed to be local, since if long range independence does not hold, then procedures that account for that dependence would be preferred on the basis of greater power. The estimator that we propose for the false discovery rate is FDR ˆ = λ/(r + 1), (1) where λ is the expected number of false discoveries and R is the total number of discoveries. This estimator has been considered by Efron (2010) in the framework of hypothesis testing with a large number of independent hypotheses and, except for the constant 1 in the denominator, is the same as that suggested by Zhang (2008). In some cases, the parameter λ can be derived analytically. In other cases it, can be computed via permutations or simulations conducted under the null assumption. In 2, we show that the estimator (1) is unbiased under assumptions (i) and (ii). Our method for controlling the rate of false discoveries is closely associated with the procedure proposed by Benjamini & Hochberg (1995) for ordered p-values. We in effect replace their assumption regarding the relations between p-values of individual hypotheses by the assumption that an appropriately indexed family of false discoveries is a Poisson process. 2. ESTIMATING AND CONTROLLING THE FALSE DISCOVERY RATE Let V Po(λ) be the number of false discoveries and let S 0 be the number of true discoveries. Assume S is a nonnegative random variable independent of V. The total number of discoveries is R = V + S. Consider the ratio V/R, which is defined to be 0 if R = 0, and compare it to the estimator λ/(r + 1). THEOREM 1. Under assumptions (i) and (ii), E(V/R; R > 0) = E{λ/(R + 1)}. Proof. For fixed s 0, let F s (x) = (x + s) 1 I (x + s > 0), with the understanding that F 0 (0) = 0. After writing expectations as infinite series, algebraic manipulations show that E{VF s (V )}=λe{f s (V + 1)}. (2) The result follows by taking expectations with respect to the distribution of S. Hence, FDR ˆ defined in (1) is an unbiased estimator of the false discovery rate. Remark. Equation (2) has been applied elsewhere. In particular, it is the basis for the Chen (1975) method of Poisson approximation. Now suppose that false detections are a Poisson process V λ of rate 1, defined on the interval [0, λ]. We assume also that the process R λ = V λ + S λ is nondecreasing and that the processes V λ and S λ are independent. Define the backwards stopping time = max{λ λ : R λ λ/α}. This is a function of the observed process R λ, and thereby it is a function of the Poisson process V λ and the independent process S λ, both unobserved. The extreme case when = 0 corresponds to the case where R is equal to zero and the ratio V/R is then defined to be equal to zero as well. Consider the procedure whereby the stopping time is evaluated and R is reported as the number of discoveries. In Theorem 2, we prove that the expected proportion of false discoveries, E(V /R ),is bounded by α. The proof is a version of the argument given by Storey et al. (2004).

4 982 D. O. SIEGMUND, N.R.ZHANG AND B. YAKIR THEOREM 2. Under the given conditions and for the procedure associated with the stopping time, E(V /R ) α. Proof. Consider the process V λ /λ and notice that it is a mean one backwards martingale with respect to the filtration F λ = σ(v t, S t : λ t λ). The stopping time is measurable with respect to this filtration. It follows that E(V λ / λ) = E(V λ/ λ) = 1, for any λ>0. Let λ 0 and observe that 1( < λ)v λ /λ converges to 0 and is bounded by 1/α. Hence, by the dominated convergence theorem, we see that E(V / ; >0) = E(V λ/ λ) = 1. Consider the proportion V / of false detection of the proposed procedure. Since this proportion is defined to be equal to zero = 0, E(V /R ) = E(V /R ; >0). Dividing and multiplying by, we get E(V /R ; >0) = E{( /R ) (V / ); >0} α E{(V / ); >0}=α, where the inequality follows from the fact that when >0the definition of the stopping time implies that /R α. The conclusion follows. 3. EXAMPLES 3 1. Fixed-width sliding window scan Consider a fixed window scan statistic. Suppose Y 1,...,Y m are independent and normally distributed random variables with unit variance. Under a global null hypothesis they are standard normal. Under the alternative there are intervals of known length w, and unknown positive integers τ such that Y τ+1,...,y τ+w have mean μ τ > 0. The values of μ τ and the number of such intervals is unknown, although we assume that the total width of all intervals is small relative to the sample size m. This situation corresponds roughly to Example 2 in 1, although, to facilitate our simulations, the numerical values of the parameters we use below are smaller than would be typical for this application. Let Z t = ( t+w i=t+1 Y i)/w 1/2. The behaviour of Z t as a process under the global null hypothesis that all discoveries are false is easily inferred from known results. Specifically, an asymptotic approximation to p = pr(max 0 t m w Z t > z), is given, for a two-sided alternative, in display (5.3) of Siegmund & Yakir (2007, p. 112), with parameters C = 1, = 1, L = m w and β = 1/w. For large enough thresholds z, the probability that Z t exceeds z is small, and the number of clumps of Z t that exceeds z is approximately Poisson distributed with mean λ 0 = log(1 p) = mzw 1 φ(z)ν{z(2/w) 1/2 }, (3) where φ denotes the standard normal probability density function and ν is a special function associated with the overshoot of a stopped random walk (cf. Siegmund & Yakir, 2007, p. 112). Although there is no unique definition of a clump, there should usually be little difficulty in recognizing one in practice. Roughly speaking, it is a set of values of t that are relatively close together, where Z t z. Except when different true discoveries are themselves close together, different clumps are distinguished by relatively long gaps where Z t remains below the level z. If all clumps were false positives and z 0, then the size of a clump would be stochastically bounded, while the expected distances between clumps would be approximately 1/λ 0, and hence grow faster than exponentially in z. The independence of Y t makes Z s and Z t independent as long as s t >w. Clumps of false positives should be short and approximately uniformly distributed across the search interval. Hence, unless the true signals occur very frequently, the probability of a false positive occurring close to a true signal is small, so the independence of V and S would be approximately satisfied. The same would be true of the variable window scans of Example 3 provided the maximum window size is much smaller than the number of observations. See Siegmund et al. (2011) for a discussion of the data normalization used to validate the normality and independence assumptions needed by Example 3. Some simulated results are presented in Tables 1 and 2. For the simulations we took m = and w = 50. A total of 21 intervals of length w, scattered about the sequence, were simulated from the alternative distribution with mean values μ τ that range between 6/w 1/2 and 2/w 1/2 in steps of size 0 2/w 1/2.

5 Miscellanea 983 Table 1. Simulated values of false discovery rate and E{λ 0 /(R + 1)}, based on 400 repetitions with w = 50, m= Nominal values of λ 0 are 5, 3, 2 and 1, respectively. There are 21 possible discoveries, with noncentrality parameters ranging from 6 to 2 in steps of size 0 2. z FDR E{λ 0 /(R + 1)} E(V ) E(S) FDR, false discovery rate. Table 2. Simulated values of the false discovery rate for the procedure that controls this rate. The simulations are based on 400 repetitions with w = 50,m= The false discovery rate is controlled to be no more than 0 3, 0 2, 0 1 or 0 05, respectively. There are 21 possible discoveries, with noncentrality parameters ranging from 6 to 2 in steps of size 0 2. α FDR E(V ) E(S) FDR, false discovery rate. Table 1 examines the estimator λ 0 /(R + 1) of the false discovery rate for several thresholds z. Four values of z corresponding to nominal values of 5, 3, 2 and 1 for λ 0 are considered. For each level the actual level of the false discovery rate and the expectation of the estimator are presented. The expected number of false discoveries, E(V ), and the expected value of true discoveries, E(S), are also given. The expectations are based on 400 replicates of the scanning process. Table 2 examines the procedure for controlling the false discovery rate. We used the stopping rule inf{z 2:R(z) λ 0 (z)/α}, where R(z) is the number of discoveries associated with the threshold z and λ 0 (z) is the approximation (3) of the expected number of clumps associated with z computed under the global null distribution. Four values of α, 0 3, 0 2, 0 1 and 0 05, are considered. For each α the actual level of the false discovery rate, the expected number of false discoveries and the expected number of true discoveries are presented. The expectations are based on 400 replicates of the scanning process Allelic bias in transcribed RNA Another example involves an experiment of RNA expression profiles in autistic subjects (Ben-David et al., 2011). The goal of the experiment was to identify autosomal loci where only one of the two alleles is expressed. Nuclear RNA was extracted from blood cell-lines of 17 subjects and reverse transcribed. Both the cdna produced and the genomic DNA of each of the subjects were genotyped using the Affymetrix Single Nucleotide Polymorphism 6 0 array technology. The identification of loci with mono-allelic expression of RNA resulted from the examination of the cdna genotypes at single nucleotide polymorphisms that had been identified as heterozygous in genomic DNA. Specifically, the algorithm for the discovery of differentially expressed regions involved the removal, for each subject, of the single nucleotide polymorphisms that were homozygous in the genomic DNA, or were determined not to be sufficiently expressed. For the remaining cdna polymorphisms, an exponentially distributed distance from heterozygous expression was calculated using the log transformed ranking

6 984 D. O. SIEGMUND, N.R.ZHANG AND B. YAKIR z-score Location Fig. 1. Scanning windows (t, t + w) that exceed the threshold of z = 30 for a region containing 500 positions in the DNA copy number data of 3 3. Each black horizontal segment shows the start and end points of a window, with the actual value of the scan statistic shown on the y-axis. This region contains three discoveries, or clumps, shown as thick bars at the top of the plot. of the confidence score from Affymetrix Birdseed V2 genotyping algorithm (Korn et al., 2008). The p- values for the sum of scores in windows of five consecutive polymorphisms were calculated using the function rollapply from the R package zoo (R Development Core Team, 2011). Windows that included polymorphisms more than 1 Mbp apart were excluded from the analysis. On the other hand, consecutive windows with p-values <0 05 were combined if the distance between them was <1 Mbp. The p-values for the merged windows were recalculated. Final windows with a p-value < were declared to be discoveries. A total of 507 such windows were discovered using the algorithm described above. In order to estimate the false discovery rate of the algorithm, the method of 2 was applied. The markers used are heterozygous and widely separated in the scale of base pairs. Hence, it seems reasonable to assume that they behave independently, since transcription, currently understood as a localized process within the genome, should not induce dependence between the allelic expression of distantly separated polymorphisms. The transcribed allelic ratios can be permuted within individuals; and a Monte Carlo experiment then determines the Poisson parameter λ. The algorithm was applied to each permuted set of data, and the number of discoveries was counted. The average number of discoveries, computed from 100 permutations, was This average served as an estimate of the expected number of false discoveries. Consequently, the estimated rate of false discovery is 11 48/( ) = Population-wide copy number variation To detect copy number variation, Olshen et al. (2004) introduced a change-point model with white Gaussian measurement errors. Their procedure was found by Lai et al. (2005) tobepreferabletoother existing methods. See Jeng et al. (2010) for a recent discussion of this model. For the more general problem of aligned copy number variation in multiple sequences Zhang et al. (2010) and Siegmund et al. (2011), after a suitable normalization of the data described in those papers, found that the change-point model with Gaussian white noise measurement errors was reasonable. It follows from (3.3) and (3.4) of Siegmund et al. (2011) that V is approximately Poisson for high thresholds. We used (3.4) from that paper applied to the data from chromosome 4 of the Stanford Quality Control Panel. For a complete description of this application and dataset, see the cited papers. There is a total of positions, with 62 samples. We restricted our analysis to small intervals, and so conducted a variable window scan of all positions with a maximum window size of 50 and a minimum window size of 1. The theoretically derived value of λ(z) compares well with values estimated via Monte Carlo simulation, even for values of z where λ(z) is fairly large. With a false discovery rate threshold of 0 01, 337 discoveries were made. With a false discovery rate threshold of 0 1, 472 discoveries were made. See Fig. 1 for an example region containing 500 positions and 3 discoveries. ACKNOWLEDGEMENT The research of the first and third authors is supported by the Israeli-American Bi-National Fund. The second and third authors are supported by the National Science Foundation, U.S.A. We would like to thank

7 Miscellanea 985 Dr Shifman from The Hebrew University of Jerusalem for giving us access to the data of the experiment described in 3 and for conducting the simulation described therein. REFERENCES ALDOUS,D.(1988). Applications of the Poisson Clumping Heuristic. New York: Springer. ARRATIA,R.,GOLDSTEIN,L.&GORDON,L.(1989). Two moments suffice for Poisson approximation. Ann. Prob. 17, BEN-DAVID,E.,GRANOT-HERSHKOVITZ,E.,MONDERER-ROTHKOFF,G.,LERER,E.,LEVI,S.,YAARI,M.,EBSTEIN,R. P., YIRMIA, N.,& SHIFMAN, S.(2011). Identification of a functional rare variant in autism using genome-wide screen for monoallelic expression. Hum. Molec. Genet. 20, BENJAMINI, Y.&HOCHBERG, Y.(1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, CHEN,L.(1975). Poisson approximation for dependent trials. Ann. Prob. 3, EFRON, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge: Cambridge University Press. JENG,X.J.,CAI,T.T.&LI,H.(2010). Optimal sparse segment identification with application in copy number variation analysis. J. Am. Statist. Assoc. 105, KORN, J.M.,KURUVILLA, F.G.,MCCARROLL, S.A.,WYSOKER, A.,NEMESH, J.,CAWLEY, S.,HUBBELL, E., VEITCH, J.,COLLINS, P.J.,DARVISHI, K.,et al. (2008). Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nature Genet. 40, LAI,W.R.,JOHNSON,M.D.,KUCHERLAPATI,R.&PARK,P.J.(2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 21, LANDER, E.&BOTSTEIN, D.(1989). Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121, LINDGREN,G.,LEADBETTER,M.R.&ROOTZÉN,H.(1983). Extremes and Related Properties of Stationary Sequences and Processes. New York: Springer. MEDVEDEV,P.,STANCIU,M.&BRUDNO,M.(2009). Computational methods for discovering structural variation with next-generation sequencing. Nature Meth. Suppl. 6, S OLSHEN,A.B.,VENKATRAMAN,E.S.,LUCITO,R.&WIGLER,M.(2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, RDEVELOPMENT CORE TEAM (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN SIEGMUND, D.O.&WORSLEY, K.J.(1995). Testing for a signal with unknown location and scale in a stationary gaussian random field. Ann. Statist. 23, SIEGMUND,D.&YAKIR,B.(2007). The Statistics of Gene Mapping. New York: Springer. SIEGMUND, D.,YAKIR, B.&ZHANG, N.(2011). Detecting simultaneous variant intervals in aligned sequences. Ann. Appl. Statist. 5, STOREY, J.D.,TAYLOR, J.E.&SIEGMUND, D.O.(2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Statist. Soc. B 66, WORSLEY, K., EVANS, A. C., MARRETT, S.& NEELIN P. (1992). A three dimensional statistical analysis for CBF activation studies in human brain. J. Cerebral Blood Flow Metab. 12, ZHANG, Y.(2008). Poisson approximation for significance in genome-wide ChiP-chip tiling arrays. Bioinformatics 24, ZHANG, N.,SIEGMUND, D.,JI, H.&LI, J.Z.(2010). Detecting simultaneous change-points in multiple sequences. Biometrika 97, [Received September Revised August 2011]

8

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

The Admixture Model in Linkage Analysis

The Admixture Model in Linkage Analysis The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Peak Detection for Images

Peak Detection for Images Peak Detection for Images Armin Schwartzman Division of Biostatistics, UC San Diego June 016 Overview How can we improve detection power? Use a less conservative error criterion Take advantage of prior

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Optional Stopping Theorem Let X be a martingale and T be a stopping time such

Optional Stopping Theorem Let X be a martingale and T be a stopping time such Plan Counting, Renewal, and Point Processes 0. Finish FDR Example 1. The Basic Renewal Process 2. The Poisson Process Revisited 3. Variants and Extensions 4. Point Processes Reading: G&S: 7.1 7.3, 7.10

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

I of a gene sampled from a randomly mating popdation,

I of a gene sampled from a randomly mating popdation, Copyright 0 1987 by the Genetics Society of America Average Number of Nucleotide Differences in a From a Single Subpopulation: A Test for Population Subdivision Curtis Strobeck Department of Zoology, University

More information

Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis

Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis Hongzhe Li hongzhe@upenn.edu, http://statgene.med.upenn.edu University of Pennsylvania Perelman School of

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Bumpbars: Inference for region detection. Yuval Benjamini, Hebrew University

Bumpbars: Inference for region detection. Yuval Benjamini, Hebrew University Bumpbars: Inference for region detection Yuval Benjamini, Hebrew University yuvalbenj@gmail.com WHOA-PSI-2017 Collaborators Jonathan Taylor Stanford Rafael Irizarry Dana Farber, Harvard Amit Meir U of

More information

The General Linear Model. Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London

The General Linear Model. Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London The General Linear Model Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM Course Lausanne, April 2012 Image time-series Spatial filter Design matrix Statistical Parametric

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Applying the Benjamini Hochberg procedure to a set of generalized p-values U.U.D.M. Report 20:22 Applying the Benjamini Hochberg procedure to a set of generalized p-values Fredrik Jonsson Department of Mathematics Uppsala University Applying the Benjamini Hochberg procedure

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

arxiv: v1 [math.st] 31 Mar 2009

arxiv: v1 [math.st] 31 Mar 2009 The Annals of Statistics 2009, Vol. 37, No. 2, 619 629 DOI: 10.1214/07-AOS586 c Institute of Mathematical Statistics, 2009 arxiv:0903.5373v1 [math.st] 31 Mar 2009 AN ADAPTIVE STEP-DOWN PROCEDURE WITH PROVEN

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Overview. Background

Overview. Background Overview Implementation of robust methods for locating quantitative trait loci in R Introduction to QTL mapping Andreas Baierl and Andreas Futschik Institute of Statistics and Decision Support Systems

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics and Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman [ Teaching Miscellaneous lectures]

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 147 Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data Sunduz

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl Broman Biostatistics and Medical Informatics University of Wisconsin Madison kbroman.org github.com/kbroman @kwbroman Backcross P 1 P 2 P 1 F 1 BC 4

More information

arxiv: v3 [math.st] 15 Oct 2018

arxiv: v3 [math.st] 15 Oct 2018 SEGMENTATION AND ESTIMATION OF CHANGE-POINT MODELS: FALSE POSITIVE CONTROL AND CONFIDENCE REGIONS arxiv:1608.03032v3 [math.st] 15 Oct 2018 Xiao Fang, Jian Li and David Siegmund The Chinese University of

More information

A COMPOUND POISSON APPROXIMATION INEQUALITY

A COMPOUND POISSON APPROXIMATION INEQUALITY J. Appl. Prob. 43, 282 288 (2006) Printed in Israel Applied Probability Trust 2006 A COMPOUND POISSON APPROXIMATION INEQUALITY EROL A. PEKÖZ, Boston University Abstract We give conditions under which the

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

How to analyze many contingency tables simultaneously?

How to analyze many contingency tables simultaneously? How to analyze many contingency tables simultaneously? Thorsten Dickhaus Humboldt-Universität zu Berlin Beuth Hochschule für Technik Berlin, 31.10.2012 Outline Motivation: Genetic association studies Statistical

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Hunting for significance with multiple testing

Hunting for significance with multiple testing Hunting for significance with multiple testing Etienne Roquain 1 1 Laboratory LPMA, Université Pierre et Marie Curie (Paris 6), France Séminaire MODAL X, 19 mai 216 Etienne Roquain Hunting for significance

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

On adaptive procedures controlling the familywise error rate

On adaptive procedures controlling the familywise error rate , pp. 3 On adaptive procedures controlling the familywise error rate By SANAT K. SARKAR Temple University, Philadelphia, PA 922, USA sanat@temple.edu Summary This paper considers the problem of developing

More information

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems Jeremy S. Conner and Dale E. Seborg Department of Chemical Engineering University of California, Santa Barbara, CA

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

Step-down FDR Procedures for Large Numbers of Hypotheses

Step-down FDR Procedures for Large Numbers of Hypotheses Step-down FDR Procedures for Large Numbers of Hypotheses Paul N. Somerville University of Central Florida Abstract. Somerville (2004b) developed FDR step-down procedures which were particularly appropriate

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Research Statement on Statistics Jun Zhang

Research Statement on Statistics Jun Zhang Research Statement on Statistics Jun Zhang (junzhang@galton.uchicago.edu) My interest on statistics generally includes machine learning and statistical genetics. My recent work focus on detection and interpretation

More information

Latent Variable Methods for the Analysis of Genomic Data

Latent Variable Methods for the Analysis of Genomic Data John D. Storey Center for Statistics and Machine Learning & Lewis-Sigler Institute for Integrative Genomics Latent Variable Methods for the Analysis of Genomic Data http://genomine.org/talks/ Data m variables

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Human vs mouse Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] www.daviddeen.com

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

A Monte-Carlo study of asymptotically robust tests for correlation coefficients

A Monte-Carlo study of asymptotically robust tests for correlation coefficients Biometrika (1973), 6, 3, p. 661 551 Printed in Great Britain A Monte-Carlo study of asymptotically robust tests for correlation coefficients BY G. T. DUNCAN AND M. W. J. LAYAKD University of California,

More information

Detecting Simultaneous Variant Intervals in Aligned Sequences

Detecting Simultaneous Variant Intervals in Aligned Sequences University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2011 Detecting Simultaneous Variant Intervals in Aligned Sequences David Siegmund Benjamin Yakir Nancy R. Zhang University

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis The Canadian Journal of Statistics Vol.?, No.?, 2006, Pages???-??? La revue canadienne de statistique Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint

More information

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at Biometrika Trust Robust Regression via Discriminant Analysis Author(s): A. C. Atkinson and D. R. Cox Source: Biometrika, Vol. 64, No. 1 (Apr., 1977), pp. 15-19 Published by: Oxford University Press on

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol 21 no 11 2005, pages 2684 2690 doi:101093/bioinformatics/bti407 Gene expression A practical false discovery rate approach to identifying patterns of differential expression

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Simulating Properties of the Likelihood Ratio Test for a Unit Root in an Explosive Second Order Autoregression

Simulating Properties of the Likelihood Ratio Test for a Unit Root in an Explosive Second Order Autoregression Simulating Properties of the Likelihood Ratio est for a Unit Root in an Explosive Second Order Autoregression Bent Nielsen Nuffield College, University of Oxford J James Reade St Cross College, University

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Classical Selection, Balancing Selection, and Neutral Mutations

Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection, Balancing Selection, and Neutral Mutations Classical Selection Perspective of the Fate of Mutations All mutations are EITHER beneficial or deleterious o Beneficial mutations are selected

More information

Two-stage stepup procedures controlling FDR

Two-stage stepup procedures controlling FDR Journal of Statistical Planning and Inference 38 (2008) 072 084 www.elsevier.com/locate/jspi Two-stage stepup procedures controlling FDR Sanat K. Sarar Department of Statistics, Temple University, Philadelphia,

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

False Discovery Rate

False Discovery Rate False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

False Discovery Control in Spatial Multiple Testing

False Discovery Control in Spatial Multiple Testing False Discovery Control in Spatial Multiple Testing WSun 1,BReich 2,TCai 3, M Guindani 4, and A. Schwartzman 2 WNAR, June, 2012 1 University of Southern California 2 North Carolina State University 3 University

More information

Supplementary Information for Discovery and characterization of indel and point mutations

Supplementary Information for Discovery and characterization of indel and point mutations Supplementary Information for Discovery and characterization of indel and point mutations using DeNovoGear Avinash Ramu 1 Michiel J. Noordam 1 Rachel S. Schwartz 2 Arthur Wuster 3 Matthew E. Hurles 3 Reed

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

SAMPLE SIZE RE-ESTIMATION FOR ADAPTIVE SEQUENTIAL DESIGN IN CLINICAL TRIALS

SAMPLE SIZE RE-ESTIMATION FOR ADAPTIVE SEQUENTIAL DESIGN IN CLINICAL TRIALS Journal of Biopharmaceutical Statistics, 18: 1184 1196, 2008 Copyright Taylor & Francis Group, LLC ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400802369053 SAMPLE SIZE RE-ESTIMATION FOR ADAPTIVE

More information

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018

LARGE NUMBERS OF EXPLANATORY VARIABLES. H.S. Battey. WHAO-PSI, St Louis, 9 September 2018 LARGE NUMBERS OF EXPLANATORY VARIABLES HS Battey Department of Mathematics, Imperial College London WHAO-PSI, St Louis, 9 September 2018 Regression, broadly defined Response variable Y i, eg, blood pressure,

More information

Lecture 7 April 16, 2018

Lecture 7 April 16, 2018 Stats 300C: Theory of Statistics Spring 2018 Lecture 7 April 16, 2018 Prof. Emmanuel Candes Scribe: Feng Ruan; Edited by: Rina Friedberg, Junjie Zhu 1 Outline Agenda: 1. False Discovery Rate (FDR) 2. Properties

More information

Statistical issues in QTL mapping in mice

Statistical issues in QTL mapping in mice Statistical issues in QTL mapping in mice Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Outline Overview of QTL mapping The X chromosome Mapping

More information

An Integrated Approach for the Assessment of Chromosomal Abnormalities

An Integrated Approach for the Assessment of Chromosomal Abnormalities An Integrated Approach for the Assessment of Chromosomal Abnormalities Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 6, 2007 Karyotypes Mitosis and Meiosis Meiosis Meiosis

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

6.207/14.15: Networks Lecture 12: Generalized Random Graphs 6.207/14.15: Networks Lecture 12: Generalized Random Graphs 1 Outline Small-world model Growing random networks Power-law degree distributions: Rich-Get-Richer effects Models: Uniform attachment model

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

A note on profile likelihood for exponential tilt mixture models

A note on profile likelihood for exponential tilt mixture models Biometrika (2009), 96, 1,pp. 229 236 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asn059 Advance Access publication 22 January 2009 A note on profile likelihood for exponential

More information

Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling

Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling Test (2008) 17: 461 471 DOI 10.1007/s11749-008-0134-6 DISCUSSION Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling Joseph P. Romano Azeem M. Shaikh

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Haplotype-based variant detection from short-read sequencing

Haplotype-based variant detection from short-read sequencing Haplotype-based variant detection from short-read sequencing Erik Garrison and Gabor Marth July 16, 2012 1 Motivation While statistical phasing approaches are necessary for the determination of large-scale

More information

Specific Differences. Lukas Meier, Seminar für Statistik

Specific Differences. Lukas Meier, Seminar für Statistik Specific Differences Lukas Meier, Seminar für Statistik Problem with Global F-test Problem: Global F-test (aka omnibus F-test) is very unspecific. Typically: Want a more precise answer (or have a more

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information