BIOINFORMATICS ORIGINAL PAPER

Size: px
Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Transcription

1 BIOINFORMATICS ORIGINAL PAPER Vol 21 no , pages doi:101093/bioinformatics/bti407 Gene expression A practical false discovery rate approach to identifying patterns of differential expression in microarray data Gregory R Grant, Junmin Liu and Christian J Stoeckert Jr Center for Bioinformatics, University of Pennsylvania, 1429 Blockley Hall, 423 Guardian Drive, Philadelphia, PA , USA Received on December 15, 2004; revised on March 16, 2005; accepted on March 22, 2005 Advance Access publication March 29, 2005 ABSTRACT Summary: Searching for differentially expressed genes is one of the most common applications for microarrays, yet statistically there are difficult hurdles to achieving adequate rigor and practicality False discovery rate (FDR) approaches have become relatively standard; however, how to define and control the FDR has been hotly debated Permutation estimation approaches such as SAM and PaGE can be effective; however, they leave much room for improvement We pursue the permutation estimation method and describe a convenient definition for the FDR that can be estimated in a straightforward manner We then discuss issues regarding the choice of statistic and data transformation It is impossible to optimize the power of any statistic for thousands of genes simultaneously, and we look at the practical consequences of this For example, the log transform can both help and hurt at the same time, depending on the gene We examine issues surrounding the SAM fudge factor parameter, and how to handle these issues by optimizing with respect to power Availability: Java and Perl implementations are available at www cbilupennedu/page Contact: ggrant@pcbiupennedu 1 INTRODUCTION A common use of microarrays is to find differentially expressed genes between two experimental conditions The statistical problem of controlling the error rates has proven to be difficult using straightforward classical statistics, and instead the relatively new false discovery rate (FDR) approaches (Benjamini and Hochberg, 1995) have become widely accepted as appropriate (Ge et al, 2003) Therefore we will not argue about the merits of the FDR approach, but we will take it as our starting point The FDR approach is to accept some false positives, while attempting to control the proportion of them in the set of all genes predicted Benjamini and Hochberg (1995) did not provide a general and powerful method for achieving FDR control Instead their original method relies on strong assumptions and even when those hold the method can be quite conservative The method starts with any gene-by-gene p-values, and then for a desired FDR Q it adjusts the p-values and a cutoff is determined so that the genes whose adjusted p-values are smaller than the cutoff have the expected proportion of false-positives no more than Q By describing one method as having greater power than another, we will mean that the method predicts a larger set of genes at the desired FDR To whom correspondence should be addressed There have been many methods proposed to make the Benjamini and Hochberg methods more general and more powerful (eg Benjamini and Yekultieli, 2001; Pounds and Cheng, 2004); however, there are always assumptions required to obtain the gene-by-gene p-values Typical assumptions are normality of intensity distributions, independence of p-value distributions across genes or identically distributed t-statistics For the most part the degree to which these assumptions affect the results on real data is unknown The methods work well on simulated data, but since we do not yet understand the nature of real data well enough to properly simulate them, it is difficult to compare methods in a meaningful way When there are only a few replicates available in each condition, and thousands of genes on the array, permutation p-values, or rank-based p-values such as the Wilcoxon rank sum statistic, which would overcome some of the parametric assumptions, are too granular to be useful For example, with three replicates in each condition, there are only 20 permutations, so an array with genes would necessarily clump 1000 of them into the highest significance value 005 The empirical Bayes method of Efron and Tibshirani (2002) is based on the Wilcoxon statistic and so requires a similar number of replicates PaGE is a permutation based method which attempts to avoid as many parametric assumptions as possible, while also avoiding the granularity of p-values and rank-based statistics altogether PaGE (Manduchi et al, 2001) and SAM (Tusher et al, 2001) are methods that attempt to control the FDR without the use of corrected p-values SAM and PaGE use methods of permutation estimation, as described below 1 SAM has become a popular application; however, it relies on some heuristics the consequences of which are not well documented and which in fact present several limitations as we will show below We propose a slightly different definition of the FDR and we will show why it is more straightforward for permutation estimation We will then describe a permutation method to control the FDR The most serious issue regarding any method is the choice of statistic SAM has relied on a modified t-statistic which depends on an extra parameter We will show how sensitive results can be to this parameter, and will show how the SAM method of setting this parameter does not generally have good power properties The current PaGE approach to this problem is different, as described in Section 8 1 PaGE 10 does not use permutations; the original PaGE algorithm was replaced by a permutation algorithm in PaGE The Author 2005 Published by Oxford University Press All rights reserved For Permissions, please journalspermissions@oupjournalsorg Downloaded from

2 Identifying patterns of differential approach Other issues relating to the statistic and data transformations are discussed The conclusion is that no statistic is best for finding all differentially expressed genes, and relying on one statistic over another generally always involves a tradeoff PaGE can be used to find differentially expressed genes between two conditions, and to generate patterns across several conditions PaGE was introduced by (Manduchi et al, 1999), and though the algorithm has changed significantly, the general approach of generating discrete patterns using an FDR-based confidence measure has remained unchanged Implementations of PaGE are available, in Java and Perl, at A complete description of the PaGE algorithm as implemented is available in the technical manual documentation, 2 DIFFERENTIAL EXPRESSION We assume that there are two well-defined experimental conditions, and that each gene has a measurable expression intensity that follows some (unknown) distribution in each condition Differential expression of a gene means that these distributions are different between the two conditions The distributions can differ in any possible way, but the statistics we use are designed to be sensitive to primarily a difference in the means (eg the t-statistic) Even so, the hypotheses being tested are of equality of distributions This is a necessary consequence of using the permutation methods that we do The data are assumed to consist of multiple quantified microarray experiments in each condition There are many ways to design a study for comparative analysis We will focus here on the two-sample case where there is a separate measurement in each condition, for each gene Another popular design, the direct comparison design, is to hybridize the two conditions to the two channels of a two-channel array, respectively, and to generate some number of replicate arrays The PaGE software handles all of these designs; however, for concision, the direct comparison design will not be discussed here, nor other cases such as paired or reference designs; those cases are discussed in detail in the technical manual The theory for all cases is similar We begin by considering just two experimental conditions, called condition 0 and condition 1 Condition 0 will be referred to as the reference condition Up-regulation of a gene will mean the gene s mean intensity is higher in condition 1 as compared to condition 0, and analogously for down-regulation 3 THE DATA The data will be assumed to consist of some number m of replicate arrays in condition 0 and some number n of replicates in condition 1 We put the data in a matrix as in Equation (1) below C 1 C 2 C m D 1 D 2 D n G 1 c 11 c 12 c 1m d 11 d 12 d 1n G 2 c 21 c 22 c 2m d 21 d 22 d 2n G 3 c 31 c 32 c 3m d 31 d 32 d 3n G g c g1 c g2 c gm d g1 d g2 d gn The m + n columns in Equation (1) correspond to hybridizations and the g rows correspond to genes The columns labeled with (1) the C i s correspond to hybridizations from condition 0, and the columns labeled with the D i s correspond to the hybridizations from condition 1 A permutation of the data will consist of choosing some number k of columns from condition 0, and then choosing the same number of columns from condition 1, and switching them We then obtain two new conditions from the first m and the last n columns of the permuted data matrix What is important in a permutation is which columns end up in which condition, and not the order that they happen to be listed in Therefore there are a total of ( m+n) n possible permutations in the two-sample case Denote a row of the data matrix by r Ifp is a permutation of the columns, we denote the correspondingly permuted row by r p 4 THE STATISTICS We denote by S any two-class statistic, and think of it as a function which maps rows of the data matrix r to real numbers Suppose there is some center point c such that S > c indicates up-regulation and S<cindicates down-regulation The statistics we have in mind are: The modified t-statistic ((c 1, c 2,, c m ), (d 1, d 2,, d n )) µ 1 µ 0 (2) α + σ where µ 0 is the mean of (c 1, c 2,, c m ), µ 1 is the mean of (d 1, d 2,, d n ), σ0 2 σ = (m 1) + σ 1 2 (n 1) m + n 2 where σ0 2 = (1/(m 1)) m j=1 (c j µ 0 ) 2 and σ1 2 = (1/(n 1)) n j=1 (d j µ 1 ) 2 In this case the center c equals 0 When α = 0 this is the standard two-sample t-statistic As we will see (Section 8), results can be extremely sensitive to the value of α; therefore we refer to α as the t-statistic tuning parameter 2 The second statistic is the ratio of the means in the two conditions ((c 1, c 2,, c m ), (d 1, d 2,, d n )) µ 1 µ 0 where µ 0 is the mean of (c 1, c 2,, c m ) and µ 1 is the mean of (d 1, d 2,, d n ) In this case the center c = 1 A statistic need not make sense in all cases For example if there are negative intensities in the unlogged data then the ratio statistic is not sensible We could apply the t-statistic to the unlogged data, or to the logged data, or to any transformation of the data Therefore, even with fixed α, the t-statistic is not one statistic but a family of statistics This will be discussed further in Section 9 The ratio statistic makes the most sense when the data is on a multiplicative scale, such as two-channel ratios The log ratio is similar to the modified t-statistic with a large value of α, so we do not consider it separately Let M be the data matrix, p a permutation of the columns and M p the permuted data matrix Denote the value of the statistic S on row r of M by S r and on row r of M p by S rp 2 This is analogous to the t-statistic fudge factor introduced by Tusher et al (2001) 2685 Downloaded from

3 GRGrant et al 5 THE FDR We will focus on up-regulation in condition 1 versus condition 0 The case of down-regulation follows by switching the roles of conditions 0 and 1 We will also assume that larger values of S are more significant This is the case for all of the statistics above The case where smaller values are more significant follows by switching the direction of the inequalities For any row r we take the null hypothesis Hr 0 to be that the distribution for any observation in condition 0 is identical to the distribution for any observation in condition 1 Suppose that for g 0 of the rows the null hypothesis is true Let g 1 = g g 0 For each real number k>c, let G k be the set of rows r of M such that S r k G k is the set of predictions if we use k as the cutoff for the statistic Let R k be the size of G k Let V k be the number of rows in G k for which the null hypothesis is true With this set-up, we will have made V k false predictions out of R k total predictions Provided R k > 0, we call the ratio V k /R k the false discovery proportion of this set of predictions We choose k so that this proportion is controlled to a desired level There are many ways to define what it means to control this proportion, and our FDR definition differs from the original one of Benjamini and Hochberg (1995), as well as that of Storey (2002) and Storey and Tibshirani (2003) Throughout we will define the FDR of the procedure as { E(V k )/R k, R k > 0 (3) 0, R k = 0 This differs from the original definition of Benjamini and Hochberg (1995) which is given by { E(V k /R k ), R k > 0 (4) 0, R k = 0 The advantage of the original definition (4) is that it takes into account the dependence between V k and R k However an advantage of definition (3) is that it can be more realistically estimated via permutation distributions 3 The goal is to find the least conservative (ie smallest) value of k so that (3) is acceptably low Sometimes one is willing to tolerate a relatively high FDR such as 05; other times a low FDR such as 005 is desired 6 FDR ESTIMATION For each permutation p of the data matrix we obtain a value V p k, the number of rows whose permutation statistic S rp k Thus we obtain a permutation distribution D k of V k under the complete null hypothesis (that is, when all null hypotheses are true) Note that the distribution D k restricted to the null genes depends on the joint 3 Indeed to estimate definition (4) one needs to know something about the random properties of V/R If we permute the columns of the data matrix (1), we can obtain some kind of approximation to an observation of V under the complete null hypothesis, but this tells us nothing about V/R under the true distribution of the data The bootstrap distribution obtained by sampling with replacement from the two conditions separately will give us an approximation to the distribution of R, but again tells us nothing about V/R To obtain information about V/R most authors have had to make strong assumptions about the data distribution of the S r, over all, and the joint distribution of the S r restricted to the null genes is maintained by permuting the data matrix in columns Let µ k be the mean of D k Typically permutation distributions are used to derive p-values, for which there is substantial theory Here, however, we are interested in actually estimating E(V k ) from the permutation distribution, and this requires some justification Note that similar permutation estimates are utilized in the SAM theory of Storey and Tibshirani (2003); however they do not address their properties (Pan, 2003) There are two problems in using µ k as an estimate of E(V k ) First, since it is calculated under the complete null hypothesis, it is at best a measure of how many hypotheses would be falsely rejected if they were all true So, assuming that some hypotheses are false, µ k would be an overestimate Second, unless all hypotheses are true, the false hypotheses can cause the distribution of V k to be different from what it would be if we could consider only the true hypotheses in defining D k Since we do not know which hypotheses are true, we must allow the false hypotheses to contribute to the counts involved in D k Regarding the second issue, we argue that µ k is conservative Suppose that null hypothesis r is false Then for permutations p which switch only one or a few columns between conditions, the false hypothesis r will tend to have large values of the statistic S r, and will therefore tend to contribute to the count of V p k more than it would if Hr 0 were true Similarly, down-regulated genes will tend to over-contribute to the count V p k for those permutations which switch most or all of the columns between the conditions Therefore the estimate µ k will tend to be larger than the true value of E(V k ) The more hypotheses that are false, the more conservative µ k will be Turning to the first issue, since µ k is an overestimate of E(V k ), R k µ k (5) is an underestimate of the number of true positives (the rows in G k for which the null hypothesis is false) Therefore g (R k µ k ) is an overestimate of the number of true hypotheses Originally µ k was calculated as an estimate of V k assuming all hypotheses are true If we recalculate assuming there are g (R k µ k ) true hypotheses, then we obtain µ k (1) = µ k g [g (R k µ k )] Since g (R k µ k ) is an overestimate of the number of true hypotheses, µ k (1) is still an overestimate of E(V k ); however it is a better estimate than µ k Using the same logic, we calculate µ k (2) = µ k(1) [g (R k µ k (1))] g and in general µ k (i + 1) = µ k(i) g [g (R k µ k (i))] This sequence is decreasing and bounded below, and therefore converges In fact it converges quickly and PaGE takes µ k (n) as its final estimate for V k, where µ k (n) µ k (n 1) <00001 We denote µ k (n) by Ṽ k We take as estimate of the FDR FDR k = Ṽ k /R k It is useful to also define the quantity CONF k = 1 FDR k CONF k is an estimate of the probability that any gene taken at random from G k is a true positive 2686 Downloaded from

4 Identifying patterns of differential approach We assign confidences not just to the sets G k, but to the rows of the data matrix themselves, by CONF r = min k such that r G k CONF k where r is a row of the data matrix In this way, we have confidence of at least γ that a row r with CONF r = γ represents a truly differentially expressed gene 7 LEVELS AND PATTERNS SAM offers a multi-class mode which tests the hypothesis that the means in all conditions are equal This does not, however, give an indication of what exactly is going on in which conditions, or of a way of sorting the genes by their behavior across the conditions In order to visualize the behavior of genes across multiple conditions, PaGE was designed to perform a series of pairwise comparisons and to generate patterns from the results (Manduchi et al, 2000) One condition is chosen as a reference and the remaining conditions are compared to this reference condition, which we call condition 0 If there are n conditions including condition 0, then a simple way to generate patterns of length n 1 would be to put +1 in position i if there is up-regulation at the desired confidence in condition i versus the reference, and similarly 1 if there is down-regulation; if there is no differential regulation at the desired confidence, then put 0 This would allow one to see where the differential expression is happening However, in order to make this strategy more descriptive, we do not restrict to just patterns of 0, ±1, but use the rest of the integers as well Higher levels represent a higher confidence of differential expression than lower levels The levels are determined as follows First, the user chooses a confidence γ A cutoff C for the statistic is determined by setting it to the minimum k for which CONF k >γ, if such a C exists (CONF k >γis defined in Section 6) The set of all rows r such that S r >Cthen gives the least conservative set of predictions with a confidence of at least γ Depending on the data, there may not be any such value of C that achieves confidence γ, in which case C is set to be Now, depending on whether the statistic is on an additive scale (such as the t-statistic), or on a multiplicative scale (such as the ratio statistic), the levels are created differently In the additive case if the statistic S r <C, then row r is given level 0 If C S r < 2C, then it is given level 1 If 2C S r < 3C then it is given level 2 In general if nc S r <(n+ 1)C then it is given level n If the statistic is on a multiplicative scale, then row r is given level n if C n S r <C n+1 When the data are multiplicative and the ratio statistic is used then the patterns take on the intuitive meaning of fold-change The parameter γ is referred to as the level confidence As one raises the level confidence, fewer levels are produced and the genes assigned to the levels have higher confidence We have found this to be a convenient way of visualizing the results of a study with many conditions 8 THE T -STATISTIC TUNING PARAMETER When using the t-statistic, the number of genes found at the desired confidence can be dramatically affected by the value of the tuning Table 1 The effect of the t-statistic tuning parameter a α Number predicted a The data consist of three replicates, direct comparison simulated data of 1000 rows, with 50 differentially expressed (up-regulated) genes Each column gives the number of genes found at confidence = 05 for the corresponding value of α The data is available at cbilupennedu/page/doc/testdata0txt Table 2 Similar to Table 1, the effect of the t-statistic tuning parameter on real data consisting of four replicates per condition, log transformed two-class mouse pancreas cells 0 and 2 h post-partial hepatectomy [data published in White et al (2005)] a α Up Down a Differential expression results were obtained at confidence = 08 The data are available at wwwcbilupennedu/page/doc/0vs2txt Table 3 Similar to Tables 1 and 2, the effect of the t-statistic tuning parameter on real data consisting of eight replicates direct comparison data a α Up Down a The cells were taken from pig heart valve endothelial tissue comparing regions subject to two different kinds of hemodynamic flow [data published in Simmons et al (2005)] Differential expression results were obtained at confidence = 08 The data is available at wwwcbilupennedu/page/doc/pig1txt parameter α in Equation (2) This is particularly true when there are only a few replicates per condition The FDR is conservatively estimated regardless of the value of α, so choosing a value of α which maximizes the number of results at the desired confidence is desirable If too many genes are found, the confidence can be raised to find a smaller set We observe this effect empirically in a wide range of datasets, both real and simulated; see for example the data in Tables 1 3 In these tables the number of genes found differentially expressed, at a fixed confidence, is given as a function of α Table 1 represents simulated data while Tables 2 and 3 are from real data For the simulated data, 50 genes are up-regulated out of 1000 genes The intensities in each row for each condition are given by beta distributions With beta distributions, by varying the two parameters as well as the range, we can produce heterogeneous behavior: from unimodal, to bimodal, are highly skewed Since gene expression data are highly heterogeneous, the beta provides a better model than the normal, with respect to the heterogeneity of distributions For the complete details of the simulation engine, see Grant et al (2005; wwwcbilupennedu/expression_simulator/) 2687 Downloaded from

5 GRGrant et al In any case, the sensitivity on α as shown in Table 1 was not dependent on the arbitrary parameter choices used to simulate the data but is a general phenomenon we found to varying degrees regardless of the dataset being real or simulated Though the nature of the dependence on α varies from dataset to dataset, the dependence itself is typical The reason for this dependence on α is that the standard t- statistic (α = 0) is large for those genes with small variances, and the algorithm is forced to be more conservative in its predictions to avoid picking them up As α grows, this effect is minimized; however, when α is too large, then for the differentially expressed genes with small mean difference and small variance, α dominates the statistic, which tends to obscure the differential expression of these genes in the noise Therefore some genes get lost as α goes down, while other genes get lost as α goes up What value to set α to depends on the nature of the differentially expressed genes as well as the non-differentially expressed genes There is no known formula that can be applied to the data matrix to determine the value of α which maximizes the power However, since the (expected) confidence of the set of predicted genes is the same regardless of α, a power criterion to determine α is desirable and we attempt to choose α to maximize the power PaGE tries a range of 10 values of α, from small to large, and chooses the one which gives the greatest number of results This is the default value of α Other values of α can, however, find genes that the default value misses, as we will see below Therefore it is important that the user has control over this parameter Since we are taking the maximum over 10 values of α we potentially introduce another multiple testing issue If these were 10 independent runs this problem might become serious, but they are highly dependent, even for very different values of α, and simulation studies indicate that the confidence is not significantly affected The extreme at which point this effect might have a significant impact on the FDR is when there are no differentially expressed genes and a high FDR is requested In this case a false positive set might consist of one or a very few genes, and the maximization can have a significant effect on the proportion because the numbers involved are small Because of this we do not recommend raising the FDR much higher than 05 when there are very few genes found The SAM (Tusher et al, 2001) algorithm uses an approach which depends on a smoothing criterion to select α The rationale is that the t-statistic distributions should be identical for all (null) genes, so they impose a uniformity criteria for the t-statistics to determine α They do not present the theory however which shows their criteria achieves reasonably optimal power, and in fact this method can give values of α that are quite far from optimal with regard to the power of the results This could be due to the fact that they are smoothing over all genes and not just the null ones To illustrate this we generated several simulated datasets, using again the simulation engine from Grant et al (2005) The first dataset has 5000 genes, 300 of which are differentially expressed Differentially expressed genes have varying mean differences The first 25 rows have mean difference 12 between the two conditions The next 25 rows have mean difference 13 The next have 14, etc Specifically the 12 blocks of 25 have mean differences 12, 13, 14, 15, 2, 22, 24, 26, 4, 43, 46 and 49, respectively This gives us 100 with relatively small mean differences; 100 with medium mean differences and 100 with relatively large mean differences These are the rows numbered The remaining 4700 non-differentially Table 4 The power of the results as a function of the t-statistic tuning parameter a a The realized confidence is the actual confidence achieved Table 5 The effect of the t-statistic tuning parameter on different spectra of differentially expressed genes expressed genes were generated by 4700 randomly chosen beta distributions with randomly chosen parameters Similarly to the data simulated for Table 1, the intensities in each conditions are given by beta distributions For the differentially expressed genes, the variances of the distributions increase over each block of 25 from very low to very high So row 0 has µ 1 µ 0 = 12 and very low variances, row 1 has µ 1 µ 0 = 12 but slightly higher variances, up to row 24 which has µ 1 µ 0 = 12 and high variances Row 25 then has µ 1 µ 0 = 13 with low variances, row 26 has µ 1 µ 0 = 13 with slightly higher variances, etc In this way a full range of mean differences and variances is represented in the data The full dataset can be downloaded at cbilupennedu/page/doc/testdata1txt Using the first three replicates of each condition, we ran PaGE with five different values of α and also ran SAM Table 4 has a summary of the results obtained with an 80% confidence cutoff for five values of α and SAM The realized confidence is the actual confidence achieved, which can be determined since this is simulated data and we know exactly what are the true positives and true negatives Table 5 breaks down the results by type of differentially expressed gene Note that the number of genes found in each row of the table is maximized on the diagonal of the shaded portion Therefore small values of α find genes with small mean differences and small variance; large values of α find genes with large mean differences and large variance The above example is far from the worst case If the non-differentially expressed genes have bimodal distributions and the differentially expressed genes have large variance, then SAM performance can be quite far from optimal The page cbilupennedu/page/doc/example1htm has the complete results for the genes found at 08 confidence or higher Columns represent different runs for different values of α The final column gives the SAM results The top three rows of the output consist of the actual values of R V (the number of true positives), R (the total number of predictions) and the realized confidence The 300 differentially expressed genes are listed, numbered 0 299, and an X means that that gene was found in that run 2688 Downloaded from

6 Identifying patterns of differential approach Table 6 Simulated dataset of 1000 genes, with 100 differentially expressed, and with the remaining genes each having a bimodal distribution Table 7 Comparison of results using the ratio of means statistic versus the t-statistic, on simulated data a Confidence ID Confidence ratio of means Confidence t-statistic Number of PaGE predictions Number of PaGE true positive Number of SAM predictions Number of SAM true positive SAM produces results very close to the setting α = 01 The power is maximized around α = 25 The PaGE default is α = 35 At this value 160 genes are reported, as opposed to SAM s 145 But the overlap between the two sets is only 125 genes SAM finds 20 that PaGE does not with the default α, and PaGE finds 35 that SAM does not Note that it is not necessary to use the PaGE default, and in the software the user has control over the value of α The default is simply designed to maximize the number of results It is important to keep in mind that an FDR only makes sense within the context of a set of predictions Even though we apply confidences to the individual genes in the sets, these confidences mean the chance of making a mistake when considering genes at random among the set of predictions made Therefore, if one merges sets of genes found from different runs, the FDR may be increased If two methods produce sets with the same FDR, but the results overlap proportionally more on the true positives than the false positives, then the FDR of the union of the two sets will be greater than the FDR of either set individually So when looking for more genes, it is not generally a good idea to use all possible choices of parameter settings and transformations, and merge the results, but rather to find the one or few that work best, and always take the meaning of the confidences in the context of the separate sets of results The dataset and parameter settings in the previous example were not particular; we repeated this with many variations and the same behavior was seen in general In fact, it is possible to generate simple datasets for which SAM performs much worse To demonstrate this, we generated a simulated dataset of 1000 genes, with 100 differentially expressed, and with the non-differentially expressed genes each having a bimodal distribution The dataset, SAM q-values and the PaGE confidences can be obtained at cbilupennedu/page/doc/files/bmodhtml The q-values are equal to one minus the confidence The lowest q-values produced by SAM on this dataset are 05 (14 genes) In contrast, PaGE reports 17 genes at confidence >08, all but one of which are true positives (Table 6) 9 THE CHOICE OF STATISTIC, TRANSFORMATION AND OTHER PARAMETERS 91 Using the ratio of means versus the t-statistic For most datasets the user will probably want to start with the t-statistic option If the t-statistic does not return many results, then keeping in mind the caveats of the previous section about multiple runs, one can try the other statistic or the log transformation options Max others a Genes with ID 0 are low-intensity differentially expressed Genes with ID 1 are high-intensity differentially expressed Genes with ID 2 99 are medium intensity nondifferentially expressed The bottom row shows the maximum confidence achieved by all other genes Each method picks up one of the two differentially expressed genes at high confidence Table 8 Comparison of results using the logged versus unlogged data with the t-statistic a ID Confidence logged Confidence unlogged Max others a Same data as in Table 7 The bottom row shows the maximum confidence achieved by all other genes Using the unlogged data was much better at finding gene 1, while using the logged data performed better on Gene 0 and completely lost gene 1 in the noise Even if there are many genes found, however, different statistics can pick up different kinds of differential expression, as we saw in the previous section regarding using the t-statistic with different values of the tuning parameter α To illustrate this further, we generated a simulated dataset with two conditions, 100 genes and four replicates per condition Two of the genes are differentially expressed The 98 non-differentially expressed genes have moderate intensity: (beta with mean 50, spread 35) Gene 0 is differentially expressed in the low-intensity range (means of 4 and 9 in the two conditions, respectively) Gene 1 is differentially expressed in the high-intensity range (means of 400 and 450 in the two conditions, respectively) The data are available at cbilupennedu/page/doc/testdata2txt Table 7 shows the results using the ratio of means statistic (left) and the t-statistic (right) Using the ratio of means, the low-intensity differentially expressed gene (gene 0) is much more significant than the high-intensity differentially expressed gene (gene 1) Conversely, when using the t-statistic, the high-intensity differentially expressed gene is much more significant than the low-intensity differentially expressed gene 92 Using logged versus unlogged data The caveats about choice of statistic apply also to the different possible data transformations one can perform Perhaps the most common is the log transformation PaGE offers the option of performing this transformation when one is using the t-statistic Using the same test datasets as above, Table 8 shows what happens to the confidence of gene 1 when the data are logged versus unlogged The confidence of Gene 0 goes down while the confidence of Gene 1 goes up 2689 Downloaded from

7 GRGrant et al Thus one cannot trust either approach to perform better for all genes simultaneously This happened because applying logs to the data, and then applying the t-statistic, which focuses on differences, is similar to first taking ratios, and then taking logs Gene 1 in the above example, whose intensities are in the high range of the spectrum, has a relatively small ratio, compared to the null genes whose intensities are in a lower range of the spectrum 93 Other statistics Statistics which are invariant to monotonic transformations, for example the Wilcoxon signed rank statistic, would allow us to avoid some of the issues above However, without sufficiently many replicates, the Wilcoxon statistic suffers from the same granularity problems as permutation p-values Once there are enough replicates for it to work, the null genes have a higher chance of appearing significant with the Wilcoxon because it puts small and large differences on an equal footing, and we are comparing the statistic across many genes This can obscure the true signal and as a result we were unable to find any datasets for which this method gave superior results We experimented with other statistics as well, including using p-values themselves as statistics, even adjusted p- values We were unable to find a significant example where these gave improved results, so we do not include them as options in the PaGE implementation Ultimately there is no best statistic or transformation A statistic can often be optimized for a single test, and there is substantial statistical theory about how to do this But a statistic cannot typically be optimized for thousands of genes at once Unfortunately each dataset is particular and must be treated as a special case, but by starting with the defaults the user can home in on reasonable parameter settings to suit their needs ACKNOWLEDGEMENTS We thank Dr Elisabetta Manduchi and Warren Ewens for valuable comments and discussions This work was supported in part by NIH grant K25-HG A1 REFERENCES Benjamini,Y and Hochberg,Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing Proc R Statist Soc B, 57, Benjamini,Y and Yekultieli,D (2001) The control of the false discovery rate in multiple testing under dependency Ann Statist, 29, Efron,B and Tibshirani,R (2002) Empirical Bayes methods and false discovery rates for microarrays Genet Epidemiol, 23, Ge,Y et al (2003) Resampling-based multiple testing for DNA microarray data analysis Test, 12, 1 44 Grant,GR, Sokolowski,S and Stoeckert,CJ, Jr (2005) Performance analysis of differential expression prediction algorithms using simulated array data Technical Report Manduchi,E et al (2000) Generation of patterns from gene expression data by assigning confidence to differentially expressed genes Bioinformatics, 16, Pan,W (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression Bioinformatics, 19, Pounds,S and Cheng,C (2004) Improving false discovery rate estimation Bioinformatics, 20, Simmons,CA et al (2005) Spatial heterogeneity of endothelial phenotypes correlates with side-specific vulnerability to calcification in normal porcine aortic valves Circ Res, in press Storey,JD (2002) A direct approach to false discovery rates J Roy Statist Soc, 84, Storey,JD and Tibshirani,R (2003) In Parmigiani,G Garrett,ES, Irizarry,RA and Zeger,S (eds), The Analysis of Gene Expression Data Springer, New York, pp Tusher,VG et al (2001) Significance analysis of microarrays applied to the ionizing radiation response Proc Natl Acad Sci USA, 98, White,P et al (2005) Identification of transcriptional networks during liver regeneration J Biol Chem, 28, Downloaded from

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

arxiv: v1 [math.st] 31 Mar 2009

arxiv: v1 [math.st] 31 Mar 2009 The Annals of Statistics 2009, Vol. 37, No. 2, 619 629 DOI: 10.1214/07-AOS586 c Institute of Mathematical Statistics, 2009 arxiv:0903.5373v1 [math.st] 31 Mar 2009 AN ADAPTIVE STEP-DOWN PROCEDURE WITH PROVEN

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST TIAN ZHENG, SHAW-HWA LO DEPARTMENT OF STATISTICS, COLUMBIA UNIVERSITY Abstract. In

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses Gavin Lynch Catchpoint Systems, Inc., 228 Park Ave S 28080 New York, NY 10003, U.S.A. Wenge Guo Department of Mathematical

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

Bootstrap Tests: How Many Bootstraps?

Bootstrap Tests: How Many Bootstraps? Bootstrap Tests: How Many Bootstraps? Russell Davidson James G. MacKinnon GREQAM Department of Economics Centre de la Vieille Charité Queen s University 2 rue de la Charité Kingston, Ontario, Canada 13002

More information

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE

CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE CHAPTER 3. THE IMPERFECT CUMULATIVE SCALE 3.1 Model Violations If a set of items does not form a perfect Guttman scale but contains a few wrong responses, we do not necessarily need to discard it. A wrong

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Multidimensional local false discovery rate for microarray studies

Multidimensional local false discovery rate for microarray studies Bioinformatics Advance Access published December 20, 2005 The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

More information

Chapter Three. Hypothesis Testing

Chapter Three. Hypothesis Testing 3.1 Introduction The final phase of analyzing data is to make a decision concerning a set of choices or options. Should I invest in stocks or bonds? Should a new product be marketed? Are my products being

More information

Step-down FDR Procedures for Large Numbers of Hypotheses

Step-down FDR Procedures for Large Numbers of Hypotheses Step-down FDR Procedures for Large Numbers of Hypotheses Paul N. Somerville University of Central Florida Abstract. Somerville (2004b) developed FDR step-down procedures which were particularly appropriate

More information

Unit 5a: Comparisons via Simulation. Kwok Tsui (and Seonghee Kim) School of Industrial and Systems Engineering Georgia Institute of Technology

Unit 5a: Comparisons via Simulation. Kwok Tsui (and Seonghee Kim) School of Industrial and Systems Engineering Georgia Institute of Technology Unit 5a: Comparisons via Simulation Kwok Tsui (and Seonghee Kim) School of Industrial and Systems Engineering Georgia Institute of Technology Motivation Simulations are typically run to compare 2 or more

More information

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data Biostatistics (2007), 8, 4, pp. 744 755 doi:10.1093/biostatistics/kxm002 Advance Access publication on January 22, 2007 A moment-based method for estimating the proportion of true null hypotheses and its

More information

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio Class 4: Classification Quaid Morris February 11 th, 211 ML4Bio Overview Basic concepts in classification: overfitting, cross-validation, evaluation. Linear Discriminant Analysis and Quadratic Discriminant

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

Estimation of the False Discovery Rate

Estimation of the False Discovery Rate Estimation of the False Discovery Rate Coffee Talk, Bioinformatics Research Center, Sept, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons: STAT 263/363: Experimental Design Winter 206/7 Lecture January 9 Lecturer: Minyong Lee Scribe: Zachary del Rosario. Design of Experiments Why perform Design of Experiments (DOE)? There are at least two

More information

A Bayesian Criterion for Clustering Stability

A Bayesian Criterion for Clustering Stability A Bayesian Criterion for Clustering Stability B. Clarke 1 1 Dept of Medicine, CCS, DEPH University of Miami Joint with H. Koepke, Stat. Dept., U Washington 26 June 2012 ISBA Kyoto Outline 1 Assessing Stability

More information

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis Multivariate Fundamentals: Rotation Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA Analysis with Spatial Data Proportion of variance explained Comp.1 + Comp.2

More information

Unit 8: A Mixed Two-Factor Design

Unit 8: A Mixed Two-Factor Design Minitab Notes for STAT 6305 Dept. of Statistics CSU East Bay Unit 8: A Mixed Two-Factor Design 8.1. The Data We use data quoted in Brownlee: Statistical Theory and Methodology in Science and Engineering,

More information

Hotelling s One- Sample T2

Hotelling s One- Sample T2 Chapter 405 Hotelling s One- Sample T2 Introduction The one-sample Hotelling s T2 is the multivariate extension of the common one-sample or paired Student s t-test. In a one-sample t-test, the mean response

More information

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis Statistics Preprints Statistics 11-2006 Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis Peng Liu Iowa State University, pliu@iastate.edu

More information

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Sanat K. Sarkar a, Tianhui Zhou b a Temple University, Philadelphia, PA 19122, USA b Wyeth Pharmaceuticals, Collegeville, PA

More information

Orthogonal, Planned and Unplanned Comparisons

Orthogonal, Planned and Unplanned Comparisons This is a chapter excerpt from Guilford Publications. Data Analysis for Experimental Design, by Richard Gonzalez Copyright 2008. 8 Orthogonal, Planned and Unplanned Comparisons 8.1 Introduction In this

More information

The optimal discovery procedure: a new approach to simultaneous significance testing

The optimal discovery procedure: a new approach to simultaneous significance testing J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received

More information

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE By Wenge Guo and M. Bhaskara Rao National Institute of Environmental Health Sciences and University of Cincinnati A classical approach for dealing

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Applying the Benjamini Hochberg procedure to a set of generalized p-values U.U.D.M. Report 20:22 Applying the Benjamini Hochberg procedure to a set of generalized p-values Fredrik Jonsson Department of Mathematics Uppsala University Applying the Benjamini Hochberg procedure

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Experimental Design, Data, and Data Summary

Experimental Design, Data, and Data Summary Chapter Six Experimental Design, Data, and Data Summary Tests of Hypotheses Because science advances by tests of hypotheses, scientists spend much of their time devising ways to test hypotheses. There

More information

Androgen-independent prostate cancer

Androgen-independent prostate cancer The following tutorial walks through the identification of biological themes in a microarray dataset examining androgen-independent. Visit the GeneSifter Data Center (www.genesifter.net/web/datacenter.html)

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown Nonparametric Statistics Leah Wright, Tyler Ross, Taylor Brown Before we get to nonparametric statistics, what are parametric statistics? These statistics estimate and test population means, while holding

More information

A Large-Sample Approach to Controlling the False Discovery Rate

A Large-Sample Approach to Controlling the False Discovery Rate A Large-Sample Approach to Controlling the False Discovery Rate Christopher R. Genovese Department of Statistics Carnegie Mellon University Larry Wasserman Department of Statistics Carnegie Mellon University

More information

Section 9.1 (Part 2) (pp ) Type I and Type II Errors

Section 9.1 (Part 2) (pp ) Type I and Type II Errors Section 9.1 (Part 2) (pp. 547-551) Type I and Type II Errors Because we are basing our conclusion in a significance test on sample data, there is always a chance that our conclusions will be in error.

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Effects of dependence in high-dimensional multiple testing problems. Kyung In Kim and Mark van de Wiel

Effects of dependence in high-dimensional multiple testing problems. Kyung In Kim and Mark van de Wiel Effects of dependence in high-dimensional multiple testing problems Kyung In Kim and Mark van de Wiel Department of Mathematics, Vrije Universiteit Amsterdam. Contents 1. High-dimensional multiple testing

More information

Bootstrapping, Randomization, 2B-PLS

Bootstrapping, Randomization, 2B-PLS Bootstrapping, Randomization, 2B-PLS Statistics, Tests, and Bootstrapping Statistic a measure that summarizes some feature of a set of data (e.g., mean, standard deviation, skew, coefficient of variation,

More information

THE SIMPLE PROOF OF GOLDBACH'S CONJECTURE. by Miles Mathis

THE SIMPLE PROOF OF GOLDBACH'S CONJECTURE. by Miles Mathis THE SIMPLE PROOF OF GOLDBACH'S CONJECTURE by Miles Mathis miles@mileswmathis.com Abstract Here I solve Goldbach's Conjecture by the simplest method possible. I do this by first calculating probabilites

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

New Procedures for False Discovery Control

New Procedures for False Discovery Control New Procedures for False Discovery Control Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Elisha Merriam Department of Neuroscience University

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Lecture 6 : Induction DRAFT

Lecture 6 : Induction DRAFT CS/Math 40: Introduction to Discrete Mathematics /8/011 Lecture 6 : Induction Instructor: Dieter van Melkebeek Scribe: Dalibor Zelený DRAFT Last time we began discussing proofs. We mentioned some proof

More information

Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening

Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening Selecting an Orthogonal or Nonorthogonal Two-Level Design for Screening David J. Edwards 1 (with Robert W. Mee 2 and Eric D. Schoen 3 ) 1 Virginia Commonwealth University, Richmond, VA 2 University of

More information

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop Inferential Statistical Analysis of Microarray Experiments 007 Arizona Microarray Workshop μ!! Robert J Tempelman Department of Animal Science tempelma@msuedu HYPOTHESIS TESTING (as if there was only one

More information

Structure of Materials Prof. Anandh Subramaniam Department of Material Science and Engineering Indian Institute of Technology, Kanpur

Structure of Materials Prof. Anandh Subramaniam Department of Material Science and Engineering Indian Institute of Technology, Kanpur Structure of Materials Prof. Anandh Subramaniam Department of Material Science and Engineering Indian Institute of Technology, Kanpur Lecture - 5 Geometry of Crystals: Symmetry, Lattices The next question

More information

False discovery control for multiple tests of association under general dependence

False discovery control for multiple tests of association under general dependence False discovery control for multiple tests of association under general dependence Nicolai Meinshausen Seminar für Statistik ETH Zürich December 2, 2004 Abstract We propose a confidence envelope for false

More information

Wilcoxon Test and Calculating Sample Sizes

Wilcoxon Test and Calculating Sample Sizes Wilcoxon Test and Calculating Sample Sizes Dan Spencer UC Santa Cruz Dan Spencer (UC Santa Cruz) Wilcoxon Test and Calculating Sample Sizes 1 / 33 Differences in the Means of Two Independent Groups When

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Sparse, stable gene regulatory network recovery via convex optimization

Sparse, stable gene regulatory network recovery via convex optimization Sparse, stable gene regulatory network recovery via convex optimization Arwen Meister June, 11 Gene regulatory networks Gene expression regulation allows cells to control protein levels in order to live

More information

A Class of Partially Ordered Sets: III

A Class of Partially Ordered Sets: III A Class of Partially Ordered Sets: III by Geoffrey Hemion 1 Definitions To begin with, let us fix a certain notation. Let the pair (X, ) be a given poset. For a X, let a = {x X : x < a} and a = {x X :

More information

At the start of the term, we saw the following formula for computing the sum of the first n integers:

At the start of the term, we saw the following formula for computing the sum of the first n integers: Chapter 11 Induction This chapter covers mathematical induction. 11.1 Introduction to induction At the start of the term, we saw the following formula for computing the sum of the first n integers: Claim

More information

INTRODUCTION TO ANALYSIS OF VARIANCE

INTRODUCTION TO ANALYSIS OF VARIANCE CHAPTER 22 INTRODUCTION TO ANALYSIS OF VARIANCE Chapter 18 on inferences about population means illustrated two hypothesis testing situations: for one population mean and for the difference between two

More information

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career. Introduction to Data and Analysis Wildlife Management is a very quantitative field of study Results from studies will be used throughout this course and throughout your career. Sampling design influences

More information

Fast and Accurate Causal Inference from Time Series Data

Fast and Accurate Causal Inference from Time Series Data Fast and Accurate Causal Inference from Time Series Data Yuxiao Huang and Samantha Kleinberg Stevens Institute of Technology Hoboken, NJ {yuxiao.huang, samantha.kleinberg}@stevens.edu Abstract Causal inference

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

DIC: Deviance Information Criterion

DIC: Deviance Information Criterion (((( Welcome Page Latest News DIC: Deviance Information Criterion Contact us/bugs list WinBUGS New WinBUGS examples FAQs DIC GeoBUGS DIC (Deviance Information Criterion) is a Bayesian method for model

More information

Chapter 7: Simple linear regression

Chapter 7: Simple linear regression The absolute movement of the ground and buildings during an earthquake is small even in major earthquakes. The damage that a building suffers depends not upon its displacement, but upon the acceleration.

More information

Unit 14: Nonparametric Statistical Methods

Unit 14: Nonparametric Statistical Methods Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based

More information

Physics 509: Non-Parametric Statistics and Correlation Testing

Physics 509: Non-Parametric Statistics and Correlation Testing Physics 509: Non-Parametric Statistics and Correlation Testing Scott Oser Lecture #19 Physics 509 1 What is non-parametric statistics? Non-parametric statistics is the application of statistical tests

More information

DEPARTMENT OF ENGINEERING MANAGEMENT. Two-level designs to estimate all main effects and two-factor interactions. Pieter T. Eendebak & Eric D.

DEPARTMENT OF ENGINEERING MANAGEMENT. Two-level designs to estimate all main effects and two-factor interactions. Pieter T. Eendebak & Eric D. DEPARTMENT OF ENGINEERING MANAGEMENT Two-level designs to estimate all main effects and two-factor interactions Pieter T. Eendebak & Eric D. Schoen UNIVERSITY OF ANTWERP Faculty of Applied Economics City

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 4 Problems with small populations 9 II. Why Random Sampling is Important 10 A myth,

More information

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs

Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Summary and discussion of: Controlling the False Discovery Rate via Knockoffs Statistics Journal Club, 36-825 Sangwon Justin Hyun and William Willie Neiswanger 1 Paper Summary 1.1 Quick intuitive summary

More information

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses Amit Zeisel, Or Zuk, Eytan Domany W.I.S. June 5, 29 Amit Zeisel, Or Zuk, Eytan Domany (W.I.S.)Improving

More information

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo PROCEDURES CONTROLLING THE k-fdr USING BIVARIATE DISTRIBUTIONS OF THE NULL p-values Sanat K. Sarkar and Wenge Guo Temple University and National Institute of Environmental Health Sciences Abstract: Procedures

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods Hypothesis Testing with the Bootstrap Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods Bootstrap Hypothesis Testing A bootstrap hypothesis test starts with a test statistic

More information

Unit 19 Formulating Hypotheses and Making Decisions

Unit 19 Formulating Hypotheses and Making Decisions Unit 19 Formulating Hypotheses and Making Decisions Objectives: To formulate a null hypothesis and an alternative hypothesis, and to choose a significance level To identify the Type I error and the Type

More information

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data

A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data A Practical Approach to Inferring Large Graphical Models from Sparse Microarray Data Juliane Schäfer Department of Statistics, University of Munich Workshop: Practical Analysis of Gene Expression Data

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information