Statistical tests for differential expression in count data (1)

Size: px

Start display at page:

Download "Statistical tests for differential expression in count data (1)"

Oliver Griffin Simpson
5 years ago
Views:

1 Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course August 2011 Academic Medical Center, Amsterdam

2 The analysis of a microarray experiment Pre-process image analysis to produce relative abundances take the logarithm remove experimental artifacts (normalize) perform 20,000 t tests Correct for multiple testing

3 The model For any particular gene, the log expression levels are assumed to be normally distributed: fit=lm(y~group) or perhaps fit=lm(y~age+group) One might further assume that the gene variances come from a common (Gamma) distribution. In R: limma. Results in shrinkage: the empirical variances are pulled towards their common mean. Favours genes with a large variance.

4 Multiple Testing K tests at level α = Expect 0.05K false discoveries. Bonferroni: K tests at level α = 0.05/K. Expect 0.05 false discoveries. In other words, we expect to make one false discovery in twenty experiments. Perhaps too conservative? Simple solution: increase the level. Unfortunately, this is taboo. The Concise Oxford Dictionary (1996) (1) a system or the act of setting a person or thing apart as sacred, prohibited, or accursed. (2) a prohibition or restriction imposed on certain behaviour, word usage, etc., by social custom.

5 Multiple Testing H 0 true H 0 false H 0 rejected V S H 0 not rejected U T Note: V are the false discoveries. Bonferroni (Holm) controls the FWER FWER = P(V > 0) Benjamini Hochberg controls the FDR ( ) V FDR = E V + S

6 Multiple Testing FDR = E ( ) V V + S We allow not more than 1 in 20 of our discoveries to be false (on average). FDR is less conservative than FWER, yet somehow it does not feel like violating the taboo. Same for NGS data.

7 Testing sets of genes (pathways) Next speaker.

8 Next Generation Sequencing An NGS experiment produces very many counts. not well approximated by the normal distribution (unless counts are large and you take the square root.)

9 Example from Baggerly et al library pheno T+ T+ T+ T+ T+ T T T size tag tag 2 We want to test if tag 1 is more/less abundant in the tumor versus the control group.

10 In the tumor group we have and in the control group we have = = So, tag 1 occurs more than twice as often in the control group. Is this statistically significant?

11 Total count for tag 1 in the tumor group has a binomial distribution with n 1 = and some probability p 1. Total count for tag 1 in the control group has a binomial distribution with n 2 = and some probability p 2. We can use R to test the null hypothesis H 0 : p 1 = p 2 x=c(434,799) n=c(404105,284968) prop.test(x,n) p-value < 2.2e-16. Giddyup!

12 A more flexible approach is to fit a logistic regression model for the probability of observing tag 1. In such a model we could add other covariates besides the presence of absence of the tumor. x=c(129,167,71,61,6,43,247,509) n=c(100474,96631,92510,95785,18705,95155,91593,98220) group=factor(c(0,0,0,0,0,1,1,1)) y=x/n fit=glm(y~group,family=binomial(),weights=n) Again, we find a very significant effect.

13 We put all subjects in each group together, and assumed binomial distributions for the total counts of tag 1. This is only valid if the proportions for each subject within a group are the same. library pheno T+ T+ T+ T+ T+ T T T size tag prop. (%) Do subjects 1 and 2 have the same proportion? No! p= x=c(129,167) n=c(100474,96631) prop.test(x,n)

14 Overdispersion There is biological/measurement variation between subjects. This will increase the variance of the total count: overdispersion. Failure to recognize this, will lead to underestimation of the variance. Underestimation of the variance, will lead to many false positives.

15 We need to account for extra (biological) variation. For every subject, let P i be random. For every subject, X Bin(n i, P i ). Test if the P i differ on average between the two groups.

16 Many ways to deal with overdispersion: log(p i /1 P i ) normally distributed (GLMM) too slow to do a million tests or more. P i beta distributed. R package BBSeq seems to be too slow as well. If n is large and p is small, then we can use Poisson-Gamma or negative binomial. R packages bayseq, DESeq, edger. we ll discuss now.

17 Recall, Binomial(n, p) distribution has mean = np and variance = np(1 p). DESeq and edger assume Negative-Binomial(µ, φ) distribution with mean = µ and variance = µ + φµ 2. Much effort goes into estimating the dispersion φ when very few (biological) replicates are available.

18 Recall, Binomial(n, p) distribution has mean = np and variance = np(1 p). DESeq and edger assume Negative-Binomial(µ, φ) distribution with mean = µ and variance = µ + φµ 2. Much effort goes into estimating the dispersion φ when very few (biological) replicates are available. Here we should point out that it is generally better to increase the sample size, if you can.

19 From the edger manual

20 Much more on bayseq, DESeq and edger later today.

21 Quasi-Binomial distribution A cheap trick is to add an extra dispersion parameter ϕ to the binomial variance = np(1 p) becomes variance = ϕnp(1 p). To estimate, we maximize a quasi likelihood. fit=glm(y~group,family=binomial(),weights=n) simply becomes fit=glm(y~group,family=quasibinomial(),weights=n) Now we find p = 0.12.

22 fit=glm(y~group,family=quasibinomial(),weights=n) accounts for overdispersion. few assumptions, hence robust. standard GLM: fast and reliable software available. standard GLM: theory well-developed. easy to include covariates. extends to dependent data: GEE.

23 We condition on the library sizes, and test if the proportion of any given tag relative to all other tags is different between groups. But, if any tag differs between groups then other tags must do so as well, simply because i p i = 1. In a two sample binomial experiment with X 1 Bin(n 1, p 1 ) and X 2 Bin(n 2, p 2 ) it makes no sense to test both H 0 : p 1 = p 2 and H 0 : (1 p 1 ) = (1 p 2 ).

24 The problem is very noticeable if there are tags that are both very abundant and different beween groups. Anders and Huber (2010) and Robinson and Oshlack (2010) suggest to modify the library sizes so that single tags have little influence. Alternatively, we might take a set of tags that we know won t differ between groups and test all tags of interest relative to that set.

25 Methylation Joint with Jelle Goeman. Bas Heijmans. Elmar Tobi. n is not large and p is not small so bayseq, DESeq and edger cannot be used. Approximation with the normal distribution is also not appropriate.

26 Methylation addition of methyl group (CH 3 ) to cytosine. occurs at CpG sites. inherited during mitosis, not meiosis. stably affects transcription.

27 Jirtle and Skinner, 2007

28 Dutch Famine Well documented, brief period allows the study of the effects of famine on human health. Children exposed to famine in utero compared to siblings: diabetes obesity cardiovascular disease How can we measure methylation?

29 Reduced Representation Bisulphite Sequencing M --CCGG-----CCGG-- --GGCC-----GGCC M M CGG-----C C-----GGC MspI digestion & Size selection ( bp) End repair & Adaptor ligation M M CGG-----CCG GCC-----GGC M Bisulfite treatment M M CGG-----CCG PCR CGG-----CCG GCC-----GGC CGG-----CCA GCC-----GGT Sequencing of short reads CGG GGT Align to reduced representation & Infer methylation GCC-----GGU M

30 Data 24 same sex sibling pairs. One sib was exposed in utero to Dutch Famine. family case cg coverage methylated Three million different CpGs.

31 First, we remove all CpG with insufficient coverage. Next, we test for differences in methylation fraction between exposed and unexposed individuals for every CpGs. Finally, we test various groups of CpG simultaneously. We can account for relatedness by introducing a fixed effect per family, very similar to a paired t test. For more complicated designs, we would need GEE.

32 family case cg coverage (n) methyl. (x) y=x/n fit=glm(y~case+fam,family=quasibinomial(),weights=n)

33 How to test sets of CpGs?

34 How to test sets of CpGs? Next speaker!

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given