Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression

Size: px

Start display at page:

Download "Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression"

Scott Hubbard
5 years ago
Views:

1 Testing High-Dimensional Count (RNA-Seq) Data for Differential Expression Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 6 1

2 References Anders & Huber (2010), Differential Expression Analysis for Sequence Count Data, Genome Biology 11:R106 DESeq2 Bioconductor package vignette, obtained in R using vignette("deseq2") Kvam, Liu, and Si (2012), A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data, Am. J. of Botany 99(2): Love, Huber, and Sanders (2014), Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2, Genome Biology 15(12):550. 2

3 Example 3 treated vs. 4 untreated; read counts (RNA-Seq) for 14,470 genes Published 2010 (Brooks et al., Genome Research) Drosophila melanogaster 3 samples treated by knock-down of pasilla gene (thought to be involved in regulation of splicing) T1 T2 T3 U1 U2 U3 U4 FBgn FBgn FBgn FBgn FBgn FBgn

4 4 # load data library(pasilla); data(pasillagenes) library(deseq) eset <- counts(pasillagenes) colnames(eset) <- c('t1','t2','t3','u1','u2','u3','u4') head(eset)

5 Consider per-gene tests t-test Error in t.test.default(x = c(2l, 2L, 2L, 2L), y = c(1l, 1L, 1L)) : data are essentially constant T1 T2 T3 U1 U2 U3 U Nonparametric Wilcoxon Rank Sum 5

6 # try a per-gene t-test trt <- c(1,1,1,0,0,0,0) pvals <- rep(na,nrow(eset)) for(i in 1:nrow(eset)) { x <- eset[i,] a1 <- t.test(x~trt) pvals[i] <- a1$p.value } i # 1687 eset[i,] #T1 T2 T3 U1 U2 U3 U4 # # try a per-gene Wilcoxon rank sum test (allowing for ties) library(coin) pvals <- rep(na,nrow(eset)) for(i in 1:nrow(eset)) # This takes a few minutes { x <- eset[i,] a1 <- wilcox_test(x~as.factor(trt)) pvals[i] <- pvalue(a1) } hist(pvals, main='pvalues from Wilcoxon Rank Sum Test', cex.main=2, cex.lab=1.5) 6

7 Consider data as counts (Poisson regression) On a per-gene basis: Let N i = # of total fragments counted in sample i Let p i = P{ fragment matches to gene in sample i } Observed # of total reads for gene in sample i : R i ~ Poisson(N i p i ) E[R i ] = Var[R i ] = N i p i Let T i = indicator of trt. status (0/1) for sample i 7 Assume log(p i ) = β 0 + β 1 T i Test for DE using H 0 : β 1 = 0

8 Poisson Regression E[R i ] = N i p i = N i exp(β 0 + β 1 T i ) log(e[r i ]) = log N i + β 0 + β 1 T i Do this for one gene in R (here, gene 2): estimate β s using iterative MLE procedure not interesting, but important call this the offset ; often considered the exposure for sample I (a quasi-normalization to scale overall genomic material) trt <- c(1,1,1,0,0,0,0) R <- eset[2,] lexposure <- log(colsums(eset)) a1 <- glm(r ~ trt, family=poisson, offset=lexposure) summary(a1) 8

9 Call: glm(formula = R ~ trt, family = poisson, offset = lexposure) Deviance Residuals: T1 T2 T3 U1 U2 U3 U Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** trt Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for poisson family taken to be 1) Null deviance: on 6 degrees of freedom Residual deviance: on 5 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 9

10 Do this for all genes 10 jackpot?

11 Possible (frequent) problem overdispersion Recall [implicit] assumption for Poisson dist n: E[R i ] = Var[R i ] = N i p i It can sometimes happen that Var[R i ] > E[R i ] common check: add a scale (or dispersion) parameter σ Var[R i ] = σ E[R i ] Estimate σ 2 as χ 2 /df Deviance χ 2 a goodness of fit statistic: 11 2 D 2 i R i log R Rˆ i i

12 # Poisson regression for all genes, checking for overdispersion Poisson.p <- scale <- rep(na,nrow(eset)) lexposure <- log(colsums(eset)) trt <- c(1,1,1,0,0,0,0) ## this next part takes about 1.5 minutes print(date()); for(i in 1:nrow(eset)) { count <- eset[i,] a1 <- glm(count ~ trt, family=poisson, offset=lexposure) Poisson.p[i] <- summary(a1)$coeff[2,4] scale[i] <- sqrt(a1$deviance/a1$df.resid) }; print(date()) par(mfrow=c(2,2)) hist(poisson.p, main='poisson', xlab='raw P-value') boxplot(scale, main='poisson', xlab='scale estimate'); abline(h=1,lty=2) mean(scale > 1) #

13 Can use alternative distribution: 13 edger package does this: For each gene: R i ~ NegativeBinomial (number of indep. Bernoulli trials to achieve a fixed number of successes) Let μ i = E[R i ], and v i = Var[R i ] But low sample sizes prevent reliable estimation of μ i and v i Assume v i = μ i + α μ i 2 estimate α by pooling information across genes then only one parameter must be estimated for each gene But DESeq2 package improves on this

14 Negative Binomial (NB) using DESeq2 Define trt. condition of sample i: Define # of fragment reads in sample i for gene k: R ki 2 ~ NB ki, ki (i) 14 Assumptions in estimating and : ki 2 ki v 2 ki ki q k, ( i) si library size, prop. to coverage [exposure] in sample i per-gene abundance, prop. to true conc. of fragments 2 ki si vk, ( i) raw variance (biological variability) shot noise this dominates for low-expressed genes v k, ( i) k, ( i) q smooth function pool information across genes to estimate variance

15 Estimate parameters (for NB distn.) m = # samples; n = # genes sˆ i med k R ki m j1 R kj 1/ m For median calculation, skip genes where geometric mean (denom) is zero. denom. is geometric mean across samples like a pseudo-reference sample 15 ŝ is essentially equivalent to, i with robustness against very large k R ki R ki for some k

16 Estimate parameters (for NB distn.) qˆ k 1 m R ki i: ( i) sˆ i m = # samples in trt. condition this is the mean of the standardized counts from the samples in treatment condition 16

17 17 Estimate function w ρ by plotting vs., and use parametric dispersion-mean relation: ( is asymptotic dispersion ; is extra Poisson ) Estimate parameters (for NB distn.) k q k q w ˆ / ˆ 1 0 (this is the variance of the standardized counts from the samples in trt. condition ρ) (an un-biasing constant) k ŵ qˆk k k k k i i i k k i i k i ki k z q w w q v s m q z q s R m w ˆ, ˆ max ˆ ˆ ˆ 1 ˆ ˆ ˆ 1 1 ˆ ) ( : 2 ) ( : 0 1

18 Estimating Dispersion in DESeq2 ŵ k 1. Estimate dispersion value for each gene 2. Fit for each condition (or pooled conditions [default]) a curve through estimates (in the vs. plot) qˆk ŵ k Assign to each gene a dispersion value, using the maximum of the estimated [empirical] value or the fitted value w qˆ k ŵ k -- this conservative approach avoids under-estimating dispersion (which would increase false positives)

19 Getting started with DESeq2 package Data in this format (previous slide 3) Integer counts in matrix form, with columns for samples and rows for genes Row names correspond to genes (or genomic regions, at least) See package vignette for suggestions on how to get to this format (including from sequence alignments and annotation) Can use read.csv or read.table functions to read in text files 19 Each column is a biological rep If have technical reps, sum them together to get a single column

20 # format data library(deseq2) countstable <- eset # counts table needs # gene IDs in row names rownames(countstable) <- rownames(eset) dim(countstable) # genes, 7 samples conds <- c("t","t","t","u","u","u","u") # 3 treated, 4 untreated; put in data.frame: cframe <- data.frame(conds) # Fit DESeq model (after formatting object): dds <- DESeqDataSetFromMatrix(countsTable, coldata=cframe, design = ~ conds) ddsctrst <- DESeq(dds) # check quality of dispersion estimation par(mfrow=c(1,1)) plotdispests(ddsctrst, cex.lab=1.5) 20

21 21 Checking Quality of Dispersion Estimation Plot ŵk vs. (both axes log-scale here) Add fitted line for w qˆ k Check that fitted line is roughly appropriate general trend qˆk

22 Test for DE between conditions Based on contrasts (coming more formally in Notes 7, slides 14-20) 22

23 log2 fold change (MLE): conds T vs U Wald test p-value: conds T vs U DataFrame with 6 rows and 4 columns basemean log2foldchange pvalue padj <numeric> <numeric> <numeric> <numeric> FBgn NA FBgn FBgn NA FBgn NA FBgn FBgn Peak near zero: DE genes Peak nearer one: low-count genes (?) Default adjustment: BH FDR (?)

24 # test for DE (Wald test, z=est/se{est}) res <- results(ddsctrst, contrast=c("conds","t","u")) # see results # (partial columns here just for convenience) head(res)[,c(1,2,5,6)] hist(res$pvalue,xlab='raw P-value', cex.lab=1.5, cex.main=2, main='deseq2, Wald test') # check to explain missing p-values t <- is.na(res$pvalue) sum(t) # 2638, or about 18.2% here boxplot(res$basemean[t], cex=2, pch=16) # -- almost always, only happens # for undetected genes # define sig DE genes padj <- p.adjust(res$pvalue, "fdr") t <- padj <.05 &!is.na(padj) gn.sig <- rownames(res)[t] length(gn.sig) #

25 25 # check p-value peak nearer 1 counts <- rowmeans(eset) t <- res$pvalue > 0.8 &!is.na(res$pvalue) par(mfrow=c(2,2)) hist(log(counts[t]), xlab='[logged] mean count', main='genes with largest p-values') hist(log(counts[!t]), xlab='[logged] mean count', main='genes with NOT largest p-values') # -- tends to be genes with smaller overall counts

26 Same example, but with extra covariate 3 samples treated by knock-down of pasilla gene, 4 samples untreated Of 3 treated samples, 1 was single-read and 2 were paired-end types Of 4 untreated samples, 2 were single-read and 2 were paired-end types 26 TS1 TP1 TP2 US1 US2 UP1 UP2 FBgn FBgn FBgn FBgn FBgn FBgn

27 27

28 # load data; recall eset object from previous slides colnames(eset) <- c('ts1','tp1','tp2','us1','us2','up1','up2') head(eset) # format data and fit model countstable <- eset rownames(countstable) <- rownames(eset) trt <- c("t","t","t","u","u","u","u") type <- c("s","p","p","s","s","p","p") cframe <- data.frame(trt, type) dds <- DESeqDataSetFromMatrix(countsTable, coldata=cframe, design = ~ trt + type) ddsctrst <- DESeq(dds) res <- results(ddsctrst, contrast=c("trt","t","u")) pvals <- res$pvalue # Visualize sig. results par(mfrow=c(1,1)) hist(pvals, xlab='raw p-value', cex.lab=1.5, cex.main=2, main='test trt effect while accounting for type') 28

29 # Visualize sig. results hist(pvals, xlab='raw p-value', cex.lab=1.5, cex.main=2, main='test trt effect while accounting for type') # Get sig. genes adj.pvals <- p.adjust(pvals, "BH") t <- adj.pvals <.05 &!is.na(adj.pvals) sum(t) # 708 sig.gn <- rownames(eset)[t] # Visualize sig. genes library(rcolorbrewer) small.eset <- eset[t,] hmcol <- colorramppalette(brewer.pal(9,"reds"))(256) csc <- rep(hmcol[250],ncol(small.eset)) csc[trt=="u"] <- hmcol[10] heatmap(small.eset,scale="row",col=hmcol, ColSideColors=csc, cexcol=2.5, main=paste(sum(t),'sig. Genes')) 29

30 30 Summary Test count (RNA-Seq) data using Negative Binomial distribution (DESeq2 approach, using contrasts), pooling information across genes What next? Adjust for multiple testing Filtering (to increase statistical power) zero-count genes? Visualization: Heatmaps / clustering / PCA biplot / others Characterize significant genes (annotations)

Gene Expression an Overview of Problems & Solutions: 3&4. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Gene Expression an Overview of Problems & Solutions: 3&4 Utah State University Bioinformatics: Problems and Solutions Summer 006 Review Considering several problems & solutions with gene expression data