Unit-free and robust detection of differential expression from RNA-Seq data

Size: px

Start display at page:

Download "Unit-free and robust detection of differential expression from RNA-Seq data"

Buddy Lucas
5 years ago
Views:

1 Unit-free and robust detection of differential expression from RNA-Seq data arxiv: v [stat.me] 8 May 204 Hui Jiang,2,* Department of Biostatistics, University of Michigan 2 Center for Computational Medicine and Bioinformatics, University of Michigan * Please send correspondence to jianghui@umich.edu. May 20, 204 Abstract Ultra high-throughput sequencing of transcriptomes (RNA-Seq) has recently become one of the most widely used methods for quantifying gene expression levels due to its decreasing cost, high accuracy and wide dynamic range for detection. However, the nature of RNA-Seq makes it nearly impossible to provide absolute measurements of transcript concentrations. Several units or data summarization methods for transcript quantification have been proposed to account for differences in transcript lengths and sequencing depths across genes and samples. However, none of these methods can reliably detect differential expression directly without further proper normalization. We propose a statistical model for joint detection of differential expression and data normalization. Our method is independent of the unit in which gene expression levels are summarized. We also introduce an efficient algorithm for model fitting. Due to the L0-penalized likelihood used by our model, it is able to reliably normalize the data and detect differential expression in some cases when more than half of the genes are differentially expressed in an asymmetric manner. The robustness of our proposed approach is demonstrated with simulations. Introduction Ultra high-throughput sequencing of transcriptomes (RNA-Seq) has recently become one of the most widely used methods for quantifying gene expression levels due to its decreasing cost, high accuracy and wide dynamic range for detection (Mortazavi et al., 2008). As of today, modern ultra high-throughput sequencing platforms can

2 generate tens of millions of short sequencing reads from each prepared biological sample in less than a day. RNA-Seq also facilitates the detection of novel transcripts (Trapnell et al., 200) and the quantification of transcripts on isoform level (Jiang and Wong, 2009). For these reasons, RNA-Seq has become the method of choice for assaying transcriptomes (Wang et al., 2009). In an RNA-Seq experiment, mrna transcripts are extracted from samples of cells, reverse transcribed into cdna, randomly fragmented into short pieces, filtered based on fragment lengths (size selection), added sequencing adapters and then sequenced on a sequencer. The collected data are sequenced reads, either from one end of the fragments (single-end sequencing) or both ends (paired-end sequencing). These sequenced reads can be aligned to reference transcript sequences and the data are summarized as read counts for each transcript in each sample. Complications happen when reads can not be uniquely aligned to reference transcript sequences, either due to homology between genes, or due to multiple transcripts (isoforms) sharing regions (exons) within a single gene. In this paper we ignore those complications and assume each gene has only one transcript and therefore we will use gene and transcript interchangeably. We also assume that each read can be aligned to only a single gene, and we take read counts for each gene in each sample as observations. However, our approach can work with estimated read counts or gene expression levels from methods developed to handle those complications such as Jiang and Wong (2009) and Li et al. (200). For an overview of those method, please refer to Pachter (20). One major limitation of RNA-Seq is that it only provides relative measurements of transcript concentrations. Because reads are sequenced from a random sample of transcript fragments, proportional changes of the amount of transcripts in a sample will not impact the distribution of finally sequenced reads. Furthermore, longer transcripts will generate more fragments which will result in more reads, and sequencing a sample deeper will also lead to higher read counts for each transcript. To account for these issues, several units (or data summarization methods) for transcript quantification have been proposed to account for differences in transcript lengths and sequencing depths across genes and samples, which include CPM (Robinson and Oshlack, 200) or RPM (counts/reads per million), RPKM (Mortazavi et al., 2008) or FPKM (Trapnell et al., 200) (reads/fragments per kilobase of exon per million mapped reads) and TPM (Li et al., 200) (transcript per million). Since read can refer to either single-end read or paired-end read, depending on the experiment conducted, we will use RPKM instead of FPKM in this paper. Suppose there are a total of m genes in the sample. For i =,..., m, let l i be the length of gene i and let c i be the observed read count for gene i in the sample. CPM (denoted as cpm i ), RPKM (denoted as rpkm i ) and TPM (denoted as tpm i ) values for gene i are defined as follows, respectively cpm i = 0 6 c i / i c i rpkm i = 0 9 c i /(l i i c i) tpm i = 0 6 rpkm i / i rpkm i (.) All these units have been used for quantifying gene expression levels from RNA-Seq data and also for detecting differential expressed genes. It has been argued that TPM 2

3 is better than RPKM (Wagner et al., 202) since it estimates the relative molar concentration of transcripts in a sample. However, none of these methods can reliably detect differential expression directly without further proper normalization. For instance, when we compare gene expression profiles from two samples (e.g, sample A and sample B), if 50% of the genes are upregulated by 2-fold in sample B while the other 50% of the genes stayed stable in both samples, due to the relative nature of RNA-Seq measurements, it is impossible to tell that it is not case that the first group of 50% genes stayed stable but the second group of 50% gene are downregulated by 2-fold in sample B. A normalization step is therefore necessary to make gene expression measurements comparable across samples. Different normalization approaches make different assumptions on the distribution of gene expression levels across samples. For example, quantile normalization (Bolstad et al., 2003) assumes that the overall distribution of gene expression levels is unchanged across samples. In this sense, both RPKM and TPM can be considered as normalization approaches when they are used directly to detect differentially expressed genes without additional normalization. In this case, TPM assumes that the total amount of transcripts (i.e., total mole number) is unchanged across samples and RPKM assumes that the total amount of nucleotides in all the transcripts (i.e., total mass) is unchanged across samples, both of which are strong but still arguably reasonable assumptions. Since it is impossible to tell the truth for the example described above, the normalization approach has to rule out such possibility using its assumptions. One commonly used assumption is that the majority (i.e., > 50%) are non-differentially expressed. The median-based approach (Anders and Huber, 200) and TMM (Robinson and Oshlack, 200) (trimmed mean of M values) are two normalization approaches based on this assumption. Furthermore, the normalization and detection of differential expression are two problems entangled together, since ideally normalization should be based on non-differentially expressed genes. The iterative normalization approach (Li et al., 202) utilizes this idea and iterates between normalization and detection of differential expression. For an overview of normalization approaches and comparisons of their performance for differential expression detection, please refer to Dillies et al. (203) and Rapaport et al. (203) The example described above is also unrealistic because in reality genes rarely change at the same pace. That is, it is unlikely that all of the differentially expressed genes are upregulated by the same amount (2-fold in the above example). It is therefore reasonable to assume that among differentially expressed genes, the degrees at which genes change have a (possibly unknown) spread distribution. However, few existing approach explicitly utilizes this realistic assumption, which is exploited in this paper. This paper introduces a method for joint detection of differential expression and data normalization, which is also independent of the unit in which gene expression levels are summarized. We also introduce an efficient algorithm for model fitting. Simulation studies are also given. 3

4 2 A penalized likelihood approach 2. The Model Suppose there are a total of m genes measured in two groups of samples with n and n 2 samples, respectively. Let x sij, s {, 2}, i =,..., m, j =,..., n s be log-transformed gene expression measurements for the i-th gene in the j-th sample in the s-th group. The following statistical model is assumed x sij N(µ si + d sj, σ 2 si) (2.) where µ si and σ si are the mean and standard deviation of log-transformed expression levels of gene i in group s, and d sj is a scaling factor (i.e., log sequencing depth or log library size) for sample j in group s. Our interest is in detecting differentially expressed genes between the two groups. Let τ i be the indicator of differential expression for gene i such that τ i = if gene i is differentially expressed between the two groups and τ i = 0 otherwise. We assume µ i = µ 2i if τ i = 0. The τ i s are the parameters of major interest, while the µ si s and the d sj s might be of interest too, because they represent biologically meaningful quantities. Since σ si s are nuisance parameters, for simplicity, from now on we assume σ si = σ where σ is a known constant. This works well when all the σ si s are roughly equal (which is a roughly reasonable assumption since the log transformation usually stabilizes the variance) and we can estimate σ using a robust data-driven approach. Our method can also be extended to handle unequal σ 2 si s. One merit of Model (2.) is that it is unit independent. That is, regardless of the unit in which gene expression estimates are summarized, which can be log-transformed read counts, log-cpm, log-rpkm or log-tpm values, Model (2.) will produce exactly the same inferences on differentially expressed genes. We consider this property a desirable feature, because many already published studies on RNA-Seq have reported their gene expression estimates in all different units mentioned above, and converting from one unit to another is not always easy since it requires information on transcript lengths and the total number of sequenced reads. 2.2 Optimization To fit Model (2.), we reparametrize µ si as µ i = µ i, γ i = µ 2i µ i. The model now becomes { xij N(µ i + d j, σ 2 ) x 2ij N(µ i + γ i + d 2j, σ 2 (2.2) ) To fit Model (2.2), we minimize its negative log-likelihood n n 2 l(µ, γ, d; x) = (x ij µ i d j ) 2 + (x 2ij µ i γ i d 2j ) 2 4

5 Model (2.2) is non-identifiable because we can simply add any constant to all the d sj s and subtract the same constant from all the µ si s, while having the same fit. To resolve this issue, we fix d = 0. Furthermore, we introduce a penalty p(γ) on the γ i s and formulate a penalized likelihood f(µ, γ, d) = l(µ, γ, d; x) + p(γ) (2.3) Candidate penalty functions for p(γ) are L (Tibshirani, 996), L0, SCAD (Fan and Li, 200) and etc. The penalty function will force some of the γ i to be zero, which will in turn facilitate the detection of differential expression since τ i = ( γ i > 0). There are m(n + n 2 ) observations and 2m + n + n 2 parameters in the model, and typically we have m 0, 000 and n + n 2 0. Using an L penalty, it will be computationally intensive if we fit the model using a standard Lasso solver such as Glmnet (Friedman et al., 200). To achieve better performance, if we use a non-convex penalty such as L0 or SCAD, it will be computationally nearly infeasible to fit the model in a brute-force manner. Fortunately, we can take advantage of the special structure in the model and solve it efficiently. In this paper we work with the L0 penalty due to its robustness in estimation and variable selection p(γ) = α ( γ i > 0) (2.4) where α > 0 is tuning parameter. Our model fitting approach can also be adapted to other penalty functions. It can be shown that the solution to (2.3) with the L0 penalty (2.4) has the property that γ i = 0 or γ i λ, where λ > 0 is some constant. In particular, model (2.3) can be solve as follows. Proposition 2.. Model (2.3) with the L0 penalty (2.4) can be solved as follows d j = m m (x ij x i ) d 2j = m m (x 2ij x 2i ) µ i = n n (x ij d j ) µ 2i = n2 n 2 (x 2ij d 2j ) (n +n 2 )α λ = n n 2 d = arg min m d min ( (µ 2i µ i d)2, λ 2) d j = d j d 2j = d + d { 2j 0 if µ γ i = 2i µ i d < λ { µ 2i µ i d otherswise µ µ i = i if γ i > 0 n +n 2 (n µ i + n 2(µ 2i d)) othersise To solve for d, we need to minimize the function f(d) = m min ( (µ 2i µ i d)2, λ 2) which is non-convex and non-differentiable (see Figure ). However, since it is univariate, it can be solved efficiently in O(m log m) time using a sliding window search. 5

6 f(d) d Figure : The function f(d) = m min ( (µ 2i µ i d)2, λ 2) with simulated {µ i }00, {µ 2i }00 N(0, ) and λ = 0.2. The minimizer of f is shown with a dashed vertical line. 3 Experiments 3. Simulations We simulate RNA-Seq data with a total of m = 000 genes (900 non-differentially expressed, 00 differentially expressed) as follows µ i = µ 2i N( 3, 2) mean expression for genes -900 (log scale) µ i N( 3, 2), µ 2i = µ i + N(0, ) mean expression for genes (log scale) d sj N(0, 0.5) scaling factor (log scale) x sij N(µ si + d sj, 0.2) gene expression levels (log scale) log(l i ) Unif(5, 0) gene length (log scale) N sj Unif(3, 5) 0 7 total number of sequenced reads {c sij } m + Mult(N sj, {l i e x sij } m ) gene expression levels (read counts) We use the log c sij (read counts) as data to fit our model with λ = 0.5. The fitted γ i s are plotted in Figure 2a. Using CPM, RPKM or TPM values computed with formulas in (.) yield the same estimates for γ i. To demonstrate the robustness of our method, we also simulate with 600 non-differentially expressed genes and 400 differentially expressed genes. Furthermore, the mean gene expression in group two is simulated as µ 2i = µ i +N(3, ) for genes , which means that the 400 differentially expressed genes are upregulated for 20 fold in group two on average. Our method still robustly estimates the γ i s (Figure 2b). We further simulate with 300 non-differentially expressed genes and 700 differentially expressed genes, for which our method still achieves robust estimates 6

7 when we simulate with µ 2i = µ i +N(3, ) (Figure 2c). Only when we simulate with 900 differentially expressed genes and with µ 2i = µ i + N(3, ), our method fails to achieve robust estimates (Figure 2d). 4 Discussion It is shown in our simulations that our proposed approach is able to reliably normalize the data and detect differential expression in some cases when more than half of the genes are differentially expressed in an asymmetric manner. This is hard to achieve with other existing robust methods such as the median or trimmed mean based normalization approaches (Anders and Huber, 200; Robinson and Oshlack, 200). This is attributed to the L0-penalized likelihood used by our model. Typically L0-penalized models are difficult to fit due to their non-convexity. However in our case it is easily manageable once we reduce the model fitting to a univariate optimization problem. It is seen that adding an L0 penalty p(γ) is equivalent to applying hard thresholding on γ, which has been shown to be a general case in She and Owen (20). The hard thresholding helps us make inference on the indicators τ i s for differential gene expression since τ i = 0 γ i < λ This resembles the two-sample t-test that has been widely used for detecting differential gene expression. In fact, γ i < λ µ i µ 2i < λ ŝ i ŝ i where ŝ i is the estimated standard deviation for gene i, which is chosen as σ when σ is known, or estimated from the data using various of methods based on different assumptions (i.e., equal variance or unequal variance). This relationship hints us to choose λ as ŝ i Q q where Q q is the q quantile (for two-sided level q tests) of the standard normal (for known ŝ i ) or t distributions (for estimated ŝ i ). Appendix Proof of Proposition 2.. We have n n 2 f(µ, γ, d) = (x ij µ i d j ) 2 + (x 2ij µ i γ i d 2j ) 2 + α ( γ i > 0) Therefore which gives f d j = 2(x ij µ i d j ) = 0 d j = m (x ij µ i ) 7

8 i γ i (a) i γ i (b) i γ i (c) i γ i (d) Figure 2: Estimated γ from simulated data. 8

9 Consequently d j d = m (x ij x i ) = d j Since we fix d = 0, we have d j = d j. Similarly, we have d 2j d 2 = m (x 2ij x 2i ) = d 2j Denote d 2 as d, we have d 2j = d + d 2j. Now f(µ, γ, d) = n n 2 (x ij µ i d j) 2 + (x 2ij µ i γ i d d 2j) 2 + α( γ i > 0) which can be written as f(µ, γ, d) = h i (µ i, γ i ) where both µ i and γ i can be considered as functions of d. If γ i > 0, to minimize f, we only need to minimize h i, which can be easily solve as and µ i = n (x ij d n j) = µ i γ i = n 2 (x 2ij µ i d d n 2j) = n 2 (x 2ij d 2 n 2j) µ i d = µ 2i µ i d 2 Similarly, if γ i = 0, we have µ i = Changing from γ i = 0 to γ i > 0, we have n + n 2 (n µ i + n 2 (µ 2i d)) h i ( (n µ i + n 2 (µ 2i d)), 0) h i (µ n + n i, µ 2i µ i d) = n n 2 (µ 2i µ i d) 2 α 2 n + n 2 Therefore Where λ = d = arg min d (n +n 2 )α n n 2 γ i = { 0 if µ 2i µ i d < λ µ 2i µ i d otherswise. The only thing remaining now is to solve for d. We have ( ) min h i ( (n µ i + n 2 (µ 2i d)), 0), h i (µ n + n i, µ 2i µ i d) 2 9

10 which can be simplified as d = arg min d ( ) n n 2 min (µ 2i µ i d) 2 α, 0 n + n 2 since h i (µ i, µ 2i µ i d) is not function of d. Therefore, d = arg min d min ( (µ 2i µ i d) 2, λ 2) References S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome Biol, (0):R06, 200. B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 9(2):85 93, Jan M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloe, C. Le Gall, B. Schaeffer, L.ffer, S. Le Crom, M. Guedj, F. Jaffrezic, and F. S. C.. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Brief Bioinform, 4(6):67 683, Nov 203. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): , 200. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33():, 200. H. Jiang and W. H. Wong. Statistical inferences for isoform expression in rna-seq. Bioinformatics, 25(8): , Apr B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey. Rna-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4): , Feb 200. J. Li, D. M. Witten, I. M. Johnstone, and R. Tibshirani. Normalization, testing, and false discovery rate estimation for rna-sequencing data. Biostatistics, 3(3): , Jul 202. A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):62 628, Jul

11 L. Pachter. Models for transcript quantification from RNA-Seq. ArXiv e-prints, Apr. 20. F. Rapaport, R. Khanin, Y. Liang, M. Pirun, A. Krek, P. Zumbo, C. E. Mason, N. D. Socci, and D. Betel. Comprehensive evaluation of differential gene expression analysis methods for rna-seq data. Genome Biology, 4(9):R95, Sep 203. M. D. Robinson and A. Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biol, (3):R25, 200. Y. She and A. B. Owen. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 06(494), 20. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages , 996. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification by rnaseq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):5 55, May 200. G. P. Wagner, K. Kin, and V. J. Lynch. Measurement of mrna abundance using rna-seq data: Rpkm measure is inconsistent among samples. Theory Biosci, 3(4):28 285, Dec 202. Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 0():57 63, Jan 2009.

Normalization and differential analysis of RNA-seq data

Normalization and differential analysis of RNA-seq data Nathalie Villa-Vialaneix INRA, Toulouse, MIAT (Mathématiques et Informatique Appliquées de Toulouse) nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org