Unit-free and robust detection of differential expression from RNA-Seq data

Size: px
Start display at page:

Download "Unit-free and robust detection of differential expression from RNA-Seq data"

Transcription

1 Unit-free and robust detection of differential expression from RNA-Seq data arxiv: v [stat.me] 8 May 204 Hui Jiang,2,* Department of Biostatistics, University of Michigan 2 Center for Computational Medicine and Bioinformatics, University of Michigan * Please send correspondence to jianghui@umich.edu. May 20, 204 Abstract Ultra high-throughput sequencing of transcriptomes (RNA-Seq) has recently become one of the most widely used methods for quantifying gene expression levels due to its decreasing cost, high accuracy and wide dynamic range for detection. However, the nature of RNA-Seq makes it nearly impossible to provide absolute measurements of transcript concentrations. Several units or data summarization methods for transcript quantification have been proposed to account for differences in transcript lengths and sequencing depths across genes and samples. However, none of these methods can reliably detect differential expression directly without further proper normalization. We propose a statistical model for joint detection of differential expression and data normalization. Our method is independent of the unit in which gene expression levels are summarized. We also introduce an efficient algorithm for model fitting. Due to the L0-penalized likelihood used by our model, it is able to reliably normalize the data and detect differential expression in some cases when more than half of the genes are differentially expressed in an asymmetric manner. The robustness of our proposed approach is demonstrated with simulations. Introduction Ultra high-throughput sequencing of transcriptomes (RNA-Seq) has recently become one of the most widely used methods for quantifying gene expression levels due to its decreasing cost, high accuracy and wide dynamic range for detection (Mortazavi et al., 2008). As of today, modern ultra high-throughput sequencing platforms can

2 generate tens of millions of short sequencing reads from each prepared biological sample in less than a day. RNA-Seq also facilitates the detection of novel transcripts (Trapnell et al., 200) and the quantification of transcripts on isoform level (Jiang and Wong, 2009). For these reasons, RNA-Seq has become the method of choice for assaying transcriptomes (Wang et al., 2009). In an RNA-Seq experiment, mrna transcripts are extracted from samples of cells, reverse transcribed into cdna, randomly fragmented into short pieces, filtered based on fragment lengths (size selection), added sequencing adapters and then sequenced on a sequencer. The collected data are sequenced reads, either from one end of the fragments (single-end sequencing) or both ends (paired-end sequencing). These sequenced reads can be aligned to reference transcript sequences and the data are summarized as read counts for each transcript in each sample. Complications happen when reads can not be uniquely aligned to reference transcript sequences, either due to homology between genes, or due to multiple transcripts (isoforms) sharing regions (exons) within a single gene. In this paper we ignore those complications and assume each gene has only one transcript and therefore we will use gene and transcript interchangeably. We also assume that each read can be aligned to only a single gene, and we take read counts for each gene in each sample as observations. However, our approach can work with estimated read counts or gene expression levels from methods developed to handle those complications such as Jiang and Wong (2009) and Li et al. (200). For an overview of those method, please refer to Pachter (20). One major limitation of RNA-Seq is that it only provides relative measurements of transcript concentrations. Because reads are sequenced from a random sample of transcript fragments, proportional changes of the amount of transcripts in a sample will not impact the distribution of finally sequenced reads. Furthermore, longer transcripts will generate more fragments which will result in more reads, and sequencing a sample deeper will also lead to higher read counts for each transcript. To account for these issues, several units (or data summarization methods) for transcript quantification have been proposed to account for differences in transcript lengths and sequencing depths across genes and samples, which include CPM (Robinson and Oshlack, 200) or RPM (counts/reads per million), RPKM (Mortazavi et al., 2008) or FPKM (Trapnell et al., 200) (reads/fragments per kilobase of exon per million mapped reads) and TPM (Li et al., 200) (transcript per million). Since read can refer to either single-end read or paired-end read, depending on the experiment conducted, we will use RPKM instead of FPKM in this paper. Suppose there are a total of m genes in the sample. For i =,..., m, let l i be the length of gene i and let c i be the observed read count for gene i in the sample. CPM (denoted as cpm i ), RPKM (denoted as rpkm i ) and TPM (denoted as tpm i ) values for gene i are defined as follows, respectively cpm i = 0 6 c i / i c i rpkm i = 0 9 c i /(l i i c i) tpm i = 0 6 rpkm i / i rpkm i (.) All these units have been used for quantifying gene expression levels from RNA-Seq data and also for detecting differential expressed genes. It has been argued that TPM 2

3 is better than RPKM (Wagner et al., 202) since it estimates the relative molar concentration of transcripts in a sample. However, none of these methods can reliably detect differential expression directly without further proper normalization. For instance, when we compare gene expression profiles from two samples (e.g, sample A and sample B), if 50% of the genes are upregulated by 2-fold in sample B while the other 50% of the genes stayed stable in both samples, due to the relative nature of RNA-Seq measurements, it is impossible to tell that it is not case that the first group of 50% genes stayed stable but the second group of 50% gene are downregulated by 2-fold in sample B. A normalization step is therefore necessary to make gene expression measurements comparable across samples. Different normalization approaches make different assumptions on the distribution of gene expression levels across samples. For example, quantile normalization (Bolstad et al., 2003) assumes that the overall distribution of gene expression levels is unchanged across samples. In this sense, both RPKM and TPM can be considered as normalization approaches when they are used directly to detect differentially expressed genes without additional normalization. In this case, TPM assumes that the total amount of transcripts (i.e., total mole number) is unchanged across samples and RPKM assumes that the total amount of nucleotides in all the transcripts (i.e., total mass) is unchanged across samples, both of which are strong but still arguably reasonable assumptions. Since it is impossible to tell the truth for the example described above, the normalization approach has to rule out such possibility using its assumptions. One commonly used assumption is that the majority (i.e., > 50%) are non-differentially expressed. The median-based approach (Anders and Huber, 200) and TMM (Robinson and Oshlack, 200) (trimmed mean of M values) are two normalization approaches based on this assumption. Furthermore, the normalization and detection of differential expression are two problems entangled together, since ideally normalization should be based on non-differentially expressed genes. The iterative normalization approach (Li et al., 202) utilizes this idea and iterates between normalization and detection of differential expression. For an overview of normalization approaches and comparisons of their performance for differential expression detection, please refer to Dillies et al. (203) and Rapaport et al. (203) The example described above is also unrealistic because in reality genes rarely change at the same pace. That is, it is unlikely that all of the differentially expressed genes are upregulated by the same amount (2-fold in the above example). It is therefore reasonable to assume that among differentially expressed genes, the degrees at which genes change have a (possibly unknown) spread distribution. However, few existing approach explicitly utilizes this realistic assumption, which is exploited in this paper. This paper introduces a method for joint detection of differential expression and data normalization, which is also independent of the unit in which gene expression levels are summarized. We also introduce an efficient algorithm for model fitting. Simulation studies are also given. 3

4 2 A penalized likelihood approach 2. The Model Suppose there are a total of m genes measured in two groups of samples with n and n 2 samples, respectively. Let x sij, s {, 2}, i =,..., m, j =,..., n s be log-transformed gene expression measurements for the i-th gene in the j-th sample in the s-th group. The following statistical model is assumed x sij N(µ si + d sj, σ 2 si) (2.) where µ si and σ si are the mean and standard deviation of log-transformed expression levels of gene i in group s, and d sj is a scaling factor (i.e., log sequencing depth or log library size) for sample j in group s. Our interest is in detecting differentially expressed genes between the two groups. Let τ i be the indicator of differential expression for gene i such that τ i = if gene i is differentially expressed between the two groups and τ i = 0 otherwise. We assume µ i = µ 2i if τ i = 0. The τ i s are the parameters of major interest, while the µ si s and the d sj s might be of interest too, because they represent biologically meaningful quantities. Since σ si s are nuisance parameters, for simplicity, from now on we assume σ si = σ where σ is a known constant. This works well when all the σ si s are roughly equal (which is a roughly reasonable assumption since the log transformation usually stabilizes the variance) and we can estimate σ using a robust data-driven approach. Our method can also be extended to handle unequal σ 2 si s. One merit of Model (2.) is that it is unit independent. That is, regardless of the unit in which gene expression estimates are summarized, which can be log-transformed read counts, log-cpm, log-rpkm or log-tpm values, Model (2.) will produce exactly the same inferences on differentially expressed genes. We consider this property a desirable feature, because many already published studies on RNA-Seq have reported their gene expression estimates in all different units mentioned above, and converting from one unit to another is not always easy since it requires information on transcript lengths and the total number of sequenced reads. 2.2 Optimization To fit Model (2.), we reparametrize µ si as µ i = µ i, γ i = µ 2i µ i. The model now becomes { xij N(µ i + d j, σ 2 ) x 2ij N(µ i + γ i + d 2j, σ 2 (2.2) ) To fit Model (2.2), we minimize its negative log-likelihood n n 2 l(µ, γ, d; x) = (x ij µ i d j ) 2 + (x 2ij µ i γ i d 2j ) 2 4

5 Model (2.2) is non-identifiable because we can simply add any constant to all the d sj s and subtract the same constant from all the µ si s, while having the same fit. To resolve this issue, we fix d = 0. Furthermore, we introduce a penalty p(γ) on the γ i s and formulate a penalized likelihood f(µ, γ, d) = l(µ, γ, d; x) + p(γ) (2.3) Candidate penalty functions for p(γ) are L (Tibshirani, 996), L0, SCAD (Fan and Li, 200) and etc. The penalty function will force some of the γ i to be zero, which will in turn facilitate the detection of differential expression since τ i = ( γ i > 0). There are m(n + n 2 ) observations and 2m + n + n 2 parameters in the model, and typically we have m 0, 000 and n + n 2 0. Using an L penalty, it will be computationally intensive if we fit the model using a standard Lasso solver such as Glmnet (Friedman et al., 200). To achieve better performance, if we use a non-convex penalty such as L0 or SCAD, it will be computationally nearly infeasible to fit the model in a brute-force manner. Fortunately, we can take advantage of the special structure in the model and solve it efficiently. In this paper we work with the L0 penalty due to its robustness in estimation and variable selection p(γ) = α ( γ i > 0) (2.4) where α > 0 is tuning parameter. Our model fitting approach can also be adapted to other penalty functions. It can be shown that the solution to (2.3) with the L0 penalty (2.4) has the property that γ i = 0 or γ i λ, where λ > 0 is some constant. In particular, model (2.3) can be solve as follows. Proposition 2.. Model (2.3) with the L0 penalty (2.4) can be solved as follows d j = m m (x ij x i ) d 2j = m m (x 2ij x 2i ) µ i = n n (x ij d j ) µ 2i = n2 n 2 (x 2ij d 2j ) (n +n 2 )α λ = n n 2 d = arg min m d min ( (µ 2i µ i d)2, λ 2) d j = d j d 2j = d + d { 2j 0 if µ γ i = 2i µ i d < λ { µ 2i µ i d otherswise µ µ i = i if γ i > 0 n +n 2 (n µ i + n 2(µ 2i d)) othersise To solve for d, we need to minimize the function f(d) = m min ( (µ 2i µ i d)2, λ 2) which is non-convex and non-differentiable (see Figure ). However, since it is univariate, it can be solved efficiently in O(m log m) time using a sliding window search. 5

6 f(d) d Figure : The function f(d) = m min ( (µ 2i µ i d)2, λ 2) with simulated {µ i }00, {µ 2i }00 N(0, ) and λ = 0.2. The minimizer of f is shown with a dashed vertical line. 3 Experiments 3. Simulations We simulate RNA-Seq data with a total of m = 000 genes (900 non-differentially expressed, 00 differentially expressed) as follows µ i = µ 2i N( 3, 2) mean expression for genes -900 (log scale) µ i N( 3, 2), µ 2i = µ i + N(0, ) mean expression for genes (log scale) d sj N(0, 0.5) scaling factor (log scale) x sij N(µ si + d sj, 0.2) gene expression levels (log scale) log(l i ) Unif(5, 0) gene length (log scale) N sj Unif(3, 5) 0 7 total number of sequenced reads {c sij } m + Mult(N sj, {l i e x sij } m ) gene expression levels (read counts) We use the log c sij (read counts) as data to fit our model with λ = 0.5. The fitted γ i s are plotted in Figure 2a. Using CPM, RPKM or TPM values computed with formulas in (.) yield the same estimates for γ i. To demonstrate the robustness of our method, we also simulate with 600 non-differentially expressed genes and 400 differentially expressed genes. Furthermore, the mean gene expression in group two is simulated as µ 2i = µ i +N(3, ) for genes , which means that the 400 differentially expressed genes are upregulated for 20 fold in group two on average. Our method still robustly estimates the γ i s (Figure 2b). We further simulate with 300 non-differentially expressed genes and 700 differentially expressed genes, for which our method still achieves robust estimates 6

7 when we simulate with µ 2i = µ i +N(3, ) (Figure 2c). Only when we simulate with 900 differentially expressed genes and with µ 2i = µ i + N(3, ), our method fails to achieve robust estimates (Figure 2d). 4 Discussion It is shown in our simulations that our proposed approach is able to reliably normalize the data and detect differential expression in some cases when more than half of the genes are differentially expressed in an asymmetric manner. This is hard to achieve with other existing robust methods such as the median or trimmed mean based normalization approaches (Anders and Huber, 200; Robinson and Oshlack, 200). This is attributed to the L0-penalized likelihood used by our model. Typically L0-penalized models are difficult to fit due to their non-convexity. However in our case it is easily manageable once we reduce the model fitting to a univariate optimization problem. It is seen that adding an L0 penalty p(γ) is equivalent to applying hard thresholding on γ, which has been shown to be a general case in She and Owen (20). The hard thresholding helps us make inference on the indicators τ i s for differential gene expression since τ i = 0 γ i < λ This resembles the two-sample t-test that has been widely used for detecting differential gene expression. In fact, γ i < λ µ i µ 2i < λ ŝ i ŝ i where ŝ i is the estimated standard deviation for gene i, which is chosen as σ when σ is known, or estimated from the data using various of methods based on different assumptions (i.e., equal variance or unequal variance). This relationship hints us to choose λ as ŝ i Q q where Q q is the q quantile (for two-sided level q tests) of the standard normal (for known ŝ i ) or t distributions (for estimated ŝ i ). Appendix Proof of Proposition 2.. We have n n 2 f(µ, γ, d) = (x ij µ i d j ) 2 + (x 2ij µ i γ i d 2j ) 2 + α ( γ i > 0) Therefore which gives f d j = 2(x ij µ i d j ) = 0 d j = m (x ij µ i ) 7

8 i γ i (a) i γ i (b) i γ i (c) i γ i (d) Figure 2: Estimated γ from simulated data. 8

9 Consequently d j d = m (x ij x i ) = d j Since we fix d = 0, we have d j = d j. Similarly, we have d 2j d 2 = m (x 2ij x 2i ) = d 2j Denote d 2 as d, we have d 2j = d + d 2j. Now f(µ, γ, d) = n n 2 (x ij µ i d j) 2 + (x 2ij µ i γ i d d 2j) 2 + α( γ i > 0) which can be written as f(µ, γ, d) = h i (µ i, γ i ) where both µ i and γ i can be considered as functions of d. If γ i > 0, to minimize f, we only need to minimize h i, which can be easily solve as and µ i = n (x ij d n j) = µ i γ i = n 2 (x 2ij µ i d d n 2j) = n 2 (x 2ij d 2 n 2j) µ i d = µ 2i µ i d 2 Similarly, if γ i = 0, we have µ i = Changing from γ i = 0 to γ i > 0, we have n + n 2 (n µ i + n 2 (µ 2i d)) h i ( (n µ i + n 2 (µ 2i d)), 0) h i (µ n + n i, µ 2i µ i d) = n n 2 (µ 2i µ i d) 2 α 2 n + n 2 Therefore Where λ = d = arg min d (n +n 2 )α n n 2 γ i = { 0 if µ 2i µ i d < λ µ 2i µ i d otherswise. The only thing remaining now is to solve for d. We have ( ) min h i ( (n µ i + n 2 (µ 2i d)), 0), h i (µ n + n i, µ 2i µ i d) 2 9

10 which can be simplified as d = arg min d ( ) n n 2 min (µ 2i µ i d) 2 α, 0 n + n 2 since h i (µ i, µ 2i µ i d) is not function of d. Therefore, d = arg min d min ( (µ 2i µ i d) 2, λ 2) References S. Anders and W. Huber. Differential expression analysis for sequence count data. Genome Biol, (0):R06, 200. B. M. Bolstad, R. A. Irizarry, M. Astrand, and T. P. Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 9(2):85 93, Jan M.-A. Dillies, A. Rau, J. Aubert, C. Hennequet-Antier, M. Jeanmougin, N. Servant, C. Keime, G. Marot, D. Castel, J. Estelle, G. Guernec, B. Jagla, L. Jouneau, D. Laloe, C. Le Gall, B. Schaeffer, L.ffer, S. Le Crom, M. Guedj, F. Jaffrezic, and F. S. C.. A comprehensive evaluation of normalization methods for illumina high-throughput rna sequencing data analysis. Brief Bioinform, 4(6):67 683, Nov 203. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): , 200. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33():, 200. H. Jiang and W. H. Wong. Statistical inferences for isoform expression in rna-seq. Bioinformatics, 25(8): , Apr B. Li, V. Ruotti, R. M. Stewart, J. A. Thomson, and C. N. Dewey. Rna-seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4): , Feb 200. J. Li, D. M. Witten, I. M. Johnstone, and R. Tibshirani. Normalization, testing, and false discovery rate estimation for rna-sequencing data. Biostatistics, 3(3): , Jul 202. A. Mortazavi, B. A. Williams, K. McCue, L. Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):62 628, Jul

11 L. Pachter. Models for transcript quantification from RNA-Seq. ArXiv e-prints, Apr. 20. F. Rapaport, R. Khanin, Y. Liang, M. Pirun, A. Krek, P. Zumbo, C. E. Mason, N. D. Socci, and D. Betel. Comprehensive evaluation of differential gene expression analysis methods for rna-seq data. Genome Biology, 4(9):R95, Sep 203. M. D. Robinson and A. Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biol, (3):R25, 200. Y. She and A. B. Owen. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 06(494), 20. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages , 996. C. Trapnell, B. A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M. J. van Baren, S. L. Salzberg, B. J. Wold, and L. Pachter. Transcript assembly and quantification by rnaseq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol, 28(5):5 55, May 200. G. P. Wagner, K. Kin, and V. J. Lynch. Measurement of mrna abundance using rna-seq data: Rpkm measure is inconsistent among samples. Theory Biosci, 3(4):28 285, Dec 202. Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 0():57 63, Jan 2009.

Normalization and differential analysis of RNA-seq data

Normalization and differential analysis of RNA-seq data Normalization and differential analysis of RNA-seq data Nathalie Villa-Vialaneix INRA, Toulouse, MIAT (Mathématiques et Informatique Appliquées de Toulouse) nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org

More information

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Statistical challenges in RNA-Seq data analysis

Statistical challenges in RNA-Seq data analysis Statistical challenges in RNA-Seq data analysis Julie Aubert UMR 518 AgroParisTech-INRA Mathématiques et Informatique Appliquées Ecole de bioinformatique, Station biologique de Roscoff, 2013 Nov. 18 J.

More information

RNASeq Differential Expression

RNASeq Differential Expression 12/06/2014 RNASeq Differential Expression Le Corguillé v1.01 1 Introduction RNASeq No previous genomic sequence information is needed In RNA-seq the expression signal of a transcript is limited by the

More information

Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression

Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression DOI 10.1007/s12561-012-9057-6 Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression David Dalpiaz Xuming He Ping Ma Received: 21 November 2011 / Accepted: 2 February 2012 International

More information

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016 New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016 Wikipedia The traditional workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The traditional workflow

More information

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Inferences for Isoform Expression in RNA-Seq Statistical Inferences for Isoform Expression in RNA-Seq Hui Jiang and Wing Hung Wong February 25, 2009 Abstract The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 8 29, pages 126 132 doi:1.193/bioinformatics/btp113 Gene expression Statistical inferences for isoform expression in RNA-Seq Hui Jiang 1 and Wing Hung Wong 2,

More information

Classifying next-generation sequencing data using a zero-inflated Poisson model

Classifying next-generation sequencing data using a zero-inflated Poisson model 7 Doc-StartBIOINFORMATICS Classifying next-generation sequencing data using a zero-inflated Poisson model Yan Zhou 1, Xiang Wan 2,, Baoxue Zhang 3 and Tiejun Tong 4, 1 College of Mathematics and Statistics,

More information

Isoform discovery and quantification from RNA-Seq data

Isoform discovery and quantification from RNA-Seq data Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification

More information

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017 Alignment-free RNA-seq workflow Charlotte Soneson University of Zurich Brixen 2017 The alignment-based workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The alignment-based

More information

EBSeq: An R package for differential expression analysis using RNA-seq data

EBSeq: An R package for differential expression analysis using RNA-seq data EBSeq: An R package for differential expression analysis using RNA-seq data Ning Leng, John Dawson, and Christina Kendziorski October 14, 2013 Contents 1 Introduction 2 2 Citing this software 2 3 The Model

More information

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

DEXSeq paper discussion

DEXSeq paper discussion DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml

More information

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Normalization, testing, and false discovery rate estimation for RNA-sequencing data Biostatistics Advance Access published October 14, 2011 Biostatistics (2011), 0, 0, pp. 1 16 doi:10.1093/biostatistics/kxr031 Normalization, testing, and false discovery rate estimation for RNA-sequencing

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig Genetic Networks Korbinian Strimmer IMISE, Universität Leipzig Seminar: Statistical Analysis of RNA-Seq Data 19 June 2012 Korbinian Strimmer, RNA-Seq Networks, 19/6/2012 1 Paper G. I. Allen and Z. Liu.

More information

Bias in RNA sequencing and what to do about it

Bias in RNA sequencing and what to do about it Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Analyses biostatistiques de données RNA-seq

Analyses biostatistiques de données RNA-seq Analyses biostatistiques de données RNA-seq Ignacio Gonzàlez, Annick Moisan & Nathalie Villa-Vialaneix prenom.nom@toulouse.inra.fr Toulouse, 18/19 mai 2017 IG, AM, NV 2 (INRA) Biostatistique RNA-seq Toulouse,

More information

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n( ,a) RNA-seq RNA-seq Cuffdiff, edger, DESeq Sese Jun,a) Abstract: Frequently used biological experiment technique for observing comprehensive gene expression has been changed from microarray using cdna

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Differential expression analysis for sequencing count data. Simon Anders

Differential expression analysis for sequencing count data. Simon Anders Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in HTS RNA-Seq Tag-Seq Gene 13CDNA73 A2BP1 A2M A4GALT AAAS AACS AADACL1 [...] ChIP-Seq Bar-Seq... GliNS1 4 19

More information

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER SUPPLEMENTARY METHODS FOR THE PAPER TRANSCRIPT ASSEMBLY AND QUANTIFICATION BY RNA-SEQ REVEALS UNANNOTATED TRANSCRIPTS AND ISOFORM SWITCHING DURING CELL DIFFERENTIATION COLE TRAPNELL, BRIAN A WILLIAMS,

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2013 Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Han Wu

More information

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data

NBLDA: negative binomial linear discriminant analysis for RNA-Seq data Dong et al. BMC Bioinformatics (2016) 17:369 DOI 10.1186/s12859-016-1208-1 RESEARCH ARTICLE Open Access NBLDA: negative binomial linear discriminant analysis for RNA-Seq data Kai Dong 1,HongyuZhao 2,TiejunTong

More information

Statistical Modeling of RNA-Seq Data

Statistical Modeling of RNA-Seq Data Statistical Science 2011, Vol. 26, No. 1, 62 83 DOI: 10.1214/10-STS343 c Institute of Mathematical Statistics, 2011 Statistical Modeling of RNA-Seq Data Julia Salzman 1, Hui Jiang 1 and Wing Hung Wong

More information

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook Stony Brook University The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook University. Alll Rigghht tss

More information

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq express: Streaming read deconvolution and abundance estimation applied to RNA-Seq Adam Roberts 1 and Lior Pachter 1,2 1 Department of Computer Science, 2 Departments of Mathematics and Molecular & Cell

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Nonconvex penalties: Signal-to-noise ratio and algorithms

Nonconvex penalties: Signal-to-noise ratio and algorithms Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Outlier detection and variable selection via difference based regression model and penalized regression

Outlier detection and variable selection via difference based regression model and penalized regression Journal of the Korean Data & Information Science Society 2018, 29(3), 815 825 http://dx.doi.org/10.7465/jkdi.2018.29.3.815 한국데이터정보과학회지 Outlier detection and variable selection via difference based regression

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

Dispersion modeling for RNAseq differential analysis

Dispersion modeling for RNAseq differential analysis Dispersion modeling for RNAseq differential analysis E. Bonafede 1, F. Picard 2, S. Robin 3, C. Viroli 1 ( 1 ) univ. Bologna, ( 3 ) CNRS/univ. Lyon I, ( 3 ) INRA/AgroParisTech, Paris IBC, Victoria, July

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences

ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences ABSSeq: a new RNA-Seq analysis method based on modelling absolute expression differences Wentao Yang October 30, 2018 1 Introduction This vignette is intended to give a brief introduction of the ABSSeq

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004 Outline

More information

Supplemental Information

Supplemental Information Molecular Cell, Volume 52 Supplemental Information The Translational Landscape of the Mammalian Cell Cycle Craig R. Stumpf, Melissa V. Moreno, Adam B. Olshen, Barry S. Taylor, and Davide Ruggero Supplemental

More information

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis Biostatistics (2010), 11, 4, pp. 599 608 doi:10.1093/biostatistics/kxq023 Advance Access publication on May 26, 2010 Simultaneous variable selection and class fusion for high-dimensional linear discriminant

More information

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis March 18, 2016 UVA Seminar RNA Seq 1 RNA Seq Gene expression is the transcription of the

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

Bayesian Clustering of Multi-Omics

Bayesian Clustering of Multi-Omics Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics

More information

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

Group exponential penalties for bi-level variable selection

Group exponential penalties for bi-level variable selection for bi-level variable selection Department of Biostatistics Department of Statistics University of Kentucky July 31, 2011 Introduction In regression, variables can often be thought of as grouped: Indicator

More information

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Yuan Ji 1,, Yanxun Xu 2, Qiong Zhang 3, Kam-Wah Tsui 3, Yuan Yuan 4, Clift Norris 1, Shoudan

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2013 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Meghana Kshirsagar (mkshirsa), Yiwen Chen (yiwenche) 1 Graph

More information

Robust Variable Selection Through MAVE

Robust Variable Selection Through MAVE Robust Variable Selection Through MAVE Weixin Yao and Qin Wang Abstract Dimension reduction and variable selection play important roles in high dimensional data analysis. Wang and Yin (2008) proposed sparse

More information

BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing. Data

BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing. Data Biometrics 000, 000 000 DOI: 000 000 0000 BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Yuan Ji 1,, Yanxun Xu 2, Qiong Zhang 3, Kam-Wah Tsui 3, Yuan Yuan 4, Clift Norris 1,

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

In Search of Desirable Compounds

In Search of Desirable Compounds In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email:

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010 Department of Biostatistics Department of Statistics University of Kentucky August 2, 2010 Joint work with Jian Huang, Shuangge Ma, and Cun-Hui Zhang Penalized regression methods Penalized methods have

More information

Optimal normalization of DNA-microarray data

Optimal normalization of DNA-microarray data Optimal normalization of DNA-microarray data Daniel Faller 1, HD Dr. J. Timmer 1, Dr. H. U. Voss 1, Prof. Dr. Honerkamp 1 and Dr. U. Hobohm 2 1 Freiburg Center for Data Analysis and Modeling 1 F. Hoffman-La

More information

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS

ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Statistica Sinica 18(2008), 1603-1618 ADAPTIVE LASSO FOR SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang, Shuangge Ma and Cun-Hui Zhang University of Iowa, Yale University and Rutgers University Abstract:

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * Some contents are adapted from Dr. Hung Huang and Dr. Chengkai Li at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Problems

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq Mark F. Rogers, Christina Boucher, and Asa Ben-Hur Department of Computer Science 1873 Campus Delivery Fort Collins, CO 80523 rogersma@cs.colostate.edu,

More information

SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE

SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE Editor Antonia Gutiérrez-Mora Compilers Benjamín Rodríguez-Garay Silvia Maribel Contreras-Ramos Manuel Reinhart Kirchmayr Marisela González-Ávila Index 1.

More information

Normalization of metagenomic data A comprehensive evaluation of existing methods

Normalization of metagenomic data A comprehensive evaluation of existing methods MASTER S THESIS Normalization of metagenomic data A comprehensive evaluation of existing methods MIKAEL WALLROTH Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

arxiv: v1 [stat.me] 1 Dec 2015

arxiv: v1 [stat.me] 1 Dec 2015 Bayesian Estimation of Negative Binomial Parameters with Applications to RNA-Seq Data arxiv:1512.00475v1 [stat.me] 1 Dec 2015 Luis León-Novelo Claudio Fuentes Sarah Emerson UT Health Science Center Oregon

More information

Robust high-dimensional linear regression: A statistical perspective

Robust high-dimensional linear regression: A statistical perspective Robust high-dimensional linear regression: A statistical perspective Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics STOC workshop on robustness and nonconvexity Montreal,

More information

Uncertainty quantification and visualization for functional random variables

Uncertainty quantification and visualization for functional random variables Uncertainty quantification and visualization for functional random variables MascotNum Workshop 2014 S. Nanty 1,3 C. Helbert 2 A. Marrel 1 N. Pérot 1 C. Prieur 3 1 CEA, DEN/DER/SESI/LSMR, F-13108, Saint-Paul-lez-Durance,

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

RNA-seq. Differential analysis

RNA-seq. Differential analysis RNA-seq Differential analysis DESeq2 DESeq2 http://bioconductor.org/packages/release/bioc/vignettes/deseq 2/inst/doc/DESeq2.html Input data Why un-normalized counts? As input, the DESeq2 package expects

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

Statistical methods for estimation, testing, and clustering with gene expression data

Statistical methods for estimation, testing, and clustering with gene expression data Graduate Theses and Dissertations Iowa State University Capstones, Theses and Dissertations 2017 Statistical methods for estimation, testing, and clustering with gene expression data Andrew Lithio Iowa

More information

Automatic Response Category Combination. in Multinomial Logistic Regression

Automatic Response Category Combination. in Multinomial Logistic Regression Automatic Response Category Combination in Multinomial Logistic Regression Bradley S. Price, Charles J. Geyer, and Adam J. Rothman arxiv:1705.03594v1 [stat.me] 10 May 2017 Abstract We propose a penalized

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley Memorial Sloan-Kettering Cancer Center July

More information