express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq Adam Roberts 1 and Lior Pachter 1,2 1 Department of Computer Science, 2 Departments of Mathematics and Molecular & Cell Biology, University of California at Berkeley IPAM: Mathematical and Computational Approaches in High-Throughput Genomics Workshop II: Transcriptomics and Epigenomics October 28, 2011

The unexpected application of cheap sequencing Despite the obvious possibilities of sequencing many new genomes, high throughput DNA sequencers have instead been mainly utilized as bean counters for sequence census methods. The vast majority of DNA sequence currently produced is for *-seq experiments: Desired measurement reduce to sequencing Sequence Solve inverse problem Creativity Biology Computer Science Mathematics/Statistics (Computational) Biology Analyze Assays include: ChIP-Seq, RNA-Seq, methyl-seq, GRO-Seq, Clip-Seq, BS-Seq, FRT-Seq, TraDI-Seq, Hi-C, SHAPE-Seq... 2

Background: Uses of RNA-Seq RNA-Seq data has three primary uses: Discovering genes and isoforms Estimating transcript abundances Finding differential expression between samples For this talk, I will focus on the second and assume that the transcriptome has been fully assembled / the genome has been fully annotated. 3

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced 5

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = 10 1500 + 10 1000 = 1 60 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = 10 1500 + 10 1000 = 1 60 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 6

Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A 1000 10 500 Isoform B 1500 10 750 Gene N/A 20 1250 Exon Union 1500 16.67 1250 What if there is differential splicing? 7

Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8

Background: The batch EM Solution The method: Develop a generative model for the data Derive a likelihood function based on this model Maximize the likelihood function with EM 9

t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l 30 0 1 l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10

A B C D Normalized Count Density Density t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs 1.0 AG G 1.0 C TAC TGC GT TAG CAG CAG CTG CAG C GATC TA A TGA CTTA TAG TGA TGA TAG TGA TGA 0.5 C T G ACT ACT ACT ACT C T C C TT CGATC C TTCCCCCCC 0.0 AGGGGG AACT T GT AT G G A GTCA G A CA G CA T GC C AT GATC ATC TT C GG A G G GA CA T C T G A G A 0 t = relative 5 10 GTA abundance -10-10 of -5 transcript -5 0 0 t 5-5 10-10 WebLogo 3.0 1.0 WebLogo 3.0 GTA CTA CAT F TAG = TAG TGA ATG set TAG TAG of sequenced fragments, i.e. read pairs CGGCTAGC CCCCCC GTA GTA GTA GTA GTA GTA GTA AT 0.5 C CAT CAT CAT CAT GAT T C CAT C GT T C GCT ACT C CCCCCCC 0.0 GGGGG CGG A G ACTA A CTC TGACT ACT CAT CAT AT TG C CAT CAT GAG AAA G G GGGGG G T 0 = set 5 of annotated 10-10 transcripts -10-5 -5 0 0 5-5 10-10 WebLogo 3.0 1.0 WebLogo 3.0 T = set of annotated transcripts 0.5 L = P(a fragment of length L) T -10 = sequencing error parameters 0.0 1.0 µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i ) Expected Density 0.5 0.0! 0 1.0 = bias parameters (relative position and sequence distributions) -10 t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! 0.5 t,i,l =! 50 0.0! t L = C GA T CAG T CAG T CGA C TGA -5 TA C CTA GTA GTA CTA GGCCG -5 GTA GTA GTA t,i! GTA GTA 30 L = Pr(a fragment of length L) t,i+l GTA GTA GTA 1 GTA --> GTA GTA Weight GTA GTA GTA GTA GTA GTA of GTA potential GTA GTA GTA GTA GTA fragment AT 0.5 C CAT CAT of CAT CAT length AT C CAT CAT CAT CAT L AT Cat CAT CAT locus CAT CAT AT C CAT CAT CAT CAT AT C CAT CAT CCl(t) C L+1 XC CCCCCCCCC CCCCCCCCC C 0.0 GGGGG GGGGG GGGGG GGGGG GGG -10-5 0 = sequencing 5 10 error -10 parameters -10-5 -5 0 0 5-5 10-10 WebLogo 3.0 WebLogo 3.0 i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 Ratio (Bias Weight) =) t = P t/! t t 0 t 0/! t 0 5' Fragment End C GA = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l 30 0 1 l(t) L+1 3' Fragment End --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t Offset from 5' Fragment End Offset from 3' Fragment End Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011) G A TC 10

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l = P t! t L t 0 t 0! t0 L 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l! " # = 0 fragment length!"#$%"#&'%($)#&*+&,%"-#,&./0&(*)#12)#'. T 3"%-(#4,&%45&'#62#41#& (*)#12)#&$*$2)%,7*4 8%$&+"%-(#4,'& (*)#12)#&'# Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

P ) P 0 0 0 (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 12

P ) P 0 0 0 (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f Y f2f X L L X X t2t (t,i,l)2a(f) l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l L t L!t,i,L! t L µ f t,i,l 12

Background: Batch EM Deconvolution 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length 0.27 0.47 M-step All transcripts are the same length 0.23 0.55 E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step 0.18 0.18 0.64 Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

Background: The batch EM Solution The method: Come up with a generative model for the data Make a likelihood function based on this model Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) 14

Background: The batch EM Solution The method: Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Come up with a generative model for the data Make a likelihood function based on this model Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) Proposal: Use online EM instead of batch! (express) 14

Method: Basics of the express online algorithm Based on the Cufflinks likelihood function combined with the online algorithm described in (Cappe & Moulines, 2009) Allow read to multi-map with limited restrictions to the transcript sequences. Probabilistically assign an incoming read to transcripts based on current model parameters (length distribution, abundances, errors, bias, etc.) Update model parameters based on probabilistic read assignment. 15

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.33 0.33 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / 0.25 0.5 Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / 0.25 0.5 / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript 0.27 0.40 / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.67 0.33 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript 0.40 0.27 Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18

Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 19

Speeding up convergence Cappe and Moulines (Journal of the Royal Statistical Society, 2009) prove the following theorem: The forgetting factor is where. 2 <capple1 n = 1 n c 1 The updates require O(k) operations where k is the number of transcripts. 20

Speeding up convergence We have shown that instead of updating the frequency vectors update count vectors so that ĉ n+1 =ĉ n + f n s(y n+1 ; ˆ n ) ŝ n one can instead The forgetting weights are given by f n = f n n 1 1 n 1 n 1 The number of coordinates that need to be updated at every step is the number of transcripts the read being considered maps to. 21

Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 22

Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 23

Results: Convergence for complex gene (Dystrophin) 24

Results: Convergence for complex gene (Dystrophin) express RSEM Cufflinks 24

Results: Full transcriptome (hg19, >76k transcripts) 25

Results: Performance Analysis for 1B simulated reads mapped to the human transcriptome (hg19 UCSC) 8-Core 2.27 GHz Mac Pro with 24 GB of RAM Cufflinks and RSEM were run with 8 threads, express 1 O(size of transcriptome) O(# of fragments) 26

Results: Speed of parameter convergence 10M mapped 76bp x 76bp fragments from Encode 27

express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28

Estimating counts 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome M-step 0.27 Unique Counts Total Counts Estimated Counts 1 4 0.47 E-step 0.27 At every step of the batch EM algorithm: 0.23 M-step apple apple unique estimated total 0.55 E-step 0.23 M-step 29

Estimation of counts in express Because of the forgetting weights we use to speed up convergence, the counts at every step may not lead to estimates that lie between the unique and total counts. To ensure that estimated counts satisfy the constraint (satisfied by the optimum) we employ an alternating projection algorithm: The algorithm projects the initial estimate alternately between the hyperplane (counts sum to the total) and the cube (unique less than estimate less than total). The algorithm converges (in this case in a finite number of steps) by a theorem of von Neumann. 30

Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB)? Archive (SRA, GEO,...) 32

Discussion: Too much data to store! Archive (SRA, GEO,...) Macmillan Publishers Ltd: Nature 458, 719-724 (2009) 34

Discussion: Too much data to store! As the performance of next-generation sequencing machines continues to improve in terms of speed, cost, accuracy, and length, and as computational processing continues to improve, the need to access the underlying reads decreases. -David Lipman, NCBI 34 GB Editorial Team. Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biology (2011)

Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35

Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35

Discussion: What will be possible...? Streaming Sequencing Machine (?) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Express) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 36

Conclusions and Future Work It is important to deconvolute read counts even for gene-level expression analysis. The online algorithm implemented in express produces very accurate results with much less resource use than batch approaches and is thus applicable for larger datasets. express can be used in a streaming sequencing pipeline to produce results without the need for storing intermediate data. express also estimates the posterior distribution on estimated counts for each isoform, which can be combined with negative-binomial models of biological over-dispersion to achieve more accurate differential expression analysis. Ambiguous read mapping is a problem in many other applications besides RNA-Seq including ChIP-Seq, variant detection, and metagenomics. express has been engineered to be a general-purpose tool that can be applied in all of these areas and more. 37

Software: http://bio.math.berkeley.edu/express Acknowledgements Lior Pachter, UC Berkeley Harold Pimentel, UC Berkeley Funding for Adam Roberts provided by the NSF Graduate Research Fellowship References Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. Cappe O and Moulines E (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Roberts A and Pachter L (2011). express: A Bayesian / Online EM algorithm for isoform-level RNA-seq quantification. In preparation. 38