express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Similar documents
Our typical RNA quantification pipeline

Isoform discovery and quantification from RNA-Seq data

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

Practical Bioinformatics

Bayesian Clustering of Multi-Omics

Bias in RNA sequencing and what to do about it

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

RNA- seq read mapping

Unit-free and robust detection of differential expression from RNA-Seq data

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material

Crick s early Hypothesis Revisited

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

High-throughput sequencing: Alignment and related topic

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Genome 541! Unit 4, lecture 3! Genomics assays

GBS Bioinformatics Pipeline(s) Overview

Genome Assembly. Sequencing Output. High Throughput Sequencing

Statistical Inferences for Isoform Expression in RNA-Seq

Centrifuge: rapid and sensitive classification of metagenomic sequences

Multi-Assembly Problems for RNA Transcripts

Introduction to Bioinformatics

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Towards More Effective Formulations of the Genome Assembly Problem

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

EBSeq: An R package for differential expression analysis using RNA-seq data

Single Cell Sequencing

Supplemental Information

High-throughput sequence alignment. November 9, 2017

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

GBS Bioinformatics Pipeline(s) Overview

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Biology 644: Bioinformatics

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

GEP Annotation Report

BIOINFORMATICS ORIGINAL PAPER

SUPPLEMENTARY DATA - 1 -

Ion Torrent. The chip is the machine

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Advanced topics in bioinformatics

Comparative analysis of RNA- Seq data with DESeq2

A Robust Method for Transcript Quantification with RNA-seq Data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Number-controlled spatial arrangement of gold nanoparticles with

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook

Gibbs Sampling Methods for Multiple Sequence Alignment

Predicting Protein Functions and Domain Interactions from Protein Interactions

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Alignment. Peak Detection

Count ratio model reveals bias affecting NGS fold changes

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Statistical Modeling of RNA-Seq Data

Supplementary Information for

Introduction to de novo RNA-seq assembly

Gene expression from RNA-Seq

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

BME 5742 Biosystems Modeling and Control

Introduction to Bioinformatics

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing. Data

Statistical tests for differential expression in count data (1)

Markov Models & DNA Sequence Evolution

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression

Supplementary Information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

What is Systems Biology

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Stochastic processes and

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Differential expression analysis for sequencing count data. Simon Anders

Electronic supplementary material

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Cloud-scale RNA-sequencing differential expression analysis with Myrna

Comparative genomics: Overview & Tools + MUMmer algorithm

Modelling gene expression dynamics with Gaussian processes

Mixtures and Hidden Markov Models for analyzing genomic data

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Algorithmics and Bioinformatics

DEXSeq paper discussion

More Codon Usage Bias

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments

Inferring Protein-Signaling Networks II

Lecture 15: Programming Example: TASEP

Predictive Genome Analysis Using Partial DNA Sequencing Data

Transcription:

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq Adam Roberts 1 and Lior Pachter 1,2 1 Department of Computer Science, 2 Departments of Mathematics and Molecular & Cell Biology, University of California at Berkeley IPAM: Mathematical and Computational Approaches in High-Throughput Genomics Workshop II: Transcriptomics and Epigenomics October 28, 2011

The unexpected application of cheap sequencing Despite the obvious possibilities of sequencing many new genomes, high throughput DNA sequencers have instead been mainly utilized as bean counters for sequence census methods. The vast majority of DNA sequence currently produced is for *-seq experiments: Desired measurement reduce to sequencing Sequence Solve inverse problem Creativity Biology Computer Science Mathematics/Statistics (Computational) Biology Analyze Assays include: ChIP-Seq, RNA-Seq, methyl-seq, GRO-Seq, Clip-Seq, BS-Seq, FRT-Seq, TraDI-Seq, Hi-C, SHAPE-Seq... 2

Background: Uses of RNA-Seq RNA-Seq data has three primary uses: Discovering genes and isoforms Estimating transcript abundances Finding differential expression between samples For this talk, I will focus on the second and assume that the transcriptome has been fully assembled / the genome has been fully annotated. 3

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced 5

Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced FPKM / 12 1500 5

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = 10 1500 + 10 1000 = 1 60 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = 10 1500 + 10 1000 = 1 60 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = 10 1500 + 10 1000 = 1 60 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B 1500 + 10 1000 = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A 1000 10 500 Isoform B 1500 10 750 Gene N/A 20 1250 Exon Union 1500 16.67 1250 What if there is differential splicing? 7

Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A 1000 10 500 Isoform B 1500 10 750 Gene N/A 20 1250 Exon Union 1500 16.67 1250 What if there is differential splicing? Target Length FPKM Fragment Count Isoform A 1000 24 1200 Isoform B 1500 0.67 50 Gene N/A 24.67 1250 Exon Union 1500 16.67 1250 7

Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8

Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8

Background: The batch EM Solution The method: Develop a generative model for the data Derive a likelihood function based on this model Maximize the likelihood function with EM 9

t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l 30 0 1 l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10

A B C D Normalized Count Density Density t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs 1.0 AG G 1.0 C TAC TGC GT TAG CAG CAG CTG CAG C GATC TA A TGA CTTA TAG TGA TGA TAG TGA TGA 0.5 C T G ACT ACT ACT ACT C T C C TT CGATC C TTCCCCCCC 0.0 AGGGGG AACT T GT AT G G A GTCA G A CA G CA T GC C AT GATC ATC TT C GG A G G GA CA T C T G A G A 0 t = relative 5 10 GTA abundance -10-10 of -5 transcript -5 0 0 t 5-5 10-10 WebLogo 3.0 1.0 WebLogo 3.0 GTA CTA CAT F TAG = TAG TGA ATG set TAG TAG of sequenced fragments, i.e. read pairs CGGCTAGC CCCCCC GTA GTA GTA GTA GTA GTA GTA AT 0.5 C CAT CAT CAT CAT GAT T C CAT C GT T C GCT ACT C CCCCCCC 0.0 GGGGG CGG A G ACTA A CTC TGACT ACT CAT CAT AT TG C CAT CAT GAG AAA G G GGGGG G T 0 = set 5 of annotated 10-10 transcripts -10-5 -5 0 0 5-5 10-10 WebLogo 3.0 1.0 WebLogo 3.0 T = set of annotated transcripts 0.5 L = P(a fragment of length L) T -10 = sequencing error parameters 0.0 1.0 µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i ) Expected Density 0.5 0.0! 0 1.0 = bias parameters (relative position and sequence distributions) -10 t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! 0.5 t,i,l =! 50 0.0! t L = C GA T CAG T CAG T CGA C TGA -5 TA C CTA GTA GTA CTA GGCCG -5 GTA GTA GTA t,i! GTA GTA 30 L = Pr(a fragment of length L) t,i+l GTA GTA GTA 1 GTA --> GTA GTA Weight GTA GTA GTA GTA GTA GTA of GTA potential GTA GTA GTA GTA GTA fragment AT 0.5 C CAT CAT of CAT CAT length AT C CAT CAT CAT CAT L AT Cat CAT CAT locus CAT CAT AT C CAT CAT CAT CAT AT C CAT CAT CCl(t) C L+1 XC CCCCCCCCC CCCCCCCCC C 0.0 GGGGG GGGGG GGGGG GGGGG GGG -10-5 0 = sequencing 5 10 error -10 parameters -10-5 -5 0 0 5-5 10-10 WebLogo 3.0 WebLogo 3.0 i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 Ratio (Bias Weight) =) t = P t/! t t 0 t 0/! t 0 5' Fragment End C GA = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l 30 0 1 l(t) L+1 3' Fragment End --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t Offset from 5' Fragment End Offset from 3' Fragment End Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011) G A TC 10

t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l 30 0 1 l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l = P t! t L t 0 t 0! t0 L 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l! " # = 0 fragment length!"#$%"#&'%($)#&*+&,%"-#,&./0&(*)#12)#'. T 3"%-(#4,&%45&'#62#41#& (*)#12)#&$*$2)%,7*4 8%$&+"%-(#4,'& (*)#12)#&'# Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

P ) P 0 0 0 (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 12

P ) P 0 0 0 (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f Y f2f X L L X X t2t (t,i,l)2a(f) l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l L t L!t,i,L! t L µ f t,i,l 12

Background: Batch EM Deconvolution 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length 0.27 0.47 M-step All transcripts are the same length 0.23 0.55 E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step 0.18 0.18 0.64 Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

Background: Batch EM Deconvolution 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length 0.27 0.47 M-step All transcripts are the same length 0.23 0.55 E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step 0.18 0.18 0.64 Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

Background: Batch EM Deconvolution 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length 0.27 0.47 M-step All transcripts are the same length 0.23 0.55 E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step 0.18 0.18 0.64 Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

Background: Batch EM Deconvolution 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length 0.27 0.47 M-step All transcripts are the same length 0.23 0.55 E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step 0.18 0.18 0.64 Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

Background: The batch EM Solution The method: Come up with a generative model for the data Make a likelihood function based on this model Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) 14

Background: The batch EM Solution The method: Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Come up with a generative model for the data Make a likelihood function based on this model Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) Proposal: Use online EM instead of batch! (express) 14

Method: Basics of the express online algorithm Based on the Cufflinks likelihood function combined with the online algorithm described in (Cappe & Moulines, 2009) Allow read to multi-map with limited restrictions to the transcript sequences. Probabilistically assign an incoming read to transcripts based on current model parameters (length distribution, abundances, errors, bias, etc.) Update model parameters based on probabilistic read assignment. 15

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.33 0.33 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.33 0.33 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.33 0.33 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.33 0.33 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / 0.25 0.5 Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / 0.25 0.5 Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / 0.25 0.5 / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / 0.25 0.5 / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / 0.25 0.5 / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.67 0.33 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / 0.25 0.5 / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.67 0.33 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript 0.27 0.40 / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.67 0.33 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript 0.27 0.40 / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.67 0.33 17

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript 0.40 0.27 Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript 0.40 0.27 Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript 0.40 0.27 Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.33 0.27 0.40 18

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript 0.40 0.27 Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.33 0.27 0.40 18

Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript 0.40 0.27 Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 0.33 0.27 0.40 18

Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 19

Speeding up convergence Cappe and Moulines (Journal of the Royal Statistical Society, 2009) prove the following theorem: The forgetting factor is where. 2 <capple1 n = 1 n c 1 The updates require O(k) operations where k is the number of transcripts. 20

Speeding up convergence We have shown that instead of updating the frequency vectors update count vectors so that ĉ n+1 =ĉ n + f n s(y n+1 ; ˆ n ) ŝ n one can instead The forgetting weights are given by f n = f n n 1 1 n 1 n 1 The number of coordinates that need to be updated at every step is the number of transcripts the read being considered maps to. 21

Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 22

Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 23

Results: Convergence for complex gene (Dystrophin) 24

Results: Convergence for complex gene (Dystrophin) express RSEM Cufflinks 24

Results: Full transcriptome (hg19, >76k transcripts) 25

Results: Performance Analysis for 1B simulated reads mapped to the human transcriptome (hg19 UCSC) 8-Core 2.27 GHz Mac Pro with 24 GB of RAM Cufflinks and RSEM were run with 8 threads, express 1 O(size of transcriptome) O(# of fragments) 26

Results: Speed of parameter convergence 10M mapped 76bp x 76bp fragments from Encode 27

express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28

express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28

Estimating counts 0.33 0.33 transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome M-step 0.27 Unique Counts Total Counts Estimated Counts 1 4 0.47 E-step 0.27 At every step of the batch EM algorithm: 0.23 M-step apple apple unique estimated total 0.55 E-step 0.23 M-step 29

Estimation of counts in express Because of the forgetting weights we use to speed up convergence, the counts at every step may not lead to estimates that lie between the unique and total counts. To ensure that estimated counts satisfy the constraint (satisfied by the optimum) we employ an alternating projection algorithm: The algorithm projects the initial estimate alternately between the hyperplane (counts sum to the total) and the cube (unique less than estimate less than total). The algorithm converges (in this case in a finite number of steps) by a theorem of von Neumann. 30

Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) http://www.illumina.com/support/faqs.ilmn 31

Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) http://www.illumina.com/support/faqs.ilmn 32

Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB)? Archive (SRA, GEO,...) 32

Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 33

Discussion: Too much data to store! Archive (SRA, GEO,...) Macmillan Publishers Ltd: Nature 458, 719-724 (2009) 34

Discussion: Too much data to store! As the performance of next-generation sequencing machines continues to improve in terms of speed, cost, accuracy, and length, and as computational processing continues to improve, the need to access the underlying reads decreases. -David Lipman, NCBI 34 GB Editorial Team. Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biology (2011)

Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35

Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35

Discussion: What will be possible...? Streaming Sequencing Machine (?) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Express) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 36

Conclusions and Future Work It is important to deconvolute read counts even for gene-level expression analysis. The online algorithm implemented in express produces very accurate results with much less resource use than batch approaches and is thus applicable for larger datasets. express can be used in a streaming sequencing pipeline to produce results without the need for storing intermediate data. express also estimates the posterior distribution on estimated counts for each isoform, which can be combined with negative-binomial models of biological over-dispersion to achieve more accurate differential expression analysis. Ambiguous read mapping is a problem in many other applications besides RNA-Seq including ChIP-Seq, variant detection, and metagenomics. express has been engineered to be a general-purpose tool that can be applied in all of these areas and more. 37

Software: http://bio.math.berkeley.edu/express Acknowledgements Lior Pachter, UC Berkeley Harold Pimentel, UC Berkeley Funding for Adam Roberts provided by the NSF Graduate Research Fellowship References Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. Cappe O and Moulines E (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Roberts A and Pachter L (2011). express: A Bayesian / Online EM algorithm for isoform-level RNA-seq quantification. In preparation. 38