express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Size: px
Start display at page:

Download "express: Streaming read deconvolution and abundance estimation applied to RNA-Seq"

Transcription

1 express: Streaming read deconvolution and abundance estimation applied to RNA-Seq Adam Roberts 1 and Lior Pachter 1,2 1 Department of Computer Science, 2 Departments of Mathematics and Molecular & Cell Biology, University of California at Berkeley IPAM: Mathematical and Computational Approaches in High-Throughput Genomics Workshop II: Transcriptomics and Epigenomics October 28, 2011

2 The unexpected application of cheap sequencing Despite the obvious possibilities of sequencing many new genomes, high throughput DNA sequencers have instead been mainly utilized as bean counters for sequence census methods. The vast majority of DNA sequence currently produced is for *-seq experiments: Desired measurement reduce to sequencing Sequence Solve inverse problem Creativity Biology Computer Science Mathematics/Statistics (Computational) Biology Analyze Assays include: ChIP-Seq, RNA-Seq, methyl-seq, GRO-Seq, Clip-Seq, BS-Seq, FRT-Seq, TraDI-Seq, Hi-C, SHAPE-Seq... 2

3 Background: Uses of RNA-Seq RNA-Seq data has three primary uses: Discovering genes and isoforms Estimating transcript abundances Finding differential expression between samples For this talk, I will focus on the second and assume that the transcriptome has been fully assembled / the genome has been fully annotated. 3

4 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

5 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

6 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

7 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

8 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

9 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense

10 Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced 5

11 Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced FPKM /

12 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6

13 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6

14 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = =

15 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = =

16 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = =

17 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 =

18 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

19 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

20 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

21 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

22 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

23 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6

24 Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A Isoform B Gene N/A Exon Union What if there is differential splicing? 7

25 Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A Isoform B Gene N/A Exon Union What if there is differential splicing? Target Length FPKM Fragment Count Isoform A Isoform B Gene N/A Exon Union

26 Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8

27 Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8

28 Background: The batch EM Solution The method: Develop a generative model for the data Derive a likelihood function based on this model Maximize the likelihood function with EM 9

29 t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10

30 A B C D Normalized Count Density Density t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs 1.0 AG G 1.0 C TAC TGC GT TAG CAG CAG CTG CAG C GATC TA A TGA CTTA TAG TGA TGA TAG TGA TGA 0.5 C T G ACT ACT ACT ACT C T C C TT CGATC C TTCCCCCCC 0.0 AGGGGG AACT T GT AT G G A GTCA G A CA G CA T GC C AT GATC ATC TT C GG A G G GA CA T C T G A G A 0 t = relative 5 10 GTA abundance of -5 transcript t WebLogo WebLogo 3.0 GTA CTA CAT F TAG = TAG TGA ATG set TAG TAG of sequenced fragments, i.e. read pairs CGGCTAGC CCCCCC GTA GTA GTA GTA GTA GTA GTA AT 0.5 C CAT CAT CAT CAT GAT T C CAT C GT T C GCT ACT C CCCCCCC 0.0 GGGGG CGG A G ACTA A CTC TGACT ACT CAT CAT AT TG C CAT CAT GAG AAA G G GGGGG G T 0 = set 5 of annotated transcripts WebLogo WebLogo 3.0 T = set of annotated transcripts 0.5 L = P(a fragment of length L) T -10 = sequencing error parameters µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i ) Expected Density ! = bias parameters (relative position and sequence distributions) -10 t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! 0.5 t,i,l =! ! t L = C GA T CAG T CAG T CGA C TGA -5 TA C CTA GTA GTA CTA GGCCG -5 GTA GTA GTA t,i! GTA GTA 30 L = Pr(a fragment of length L) t,i+l GTA GTA GTA 1 GTA --> GTA GTA Weight GTA GTA GTA GTA GTA GTA of GTA potential GTA GTA GTA GTA GTA fragment AT 0.5 C CAT CAT of CAT CAT length AT C CAT CAT CAT CAT L AT Cat CAT CAT locus CAT CAT AT C CAT CAT CAT CAT AT C CAT CAT CCl(t) C L+1 XC CCCCCCCCC CCCCCCCCC C 0.0 GGGGG GGGGG GGGGG GGGGG GGG = sequencing 5 10 error -10 parameters WebLogo 3.0 WebLogo 3.0 i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 Ratio (Bias Weight) =) t = P t/! t t 0 t 0/! t 0 5' Fragment End C GA = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l l(t) L+1 3' Fragment End --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t Offset from 5' Fragment End Offset from 3' Fragment End Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011) G A TC 10

31 t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10

32 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

33 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

34 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

35 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

36 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l = P t! t L t 0 t 0! t0 L 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

37 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

38 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

39 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

40 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

41 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

42 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l! " # = 0 fragment length!"#$%"#&'%($)#&*+&,%"-#,&./0&(*)#12)#'. T 3"%-(#4,&%45&'#62#41#& (*)#12)#&$*$2)%,7*4 8%$&+"%-(#4,'& (*)#12)#&'# Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)

43 P ) P (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 12

44 P ) P (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f Y f2f X L L X X t2t (t,i,l)2a(f) l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l L t L!t,i,L! t L µ f t,i,l 12

45 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

46 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

47 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

48 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13

49 Background: The batch EM Solution The method: Come up with a generative model for the data Make a likelihood function based on this model Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) 14

50 Background: The batch EM Solution The method: Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Come up with a generative model for the data Make a likelihood function based on this model Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) Proposal: Use online EM instead of batch! (express) 14

51 Method: Basics of the express online algorithm Based on the Cufflinks likelihood function combined with the online algorithm described in (Cappe & Moulines, 2009) Allow read to multi-map with limited restrictions to the transcript sequences. Probabilistically assign an incoming read to transcripts based on current model parameters (length distribution, abundances, errors, bias, etc.) Update model parameters based on probabilistic read assignment. 15

52 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16

53 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16

54 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

55 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

56 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

57 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16

58 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17

59 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17

60 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

61 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

62 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

63 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

64 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18

65 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18

66 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

67 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

68 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment

69 Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 19

70 Speeding up convergence Cappe and Moulines (Journal of the Royal Statistical Society, 2009) prove the following theorem: The forgetting factor is where. 2 <capple1 n = 1 n c 1 The updates require O(k) operations where k is the number of transcripts. 20

71 Speeding up convergence We have shown that instead of updating the frequency vectors update count vectors so that ĉ n+1 =ĉ n + f n s(y n+1 ; ˆ n ) ŝ n one can instead The forgetting weights are given by f n = f n n 1 1 n 1 n 1 The number of coordinates that need to be updated at every step is the number of transcripts the read being considered maps to. 21

72 Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 22

73 Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 23

74 Results: Convergence for complex gene (Dystrophin) 24

75 Results: Convergence for complex gene (Dystrophin) express RSEM Cufflinks 24

76 Results: Full transcriptome (hg19, >76k transcripts) 25

77 Results: Performance Analysis for 1B simulated reads mapped to the human transcriptome (hg19 UCSC) 8-Core 2.27 GHz Mac Pro with 24 GB of RAM Cufflinks and RSEM were run with 8 threads, express 1 O(size of transcriptome) O(# of fragments) 26

78 Results: Speed of parameter convergence 10M mapped 76bp x 76bp fragments from Encode 27

79 express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28

80 express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28

81 Estimating counts transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome M-step 0.27 Unique Counts Total Counts Estimated Counts E-step 0.27 At every step of the batch EM algorithm: 0.23 M-step apple apple unique estimated total 0.55 E-step 0.23 M-step 29

82 Estimation of counts in express Because of the forgetting weights we use to speed up convergence, the counts at every step may not lead to estimates that lie between the unique and total counts. To ensure that estimated counts satisfy the constraint (satisfied by the optimum) we employ an alternating projection algorithm: The algorithm projects the initial estimate alternately between the hyperplane (counts sum to the total) and the cube (unique less than estimate less than total). The algorithm converges (in this case in a finite number of steps) by a theorem of von Neumann. 30

83 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) 31

84 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) 32

85 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB)? Archive (SRA, GEO,...) 32

86 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 33

87 Discussion: Too much data to store! Archive (SRA, GEO,...) Macmillan Publishers Ltd: Nature 458, (2009) 34

88 Discussion: Too much data to store! As the performance of next-generation sequencing machines continues to improve in terms of speed, cost, accuracy, and length, and as computational processing continues to improve, the need to access the underlying reads decreases. -David Lipman, NCBI 34 GB Editorial Team. Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biology (2011)

89 Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35

90 Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35

91 Discussion: What will be possible...? Streaming Sequencing Machine (?) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Express) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 36

92 Conclusions and Future Work It is important to deconvolute read counts even for gene-level expression analysis. The online algorithm implemented in express produces very accurate results with much less resource use than batch approaches and is thus applicable for larger datasets. express can be used in a streaming sequencing pipeline to produce results without the need for storing intermediate data. express also estimates the posterior distribution on estimated counts for each isoform, which can be combined with negative-binomial models of biological over-dispersion to achieve more accurate differential expression analysis. Ambiguous read mapping is a problem in many other applications besides RNA-Seq including ChIP-Seq, variant detection, and metagenomics. express has been engineered to be a general-purpose tool that can be applied in all of these areas and more. 37

93 Software: Acknowledgements Lior Pachter, UC Berkeley Harold Pimentel, UC Berkeley Funding for Adam Roberts provided by the NSF Graduate Research Fellowship References Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. Cappe O and Moulines E (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Roberts A and Pachter L (2011). express: A Bayesian / Online EM algorithm for isoform-level RNA-seq quantification. In preparation. 38

Our typical RNA quantification pipeline

Our typical RNA quantification pipeline RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality

More information

Isoform discovery and quantification from RNA-Seq data

Isoform discovery and quantification from RNA-Seq data Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification

More information

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016 New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016 Wikipedia The traditional workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The traditional workflow

More information

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017 Alignment-free RNA-seq workflow Charlotte Soneson University of Zurich Brixen 2017 The alignment-based workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The alignment-based

More information

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER SUPPLEMENTARY METHODS FOR THE PAPER TRANSCRIPT ASSEMBLY AND QUANTIFICATION BY RNA-SEQ REVEALS UNANNOTATED TRANSCRIPTS AND ISOFORM SWITCHING DURING CELL DIFFERENTIATION COLE TRAPNELL, BRIAN A WILLIAMS,

More information

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o

More information

Bayesian Clustering of Multi-Omics

Bayesian Clustering of Multi-Omics Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics

More information

Bias in RNA sequencing and what to do about it

Bias in RNA sequencing and what to do about it Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA

More information

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc Supplemental Figure 1. Prediction of phloem-specific MTK1 expression in Arabidopsis shoots and roots. The images and the corresponding numbers showing absolute (A) or relative expression levels (B) of

More information

RNA- seq read mapping

RNA- seq read mapping RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference

More information

Unit-free and robust detection of differential expression from RNA-Seq data

Unit-free and robust detection of differential expression from RNA-Seq data Unit-free and robust detection of differential expression from RNA-Seq data arxiv:405.4538v [stat.me] 8 May 204 Hui Jiang,2,* Department of Biostatistics, University of Michigan 2 Center for Computational

More information

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material Charlotte Soneson, Michael I. Love, Mark D. Robinson Contents 1 Simulation details, sim2

More information

Crick s early Hypothesis Revisited

Crick s early Hypothesis Revisited Crick s early Hypothesis Revisited Or The Existence of a Universal Coding Frame Ryan Rossi, Jean-Louis Lassez and Axel Bernal UPenn Center for Bioinformatics BIOINFORMATICS The application of computer

More information

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm Electronic Supplementary Material (ESI) for Nanoscale. This journal is The Royal Society of Chemistry 2018 High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

High-throughput sequencing: Alignment and related topic

High-throughput sequencing: Alignment and related topic High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs

More information

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA SUPPORTING INFORMATION FOR SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA Aik T. Ooi, Cliff I. Stains, Indraneel Ghosh *, David J. Segal

More information

Genome 541! Unit 4, lecture 3! Genomics assays

Genome 541! Unit 4, lecture 3! Genomics assays Genome 541! Unit 4, lecture 3! Genomics assays Much easier to follow with slides. Good pace.! Having the slides was really helpful clearer to read and easier to follow the trajectory of the lecture.!!

More information

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from

More information

Genome Assembly. Sequencing Output. High Throughput Sequencing

Genome Assembly. Sequencing Output. High Throughput Sequencing Genome High Throughput Sequencing Sequencing Output Example applications: Sequencing a genome (DNA) Sequencing a transcriptome and gene expression studies (RNA) ChIP (chromatin immunoprecipitation) Example

More information

Statistical Inferences for Isoform Expression in RNA-Seq

Statistical Inferences for Isoform Expression in RNA-Seq Statistical Inferences for Isoform Expression in RNA-Seq Hui Jiang and Wing Hung Wong February 25, 2009 Abstract The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription

More information

Centrifuge: rapid and sensitive classification of metagenomic sequences

Centrifuge: rapid and sensitive classification of metagenomic sequences Centrifuge: rapid and sensitive classification of metagenomic sequences Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg Supplementary Material Supplementary Table 1 Supplementary Note

More information

Multi-Assembly Problems for RNA Transcripts

Multi-Assembly Problems for RNA Transcripts Multi-Assembly Problems for RNA Transcripts Alexandru Tomescu Department of Computer Science University of Helsinki Joint work with Veli Mäkinen, Anna Kuosmanen, Romeo Rizzi, Travis Gagie, Alex Popa CiE

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr), 48 3 () Vol. 48 No. 3 2009 5 Journal of Xiamen University (Nat ural Science) May 2009 SSR,,,, 3 (, 361005) : SSR. 21 516,410. 60 %96. 7 %. (),(Between2groups linkage method),.,, 11 (),. 12,. (, ), : 0.

More information

Towards More Effective Formulations of the Genome Assembly Problem

Towards More Effective Formulations of the Genome Assembly Problem Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25 2 / 25 CENTRAL DOGMA OF BIOLOGY

More information

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n( ,a) RNA-seq RNA-seq Cuffdiff, edger, DESeq Sese Jun,a) Abstract: Frequently used biological experiment technique for observing comprehensive gene expression has been changed from microarray using cdna

More information

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector. The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector. Omar S. Akbari*, Igor Antoshechkin*, Henry Amrhein, Brian Williams, Race Diloreto, Jeremy

More information

EBSeq: An R package for differential expression analysis using RNA-seq data

EBSeq: An R package for differential expression analysis using RNA-seq data EBSeq: An R package for differential expression analysis using RNA-seq data Ning Leng, John Dawson, and Christina Kendziorski October 14, 2013 Contents 1 Introduction 2 2 Citing this software 2 3 The Model

More information

Single Cell Sequencing

Single Cell Sequencing Single Cell Sequencing Fundamental unit of life Autonomous and unique Interactive Dynamic - change over time Evolution occurs on the cellular level Robert Hooke s drawing of cork cells, 1665 Type Prokaryotes

More information

Supplemental Information

Supplemental Information Molecular Cell, Volume 52 Supplemental Information The Translational Landscape of the Mammalian Cell Cycle Craig R. Stumpf, Melissa V. Moreno, Adam B. Olshen, Barry S. Taylor, and Davide Ruggero Supplemental

More information

High-throughput sequence alignment. November 9, 2017

High-throughput sequence alignment. November 9, 2017 High-throughput sequence alignment November 9, 2017 a little history human genome project #1 (many U.S. government agencies and large institute) started October 1, 1990. Goal: 10x coverage of human genome,

More information

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction

More information

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq Mark F. Rogers, Christina Boucher, and Asa Ben-Hur Department of Computer Science 1873 Campus Delivery Fort Collins, CO 80523 rogersma@cs.colostate.edu,

More information

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014 Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more

More information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics 582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline

More information

GBS Bioinformatics Pipeline(s) Overview

GBS Bioinformatics Pipeline(s) Overview GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the

More information

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics Genome 541 Unit 4, lecture 2 Transcription factor binding using functional genomics Slides vs chalk talk: I m not sure why you chose a chalk talk over ppt. I prefer the latter no issues with readability

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 8 29, pages 126 132 doi:1.193/bioinformatics/btp113 Gene expression Statistical inferences for isoform expression in RNA-Seq Hui Jiang 1 and Wing Hung Wong 2,

More information

SUPPLEMENTARY DATA - 1 -

SUPPLEMENTARY DATA - 1 - - 1 - SUPPLEMENTARY DATA Construction of B. subtilis rnpb complementation plasmids For complementation, the B. subtilis rnpb wild-type gene (rnpbwt) under control of its native rnpb promoter and terminator

More information

Ion Torrent. The chip is the machine

Ion Torrent. The chip is the machine Ion Torrent Introduction The Ion Personal Genome Machine [PGM] is simple, more costeffective, and more scalable than any other sequencing technology. Founded in 2007 by Jonathan Rothberg. Part of Life

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given

More information

A Robust Method for Transcript Quantification with RNA-seq Data

A Robust Method for Transcript Quantification with RNA-seq Data A Robust Method for Transcript Quantification with RNA-seq Data Yan Huang 1, Yin Hu 1, Corbin D. Jones 2, James N. MacLeod 3, Derek Y. Chiang 4, Yufeng Liu 5, Jan F. Prins 6, and Jinze Liu 1 1 Department

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics I believe it is helpful to number your slides for easy reference. It's been a while since I took

More information

Number-controlled spatial arrangement of gold nanoparticles with

Number-controlled spatial arrangement of gold nanoparticles with Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2016 Number-controlled spatial arrangement of gold nanoparticles with DNA dendrimers Ping Chen,*

More information

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2013 Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Han Wu

More information

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Introduc)on to RNA- Seq Data Analysis Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Material: hep://)ny.cc/rnaseq Slides: hep://)ny.cc/slidesrnaseq

More information

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

NSCI Basic Properties of Life and The Biochemistry of Life on Earth NSCI 314 LIFE IN THE COSMOS 4 Basic Properties of Life and The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB http://physics.csusb.edu/~karen/ WHAT IS LIFE? HARD TO DEFINE,

More information

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook Stony Brook University The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook University. Alll Rigghht tss

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

Count ratio model reveals bias affecting NGS fold changes

Count ratio model reveals bias affecting NGS fold changes Published online 8 July 2015 Nucleic Acids Research, 2015, Vol. 43, No. 20 e136 doi: 10.1093/nar/gkv696 Count ratio model reveals bias affecting NGS fold changes Florian Erhard * and Ralf Zimmer Institut

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Statistical Modeling of RNA-Seq Data

Statistical Modeling of RNA-Seq Data Statistical Science 2011, Vol. 26, No. 1, 62 83 DOI: 10.1214/10-STS343 c Institute of Mathematical Statistics, 2011 Statistical Modeling of RNA-Seq Data Julia Salzman 1, Hui Jiang 1 and Wing Hung Wong

More information

Supplementary Information for

Supplementary Information for Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford

More information

Introduction to de novo RNA-seq assembly

Introduction to de novo RNA-seq assembly Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies

More information

Gene expression from RNA-Seq

Gene expression from RNA-Seq Gene expression from RNA-Seq Once sequenced problem becomes computational cells cdna sequencer Sequenced reads ChI Alinment read coverae enome Considerations and assumptions Hih library complexity #molecules

More information

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II 6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

More information

BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing. Data

BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing. Data Biometrics 000, 000 000 DOI: 000 000 0000 BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Yuan Ji 1,, Yanxun Xu 2, Qiong Zhang 3, Kam-Wah Tsui 3, Yuan Yuan 4, Clift Norris 1,

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford RNAseq Applications in Genome Studies Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford RNAseq Protocols } Next generation sequencing protocol } cdna, not RNA sequencing

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression

Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression DOI 10.1007/s12561-012-9057-6 Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression David Dalpiaz Xuming He Ping Ma Received: 21 November 2011 / Accepted: 2 February 2012 International

More information

Supplementary Information

Supplementary Information Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2014 Directed self-assembly of genomic sequences into monomeric and polymeric branched DNA structures

More information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 Zn 2+ -binding sites in USP18. (a) The two molecules of USP18 present in the asymmetric unit are shown. Chain A is shown in blue, chain B in green. Bound Zn 2+ ions are shown as

More information

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS The Saguaro Genome Toward the Ecological Genomics of a Sonoran Desert Icon Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS Why study a genome? - the genome contains the genetic information of an

More information

What is Systems Biology

What is Systems Biology What is Systems Biology 2 CBS, Department of Systems Biology 3 CBS, Department of Systems Biology Data integration In the Big Data era Combine different types of data, describing different things or the

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Differential expression analysis for sequencing count data. Simon Anders

Differential expression analysis for sequencing count data. Simon Anders Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in HTS RNA-Seq Tag-Seq Gene 13CDNA73 A2BP1 A2M A4GALT AAAS AACS AADACL1 [...] ChIP-Seq Bar-Seq... GliNS1 4 19

More information

Electronic supplementary material

Electronic supplementary material Applied Microbiology and Biotechnology Electronic supplementary material A family of AA9 lytic polysaccharide monooxygenases in Aspergillus nidulans is differentially regulated by multiple substrates and

More information

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin International Journal of Genetic Engineering and Biotechnology. ISSN 0974-3073 Volume 2, Number 1 (2011), pp. 109-114 International Research Publication House http://www.irphouse.com Characterization of

More information

Cloud-scale RNA-sequencing differential expression analysis with Myrna

Cloud-scale RNA-sequencing differential expression analysis with Myrna Cloud-scale RNA-sequencing differential expression analysis with Myrna Jeff Leek Johns Hopkins Bloomberg School of Public Health e: jleek@jhsph.edu t: http://www.twitter.com/leekgroup myrna: http://bowtie-bio.sourceforge.net/myrna/

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Modelling gene expression dynamics with Gaussian processes

Modelling gene expression dynamics with Gaussian processes Modelling gene expression dynamics with Gaussian processes Regulatory Genomics and Epigenomics March th 6 Magnus Rattray Faculty of Life Sciences University of Manchester Talk Outline Introduction to Gaussian

More information

Mixtures and Hidden Markov Models for analyzing genomic data

Mixtures and Hidden Markov Models for analyzing genomic data Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche

More information

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Supporting Information Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Cuichen Wu,, Da Han,, Tao Chen,, Lu Peng, Guizhi Zhu,, Mingxu You,, Liping Qiu,, Kwame Sefah,

More information

Algorithmics and Bioinformatics

Algorithmics and Bioinformatics Algorithmics and Bioinformatics Gregory Kucherov and Philippe Gambette LIGM/CNRS Université Paris-Est Marne-la-Vallée, France Schedule Course webpage: https://wikimpri.dptinfo.ens-cachan.fr/doku.php?id=cours:c-1-32

More information

DEXSeq paper discussion

DEXSeq paper discussion DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml

More information

More Codon Usage Bias

More Codon Usage Bias .. CSC448 Bioinformatics Algorithms Alexander Dehtyar.. DA Sequence Evaluation Part II More Codon Usage Bias Scaled χ 2 χ 2 measure. In statistics, the χ 2 statstic computes how different the distribution

More information

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem

More information

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments

Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments for the Computational Biology Doctoral Seminar (CMPBIO 293), organized by N. Yosef & T. Ashuach, Spring 2018, UC Berkeley

More information

Inferring Protein-Signaling Networks II

Inferring Protein-Signaling Networks II Inferring Protein-Signaling Networks II Lectures 15 Nov 16, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022

More information

Lecture 15: Programming Example: TASEP

Lecture 15: Programming Example: TASEP Carl Kingsford, 0-0, Fall 0 Lecture : Programming Example: TASEP The goal for this lecture is to implement a reasonably large program from scratch. The task we will program is to simulate ribosomes moving

More information

Predictive Genome Analysis Using Partial DNA Sequencing Data

Predictive Genome Analysis Using Partial DNA Sequencing Data Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,

More information