express: Streaming read deconvolution and abundance estimation applied to RNA-Seq
|
|
- Bethanie Todd
- 5 years ago
- Views:
Transcription
1 express: Streaming read deconvolution and abundance estimation applied to RNA-Seq Adam Roberts 1 and Lior Pachter 1,2 1 Department of Computer Science, 2 Departments of Mathematics and Molecular & Cell Biology, University of California at Berkeley IPAM: Mathematical and Computational Approaches in High-Throughput Genomics Workshop II: Transcriptomics and Epigenomics October 28, 2011
2 The unexpected application of cheap sequencing Despite the obvious possibilities of sequencing many new genomes, high throughput DNA sequencers have instead been mainly utilized as bean counters for sequence census methods. The vast majority of DNA sequence currently produced is for *-seq experiments: Desired measurement reduce to sequencing Sequence Solve inverse problem Creativity Biology Computer Science Mathematics/Statistics (Computational) Biology Analyze Assays include: ChIP-Seq, RNA-Seq, methyl-seq, GRO-Seq, Clip-Seq, BS-Seq, FRT-Seq, TraDI-Seq, Hi-C, SHAPE-Seq... 2
3 Background: Uses of RNA-Seq RNA-Seq data has three primary uses: Discovering genes and isoforms Estimating transcript abundances Finding differential expression between samples For this talk, I will focus on the second and assume that the transcriptome has been fully assembled / the genome has been fully annotated. 3
4 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense
5 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna sscdna 3. construction of dscdna dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense
6 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense
7 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense
8 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense
9 RNA molecules 1. fragmentation of RNA RNA fragments 2. random priming to make sscdna 3. construction of dscdna sscdna First-Strand Synthesis dscdna Second-Strand Synthesis 4. size selection short long Gel cutout 5. sequencing sense RNA sequence paired-end read 6. mapping anti-sense
10 Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced 5
11 Background: Estimating gene abundances Aligned Fragments Genome Gene Length: 1500 bp To get an abundance relative to other genes in the same experiment, we must normalize by length. To get an abundance relative to genes in other experiments, we must all normalize by the number of reads in the experiment. A typical measure is Fragments Per Kilobase per Million sequenced FPKM /
12 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6
13 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp 6
14 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = =
15 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Length: 1500 bp Length: 1000 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = =
16 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A l A + f B l B = =
17 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 =
18 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6
19 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6
20 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6
21 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6
22 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6
23 Background: Estimating gene abundances Aligned Fragments Genome Isoform A Isoform B Exon Union Length: 1500 bp Length: 1000 bp Length: 1500 bp FPKM true = FPKM A + FPKM B / f A + f B = 10 l A l B = 1 60 FPKM union / f A + f B = 20 l A[B 1500 = 1 75 FPKM union apple FPKM true 6
24 Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A Isoform B Gene N/A Exon Union What if there is differential splicing? 7
25 Background: Implications for differential expression FPKM is proportional to the number of copies of each transcript in the RNA sample. Assume 50M reads sequenced. Target Length FPKM Fragment Count Isoform A Isoform B Gene N/A Exon Union What if there is differential splicing? Target Length FPKM Fragment Count Isoform A Isoform B Gene N/A Exon Union
26 Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8
27 Background: How is this dealt with? Some continue to use the exon union model (HTSeq, DNAnexus). Fast, but not accurate. Adjust the transcript length to remove shared sequence and only look at unique reads (NEUMA) Slower and ignores useful information provided by ambiguous fragments Minimize an objective function based on coverage (rquant, IsoInfer) Slightly faster but loses information in individual fragments Use batch EM-based likelihood maximization to probabilistically assign ambiguous fragments (Cufflinks, RSEM, IsoEM) Slower but can model all relevant information and is highly accurate (with sufficient depth) 8
28 Background: The batch EM Solution The method: Develop a generative model for the data Derive a likelihood function based on this model Maximize the likelihood function with EM 9
29 t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10
30 A B C D Normalized Count Density Density t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs 1.0 AG G 1.0 C TAC TGC GT TAG CAG CAG CTG CAG C GATC TA A TGA CTTA TAG TGA TGA TAG TGA TGA 0.5 C T G ACT ACT ACT ACT C T C C TT CGATC C TTCCCCCCC 0.0 AGGGGG AACT T GT AT G G A GTCA G A CA G CA T GC C AT GATC ATC TT C GG A G G GA CA T C T G A G A 0 t = relative 5 10 GTA abundance of -5 transcript t WebLogo WebLogo 3.0 GTA CTA CAT F TAG = TAG TGA ATG set TAG TAG of sequenced fragments, i.e. read pairs CGGCTAGC CCCCCC GTA GTA GTA GTA GTA GTA GTA AT 0.5 C CAT CAT CAT CAT GAT T C CAT C GT T C GCT ACT C CCCCCCC 0.0 GGGGG CGG A G ACTA A CTC TGACT ACT CAT CAT AT TG C CAT CAT GAG AAA G G GGGGG G T 0 = set 5 of annotated transcripts WebLogo WebLogo 3.0 T = set of annotated transcripts 0.5 L = P(a fragment of length L) T -10 = sequencing error parameters µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i ) Expected Density ! = bias parameters (relative position and sequence distributions) -10 t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! 0.5 t,i,l =! ! t L = C GA T CAG T CAG T CGA C TGA -5 TA C CTA GTA GTA CTA GGCCG -5 GTA GTA GTA t,i! GTA GTA 30 L = Pr(a fragment of length L) t,i+l GTA GTA GTA 1 GTA --> GTA GTA Weight GTA GTA GTA GTA GTA GTA of GTA potential GTA GTA GTA GTA GTA fragment AT 0.5 C CAT CAT of CAT CAT length AT C CAT CAT CAT CAT L AT Cat CAT CAT locus CAT CAT AT C CAT CAT CAT CAT AT C CAT CAT CCl(t) C L+1 XC CCCCCCCCC CCCCCCCCC C 0.0 GGGGG GGGGG GGGGG GGGGG GGG = sequencing 5 10 error -10 parameters WebLogo 3.0 WebLogo 3.0 i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 Ratio (Bias Weight) =) t = P t/! t t 0 t 0/! t 0 5' Fragment End C GA = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l l(t) L+1 3' Fragment End --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t Offset from 5' Fragment End Offset from 3' Fragment End Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011) G A TC 10
31 t = relative abundance of transcript t F = set of sequenced fragments, i.e. read pairs T = set of annotated transcripts L = P(a fragment of length L) = sequencing error parameters t = relative abundance of transcript t µ f t,i,l = P(fragment f of length L coming from transcript t with 50 end at i )! 0 F = set of sequenced fragments, i.e. read pairs = bias parameters (relative position and sequence distributions) T = set of annotated transcripts t,i = P( 0 end of a fragment at position i in transcript t ) P( 0 end of a fragment at position i in transcript t)! t,i,l =! 50 t,i! 30 t,i+l 1! t L = l(t) L+1 X i=0! t,i,l! µ f t = X L! t L t,i,l L t L = P t! t L t 0 t 0! t 0 L t = P t! t! 0 t 0 t 0! t 0 =) t = P t/! t t 0 t 0/! t 0 --> Weight of potential fragment of length L at locus L = Pr(a fragment of length L) = sequencing error parameters = Pr(fragment f of length L coming from tran --> Total transcript weight (effective length) = bias parameters (relative position and seque t,i = Pr( 0 end of a fragment at position i in tran Pr( 0 end of a fragment at position i in tr! t,i,l =! 50 t,i! t,i+l l(t) L+1 --> Weight of locus --> Prob of generating fragment of length L from transcript t --> Prob of generating (any) fragment from transcript t 10
32 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
33 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
34 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
35 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
36 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l = P t! t L t 0 t 0! t0 L 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
37 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
38 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
39 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
40 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
41 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 0 fragment length. T Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
42 L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l! " # = 0 fragment length!"#$%"#&'%($)#&*+&,%"-#,&./0&(*)#12)#'. T 3"%-(#4,&%45&'#62#41#& (*)#12)#&$*$2)%,7*4 8%$&+"%-(#4,'& (*)#12)#&'# Roberts, et al. Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology (2011)
43 P ) P (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f X L L X t2t l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l 12
44 P ) P (t, i, L) = fragment alignment (transcript, 5 0 endpoint, mapped length) A(f) = set of alignments of fragment f L( F, T,,, ) = Y f2f Y f2f X L L X X t2t (t,i,l)2a(f) l(t) XL+1 t L i=0! t,i,l! t L µ f t,i,l L t L!t,i,L! t L µ f t,i,l 12
45 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13
46 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13
47 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13
48 Background: Batch EM Deconvolution transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome Example Assumptions: 0.27 M-step E-step All fragments are the same length M-step All transcripts are the same length E-step Uniform coverage (no sequencespecific or positional bias) 0.23 M-step Pachter, L. Models for transcript quantification from RNA-Seq. (2011) 13
49 Background: The batch EM Solution The method: Come up with a generative model for the data Make a likelihood function based on this model Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) 14
50 Background: The batch EM Solution The method: Maximize the likelihood function with EM The problem: The batch EM requires you to iterate over the data hundreds or thousands of times. Read alignments are nearing 1 TB for typical experiments (uncompressed). Alignments must therefore be stored in memory (expensive) or read from disk (slow). Solutions: Come up with a generative model for the data Make a likelihood function based on this model Cut-off on mismatches to avoid too much multi-mapping (RSEM). Partition reads into blocks based on shared multi-mapping. Partition will be come large with more reads and allowed mismatches. (IsoEM) Partition reads based on genomic loci. Misses reads that might map to multiple genomic loci. (Cufflinks) Proposal: Use online EM instead of batch! (express) 14
51 Method: Basics of the express online algorithm Based on the Cufflinks likelihood function combined with the online algorithm described in (Cappe & Moulines, 2009) Allow read to multi-map with limited restrictions to the transcript sequences. Probabilistically assign an incoming read to transcripts based on current model parameters (length distribution, abundances, errors, bias, etc.) Update model parameters based on probabilistic read assignment. 15
52 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16
53 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 16
54 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16
55 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16
56 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16
57 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 1 16
58 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17
59 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 17
60 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
61 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] Fragment Length Distribution Read Counts / / Counts Transcriptome Transcript 0.25 Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
62 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
63 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / Transcriptome Transcript / Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
64 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18
65 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment 18
66 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
67 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
68 Method: Simple Express example Transcript Fragment Counts Transcript Abundances E[ ] 0.33 Fragment Length Distribution Read Counts / / Transcriptome Transcript Counts Fragment Length Oxford Nanopore Bowtie Aligner Probabilistic Fragment Assignment
69 Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 19
70 Speeding up convergence Cappe and Moulines (Journal of the Royal Statistical Society, 2009) prove the following theorem: The forgetting factor is where. 2 <capple1 n = 1 n c 1 The updates require O(k) operations where k is the number of transcripts. 20
71 Speeding up convergence We have shown that instead of updating the frequency vectors update count vectors so that ĉ n+1 =ĉ n + f n s(y n+1 ; ˆ n ) ŝ n one can instead The forgetting weights are given by f n = f n n 1 1 n 1 n 1 The number of coordinates that need to be updated at every step is the number of transcripts the read being considered maps to. 21
72 Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 22
73 Results: Convergence for simple gene (UGT3A2) express RSEM Cufflinks 23
74 Results: Convergence for complex gene (Dystrophin) 24
75 Results: Convergence for complex gene (Dystrophin) express RSEM Cufflinks 24
76 Results: Full transcriptome (hg19, >76k transcripts) 25
77 Results: Performance Analysis for 1B simulated reads mapped to the human transcriptome (hg19 UCSC) 8-Core 2.27 GHz Mac Pro with 24 GB of RAM Cufflinks and RSEM were run with 8 threads, express 1 O(size of transcriptome) O(# of fragments) 26
78 Results: Speed of parameter convergence 10M mapped 76bp x 76bp fragments from Encode 27
79 express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28
80 express Usage Target FASTA results.xprs FPKM Unique Counts Total Counts Estimated Counts params.xprs Fragment Lengths Sequence-specific Bias Relative Position Bias Error Substitutions Read FASTQ Read Mapper N hits.prob.sam Input alignments with posterior probability of each multi-mapping 28
81 Estimating counts transcript abundances 0.33 E-step blue green red a b c d e genome aligned reads with proportional assignment to transcripts transcripts aligned to genome M-step 0.27 Unique Counts Total Counts Estimated Counts E-step 0.27 At every step of the batch EM algorithm: 0.23 M-step apple apple unique estimated total 0.55 E-step 0.23 M-step 29
82 Estimation of counts in express Because of the forgetting weights we use to speed up convergence, the counts at every step may not lead to estimates that lie between the unique and total counts. To ensure that estimated counts satisfy the constraint (satisfied by the optimum) we employ an alternating projection algorithm: The algorithm projects the initial estimate alternately between the hyperplane (counts sum to the total) and the cube (unique less than estimate less than total). The algorithm converges (in this case in a finite number of steps) by a theorem of von Neumann. 30
83 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) 31
84 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) 32
85 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB)? Archive (SRA, GEO,...) 32
86 Discussion: Current Illumina analysis pipeline Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 33
87 Discussion: Too much data to store! Archive (SRA, GEO,...) Macmillan Publishers Ltd: Nature 458, (2009) 34
88 Discussion: Too much data to store! As the performance of next-generation sequencing machines continues to improve in terms of speed, cost, accuracy, and length, and as computational processing continues to improve, the need to access the underlying reads decreases. -David Lipman, NCBI 34 GB Editorial Team. Closure of the NCBI SRA and implications for the long-term future of genomics data storage. Genome Biology (2011)
89 Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Alignment File (.sam, ~1.2 TB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35
90 Discussion: What we ve accomplished. Sequencing Machine (Illumina HiSeq,...) Image Analysis (Firecrest,...) Base Caller (Bustard, BayesCall,...) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Cufflinks, RSEM,...) Image FIle (.tif) Intensity File (.cif, ~2 TB) Sequence File (.fastq, ~250 GB) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 35
91 Discussion: What will be possible...? Streaming Sequencing Machine (?) Short-Read Aligner (Bowtie, Maq,...) RNA-seq Quantification (Express) Expression File (.fpkm, ~3 MB) Archive (SRA, GEO,...) 36
92 Conclusions and Future Work It is important to deconvolute read counts even for gene-level expression analysis. The online algorithm implemented in express produces very accurate results with much less resource use than batch approaches and is thus applicable for larger datasets. express can be used in a streaming sequencing pipeline to produce results without the need for storing intermediate data. express also estimates the posterior distribution on estimated counts for each isoform, which can be combined with negative-binomial models of biological over-dispersion to achieve more accurate differential expression analysis. Ambiguous read mapping is a problem in many other applications besides RNA-Seq including ChIP-Seq, variant detection, and metagenomics. express has been engineered to be a general-purpose tool that can be applied in all of these areas and more. 37
93 Software: Acknowledgements Lior Pachter, UC Berkeley Harold Pimentel, UC Berkeley Funding for Adam Roberts provided by the NSF Graduate Research Fellowship References Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology. Cappe O and Moulines E (2009). On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Roberts A and Pachter L (2011). express: A Bayesian / Online EM algorithm for isoform-level RNA-seq quantification. In preparation. 38
Our typical RNA quantification pipeline
RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality
More informationIsoform discovery and quantification from RNA-Seq data
Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification
More informationNew RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016
New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016 Wikipedia The traditional workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The traditional workflow
More informationAlignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017
Alignment-free RNA-seq workflow Charlotte Soneson University of Zurich Brixen 2017 The alignment-based workflow ALIGNMENT COUNTING ANALYSIS Gene A Gene B... Gene X 7... 13............... The alignment-based
More informationCOLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER
SUPPLEMENTARY METHODS FOR THE PAPER TRANSCRIPT ASSEMBLY AND QUANTIFICATION BY RNA-SEQ REVEALS UNANNOTATED TRANSCRIPTS AND ISOFORM SWITCHING DURING CELL DIFFERENTIATION COLE TRAPNELL, BRIAN A WILLIAMS,
More informationPractical Bioinformatics
5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o
More informationBayesian Clustering of Multi-Omics
Bayesian Clustering of Multi-Omics for Cardiovascular Diseases Nils Strelow 22./23.01.2019 Final Presentation Trends in Bioinformatics WS18/19 Recap Intermediate presentation Precision Medicine Multi-Omics
More informationBias in RNA sequencing and what to do about it
Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and Engineering Genome Sciences University of Washington Fred Hutchinson Cancer Research Center Seattle, WA, USA
More informationSupplemental data. Pommerrenig et al. (2011). Plant Cell /tpc
Supplemental Figure 1. Prediction of phloem-specific MTK1 expression in Arabidopsis shoots and roots. The images and the corresponding numbers showing absolute (A) or relative expression levels (B) of
More informationRNA- seq read mapping
RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference
More informationUnit-free and robust detection of differential expression from RNA-Seq data
Unit-free and robust detection of differential expression from RNA-Seq data arxiv:405.4538v [stat.me] 8 May 204 Hui Jiang,2,* Department of Biostatistics, University of Michigan 2 Center for Computational
More informationDifferential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material
Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material Charlotte Soneson, Michael I. Love, Mark D. Robinson Contents 1 Simulation details, sim2
More informationCrick s early Hypothesis Revisited
Crick s early Hypothesis Revisited Or The Existence of a Universal Coding Frame Ryan Rossi, Jean-Louis Lassez and Axel Bernal UPenn Center for Bioinformatics BIOINFORMATICS The application of computer
More informationHigh throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm
Electronic Supplementary Material (ESI) for Nanoscale. This journal is The Royal Society of Chemistry 2018 High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationHigh-throughput sequencing: Alignment and related topic
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg HTS Platforms E s ta b lis h e d p la tfo rm s Illu m in a H is e q, A B I S O L id, R o c h e 4 5 4 N e w c o m e rs
More informationSUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA
SUPPORTING INFORMATION FOR SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA Aik T. Ooi, Cliff I. Stains, Indraneel Ghosh *, David J. Segal
More informationGenome 541! Unit 4, lecture 3! Genomics assays
Genome 541! Unit 4, lecture 3! Genomics assays Much easier to follow with slides. Good pace.! Having the slides was really helpful clearer to read and easier to follow the trajectory of the lecture.!!
More informationGBS Bioinformatics Pipeline(s) Overview
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Terry Casstevens With supporting information from
More informationGenome Assembly. Sequencing Output. High Throughput Sequencing
Genome High Throughput Sequencing Sequencing Output Example applications: Sequencing a genome (DNA) Sequencing a transcriptome and gene expression studies (RNA) ChIP (chromatin immunoprecipitation) Example
More informationStatistical Inferences for Isoform Expression in RNA-Seq
Statistical Inferences for Isoform Expression in RNA-Seq Hui Jiang and Wing Hung Wong February 25, 2009 Abstract The development of RNA sequencing (RNA-Seq) makes it possible for us to measure transcription
More informationCentrifuge: rapid and sensitive classification of metagenomic sequences
Centrifuge: rapid and sensitive classification of metagenomic sequences Daehwan Kim, Li Song, Florian P. Breitwieser, and Steven L. Salzberg Supplementary Material Supplementary Table 1 Supplementary Note
More informationMulti-Assembly Problems for RNA Transcripts
Multi-Assembly Problems for RNA Transcripts Alexandru Tomescu Department of Computer Science University of Helsinki Joint work with Veli Mäkinen, Anna Kuosmanen, Romeo Rizzi, Travis Gagie, Alex Popa CiE
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationRegulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)
Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model
More informationSSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),
48 3 () Vol. 48 No. 3 2009 5 Journal of Xiamen University (Nat ural Science) May 2009 SSR,,,, 3 (, 361005) : SSR. 21 516,410. 60 %96. 7 %. (),(Between2groups linkage method),.,, 11 (),. 12,. (, ), : 0.
More informationTowards More Effective Formulations of the Genome Assembly Problem
Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25 2 / 25 CENTRAL DOGMA OF BIOLOGY
More informationg A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(
,a) RNA-seq RNA-seq Cuffdiff, edger, DESeq Sese Jun,a) Abstract: Frequently used biological experiment technique for observing comprehensive gene expression has been changed from microarray using cdna
More informationThe Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.
The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector. Omar S. Akbari*, Igor Antoshechkin*, Henry Amrhein, Brian Williams, Race Diloreto, Jeremy
More informationEBSeq: An R package for differential expression analysis using RNA-seq data
EBSeq: An R package for differential expression analysis using RNA-seq data Ning Leng, John Dawson, and Christina Kendziorski October 14, 2013 Contents 1 Introduction 2 2 Citing this software 2 3 The Model
More informationSingle Cell Sequencing
Single Cell Sequencing Fundamental unit of life Autonomous and unique Interactive Dynamic - change over time Evolution occurs on the cellular level Robert Hooke s drawing of cork cells, 1665 Type Prokaryotes
More informationSupplemental Information
Molecular Cell, Volume 52 Supplemental Information The Translational Landscape of the Mammalian Cell Cycle Craig R. Stumpf, Melissa V. Moreno, Adam B. Olshen, Barry S. Taylor, and Davide Ruggero Supplemental
More informationHigh-throughput sequence alignment. November 9, 2017
High-throughput sequence alignment November 9, 2017 a little history human genome project #1 (many U.S. government agencies and large institute) started October 1, 1990. Goal: 10x coverage of human genome,
More informationAnnotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)
Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction
More informationSpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq
SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq Mark F. Rogers, Christina Boucher, and Asa Ben-Hur Department of Computer Science 1873 Campus Delivery Fort Collins, CO 80523 rogersma@cs.colostate.edu,
More informationGoing Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014
Going Beyond SNPs with Next Genera5on Sequencing Technology 02-223 Personalized Medicine: Understanding Your Own Genome Fall 2014 Next Genera5on Sequencing Technology (NGS) NGS technology Discover more
More informationModelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics
582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline
More informationGBS Bioinformatics Pipeline(s) Overview
GBS Bioinformatics Pipeline(s) Overview Getting from sequence files to genotypes. Pipeline Coding: Ed Buckler Jeff Glaubitz James Harriman Presentation: Rob Elshire With supporting information from the
More informationGenome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics
Genome 541 Unit 4, lecture 2 Transcription factor binding using functional genomics Slides vs chalk talk: I m not sure why you chose a chalk talk over ppt. I prefer the latter no issues with readability
More informationStatistics for Differential Expression in Sequencing Studies. Naomi Altman
Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand
More informationBiology 644: Bioinformatics
A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past
More informationIntroduction to Hidden Markov Models for Gene Prediction ECE-S690
Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence
More informationGEP Annotation Report
GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:
More informationBIOINFORMATICS ORIGINAL PAPER
BIOINFORMATICS ORIGINAL PAPER Vol. 25 no. 8 29, pages 126 132 doi:1.193/bioinformatics/btp113 Gene expression Statistical inferences for isoform expression in RNA-Seq Hui Jiang 1 and Wing Hung Wong 2,
More informationSUPPLEMENTARY DATA - 1 -
- 1 - SUPPLEMENTARY DATA Construction of B. subtilis rnpb complementation plasmids For complementation, the B. subtilis rnpb wild-type gene (rnpbwt) under control of its native rnpb promoter and terminator
More informationIon Torrent. The chip is the machine
Ion Torrent Introduction The Ion Personal Genome Machine [PGM] is simple, more costeffective, and more scalable than any other sequencing technology. Founded in 2007 by Jonathan Rothberg. Part of Life
More informationSEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA
SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.
More informationAdvanced topics in bioinformatics
Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib
More informationComparative analysis of RNA- Seq data with DESeq2
Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given
More informationA Robust Method for Transcript Quantification with RNA-seq Data
A Robust Method for Transcript Quantification with RNA-seq Data Yan Huang 1, Yin Hu 1, Corbin D. Jones 2, James N. MacLeod 3, Derek Y. Chiang 4, Yufeng Liu 5, Jan F. Prins 6, and Jinze Liu 1 1 Department
More informationDEGseq: an R package for identifying differentially expressed genes from RNA-seq data
DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics
More informationGenome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics
Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics I believe it is helpful to number your slides for easy reference. It's been a while since I took
More informationNumber-controlled spatial arrangement of gold nanoparticles with
Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2016 Number-controlled spatial arrangement of gold nanoparticles with DNA dendrimers Ping Chen,*
More informationStatistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology
Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations Fall 2013 Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology Han Wu
More informationIntroduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas
Introduc)on to RNA- Seq Data Analysis Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas Material: hep://)ny.cc/rnaseq Slides: hep://)ny.cc/slidesrnaseq
More informationNSCI Basic Properties of Life and The Biochemistry of Life on Earth
NSCI 314 LIFE IN THE COSMOS 4 Basic Properties of Life and The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB http://physics.csusb.edu/~karen/ WHAT IS LIFE? HARD TO DEFINE,
More informationThe official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook
Stony Brook University The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook University. Alll Rigghht tss
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationGrundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
More informationAlignment. Peak Detection
ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie
More informationCount ratio model reveals bias affecting NGS fold changes
Published online 8 July 2015 Nucleic Acids Research, 2015, Vol. 43, No. 20 e136 doi: 10.1093/nar/gkv696 Count ratio model reveals bias affecting NGS fold changes Florian Erhard * and Ralf Zimmer Institut
More informationMathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007
-2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open
More informationStatistical Modeling of RNA-Seq Data
Statistical Science 2011, Vol. 26, No. 1, 62 83 DOI: 10.1214/10-STS343 c Institute of Mathematical Statistics, 2011 Statistical Modeling of RNA-Seq Data Julia Salzman 1, Hui Jiang 1 and Wing Hung Wong
More informationSupplementary Information for
Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford
More informationIntroduction to de novo RNA-seq assembly
Introduction to de novo RNA-seq assembly Introduction Ideal day for a molecular biologist Ideal Sequencer Any type of biological material Genetic material with high quality and yield Cutting-Edge Technologies
More informationGene expression from RNA-Seq
Gene expression from RNA-Seq Once sequenced problem becomes computational cells cdna sequencer Sequenced reads ChI Alinment read coverae enome Considerations and assumptions Hih library complexity #molecules
More information6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II
6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:
More informationBME 5742 Biosystems Modeling and Control
BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationGenome Annotation. Qi Sun Bioinformatics Facility Cornell University
Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST
More informationBM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing. Data
Biometrics 000, 000 000 DOI: 000 000 0000 BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data Yuan Ji 1,, Yanxun Xu 2, Qiong Zhang 3, Kam-Wah Tsui 3, Yuan Yuan 4, Clift Norris 1,
More informationStatistical tests for differential expression in count data (1)
Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image
More informationMarkov Models & DNA Sequence Evolution
7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under
More informationRNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford
RNAseq Applications in Genome Studies Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford RNAseq Protocols } Next generation sequencing protocol } cdna, not RNA sequencing
More information1/22/13. Example: CpG Island. Question 2: Finding CpG Islands
I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island
More informationBias Correction in RNA-Seq Short-Read Counts Using Penalized Regression
DOI 10.1007/s12561-012-9057-6 Bias Correction in RNA-Seq Short-Read Counts Using Penalized Regression David Dalpiaz Xuming He Ping Ma Received: 21 November 2011 / Accepted: 2 February 2012 International
More informationSupplementary Information
Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2014 Directed self-assembly of genomic sequences into monomeric and polymeric branched DNA structures
More informationNature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1
Supplementary Figure 1 Zn 2+ -binding sites in USP18. (a) The two molecules of USP18 present in the asymmetric unit are shown. Chain A is shown in blue, chain B in green. Bound Zn 2+ ions are shown as
More informationThe Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS
The Saguaro Genome Toward the Ecological Genomics of a Sonoran Desert Icon Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS Why study a genome? - the genome contains the genetic information of an
More informationWhat is Systems Biology
What is Systems Biology 2 CBS, Department of Systems Biology 3 CBS, Department of Systems Biology Data integration In the Big Data era Combine different types of data, describing different things or the
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More informationStochastic processes and
Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University
More informationComputational Genomics. Systems biology. Putting it together: Data integration using graphical models
02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput
More informationDifferential expression analysis for sequencing count data. Simon Anders
Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in HTS RNA-Seq Tag-Seq Gene 13CDNA73 A2BP1 A2M A4GALT AAAS AACS AADACL1 [...] ChIP-Seq Bar-Seq... GliNS1 4 19
More informationElectronic supplementary material
Applied Microbiology and Biotechnology Electronic supplementary material A family of AA9 lytic polysaccharide monooxygenases in Aspergillus nidulans is differentially regulated by multiple substrates and
More informationCharacterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin
International Journal of Genetic Engineering and Biotechnology. ISSN 0974-3073 Volume 2, Number 1 (2011), pp. 109-114 International Research Publication House http://www.irphouse.com Characterization of
More informationCloud-scale RNA-sequencing differential expression analysis with Myrna
Cloud-scale RNA-sequencing differential expression analysis with Myrna Jeff Leek Johns Hopkins Bloomberg School of Public Health e: jleek@jhsph.edu t: http://www.twitter.com/leekgroup myrna: http://bowtie-bio.sourceforge.net/myrna/
More informationComparative genomics: Overview & Tools + MUMmer algorithm
Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first
More informationModelling gene expression dynamics with Gaussian processes
Modelling gene expression dynamics with Gaussian processes Regulatory Genomics and Epigenomics March th 6 Magnus Rattray Faculty of Life Sciences University of Manchester Talk Outline Introduction to Gaussian
More informationMixtures and Hidden Markov Models for analyzing genomic data
Mixtures and Hidden Markov Models for analyzing genomic data Marie-Laure Martin-Magniette UMR AgroParisTech/INRA Mathématique et Informatique Appliquées, Paris UMR INRA/UEVE ERL CNRS Unité de Recherche
More informationBuilding a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy
Supporting Information Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Cuichen Wu,, Da Han,, Tao Chen,, Lu Peng, Guizhi Zhu,, Mingxu You,, Liping Qiu,, Kwame Sefah,
More informationAlgorithmics and Bioinformatics
Algorithmics and Bioinformatics Gregory Kucherov and Philippe Gambette LIGM/CNRS Université Paris-Est Marne-la-Vallée, France Schedule Course webpage: https://wikimpri.dptinfo.ens-cachan.fr/doku.php?id=cours:c-1-32
More informationDEXSeq paper discussion
DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml
More informationMore Codon Usage Bias
.. CSC448 Bioinformatics Algorithms Alexander Dehtyar.. DA Sequence Evaluation Part II More Codon Usage Bias Scaled χ 2 χ 2 measure. In statistics, the χ 2 statstic computes how different the distribution
More informationTechnologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA
Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem
More informationDifferential Expression Analysis Techniques for Single-Cell RNA-seq Experiments
Differential Expression Analysis Techniques for Single-Cell RNA-seq Experiments for the Computational Biology Doctoral Seminar (CMPBIO 293), organized by N. Yosef & T. Ashuach, Spring 2018, UC Berkeley
More informationInferring Protein-Signaling Networks II
Inferring Protein-Signaling Networks II Lectures 15 Nov 16, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022
More informationLecture 15: Programming Example: TASEP
Carl Kingsford, 0-0, Fall 0 Lecture : Programming Example: TASEP The goal for this lecture is to implement a reasonably large program from scratch. The task we will program is to simulate ribosomes moving
More informationPredictive Genome Analysis Using Partial DNA Sequencing Data
Predictive Genome Analysis Using Partial DNA Sequencing Data Nauman Ahmed, Koen Bertels and Zaid Al-Ars Computer Engineering Lab, Delft University of Technology, Delft, The Netherlands {n.ahmed, k.l.m.bertels,
More information