Isoform discovery and quantification from RNA-Seq data

Similar documents
EBSeq: An R package for differential expression analysis using RNA-seq data

Alignment-free RNA-seq workflow. Charlotte Soneson University of Zurich Brixen 2017

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Our typical RNA quantification pipeline

Bias in RNA sequencing and what to do about it

High-throughput sequencing: Alignment and related topic

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Comparative analysis of RNA- Seq data with DESeq2

SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE

Statistical Inferences for Isoform Expression in RNA-Seq

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

RNA- seq read mapping

Supplemental Information

GEP Annotation Report

Introduction. SAMStat. QualiMap. Conclusions

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

Unit-free and robust detection of differential expression from RNA-Seq data

Mixtures and Hidden Markov Models for analyzing genomic data

Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences Supplementary Material

RNASeq Differential Expression

ASSESSING TRANSLATIONAL EFFICIACY THROUGH POLY(A)- TAIL PROFILING AND IN VIVO RNA SECONDARY STRUCTURE DETERMINATION

Alignment. Peak Detection

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Genome 541! Unit 4, lecture 2! Transcription factor binding using functional genomics

Introduction to de novo RNA-seq assembly

BIOINFORMATICS ORIGINAL PAPER

The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Genomics and bioinformatics summary. Finding genes -- computer searches

Lecture: Mixture Models for Microbiome data

Genome-wide modelling of transcription kinetics reveals patterns of RNA production delays arxiv: v2 [q-bio.

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

TRANSCRIPTOMICS. (or the analysis of the transcriptome) Mario Cáceres. Main objectives of genomics. Determine the entire DNA sequence of an organism

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Bayesian Clustering of Multi-Omics

Introduction to Bioinformatics

ChIP-seq analysis M. Defrance, C. Herrmann, S. Le Gras, D. Puthier, M. Thomas.Chollier

Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons

DEXSeq paper discussion

Single Cell Sequencing

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

RNAseq Applications in Genome Studies. Alexander Kanapin, PhD Wellcome Trust Centre for Human Genetics, University of Oxford

Genome 541 Gene regulation and epigenomics Lecture 2 Transcription factor binding using functional genomics

Taxonomical Classification using:

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

O 3 O 4 O 5. q 3. q 4. Transition

SPOTTED cdna MICROARRAYS

Overview - MS Proteomics in One Slide. MS masses of peptides. MS/MS fragments of a peptide. Results! Match to sequence database

Student Handout Fruit Fly Ethomics & Genomics

Web-based Supplementary Materials for BM-Map: Bayesian Mapping of Multireads for Next-Generation Sequencing Data

Multi-Assembly Problems for RNA Transcripts

1 Decomposition of ESG

BLAST. Varieties of BLAST

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq

GCD3033:Cell Biology. Transcription

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Count ratio model reveals bias affecting NGS fold changes

TUTORIAL EXERCISES WITH ANSWERS

Matrix-based pattern discovery algorithms

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

Differential expression analysis for sequencing count data. Simon Anders

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Mixture models for analysing transcriptome and ChIP-chip data

High-throughput sequence alignment. November 9, 2017

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Synteny Portal Documentation

Package NarrowPeaks. September 24, 2012

Bioinformatics Chapter 1. Introduction

Exhaustive search. CS 466 Saurabh Sinha

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database

SUPPLEMENTARY INFORMATION

Expression arrays, normalization, and error models

Comparative Bioinformatics Midterm II Fall 2004

Comparing whole genomes

86 Part 4 SUMMARY INTRODUCTION

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Sequence Alignment Techniques and Their Uses

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

GBS Bioinformatics Pipeline(s) Overview

Transcription:

Isoform discovery and quantification from RNA-Seq data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Deloger November 2016 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 1 / 66

Introduction Forewords Haas BJ, Zody MC.: Advancing RNA-Seq analysis. Nat Biotechnol. 2010 May;28(5):421-3 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 2 / 66

Introduction Forewords Quantification from RNA-Seq data Previous talk: quantification within the gene level Condition 1 Condition 2 Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-Seq experiments. Bioinformatics. 2013 Apr 15;29(8):1035-43 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 3 / 66

Introduction Forewords Quantification from RNA-Seq data Previous talk: quantification within the gene level Condition 1 Condition 2 but Genes may be differentially spliced many different mrnas from a single locus isoforms Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-Seq experiments. Bioinformatics. 2013 Apr 15;29(8):1035-43 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 4 / 66

Introduction Forewords Quantification from RNA-Seq data And isoforms may be differentially expressed between 2 conditions: Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-Seq experiments. Bioinformatics. 2013 Apr 15;29(8):1035-43 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 5 / 66

Introduction Forewords Classification and usage of splicing events Histogram: AStalavista 1 + lastests RefSeq versions available of species annotations, ce2, dm3, hg18, tair10 (number of splicing events) 1. Foissac S, Sammeth M (2007) ASTALAVISTA: dynamic and flexible analysis of alternative splicing events in custom gene datasets. Nucleic Acids Research 35:W297-299 - http://genome.crg.es/astalavista/ C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 6 / 66

A real need? Introduction Forewords transcriptome from new condition tissue-specific transcriptome different development stages transcriptome from non model organism cancer cell RNA maturation mutant... How to manage RNA-Seq data with genes subjected to differential splicing? Is it possible to discover new isoforms? Is it possible to quantify abundance of each isoform? C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 7 / 66

A real need? Introduction Forewords transcriptome from new condition tissue-specific transcriptome different development stages transcriptome from non model organism cancer cell RNA maturation mutant... How to manage RNA-Seq data with genes subjected to differential splicing? Is it possible to discover new isoforms? Cufflinks, Cuffmerge Is it possible to quantify abundance of each isoform? RSEM, EBSeq C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 8 / 66

Introduction Forewords Isoforms reconstruction and quantification from RNA-Seq C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 9 / 66

Introduction RNA-Seq Data: Profiling of sex-biased expression in Drosophila melanogaster Data tissue: whole flies developmental stage, age: adult, 5-7 days post eclosion conditions: sex, female or male, biological duplicate Female rep1 Female rep2 Male rep1 Male rep2 SRA 1 GSM694258 GSM694259 GSM694260 GSM694261 PolyA+ mrna, paire-ends 2x75bp, insert size +/- 200bp genome reduction: chr 3R (autosom), from 377 to 13 947 890 1. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gsm6942xx C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 10 / 66

Introduction TP 1 st step: Data importation from Published histories Data C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 11 / 66

Discovery of new transcript Isoforms reconstruction protocol Cufflinks Tuxedo suite: Trapnell C, & al.: Differential gene and transcript expression analysis of RNA-Seq experiments with TopHat and Cufflinks. Nat Protoc. 2012 Mar 1;7(3):562-78 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 12 / 66

TP 2 nd step: Cufflinks Discovery of new transcript Cufflinks Parameters SAM or BAM file of aligned RNA-Seq reads : Your mapping file Use Reference Annotation : Set to Use reference annotation as guide Reference Annotation : Your genome annotation Use effective length correction : No We are interesed in Isoform detection, but not in their quantification C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 13 / 66

Cufflinks algorithm Discovery of new transcript Cufflinks Trapnell C. et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 14 / 66

Our cufflinks usage Discovery of new transcript Cufflinks for the discovery of isoform and without quantification aims no matter of the parameters related to quantification (normalization, length correction) 2 thresholds (signal to noise ratio), isoform and splicing event: minimum expression ratio: given isoform / majority isoform number of reads ratio: splicing site / intron with a well-known genome (fruitfly) use reference annotation as guide (but may be used with no reference annotation) Trapnell C. et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 15 / 66

Discovery of new transcript Cuffmerge Merge transcripts from many samples Cufflinks done for each sample different lists of transcripts necessary to unify lists between them in connection with the reference annotations cuffmerge Trapnell C. et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 16 / 66

Discovery of new transcript TP 3 rd step: Cuffmerge Cuffmerge Parameters GTF file produced by Cufflinks : Your first genomic annotation produced by Cufflinks (gtf) Additional GTF Input Files : Repeat up to your last annotation! Use Reference Annotation : Set to Yes, then insert your initial genomic annotation (gtf) Use Sequence Data : Set it to Yes Choose the source for the reference list : Set it to History Using Reference file : Your genomic sequence (fasta) C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 17 / 66

Discovery of new transcript Results Understanding the cuff classification of the transcripts http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/ C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 18 / 66

= class code Discovery of new transcript Results The following reads are mapped to an existing transcript in the fly genome (here female sample 2 and male sample 1) without any differential expression, nor differential processing. C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 19 / 66

= class code Discovery of new transcript Results Another example of = class which is differentially expressed in relation to the male condition. C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 20 / 66

j class code Discovery of new transcript Results The following reads are mapped to a part of an existing transcript. This is a potential novel isoform in female sample. C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 21 / 66

u class code Discovery of new transcript Results The following reads are mapped to an intergenic region. C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 22 / 66

Discovery of new transcript Results Isoforms reconstruction and quantification from RNA-Seq C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 23 / 66

Differential expression, transcript level Isoforms differential expression Forewords RSEM aligns reads on a reference of transcripts and counts EBSeq finds DE isoforms across two conditions and some intermediary steps to link RSEM and EBSeq C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 24 / 66

Differential expression, transcript level Forewords Isoforms differential expression: RSEM RSEM aligns reads on a transcript reference: computed from the genome annotations (gft file) only stranded features: filter to remove unstranded isoforms directly from transcript assembly (in case of non-model organism, cancer cell, etc) C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 25 / 66

Differential expression, transcript level RSEM: Pre-processing data RSEM: Removing unstranded new isoforms RSEM requires only stranded features, so we have to filter unstranded isoforms C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 26 / 66

Differential expression, transcript level TP 4 th step: filter for RSEM RSEM: Pre-processing data Parameters Filter : The file to be filtered, our merged gtf With following condition : c7!=. Number of header lines to skip : We have no header at all, so 0 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 27 / 66

Differential expression, transcript level RSEM: Pre-processing data Isoforms differential expression: RSEM RSEM aligns reads on a transcript reference: computed from the genome annotations (gft file) only stranded features: filter to remove unstranded isoforms directly from transcript assembly (in case of non-model organism, cancer cell, etc) RSEM adds a polya tail to each transcript (reads from 3 end mrna) and uses indexation (gain time) RSEM prepare reference C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 28 / 66

Differential expression, transcript level RSEM: Pre-processing data TP 5 th step: RSEM prepare reference Parameters Reference transcript source : Set it to reference genome and gtf reference fasta file : Your genome sequence (fasta) gtf or gff3 file : Your enhanced and merged genome annotation Use Bowtie2 : Hit Yes C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 29 / 66

RSEM features Differential expression, transcript level RSEM: Calculating Expression values RSEM estimates the incertainty due to both multiread allocation and random sampling effect using all valid mappings of the read (mapping scores, probability for a read to come from a locus) Need of a specific mapping (sam/bam) file: reporting of all the valid mappings for each read relaunch the mapping step (bowtie/bowtie2) Some RSEM features: strand-specificity highly 5 or 3 biaised ditribution of read positions in case of single-end, fix the fragment length does not support gapped mapping (no indel) RSEM: RNA-Seq by Expectation Maximization C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 30 / 66

Differential expression, transcript level RSEM: Calculating Expression values EM algorithm: estimate the expression of cognate isoforms EM: Expectated-Maximization First 3 cycles of EM algorithm. Abundance of red isoform estimated after the 1srt M-step: (1/3 read a + 1/2 read c + 1 read d + 1/2 read e)/(total read number), i.e. 0.47 ((0.33+0.5+1+0.5)/5) proved to converge stop criterion: when all probabilities that a fragment is derived from a transcript 10-7 have a relative change than 10-3 RSEM calculate expression L. Pachter: Models for transcript quantification from RNA-Seq, http://arxiv.org/pdf/1104.3889v2.pdf C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 31 / 66

Differential expression, transcript level RSEM: Calculating Expression values TP 6 th step: RSEM calculate expression Parameters RSEM Reference Source : Set it to From your history RSEM reference : Your previous reference Library type : Set it to Paired End Reads Read 1 fastq file and Read 2 fastq file : Your reads (fastq) Use bowtie 1 or 2? : Set it to Bowtie 2 Is the library strand specific? : Set it to forward orientation C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 32 / 66

Differential expression, transcript level EBSeq algorithm Isoforms differential expression: EBSeq EBSeq: Empirical Bayesian approach that models a number of features observed in RNA-Seq data. Runs EBSeq to find DE isoforms across two conditions: Isoform level DE test across two conditions C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 33 / 66

Differential expression, transcript level EBSeq algorithm EBSeq algorithm Mapping incertainty increases due to the presence of multiple isoforms of a given gene. EBSeq: Expected count for an isoform is distributed as Negative Binomiale Isoform-specific means and variances are estimated via the EM algorithm EBSeq accomodates isoform expression estimation uncertainty by modeling the differential variability observed in distinct groups of isoforms. 3 groups: following the number of isoforms associated to each gene (1, 2 or 3 and more) Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-Seq experiments. Bioinformatics. 2013 Apr 15;29(8):1035-43 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 34 / 66

Differential expression, transcript level EBSeq algorithm EBSeq directly models isoform expression A collective analysis of isoforms: reduces the power for identifying isoform in the 1 group (the true variance in that group are lower, on average, than those derived from the full collection of isoforms) increases the false discoveries in the 2 other groups (true variances are higher). Changes of the estimation incertainty with the increase of isoform complexity Leng N, Dawson JA, Thomson JA, Ruotti V, Rissman AI, Smits BM, Haag JD, Gould MN, Stewart RM, Kendziorski C.: EBSeq: an empirical Bayes hierarchical model for inference in RNA-Seq experiments. Bioinformatics. 2013 Apr 15;29(8):1035-43 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 35 / 66

Differential expression, transcript level Pre-processing data Isoforms differential expression: EBSeq Empirical Bayesian approach that models a number of features observed in RNA-Seq data. 2 workflows: Create a vector with the related group for each isoform Create IG Vector 4 RSEM outputs 1 EBSeq input Create Expression Table Runs EBSeq to find DE isoforms across two conditions: Isoform level DE test across two conditions C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 36 / 66

Differential expression, transcript level Pre-processing data Isoforms differential expression: EBSeq Empirical Bayesian approach that models a number of features observed in RNA-Seq data. 2 workflows: Create a vector with the related group for each isoform Create IG Vector 4 RSEM outputs 1 EBSeq input Create Expression Table Runs EBSeq to find DE isoforms across two conditions: Isoform level DE test across two conditions C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 37 / 66

Differential expression, transcript level Pre-processing data From RSEM to EBSeq We have: 4 files (1 per replicate) Those files are identically ordered by transcript names We need: - 1 file containing the number of isoforms each gene owns: IG vector - 1 file for all expected expression: Expression table We have to convert RSEM output to fit EBSeq input s requirements. C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 38 / 66

Differential expression, transcript level Pre-processing data TP 7 th step: EBSeq IG Vector (1/3) What is the IG Vector? - The IG Vector is a table with only one column of numbers (integers) - Each row corresponds to a transcript on the same row in the Expression table. Each integer in the IG Vector corresponds to the group 1, 2 or 3, according to the number of isoforms of the gene related to the considered isoform Tools : - Cut and Remove beginning from Text Manipulation section - Get Ig vector from gene-isoform mapping for isoform level DE analysis, available in EBSeq section C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 39 / 66

Differential expression, transcript level Pre-processing data TP 7 th step: EBSeq IG Vector (2/3) RSEM Isoform Abundance table EBSeq IG Vector C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 40 / 66

Differential expression, transcript level Pre-processing data TP 7 th step: EBSeq IG Vector (3/3) parameter input: A count table from RSEM. Caution: All Isoform abundances tabular files have the same succession of transcripts and genes names through each line. This succession is used by the Create IG Vector workflow. Therefore, any Isoform abundance file may be used in this step. C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 41 / 66

Differential expression, transcript level Pre-processing data Isoforms differential expression: EBseq Empirical Bayesian approach that models a number of features observed in RNA-Seq data. 2 workflows: Create a vector with the related group for each isoform Create IG Vector 4 RSEM outputs 1 EBSeq input Create Expression Table Runs EBSeq to find DE isoforms across two conditions: Isoform level DE test across two conditions C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 42 / 66

Differential expression, transcript level Pre-processing data From RSEM Expression Table to EBSeq Data Matrix The expression table 5 columns: - Transcripts name - The expected expression of F1, F2, M1 and M2 Obtained by merging the 5 th column of RSEM Isoform Expression results. Tool: Create Expression Table, available among the shared workflows C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 43 / 66

Differential expression, transcript level Pre-processing data TP 8 th step: Create Expression Table parameters First Dataset, Second Dataset, Third Dataset, and Fourth Dataset: Your count tables C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 44 / 66

Differential expression, transcript level Differential analysis Isoforms differential expression: EBseq Empirical Bayesian approach that models a number of features observed in RNA-Seq data. 2 workflows: Create a vector with the related group for each isoform Create IG Vector 4 RSEM outputs 1 EBSeq input Create Expression Table Runs EBSeq to find DE isoforms across two conditions: Isoform level DE test across two conditions C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 45 / 66

Differential expression, transcript level Differential analysis TP 9 th step: EBSeq Differential expression Parameters Isoform Expression : Our Data Matrix The first row is Sample Names : Yes Enter which condition each sample belongs to : M, M, F, F Ig Vector : Our IG Vector C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 46 / 66

List of DE isoforms Differential expression, transcript level Differential analysis...... C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 47 / 66

Conclusion Isoforms differential expression: methods and tools Classical RNA-Seq analysis method. Many methods (and tools): Expression estimation: Bayesian estimation of parameters of a model: BitSeq, Cufflinks, express Expectation-Maximization approach to inferring isoform abundances: RSEM-EBseq, Sailfish/Salmon, Kallisto Mapping to: the genome: Cuffdiff2, BitSeq, FluxCapacitor the transcriptome: express, RSEM-EBseq Mapping-free: Sailfish/Salmon, Kallisto C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 48 / 66

Mapping-free? Conclusion Kallisto example: De Bruijn Graph on transcriptome Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal RNA-Seq quantification with kallisto.nat Biotechnol. 2016 May;34(5):525-7 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 49 / 66

Mapping-free? Conclusion Kallisto example: De Bruijn Graph on transcriptome C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 50 / 66

Mapping-free? Conclusion Kallisto example: De Bruijn Graph on transcriptome C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 51 / 66

Mapping-free? Conclusion Kallisto example: De Bruijn Graph on transcriptome C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 52 / 66

Mapping-free? Conclusion Kallisto example: De Bruijn Graph on transcriptome C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 53 / 66

Mapping-free? Conclusion Kallisto example: De Bruijn Graph on transcriptome Stand for multimap reads Need to adapt algorithm to use stranded RNAseq No mapping = no visualization C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 54 / 66

Conclusion Isoforms with RNA-Seq: not yet Isoforms discovery and quantification from RNA-Seq: not yet a well-established measure Methods based on transcriptome are generally better (for quantification but not for discovery) EM methods are better than count-based methods (many EM methods are available but differ little in accuracy) the more abundant is the isoform, the more accurately it is inferred major bottleneck: small size of read (comparing to 2.2 kb for mammals transcripts), multimap reads Evaluate the accuracy of isoform abundance computational methods: difficult too few number of isoform with experimental validation strategies (ex. qrt-pcr) synthetically generated datasets may not capture adequately the complexities of RNA-Seq experiments C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 55 / 66

Improve? Conclusion don t forget the micro-arrays designed for isoform detection (not for discovery of new isoform, model organism) gain statistical power with spike measurements make protocols like ribodepletion but for highly expressed housekeeping genes (to enrich with interesting transcripts) complete isoform definitions by other NGS studies? ChIPSeq with a protein from the spliceosome as target capturing the 5 ends of RNAs... full-length cdnas technology (Pacific Biosciences)? a too low throughput (10 4 transcripts, summer 2015) Adapt! biological query + organism + data Parameters, softwares, sequencing protocols (single or paired-end, stranded or not) Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-Seq data.genome Biol. 2015 Jul 23;16:150 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 56 / 66

RNA-Seq: just a photo Conclusion RNA-Seq is just an unique and sampled RNA capture in a given position, at a given time, of one biological experiment... a poor quality photo comparing to real life C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 57 / 66

Start a new workflow Bonus: Création d un workflow 1/9 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 58 / 66

Add some details Bonus: Création d un workflow 2/9 Both name and annotation are important for your own workflows management C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 59 / 66

Add some details Bonus: Création d un workflow 3/9 The workflow is created empty, let us add some tags before diving through tools C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 60 / 66

Cut Bonus: Création d un workflow 4/9 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 61 / 66

Add some actions Bonus: Création d un workflow 5/9 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 62 / 66

Remove Beginning Bonus: Création d un workflow 6/9 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 63 / 66

EBSeq IG Vector Bonus: Création d un workflow 7/9 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 64 / 66

Custom output Bonus: Création d un workflow 8/9 C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 65 / 66

End Bonus: Création d un workflow 9/9 Do not forget to save! C. Toffano-Nioche, T. Dayris, Y. Boursin, M. Isoform Deloger discovery and quantification from RNA-Seq data November 2016 66 / 66