Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Similar documents
GEP Annotation Report

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Isoform discovery and quantification from RNA-Seq data

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Small RNA in rice genome

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

Genomes Comparision via de Bruijn graphs

objective functions...

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

Supplementary Information

High-throughput sequencing: Alignment and related topic

Bias in RNA sequencing and what to do about it

Towards More Effective Formulations of the Genome Assembly Problem

RNA- seq read mapping

SUPPLEMENTARY INFORMATION

Introduction to Bioinformatics

Our typical RNA quantification pipeline

Introduction to de novo RNA-seq assembly

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Genomics and bioinformatics summary. Finding genes -- computer searches

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Potato Genome Analysis

RNA Processing: Eukaryotic mrnas

BLAST. Varieties of BLAST

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

Annotation of Drosophila grimashawi Contig12

Supplemental Data. Hou et al. (2016). Plant Cell /tpc

Genome Assembly. Sequencing Output. High Throughput Sequencing

Paired-End Read Length Lower Bounds for Genome Re-sequencing

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Introduction to Sequence Alignment. Manpreet S. Katari

Pan-genomics: theory & practice

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Eppendorf twin.tec PCR Plates 96 LoBind Increase Yield of Transcript Species and Number of Reads of NGS Libraries

BME 5742 Biosystems Modeling and Control

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Bioinformatics Chapter 1. Introduction

Comparative Network Analysis

Sequence analysis and comparison

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Bioinformatics and BLAST

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge

Supplementary Information

Supplemental Figure 1. Comparison of Tiller Bud Formation between the Wild Type and d27. (A) and (B) Longitudinal sections of shoot apex in wild-type

GCD3033:Cell Biology. Transcription

RGP finder: prediction of Genomic Islands

Supplementary Figure 1 The number of differentially expressed genes for uniparental males (green), uniparental females (yellow), biparental males

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Comparative analysis of RNA- Seq data with DESeq2

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

HMMs and biological sequence analysis

AS A SERVICE TO THE RESEARCH COMMUNITY, GENOME BIOLOGY PROVIDES A 'PREPRINT' DEPOSITORY

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Sequences, Structures, and Gene Regulatory Networks

Supporting Information

De novo assembly and genotyping of variants using colored de Bruijn graphs

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Supplemental Information

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Introduc)on to RNA- Seq Data Analysis. Dr. Benilton S Carvalho Department of Medical Gene)cs Faculty of Medical Sciences State University of Campinas

Statistical Models for Gene and Transcripts Quantification and Identification Using RNA-Seq Technology

Statistical Inferences for Isoform Expression in RNA-Seq

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

SUSTAINABLE AND INTEGRAL EXPLOITATION OF AGAVE

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

express: Streaming read deconvolution and abundance estimation applied to RNA-Seq

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

New RNA-seq workflows. Charlotte Soneson University of Zurich Brixen 2016

A Browser for Pig Genome Data

Systematic comparison of lncrnas with protein coding mrnas in population expression and their response to environmental change

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Markov Chains and Hidden Markov Models. = stochastic, generative models

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Eukaryotic vs. Prokaryotic genes

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Alignment Strategies for Large Scale Genome Alignments

Algorithmics and Bioinformatics

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

PG Diploma in Genome Informatics onwards CCII Page 1 of 6

High-throughput sequence alignment. November 9, 2017

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Comparative genomics: Overview & Tools + MUMmer algorithm

Introduction. SAMStat. QualiMap. Conclusions

Variation in the genetic response to high temperature in Montastraea faveolata from the Florida Keys & Mexico

Multi-Assembly Problems for RNA Transcripts

Regulatory Change in YABBY-like Transcription Factor Led to Evolution of Extreme Fruit Size during Tomato Domestication

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Whole Genome Alignments and Synteny Maps

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

SpliceGrapherXT: From Splice Graphs to Transcripts Using RNA-Seq

Genome sequence of Plasmopara viticola and insight into the pathogenic mechanism

BIOINFORMATICS ORIGINAL PAPER

Transcription:

Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction of gene structure within genome 0 _ 5 _ luscu2-75bp 0 _ 0155 6 _ JGI 3.1 Genes from GFF as Partly Cleaned by SJC -- do not trust AA translations Solexa Transcriptome Log_10 Unique Coverage All All All with -1 for 0 (M./G./N.) ll_u -1 _ Transcripti onal Start Site UTR Coding Exon Intron Transcripti onal End Site

How are Genomes Annotated? Traditional Approaches use: Using information from expressed sequence tags (ESTs) Conservation across organisms Prior knowledge of sequence motifs (e.g. splice junctions) Do not take advantage of data generated from nextgeneration sequencers Challenge: Develop data-driven annotation using RNA-seq data

Whole-genome Transcriptome Analysis (WTA) 10!g Total RNA 10ng/!l mrna AAAAAAA Random fragmentation AAAA Random hexamer primed 1st strand cdna synthesis + 2nd strand cdna synthesis End-repairing and adaptor ligation Size selection Christian Haudenschild - Illumina

Plant Genomes Chlamydomonas is a model algae with a sequenced genome and still incomplete annotation Currently being used for biodiesel studies Arabiodpsis is a model plant with a very high quality genome and nearly complete annotation Genetically tractable organism

Limita&ons of Current Chlamydomonas Annota&on from Augustus Models Even at high coverage more than 20% of predicted genes have no RNA-seq evidence

Two Approaches for Annotating Genomes using RNA-seq 1. First Approach Align reads to genome Concatenate reads that map to overlapping bases on the genome 2. Second approach Assemble reads directly before mapping to genome Use Assembly tools such as ABySS

Method I - Alignment of Reads to Genome First perform ungapped alignments using a fast aligner (e.g. Novoalign or Bow@e) The reads that do not map are mapped using a gapped alignment protocol (e.g. BLAT or TopHat) The gaps iden@fy splice junc@ons We compute the number of reads that align to each base in the genome

Read Counts Across Chlamydomonas Genome 35 base reads JGI-MinusCu1-35bp JGI-MinusCu2-75bp sition old_1: 5 _ 0 _ 5 _ C. reinhardtii May 2006 scaffold_1:99,753-102,790 (3,038 bp) 100000 100500 101000 101500 102000 102500 RNA-Seq Coverage JGI MinusCu1 35bp First alignment round - Log10 RNA-Seq Coverage JGI MinusCu2 75bp First alignment round - Log10 75 base reads JGI-PlusCu1-75bp JGI-PlusCu2-75bp 0 _ 5 _ 0 _ 5 _ RNA-Seq Coverage JGI PlusCu1 75bp First alignment round - Log10 RNA-Seq Coverage JGI PlusCu2 75bp First alignment round - Log10 0 _ 0155 6 _ JGI 3.1 Genes from GFF as Partly Cleaned by SJC -- do not trust AA translations Solexa Transcriptome Log_10 Unique Coverage All All All with -1 for 0 (M./G./N.) AllAllAll_U Long Reads are not aligned across short exons in ungapped alignments -1 _

Gapped Alignments in Arabidopsis Genome Window Position chr1: ---> A. thaliana Jan. 2004 chr1:64,187-65,986 (1,800 bp) 64300 64400 64500 64600 64700 64800 64900 65000 65100 65200 65300 65400 65500 65600 65700 65800 65900 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=100 Gapped alignments allow us to cover short exons and define splice junctions Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=99 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=98 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=97 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=96 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=95 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=94 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=93 AT1G01140.1 AT1G01140.2 AT1G01140.3 Short Match Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=92 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=91 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=90 TAIR7 Annotations (green=protein-coding gene, red=pseudogene/transposon, others=various RNA types) Simple Tandem Repeats by TRF Tandem Repeats Dispersed Repeats Inverted Repeats Perfect Matches to Short Sequence (CG) Your Sequence from Blat Search

Assembly of Mapped Reads Ungapped Alignments Gapped Alignment RNA reads RNA Counts per base Splice junc@ons Assembly Transcript models Transla@on Gene models

Assembly e 1: 2 _ 2 kb 25000 25500 26000 26500 27000 27500 28000 28500 29000 29500 30000 DGE tags Forward strand DGE-Tag-F Our Models DGE-Tag-R (blue) JGI (green) Augustus (red) SMMC37CuP.MPU 0.30103 _ s -0.30103 _ -2 _ 4" 1" 2 _ -1 _ 7 chr1model66 NlaIII restriction sites - Thick restriction site - Thin 17-mer tag forward/reverse strand DGE tags Reverse strand Paired-end 50bp reads from Solexa GAII. Chre4 RNA-Seq models for chr=1 bestgenes v4 annotation Augustus Genes v5.0 from GFF Sabeeha Merchant, Madeli Castruita, 2137, Copper Plus Minimal, Paired end, Unique hits Reads that map to overlapping bases on the genome are concatenated into contigs (blue)

Alterna@ve Splicing The same locus can generate mul@ple transcripts due to alterna@ve splicing, TSS and TTS sites Our Assembly generates mul@ple models that represent different combina@ons of splice sites 13

0.30103 _ -0.30103 _ DGE-Tag-R Segmenta&on -2 _ SMMC37CuP.MPU au5.g1528_t1 au5.g1529_t1 au5.g1530_t1 2 _ Augustus Genes v5.0 from GFF "Sabeeha Merchant, Madeli Castruita, 2137, Copper Plus Minimal, Paired end, Unique hits" -1 _ Regions where two genes overlap show continuous RNA-seq data counts in Discontinuities in read counts may be used to define the boundaries We use Dynamic Programming approaches to efficiently segment count profiles

Refining Splice Junc&ons Splicing Junc@on Mo@fs Donor Acceptor Splice Junction motifs are computed and used to refine ambiguous gapped alignments

Preliminary Results 77% of the bases in our models overlap with Augustus 76% of Augustus models overlap our models Our models are ouen limited by poor RNA-seq coverageof genes which results in the genera@on of gene fragments rather than complete transcripts 16

De Novo Transcriptiome Assembly with ABySS l Transcriptome assembly --May be used when a genome seqeunce is not available --Not biased by errors in genome sequence l De novo assembly ABySS Assembler --Assembly By Short Sequence --Assembly basis: de Bruijn graph

Differences Between Genome and Transcript Assemblies Transcript have a large dynamic range of abundances

De Novo Transcriptiome Assembly with ABySS K-mers K=3 Remove bubbles and branches Output: contigs The k-mers are connected if the overlap is k-1=2 Blue arrows indicate the order of the k-mers and their overlaps

De Novo Transcriptiome Assembly with ABySS Generated RNA-seq library from Arabidopsis flowers Sequenced 20 million reads, 100 bases long Reads were assembled using ABySS

ABySS: Parameter Search for Op@mal k value Assembly Birol et al Arabipodisis txscriptome (14 mill 100mer reads) k-mer value 28 28 61 60 59 58 57 56 55 #contigs 812,300 1,700,453 40,603 42,365 45,491 47,319 51,115 54,541 59,672 #contigs >= 100 95,080 37,352 29,074 29,468 30,153 30,347 30,891 31,548 32,467 #contigs >= N50 N/A 8,621 5,828 5,818 5,805 5,771 5,775 5,780 5,803 median (bp) N/A 170 503 498 485 483 475 463 447 mean (bp) N/A 249 707 703 692 690 680 669 653 N50 (bp) 481 308 1,106 1,116 1,131 1,143 1,148 1,154 1,155 max (bp) 7,386 3,495 8,539 11,911 11,911 11,911 11,911 11,911 8,373 sum (Mbp) 29.0 9 21 21 21 21 21 21 21 Stats for con@gs >= 100bp (except #con@gs) N50: con@gs of size >= N50 make up 50% of assembly s bases Opted for k = 56 because highest N50, max con@g, & total Mbp

ABySS Assembly effienciency improved when adding more reads Sum of bases in contigs Contig length

How Much Coverage do We Need to Generate Full Length Transcripts? 15 counts per base are sufficient to assemble full length transcripts for most genes

ABySS: Assembly Coverage To determine ABySS assembly quality Aligned con@gs to TAIR mrna ref seq w/ BLAST Perl script to calculate coverage: only alignments w/ 98% iden@ty (2% MM & 0 gaps) BLAST of Contigs against Refseqs #Contigs Qual Hits Low Qual Hits No Hits 31,548 27,530 965 3,053 Coverage Total Covered %Covered Queries 31,548 27,530 87% Bases 21,124,429 19,575,778 93% BLAST of Refseqs against Contigs #Refseqs Qual Hits Low Qual Hits No Hits 31,770 19,189 5,146 7,435 Coverage Total Covered %Covered Queries 31,770 19,189 60% Bases 48,103,124 23,494,369 49%

Example ABySS Con&gs In Black ABySS contigs often underestimate TSS and TTS ABySS contigs only capture a single transcript

Able to Predict new genes that are not in Annotated Were able to identify 414 novel transcripts with no matches to existing annotation 90 of these had hits to protein database

Acknowledgements Pellegrini Lab David Casero Diaz-Cano Stephen Douglass Darren Kessner Sabeeha Merchant Lab Steven Karpowicz Madeli Castruita Janeie Kropat Sequencing done at JGI 27