Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction of gene structure within genome 0 _ 5 _ luscu2-75bp 0 _ 0155 6 _ JGI 3.1 Genes from GFF as Partly Cleaned by SJC -- do not trust AA translations Solexa Transcriptome Log_10 Unique Coverage All All All with -1 for 0 (M./G./N.) ll_u -1 _ Transcripti onal Start Site UTR Coding Exon Intron Transcripti onal End Site

How are Genomes Annotated? Traditional Approaches use: Using information from expressed sequence tags (ESTs) Conservation across organisms Prior knowledge of sequence motifs (e.g. splice junctions) Do not take advantage of data generated from nextgeneration sequencers Challenge: Develop data-driven annotation using RNA-seq data

Whole-genome Transcriptome Analysis (WTA) 10!g Total RNA 10ng/!l mrna AAAAAAA Random fragmentation AAAA Random hexamer primed 1st strand cdna synthesis + 2nd strand cdna synthesis End-repairing and adaptor ligation Size selection Christian Haudenschild - Illumina

Plant Genomes Chlamydomonas is a model algae with a sequenced genome and still incomplete annotation Currently being used for biodiesel studies Arabiodpsis is a model plant with a very high quality genome and nearly complete annotation Genetically tractable organism

Limita&ons of Current Chlamydomonas Annota&on from Augustus Models Even at high coverage more than 20% of predicted genes have no RNA-seq evidence

Two Approaches for Annotating Genomes using RNA-seq 1. First Approach Align reads to genome Concatenate reads that map to overlapping bases on the genome 2. Second approach Assemble reads directly before mapping to genome Use Assembly tools such as ABySS

Method I - Alignment of Reads to Genome First perform ungapped alignments using a fast aligner (e.g. Novoalign or Bow@e) The reads that do not map are mapped using a gapped alignment protocol (e.g. BLAT or TopHat) The gaps iden@fy splice junc@ons We compute the number of reads that align to each base in the genome

Read Counts Across Chlamydomonas Genome 35 base reads JGI-MinusCu1-35bp JGI-MinusCu2-75bp sition old_1: 5 _ 0 _ 5 _ C. reinhardtii May 2006 scaffold_1:99,753-102,790 (3,038 bp) 100000 100500 101000 101500 102000 102500 RNA-Seq Coverage JGI MinusCu1 35bp First alignment round - Log10 RNA-Seq Coverage JGI MinusCu2 75bp First alignment round - Log10 75 base reads JGI-PlusCu1-75bp JGI-PlusCu2-75bp 0 _ 5 _ 0 _ 5 _ RNA-Seq Coverage JGI PlusCu1 75bp First alignment round - Log10 RNA-Seq Coverage JGI PlusCu2 75bp First alignment round - Log10 0 _ 0155 6 _ JGI 3.1 Genes from GFF as Partly Cleaned by SJC -- do not trust AA translations Solexa Transcriptome Log_10 Unique Coverage All All All with -1 for 0 (M./G./N.) AllAllAll_U Long Reads are not aligned across short exons in ungapped alignments -1 _

Gapped Alignments in Arabidopsis Genome Window Position chr1: ---> A. thaliana Jan. 2004 chr1:64,187-65,986 (1,800 bp) 64300 64400 64500 64600 64700 64800 64900 65000 65100 65200 65300 65400 65500 65600 65700 65800 65900 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=100 Gapped alignments allow us to cover short exons and define splice junctions Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=99 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=98 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=97 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=96 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=95 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=94 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=93 AT1G01140.1 AT1G01140.2 AT1G01140.3 Short Match Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=92 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=91 Gapped alignment from Solexa GAII. Arabidopsis flowers, lane 1, score=90 TAIR7 Annotations (green=protein-coding gene, red=pseudogene/transposon, others=various RNA types) Simple Tandem Repeats by TRF Tandem Repeats Dispersed Repeats Inverted Repeats Perfect Matches to Short Sequence (CG) Your Sequence from Blat Search

Assembly of Mapped Reads Ungapped Alignments Gapped Alignment RNA reads RNA Counts per base Splice junc@ons Assembly Transcript models Transla@on Gene models

Assembly e 1: 2 _ 2 kb 25000 25500 26000 26500 27000 27500 28000 28500 29000 29500 30000 DGE tags Forward strand DGE-Tag-F Our Models DGE-Tag-R (blue) JGI (green) Augustus (red) SMMC37CuP.MPU 0.30103 _ s -0.30103 _ -2 _ 4" 1" 2 _ -1 _ 7 chr1model66 NlaIII restriction sites - Thick restriction site - Thin 17-mer tag forward/reverse strand DGE tags Reverse strand Paired-end 50bp reads from Solexa GAII. Chre4 RNA-Seq models for chr=1 bestgenes v4 annotation Augustus Genes v5.0 from GFF Sabeeha Merchant, Madeli Castruita, 2137, Copper Plus Minimal, Paired end, Unique hits Reads that map to overlapping bases on the genome are concatenated into contigs (blue)

Alterna@ve Splicing The same locus can generate mul@ple transcripts due to alterna@ve splicing, TSS and TTS sites Our Assembly generates mul@ple models that represent different combina@ons of splice sites 13

0.30103 _ -0.30103 _ DGE-Tag-R Segmenta&on -2 _ SMMC37CuP.MPU au5.g1528_t1 au5.g1529_t1 au5.g1530_t1 2 _ Augustus Genes v5.0 from GFF "Sabeeha Merchant, Madeli Castruita, 2137, Copper Plus Minimal, Paired end, Unique hits" -1 _ Regions where two genes overlap show continuous RNA-seq data counts in Discontinuities in read counts may be used to define the boundaries We use Dynamic Programming approaches to efficiently segment count profiles

Refining Splice Junc&ons Splicing Junc@on Mo@fs Donor Acceptor Splice Junction motifs are computed and used to refine ambiguous gapped alignments

Preliminary Results 77% of the bases in our models overlap with Augustus 76% of Augustus models overlap our models Our models are ouen limited by poor RNA-seq coverageof genes which results in the genera@on of gene fragments rather than complete transcripts 16

De Novo Transcriptiome Assembly with ABySS l Transcriptome assembly --May be used when a genome seqeunce is not available --Not biased by errors in genome sequence l De novo assembly ABySS Assembler --Assembly By Short Sequence --Assembly basis: de Bruijn graph

Differences Between Genome and Transcript Assemblies Transcript have a large dynamic range of abundances

De Novo Transcriptiome Assembly with ABySS K-mers K=3 Remove bubbles and branches Output: contigs The k-mers are connected if the overlap is k-1=2 Blue arrows indicate the order of the k-mers and their overlaps

De Novo Transcriptiome Assembly with ABySS Generated RNA-seq library from Arabidopsis flowers Sequenced 20 million reads, 100 bases long Reads were assembled using ABySS

ABySS: Parameter Search for Op@mal k value Assembly Birol et al Arabipodisis txscriptome (14 mill 100mer reads) k-mer value 28 28 61 60 59 58 57 56 55 #contigs 812,300 1,700,453 40,603 42,365 45,491 47,319 51,115 54,541 59,672 #contigs >= 100 95,080 37,352 29,074 29,468 30,153 30,347 30,891 31,548 32,467 #contigs >= N50 N/A 8,621 5,828 5,818 5,805 5,771 5,775 5,780 5,803 median (bp) N/A 170 503 498 485 483 475 463 447 mean (bp) N/A 249 707 703 692 690 680 669 653 N50 (bp) 481 308 1,106 1,116 1,131 1,143 1,148 1,154 1,155 max (bp) 7,386 3,495 8,539 11,911 11,911 11,911 11,911 11,911 8,373 sum (Mbp) 29.0 9 21 21 21 21 21 21 21 Stats for con@gs >= 100bp (except #con@gs) N50: con@gs of size >= N50 make up 50% of assembly s bases Opted for k = 56 because highest N50, max con@g, & total Mbp

ABySS Assembly effienciency improved when adding more reads Sum of bases in contigs Contig length

How Much Coverage do We Need to Generate Full Length Transcripts? 15 counts per base are sufficient to assemble full length transcripts for most genes

ABySS: Assembly Coverage To determine ABySS assembly quality Aligned con@gs to TAIR mrna ref seq w/ BLAST Perl script to calculate coverage: only alignments w/ 98% iden@ty (2% MM & 0 gaps) BLAST of Contigs against Refseqs #Contigs Qual Hits Low Qual Hits No Hits 31,548 27,530 965 3,053 Coverage Total Covered %Covered Queries 31,548 27,530 87% Bases 21,124,429 19,575,778 93% BLAST of Refseqs against Contigs #Refseqs Qual Hits Low Qual Hits No Hits 31,770 19,189 5,146 7,435 Coverage Total Covered %Covered Queries 31,770 19,189 60% Bases 48,103,124 23,494,369 49%

Example ABySS Con&gs In Black ABySS contigs often underestimate TSS and TTS ABySS contigs only capture a single transcript

Able to Predict new genes that are not in Annotated Were able to identify 414 novel transcripts with no matches to existing annotation 90 of these had hits to protein database

Acknowledgements Pellegrini Lab David Casero Diaz-Cano Stephen Douglass Darren Kessner Sabeeha Merchant Lab Steven Karpowicz Madeli Castruita Janeie Kropat Sequencing done at JGI 27