Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Similar documents
GEP Annotation Report

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Comparing whole genomes

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

RNA Processing: Eukaryotic mrnas

Sequence analysis and comparison

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

SUPPLEMENTARY INFORMATION

Annotation of Drosophila grimashawi Contig12

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Synteny Portal Documentation

Small RNA in rice genome

MegAlign Pro Pairwise Alignment Tutorials

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

CS612 - Algorithms in Bioinformatics

HMMs and biological sequence analysis

BLAST. Varieties of BLAST

RGP finder: prediction of Genomic Islands

ATLAS of Biochemistry

COLE TRAPNELL, BRIAN A WILLIAMS, GEO PERTEA, ALI MORTAZAVI, GORDON KWAN, MARIJKE J VAN BAREN, STEVEN L SALZBERG, BARBARA J WOLD, AND LIOR PACHTER

Whole Genome Alignments and Synteny Maps

Introduction to Bioinformatics Online Course: IBT

Mitochondrial Genome Annotation

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

A Browser for Pig Genome Data

RNA- seq read mapping

Browsing Genomic Information with Ensembl Plants

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

SUPPLEMENTARY INFORMATION

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative Bioinformatics Midterm II Fall 2004

Bioinformatics and BLAST

Markov Models & DNA Sequence Evolution

Isoform discovery and quantification from RNA-Seq data

objective functions...

GCD3033:Cell Biology. Transcription

TUTORIAL EXERCISES WITH ANSWERS

BME 5742 Biosystems Modeling and Control

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week:

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Large-Scale Genomic Surveys

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Abstract. CROMER, JOHN PATRICK. Genome Annotation of Meloidogyne hapla. (Under the direction of David Bird and Charles Opperman).

Eukaryotic vs. Prokaryotic genes

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

1. In most cases, genes code for and it is that

Genome-wide analysis of the MYB transcription factor superfamily in soybean

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

Lecture 5: September Time Complexity Analysis of Local Alignment

Predicting Protein Functions and Domain Interactions from Protein Interactions

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Supporting Information

training workshop 2015

EBI web resources II: Ensembl and InterPro

Computational Biology and Chemistry

ESPRIT Feature. Innovation with Integrity. Particle detection and chemical classification EDS

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformatics Chapter 1. Introduction

Our typical RNA quantification pipeline

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Introduction to Sequence Alignment. Manpreet S. Katari

Identification of 3 0 gene ends using transcriptional and genomic conservation across vertebrates

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Session 5: Phylogenomics

The Schrödinger KNIME extensions

Extensive identification and analysis of conserved small ORFs in animals

BIOINFORMATICS: An Introduction

Multiple Choice Review- Eukaryotic Gene Expression

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Markov Chains and Hidden Markov Models. = stochastic, generative models

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

O 3 O 4 O 5. q 3. q 4. Transition

Protein Interaction Mapping: Use of Osprey to map Survival of Motor Neuron Protein interactions

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

DNA Feature Sensors. B. Majoros

Multiple Whole Genome Alignment

Matrices, Vector Spaces, and Information Retrieval

Sequence Alignment Techniques and Their Uses

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Table of content. Understanding workflow automation - Making the right choice Creating a workflow...05

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Towards More Effective Formulations of the Genome Assembly Problem

Robert Edgar. Independent scientist

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Transcription:

-2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007

About PASA PASA is an open source free to download software program written by Brian Haas (bhaas@jcvi.org) Reference :Its original application is described in: Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.

Topics Outline Overview of the PASA Pipeline Alignment Assembly Algorithm Annotation comparison

FL-cDNAs and ESTs Gold standard for gene structure resolution Introns and exons via spliced alignment Direct evidence for: Alternative splicing Untranslated regions (UTRs) Polyadenylation sites

The PASA Pipeline Automate incorporation of transcript alignments into gene structure annotations It was originally developed to refine gene structures in Arabidopsis as part of our Arabidopsis re-annotation effort. Since that time, we ve expanded the pipeline and applied it to a range of other organisms at TIGR, now with a special focus on Rice.

Influxes of mrna Sequences After Initial Genome Releases 1000000 human mouse Drosophila Arabidopsis 100000 Feb.2001 Dec.2002 10000 Mar.2000 1000 Dec.2000 1999 2000 2001 2002 2003 2004 2005 2006

Additionally Found Uses of PASA Automated generation of training sets for Gene Finders (Aedes, Aspergillus, Tetrahymena) Evaluation of EST libraries (Tetrahymena) examine redundancy within EST library selection of clones for full-length sequencing Transitive gene structure annotation for closely related species (Aspergillus sp.) Comparing different annotation methods on the same contigs (Plasmodium vivax) Cataloging polya sites for more detailed studies (Arabidopsis, Rice)

The PASA Pipeline [at a glance] Align transcripts to genome Assemble the alignments PASA: Program to Assemble Spliced Alignments Compare alignment assemblies to existing annotations, suggest updates

Transcript Sequences Seqclean Align to Genome Cluster overlapping alignments PASA Pipeline Seqclean (TIGR Gene Indices) vector removal poly-a identification, stripping trash low quality seqs PASA alignment assembly subcluster PASA assemblies Compare to annotation Update annotation

Transcript Sequences PASA Pipeline Seqclean BLAT and sim4 spliced alignments Align to Genome Cluster overlapping alignments PASA alignment assembly subcluster PASA assemblies Compare to annotation Update annotation Valid alignment criteria: min 95% Identity min 90% transcript length aligned (both configurable parameters) consensus splice sites (GT,GC) donors AG acceptor Assign Transcribed Orientations Splice sites Polyadenylation sites

Transcript Sequences PASA Pipeline Seqclean BLAT and sim4 spliced alignments Align to Genome Cluster overlapping alignments PASA alignment assembly subcluster PASA assemblies Compare to annotation Update annotation

Transcript Sequences PASA Pipeline Seqclean BLAT and sim4 spliced alignments Align to Genome Cluster overlapping alignments PASA alignment assembly subcluster PASA assemblies Compare to annotation Update annotation

Transcript Sequences PASA Pipeline Seqclean BLAT and sim4 spliced alignments Align to Genome Cluster overlapping alignments PASA alignment assembly subcluster PASA assemblies Compare to annotation Update annotation

Transcript Sequences PASA Pipeline Seqclean Align to Genome Cluster overlapping alignments PASA alignment assembly subcluster PASA assemblies Compare to annotation Update annotation Annotation Comparison FL-cDNAs and ESTs treated separately with different rules for incorporation Annotation Updates -exon modifications -alt splice isoform additions -gene merges -gene splits -new genes

Alignment Assembly Maximize evidence supporting gene structures. (Maximum evidence) ~ (Maximum # alignments) Goal: find maximal assembly of compatible alignments.

Alignment Assembly using PASA: Program to Assemble Spliced Alignments 5 3 Assemblies Maximally Assemble Compatible Alignments

Alignment Assembly using PASA: Program to Assemble Spliced Alignments 5 3 Assemblies Maximally Assemble Compatible Alignments

Alignment Assembly using PASA: Program to Assemble Spliced Alignments 5 3 Assemblies Maximally Assemble Compatible Alignments

PASA Algorithm Containments preclude the simple chaining of compatible alignments (B is contained within A) A B C A ~ B B ~ C A!~ C ~ :compatible!~ :not compatible

PASA Algorithm Finding the Single Maximal Assembly Sort list of alignments by left-most coordinate Determine pairwise containments C a = # alignments contained in a, including a Determine pairwise compatibilities Chain compatible alignments, summing unique containments. {Create Left Path Graph, chain compatible alignments from left to right} L a = maximal chain of alignments originating from the left of alignment a and ending at a. Solve by dynamic programming Find maximal assembly as the chain with maximal # alignments.

PASA Algorithm Find Maximal Assemblies for Missing Alignments (Alt Spliced Isoforms) Create reciprocal {right path} graph {chain compatible alignments from right to left} R a = maximal chain of alignments originating from the Right of alignment a and ending at a. For each missing alignment a, find the maximal assembly containing a (restated as sum of left and right paths)

Annotation Comparison The PASA Pipeline [Capabilities] Then (NAR, 2003) : Update gene structures: - Changes in introns and exons - UTR additions Model additional gene structures - Alternative splicing isoforms - New gene models Now, PASA-2 (above plus following enhancements) : Gene merging Gene splitting Antisense classification Polyadenylation sites

Incorporation of PASA assemblies into the annotation FL-assemblies contain at least one FL-cDNA, expected to encode all exons, complete protein, possibly UTRs. non-fl-assemblies encode part of a gene: - part of one or more exons - potentially UTRs.

Full-length cdnas Provide Complete Gene Structures (hence, full-length Assemblies too!) Genomic DNA Gapped Alignment Poly-A site Full-length cdna AAAAAAAAAAAAAA cdna-genome spliced alignment ORF reconstruction based on the joined exons. UTRs identified. Automated process

FL-assembly-based updates Existing model: UTR Different Introns/Exons UTR FL-assembly-based model: ::FL-assembly-based model replaces the existing model = CDS = cdna

Non-FL-assembly-based updates Existing model: Non-FL-assembly: *stitching* Stitched product replaces existing model

Alternative Splicing (incompatible alignment assemblies) Sets of mutually incompatible alignment assemblies Multiple FL-assemblies FL-assembly(s) and non-fl-assembly(s) Non-FL-assemblies (*pre-existing gene model required)

Minimize Corruption or Pollution of Existing Annotations Requirements of a FL-assembly Min ORF size requirement - MIN_PERCENT_PROT_CODING (ie. 40%) - MIN_FL_ORF_SIZE (ie. 100 aa) Max # UTR exons (ie. 2 or 3) - MAX_UTR_EXONS Requirements of an annotation update Compared to existing model, must pass validation tests: - Length test (ie. must encode a protein at least 70% the length of the current one) - *Maybe trust FL assemblies more than ESTs; can set stringencies separately: - MIN_PERCENT_LENGTH_FL_COMPARE (involving FL-assemblies) - MIN_PERCENT_LENGTH_NONFL_COMPARE (involving non-fl assemblies) - Homology test [Fasta Alignment] (ie. 70% identity, 70% length) - MIN_PERID_PROT_COMPARE (ie. 70% identity) - MIN_PERCENT_ALIGN_LENGTH (ie. 70% of the shorter protein length) * all user-configurable parameters, option names shown in italics

Enhancements: Gene Merging (FL-cDNA) gene1 FL-assembly based gene gene2 FL-ORF_SPAN If FL-ORF_SPAN overlaps both gene1 and gene2 [,... genex] by at least MIN_PERCENT_OVERLAP_GENE_REPLACE, gene1 and gene2 [,... genex] are to be merged and replaced by the FL-assembly based gene.

Enhancements: Gene Merging (non-fl) gene1 nonfl assembly reconstructed gene gene2 stitch into overlapping models ORF_SPAN Same rule as before, using ORF_SPAN and MIN_PERCENT_OVERLAP_GENE_REPLACE

Enhancements: Gene Existing gene Splitting FL-assemblies Requires multiple FL-assemblies from distinct sub clusters map to the same gene have the same transcribed orientation, and the min and max of the new ORFs must cover at least MIN_PERCENT_OVERLAP_GENE_REPLACE of the gene to be split.

Gene Merging and Gene Splitting Homology (used loosely) between the existing gene and the replacement is not required. Only require that the locus of interest continues to be covered by ORFs. Why? Merged and split genes may appear very different from the existing [predicted] gene. One of the split products may look quite similar to the preexisting gene, but the other may not. Our experience is that the existing methodology of splitting and merging works quite well, and we haven t needed to explore additional methods.

Want more aggressive updates? Besides merging and splitting, individual gene updates must pass the homology test. Failures require manual inspection. But, many that fail homology may still provide reasonable, and improved gene structure updates. Option (flag): STOMP_HIGH_PERCENTAGE_OVERLAPPING_GENE If update fails the homology test, consider the ORF_SPAN alone. if ORF_SPAN > MIN_PERCENT_OVERLAP_GENE_REPLACE, allow update to occur. Called STOMPing

Trusting the FL-Status Ideally, FL-transcripts are full length! existing gene FL-assembly But, often: existing gene FL-assembly Solution: If a FL-assembly is compatible with an existing gene annotation, treat it as non-fl update * option TRUST_FL_STATUS, by default, disabled.

Example Application of PASA to Rice

Results from Annotation Comparison (Counting PASA assemblies) cgi-bin/status_report.cgi

Gene Comparison Summary (Counting Genes) cgi-bin/status_report.cgi

Gene Structure Updates Summary

Examining Updates (clicking any link in the previous report)

Assembly Report Page

Examples of Classified Updates FL adds/extends UTRs FL extends protein FL updates structure (passes homology test) FL updates structure (fails homology, passes ORF span)

FL merges genes FL split gene FL novel gene

EST extends UTRs EST extends protein EST updates structure (passes homology test)

EST updates structure (fails homology test, passes ORF span) EST merges multiple genes

A tool for Studying Alternative Splicing Evidence for >5000 genes alternatively spliced Unspliced Introns: 45% Alt donor/acceptor: 32% Start/end in intron: 34% Exon skipping: 8.4% Alternate exon: 7.4% *categories overlap due to combinations Distribution of splicing variations is similar to those described in Arabidopsis.

PASA Pipeline Application Framework UI Tier Web browser Shell Terminal App Tier CGI scripts, run from Apache Perl Scripts, C++ program Data Tier MySQL Database Text files including Fasta formatted sequence databases and config files.

PASA Documentation http://pasa.sf.net

Obtaining PASA http://www.tigr.org/software

QUESTIONS?