GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

Similar documents
biology Exploiting a Reference Genome in Terms of Duplications: The Network of Paralogs and Single Copy Genes in Arabidopsis thaliana

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

BLAST. Varieties of BLAST

From BBCC Conference 2017 Naples, Italy December 2017

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

GEP Annotation Report

Basic Local Alignment Search Tool

Comparing Genomes! Homologies and Families! Sequence Alignments!

Figure S1: Mitochondrial gene map for Pythium ultimum BR144. Arrows indicate transcriptional orientation, clockwise for the outer row and

Genome-wide discovery of G-quadruplex forming sequences and their functional

Example of Function Prediction

Chapter 18 Active Reading Guide Genomes and Their Evolution

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Orthologs Detection and Applications

Genome-wide analysis of the MYB transcription factor superfamily in soybean

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Quantitative Genetics & Evolutionary Genetics

Genomes and Their Evolution

Stage 1: Karyotype Stage 2: Gene content & order Step 3

JUNK DNA: EVIDENCE FOR EVOLUTION OR DESIGN?

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Genomes Comparision via de Bruijn graphs

Supporting Information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Comparative Bioinformatics Midterm II Fall 2004

Changes in Twelve Homoeologous Genomic Regions in Soybean following Three Rounds of Polyploidy W

Lineage specific conserved noncoding sequences in plants

Procedure to Create NCBI KOGS

Impact of recurrent gene duplication on adaptation of plant genomes

Genome Sequences and Evolution

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Eukaryotic vs. Prokaryotic genes

Homeotic Genes and Body Patterns

Arabidopsis genomic information for interpreting wheat EST sequences

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

O 3 O 4 O 5. q 3. q 4. Transition

Comparative Genomics II

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Molecular Evolution in the Genomic Era

Visit to BPRC. Data is crucial! Case study: Evolution of AIRE protein 6/7/13

Potato Genome Analysis

2012 Assessment Report. Biology Level 3

BIOINFORMATICS: An Introduction

AP BIOLOGY SUMMER ASSIGNMENT

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

TE content correlates positively with genome size

Bioinformatics: Network Analysis

How to detect paleoploidy?

Genomics and Bioinformatics

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Browsing Genomic Information with Ensembl Plants

Small RNA in rice genome

Graph Alignment and Biological Networks

3/8/ Complex adaptations. 2. often a novel trait

DNA Technology, Bacteria, Virus and Meiosis Test REVIEW

PLNT2530 (2018) Unit 5 Genomes: Organization and Comparisons

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

EBI web resources II: Ensembl and InterPro

Sabarinath Subramaniam. A dissertation submitted in partial satisfaction of the. requirements for the degree of. Doctor of Philosophy.

8/23/2014. Phylogeny and the Tree of Life

Chapter 18 Lecture. Concepts of Genetics. Tenth Edition. Developmental Genetics

I. Molecules and Cells: Cells are the structural and functional units of life; cellular processes are based on physical and chemical changes.

Homology and Information Gathering and Domain Annotation for Proteins

Origin and diversification of leucine-rich repeat receptor-like protein kinase (LRR-RLK) genes in plants

MiGA: The Microbial Genome Atlas

Chapter 15 Active Reading Guide Regulation of Gene Expression

GENES ENCODING FLOWER- AND ROOT-SPECIFIC FUNCTIONS ARE MORE RESISTANT TO FRACTIONATION THAN GLOBALLY EXPRESSED GENES IN BRASSICA RAPA.

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Research Article Expression Divergence of Tandemly Arrayed Genes in Human and Mouse

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Paleo-evolutionary plasticity of plant disease resistance genes

Chapter 2. Gene Orthology Assessment with OrthologID. Mary Egan, Ernest K. Lee, Joanna C. Chiu, Gloria Coruzzi, and Rob DeSalle.

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Model plants and their Role in genetic manipulation. Mitesh Shrestha

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Introduction to Bioinformatics

What Organelle Makes Proteins According To The Instructions Given By Dna

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge

Bioinformatics Exercises

Piecing It Together. 1) The envelope contains puzzle pieces for 5 vertebrate embryos in 3 different stages of

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION

Lecture 20 DNA Repair and Genetic Recombination (Chapter 16 and Chapter 15 Genes X)

Phylogenetic Comparison of F-Box (FBX) Gene Superfamily within the Plant Kingdom Reveals Divergent Evolutionary Histories Indicative of Genomic Drift

Protein function prediction based on sequence analysis

Ensembl Genomes (non-chordates): Quick tour. This quick tour provides a brief introduction to Ensembl Genomes [2], the non-chordate genome browser.

The Contribution of Bioinformatics to Evolutionary Thought

10 Biodiversity Support. AQA Biology. Biodiversity. Specification reference. Learning objectives. Introduction. Background

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

1 Introduction. Abstract

SoyBase, the USDA-ARS Soybean Genetics and Genomics Database

BIOLOGY YEAR AT A GLANCE RESOURCE ( )

Fractionation resistance of duplicate genes following whole genome duplication in plants as a function of gene ontology category and expression level

Transcription:

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. Alessandra Vigilante, Mara Sangiovanni, Chiara Colantuono, Luigi Frusciante and Maria Luisa Chiusano Dept. of Soil, Plant, Environmental and Animal Production Sciences CAB (Computer Aided Biosciences) group

BACKGROUND: Arabidopsis thaliana as a reference genome Arabidopsis thaliana WAS THE FIRST PLANT GENOME TO BE COMPLETELY SEQUENCED

BACKGROUND: Arabidopsis thaliana as a reference genome A REFERENCE GENOME SHOULD BE: FULLY RELIABLE SAFELY ANNOTATED WELL UNDERSTOOD IN TERMS OF EVOLUTIONARY HISTORY

BACKGROUND: Arabidopsis thaliana as a reference genome A REFERENCE GENOME SHOULD BE: FULLY RELIABLE SAFELY ANNOTATED WELL UNDERSTOOD IN TERMS OF EVOLUTIONARY HISTORY Arabidopsis thaliana GENOME IS: GENE DENSE COMPLEX BECAUSE HIGHLY DUPLICATED AND CLAIMED TO BE ARCHEOPOLYPLOID STILL NOT EXHAUSTIVELY ANNOTATED

BACKGROUND: Arabidopsis thaliana as a reference genome Whole genome duplication events Nature (2007) Jaillon O, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

BACKGROUND: Arabidopsis thaliana as a reference genome V. vinifera A. thaliana Whole genome duplication events Nature (2007) Jaillon O, et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

STRATEGY: Unraveling Arabidopsis thaliana genome IT COULD BE USEFUL TO REVIEW THE GENOME IN TERMS OF RELATIONSHIPS BETWEEN DUPLICATED GENES/PARALOGS

STRATEGY: Unraveling Arabidopsis thaliana genome IT COULD BE USEFUL TO REVIEW THE GENOME IN TERMS OF RELATIONSHIPS BETWEEN DUPLICATED GENES/PARALOGS NETWORKS OF PARALOG GENES

Gene Gene duplication analysis: pipeline Arabidopsis thaliana proteome (TAIR9 release) All-against-all BLASTp versus protein-coding genes E<10-10, Rost s formula Singleton genes Duplicated genes Network extraction Networks of duplicated genes

Gene Gene duplication analysis: pipeline Arabidopsis thaliana proteome (TAIR9 release) All-against-all BLASTp versus protein-coding genes E<10-10, Rost s formula Singleton genes Duplicated genes Network extraction Networks of duplicated genes

Gene Gene duplication analysis: pipeline Arabidopsis thaliana proteome (TAIR9 release) All-against-all BLASTp versus protein-coding genes E<10-10, Rost s formula Singleton genes Duplicated genes Network extraction Networks of duplicated genes

Gene Gene duplication analysis: pipeline Arabidopsis thaliana proteome (TAIR9 release) All-against-all BLASTp versus protein-coding genes E<10-10, Rost s formula Singleton genes Duplicated genes Network extraction Networks of duplicated genes

Gene Gene duplication analysis: pipeline A. thaliana GENES AND PARALOGIES ARE REPRESENTED AS AN (UNDIRECTED) GRAPH G(V,E) WHERE: - V ={v 1,..v N } = genes - E ={e 1,..e M } = paralogies

Gene Gene duplication analysis: pipeline A. thaliana GENES AND PARALOGIES ARE REPRESENTED AS AN (UNDIRECTED) GRAPH G(V,E) WHERE: - V ={v 1,..v N } = genes - E ={e 1,..e M } = paralogies

RESULTS: Gene duplication Gene duplication analysis: analysis pipeline Arabidopsis thaliana proteome (TAIR9 release) 27169 All-against-all BLASTp versus protein-coding genes E<10-10, Rost s formula Singleton genes Duplicated genes Networks of duplicated genes

RESULTS: Gene duplication Gene duplication analysis: analysis pipeline Arabidopsis thaliana proteome (TAIR9 release) 27169 All-against-all BLASTp versus protein-coding genes E<10-10, Rost s formula Singleton genes 5326 Duplicated genes 21843 3017 Networks of duplicated genes

RESULTS: Gene duplication Gene duplication analysis: analysis pipeline 1400 1000 600 200 0 2 3-9 10-30 31-207 208-5168 Genes A network contains all and only the genes that share at least one paralogy relationship. Each gene belongs to one and only one network.

RESULTS: Gene duplication Networks analysis: applications pipeline NETWORKS ARE A USEFUL TOOL TO DEEPLY INVESTIGATE RELATIONSHIPS BETWEEN SUBSETS OF GENES SHARING DUPLICATION RELATIONSHIPS NETWORKS CAN BE A USEFUL TOOL TO REFINE GENE ANNOTATION

RESULTS: Networks Gene duplication Networks applications applications analysis: for the pipeline annotation of unknown information The 19% of the proteome is still annotated as unknown protein NETWORKS CAN HELP IN REFINING GENE ANNOTATION

RESULTS: Networks Gene duplication applications Networks applications analysis: for study pipeline of relationships between gene familes

RESULTS: Networks Gene duplication applications Networks applications analysis: for study pipeline of relationships between gene familes Networks can be a useful tool for highlighting evolutionary relationships between different gene families

RESULTS: Gene Networks duplication of applications analysis: duplicated pipeline genes

RESULTS: Gene Networks duplication of applications analysis: duplicated pipeline genes

RESULTS: Gene Networks duplication of applications analysis: duplicated pipeline genes

Gene RESULTS: duplication Networks 2-gene applications analysis: networks pipeline

Gene RESULTS: duplication Networks 2-gene applications analysis: networks pipeline Arabidopsis thaliana chromosomes

Gene RESULTS: duplication Networks 2-gene applications analysis: networks pipeline All protein-coding genes

Gene RESULTS: duplication Networks 2-gene applications analysis: networks pipeline All genes involved in two-gene networks

RESULTS: Gene 2-gene duplication Networks networks 2-gene networks applications analysis: and singleton pipeline genes AT LEAST 5% OF THE ENTIRE PROTEOME HAS A SINGLE PARALOGY RELATIONSHIP AT LEAST 20% OF THE ENTIRE PROTEOME IS A SINGLETON ABOUT ONE QUARTER OF THE ENTIRE PROTEOME HAS ZERO OR AT MOST ONE PARALOGY RELATIONSHIP

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline WHAT ABOUT SINGLETON GENES?

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline Singleton genes not having protein-coding paralogs Blastn mrna sequences against ESTs Singleton genes validated or not validated by ESTs 3588

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline Singleton genes not having protein-coding paralogs 3588 Comparative analysis (Biomart Ensembl) Singleton s orthologs in different species (O.sativa, V.vinifera, S.bicolor, P.trichocarpa)

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 3588 singleton genes

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 3588 singleton genes 2516 confirmed by ESTs 1072 Not confirmed by ESTs

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 3588 singleton genes 2516 confirmed by ESTs 1072 Not confirmed by ESTs 1072 Present only in A. thaliana

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 3588 singleton genes 2516 confirmed by ESTs 1072 Not confirmed by ESTs 481 Present only in A. thaliana 1072 Present only in A. thaliana

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 2516 confirmed by ESTs

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 481 out of 2516 present only in A.thaliana

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline SINGLETON GENES ANALYSIS RESULTS 1110 Not confirmed by ESTs and present only in A.thaliana

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline Singleton genes not having protein-coding paralogs 3588 BLASTn vs: Pseudogenes Transposons Other RNA Intergenic regions Singleton genes not having protein-coding/ non protein-coding paralogs and any significant hit in the genome 250

RESULTS: Gene duplication Networks Singleton 2-gene networks applications analysis: genes analysis pipeline Singleton genes not having protein-coding paralogs 3588 Blastx protein sequences against transcript Searching for ORF annotation errors MT1A (METALLOTHIONEIN 1A) MT1B (METALLOTHIONEIN 1B) Gene family: metallothionein binds to and detoxifies excess copper and other metals, limiting oxidative damage. copper ion binding / metal ion binding

CONCLUSIONS This work was helpful to highlight several crucial issues related to gene annotation Networks of paralogs -Are useful for the investigation of the gene annotation of reference and novel genomes -Provide a useful tool for deeper investigations of gene families -Reveal intriguing issues on the genome organization of Arabidopsis thaliana

FUNDING & ACKNOWLEDGEMENTS Prof. Luigi Frusciante Maria Luisa Chiusano PRINCIPAL INVESTIGATOR Nunzio D Agostino BIOLOGIST Alessandra Traini PHYSICIST Miriam Di Filippo BIOLOGIST Alessandra Vigilante BIOLOGIST Mara Sangiovanni Chiara Colantuono COMPUTER SCIENTIST STUDENT Mario Aversano TECHNICAL ASSISTANT

THANK YOU FOR THE ATTENTION! Reggia di Portici, XVIII century