Comparative Features of Multicellular Eukaryotic Genomes

Similar documents
Population transcriptomics uncovers the regulation of gene. expression variation in adaptation to changing environment

GO ID GO term Number of members GO: translation 225 GO: nucleosome 50 GO: calcium ion binding 76 GO: structural

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

S1 Gene ontology (GO) analysis of the network alignment results

EBI web resources II: Ensembl and InterPro

CSCE555 Bioinformatics. Protein Function Annotation

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Transcriptome analysis of leaf tissue from Bermudagrass (Cynodon dactylon) using a normalised cdna library

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Procedure to Create NCBI KOGS

GCD3033:Cell Biology. Transcription

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

SUPPLEMENTARY MATERIAL. The table lists the 437 proteins proposed to be myristoylated in A. thaliana. The AGI

Lecture 10: Cyclins, cyclin kinases and cell division

CATEGORY a TERM COUNT b P VALUE GENE FAMILIES: REPRESENTATIVE GENE SYMBOLS c. Annotation Cluster 1 Enrichment Score: 1.

Regulation of Transcription in Eukaryotes. Nelson Saibo

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Regulation of gene expression. Premedical - Biology

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Introduction. Gene expression is the combined process of :

-max_target_seqs: maximum number of targets to report

Figure S1: Mitochondrial gene map for Pythium ultimum BR144. Arrows indicate transcriptional orientation, clockwise for the outer row and

Peter Pristas. Gene regulation in eukaryotes

Transcription Regulation And Gene Expression in Eukaryotes UPSTREAM TRANSCRIPTION FACTORS

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Biological Process Term Enrichment

Prokaryotes: genome size:? gene number:?

Welcome to Class 21!

The Protein Ontology: An Evolution

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

BME 5742 Biosystems Modeling and Control

Chapter 10, 11, 14: Gene Expression, Regulation, and Development Exam

Ch 10, 11 &14 Preview

Homology and Information Gathering and Domain Annotation for Proteins

Leucine-rich repeat receptor-like kinases (LRR-RLKs), HAESA, ERECTA-family

Chapter 9 DNA recognition by eukaryotic transcription factors

2 The Proteome. The Proteome 15

Prediction of protein function from sequence analysis

Functional Annotation

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Plant Molecular and Cellular Biology Lecture 8: Mechanisms of Cell Cycle Control and DNA Synthesis Gary Peter

Regulation of Gene Expression

Student Learning Outcomes: Nucleus distinguishes Eukaryotes from Prokaryotes

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Lineage specific conserved noncoding sequences in plants

Photoperiodic effect on protein profiles of potato petiole

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

MOLECULAR CELL BIOLOGY

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Lecture 12. Metalloproteins - II

A Protein Ontology from Large-scale Textmining?

Activation of a receptor. Assembly of the complex

Lecture 2. The Blast2GO annotation framework

Introduction to Bioinformatics Integrated Science, 11/9/05

Regulation of Transcription in Eukaryotes

Multiple Choice Review- Eukaryotic Gene Expression

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

DNA Technology, Bacteria, Virus and Meiosis Test REVIEW

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

JMJ14-HA. Col. Col. jmj14-1. jmj14-1 JMJ14ΔFYR-HA. Methylene Blue. Methylene Blue

Supplementary Information 16

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

Introduction to molecular biology. Mitesh Shrestha

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Genome Annotation Project Presentation

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Review. Membrane proteins. Membrane transport

Big Idea 3: Living systems store, retrieve, transmit and respond to information essential to life processes. Tuesday, December 27, 16

Protein Architecture V: Evolution, Function & Classification. Lecture 9: Amino acid use units. Caveat: collagen is a. Margaret A. Daugherty.

Signal Transduction. Dr. Chaidir, Apt

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Introduction to Molecular and Cell Biology

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Supplementary Table 3. Membrane/Signaling/Neural Genes of the DmSP. FBgn CG5265 acetyltransferase amino acid metabolism

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Chapter 20. Initiation of transcription. Eukaryotic transcription initiation

Gene Control Mechanisms at Transcription and Translation Levels

Small RNA in rice genome

Old FINAL EXAM BIO409/509 NAME. Please number your answers and write them on the attached, lined paper.

32 Gene regulation, continued Lecture Outline 11/21/05

Introduction to Bioinformatics

Molecular Biology (9)

CELL CYCLE AND DIFFERENTIATION

Reception The target cell s detection of a signal coming from outside the cell May Occur by: Direct connect Through signal molecules

Chapter 18 Lecture. Concepts of Genetics. Tenth Edition. Developmental Genetics

Life Sciences 1a: Section 3B. The cell division cycle Objectives Understand the challenges to producing genetically identical daughter cells

From gene to protein. Premedical biology

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

11/24/13. Science, then, and now. Computational Structural Bioinformatics. Learning curve. ECS129 Instructor: Patrice Koehl

Supporting Information

Reminders about Eukaryotes

Co-ordination occurs in multiple layers Intracellular regulation: self-regulation Intercellular regulation: coordinated cell signalling e.g.

Computational Structural Bioinformatics

Transcription:

Comparative Features of Multicellular Eukaryotic Genomes C elegans A thaliana O. Sativa D. melanogaster M. musculus H. sapiens Size (Mb) 97 115 389 120 2500 2900 # Genes 18,425 25,498 37,544 13,601 30,000 30,000 Gene density (#/kb) 1/5 1/4.5 9.9 1/8.8 1/83 1/97 LINE/SINE (%) 0.4 0.5 1.2 0.7 27.4 33.6 LTR (%) 0.0 4.8 14.8 1.5 9.9 8.6 DNA Elements 5.3 5.1 18.0 0.7 0.9 3.1 Total repeats 6.5 10.5 34.8 3.1 38.6 46.4 Exons % genome size 27 28.8 12.0 24.0 per gene 5.4 4.7 4.1 8.4 8.7 average size (bp) 250 254 506 Introns % genome size 15.6 15.3 average size (bp) 168 413

Characterizing the Proteome The Protein World Sequencing has defined o 129,998 proteins Translation of EMBL (GenBank counterpart) has defined o 944,044 additional proteins How can we use this data to: o Define genes in new genomes o Look for evolutionarily related genes o Follow evolution of genes Mixing of domains to create new proteins o Uncover important subsets of genes that That deep phylogenies Plants vs. animals Placental vs. non-placental animals Monocots vs. dicots plants Common nomenclature needed o Ensure consistency of interpretations InterPro (http://www.ebi.ac.uk/interpro/) Intergrated documentation resource for protein families, domains and functional sites Nucleic Acids Research (2001) 29:37; (2003) 31:315 A database of protein families, domains, and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences A database that collapses information from 11 other databases Current Release: 7/2003 o 8,547 entries (7/03); 12,294 (7/05); 13,147 (8/06)

InterPro Entries Defined as: Family Evolutionarily related proteins Share architecture base on: o Domain signature o Repeat signature Multiple signatures can define a family 6/2004 release: 8,166 families; 8,753 (7/05); 9080 (8/06) Examples IPR000003 IPR000006 IPR000007 IPR000009 IPR000010 Retinoid X Receptor Vertebrate metallothionein Tubby Protein phosphate 2A regulatory subunit PR55 Cysteine protease inhibitor Domain Independent structural unit Can be associated with other o Domains o Repeat Proteins with the domain are evolutionarily related 6/2004 release: 2,573 domains; 3,240 (7/05); 3,760 (8/06) Examples IPR000001 IPR000002 IPR000005 IPR000008 IPR000014 Kringle Cdc20/Fizzy Helix-turn-helix, AraCtype C2 domain PAS domain

Repeat Region not expected to form a globular structure on their own o WD40 Six to eight copies needed for globular domain to form 6/2004 release: 201 repeats; 230 (7/05)/ 232 (8/06) Examples IPR000033 IPR000102 IPR000127 IPR000225 IPR000258 Low-density lipoprotein receptor, YWTD repeat Neuraxin/MAP1B repeat Ubiquitin-activating enzyme repeat Armadillo repeat Bacterial ice-nucleation proteins octamer repeat Site Post-translational modification o Modifies primary protein structure o Glycosylation o Splicing o Phosphorylation 6/2004 release: 20 post-translational modifications; 21 (7/05); 21 (8/06) Examples IPR000042 IPR000134 IPR000152 IPR000220 IPR000338 N-glycosylation site Amidation site Aspartic acid and asparagin hydroxylation site Tyrosine kinase phosphorylation site N-myristoylation site

Binding site o Binds chemical compounds Compounds not substrates Cofactors required for protein function 6/2004 release: 21 binding sites; 21 (7/05); 22 (8/06) Examples IPR000205 IPR000214 IPR000345 IPR000634 IPR001216 NAD-binding site Formamidopyrimidine-DNA glycolase, Zn-binding site Cytochrome c heme-binding site Serine/threonine dhydrate, pyridoxal-phosphate-binding site Cysteine synthase/cystathionine beta-synthase P-phosphate binding site Active site o Catalytic sites of protein o Site where substrate binds 6/2004 release: 26 active sites; 29 (7/05); 32 (8/06) Examples IPR000126 IPR000138 IPR000180 IPR000189 IPR000590 Serine protease V8, active site Hydroxymethylglutaryl-coenzyme A lyase, active site Peptidase M19, renal dipeptidase, active site Prokaryotic transglycosylase, active site Hydroxymethylglutaryl-coenzyme A synthase, active site

InterPro Strategy Uses nomenclature developed by other databases Collapses those nomenclatures to define each InterPro entry 8/2006 Data Database Contents No. entries UNIPRO Protein sequence database; recently translated SP: 228,670 nucleotides (EMBL) not represented as proteins TrEMBL: 3,031,970 PROSITE Biologically significant protein families and domains database PFAM Protein Families Database of Alignments and HMMs PRINTS Protein fingerprints database; fingerprints are conserved motifs used to characterize a protein family ProDom Database of automatically compiled homologous protein domains SMART Identifies and annotates domains involved in signaling, and extracellular and chromatin-associated proteins PIRSF PIR SuperFamily; assigns a protein to a single family based on full-length similarity and a sharing of a common domain SUPERFAMILY Domain based collection of proteins that are grouped using the SCOP definition of a superfamily; all superfamily members show evidence of a common ancestor WWW Sites: Swissprot: http://us.expasy.org/sprot/ PROSITE: http://us.expasy.org/prosite/ PFAM: http://www.sanger.ac.uk/software/pfam/ PRINTS: http://www.bioinf.man.ac.uk/dbbrowser/prints/ ProDom: http://prodes.toulouse.inra.fr/prodom/2002.1/html/home.php SMART: http://smart.embl-heidelberg.de/ TIGRFAMS: http://www.tigr.org/tigrfams/ PIRSF: http://pir.georgetown.edu/iproclass/ SUPERFAMILY: http://supfam.mrc-lmb.cam.ac.uk/superfamily/ 1697 pattern 125 preprofiles 8295 families 1899 fingerprints 1001 families 704 SMART HMMs 1033 superfamilies 450 superfamilies

InterPro statistics for sequenced eukaroyte genomes. (10/12/03 data from InterPro) # Proteins in proteome Proteins with InterPro matches (%) Number of signatures Number of InterPro entries (%) A. thaliana 26,070 20,685 (70.3) 5629 3119 (36.5) C. elegans 21,128 15,090 (71.4) 5402 2956 (34.5) D. melanogaster 15,455 11,267 (72.9) 5536 3035 (35.5) E. cuniculi 1,908 1,258 (65.9) 1765 927 (10.9) G. theta 451 279 (61.9) 596 295 (5.3) H. sapiens 29,638 22,036 (74.4) 7296 4163 (48.7) M. musculus 21,041 16,652 (79.1) 7123 4053 (47.4) S. cerevisiae 6,202 4,379 (70.6) 4436 2347 (27.5) S. pombe 4,988 3,889 (78.0) 4839 2335 (27.3) 1Encephalitozoon cuniculi: microsporidian; primative eukaryote; lack mitochondria 2Guillarida theta enslaved nucleus: nucleus containing chloroplasts are found in some unicellular algae.

InterPro Data Arabidopsis Top 15 Entries (6/2004) Rank No. proteins (% proteome) InterPro Entry (type) 1 1088 (4.2) Protein kinase-like 2 1054 (4.0) Protein kinase 3 731 (2.8) Serine/threonine protein kinase active site (active site) 4 649 (2.5) Cyclin-like F-box (domain) 5 551 (2.1) Protein prenyltransferase 6 542 (2.1) Leucine-rich repeat (repeat) 7 488 (1.9) Zn-finger, RING (domain) 8 457 (1.7) PPR repeat (Repeat) 9 389 (1.5) Homeodomain-like 10 333 (1.3) Leucine-rich repeat, plant specific (repeat) 11 322 (1.2) ARM repeat fold 12 321 (1.2) Galactose oxidase, central 13 319 (1.2) AAA ATPase (domain) 14 309 (1.2) Myb DNA-binding protein (domain) 15 285 (1.1) TPR-like

Arabidopsis Top 15 Families (6/2004) Rank No. proteins InterPro Entry 1 252 Cytochrome P450 2 161 Pectin lyase-like 3 158 Disease resistance protein 4 113 Retrotransposon gag protein 5 110 Lipolytic enzyme, G-D-S-L 6 109 UDP-glucoronosyl/UDP-glucosyltransferase 7 107 No apical meristem (NAM) protein 8 103 Ras GTPase 9 103 Peptidase aspartic 10 90 Transcritional factor B3 11 87 Major facilitator superfamily 12 85 Short-chain dehydrogenase/reductase SDR 13 82 Haemperoxidase 14 80 Peptidase C48, SUMO/Sentrin/Ubl1 15 72 Auxin responsive SAUR protein

Arabidopsis Top 15 Domains (6/2004) Rank Proteins matched (Matches/genome) InterPro Entry (type) 1 1088 (1348) Protein kinase-like 2 1055 (3767) Protein kinase 3 649 (1507) Cyclin-like F-box 4 551 (1206) Protein prenyltransferase 5 488 (1315) Zn-finger, RING 6 389 (540) Homeodomain-like 7 322 (714) ARM repeat fold 8 321 (474) Galactose oxidase, central 9 319 (394) AAA ATPase F-box protein interaction domain 10 309 (1367) Myb DNA-binding domain 11 285 (531) TPR-like 12 252 (1165) RNA-binding region RNP-1 (RNA recognigtion motif) 13 237 (973) Zn-finger, CCHC type 14 225 (232) Winged helix DNA-binding 15 222 (342) Nucleic acid-binding OB-fold

Arabidopsis Top 15 Repeats (6/2004) Rank No. proteins InterPro Entry 1 542 Leucine-rich repeat 2 457 PPR repeat 3 245 G-protein beta WD-40 repeat 4 151 Ankyrin 5 129 TPR repeat 6 111 Kelch repeat 7 75 Armadillo repeat 8 70 Parallel beta-helix repeat 9 44 Kelch 10 31 Paired amphipathic helix 11 27 Leucine-rich repeat, cysteine-containing 12 25 Pumilo/Puf RNA-binding 13 25 Bacterial transferase hexapeptide repeat 14 24 Protein of unknown function DUF321 15 23 Regulator of chomosome condensation, RCC!

Are their plant specific families? Rank InterPro entry A. thaliana C. elegans D. melanogaster H. sapiens S. cervisiae 1 Protein kinase 1050 528 303 694 118 2 Cytochrome P450 252 83 88 90 3 3 Disease resistance protein 4 2OG-Fe(II) oxygenase superfamily 5 Retrotransposon gag protein 6 Lipolytic enzyme, G- DE-S-L family 7 UDGglucoronosyl/UDPglucosyl transferase 8 No apical meristem (NAM) protein 9 Ras GTPase superfamily 157 1 0 4 2 114 11 26 19 1 110 5 3 4 3 110 11 2 5 1 109 76 35 29 0 108 0 0 0 0 106 82 86 118 34 10 Haem peroxidase 96 26 16 32 4 11 Short-chain dehyrogenase/reducta se SDR 12 General substrate transporter 13 Mitochondrial substrate carrier 14 Major facilitator superfamily 15 Transcriptional factor B3 91 92 60 80 13 91 96 118 77 54 85 65 74 93 38 84 128 123 106 670 81 0 0 0 0

Are their plant specific domains? [No. proteins (Matches/proteome)] Rank InterPro entry A. thaliana C. elegans D. melanogaster H. sapiens S. cervisiae 1 Cyclin-like F-box 648 (1487) 441 (957) 37 (78) 86 (204) 14 (32) 2 Zn-finger, RING 491 (1287) 175 (504) 150 (376) 387 (1194) 40 (108) 3 Myb DNA-binding domain 345 (1535) 39 (112) 50 (103) 92 (276) 21 (91) 4 AAA ATPase 314 (390) 124 (173) 131 (193) 176 (251) 89 (125) 5 RNA-binding region RNP-1 (RNA recognition motif 6 Zn-finger, CCHC type 7 Basic helix-loophelix dimerization domain bhlh 8 Calcium-binding EFhand 9 F-box protein interaction domain 10 Esterase/lipase/thioes terase 300 (1302) 131 (193) 176 (251) 346 (1809) 58 (307) 233 (923) 47 (210) 27 (111) 56 (308) 14 (89) 224 (647) 101 (271) 93 (307) 168 (635) 9 (36) 222 (1546) 133 (813) 134 (873) 316 (1945) 17 (102) 213 (223) 0 (0) 0 (0) 0 (0) 0 (0) 210 (213) 134 (142) 148 (149) 121 (121) 37 (37) 11 Zn-finger,C2H2 type 196 (852) 251 (2042) 397 (7249) 962 (29,334) 54 (381) 12 Zn-finger-like, PHD finger 164 (472) 46 (146) 51 (203) 126 (529) 16 (57) 13 FBD 159 (192) 1 (1) 2 (2) 1 (1) 0 (0) 14 NB-ARC domain 154 (157) 0 (0) 0 (0) 0 (0) 0 (0) 15 Integrase, catalytic domain 153 (158) 30 (31) 6 (6) 15 (15) 40 (40)

Are their plant specific repeats? Rank InterPro entry A. thaliana C. elegans D. melanogaster H. sapiens S. cervisiae 1 Leucine-rich repeat 518 85 127 269 11 2 PPR repeat 454 1 4 5 2 3 G-protein beta WD- 40 repeat 267 170 195 405 100 4 TPR repeat 136 64 92 189 33 5 Ankyrin 117 116 104 306 21 6 Kelch repeat 115 17 22 97 6 7 Armadillo repeat 70 7 11 47 2 8 Parallel beta-helix repeat 9 Paired amphipathic helix 10 Bacterail transferase hexapeptide repeat 11 Leucine-rich repeat, cysteine-containing 12 Protein of unknown function DUF321 13 Prenyltransferae/squa lene oxidase 14 Late embryogenesis abundant protein 62 3 2 9 1 28 1 3 3 1 26 7 5 8 4 26 7 10 21 3 24 0 0 0 0 20 3 3 4 4 18 4 1 2 0 15 HEAT repeat 17 14 6 30 10

Gene Ontology (GO) (http://www.geneontology.org/) The goal of the Gene Ontology Consortium is to produce a controlled vocabulary that be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing. Major Categories Molecular Function Ontology The action characteristic of a gene product. Biological Process Ontology A phenomenon marked by changes that lead to a particular result, mediated by one or more gene products. Cellular Component Ontology The part of a cell of which a gene product is a component; for purpose of GO includes the extracellular environment of cells; a gene product may be a component of one or more parts of a cell; this term includes gene products that are parts of macromolecular complexes, by the definition that all members of a complex normally copurify under all except extreme conditions.

GO Annotations for Eukaryotic Species Species Biological process Molecular function Cellular component Total genes associated A. thaliana 10,395 12,937 6,159 18,572 C. elegans 5,115 5,755 3,057 6,925 D. melanogaster 4,439 6,795 3,942 7,938 H. sapiens 16,643 18,722 13,880 20,687 S. cerevasiae 6,646 6,434 6,435 6,448 Note: Each annotations was performed by a different group.

Lineage Specific Expansion (http://www.ncbi.nlm.nih.gov/cog/new/sholse.cgi) Uses the Clusters of Orthologous Genes dataset Determines how many genes are specific to only one species Species LSEs A. thaliana 1,458 C. elegans 845 D. melanogaster 303 H. sapiens 1,373 S. cerevasiae 132 Arabidopsis LSE Category Unique Total Signal transduction mechanism (T) 34 1080 Defense mechanism (V) 15 337

Arabidopsis Example Disease resistance proteins o Shared structure among each protein Domains TIR: Toll-interleukin receptor site NBS: Nucleotide binding site LRR: Leucine rich repeat o Structure shared among many plant resistance genes Lineage specific for Arabidopsis relative to other sequenced genomes Not lineage specific for plants in general o Number 105 Other Arabidopsis Examples Receptor protein kinase containing LRR repeats o Gene expression regulator Involved in phosphorylating other proteins Number o 227 basic Helix Loop Helix o bhlh o Transcriptional regulator Number 143

Human Examples Olfactory receptor o Smell Number 570 Immunoglobin heavychain o Immune system protein o Defense response Number 156