SnoPatrol: How many snorna genes are there? Supplementary

Similar documents
Small RNA in rice genome

SNORNAS HOMOLOGY SEARCH

Homology and Information Gathering and Domain Annotation for Proteins

CS612 - Algorithms in Bioinformatics

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

In Genomes, Two Types of Genes

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Genome 559 Wi RNA Function, Search, Discovery

Supplementary Information

Introduction to Bioinformatics

How much non-coding DNA do eukaryotes require?

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

The RNA component of the RNase P

BLAST. Varieties of BLAST

Homology. and. Information Gathering and Domain Annotation for Proteins

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Taxonomical Classification using:

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Comparative Network Analysis

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?


Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Mitochondrial Genome Annotation

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

SUPPLEMENTARY INFORMATION

Ensembl Genomes (non-chordates): Quick tour. This quick tour provides a brief introduction to Ensembl Genomes [2], the non-chordate genome browser.

Quantitative Bioinformatics

Gene mention normalization in full texts using GNAT and LINNAEUS

Hands-On Nine The PAX6 Gene and Protein

6.096 Algorithms for Computational Biology. Prof. Manolis Kellis

Bioinformatics Chapter 1. Introduction

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sequencing alignment Ameer Effat M. Elfarash

Shape Based Indexing For Faster Search Of RNA Family Databases

Ribosomal RNA. Introduction. Organization of the Ribosomal RNA Genes. Introductory article

MiGA: The Microbial Genome Atlas

Figure S1: Similar to Fig. 2D in paper, but using Euclidean distance instead of Spearman distance.

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

SUPPLEMENTARY INFORMATION

Sequencing alignment Ameer Effat M. Elfarash

Large-Scale Genomic Surveys

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Protein function prediction based on sequence analysis

Algorithmics and Bioinformatics

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Robust Identification of Noncoding RNA from Transcriptomes Requires Phylogenetically-Informed Sampling

Introduction to molecular biology. Mitesh Shrestha

Comparison of Three Fugal ITS Reference Sets. Qiong Wang and Jim R. Cole

BIOINFORMATICS LAB AP BIOLOGY

Markov Models & DNA Sequence Evolution

Discovering Binding Motif Pairs from Interacting Protein Groups

Multiple Sequence Alignment. Sequences

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Procedure to Create NCBI KOGS

Prediction of protein function from sequence analysis

Detecting non-coding RNA in Genomic Sequences

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Review Article Ribonucleoproteins in Archaeal Pre-rRNA Processing and Modification

What can sequences tell us?

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Rho1 binding site PtdIns(4,5)P2 binding site Both sites

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Introduction to Evolutionary Concepts

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Supplementary Information for Discovery and characterization of indel and point mutations

Tandem Mass Spectrometry: Generating function, alignment and assembly

EBI web resources II: Ensembl and InterPro

"Omics" - Experimental Approachs 11/18/05

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

This is a repository copy of Microbiology: Mind the gaps in cellular evolution.

Simplifying Drug Discovery with JMP

THE DISTRIBUTION AND EVOLUTION OF SMALL NON-CODING RNAS IN ARCHAEA IN LIGHT OF NEW ARCHAEAL PHYLA

Supplementary Information for Evolution of resilience in protein interactomes across the tree of life

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Stable stem enabled Shannon entropies distinguish non-coding RNAs from random backgrounds

Microbiome: 16S rrna Sequencing 3/30/2018

Comparative Bioinformatics Midterm II Fall 2004

Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

Microbial Taxonomy. Slowly evolving molecules (e.g., rrna) used for large-scale structure; "fast- clock" molecules for fine-structure.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Semantic Integration of Biological Entities in Phylogeny Visualization: Ontology Approach

Journal of Proteomics & Bioinformatics - Open Access

Pathogenic fungi genomics, evolution and epidemiology! John W. Taylor University of California Berkeley, USA

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Centrifuge: rapid and sensitive classification of metagenomic sequences

Supplemental Materials

PHYLOGENY AND SYSTEMATICS

-max_target_seqs: maximum number of targets to report

Transcription:

SnoPatrol: How many snorna genes are there? Supplementary materials. Paul P. Gardner 1, Alex G. Bateman 1 and Anthony M. Poole 2,3 1 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA, UK 2 Department of Molecular Biology & Functional Genomics, Stockholm University, SE-106 91 Stockholm, Sweden 3 School of Biological Sciences, University of Canterbury, Christchurch 8140, New Zealand Email: Paul P. Gardner - pg5@sanger.ac.uk; Alex G. Bateman - agb@sanger.ac.uk; Anthony M. Poole - anthony.poole@canterbury.ac.nz; Corresponding author Supplementary Methods In order to investigate the taxonomic distribution of the known snornas and highlight where potential new discoveries can be made we have gathered data from the Pfam, Rfam, Genomes Online Database (GOLD) and EMBL databases. To locate the essential protein components of snornps we used domains annotated by the Pfam database. We chose 6 snornp associated domains Nop, Nop10p, Gar1, SHQ1, Fibrillarin and TruB N, the counts of which are shown in the blue bars in Figure 2. We also ran the control experiment of ensuring that the genomes for each taxa contain rrnas by using the annotation for the partial SSU model from the Rfam database, the counts of which are shown in the green bars in Figure 2 for each taxa. These two datasets indicate that the snornp machinery is present in all the known Archaea and Eukaryotic taxa, the notable exception being the absence of snornp protein domains in the Oömycetes, stramenopiles group. However, it appears that this lack is due to the protein sequences not being included in the public sequence databases yet rather than a bona fide vestigialization of this machinery. Yet for many major taxonomic clades there are few or no known snornas. To investigate these we used species counts of the snornas annotated by the Rfam database the results of these are displayed in the red bars in Figure 2. Of these there are many experimentally validated snorna sequences published in the literature and therefore submitted to one of the nucleotide archives Genbank, EMBL or the DDBJ, for further verification we mined the annotation of those sequences in the Rfam database for terms associated with snornas, the counts of these are shown in the pink bars in Figure 2. Finally, in order to be certain that 1

the counts we observed aren t due to an under-sampling of sequences from certain taxonomic groupings we used data from the GOLD genomes database, genomes annotated as either Complete, Draft or In progress ; assuming these genome sequences have all been submitted to the public nucleotide archives (the annotation of which is included in the UniProt database searched by Pfam) then this should give a good representation of where sequences are available and therefore annotation of snornas possible. Supplementary Table 1 Col. 1: taxon string (top three levels of the NCBI taxonomy) Col. 2: number of sequence regions annotated by the Pfam Nop, Nop10p and Gar1 families Col. 3: number of sequence regions annotated by the Rfam SSU rrna family Col. 4: number of genomes annotated as either completed, draft or in progress from the GOLD database Col. 5: number of sequence regions annotated by all the Rfam snorna families Col. 6: number of sequence regions annotated by all the Rfam snorna families where EMBL sequences have been published as snornas. TaxString Pfam SSU GOLD predsnos knownsnos Archaea;Crenarchaeota;Thermoprotei; 157 475 27 128 11 Archaea;Euryarchaeota;Aciduliprofundum. 10 3 1 2 0 Archaea;Euryarchaeota;Archaeoglobi; 6 43 1 8 3 Archaea;Euryarchaeota;Halobacteria; 59 1526 15 0 0 Archaea;Euryarchaeota;Methanobacteria; 23 1486 5 0 0 Archaea;Euryarchaeota;Methanococci; 45 107 6 26 0 Archaea;Euryarchaeota;Methanomicrobia; 64 1383 11 10 0 Archaea;Euryarchaeota;Methanopyri; 6 5 1 1 0 Archaea;Euryarchaeota;Thermococci; 48 300 7 324 19 Archaea;Euryarchaeota;Thermoplasmata; 14 220 3 0 0 Archaea;Korarchaeota;Candidatus 3 1 1 0 0 Archaea;Nanoarchaeota;Nanoarchaeum. 3 2 1 0 0 Archaea;Thaumarchaeota;Cenarchaeales; 4 0 0 0 0 Archaea;Thaumarchaeota;marine 18 0 0 0 0 Eukaryota;Alveolata;Apicomplexa; 132 2675 25 58 1 Eukaryota;Alveolata;Ciliophora; 28 1438 1 10 0 Eukaryota;Alveolata;Perkinsea; 20 175 1 29 0 Eukaryota;Amoebozoa;Archamoebae; 16 326 3 3 0 Eukaryota;Amoebozoa;Mycetozoa; 13 239 1 2 0 Eukaryota;Amoebozoa;Tubulinea; 1 84 0 0 0 Eukaryota;Choanoflagellida;Codonosigidae; 9 18 1 6 0 Eukaryota;Cryptophyta;Pyrenomonadales; 3 119 1 0 0 2

Eukaryota;Diplomonadida;Hexamitidae; 9 90 2 0 0 Eukaryota;Euglenozoa;Euglenida; 4 458 0 0 0 Eukaryota;Euglenozoa;Kinetoplastida; 70 669 3 174 26 Eukaryota;Fungi;Dikarya; 546 11898 154 3591 181 Eukaryota;Fungi;Microsporidia; 18 540 2 0 0 Eukaryota;Metazoa;Annelida; 1 1078 1 0 0 Eukaryota;Metazoa;Arthropoda; 208 10738 40 2032 120 Eukaryota;Metazoa;Brachiopoda; 1 74 0 0 0 Eukaryota;Metazoa;Chordata; 188 17465 90 85287 952 Eukaryota;Metazoa;Cnidaria; 9 1167 2 144 0 Eukaryota;Metazoa;Echinodermata; 2 254 1 168 0 Eukaryota;Metazoa;Echiura; 1 6 0 0 0 Eukaryota;Metazoa;Mollusca; 1 2045 1 140 0 Eukaryota;Metazoa;Nematoda; 31 2420 12 304 49 Eukaryota;Metazoa;Nematomorpha; 1 19 0 0 0 Eukaryota;Metazoa;Placozoa; 9 15 1 19 0 Eukaryota;Metazoa;Platyhelminthes; 22 1521 2 38 0 Eukaryota;Metazoa;Porifera; 1 281 0 0 0 Eukaryota;Metazoa;Priapulida; 1 23 0 0 0 Eukaryota;Metazoa;Sipuncula; 1 104 0 0 0 Eukaryota;Parabasalidea;Trichomonada; 12 350 1 1 0 Eukaryota;Rhizaria;Cercozoa; 3 866 1 0 0 Eukaryota;Viridiplantae;Chlorophyta; 48 2288 6 73 20 Eukaryota;Viridiplantae;Streptophyta; 194 5791 44 15529 680 Eukaryota;stramenopiles;Bacillariophyta; 21 1125 2 9 0 Eukaryota;stramenopiles;Oomycetes; 0 261 4 75 0 Supplementary Methods The Trichomonas vaginalis G3 C/D box snorna (RF00335, Small nucleolar RNA Z13/snr52) discussed in the text of the article. The results of the cmsearch [1] match between the RF00335 model and the T. vaginalis snorna candidate: >AAHC01000432.1/36420-36361 Query = 1-101, Target = 76-135 Score = 26.12, E = 104.1, P = 3.938e-09, GC = 33 ::<<<-----------------..-<<<<~~~~~~>>>>------------------->> 1 uucucuaaaaugaugaaauauu..ucgga*[47]*uccguugccuuccuucugauuaaga 99 +C::U+A AUGAUG A A U :GGA UCC:UUGCCUUCCUU UGAUU++:: 76 GACAAUUAGAUGAUGCAUAAAUgaGUGGA*[ 5]*UCCAUUGCCUUCCUU-UGAUUUUUU 133 >: 100 gu 101 GU 134 GU 135 3

A Stockholm format alignment showing the alignment between the known Fungal snr52/z13 snornas and the T.vaginalis snorna candidate: # STOCKHOLM 1.0 #=GF AC RF00335 #=GF ID snoz13_snr52 #=GF DE Small nucleolar RNA Z13/snr52 #=GF AU Moxon SJ #=GF SE Moxon SJ, INFERNAL #=GF SS Predicted; PFOLD; Moxon SJ #=GF GA 20.16 #=GF TC 22.70 #=GF NC undefined #=GF TP Gene; snrna; snorna; CD-box; #=GF BM cmbuild -F CM SEED; cmcalibrate --mpi -s 1 CM #=GF BM cmsearch -Z 169604 -E 1000 --toponly CM SEQDB #=GF DR SO:0000593 SO:C_D_box_snoRNA #=GF DR GO:0006396 GO:RNA processing #=GF DR GO:0005730 GO:nucleolus #=GF RN [1] #=GF RM 12007400 #=GF RT Small nucleolar RNAs: an abundant group of noncoding RNAs with #=GF RT diverse cellular functions. #=GF RA Kiss T; #=GF RL Cell 2002;109:145-148. #=GF CC Z13 or snr52 is a member of the C/D class of snorna which contain the C #=GF CC (UGAUGA) and D (CUGA) box motifs. Most of the members of the box C/D #=GF CC family function in directing site-specific 2 -O-methylation of substrate #=GF CC RNAs [1]. #=GF WK http://en.wikipedia.org/wiki/small_nucleolar_rna_z13/snr52 #=GS S.pombe_Z13_snoRNA CC AJ223843.1/1-95, score=89.65 E value=3.11e-16 #=GS CC AJ251862.1/1-95, score=89.65 E value=3.11e-16 #=GS CC EU780965.1/2-92, score=79.46 E value=2.01e-13 #=GS CC AM921928.1/6-91, score=79.30 E value=2.22e-13 #=GS CC AJ617322.1/3-83, score=67.73 E value=3.46e-10 #=GS CC AAHC01000432.1/36420-36361, score=26.12 E value=1.04e+02 #=GS CC AJ223033.1/15-106, score=9.86 E value=3.17e+06 #=GS CC AF064264.1/1-92, score=9.86 E value=3.17e+06 #=GS S.pombe_Z13_snoRNA CC S.pombe Schizosaccharomyces pombe Z13 small nucleolar RNA gen #=GS CC S.pombe Schizosaccharomyces pombe snr52 gene for small nucleo #=GS CC N.crassa Neurospora crassa box C/D snorna CD42, complete seque #=GS CC A.fumigatus Aspergillus fumigatus Af293 putative C/D box snorna A #=GS CC S.pombe Schizosaccharomyces pombe snr52 snorna #=GS CC T.vaginalis Trichomonas vaginalis G3, whole genome shotgun sequen #=GS CC S.cerevisiae Saccharomyces cerevisiae Z13 small nucleolar RNA gene #=GS CC S.cerevisiae Saccharomyces cerevisiae snr52 small nucleolar RNA, c 4

S.pombe_Z13_snoRNA #=GC SS_cons S.pombe_Z13_snoRNA #=GC SS_cons AUUUUGAAAAUGAUGAAAAU..AA.cGCGGAUGAAAUAUAU-------GU AUUUUGAAAAUGAUGAAAAU..AA.cGCGGAUGAAAUAUAU-------GU AC------CCUGAUGAAAUA..UU.cUCGGAUGUAAAAUUUUACUUUUGU ----------UGAUGAAAUA..UU..UCGGAUGUAAACUCACAGACCUGU ------AAAAUGAUGAAAAU..AA.cGCGGAUGAAAUAUAU-------GU GACAAUUAGAUGAUGCAUAA..AUgaGUGGA------------------- UA-----CUAUGAUGAAUGAcaUU.aGCGUGAACAAUCUCUGAUACAAAA UA-----CUAUGAUGAAUGAcaUU.aGCGUGAACAAUCUCUGAUACAAAA ::<<<---------------..--..-<<<<...CCCCCC... <25S_rRNA>... CUGAAAaG...CGCAAAAACAAUGUUGAGAUUUCCGUUGC UCUGAAuG...CGCAAAACC-GUGUAGAGAUUUCCGUUGC ------.g...uauuuccauugc UCGAAA.Gauuuuaggauuag...aaaa...acuUAUGUUGC UCGAAA.Gauuuuaggauuag...aaaa...acuUAUGUUGC..._... >>>>----... <18S_ S.pombe_Z13_snoRNA CUUCCUUCUGAUUCAAAAU CUUCCUUCUGAUUCAAAAU CUUCCUUCUGAUC------ CUUCCUUCUGAUGA----- CUUCCUUCUGA-------- CUUCCUU-UGAUUUUUUGU CUUCCUUCUGAA--A---- CUUCCUUCUGAA--A---- #=GC SS_cons --------------->>>: rrna.> DDDD... // References 1. Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics 2009, 25(10):1335 7. 5