Supplementary Information

Similar documents
Phylogenetics: Building Phylogenetic Trees

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Department of Pathology, Northwestern University Medical School, Chicago, IL 60611

SUPPLEMENTARY INFORMATION

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Algorithms in Bioinformatics

Phylogenetic Tree Reconstruction

Theory of Evolution Charles Darwin

Dr. Amira A. AL-Hosary

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling

Types of biological networks. I. Intra-cellurar networks

BINF6201/8201. Molecular phylogenetic methods

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Chapter 26 Phylogeny and the Tree of Life

A Phylogenetic Network Construction due to Constrained Recombination

Multiple Sequence Alignment. Sequences

Constructing Evolutionary/Phylogenetic Trees

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

BioControl - Week 6, Lecture 1

EVOLUTIONARY DISTANCES

A (short) introduction to phylogenetics

Computational approaches for functional genomics

Phylogenetic trees 07/10/13

Phylogenetic inference

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

V14 Graph connectivity Metabolic networks

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

The architecture of complexity: the structure and dynamics of complex networks.

Theory of Evolution. Charles Darwin

Chapter 19 Organizing Information About Species: Taxonomy and Cladistics

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai


Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Microbial Taxonomy. Microbes usually have few distinguishing properties that relate them, so a hierarchical taxonomy mainly has not been possible.

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Supplementary Materials for

Phylogeny Tree Algorithms

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

CS224W: Social and Information Network Analysis

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Phylogenetic inference: from sequences to trees

Classification and Phylogeny

Biological Networks. Gavin Conant 163B ASRC

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Classification and Phylogeny

What is Phylogenetics

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

86 Part 4 SUMMARY INTRODUCTION

Phylogeny: building the tree of life

Overview of clustering analysis. Yuehua Cui

Macroevolution Part I: Phylogenies

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Molecular Evolution and Phylogenetic Tree Reconstruction

Workshop: Biosystematics

Phylogenetics. BIOL 7711 Computational Bioscience

BioSystems 107 (2012) Contents lists available at SciVerse ScienceDirect. BioSystems

Genomic Comparison of Bacterial Species Based on Metabolic Characteristics

Honor pledge: I have neither given nor received unauthorized aid on this test. Name :

Workshop III: Evolutionary Genomics

Network Alignment 858L

Non-independence in Statistical Tests for Discrete Cross-species Data

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

How to read and make phylogenetic trees Zuzana Starostová

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Evolutionary Tree Analysis. Overview

Consistency Index (CI)

Name: Class: Date: ID: A

networks in molecular biology Wolfgang Huber

Phylogenetic Analysis of Molecular Interaction Networks 1

1 ATGGGTCTC 2 ATGAGTCTC

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

METABOLIC PATHWAY PREDICTION/ALIGNMENT

University of Florida CISE department Gator Engineering. Clustering Part 1

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Fig. 26.7a. Biodiversity. 1. Course Outline Outcomes Instructors Text Grading. 2. Course Syllabus. Fig. 26.7b Table

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

ELE4120 Bioinformatics Tutorial 8

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

What is the range of a taxon? A scaling problem at three levels: Spa9al scale Phylogene9c depth Time

On the Uniqueness of the Selection Criterion in Neighbor-Joining

Lecture 6 Phylogenetic Inference

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

Chapter 19. Microbial Taxonomy

Biology 211 (2) Week 1 KEY!

PHYLOGENY AND SYSTEMATICS

Transcription:

Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers are the same as in the paper) 1

Web Note A Connectivity distribution of information transfer networks To determine the large-scale structural organization of information transfer networks, we have examined the topologic properties of the information transfer pathways of 43 different organisms based on "Information transfer" portions of data deposited in the WIT database. Similar to that of metabolic networks 11, we have first established a graph theoretic representation of the biochemical reactions taking place in a given information transfer network. In this representation, a network is built up of nodes, which are the substrates that are connected to one another through links, which are the actual biochemical reactions. Note, that beside core metabolic reactions, the WIT DB includes biochemical reactions contributing to (1) information pathways, (2) electron transport, (3) transmembrane transport, (4) signal transduction, and (5) structure and function of the cell. However, with the exception of information transfer pathways, the data deposited for these parts are so incomplete that comparison of e.g., signal transduction pathways is completely impossible. In addition, no information for any metazoan, including for those of the just recently sequenced human genome, is available in the WIT DB. To establish if the topology of information transfer networks is best described by the inherently random and uniform exponential- or the highly heterogeneous scale-free model 11 the connectivity distribution of all 43 networks were analyzed as described for metabolic networks in ref. 11. As illustrated in Web Fig. A, our results convincingly indicate that the probability that a given substrate participates in k reactions follows a power-law distribution, i.e., information transfer networks belong to the class of scale-free networks. Since under physiological conditions a large number of biochemical reactions (links) in a network are preferentially catalyzed in one direction (i.e., the links are directed), for each node we distinguish between incoming and outgoing links. For instance, in scherichia coli the probability that a substrate participates as an educt in k information transfer pathways follows P(k) ~ k -γin, with γ in = 2.1, and the probability that a given substrate is produced by k different information transfer reactions follows a similar distribution, with γ out = 2.2 (Web Fig. Ab). We find that scale-free networks describe the information transfer reaction networks in all organisms in all three domains of life (Web Fig. Aa-c), indicating the generic nature of this structural organization (Web Fig. Ad). 2

0 In Out 0 In Out -2-2 P(k) P(k) -4-6 0 (a) 0 1 2 3 k -2 In Out P(k) P(k) -6 0 1 2 3 0-4 -2 (b) k In Out -4-6 (c) 0 1 2 3 k -4-6 (d) 0 1 2 3 k Fig. A Connectivity distribution P(k) for the substrates in (a) A. fulgidus (Archaea) (b). coli (Bacteria) (c) C. elegans (ukarya), shown on a log-log plot, counting separately the incoming (IN) and outgoing links (OUT) for each substrate, k in (k out ) corresponding to the number of reactions in which a substrate participates as a product (educt). (d) The connectivity distribution averaged over all 43 organisms. 3

Web Note B Database analysis Data types and resemblance coefficients Although the data matrices comprise quantitative information, we decided to rely only upon the qualitative part of the data, for several reasons. First of all, traditionally, qualitative data have been almost exclusively used in molecular systematics. Second, qualitative data are less sensitive to the different intensity by which the organisms are studied, meaning that there is more noise in quantitative data attributable to 'sampling error'. The presence or absence (P/A) of a variable in a given organism is the simplest type of all data. We thus compared each pair of organisms using a simple ratio in which the number of variables present in both was divided by the number of variables present in at least one of them (widely known as the Jaccard index). The coefficient is expressed here as its complement to measure the dissimilarity of the two species being compared. In addition to the P/A scores, we also utilized the ordinal information (i.e., the rank order) in the data. We have previously demonstrated 11 that ordering relationships for each organism are also of great importance when describing metabolic pathways. Consequently, we also compared each pair of species using the ordinal measure suggested by Goodman and Kruskal (their γ function). In this coefficient, the number of variable pairs ordered similarly by both taxa is divided by the total number of pairs of variables that are ordered at all. For taxa j and k it is written as γ jk = (a - b) / (a + b) where a is the number of pairs of variables ordered for objects j and k identically, b is the number of pairs of variables that are reversely ordered in j and k. The coefficient ranges from -1 to 1, indicating complete disagreement and full agreement, respectively. To simplify calculations, we transformed the values to measure dissimilarity in the range of 0 to 2. For details of the methodology used in this paper see refs. 12-15. Ordination A major group of exploratory data analysis tools aims to reduce the dimensionality of the system, more precisely, to show a multidimensional picture in a few, preferably two, dimensions as efficiently as possible. These procedures are commonly referred to as ordination. For the present study, we have chosen the non-metric multidimensional scaling (NMDS) method which is best suited to input dissimilarities that are ordinal anyway. For 4

consistency, we applied the very same procedure to the otherwise metric Jaccard coefficient, as well. A fundamental feature of NMDS is that the user defines the final dimensionality of the result, which is generally selected to be two. The procedure then attempts to arrange the points in the space such that the ordering relationships of input dissimilarities are as faithfully preserved in the ordination space as possible. The procedure is iterative, and therefore requires several runs from which we retain the best result. Hierarchical clustering The most widely known hierarchical clustering method is group average clustering (UPGMA, ref. 13). Despite strong criticisms expressed by cladists, the method applies even to phylogenetic problems especially if the molecular clock (i.e., constant rate of evolutionary change on all lineages) is assumed. Nevertheless, UPGMA has been most commonly used on data other than DNA or RNA sequences, thus its properties and interpretability of its results are sufficiently known. In the present study, UPGMA was applied to matrices of the Jaccard index. The method produces an ultrametric tree, the hierarchical levels potentially interpretable as mean dissimilarities between groups fused at that level. Since the matrix calculated by Goodman-Kruskal's γ comprises non-metric information, they were processed by a different hierarchical classification algorithm, namely ordinal hierarchical clustering (OC, see ref. 14 for its underlying criterion). As OC is a method recently developed by one of us (JP); here we provide a more detailed description of this technique. This procedure considers only the ordering information present in the dissimilarity matrix, and then builds a tree in which each level is labeled only by the number of the iteration step in which the actual merger took place. The algorithm of agglomerative ordinal clustering first orders the m(m-1)/2 dissimilarity coefficients (m is the number of taxa), so that each γ jk is replaced by its rank, r jk. Then, the following clustering criterion is considered C = (R w - R min ) / (R max - R min ) where R w is the sum of ranks of within-cluster dissimilarities, R min is the possible minimum of such ranks for the given number of clusters and for the given numbers of objects in each cluster, and R max is the possible maximum. A similar criterion using R exp instead of R max was proposed by ref. 14 for determining the importance of variables in classification. The value of C ranges from 0 to 1, 0 indicating that all within-cluster 5

dissimilarities are smaller than the between-cluster dissimilarities, whereas 1 indicating that all between-cluster dissimilarities are smaller than the others. In the dendrogram resulted from agglomerative ordinal clustering we used the ranks of fusions (values from 1 to m-1), rather than the C values themselves, because this criterion does not change monotonically. That is, the result is an ordered dendrogram, rather than a weighted dendrogram, being consistent with the ordinality of the previous steps of the analysis. Neighbor joining (NJ) The method originally developed by Saitou and Nei (ref. 12) has been the most widely applied distance-based tree-building method in phylogenetic reconstruction. Contrary to UPGMA, the tree is not ultrametric, but the additivity of branch lengths holds. In other words, the sum of branch lengths along the path between two taxa in the tree is an approximation to the input distances. The NJ tree is unrooted by definition, although some external criterion (outgroup, longest path) may be used to position a root (a practice not followed here). 6

Web Note C Analysis of the effect of database errors Of the 43 organisms whose metabolic- and information transfer pathways we have analyzed the genome of 25 has been completely sequenced (5 Archaea, 18 Bacteria, 2 ukaryotes), while the remaining 18 are only partially sequenced. Therefore, two major sources of possible errors in the database could potentially affect our analysis: (a) the erroneous annotation of enzymes and consequently, biochemical reactions; for the organisms with completely sequenced genomes this is the likely source of error. (b) reactions and pathways missing from the database; for organisms with incompletely sequenced genomes both (a) and (b) are of potential source of error. To determine if these limitations affect our analysis we have performed repeated analyses according to the type of errors to demonstrate if database bias critically influences the validity of our findings. These analyses were restricted to the UPGMA classifications of the P/A case, assuming that ordinations and unrooted trees are similarly influenced by data changes. In the first series of simulations, the four UPGMA classifications (Fig. 1f,h, Fig. 2f,h) were recomputed based on a reduced set of input data such that only the randomly chosen 50% of original variables were retained. The INFO based dendrograms (compare original Fig. 1f,h to the corresponding ones in Web Fig. B) suffered little changes owing to data reduction, the only discrepancy being that Archaea and ukaryotes form three, rather than two small clusters, while Bacteria remain intact as a very tight group. For MTAB data, removal of 50% of the variables from the substrate data set caused only one critical modification of the tree (compare original Fig. 2f to the corresponding one in Web Fig. B); Aeropyrum pernix was moved to the B2 group which was otherwise split into two. On the other hand, the main groupings in the dendrogram of Fig. 2h (NZ data) did not change at all. In the second series of simulations, the incompletely sequenced organisms were simply removed from the analyses, so that clustering was confined to 25 taxa in case of all the four input data sets. Reduction of sample size did not alter the main groupings at all (compare original Fig. 1f,h and 2f,h to the corresponding trees in Web Fig. C), suggesting robustness of the data upon the removal of incompletely sequenced organisms. 7

1f 1h A+ B A+ B 2f 2h A B1 B2 A B1 B2 Fig. B UPGMA classifications based on randomly reduced data sets to evaluate potential effects of missing reactions and pathways. Captions refer to comparable dendrograms obtained from complete data. 8

1f 1h A B A+ 2f 2h B A B1 B2 A B1 B2 Fig. C UPGMA classifications restricted to completely sequenced species to evaluate the effect of missing information arising from incomplete sequencing. Captions refer to comparable dendrograms obtained for the full set of species. 9