Genomic Comparison of Bacterial Species Based on Metabolic Characteristics

Genomic Comparison of Bacterial Species Based on Metabolic Characteristics Gaurav Jain 1, Haozhu Wang 1, Li Liao 1*, E. Fidelma Boyd 2* 1 Department of Computer and Information Sciences University of Delaware Newark, DE 19716, USA 2 Department of Biological Sciences University of Delaware Newark, DE 19716, USA Abstract In this work, we developed a novel method to generate comparison trees based on characteristics collected from metabolic networks of bacteria. We characterize each bacterial genome s metabolism by the occurrence frequencies of various chemical reactions classified by enzyme commission numbers, and by the correlation of the reaction types for any two consecutive reactions in pathways present in the networks. In hypothesizing that species physiologically close to each other should show high similarity in these characteristics, we quantitatively measure the similarity using Pearson correlation coefficient, and build comparison trees using the Neighbor- Joining algorithm. These Metabolic Characteristics (MC) based comparison trees cluster the bacteria according to their functional groups and reveal the relationship between different organisms from a physiological perspective yielding new insights about the organisms. Keywords-genome comparison; metabolism; Pearson correlation I. INTRODUCTION Classifying organisms into an ordered scheme is important in understanding the fundamental biology of life. Traditionally, trees based on 16S rrna sequences are the main tool for studying molecular phylogeny of bacteria. The advent of new molecular techniques such as metabolic network reconstruction and simulations has opened new avenues to compare the organisms. Most metabolic reactions critical for proper functioning of cells are catalyzed by enzymes [7]. However, neither do all enzymes occur in all species nor do they have equal importance in each species. Further within the phylogeny of each species, the occurrence of enzymes is heterogeneous [5]. Because enzymes are inherent parts of metabolic networks, we can generate comparison trees based on enzymes phylogenetic properties to look at the relationship between different organisms from a physiological perspective. As more bacterial genomes are sequenced and the metabolic pathways of these organisms are reconstructed, it becomes possible to perform organism comparisons from a * Corresponding authors. biochemical-physiological perspective. Such comparisons may yield novel insights into the evolution of metabolic pathways and may be relevant to metabolic engineering of industrial microbes. Studies in this direction focusing on individual pathways have been attempted [1, 8]. In contrast to the classical view of metabolism, where relatively isolated sets of reactions or metabolic pathways allow the synthesis and degradation of compounds, the new perspective views metabolic pathway components such as substrates, products, cofactors, and enzymes as parts of a single whole network. Due to the fact that some functional properties like the small distance between reactions from different pathways are visible only when the metabolism is analyzed from a network perspective, it becomes less meaningful to define metabolism as just isolated pathways [11]. A metabolic network consists of all chemical transformations or reactions involved in metabolism in the cell, with the metabolites being interconnected by enzymecatalyzed reactions. Many enzymes are common in numerous species while others occur only in a few. Phylogenetic analysis of these metabolic components (substrates, products, cofactors, and enzymes) may expand the understanding of the evolutionary processes [13]. In order to study metabolism as a whole, there are two complementary ways to represent the metabolic network. First, metabolism can be represented with a compoundcentric network, wherein nodes (substrates and products) participating in the same reaction are connected. Second, metabolism can be represented as an enzyme-centric network where nodes (enzymes) producing a compound are connected with nodes consuming the same compound. In this work, we developed a novel method to collect and utilize characteristics embedded in enzyme-centric metabolic pathway networks, and generate the comparison trees for genomes of interest. These metabolic characteristics (MC) based comparison trees cluster the organisms according to their functional groups and reveal the relationship between different organisms from a physiological perspective yielding new insights about the organisms. We showed that where the 16S rrna tree failed to capture some major metabolic differences between the organisms, the MC based method efficiently captured the differences. Our simple and

accurate approach was able to capture the functional properties of different groups like pathogens, non-pathogens, and clusters them which were not seen in the traditional 16S rrna tree suggesting that there are differences in metabolic capabilities between the organisms. Step 2: We then calculated the frequency of the reaction types (EC: a.b) for all the selected species or strains. We then generated the histogram (Figure 2) in order to see the distribution of these reaction types. II. METHOD AND DATA A. Construction of enzym centric metabolic networks We constructed metabolic comparison trees by combining information about all the enzyme catalyzed metabolic reactions in bacterial species selected from the KEGG database [5] (www.genome.jp/kegg, as of 30 th June 2008). This is currently one of the best available comprehensive databases for examining metabolic pathways, along with other more deeply annotated and specialized databases such as EcoCyc [6]. Bacteria were chosen for three reasons: 1. Bacterial metabolism is reasonably well understood and allows us to identify the roles of enzymes more clearly and reliably. 2. Bacteria allow us to get a better estimate of the phylogenetic profile and overall topological positions of individual enzyme because their phylogeny is widely studied. 3. Limiting the investigation to a single major group of organisms removes the confusion that might arise if representatives of several major organism types were examined, since each major group is likely to have metabolic characteristics peculiar to them. Figure 2. Frequencies of reaction types in the Vco-Vibrio Cholerae O1 metabolic network. Step 3: We generated a meta-table listing all the reactions that are involved in all metabolism along with its corresponding reaction type (EC: a.b) for all the bacterial species and strains from the KEGG database. Step 4: We created a list of reactions along with its reaction types for each organism or strains from the XML files in the KEGG database and the meta-table generated in the previous step. Figure 1.Aflow chart for the steps to construct distance matrix We developed a framework, shown in Figure 1, to get the distance matrix used for the construction of comparison trees. The steps are as follows: Step 1: We determined the number of different enzymes (identified by E.C. numbers) occurring in the selected bacterial species or strains from the xml files in the KEGG database. Figure 3. A snapshot of Glycolysis metabolic pathway from the KEGG database (a), and a schematic diagram for extracting a link between reaction types. Step 5: For each organism or strain, we then created a correlation matrix (Figure 4) containing the z-values (see

below) of the frequency of a reaction type (EC: a.b) followed by another reaction type (EC: x.y). We defined a link between two enzymes that participate in two successive reactions such that the product of one is substrate of another. If reaction R1 produces a compound A and A is the substrate of R2, a directed link between the EC numbers of R1 and R2 was established. In reversible reactions, a second link from the EC number of R2 to the EC number of R1 is added, as shown in Figure 3. Reference [9] discussed visualizing metabolic pathways as a useful framework for providing support for determination of gene functions. A z-score is a dimensionless quantity derived by subtracting the population mean from an individual raw score and then dividing the difference by the population standard deviation. This conversion process is called normalizing. The z- score is calculated using the formula: insightful perspective on metabolism evolution [2]. In this work, we focus on using such information to compare genomes in larger context, as alternative to 16S rrna approach. Z = x - µ / σ, (1) where x is the raw score to be standardized, σ is the standard deviation of the population, and μ is the mean of the population. The score indicates the number of standard deviations an observation is above or below the mean. The quantity z is negative when the raw score is below the mean and positive when above. Step 6: Finally, the correlation matrices of all the organism and strains are used to create the distance matrix using the Pearson correlation coefficient. A correlation is a number that measures the degree of association between two variables (X and Y). A positive value for the correlation implies a positive association (large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y). A negative value for the correlation implies a negative or inverse association (large values of X tend to be associated with small values of Y and vice versa). The Pearson correlation coefficient is calculated as follows: The correlation coefficient is always between -1 and +1. The closer the correlation is to +/-1, the closer to a perfect linear relationship. Therefore, the Pearson correlation coefficient as defined above can measure to certain degree the similarity between the two species in terms of their metabolic characteristics. To conform to the requirements for constructing comparison trees for species, the similarity as measured by the Pearson correlation coefficient is converted, by subtracting one, to a measure of distance between the two species. A distance matrix is thus created for all species that are to be compared. Note that the reaction types frequencies and correlation can also be used for detecting duplicated genes, yielding (2) Figure 4. Correlation matrix of consecutive reaction types for a species, represented as a heat map. B. Comparison tree construction The final Comparison trees are generated in the following steps: Step 1: We used the program NEIGHBOR from the package PHYLIP [10], which implements the Neighbor- Joining method of Saitou and Nei (1987) and the UPGMA method of clustering. It constructs the tree but does not rearrange the nodes. The tree does not assume an evolutionary clock and in effect, it produces an unrooted tree. Step 2: Using the output tree description files from NEIGHBOR, we used RETREE [10] which is an interactive tree-plotting program. The final tree is created from this program. Figure 5. A flow chart of steps in generating the comparison tree from the distance matrix.

III. RESULTS A. Comparison of the bacteria Escherichia coli and its strains The bacteria Escherichia coli are widely studied intestinal bacteria and an ideal platform to understand cell genomics and metabolic capabilities. Because 16S rrna has several regions containing highly conserved sequences, slight differences in the 16S rrna sequences from different organisms can be used to determine their phylogenetic relationships. The 16S rrna tree for the 10 E. coli strains was constructed. The 16S ribosomal sequences are downloaded from the NCBI website in the FASTA format and are aligned using ClustalW multiple sequence alignment. The tree is then generated using the neighbor-joining algorithm and is rendered in TREEVIEW for visual depiction. It is clear from the tree, as shown in Figure 6, that the E. coli strains have highly similar 16S rrna sequences, as indicated by the tight clustering with short branch length. We have constructed a metabolic pathway component based comparison tree, as shown in Figure 7, by combining information about all enzyme catalyzing metabolic reactions from the KEGG database (30th June 2008) for our 10 E. coli representatives. TABLE 1. THE FREQUENCY COUNT OF FOUR REACTION TYPES FOR THREE E. COLI K12 STRAINS Reaction Types Eco Ecj ecd (EC:a.b) 1.1 252 254 125 1.2 152 152 81 1.3 188 184 99 2.4 179 180 89 Figure 6. 16S rrnatree for E. coli strains with Yersinia perstis CO92 as outgroup. Figure 7. MC based comparison tree for E.coli strains with Y. perstis CO92 as outgroup. Figure 8. Histogram of reaction type frequencies for three E. coli. K-12 strains. We have shown in Table 1, that there are major differences between the reaction type (EC: a.b) frequencies for E. coli K12 DH10B in comparison to two other E. coli K-12 strains, MG1655 and W3110. It is worth pointing out that the strain DH10B has diverse reaction type frequency pattern in comparison to the strains MG1655 and W3110 (Table 1 and Figure 8). It is clear that the 16S rrna tree fails to capture the metabolic differences between these strains, while the MPC based tree does. Moreover, the MPC based method also successfully captures the functional properties of different pathogenic types. For example, our method grouped UPEC (Uropathogenic E. coli) strains together, the most common cause of non-hospital-acquired urinary tract infections. Similarly, Enterohemorrhagic E. coli (EHEC) strains, which are the primary cause of hemorrhagic colitis or bloody diarrhea, were clustered together. Some of these clustering of functional groups, although being largely

absent in the 16S rrna tree (Figure 6), suggests that there are differences in metabolic capabilities between the strains. B. Comparison of Pseudomonas, Psychrobacter, Acinetobacter and Shewanella oneidensis as an outgroup We next examined the phylogeny of Pseudomonas, Psychrobacter, and Acinetobacter, and Shewanella oneidensis was taken as an outgroup. The main reason to analyze these species is their diversity, which is clearly visible in the 16S rrna tree as shown in Figure 9. Unlike the 16S rrna tree in the phylogeny of E. coli (Figure 6), the tree has long branch lengths. Pseudomonas species are ubiquitous in nature and contain many pathogens that infect plants and humans. As these bacteria do not need any organic growth factor, they can grow under several different conditions. On the other hand, Psychrobacter are cold adapted organisms, some are from extreme low temperature environments. TABLE 2. THE REACTION TYPE FREQUENCY COUNT FOR FOUR ACINETOBACTER AND ON PSEUDOMONAS Reaction Types aci psa acb aby abm (EC:a.b) 1.1 252 193 197 247 257 2.3 132 187 200 173 170 2.7 344 252 255 300 296 3.2 139 65 80 147 147 6.1 7 33 27 7 7 Acinetobacter is an aquatic organism that thrives in hospital environments and in hospitalized patients. They are highly versatile and omnipresent in nature. The metabolic diversity of these organisms and their strains thus makes them a very interesting group to study. The 16S rrna tree, as shown in Figure 9, is divided into 2 main clusters having Pseudomonas in one cluster. The other cluster is further divided into subclusters containing Psychrobacter and Acinetobacter. In comparison to the 16S rrna tree, our MC based tree, as shown in Figure 10, takes account of the genomic versatility of these organisms and their strains. One of the major observations in the MC based tree in comparison to the 16S rrna based tree is the clustering of Acinetobacter sp. ADP1 with other Pseudomonas and the grouping of Acinetobacter baumannii SDF and Acinetobacter baumannii AYE in a separate cluster. We justified the clustering of Acinetobacter baumannii with Pseudomonas stutzeri and Acinetobacter baumannii SDF with Acinetobacter baumannii AYE by analyzing the reaction type frequencies of these strains as shown in Table 3. TABLE 3. THE REACTION TYPE FREQUENCY COUNT FOR FOUR PSEUDOMONAS STRAINS Reaction Types ppu pfl pst psb (EC:a.b) 1.10 11 11 15 15 2.3 153 136 168 171 3.3 22 22 20 15 4.4 25 22 16 17 We have also shown that Acinetobacter sp. ADP1, Acinetobacter baumannii SDF and Acinetobacter baumannii AYE are separately clustered away from the Pseudomonas. Another interesting observation is the clustering of Pseudomonas putida KT2440 with Pseudomonas fluorescens Pf-5 in the MC based tree. Both of these organisms have agricultural applications as biocontrol agents, especially the presence of many strains that have the ability to suppress agriculture pathogens. Clearly the traditional 16S rrna tree did not capture this functionality. IV. CONCLUSION In an attempt to classify and analyze organisms into an ordered scheme to better understand biological process and metabolic functional differences in organisms, we have successfully developed a framework to analyze the organisms based not only on their phylogeny but also the functional differences in their metabolism. We measured the distance between two species using the Pearson correlation coefficient. This score, Pearson correlation coefficient, was calculated using the information about all enzyme-catalyzed metabolic reactions in the bacterial species selected from the KEGG database. It was used to create the distance matrix from which we have built the enzyme centric metabolic comparison trees. We showed that 16S rrna tree clearly failed to capture some major metabolic differences between the organisms, while the MC based method efficiently captured the differences. Our simple and accurate approach was able to capture the functional properties of different groups like pathogens, non-pathogens, and clusters them which were not seen in the traditional 16S rrna tree suggesting that there are differences in metabolic capabilities between the organisms. ACKNOWLEDGMENT Research in EFB's laboratory is funded by UDRF 2008-2009 and USDA NRI CSREES grants. The authors are grateful for comments made by the anonymous reviewers, particular for bringing to our attention a relevant paper by Lindroos and Andersson.

REFERENCES [1] T. Dandekar, S. Schuster, B. Snel, M. Huynan, and P.Bork, Pathway alignment: application of the comparativeanalysis of glycolytic enzymes, Biochme. J, vol. 43, pp. 115-124, 1999. [2] J.J. Diaz-Mejia, E. Perez-Rueda, and L. Segovia, A network perspective on the evolution of metabolism by gene duplication, Genome Biology, 8:R26, 2007. [3] W.M. Fitch, Construction of phylogenetic trees, Science, vol. 155, pp. 279-284, 1967. [4] M.A. Huynan and P. Bork, Measuring genome evolution, Proc. Natl Acad. Sci. USA, Vol. 95, pp. 5849-5856, 1998. [5] M. Kanehisa, S. Goto, S. Kawashima, A. Nakaya, The KEGG databases at GenomeNet, Nucleic Acids Res. Vol. 30, pp. 42-46, 2002. [6] Karp,P.D. Pathway databases: a case study in computational symbolic theories. Science, 293, 2040 2044, 2001. [7] A.L. Lehninger, D.L. Nelson, and M.M. Cox, Principles of Biochemistry. 2 nd edition, Worth Publishers, Inc, 1993. [8] L. Liao, S. Kim, and J-F. Tomb, Genome comparisons bases on profiles of metabolic pathways, Proc. The Six International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES 2002), pp. 469-476, September 2002, Crema, Italy. [9] H. Lindroos, S.G.E. Andersson, Visualizing metabolic pathways: comparative genomics and expression analysis, Proceedings of the IEEE, Vol. 90, pp. 1793 1802, 2002. [10] http://evolution.genetics.washington.edu/phylip.html. [11] S. Schuster, D.A. Fell, T. Dandekar, A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks, Nat. Biotechnol., Vol. 18, pp. 326-332, 2000.doi:10.1038/73786. [12] Studier and Keppler, A note on the neighbor-joining algorithm of Saiton and Nei, Moclecular Biology and Evolution, Vol. 5, pp. 729-731, 1988. [13] S. Zhang, L. Liao, J-F. Tomb, and J.T.L. Wang, Clustering and classifying enzymes in metabolic pathways: some preliminary results, Proc. ACM SIGKDD Workshop on Data Mining in Bioinformatics, pp. 19-24, Edmonton, Canada, 2002. [14] W.C. Hwang, W.H. Lin, A.J. Davis, F. Jordan, H.T. Yang, and M.J. Hwang, A network perspective on the toplogical importance of enzymes and their phylogenetic conservation, BMC Bioinformatics, 200, Vol. 8, pp. 212doi:10.1186/1471-2105-8-121. Figure 9. 16S rrna tree for Pseudomonas, Psychrobacter, Acinetobacter with Shewanella oneidensis as outgroup.

Figure 10. The MC based comparison tree for Pseudomonas, Psychrobacter, Acinetobacber and Shewanella oneidensis as an outgroup.