Gene Family Content-Based Phylogeny of Prokaryotes: The Effect of Criteria for Inferring Homology

Similar documents
Prokaryotic phylogenies inferred from protein structural domains

Additional file 1 for Structural correlations in bacterial metabolic networks by S. Bernhardsson, P. Gerlee & L. Lizana

2 Genome evolution: gene fusion versus gene fission

ABSTRACT. As a result of recent successes in genome scale studies, especially genome

Evolutionary Analysis by Whole-Genome Comparisons

The Minimal-Gene-Set -Kapil PHY498BIO, HW 3

# shared OGs (spa, spb) Size of the smallest genome. dist (spa, spb) = 1. Neighbor joining. OG1 OG2 OG3 OG4 sp sp sp

Stabilization against Hyperthermal Denaturation through Increased CG Content Can Explain the Discrepancy between Whole Genome and 16S rrna Analyses

The genomic tree of living organisms based on a fractal model

Biased biological functions of horizontally transferred genes in prokaryotic genomes

Introduction to Bioinformatics Integrated Science, 11/9/05

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

PBL: INVENT A SPECIES

Prokaryotic Utilization of the Twin-Arginine Translocation Pathway: a Genomic Survey

Visualization of multiple alignments, phylogenies and gene family evolution

Assessing evolutionary relationships among microbes from whole-genome analysis Jonathan A Eisen

Organisation of the S10, spc and alpha ribosomal protein gene clusters in prokaryotic genomes

Correlations between Shine-Dalgarno Sequences and Gene Features Such as Predicted Expression Levels and Operon Structures

Genome-Wide Molecular Clock and Horizontal Gene Transfer in Bacterial Evolution

Genes order and phylogenetic reconstruction: application to γ-proteobacteria

Midterm Exam #1 : In-class questions! MB 451 Microbial Diversity : Spring 2015!

CcpA-Dependent Carbon Catabolite Repression in Bacteria

8/23/2014. Phylogeny and the Tree of Life

Microbial Taxonomy. Classification of living organisms into groups. A group or level of classification

Deposited research article Short segmental duplication: parsimony in growth of microbial genomes Li-Ching Hsieh*, Liaofu Luo, and Hoong-Chien Lee

Shedding Genomic Ballast: Extensive Parallel Loss of Ancestral Gene Families in Animals

Ch 27: The Prokaryotes Bacteria & Archaea Older: (Eu)bacteria & Archae(bacteria)

Microbial Taxonomy and the Evolution of Diversity

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

A thermophilic last universal ancestor inferred from its estimated amino acid composition

Pseudogenes are considered to be dysfunctional genes

Multifractal characterisation of complete genomes

Computational approaches for functional genomics

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Structural Proteomics of Eukaryotic Domain Families ER82 WR66

The use of gene clusters to infer functional coupling

Evolutionary Use of Domain Recombination: A Distinction. Between Membrane and Soluble Proteins

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Figure Page 117 Microbiology: An Introduction, 10e (Tortora/ Funke/ Case)

Dr. Amira A. AL-Hosary

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

From Phylogenetics to Phylogenomics: The Evolutionary Relationships of Insect Endosymbiotic γ-proteobacteria as a Test Case

Increasing biological complexity is positively correlated with the relative genome-wide expansion of non-protein-coding DNA sequences

C3020 Molecular Evolution. Exercises #3: Phylogenetics

OGtree: a tool for creating genome trees of prokaryotes based on overlapping genes

Tree of Life: An Introduction to Microbial Phylogeny Beverly Brown, Sam Fan, LeLeng To Isaacs, and Min-Ken Liao

Effects of Gap Open and Gap Extension Penalties

Application of tetranucleotide frequencies for the assignment of genomic fragments

Inferring positional homologs with common intervals of sequences

Microbiology Helmut Pospiech

MiGA: The Microbial Genome Atlas

Two Families of Mechanosensitive Channel Proteins

Reversing Gene Erosion Reconstructing Ancestral Bacterial Genomes from Gene-Content and Order Data

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Phylogenetic Networks, Trees, and Clusters

Quantitative Exploration of the Occurrence of Lateral Gene Transfer Using Nitrogen Fixation Genes as a Case Study

doi: / _25

Chapter 19. Microbial Taxonomy

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Biology 211 (2) Week 1 KEY!

Letter to the Editor. Department of Biology, Arizona State University

Introduction to polyphasic taxonomy

The Complement of Enzymatic Sets in Different Species

Unsupervised Learning in Spectral Genome Analysis

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Fitness constraints on horizontal gene transfer

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

2/25/2013. Chapter 11 The Prokaryotes: Domains Bacteria and Archaea The Prokaryotes

Bacterial Molecular Phylogeny Using Supertree Approach

Domain Bacteria. BIO 220 Microbiology Jackson Community College

The impact of the neisserial DNA uptake sequence on genome evolution and stability

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Measure representation and multifractal analysis of complete genomes

SUPPLEMENTARY INFORMATION

Genome reduction in prokaryotic obligatory intracellular parasites of humans: a comparative analysis

1. Prokaryotic Nutritional & Metabolic Adaptations

Consensus Methods. * You are only responsible for the first two

BATMAS30: Amino Acid Substitution Matrix for Alignment of Bacterial Transporters

Name: Class: Date: ID: A

A Structural Equation Model Study of Shannon Entropy Effect on CG content of Thermophilic 16S rrna and Bacterial Radiation Repair Rec-A Gene Sequences

Classification, Phylogeny yand Evolutionary History

Classification and Phylogeny

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogeny and the Tree of Life

Chapter 26 Phylogeny and the Tree of Life

The Prokaryotes: Domains Bacteria and Archaea

rho Is Not Essential for Viability or Virulence in Staphylococcus aureus

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Chapter 26 Phylogeny and the Tree of Life

Classification and Phylogeny

Phylogenetic Analysis

Phylogenetic Analysis

Phylogenetic Analysis

Orthologs, Paralogs, and Evolutionary Genomics 1

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Outline. Classification of Living Things

A Phylogenetic Network Construction due to Constrained Recombination

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

N o hal June 2007

Transcription:

Syst. Biol. 54(2):268 276, 2005 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150590923335 Gene Family Content-Based Phylogeny of Prokaryotes: The Effect of Criteria for Inferring Homology AUSTIN L. HUGHES, 1 VIKRAM EKOLLU, 2 ROBERT FRIEDMAN, 1 AND JOHN R. ROSE 2 1 Department of Biological Sciences and 2 Department of Computer Science and Engineering, University of South Carolina, Columbia, South Carolina 29205, USA; E-mail: austin@biol.sc.edu (A.L.H.) Abstract. A number of recent papers have suggested that gene family content can be used to resolve phylogenies, particularly in the case of prokaryotes, in which extensive horizontal gene transfer means that individual gene phylogenies may not mirror the organismal phylogeny. However, no study has yet examined how sensitive such analyses are to the criterion of homology assessment used to assemble multigene families. Using data from 99 completely sequenced prokaryotic genomes, we examined the effect of homology criteria in phylogenetic analyses wherein presence or absence of each family in the genome was used as a cladistic character. Different criteria resulted in evidence for contradictory tree topologies, sometimes with high bootstrap support. A moderately strict criterion seemed best for assembling multigene families in a biologically meaningful way, but it was not necessarily preferable for phylogenetic analysis. Instead, a very strict criterion, which broke up gene families into smaller subfamilies, seemed to have advantages for phylogenetic purposes. The poor performance of gene family content-based phylogenetic analysis in the case of prokaryotes appears to reflect high levels of homoplasy resulting not only from horizontal gene transfer but also, more importantly, from extensive parallel loss of gene families in certain bacteria genomes. [Gene content; gene families; gene loss; horizontal gene transfer; phylogenetic methods.] The availability of a large number of complete sequences of prokaryotic genomes holds promise for further resolving the phylogenetic relationships among major prokaryotic groups. However, there is evidence that horizontal gene transfer (HGT) may have been a frequent occurrence in prokaryotic evolution, which would imply that the phylogeny of individual genes may not reflect the organismal phylogeny (Daubin et al., 2003; Kunin and Ouzounis, 2003; Lerat et al., 2003; Mirkin et al., 2003; Wolf et al., 2002). For this reason, a number of investigators have advocated approaches to prokaryotic phylogeny based on so-called gene content (Snell et al., 1999), that is, the presence or absence of gene families in genomes, which might be more properly called gene family content. Often, gene family content analyses have made use of various distances based on the proportion of shared gene families (Snell et al., 1999). Because of the ad hoc character of such distances, some authors (e.g., Gu, 2000; Huson and Steel, 2004) have proposed a maximum likelihood approach to this question. However, because developing a biologically accurate model of gene family gain and loss is problematic, a number of authors have applied parsimony to gene family content analyses (Wolf et al., 2001, 2004; Hughes and Friedman, 2004). Whatever method of phylogenetic reconstruction is used, any analysis based on gene family content faces a problem in defining gene families. When families are defined in automated fashion, some search criterion based on the extent of sequence similarity must be used; but the effect of the choice of search criterion on the results of phylogenetic analyses has so far not been studied. In the present paper, we apply a range of different homology criteria to establish gene families in order to examine the sensitivity of such analyses to the criteria used in assigning family membership. METHODS We analyzed 99 complete genomes of prokaryotes, 16 from Archaea and 83 from Bacteria (see Appendix 1 for accession numbers). References to currently accepted taxonomy of these species followed the Bergey s Manual Trust Web site (http://www.cme.msu.edu/ bergeys/outline.prn.pdf). We assembled gene families by inferred homology from search applied to predicted protein translations using the BLASTCLUST software available in the Blast software package (Altschul et al., 1997). This program identifies families by a singlelinkage method, which assembles larger families by linking shared genes among families, thus ensuring that a given gene will be assigned to only one family. Sequence homology was established by identifying matches using a conservative E-value of 10 6.Weused six different criteria for scoring a match between two sequences: (1) a minimum of 10% sequence identity across at least 30% of the two sequences; (2) a minimum of 20% sequence identity across at least 40% of the two sequences; (3) a minimum of 30% sequence identity across at least 50% of the two sequences; (4) a minimum of 40% sequence identity across at least 60% of the two sequences; (5) a minimum of 50% sequence identity across at least 70% of the two sequences; and (6) a minimum of 60% sequence identity across at least 80% of the two sequences. We refer to these criteria, respectively, as 10/30, 20/40, 30/50, 40/60, 50/70, and 60/80. Using these specified homology criteria, all predicted proteins in the 99 genomes were assigned to families. Families having only a single member were excluded from the analyses. For each remaining family, each genome was scored for presence (1) or absence (0). Maximum parsimony (MP) analysis, using heuristic search by simple stepwise addition (Swofford, 2002), was applied 268

2005 HUGHES ET AL. GENE FAMILY CONTENT-BASED PHYLOGENY OF PROKARYOTES 269 TABLE 1. Properties of gene families identified by different homology criteria. Criterion Genome characteristics 10/30 20/40 30/50 40/60 50/70 60/80 Number of genes in 2460 ± 146 2443 ± 144 2349 ± 141 2101 ± 132 1822 ± 128 1555 ± 129 families per genome a Number of families per 837 ± 42 955 ± 46 1411 ± 65 1733 ± 100 1656 ± 115 1460 ± 122 genome a Genes/family a 2.86 ± 0.05 2.46 ± 0.04 1.58 ± 0.03 1.19 ± 0.01 1.10 ± 0.01 1.10 ± 0.01 Correlation between genes/family and genome size (bp) a Mean ± SE for 99 prokaryotic genomes. 0.634 (P < 0.001) 0.754 (P < 0.001) to the resulting matrix, in which protein families corresponded to characters. MP trees were rooted on the assumption that Archaea constitute an outgroup to Bacteria. Bootstrapping (1000 replicates) (Felsenstein, 1985) was used to assess the extent to which clustering patterns in the MP tree received support from the data set as a whole. In order to assess the nature of the phylogenetic signal in the data sets assembled under different homology criteria, we computed the amount of possible synapomorphy (APS) (Ferris, 1989; Simmons et al., 2004). For each parsimony-informative character, APS is defined as the difference between the maximum and minimum number of possible steps for that character. Characters with high APS can potentially be used to resolve deep internal branches of the phylogenetic tree, whereas those with low APS can only resolve outer branches. Thus, the average APS across all informative characters provides information regarding the potential for resolution of deep branches. In order to examine the tree-like nature of the signal in each data set, we calculated NeighborNet splits graphs using SplitsTree 4.0 (Bryant and Moulton, 2004; Huson, 1998) from a matrix of p-distances (proportion of difference) among genomes, derived from the matrix of 1s and 0s. This approach allowed a heuristic visualization of the extent of conflicting signals in the data, as homology criteria were changed. RESULTS The different search criteria led to differences in definition and membership of families (Table 1). As the strictness of the criterion increased, the mean number of genes per genome assigned to families decreased (Table 1). This evidently occurred because increasingly strict homology criteria led to an increase in the number of singletons, i.e., single genes not assigned to membership in any family. The mean number of families per genome was lowest with the least strict criterion (10/30), then increased as the criterion became stricter, reaching a maximum at 40/60, and then declined as the criterion became still stricter (Table 1). The mean number of genes per family decreased as a function of increasing strictness of the homology criterion (Table 1). 0.888 (P < 0.001) 0.699 (P < 0.001) 0.302 (P = 0.002) 0.018 (N.S.) Under most criteria, the mean number of genes per family in a genome was correlated with genome size (in bp). This correlation was strongest with the 30/50 criterion (Table 1), in which case a close linear relationship was observed (Fig. 1A). However, under the FIGURE 1. Scatter plots showing the relationship between number of genes per family and genome size in 99 prokaryotes: (A) when families were assembled under the 30/50 homology criterion; and (B) when families were assembled under the 60/80 homology criterion.

270 SYSTEMATIC BIOLOGY VOL. 54 TABLE 2. Summary of MP analyses based on gene families identified by different search criteria. Criterion 10/30 20/40 30/50 40/60 50/70 60/80 No. informative 12,919 13,965 19,131 31,908 40,366 43,890 characters No. MP trees found 2 2 1 2 1 6 Changes 1to0 4975 (15.7%) 5726 (16.4%) 8659 (17.7%) 9372 (14.0%) 8110 (11.9%) 4210 (6.9%) 0to1 26,814 (84.3%) 29,113 (83.6%) 40,233 (82.4%) 57,346 (86.0%) 59,892 (88.1%) 56,676 (93.1%) Total 31,789 34,839 48,892 66,718 68,002 60,886 Consistency index a 0.456 0.442 0.439 0.534 0.673 0.809 Mean d b T 42.8 36.2 36.2 41.6 43.2 50.8 Significant branches c Terminal pair 28 (47.4%) 28 (49.1%) 29 (47.5%) 32 (43.8%) 31 (44.9%) 27 (42.2%) Internal 31 (52.6%) 29 (50.9%) 32 (52.5%) 41 (56.2%) 38 (55.1%) 37 (57.8%) Total 59 57 61 73 69 64 Mean APS d (±SE) 4.23 ± 0.06 4.46 ± 0.06 4.64 ± 0.06 3.71 ± 0.04 2.82 ± 0.02 2.20 ± 0.01 a Excluding noninformative sites. b Mean topological distance (d T )tomptrees found under all other criteria. c Defined as a branch receiving at least 95% support in 1000 bootstrap samples. d APS = amount of possible synapomorphy (per character). strictest criterion (60/80), there was not a significant relationship between the mean number of genes per family and genome size (Table 1 and Fig. 1B). This evidently occurred because, under the strictest criterion, families were broken up to the point that relatively few families had more than a single member in any given genome. Table 2 summarizes results of phylogenetic analyses conducted using the data sets assembled under the different homology criteria. The number of informative characters (i.e., families) available for analyses increased as the strictness of the criterion increased (Table 2). The consistency index (CI) decreased, reaching a minimum at 30/50, then increased sharply as the criteria increased in strictness (Table 2). This pattern evidently occurred because the proportion of hypothesized changes involving loss of a family (character changes from 1 to 0) was highest with the 30/50 criterion. Under the 30/50 criterion, large families were broken up but not excessively so. Thus there were fewer gains of families (character change from 0 to 1) relative to losses under this criterion, and families including both gains and losses contributed to the reduction in CI. With more liberal criteria, fewer distinct families were identified; thus, both gains and losses were reduced. In contrast, with stricter criteria, an increasingly large number of families were identified, leading to very few losses of families and a large number of gains (Table 2). Regarding bootstrap support for branches within the trees, the number of significant (95% support or better) did not change in a consistent way as a function of the strictness of the homology criterion (Table 2). Both the number of significant branches and the number of significant internal branches (i.e., those deeper than the branch leading to a terminal pair) were highest with the 40/60 criterion. The mean APS (per informative character) differed significantly among the six criteria (one-way analysis of variance [ANOVA]; F 5,161,903 = 844.24; P < 0.001) (Table 2). Mean APS increased slightly with increasing strictness of the homology criterion from 10/30 to 30/30, then decreased as the criterion became increasingly strict (Table 2). As a result, the mean APS for 60/80 was less than half that for 30/50 (Table 2). These results imply that, using a criterion of intermediate strictness, there was maximal potential information for resolving deep internal branches, whereas with an extremely strict criterion a greater proportion of information was available for resolving terminal branches. Figure 2 illustrates the single MP tree based on the moderate 30/50 criterion. As in all MP trees found under all search criteria, Archaea clustered apart from Bacteria (Fig. 2). In addition, as in all MP trees found under all criteria, closely related species (such as congeners) clustered together, usually with strong bootstrap support (Fig. 2). In the Bacteria, certain members of recognized higher level taxonomic groups clustered together, although monophyly of previously recognized higher level groupings was generally not supported. For example, the order Bacillales (including Bacillus and related genera) formed a well-supported monophyletic group (Fig. 2). However, the phylum Firmicutes, in which Bacillales is included, did not form a monophyletic cluster. Mycoplasma and Ureaplasma, traditionally included in Firmicutes, clustered apart from the cluster including most Firmicutes. In addition, the cluster including most Firmicutes also included Fusobacterium (Fig. 2), which is assigned to a separate phylum (Fusobacteria). Similarly, there was a well-supported cluster that included many genera assigned to the phylum Proteobacteria, such as Escherichia, Agrobacterium, and Ralstonia (Fig. 2). However, the groupings within this cluster did not correspond to the currently accepted classes Alphaproteobacteria, Betaproteobacteria, and Gammaproteobacteria (Fig. 2). In addition, Rickettsia and Buchnera, traditionally assigned to Proteobacteria, fell outside this

2005 HUGHES ET AL. GENE FAMILY CONTENT-BASED PHYLOGENY OF PROKARYOTES 271 FIGURE 2. Single MP tree constructed under the 30/50 homology criterion (for details see Table 2). Symbols on the branches indicate the strength of bootstrap support: open circles, 95% to 98%; closed circles 99%.

272 SYSTEMATIC BIOLOGY VOL. 54 cluster (Fig. 2). In the six MP trees based on the strict 60/80 criterion, Rickettsia and Buchnera clustered strongly with Proteobacteria (Fig. 3). On the other hand, Firmicutes were not recovered as a monophyletic group, because Mycoplasma and Ureaplasma fell outside the cluster with other genera traditionally assigned to Firmicutes (Fig. 3). Figure 4 shows the strict consensus of all MP trees found with the different criteria used. In this consensus tree, most deep-branching patterns were unresolved (Fig. 4). Only 46 branches received significant bootstrap support a much lower figure than in any of the individual trees constructed on the basis of individual homology criteria (Table 2). Of these, only 19 (41%) represented deep branches (i.e., not branches subtending terminal pairs), again a much lower figure than in any individual tree (Table 2). The fact that certain deep branches were not resolved in the consensus tree but received significant bootstrap support in individual trees implies that the trees constructed on the basis of the different homology criteria frequently resolved the higher-level relationships of prokaryotes in mutually contradictory ways. Equally illustrative of the conflicts among trees were the high mean topological distances (d T ) among the MP trees found under each criterion (Table 2). The 20/40 and 30/50 criteria were closest on average to the other criteria, while 60/80 was farthest from the other criteria (Table 2). The large average d T values to 60/80 reflected in part the placement of both Rickettsia and Buchnera with other Proteobacteria under the latter criterion, which was not observed under any other criterion (Figs. 2, 3 and data not shown). NeighborNet analyses produced splits graphs that corroborated the findings from the APS analyses. These show that, as the homology assessment became strict, support decreased for the internal branches that separate major clusters. Most noticeable was the loss of phylogenetic signal separating Archaera and Eubacteria; for example, compare the graph for 30/50 with that for 60/80. (For splits graphs for all criteria, see Fig. A1, available online at the Society of Systematic Biologists web site, http://systematicbiology.org). Interestingly, at stricter levels used to infer homology, there appeared to be a higher level of bifurcation amongst terminal taxa (Fig. 4). These findings support other observations that we report and indicate that no one criterion well represents all relevant phylogenetic information. DISCUSSION The results presented here demonstrate that, at least in the case of prokaryotic genomes, phylogenetic analyses based on gene family content are highly sensitive to the homology criteria used to define families. The true phylogeny of these organisms is so far poorly resolved. Thus, it is not in general possible to say which of the homology criteria used produced a tree closer to the true tree. However, the fact that the trees obtained with different homology criteria were mutually contradictory did not increase confidence in the applicability of gene content analyses to the resolution of prokaryotic phylogenies. Although parsimony was used for phylogenetic reconstruction in the present analyses, there is no reason to believe that the problems revealed here are unique to parsimony. Because all methods of analysis that have been applied to gene family content take family assignment of genes as a given, at least some of the same problems are likely to arise with distance or likelihood methods as well. The absence of Rickettsia and Buchnera from the cluster with other Proteobacteria in the phylogeny based on the moderate 30/50 criterion (Fig. 2) suggested that parallel loss of gene families is the likely explanation for some of the observed problems. Both Rickettsia and Buchnera have reduced genome sizes due to massive loss of gene families as an adaptation to life as obligate intracellular parasites (Andersson et al., 1998; van Ham et al., 2003). The extensive loss of gene families apparently caused these taxa to cluster nearer to other genera that have lost numerous gene families in adaptation to intracellular life, such as Mycoplasma (Himmelreich et al., 1996). Parallel gene family loss in adaptation to similar lifestyles appears to have created a sufficient degree of homoplasy that the true relationships of these organisms cannot be recovered by the method used. Previous studies have noted the problems that large-scale loss of gene families can pose for analyses based on gene family content (House and Firzgibbon, 2002; Dutilh et al., 2004; Lake and Rivera, 2004). Dutilh et al. (2004) have developed a method of reducing phylogenetically discordant signals in gene family content data that appears to ameliorate the problem. On the other hand, in our analysis based on the 60/80 homology criterion, Rickettsia and Buchnera clustered among the Proteobacteria, although Buchnera did not cluster with Gammaproteobacteria, as expected from traditional classification (Fig. 3). The strict 60/80 criterion evidently had the effect of breaking up gene families so that only proteins showing a close phylogenetic relationship were grouped in a common family (Table 2). Because of the problems of extensive parallel gene loss, these extremely subdivided families may better reconstruct relatively close evolutionary relationships than do less subdivided families, at least in the case of prokaryotes. The greatly reduced amount of possible synapomorphy (APS) per character in the case of the 60/80 criterion in comparison to more liberal criteria (Table 2) suggests that a stricter criterion provides more information suitable for resolving close relationships than do more liberal criteria. Conversely, moderate criteria (such as 30/50) showed the highest mean APS per character (Table 2) and thus the most potential information for resolving deep branches. However, the higher APS for moderate criteria did not in practice lead to a strikingly better resolution of deep branches (compare Figs. 2 and 3). Even at this level of homology criteria NeighborNet analysis showed many contradictory internal splits. This may at least in part reflect ancient horizontal gene transfers (HGT) among major lineages. Using a stricter criterion eliminates some contradictory splits; however, accompanying this is the loss of information as the stricter criterion breaks up ancient gene families whose phylogenetic relationships may document HGT events.

2005 HUGHES ET AL. GENE FAMILY CONTENT-BASED PHYLOGENY OF PROKARYOTES 273 FIGURE 3. Strict consensus tree of 6 MP trees constructed under the 60/80 criterion. Symbols on the branches indicate the strength of bootstrap support: open circles, 95% to 98%; closed circles 99%.

274 SYSTEMATIC BIOLOGY VOL. 54 FIGURE 4. Strict consensus of all MP trees (N = 14) constructed under all six search criteria. Symbols on the branches indicate the strength of bootstrap support for a given clustering pattern in all MP trees: open circles, 95% to 98% in all trees; closed circles 99% in all MP trees.

2005 HUGHES ET AL. GENE FAMILY CONTENT-BASED PHYLOGENY OF PROKARYOTES 275 Families assembled with a moderate criterion may provide a better representation of what is usually meant by a multigene family than do the highly subdivided families assembled by a very strict criterion. In completely sequenced eukaryotic genomes, there is a correlation between genome size and the number of genes per family (Friedman and Hughes, 2001). We found in prokaryotic genomes also, except when the strictest was used, the number of genes per family was positively correlated with genome size (Table 1 and Fig. 1A). This suggests that less strict criteria better capture the concept of a gene family as a product of within-genome gene duplications (and, in the case of prokaryotes, occasional between-genome horizontal transfers). This correlation was strongest with the 30/50 criterion, suggesting that a criterion of intermediate strictness may be optimal when the goal is to assemble gene families for purposes of reconstructing the pattern of gene duplication within a genome. On the other hand, a very liberal criterion (such as 10/30) may approximate the results of an analysis based on families of domains or protein folds (Lin and Gerstein, 2000), since a very liberal criterion is likely to group proteins that share even one domain. With all homology criteria used, the hypothesized gains of families substantially exceeded the hypothesized losses (Table 2). Hypothesized gains of families include both the first appearance of the gene in the phylogeny and its appearance in a new part of the phylogeny as a result of an HGT event. Furthermore, as stricter homology criteria are used, an increasing number of hypothesized gains of gene families are artifacts of the break-up of large families. When subfamilies of a large family are characterized as separate families, each such family is hypothesized to make a separate first appearance in the phylogeny. Thus, although a very strict homology criterion might be preferable for reconstructing some relationships of prokaryotic phylogeny, it would be very misleading if it were used to reconstruct the true pattern of HGT within a phylogeny. ACKNOWLEDGMENTS This research was supported by grant GM066710 to A.L.H. from the National Institutes of Health. REFERENCES Altschul, S. F., T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25:3389 3402. Andersson, S. G., A. Zomorodipour, J. O. Anderssson, T. Sicheritz- Ponten, U. C. Alsmark, R. M. Podowski, A. K. Naslund, A. S. Erikson, H. H. Winkler, and C. G. Kurland. 1998. The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 396:133 140. Bryant, D., and V. Moulton. 2004. NeighborNet: An agglomerative method for the construction of planar phylogenetic networks. Mol. Biol. Evol. 21:255 265. Daubin, V., N. A. Moran, and H. Ochman. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829 832. Dutilh, B. E., M. A. Huynen, W. J. Bruno, and B. Snel. 2004. The consistent phylogenetic signal in genome trees revealed by reducing the impact of noise. J. Mol. Evol. 58:527 538. Farris, J. S. 1989. The retention index and the rescaled consistency index. Cladistics 5:417 419. Felsenstein, J. 1985. Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39:783 791. Friedman, R., and Hughes, A. L. 2001. Pattern and timing of gene duplication in animal genomes. Genome Res. 11:1842 1847. Gu, X. 2000. A simple evolutionary model for genome phylogeny based on gene content. Pages 515 523 in Comparative genomics (D. Sankoff and J. H. Nadeau, eds.) Kluwer Academic, Dordrecht. Himmelreich, R., H. Hilbert, H. Plagens, E. Pirkl, B. C. Li, and R. Herrmann. 1996. Complete sequence analysis of the genome of the bacterium Mycoplasma pneumoniae. Nucleic Acids Res. 24:4420 4449. House, C. H., and S. T. Fitz-Gibbon. 2002. Using homolog groups to create a whole-genomic tree of free-living organisms: An update. J. Mol. Evol. 54:539 547. Hughes, A. L., and R. Friedman. 2004. Differential loss of ancestral gene families as a source of genomic divergence in animals. Proc. R. Soc. Lond. B Suppl. 271:S107 S109. Huson, D. 1998. SplitsTree: Analyzing and visualizing evolutionary data. Bioinformatics 14:68 73. Huson, D. H., and M. Steel. 2004. Phylogenetic trees based on gene content. Bioinformatics 20:2044 2049. Kunin, V., and C. A. Ouzounis. 2003. The balance of driving forces during genome evolution in prokaryotes. Genome Res. 13:1589 1594. Lake, J. A., and M. C. Rivera. 2004. Deriving the genomic tree of life in the presence of horizontal gene transfer: Conditioned reconstruction. Mol. Biol. Evol. 21:681 690. Lerat, E., V. Daubin, and N. A. Moran. 2003. From gene trees to organismal phylogeny in prokaryotes: The case of the γ -Proteobacteria. PloS Biol. 1:E19. Lin, J., and M. Gerstein. 2000. Whole-genome trees based on the occurrence of folds and orthologs: Implications for comparing genomes on different levels. Genome Res. 10:808 818. Mirkin, B. G., T. I. Fenner, M. Y. Galperin, and E. V. Koonin. 2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol. 3:2. Simmons, M. P., T. G. Carr, and K. O Neill. 2004. Relative characterstate space, amount of potential phylogenetic information, and heterogeneity of nucleotide and amino acid characters. Mol. Phyl. Evol. 32:913 926. Snell, B., P. Bork, and M. A. Huynen. 1999. Genome phylogeny based on gene content. Nat. Genet. 21:108 110. Swofford, D. L. 2002. PAUP*: Phylogenetic analysis using parsimony (*and other methods). Sinauer, Sunderland, Massachusetts. Van Ham, R. C. J., J. Kamerbeek, C. Palacios, C. Rausell, F. Abascal, U. Bastolla, J. M. Fernández, L. Jiménez, M. Postigo, F. J. Silva, J. Tamames, E. Viguera, A. Latorre, A. Valencia, F. Morán, and A. Moya. 2003. Reductive genome evolution in Buchnera aphidicola. Proc. Natl. Acad. Sci. USA 100:581 586. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, and E. V. Koonin. 2002. Genome trees and the tree of life. Trends Genet. 18:272 479. Wolf, Y. I., I. B. Rogozin, N. V. Grishin, R. L. Tatusov, and E. V. Koonin. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. 1:8. Wolf, Y. I., I. B. Rogozin, and E. V. Koonin. 2004. Coelomata and not Ecdysozoa: Evidence from genome-wide phylogenetic analysis. Genome Res. 14:29 36. First submitted 22 December 2003; reviews returned 6 August 2004; final acceptance 31 October 2004 Associate Editor: Peter Lockhart Editor: Chris Simon

276 SYSTEMATIC BIOLOGY VOL. 54 APPENDIX 1 Genome sequences and accession numbers used in analyses: 1. Halobacterium sp. NC 002607 51. Bradyrhizobium japonicum NC 004463 2. Thermoplasma acidophilum NC 002578 52. Mesorhizobium loti NC 002678 3. Thermoplasma volcanicum NC 002689 53. Sinorhizobium meliloti NC 003047 4. Aeropyrum pernix NC 000854 54. Agrobacterium tumefaciens C58 NC 003062 5. Pyrobaculum aerophilum NC 003364 55. Agrobacterium tumefaciens C58 UW NC 003304 6. Sulfolobus solfataricus NC 002754 56. Neisseria meningitidis MC58 NC 003112 7. Solfolobus tokadei NC 003106 57. Neisseria meningitidis Z2491 NC 003116 8. Pyrococcus furiosus NC 003413 58. Haemophilus influenzae NC 000907 9. Pyrococcus abyssi NC 000868 59. Pasteurella multilocida NC 002663 10. Pyrococcus horokoshii NC 000961 60. Shewanella oneidensis NC 004347 11. Archaeoglobus fulgidus NC 000917 61. Vibrio cholerae NC 002505 12. Methanosarcina acetivorans NC 003552 62. Vibrio parahaemolyticus NC 004603 13. Methanosarcina mazei NC 003901 63. Yersinia pestis C092 NC 003143 14. Methanococcus jannaschii-nc 000909 64. Yersinia pestis KIM NC 004088 15. Methanobacterium thermoautotrophicum NC 000916 65. Salmonella enterica NC 003198 16. Methanopyrus kandleri NC 003551 66. Salmonella typhimurium NC 003197 17. Trophyerma whipplei NC 004551 67. Escherichia coli K12 NC 000913 18. Buchnera aphidicola Bp NC 004545 68. Escherichia coli O157H7 NC 002695 19. Buchnera aphidicola Sg NC 004061 69. Escherichia coli O157H7 EDL933 NC 002655 20. Buchnera sp. APS NC 002528 70. Deinococcus radiodurans NC 001263 21. Chlamydia trachomatis NC 000117 71. Streptomyces avertimilis NC 003155 22. Chlamydia pneumoniae NC 002620 72. Streptomyces coelicolor NC 003888 23. Chlamydophila pneumoniae CWL029 NC 000922 73. Cornyebacterim efficiens NC 004369 24. Chlamydophila pneumoniae J138 NC 002491 74. Mycobacterium leprae NC 002677 25. Borrelia burgdorferi NC 001318 75. Mycobacterium tuberculosis CDC1551 NC 002755 26. Treponema pallidum NC 000919 76. Mycobacterium tuberculosis H37Rv NC 000962 27. Mycoplasma pulmonis NC 002771 77. Thermotoga maritima NC 000853 28. Mycoplasma genitalium NC 000908 78. Thermoanaerobacter tencongensis NC 003869 29. Mycoplasma pneumoniae NC 000912 79. Clostridium acetobulyticum NC 003030 30. Mycoplasma penetrans NC 004432 80. Clostridium perfringens NC 003366 31. Ureaplasma urealyticum NC 002162 81. Fusobacterium nucleatum NC 003454 32. Rickettsia conorii NC 003103 82. Staphylococcus aureus MW2 NC 003923 33. Rickettsia prowazekei NC 000963 83. Staphylococcus aureus Mu50 NC 002758 34. Campylobacter jejuni NC 002163 84. Staphylococcus aureus N315 NC 002745 35. Helicobacter pylori 26695 NC 000915 85. Listeria innocua NC 003212 36. Helicobacter pylori J99 NC 000921 86. Listeria monocytogenes NC-003210 37. Aquifex aeolicus NC 000918 87. Oceanobacillus iheyensis NC 004193 38. Chlorobium tepidum NC 002932 88. Bacillus halodurans NC 002570 39. Thermosynechoccus elongatus NC 004113 89. Bacillus subtilis NC 000964 40. Nostoc sp. NC 003272 90. Lactobacillus plantarum NC 004567 41. Synechocystis sp. BA000022 91. Lactococcus lactis NC 002662 42. Nitrosomonas europaea NC 004757 92. Streptococcus pneumoniae R6 NC 003098 43. Xylella fastidiosa NC 002488 93. Streptococcus pneumoniae NC 003028 44. Xanthomonas campestris NC 003902 94. Streptococcus agalactiae 2603VR NC 004116 45. Xanthomonas axonopodis NC 003919 95. Streptococcus agalactiae NEM316 NC 004368 46. Pseudomonas aeruginosa NC 002516 96. Streptococcus pyogenes NC 002737 47. Ralstonia solanacearum NC 003295 97. Streptococcus pyogenes MGAS8232 NC 003485 48. Caulobacter crescentus NC 002696 98. Streptococcus pyogenes MGAS315 NC 004070 49. Brucella melitensis NC 003317 99. Streptococcus pyog pyogenes SSI1 NC 004606 50. Brucella suis NC 004310