In order to compare the proteins of the phylogenomic matrix, we needed a similarity

Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for presence/absence quantification and have been shown to under-perform compared to other metrics 1. Pearson correlations between raw bit score vectors are an improvement in that they allow degrees of sensitivity; they are arbitrarily sensitive to the presence of large values in a given column, a phenomenon that happens with some frequency given the long right tail of the bit score distribution 2. Mutual information 3 is an attractive alternative for naturally discrete data sets, but it is less desirable for continuous data because it requires the estimation of a joint density for each pair-wise comparison and is quite sensitive to histogramming criteria 4. Robust correlation methods such as the Maronna or Quadrant Correlation algorithms 5 are theoretically attractive, but computationally intense and are too slow to handle large data sets. Thus, as a compromise between robustness and simplicity, we used Spearman s rank correlation (with tie-correction) to compute similarities between rows rather than the raw correlation. Using ranks rather than raw scores has several advantages. First, ranks are a natural representation for the ordered data returned from a BLASTP 2 result. Second, ranks are invariant under monotonic transformations such as scaling or translation. Finally, rank correlations are a proven and well characterized way to extract robust similarity information from data sets where the use of Pearson correlation may be questionable. We thus computed rank correlations between all pairs of profiles and retained the top 50 similarity values for each profile as our similarity matrix.

Map Construction The purpose of 2-dimensional ordination is to enable human visualization of high dimensional data. Given a similarity matrix computed from a given data matrix, we seek to put proteins with similar profiles near each other on a 2-dimensional map. As each protein in the original phylogenomic matrix has more than 200 dimensions, a projection into 2- dimensions reduces the dimensionality by more than 2 orders of magnitude. This data reduction is possible for two reasons. First, the original phylogenomic data matrix is sparse the majority of entries are zero. Second, heuristically limiting the algorithm to use the top 50 similarity scores allows it to preserve neighborhoods (local accuracy) at the expense of global integrity. In the context of phylogenomic mapping, this means that the 2-dimensional placement is optimized to show groups of proteins which are likely to interact on the basis of coinheritance. In brief, the dimension reduction problem was framed as an optimization problem find the representation of a high dimensional set in a low dimensional space which keeps similar pairs of proteins as close together as possible. The phylogenomic matrix was used to generate a square matrix of similarities (in our case, rank correlations). This similarity matrix was used to produce a quadratic form, whose minimization, subject to appropriate constraint, was the multidimensional scaling step of the ordination algorithm. The result of this step was an initial placement of proteins into 2-dimensions. A force-directed placement algorithm was then applied to refine this initial placement and obtain a final set of (x,y) coordinates in the plane for each protein. We used VxInsight (Viswave LLC) to perform both ordination and visualization 6. Fully interactive phylogenomic maps for more than 200 bacteria are available online at http://myxococcus.syr.edu/phylo.

Noise Filtering Our goal was to use phylogenomic information to distinguish between protein clusters with different evolutionary histories. However, despite their disparate pasts, all proteins in a given genome have one thing in common they have a trivial hit to at least one ORF within their own genome. Including the fact that every protein in M. xanthus has a hit in M. xanthus adds unremarkable and computationally undesirable similarity information. This is because the spectral projection at the core of the ordination step is analogous to a principal components representation of the data matrix. Including the strong, yet uninteresting signal from self-hit columns would dominate the first of our two coordinate eigenvectors and degrade our ability to interpret differences in location on the phylogenomic map as differences in evolutionary history. Furthermore, including self-hit columns is undesirable from an annotation standpoint as it uses mildly circular inference a draft genome would be used to annotate itself, rather than using other genomes (of presumably known quality) as independent data sources. For these reasons we removed the self-hit column from the data matrix and excluded from the analysis those proteins which only had BLASTP hits to M. xanthus sequences. In order to ensure that the proteins on our map had nontrivial phylogenetic profiles, we further restricted our data set to only those proteins with hits in at least five different sequenced genomes. Without this step, we encountered a problem different from the spurious agglomeration problem that arose via inclusion of M. xanthus self-hits. Instead of spurious agglomeration, the inclusion of proteins with very sparse row entries caused spurious differentiation. That is, the presence of proteins with only one or two hits across 200 genomes resulted in a profusion of mini-mountains with only one species hit across much of our landscape. This reduced the dynamic range for discrimination between other protein groups.

Paralogy Correction Proteins with similar primary sequences tend to have similar BLAST neighbors, and hence, similar phylogenetic profiles. We were interested in those proteins which were frequently coinherited, yet had different primary sequences and BLAST neighbors, as these were the potentially novel linkages that could not be discerned from an analysis of BLAST alone. For each pair of proteins on the map, we extracted the corresponding sets of GenInfo Identification (GI) numbers returned by a run of BLASTP against a local database of sequenced microbial genomes. We then computed the Jaccard coefficient between these two sets, which is defined as (A B)/(A B). We considered two proteins to have an association predictable from BLAST alone if they had Jaccard coefficients greater than 0.5. Intuitively, a high Jaccard coefficient means that a manual comparison of the two BLASTPgenerated GI lists would reveal much overlap and hint at functional similarity. When this network was superimposed on the phylogenomic map, we could detect at a glance whether the colocation of a given pair of proteins was due to sequence similarity or whether it was due to nontrivial similarity in coinheritance patterns. A novel phylogenomic linkage between two proteins indicates that they are present in the same sets of organisms despite having different primary sequences, and is indicated on the map by proteins that are nearby but unconnected (see Figure 2c). Mountain Discretization To facilitate map analysis, we used an Expectation Maximization algorithm based on Gaussians with spherical covariance matrices 7 to discretize the 2-dimensional coordinates of each map into individual mountains. To initialize the number of basis Gaussians, the gap statistic of Tibshirani et al. 8 was applied to provide a robust estimate of the number of mountains present in the reduced 2-dimensional map space.

References 1. Enault, F., Suhre, K., Abergel, C., Poirot, O. & Claverie, J.M. Annotation of bacterial genomes using improved phylogenomic profiles. Bioinformatics 19 Suppl 1, i105 i107 (2003). 2. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389 3402 (1997). 3. Date, S.V. & Marcotte, E.M. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotechnol. 21, 1055 1062 (2003). 4. Paninski, L. Estimation of entropy and mutual information. Neural Computation 15, 1191 1253 (2003). 5. Chilson, J., Ng, R., Wagner, A. & Zamar, R. Parallel computation of high dimensional robust correlation and covariance matrices. in Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining 533 538 (ACM Press, Seattle, WA, USA; 2004). 6. Davidson, G.S., Wylie, B.N. & Boyack, K. Cluster stability and the use of noise in interpretation of clustering. in Proceedings of the IEEE Symposium on Information Visualization 2001 (INFOVIS'01), 23 30 (IEEE Computer Society, 2001). 7. McLachlan, G.J. & Krishnan, T. The EM Algorithm and Extentions. (John Wiley & Sons, New York, 1997). 8. Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a dataset via the Gap statistic. J. Roy. Stat. Soc. B 63, 411 123 (2001).