Theories on PHYlogenetic ReconstructioN (PHYRN)

Theories on PHYlogenetic ReconstructioN (PHYRN) Gaurav Bhardwaj 1,2, Zhenhai Zhang 1,3, Yoojin Hong 1,4, Kyung Dae Ko 1,2, Gue Su Chang 3, Evan J. Smith 1,2, Lindsay A. Kline 1,2, D. Nicholas Hartranft 1,2, Edward C. Holmes 1,2, Randen L. Patterson 1,2, and Damian B. van Rossum 1,2. (1) Center for Computational Proteomics, The Pennsylvania State University, USA (2) Department of Biology, The Pennsylvania State University, USA (3) Department of Biochemistry and Molecular Biology, The Pennsylvania State University, USA (4) Department of Computer Science and Engineering, The Pennsylvania State University, USA * Address correspondence to: Randen L. Patterson, 230 Life Science Bldg, University Park, PA 16802. Tel: 001-814-865-1668; Fax: 001-814-863-1357; E-mail: rlp25@psu.edu. Damian B. van Rossum, 518 Wartik Laboratory, University Park, PA 16802. Tel: 001-814-863-7; Fax: 001-814-863-1357; E-mail: dbv10@psu.edu. Abstract The inability to resolve deep node relationships of highly divergent/rapidly evolving protein families is a major factor that stymies evolutionary studies. In this manuscript, we propose a Multiple Sequence Alignment (MSA) independent method to infer evolutionary relationships. We previously demonstrated that phylogenetic profiles built using position specific scoring matrices (PSSMs) are capable of constructing informative evolutionary histories(1;2). In this manuscript, we theorize that PSSMs derived specifically from the query sequences used to construct the phylogenetic tree will improve this method for the study of rapidly evolving proteins. To test this theory, we performed phylogenetic analyses of a benchmark protein superfamily (reverse transcriptases (RT)) as well as simulated datasets. When we compare the results obtained from our method, PHYlogenetic ReconstructioN (PHYRN), with other MSA dependent methods, we observe that PHYRN provides a 4- to - fold increase in accurate measurements at deep nodes. As phylogenetic profiles are used as the information source, rather than MSA, we propose PHYRN as a paradigm shift in studying evolution when MSA approaches fail. Perhaps most importantly, due to the improvements in our computational approach and the availability of vast amount of sequencing data, PHYRN is scalable to thousands of sequences. Taken together with PHYRN s adaptability to any protein family, this method can serve as a tool for resolving ambiguities in evolutionary studies of rapidly evolving/highly divergent protein families.

Introduction Phylogenetic profiles have been suggested by us and others as a unified framework for measuring structural, functional, and evolutionary characteristics of protein/protein-families(1;3-6). Proteins within a phylogenetic profile can be defined in an N(query) by M(PSSM) matrix. Under this paradigm, a protein is defined as a vector where each entry quantifies the alignments of a query sequence with a PSSM(3;7). In the case of evolutionary measurements, we previously demonstrated that phylogenetic profiles built in this manner can be used to construct phylogenetic trees using Euclidian distance measurements(1;2). Our previous study used reverse transcriptases (RT) as a benchmark dataset, and demonstrates that phylogenetic profiles perform well even at extreme levels of divergence (i.e. twilight zone of sequence similarity )(2). This work highlighted how pre-existing PSSMs, obtained from the Conserved Domain Database (CDD,(8)), could be utilized to construct an informative M- dimension. When we generated trees with the entire CDD we obtained a tree with perfect monophyly. Despite the perfect monophyly, the statistical support at most deep nodes was lacking. Interestingly, when we analyzed the alignments from the phylogenetic profiles, we determined that the most frequently occurring PSSM alignments were the 16 RT domain-containing profiles present in CDD. When trees were constructed using only these 16RT PSSMs for our M-dimension, we still observed significant monophyly that is well above random (Supplemental Figure 3 in (2)). Based on these results, it is reasonable to consider that expanding only the informative profiles within our knowledge base will improve the robustness of phylogenetic profile-based measurements, in addition to improving computational performance. In this manuscript, we present data supporting this supposition. Further, we present a pipeline for enriching and amplifying informative PSSMs as well as an algorithmic improvement that drastically reduces computational expense. As evidence for these theories we analyzed biological (RTs) and ROSE-simulated (9) datasets. When compared with other ab-initio multiple-sequence alignment (MSA) methods, PHYRN reliably recapitulates true evolutionary history in simulated datasets, and provides deep-node measurements with robust statistical support. Results PHYRN Pipeline The algorithm begins by compiling a set of protein queries belonging to the same protein family/superfamily (Figure 1). We use CDD (8) and other approaches to define conserved domains present in members of this superfamily (e.g. RT domain in reverse transcriptases). From this subset of knowledgebase PSSMs we utilize pairwise comparisons to define boundaries of homology. These homologous protein fragments are then utilized to construct a database/library of query-based PSSMs using PSI-BLAST (6-iterations, e-value threshold= 10-6 )(10). In this manner, a query set of sequences can make a library of at least PSSMs. We then use rpsblast (8) to obtain pairwise alignments between full-length queries and the query-specific PSSM library. This alignment information (% identity, % coverage) is then encoded into phylogenetic profile matrix. Following, we calculate the Euclidian distance between each query(2). The results from these calculations can then plotted as a phylogenetic tree using a variety of tree-building algorithms (e.g. Neighbor-Joining(11), Maximum Likelyhood(12), Minimum Evolution(13), etc, see Methods for complete description of PHYRN). In the following sections we will highlight the methods for creating informative PSSM libraries and the scoring schemes utilized. Enriching and Amplifying Informative PSSMs Although CDD provides a comprehensive resource for conserved domains, in all cases the number of PSSMs for any given domain is relatively small (>). We have previously demonstrated

i ii iii Query N Query 3 Query 2 Query 1 CDD domain specific PSSM (pfamxxxx) Chop query sequences at domain specific boundaries Chopped Query sequences PSI-BLAST 6 iterations, e = 10-6) Multiple domain specific PSSMs generated from chopped query sequences iv vii vi v M profiles Run rpsblast with full length query sequences and domain specific PSSMs Euclidean distance matrix N queries Phylogram N X M data matrix Score = %identity X % coverage Figure 1: PHYRN Concept and Work Flow- PHYRN begins by (i-ii) defining and extracting the domain specific region among the query sequences. (iii) Domain specific regions are then used to create PSSM library using PSI-BLAST. (iv-v) Positive alignments are then calculated between queries and PSSM library using rpsblast, and encoded as a PHYRN product score (%identity X %coverage) matrix. (vi) Product score matrix is converted to a Euclidean distance matrix by calculating Euclidean distance between each query pair. vii) Phylogenetic trees can then be graphed using Neighbor- Joining (NJ) or Minimum Evolution (ME) as available in MEGA. that quality of phylogenetic profile measurements is proportional to the size and variety of the domainspecific PSSM library (2;3;7). This increase in information content is exemplified in Figure 2. When the silkworm Non-LTR AAA92147.1 is analyzed with NCBI CDD, only one RT specific PSSM alignment is returned (Figure 2a). When the same query is analyzed using PHYRN and the 16 RT PSSMs from CDD, we observe 3 overlapping alignments from 3 different PSSMs (~19% of the library, Figure 2b). These overlapping alignments define the boundary from which to generate PSSMs using PSI-BLAST. This same approach was used for RT query sequences; post-expansion, we obtain 102 RT-specific PSSMs. When we reanalyzed the silkworm sequence with the amplified library, ~56% of the PSSM library returns a result within the homologous region. Scoring Scheme In order to encode alignment information between queries and our PSSM libraries, we utilize a product score (%identity X %coverage) during our PHYRN analysis. Equation sets for %identity and %coverage are as defined in (2). The algebraic derivation in Figure 3a demonstrates that our product score is equivalent to (1-p-distance) X gap weight. In data not shown, we determined that the gap weight is a negligible variable, and when removed does not alter our results. Figure 3b depicts the distribution of % gaps, % identity, and % coverage of alignments between RT queries and 102 RT-specific PSSMs. Overall, 92% of alignments are less than 25% identity and the average percentage identity between RT sequences is 21.8% (± 5.6% s.d.). Within a smaller subset of 88 RT sequences, 3644 pairs (95.2%) among 3828 possible pairs of these 88 sequences have less than 25% sequence identity. As a whole RT sequences reside in the twilight zone of sequence similarity, underscoring the reason why deducing evolutionary relationships within the RT family is extremely challenging. Furthermore, % gap and % coverage measurements have wide variances.

rps-blast (default settings) a b 16 RT PSSMs from CDD PSI-BLAST 6-iterations, e=10-6 Expansion (n= 102 RT PSSMs) rt pssm-01 rt pssm-02 Library Hits (AAA92147) rt pssm-03 Pre-expansion PSSM 3/16 = 18.75% Post-expansion PSSM 57/102= 55.88% Figure 2: Enrichment and Amplification of Signal Source PSSMs- (a) Family/Superfamily specific PSSMs can be identified using NCBI Conserved Domain Database. (b) rpsblast can then be used to identify overlapping alignments between individual queries and family/superfamily specific CDD PSSMs. Overlapping alignments are then used to define domain specific region as described in methods. (c) Domain specific regions as identified are then used to generate PSSMs using PSI-BLAST and NCBI non-redundant (nr) database. Phylogenetic Trees generated using PHYRN Phylogenetic Reconstruction of the RT superfamily Despite the lowidentity and high variance in % gaps and coverage, when a phylogenetic tree is constructed from a phylogenetic profile comprised of RT query sequences and 102 RT-specific PSSMs using Neighbor-Joining (11) (see Methods), we obtain a robust monophyletic tree with deep statistical support (bootstrap and jackknife, Figure 3c). In all aspects, this tree is superior to the tree constructed preexpansion of the RT- specific PSSMs (Fig 3 in Chang et al (2)). Specifically, the Hepadnaviruses now form a clear monophyletic clade, and the Mt Plasmids now reside in the prokaryotic group as expected. As previously mentioned, phylogenetic profile measurements improve with larger datasets (1;3;7). Therefore, we wondered whether our resolution could be increased by the inclusion of additional sequences across multiple taxa, thereby increasing the size of the PSSM library. We collected 716 full-length RT containing sequences from the literature (14-20) and PSI-blast aided searches of NCBI non-redundant database. These sequences were subsequently included in our RT-specific PSSM library as previously described. Figure 4 depicts a linearized phylogenetic tree of 716 full length retroelements measured using 846 PSSMs generated from the RT domain and rooted with retrointrons. The pairwise distances among them were acquired based on Euclidean distance measurement in the 716 846 data matrix, and an unrooted phylogenetic tree was derived from the 716 716 distance matrix using a neighbor-joining method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix. Even at this size and level of divergence (~17% identity between groups), the PHYRN tree has robust monophyly. Within these monophyletic nodes, there are multiple subclades which are evident from this analysis. For example, we observe numerous subgroups of TY3 retroelements, and proper subclade groupings of retroviral RTs. Recently, Simon et al reported that there is a plethora of uncharacterized bacterial RTs (19). Our analysis is congruent with this proposal as we also observe novel clades of bacterial origin. While these results are promising, the evolutionary history is unknown and therefore we cannot fully evaluate the performance of PHRYN using this dataset. Further, during the course of these experiments, we determined that although effective, the pipeline as described is computationally expensive for large datasets. To overcome these limitations, we made the following changes to the pipeline. Specifically, we compiled all PSSMs into a single database that can be used by standard c

rps-blast(8). This drastically reduced the number of operations and enabled us to use a sliding window of e-value thresholds to recover positive alignments. In data not shown, these changes afforded us a >600-fold increase in computational speed as well as improved speciation (see Methods for complete details). a score = % i % c (%i: Percent identity, %c: Percent coverage) c b ids aqlen = alen plen ids aqlen = plen aqlen + gaps ids: The number of identical residues alen: The alignment length including gaps aqlen: The alignment length in the query excluding gaps ( = query _ to query _ from + 1) plen: The sequence length of a PSSM gaps: The number of gaps in the query-sided alignment Q alen = aqlen + gaps aqlen Q = (Gap weight based on the number of gaps in the query-sided alignment) aqlen + gaps = (Proportion of amino acid sites identical over a domain) (Gap weight) = (1 Proportion of amino acid sites different over a domain) (Gap weight) = (1 p) w % gap % cov % Id 102 distance matrix Figure 3: Phylogenetic profile based measurements of evolutionary distance.- (a) Algebraic derivation of p- distance from PHYRN product scoring scheme (%Identity x %coverage). (b) Distribution of Retroelements by measurements of % Identity, % coverage, and %gap. (c) Unrooted phylogenetic tree of full length retroelements measured using 102 PSSMs generated from the RT domain. The pairwise distances among them were acquired based on Euclidean distance measurement in the X 102 data matrix, and an unrooted phylogenetic tree was derived from the X 102 distance matrix using a minimum evolution method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix. Bootstrap and jackknife (80% fraction of samples) values were obtained from 1,000 replicates and are reported as percentages. Phylogenetic Reconstruction of Rose Simulations Rose (Random Model of Sequence Evolution - Version 1.3; http://bibiserv.techfak.unibielefeld.de/rose/) implements a probabilistic model for protein sequence evolution (9). In this simulation, sequences are created from a common ancestor to produce a dataset of known size, divergence, and history. In this artificial evolutionary process, the accurate history is recorded since the multiple sequence alignment is created simultaneously. This allows us to have perfect control over evolutionary rates, allowing us to test the efficacy of our approach.

Figure 5 provides the results from PHRYN and MUSCLE (21) using 67 sequences simulated for 17% identity by ROSE (see Methods for full description of PSSM generation). Whereas MUSCLE performs poorly on this dataset (Figure 5b), PHYRN recaptures 93 % of the true evolutionary history and has only one deep node incorrectly identified (Figure 5a). In a second simulation, we maintained a similar level of divergence while increasing the size of the simulated dataset to 584 sequences. In this simulation, MUSCLE performance decreases significantly (no deep-nodes are correctly obtained, (Figure 6a), while PHRYN performance is still robust (Figure 6b). PHYRN recaptures 76% of the true evolutionary history at the deepest 64 nodes. Taken together, these results demonstrate the power of PHYRN for deriving deep evolutionary information. Figure 4: Towards comprehensive phylogenies.- Linearized phylogenetic tree of 716 full-length retroelements measured using 846 PSSMs generated from the RT domain and rooted with retrointrons. The pairwise distances among them were acquired based on Euclidean distance measurement in the 716 X 846 data matrix, and an unrooted phylogenetic tree was derived from the 716 X 716 distance matrix using a neighbor-joining (NJ) method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix.

a b Figure 5: PHYRN recapitulates true evolutionary history better than MUSCLE in simulated protein families- Consensus tree between original ROSE tree and tree generated using a) PHYRN and b) MUSCLE. Simulated protein family generated using ROSE, with an average distance of 5 (p distance ~0.83). Red circles mark the branch points (nodes) that are not recapitulated correctly. (no. of query sequences = 67)

SEQ1622 SEQ1623 SEQ1625 SEQ1626 SEQ1629 SEQ1630 SEQ1609 SEQ1608 SEQ1618 SEQ1614 SEQ1653 SEQ1654 SEQ1656 SEQ1661 SEQ1663 SEQ1664 SEQ1645 SEQ1649 SEQ1641 SEQ1642 SEQ1639 SEQ1638 SEQ1945 SEQ1944 SEQ1941 SEQ1949 SEQ1937 SEQ1936 SEQ1933 SEQ1934 SEQ1930 SEQ1927 SEQ2015 SEQ2010 SEQ2004 SEQ2005 SEQ2000 SEQ1997 SEQ1996 SEQ1988 SEQ1992 SEQ2035 SEQ2036 SEQ2043 SEQ2046 SEQ2028 SEQ2030 SEQ2023 SEQ2024 SEQ1982 SEQ1976 SEQ1973 SEQ1972 SEQ1967 SEQ1968 SEQ1965 SEQ1957 SEQ1960 SEQ1904 SEQ1900 SEQ1896 SEQ1892 SEQ1918 SEQ1916 SEQ1908 SEQ1888 SEQ1887 SEQ1883 SEQ1876 SEQ1879 SEQ1872 SEQ1868 SEQ1864 SEQ1863 SEQ1825 SEQ1824 SEQ1814 SEQ1818 SEQ1808 SEQ1806 SEQ1799 SEQ1838 SEQ1837 SEQ1841 SEQ1849 SEQ1846 SEQ1855 SEQ1856 SEQ1833 SEQ1830 SEQ1569 SEQ1567 SEQ1560 SEQ1563 SEQ1551 SEQ1547 SEQ1545 SEQ1584 SEQ1582 SEQ1575 SEQ1576 SEQ1579 SEQ1594 SEQ1596 SEQ1721 SEQ1716 SEQ1724 SEQ1725 SEQ1728 SEQ1727 SEQ1710 SEQ1709 SEQ1712 SEQ1702 SEQ1706 SEQ1674 SEQ1672 SEQ1677 SEQ1696 SEQ1694 SEQ1690 SEQ1687 SEQ1764 SEQ1769 SEQ1775 SEQ1772 SEQ1784 SEQ1790 SEQ1787 SEQ1742 SEQ1741 SEQ1745 SEQ1735 SEQ1738 SEQ1737 SEQ1748 SEQ1752 SEQ1757 SEQ1756 SEQ1245 SEQ1246 SEQ1247 SEQ1239 SEQ1238 SEQ1241 SEQ1261 SEQ1264 SEQ1254 SEQ1256 SEQ1269 SEQ1273 SEQ1272 SEQ1277 SEQ1226 SEQ1227 SEQ1223 SEQ1232 SEQ1231 SEQ1177 SEQ1176 SEQ1175 SEQ1183 SEQ1182 SEQ1186 SEQ1164 SEQ1159 SEQ1170 SEQ1168 SEQ1206 SEQ1207 SEQ1209 SEQ1210 SEQ1214 SEQ1213 SEQ1216 SEQ1201 SEQ1199 SEQ1191 SEQ1193 SEQ1115 SEQ1114 SEQ1111 SEQ1119 SEQ1122 SEQ1097 SEQ1104 SEQ1107 SEQ1145 SEQ1146 SEQ1143 SEQ1148 SEQ1128 SEQ1127 SEQ1131 SEQ1137 SEQ1134 SEQ1059 SEQ1058 SEQ1055 SEQ1051 SEQ1034 SEQ1033 SEQ1036 SEQ1041 SEQ1089 SEQ1090 SEQ1086 SEQ1079 SEQ1082 SEQ1070 SEQ1064 SEQ1065 SEQ1370 SEQ1366 SEQ1367 SEQ1376 SEQ1373 SEQ1361 SEQ1359 SEQ1354 SEQ1352 SEQ1407 SEQ1403 SEQ1397 SEQ1398 SEQ1399 SEQ1389 SEQ1390 SEQ1392 SEQ1310 SEQ1307 SEQ1302 SEQ1286 SEQ1299 SEQ1295 SEQ1327 SEQ1326 SEQ1329 SEQ1320 SEQ1322 SEQ1323 SEQ1340 SEQ1344 SEQ1345 SEQ1337 SEQ1333 SEQ1383 SEQ1434 SEQ1433 SEQ1430 SEQ1440 SEQ1436 SEQ1419 SEQ1418 SEQ1422 SEQ1426 SEQ1472 SEQ1469 SEQ1463 SEQ1461 SEQ1447 SEQ1453 SEQ1457 SEQ1528 SEQ1527 SEQ1525 SEQ1531 SEQ1534 SEQ1510 SEQ19 SEQ1513 SEQ1518 SEQ1516 SEQ1486 SEQ1485 SEQ1489 SEQ1488 SEQ1478 SEQ1482 SEQ11 SEQ14 SEQ1493 SEQ745 SEQ746 SEQ742 SEQ743 SEQ749 SEQ7 SEQ752 SEQ753 SEQ765 SEQ764 SEQ767 SEQ761 SEQ760 SEQ729 SEQ732 SEQ737 SEQ714 SEQ715 SEQ711 SEQ721 SEQ722 SEQ718 SEQ6 SEQ648 SEQ649 SEQ658 SEQ656 SEQ655 SEQ674 SEQ671 SEQ670 SEQ662 SEQ667 SEQ666 SEQ683 SEQ678 SEQ686 SEQ689 SEQ690 SEQ701 SEQ702 SEQ705 SEQ694 SEQ697 SEQ634 SEQ633 SEQ630 SEQ638 SEQ640 SEQ619 SEQ623 SEQ626 SEQ606 SEQ607 SEQ610 SEQ609 SEQ601 SEQ600 SEQ599 SEQ583 SEQ593 SEQ588 SEQ592 SEQ563 SEQ562 SEQ558 SEQ553 SEQ552 SEQ554 SEQ570 SEQ566 SEQ578 SEQ574 SEQ575 SEQ539 SEQ540 SEQ537 SEQ536 SEQ547 SEQ543 SEQ524 SEQ525 SEQ521 SEQ522 SEQ528 SEQ531 SEQ998 SEQ997 SEQ3 SEQ7 SEQ8 SEQ1015 SEQ1013 SEQ1020 SEQ967 SEQ969 SEQ976 SEQ982 SEQ992 SEQ906 SEQ903 SEQ911 SEQ928 SEQ929 SEQ926 SEQ922 SEQ957 SEQ956 SEQ959 SEQ952 SEQ949 SEQ9 SEQ941 SEQ943 SEQ935 SEQ839 SEQ841 SEQ840 SEQ847 SEQ889 SEQ888 SEQ896 SEQ892 SEQ865 SEQ856 SEQ880 SEQ881 SEQ876 SEQ874 SEQ870 SEQ801 SEQ799 SEQ792 SEQ791 SEQ795 SEQ783 SEQ786 SEQ775 SEQ779 SEQ829 SEQ833 SEQ825 SEQ821 SEQ808 SEQ809 SEQ818 SEQ813 SEQ463 SEQ462 SEQ466 SEQ465 SEQ456 SEQ458 SEQ470 SEQ471 SEQ473 SEQ480 SEQ478 SEQ0 SEQ5 SEQ4 SEQ512 SEQ7 SEQ488 SEQ485 SEQ493 SEQ494 SEQ497 SEQ424 SEQ426 SEQ429 SEQ432 SEQ446 SEQ448 SEQ442 SEQ441 SEQ438 SEQ392 SEQ393 SEQ395 SEQ401 SEQ417 SEQ418 SEQ414 SEQ410 SEQ408 SEQ399 SEQ344 SEQ347 SEQ349 SEQ353 SEQ337 SEQ336 SEQ335 SEQ329 SEQ332 SEQ382 SEQ381 SEQ385 SEQ375 SEQ377 SEQ362 SEQ359 SEQ366 SEQ369 SEQ370 SEQ268 SEQ269 SEQ264 SEQ275 SEQ273 SEQ281 SEQ280 SEQ284 SEQ291 SEQ288 SEQ312 SEQ313 SEQ322 SEQ318 SEQ297 SEQ299 SEQ307 SEQ306 SEQ114 SEQ115 SEQ112 SEQ107 SEQ105 SEQ120 SEQ122 SEQ130 SEQ126 SEQ99 SEQ94 SEQ92 SEQ76 SEQ77 SEQ74 SEQ81 SEQ82 SEQ45 SEQ44 SEQ42 SEQ52 SEQ51 SEQ47 SEQ56 SEQ58 SEQ67 SEQ62 SEQ32 SEQ33 SEQ26 SEQ25 SEQ13 SEQ14 SEQ10 SEQ16 SEQ21 SEQ36 SEQ156 SEQ155 SEQ152 SEQ163 SEQ159 SEQ160 SEQ141 SEQ140 SEQ144 SEQ145 SEQ193 SEQ191 SEQ184 SEQ186 SEQ179 SEQ178 SEQ167 SEQ170 SEQ176 SEQ175 SEQ238 SEQ239 SEQ241 SEQ242 SEQ235 SEQ234 SEQ232 SEQ216 SEQ215 SEQ218 SEQ219 SEQ223 SEQ225 SEQ226 SEQ206 SEQ202 SEQ257 SEQ256 SEQ254 SEQ249 SEQ599 SEQ600 SEQ721 SEQ722 SEQ1663 SEQ1664 SEQ1661 SEQ1653 SEQ1654 SEQ1656 SEQ1478 SEQ1482 SEQ112 SEQ11 SEQ1596 SEQ1945 SEQ1944 SEQ1941 SEQ1134 SEQ2015 SEQ2010 SEQ2004 SEQ2005 SEQ2000 SEQ583 SEQ588 SEQ257 SEQ256 SEQ254 SEQ1034 SEQ1033 SEQ1036 SEQ446 SEQ1748 SEQ1752 SEQ799 SEQ424 SEQ1856 SEQ1855 SEQ470 SEQ471 SEQ473 SEQ480 SEQ478 SEQ1295 SEQ1286 SEQ1299 SEQ1302 SEQ1307 SEQ1310 SEQ1327 SEQ1326 SEQ1329 SEQ1320 SEQ1323 SEQ1322 SEQ1337 SEQ1333 SEQ1345 SEQ1344 SEQ1340 SEQ1366 SEQ1367 SEQ1370 SEQ1373 SEQ1376 SEQ1354 SEQ1352 SEQ1359 SEQ1361 SEQ1407 SEQ1403 SEQ1397 SEQ1398 SEQ1399 SEQ1389 SEQ1390 SEQ1392 SEQ1383 SEQ463 SEQ462 SEQ466 SEQ465 SEQ493 SEQ494 SEQ497 SEQ488 SEQ485 SEQ1694 SEQ1696 SEQ1745 SEQ1741 SEQ1742 SEQ702 SEQ701 SEQ705 SEQ694 SEQ697 SEQ1764 SEQ1769 SEQ1775 SEQ1772 SEQ426 SEQ249 SEQ1965 SEQ1787 SEQ1687 SEQ1949 SEQ1170 SEQ2043 SEQ2046 SEQ2035 SEQ2036 SEQ1493 SEQ206 SEQ1690 SEQ1469 SEQ1199 SEQ1201 SEQ1191 SEQ1193 SEQ1457 SEQ1164 SEQ1159 SEQ1183 SEQ1182 SEQ58 SEQ56 SEQ215 SEQ216 SEQ218 SEQ219 SEQ1584 SEQ144 SEQ145 SEQ160 SEQ159 SEQ163 SEQ156 SEQ155 SEQ414 SEQ1119 SEQ1806 SEQ1808 SEQ1261 SEQ1436 SEQ1440 SEQ1433 SEQ1434 SEQ1430 SEQ1419 SEQ1418 SEQ1426 SEQ1422 SEQ1472 SEQ1516 SEQ1518 SEQ1489 SEQ1488 SEQ1089 SEQ1090 SEQ1086 SEQ1838 SEQ1837 SEQ1830 SEQ903 SEQ1988 SEQ1992 SEQ1677 SEQ792 SEQ791 SEQ795 SEQ1246 SEQ1245 SEQ1247 SEQ1238 SEQ1239 SEQ1241 SEQ1226 SEQ1227 SEQ1223 SEQ1232 SEQ1231 SEQ779 SEQ775 SEQ761 SEQ760 SEQ764 SEQ765 SEQ767 SEQ749 SEQ7 SEQ752 SEQ753 SEQ742 SEQ743 SEQ745 SEQ746 SEQ32 SEQ33 SEQ36 SEQ26 SEQ25 SEQ10 SEQ13 SEQ14 SEQ16 SEQ21 SEQ105 SEQ107 SEQ307 SEQ306 SEQ297 SEQ299 SEQ313 SEQ312 SEQ322 SEQ318 SEQ7 SEQ512 SEQ626 SEQ992 SEQ623 SEQ840 SEQ839 SEQ841 SEQ847 SEQ856 SEQ865 SEQ1058 SEQ1059 SEQ1055 SEQ1079 SEQ1082 SEQ1064 SEQ1065 SEQ9 SEQ949 SEQ1996 SEQ1997 SEQ1927 SEQ1930 SEQ1937 SEQ1936 SEQ1933 SEQ1934 SEQ1876 SEQ1879 SEQ1818 SEQ1814 SEQ1672 SEQ1674 SEQ369 SEQ370 SEQ385 SEQ280 SEQ281 SEQ284 SEQ288 SEQ291 SEQ418 SEQ417 SEQ393 SEQ392 SEQ1175 SEQ1176 SEQ1177 SEQ896 SEQ347 SEQ1264 SEQ1254 SEQ1256 SEQ1757 SEQ1756 SEQ226 SEQ225 SEQ401 SEQ1622 SEQ1623 SEQ1625 SEQ1626 SEQ1629 SEQ1630 SEQ1609 SEQ1608 SEQ1618 SEQ1614 SEQ1721 SEQ1716 SEQ1712 SEQ648 SEQ649 SEQ6 SEQ543 SEQ1051 SEQ141 SEQ140 SEQ152 SEQ4 SEQ5 SEQ0 SEQ1447 SEQ1737 SEQ1738 SEQ1735 SEQ537 SEQ536 SEQ539 SEQ540 SEQ51 SEQ52 SEQ47 SEQ45 SEQ44 SEQ42 SEQ1115 SEQ1114 SEQ1111 SEQ1020 SEQ1510 SEQ19 SEQ1513 SEQ186 SEQ184 SEQ74 SEQ77 SEQ76 SEQ82 SEQ115 SEQ114 SEQ399 SEQ1887 SEQ1888 SEQ1883 SEQ1918 SEQ1916 SEQ1525 SEQ1582 SEQ833 SEQ689 SEQ690 SEQ686 SEQ671 SEQ670 SEQ674 SEQ662 SEQ666 SEQ667 SEQ575 SEQ574 SEQ578 SEQ566 SEQ570 SEQ563 SEQ562 SEQ558 SEQ553 SEQ552 SEQ554 SEQ888 SEQ889 SEQ1015 SEQ1013 SEQ2030 SEQ2028 SEQ2023 SEQ2024 SEQ521 SEQ522 SEQ524 SEQ525 SEQ1041 SEQ1070 SEQ1784 SEQ1575 SEQ1576 SEQ1579 SEQ429 SEQ432 SEQ607 SEQ606 SEQ610 SEQ609 SEQ238 SEQ239 SEQ241 SEQ242 SEQ1790 SEQ344 SEQ1702 SEQ634 SEQ633 SEQ630 SEQ638 SEQ640 SEQ729 SEQ737 SEQ829 SEQ335 SEQ336 SEQ337 SEQ268 SEQ269 SEQ264 SEQ275 SEQ273 SEQ349 SEQ353 SEQ957 SEQ956 SEQ959 SEQ952 SEQ941 SEQ943 SEQ935 SEQ1097 SEQ1104 SEQ1107 SEQ1206 SEQ1207 SEQ1210 SEQ1209 SEQ1214 SEQ1213 SEQ1216 SEQ982 SEQ362 SEQ1528 SEQ1527 SEQ928 SEQ929 SEQ926 SEQ1710 SEQ1709 SEQ1706 SEQ1724 SEQ1725 SEQ1728 SEQ1727 SEQ801 SEQ232 SEQ1168 SEQ1645 SEQ1649 SEQ1638 SEQ1639 SEQ1641 SEQ1642 SEQ1872 SEQ1868 SEQ1864 SEQ1863 SEQ1908 SEQ1531 SEQ1534 SEQ1982 SEQ1973 SEQ1972 SEQ1976 SEQ1148 SEQ1145 SEQ1146 SEQ1143 SEQ1967 SEQ1968 SEQ1960 SEQ1545 SEQ1547 SEQ1551 SEQ1567 SEQ1569 SEQ1560 SEQ1563 SEQ1127 SEQ1128 SEQ1131 SEQ375 SEQ377 SEQ881 SEQ880 SEQ876 SEQ874 SEQ870 SEQ193 SEQ191 SEQ176 SEQ175 SEQ179 SEQ178 SEQ167 SEQ170 SEQ593 SEQ592 SEQ1486 SEQ1485 SEQ1825 SEQ1824 SEQ130 SEQ126 SEQ438 SEQ120 SEQ122 SEQ1269 SEQ1272 SEQ1273 SEQ1277 SEQ382 SEQ381 SEQ366 SEQ1463 SEQ1461 SEQ1453 SEQ408 SEQ410 SEQ786 SEQ448 SEQ441 SEQ442 SEQ732 SEQ711 SEQ714 SEQ715 SEQ808 SEQ809 SEQ658 SEQ656 SEQ655 SEQ683 SEQ678 SEQ998 SEQ997 SEQ1900 SEQ1896 SEQ1892 SEQ1904 SEQ223 SEQ1849 SEQ1846 SEQ458 SEQ456 SEQ969 SEQ967 SEQ3 SEQ8 SEQ7 SEQ906 SEQ1137 SEQ1799 SEQ911 SEQ619 SEQ235 SEQ234 SEQ202 SEQ825 SEQ821 SEQ922 SEQ528 SEQ531 SEQ14 SEQ1841 SEQ1957 SEQ395 SEQ332 SEQ329 SEQ1594 SEQ547 SEQ892 SEQ92 SEQ99 SEQ94 SEQ783 SEQ976 SEQ62 SEQ67 SEQ1186 SEQ818 SEQ813 SEQ81 SEQ359 SEQ601 SEQ1833 SEQ718 SEQ1122 a b Figure 6: Deep Node Recapitulation of true evolutionary history in mega-phylogenies- Consensus tree between original ROSE tree and tree generated using a) MUSCLE and b) PHYRN. Simulated protein family generated using ROSE, with an average distance of 5 (p distance ~0.82). Green circles mark the deep nodes that are recapitulated correctly in the consensus trees. (no. of query sequences = 584)

Discussion Our case-study of the RT superfamily and simulated datasets demonstrates that PHYRN is capable of inferring deep evolutionary relationships between highly divergent proteins. A number of implications can be derived from this study: (i) phylogenies built with PHYRN recapture more of the true evolutionary history and have robust statistical support; (ii) phylogenies built on pairwise alignments outperform conventional MSA methods and (iii) this method is scalable to thousands of sequences. This improved performance is due the improved information content contained in the PSSM libraries used in this study. We improved the efficacy of our PSSMs by: (i) limiting the PSSMs to homologous domains, (ii) optimizing the PSI-BLAST settings for their generation, and (iii) creating a pipeline that is sufficiently fast to handle large datasets. Conversely, with respect to MSA dependent methods, increasing the number of query sequences makes it increasingly difficult to obtain an optimal multiple sequence alignment(22); in PHYRN, increasing number of query sequences also increases the dimensionality of the phylogenetic profile, thus increasing the alignment information space. This increase in information space leads to better, more robust measurements of relative rates. This comprehensive survey approach, where more sequences are better, is in contrast to random walk approach of MSA dependent methods where increased sequences are a problem and trees are limited to discrete taxa. Further, use of frequency tables in the phylogenetic profiles provides more informative measurements for calculating relative rates of evolution. This approach provides PHYRN with a potential to generate trees with thousands of sequences where the only theoretical limit is the available sequencing data. Indeed, when we expanded the RT tree from to 716 sequences comprising >14 groups we obtain a tree that is consistent yet higher resolution than previously reported RT studies (2;15-17). As PHRYN is well suited to making measurements on large divergent datasets, we hypothesize this approach may be capable of solving a number of unanswered questions related to the ancient origins of life and speciation. Moreover, since PHYRN functions in the twilight zone of sequence similarity, this algorithm may have the ability to inform whether functionally or structurally similar proteins have a common ancestor or occurred via convergent evolution. In conclusion, our study provides strong evidence that, even in its nascent stage, PHYRN measurements can provide key insight into evolutionary relationships among distantly related and/or rapidly-evolving proteins. Acknowledgements- This work was supported by the Searle Young Investigators Award and startup money from PSU (RLP), NCSA grant TG-MCB070027N (RLP, DVR), The National Science Foundation 428-15 691M (RLP, DVR), and The National Institutes of Health R01 GM087410-01 (RLP, DVR). This project was also funded by a Fellowship from the Eberly College of Sciences and the Huck Institutes of the Life Sciences (DVR) and a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds (DVR). The Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions. We would like to thank Teresa Killick, Sree Chintapalli, and Anand Padmanbha for their help and support during the project as well as Jason Holmes at the Pennsylvania State University CAC center for technical assistance. We would also like to thank Drs. Robert E. Rothe, Jim White, Glenn M. Sharer, Cookie van Volmar, Barbara VanRossum, Russell Hamilton Carroll, C. Heesch and C. Hong for creative dialogue.

M (no. of PSSMs) matrix of product scores. Euclidean distance was calculated based on the PHYRN product score of each query and an N X N Euclidean distance matrix was generated. Similar to method used for RT trees, phylogenetic trees were inferred using Neighbor Joining (NJ) and Minimum Evolution (ME) method, as available in MEGA (13). Generating phylogenetic trees using MUSCLE Optimal Multiple Sequence Alignment (MSA) for a given dataset was obtained using MUSCLE v3.6 (23). Phylogenetic trees for these optimal MSA were inferred using MEGA s Neighbor-joining (NJ) and Minimum-evolution (ME) algorithm, with pairwise deletion and p-distance as default settings. Consensus trees with ROSE true history We used consense program of PHYLIP v3.67 package (24;25) to generate consensus trees between PHYRN and Rose trees, as well as between MUSCLE and Rose. Recapitulation rate and percentages were then calculated from consensus tree newick files. Bootstrap and Jackknife We generated 3,000 random samples from our PHYRN M-dimension, using random number generator code from PHYLIP source code (http://evolution.genetics.washington.edu/phylip.html). During resampling, same columns were allowed to be selected more than once. We then used Fitch program with default settings in PHYLIP 3.67 package to generate minimum-evolution (ME) trees for each sample, followed by Consense program for generating a consensus trees of all samples by majority rule. For jackknife resampling, we followed a similar approach to generate 1,000 random samples, however only 80% of original M-dimensional data was resampled each time. FITCH and Consense programs were then used in similar manner as used in bootstrap resampling. Reference List 1. Ko,K.D., Hong,Y., Chang,G.S., Bhardwaj,G., van Rossum,D.B., and Patterson,R.L. 2008. Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution. Physics Archives arxiv:0806.239, q-bio.q. 2. Chang,G.S., Hong,Y., Ko,K.D., Bhardwaj,G., Holmes,E.C., Patterson,R.L., and van Rossum,D.B. 2008. Phylogenetic profiles reveal evolutionary relationships within the "twilight zone" of sequence similarity. Proc. Natl. Acad Sci U. S. A 105:13474-13479. 3. Ko,K.D., Hong,Y., Bhardwaj,G., Killick,T.M., van Rossum,D.B., and Patterson,R.L. 2009. Brainstorming through the Sequence Universe: Theories on the Protein Problem. Physics Archives arxiv:0911.0652v1, q-bio.qm:1-21. 4. Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D., and Yeates,T.O. 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad Sci U. S. A 96:4285-4288. 5. Ranea,J.A., Yeats,C., Grant,A., and Orengo,C.A. 2007. Predicting protein function with hierarchical phylogenetic profiles: the Gene3D Phylo-Tuner method applied to eukaryotic genomes. PLoS. Comput. Biol. 3:e237. 6. Wu,J., Mellor,J.C., and DeLisi,C. 2005. Deciphering protein network organization using phylogenetic profile groups. Genome Inform. 16:142-149.

7. Hong,Y., Lee,D., kang,j., van Rossum,D.B., and Patterson,R.L. 2009. Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding. Physics Archives arxiv:0911.06v1, q-bio.qm:1-21. 8. Marchler-Bauer,A., Anderson,J.B., Cherukuri,P.F., Weese-Scott,C., Geer,L.Y., Gwadz,M., He,S., Hurwitz,D.I., Jackson,J.D., Ke,Z. et al 2005. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 33 Database Issue:D192-D196. 9. Stoye,J., Evers,D., and Meyer,F. 1998. Rose: generating sequence families. Bioinformatics. 14:157-163. 10. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. 11. Saitou,N., and Nei,M. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425. 12. Adams,M.A., Suits,M.D., Zheng,J., and Jia,Z. 2007. Piecing together the structure-function puzzle: experiences in structure-based functional annotation of hypothetical proteins. Proteomics. 7:2920-2932. 13. Tamura,K., Dudley,J., Nei,M., and Kumar,S. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24:1596-1599. 14. Arkhipova,I.R., Pyatkov,K.I., Meselson,M., and Evgen'ev,M.B. 2003. Retroelements containing introns in diverse invertebrate taxa. Nat. Genet. 33:123-124. 15. Curcio,M.J., and Belfort,M. 2007. The beginning of the end: links between ancient retroelements and modern telomerases. Proc. Natl. Acad Sci U. S. A 104:9107-9108. 16. Medhekar,B., and Miller,J.F. 2007. Diversity-generating retroelements. Curr. Opin. Microbiol. 10:388-395. 17. Eickbush,T.H., and Jamburuthugoda,V.K. 2008. The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Res. 134:221-234. 18. Simon,D.M., Kelchner,S.A., and Zimmerly,S. 2009. A broadscale phylogenetic analysis of group II intron RNAs and intron-encoded reverse transcriptases. Mol. Biol. Evol. 26:2795-2808. 19. Simon,D.M., and Zimmerly,S. 2008. A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 36:7219-7229. 20. Simon,D.M., Clarke,N.A., McNeil,B.A., Johnson,I., Pantuso,D., Dai,L., Chai,D., and Zimmerly,S. 2008. Group II introns in eubacteria and archaea: ORF-less introns and new varieties. RNA. 14:1704-1713. 21. Edgar,R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32:1792-1797. 22. Kemena,C., and Notredame,C. 2009. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 25:2455-2465. 23. Edgar,R.C. 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC. Bioinformatics. 5:113.

24. Felsenstein,J. 1997. An alternating least squares approach to inferring phylogenies from pairwise distances. Syst. Biol. 46:101-111. 25. Felsenstein,J. 2008. Comparative methods with sampling error and within-species variation: contrasts revisited and revised. Am. Nat. 171:713-725.