Theories on PHYlogenetic ReconstructioN (PHYRN)

Size: px
Start display at page:

Download "Theories on PHYlogenetic ReconstructioN (PHYRN)"

Transcription

1

2 Theories on PHYlogenetic ReconstructioN (PHYRN) Gaurav Bhardwaj 1,2, Zhenhai Zhang 1,3, Yoojin Hong 1,4, Kyung Dae Ko 1,2, Gue Su Chang 3, Evan J. Smith 1,2, Lindsay A. Kline 1,2, D. Nicholas Hartranft 1,2, Edward C. Holmes 1,2, Randen L. Patterson 1,2, and Damian B. van Rossum 1,2. (1) Center for Computational Proteomics, The Pennsylvania State University, USA (2) Department of Biology, The Pennsylvania State University, USA (3) Department of Biochemistry and Molecular Biology, The Pennsylvania State University, USA (4) Department of Computer Science and Engineering, The Pennsylvania State University, USA * Address correspondence to: Randen L. Patterson, 230 Life Science Bldg, University Park, PA Tel: ; Fax: ; rlp25@psu.edu. Damian B. van Rossum, 518 Wartik Laboratory, University Park, PA Tel: ; Fax: ; dbv10@psu.edu. Abstract The inability to resolve deep node relationships of highly divergent/rapidly evolving protein families is a major factor that stymies evolutionary studies. In this manuscript, we propose a Multiple Sequence Alignment (MSA) independent method to infer evolutionary relationships. We previously demonstrated that phylogenetic profiles built using position specific scoring matrices (PSSMs) are capable of constructing informative evolutionary histories(1;2). In this manuscript, we theorize that PSSMs derived specifically from the query sequences used to construct the phylogenetic tree will improve this method for the study of rapidly evolving proteins. To test this theory, we performed phylogenetic analyses of a benchmark protein superfamily (reverse transcriptases (RT)) as well as simulated datasets. When we compare the results obtained from our method, PHYlogenetic ReconstructioN (PHYRN), with other MSA dependent methods, we observe that PHYRN provides a 4- to - fold increase in accurate measurements at deep nodes. As phylogenetic profiles are used as the information source, rather than MSA, we propose PHYRN as a paradigm shift in studying evolution when MSA approaches fail. Perhaps most importantly, due to the improvements in our computational approach and the availability of vast amount of sequencing data, PHYRN is scalable to thousands of sequences. Taken together with PHYRN s adaptability to any protein family, this method can serve as a tool for resolving ambiguities in evolutionary studies of rapidly evolving/highly divergent protein families.

3 Introduction Phylogenetic profiles have been suggested by us and others as a unified framework for measuring structural, functional, and evolutionary characteristics of protein/protein-families(1;3-6). Proteins within a phylogenetic profile can be defined in an N(query) by M(PSSM) matrix. Under this paradigm, a protein is defined as a vector where each entry quantifies the alignments of a query sequence with a PSSM(3;7). In the case of evolutionary measurements, we previously demonstrated that phylogenetic profiles built in this manner can be used to construct phylogenetic trees using Euclidian distance measurements(1;2). Our previous study used reverse transcriptases (RT) as a benchmark dataset, and demonstrates that phylogenetic profiles perform well even at extreme levels of divergence (i.e. twilight zone of sequence similarity )(2). This work highlighted how pre-existing PSSMs, obtained from the Conserved Domain Database (CDD,(8)), could be utilized to construct an informative M- dimension. When we generated trees with the entire CDD we obtained a tree with perfect monophyly. Despite the perfect monophyly, the statistical support at most deep nodes was lacking. Interestingly, when we analyzed the alignments from the phylogenetic profiles, we determined that the most frequently occurring PSSM alignments were the 16 RT domain-containing profiles present in CDD. When trees were constructed using only these 16RT PSSMs for our M-dimension, we still observed significant monophyly that is well above random (Supplemental Figure 3 in (2)). Based on these results, it is reasonable to consider that expanding only the informative profiles within our knowledge base will improve the robustness of phylogenetic profile-based measurements, in addition to improving computational performance. In this manuscript, we present data supporting this supposition. Further, we present a pipeline for enriching and amplifying informative PSSMs as well as an algorithmic improvement that drastically reduces computational expense. As evidence for these theories we analyzed biological (RTs) and ROSE-simulated (9) datasets. When compared with other ab-initio multiple-sequence alignment (MSA) methods, PHYRN reliably recapitulates true evolutionary history in simulated datasets, and provides deep-node measurements with robust statistical support. Results PHYRN Pipeline The algorithm begins by compiling a set of protein queries belonging to the same protein family/superfamily (Figure 1). We use CDD (8) and other approaches to define conserved domains present in members of this superfamily (e.g. RT domain in reverse transcriptases). From this subset of knowledgebase PSSMs we utilize pairwise comparisons to define boundaries of homology. These homologous protein fragments are then utilized to construct a database/library of query-based PSSMs using PSI-BLAST (6-iterations, e-value threshold= 10-6 )(10). In this manner, a query set of sequences can make a library of at least PSSMs. We then use rpsblast (8) to obtain pairwise alignments between full-length queries and the query-specific PSSM library. This alignment information (% identity, % coverage) is then encoded into phylogenetic profile matrix. Following, we calculate the Euclidian distance between each query(2). The results from these calculations can then plotted as a phylogenetic tree using a variety of tree-building algorithms (e.g. Neighbor-Joining(11), Maximum Likelyhood(12), Minimum Evolution(13), etc, see Methods for complete description of PHYRN). In the following sections we will highlight the methods for creating informative PSSM libraries and the scoring schemes utilized. Enriching and Amplifying Informative PSSMs Although CDD provides a comprehensive resource for conserved domains, in all cases the number of PSSMs for any given domain is relatively small (>). We have previously demonstrated

4 i ii iii Query N Query 3 Query 2 Query 1 CDD domain specific PSSM (pfamxxxx) Chop query sequences at domain specific boundaries Chopped Query sequences PSI-BLAST 6 iterations, e = 10-6) Multiple domain specific PSSMs generated from chopped query sequences iv vii vi v M profiles Run rpsblast with full length query sequences and domain specific PSSMs Euclidean distance matrix N queries Phylogram N X M data matrix Score = %identity X % coverage Figure 1: PHYRN Concept and Work Flow- PHYRN begins by (i-ii) defining and extracting the domain specific region among the query sequences. (iii) Domain specific regions are then used to create PSSM library using PSI-BLAST. (iv-v) Positive alignments are then calculated between queries and PSSM library using rpsblast, and encoded as a PHYRN product score (%identity X %coverage) matrix. (vi) Product score matrix is converted to a Euclidean distance matrix by calculating Euclidean distance between each query pair. vii) Phylogenetic trees can then be graphed using Neighbor- Joining (NJ) or Minimum Evolution (ME) as available in MEGA. that quality of phylogenetic profile measurements is proportional to the size and variety of the domainspecific PSSM library (2;3;7). This increase in information content is exemplified in Figure 2. When the silkworm Non-LTR AAA is analyzed with NCBI CDD, only one RT specific PSSM alignment is returned (Figure 2a). When the same query is analyzed using PHYRN and the 16 RT PSSMs from CDD, we observe 3 overlapping alignments from 3 different PSSMs (~19% of the library, Figure 2b). These overlapping alignments define the boundary from which to generate PSSMs using PSI-BLAST. This same approach was used for RT query sequences; post-expansion, we obtain 102 RT-specific PSSMs. When we reanalyzed the silkworm sequence with the amplified library, ~56% of the PSSM library returns a result within the homologous region. Scoring Scheme In order to encode alignment information between queries and our PSSM libraries, we utilize a product score (%identity X %coverage) during our PHYRN analysis. Equation sets for %identity and %coverage are as defined in (2). The algebraic derivation in Figure 3a demonstrates that our product score is equivalent to (1-p-distance) X gap weight. In data not shown, we determined that the gap weight is a negligible variable, and when removed does not alter our results. Figure 3b depicts the distribution of % gaps, % identity, and % coverage of alignments between RT queries and 102 RT-specific PSSMs. Overall, 92% of alignments are less than 25% identity and the average percentage identity between RT sequences is 21.8% (± 5.6% s.d.). Within a smaller subset of 88 RT sequences, 3644 pairs (95.2%) among 3828 possible pairs of these 88 sequences have less than 25% sequence identity. As a whole RT sequences reside in the twilight zone of sequence similarity, underscoring the reason why deducing evolutionary relationships within the RT family is extremely challenging. Furthermore, % gap and % coverage measurements have wide variances.

5 rps-blast (default settings) a b 16 RT PSSMs from CDD PSI-BLAST 6-iterations, e=10-6 Expansion (n= 102 RT PSSMs) rt pssm-01 rt pssm-02 Library Hits (AAA92147) rt pssm-03 Pre-expansion PSSM 3/16 = 18.75% Post-expansion PSSM 57/102= 55.88% Figure 2: Enrichment and Amplification of Signal Source PSSMs- (a) Family/Superfamily specific PSSMs can be identified using NCBI Conserved Domain Database. (b) rpsblast can then be used to identify overlapping alignments between individual queries and family/superfamily specific CDD PSSMs. Overlapping alignments are then used to define domain specific region as described in methods. (c) Domain specific regions as identified are then used to generate PSSMs using PSI-BLAST and NCBI non-redundant (nr) database. Phylogenetic Trees generated using PHYRN Phylogenetic Reconstruction of the RT superfamily Despite the lowidentity and high variance in % gaps and coverage, when a phylogenetic tree is constructed from a phylogenetic profile comprised of RT query sequences and 102 RT-specific PSSMs using Neighbor-Joining (11) (see Methods), we obtain a robust monophyletic tree with deep statistical support (bootstrap and jackknife, Figure 3c). In all aspects, this tree is superior to the tree constructed preexpansion of the RT- specific PSSMs (Fig 3 in Chang et al (2)). Specifically, the Hepadnaviruses now form a clear monophyletic clade, and the Mt Plasmids now reside in the prokaryotic group as expected. As previously mentioned, phylogenetic profile measurements improve with larger datasets (1;3;7). Therefore, we wondered whether our resolution could be increased by the inclusion of additional sequences across multiple taxa, thereby increasing the size of the PSSM library. We collected 716 full-length RT containing sequences from the literature (14-20) and PSI-blast aided searches of NCBI non-redundant database. These sequences were subsequently included in our RT-specific PSSM library as previously described. Figure 4 depicts a linearized phylogenetic tree of 716 full length retroelements measured using 846 PSSMs generated from the RT domain and rooted with retrointrons. The pairwise distances among them were acquired based on Euclidean distance measurement in the data matrix, and an unrooted phylogenetic tree was derived from the distance matrix using a neighbor-joining method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix. Even at this size and level of divergence (~17% identity between groups), the PHYRN tree has robust monophyly. Within these monophyletic nodes, there are multiple subclades which are evident from this analysis. For example, we observe numerous subgroups of TY3 retroelements, and proper subclade groupings of retroviral RTs. Recently, Simon et al reported that there is a plethora of uncharacterized bacterial RTs (19). Our analysis is congruent with this proposal as we also observe novel clades of bacterial origin. While these results are promising, the evolutionary history is unknown and therefore we cannot fully evaluate the performance of PHRYN using this dataset. Further, during the course of these experiments, we determined that although effective, the pipeline as described is computationally expensive for large datasets. To overcome these limitations, we made the following changes to the pipeline. Specifically, we compiled all PSSMs into a single database that can be used by standard c

6 rps-blast(8). This drastically reduced the number of operations and enabled us to use a sliding window of e-value thresholds to recover positive alignments. In data not shown, these changes afforded us a >600-fold increase in computational speed as well as improved speciation (see Methods for complete details). a score = % i % c (%i: Percent identity, %c: Percent coverage) c b ids aqlen = alen plen ids aqlen = plen aqlen + gaps ids: The number of identical residues alen: The alignment length including gaps aqlen: The alignment length in the query excluding gaps ( = query _ to query _ from + 1) plen: The sequence length of a PSSM gaps: The number of gaps in the query-sided alignment Q alen = aqlen + gaps aqlen Q = (Gap weight based on the number of gaps in the query-sided alignment) aqlen + gaps = (Proportion of amino acid sites identical over a domain) (Gap weight) = (1 Proportion of amino acid sites different over a domain) (Gap weight) = (1 p) w % gap % cov % Id 102 distance matrix Figure 3: Phylogenetic profile based measurements of evolutionary distance.- (a) Algebraic derivation of p- distance from PHYRN product scoring scheme (%Identity x %coverage). (b) Distribution of Retroelements by measurements of % Identity, % coverage, and %gap. (c) Unrooted phylogenetic tree of full length retroelements measured using 102 PSSMs generated from the RT domain. The pairwise distances among them were acquired based on Euclidean distance measurement in the X 102 data matrix, and an unrooted phylogenetic tree was derived from the X 102 distance matrix using a minimum evolution method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix. Bootstrap and jackknife (80% fraction of samples) values were obtained from 1,000 replicates and are reported as percentages. Phylogenetic Reconstruction of Rose Simulations Rose (Random Model of Sequence Evolution - Version 1.3; implements a probabilistic model for protein sequence evolution (9). In this simulation, sequences are created from a common ancestor to produce a dataset of known size, divergence, and history. In this artificial evolutionary process, the accurate history is recorded since the multiple sequence alignment is created simultaneously. This allows us to have perfect control over evolutionary rates, allowing us to test the efficacy of our approach.

7 Figure 5 provides the results from PHRYN and MUSCLE (21) using 67 sequences simulated for 17% identity by ROSE (see Methods for full description of PSSM generation). Whereas MUSCLE performs poorly on this dataset (Figure 5b), PHYRN recaptures 93 % of the true evolutionary history and has only one deep node incorrectly identified (Figure 5a). In a second simulation, we maintained a similar level of divergence while increasing the size of the simulated dataset to 584 sequences. In this simulation, MUSCLE performance decreases significantly (no deep-nodes are correctly obtained, (Figure 6a), while PHRYN performance is still robust (Figure 6b). PHYRN recaptures 76% of the true evolutionary history at the deepest 64 nodes. Taken together, these results demonstrate the power of PHYRN for deriving deep evolutionary information. Figure 4: Towards comprehensive phylogenies.- Linearized phylogenetic tree of 716 full-length retroelements measured using 846 PSSMs generated from the RT domain and rooted with retrointrons. The pairwise distances among them were acquired based on Euclidean distance measurement in the 716 X 846 data matrix, and an unrooted phylogenetic tree was derived from the 716 X 716 distance matrix using a neighbor-joining (NJ) method. The tree is drawn to scale, with branch lengths in the same units as those of the Euclidean distances calculated from the data matrix.

8 a b Figure 5: PHYRN recapitulates true evolutionary history better than MUSCLE in simulated protein families- Consensus tree between original ROSE tree and tree generated using a) PHYRN and b) MUSCLE. Simulated protein family generated using ROSE, with an average distance of 5 (p distance ~0.83). Red circles mark the branch points (nodes) that are not recapitulated correctly. (no. of query sequences = 67)

9 SEQ1622 SEQ1623 SEQ1625 SEQ1626 SEQ1629 SEQ1630 SEQ1609 SEQ1608 SEQ1618 SEQ1614 SEQ1653 SEQ1654 SEQ1656 SEQ1661 SEQ1663 SEQ1664 SEQ1645 SEQ1649 SEQ1641 SEQ1642 SEQ1639 SEQ1638 SEQ1945 SEQ1944 SEQ1941 SEQ1949 SEQ1937 SEQ1936 SEQ1933 SEQ1934 SEQ1930 SEQ1927 SEQ2015 SEQ2010 SEQ2004 SEQ2005 SEQ2000 SEQ1997 SEQ1996 SEQ1988 SEQ1992 SEQ2035 SEQ2036 SEQ2043 SEQ2046 SEQ2028 SEQ2030 SEQ2023 SEQ2024 SEQ1982 SEQ1976 SEQ1973 SEQ1972 SEQ1967 SEQ1968 SEQ1965 SEQ1957 SEQ1960 SEQ1904 SEQ1900 SEQ1896 SEQ1892 SEQ1918 SEQ1916 SEQ1908 SEQ1888 SEQ1887 SEQ1883 SEQ1876 SEQ1879 SEQ1872 SEQ1868 SEQ1864 SEQ1863 SEQ1825 SEQ1824 SEQ1814 SEQ1818 SEQ1808 SEQ1806 SEQ1799 SEQ1838 SEQ1837 SEQ1841 SEQ1849 SEQ1846 SEQ1855 SEQ1856 SEQ1833 SEQ1830 SEQ1569 SEQ1567 SEQ1560 SEQ1563 SEQ1551 SEQ1547 SEQ1545 SEQ1584 SEQ1582 SEQ1575 SEQ1576 SEQ1579 SEQ1594 SEQ1596 SEQ1721 SEQ1716 SEQ1724 SEQ1725 SEQ1728 SEQ1727 SEQ1710 SEQ1709 SEQ1712 SEQ1702 SEQ1706 SEQ1674 SEQ1672 SEQ1677 SEQ1696 SEQ1694 SEQ1690 SEQ1687 SEQ1764 SEQ1769 SEQ1775 SEQ1772 SEQ1784 SEQ1790 SEQ1787 SEQ1742 SEQ1741 SEQ1745 SEQ1735 SEQ1738 SEQ1737 SEQ1748 SEQ1752 SEQ1757 SEQ1756 SEQ1245 SEQ1246 SEQ1247 SEQ1239 SEQ1238 SEQ1241 SEQ1261 SEQ1264 SEQ1254 SEQ1256 SEQ1269 SEQ1273 SEQ1272 SEQ1277 SEQ1226 SEQ1227 SEQ1223 SEQ1232 SEQ1231 SEQ1177 SEQ1176 SEQ1175 SEQ1183 SEQ1182 SEQ1186 SEQ1164 SEQ1159 SEQ1170 SEQ1168 SEQ1206 SEQ1207 SEQ1209 SEQ1210 SEQ1214 SEQ1213 SEQ1216 SEQ1201 SEQ1199 SEQ1191 SEQ1193 SEQ1115 SEQ1114 SEQ1111 SEQ1119 SEQ1122 SEQ1097 SEQ1104 SEQ1107 SEQ1145 SEQ1146 SEQ1143 SEQ1148 SEQ1128 SEQ1127 SEQ1131 SEQ1137 SEQ1134 SEQ1059 SEQ1058 SEQ1055 SEQ1051 SEQ1034 SEQ1033 SEQ1036 SEQ1041 SEQ1089 SEQ1090 SEQ1086 SEQ1079 SEQ1082 SEQ1070 SEQ1064 SEQ1065 SEQ1370 SEQ1366 SEQ1367 SEQ1376 SEQ1373 SEQ1361 SEQ1359 SEQ1354 SEQ1352 SEQ1407 SEQ1403 SEQ1397 SEQ1398 SEQ1399 SEQ1389 SEQ1390 SEQ1392 SEQ1310 SEQ1307 SEQ1302 SEQ1286 SEQ1299 SEQ1295 SEQ1327 SEQ1326 SEQ1329 SEQ1320 SEQ1322 SEQ1323 SEQ1340 SEQ1344 SEQ1345 SEQ1337 SEQ1333 SEQ1383 SEQ1434 SEQ1433 SEQ1430 SEQ1440 SEQ1436 SEQ1419 SEQ1418 SEQ1422 SEQ1426 SEQ1472 SEQ1469 SEQ1463 SEQ1461 SEQ1447 SEQ1453 SEQ1457 SEQ1528 SEQ1527 SEQ1525 SEQ1531 SEQ1534 SEQ1510 SEQ19 SEQ1513 SEQ1518 SEQ1516 SEQ1486 SEQ1485 SEQ1489 SEQ1488 SEQ1478 SEQ1482 SEQ11 SEQ14 SEQ1493 SEQ745 SEQ746 SEQ742 SEQ743 SEQ749 SEQ7 SEQ752 SEQ753 SEQ765 SEQ764 SEQ767 SEQ761 SEQ760 SEQ729 SEQ732 SEQ737 SEQ714 SEQ715 SEQ711 SEQ721 SEQ722 SEQ718 SEQ6 SEQ648 SEQ649 SEQ658 SEQ656 SEQ655 SEQ674 SEQ671 SEQ670 SEQ662 SEQ667 SEQ666 SEQ683 SEQ678 SEQ686 SEQ689 SEQ690 SEQ701 SEQ702 SEQ705 SEQ694 SEQ697 SEQ634 SEQ633 SEQ630 SEQ638 SEQ640 SEQ619 SEQ623 SEQ626 SEQ606 SEQ607 SEQ610 SEQ609 SEQ601 SEQ600 SEQ599 SEQ583 SEQ593 SEQ588 SEQ592 SEQ563 SEQ562 SEQ558 SEQ553 SEQ552 SEQ554 SEQ570 SEQ566 SEQ578 SEQ574 SEQ575 SEQ539 SEQ540 SEQ537 SEQ536 SEQ547 SEQ543 SEQ524 SEQ525 SEQ521 SEQ522 SEQ528 SEQ531 SEQ998 SEQ997 SEQ3 SEQ7 SEQ8 SEQ1015 SEQ1013 SEQ1020 SEQ967 SEQ969 SEQ976 SEQ982 SEQ992 SEQ906 SEQ903 SEQ911 SEQ928 SEQ929 SEQ926 SEQ922 SEQ957 SEQ956 SEQ959 SEQ952 SEQ949 SEQ9 SEQ941 SEQ943 SEQ935 SEQ839 SEQ841 SEQ840 SEQ847 SEQ889 SEQ888 SEQ896 SEQ892 SEQ865 SEQ856 SEQ880 SEQ881 SEQ876 SEQ874 SEQ870 SEQ801 SEQ799 SEQ792 SEQ791 SEQ795 SEQ783 SEQ786 SEQ775 SEQ779 SEQ829 SEQ833 SEQ825 SEQ821 SEQ808 SEQ809 SEQ818 SEQ813 SEQ463 SEQ462 SEQ466 SEQ465 SEQ456 SEQ458 SEQ470 SEQ471 SEQ473 SEQ480 SEQ478 SEQ0 SEQ5 SEQ4 SEQ512 SEQ7 SEQ488 SEQ485 SEQ493 SEQ494 SEQ497 SEQ424 SEQ426 SEQ429 SEQ432 SEQ446 SEQ448 SEQ442 SEQ441 SEQ438 SEQ392 SEQ393 SEQ395 SEQ401 SEQ417 SEQ418 SEQ414 SEQ410 SEQ408 SEQ399 SEQ344 SEQ347 SEQ349 SEQ353 SEQ337 SEQ336 SEQ335 SEQ329 SEQ332 SEQ382 SEQ381 SEQ385 SEQ375 SEQ377 SEQ362 SEQ359 SEQ366 SEQ369 SEQ370 SEQ268 SEQ269 SEQ264 SEQ275 SEQ273 SEQ281 SEQ280 SEQ284 SEQ291 SEQ288 SEQ312 SEQ313 SEQ322 SEQ318 SEQ297 SEQ299 SEQ307 SEQ306 SEQ114 SEQ115 SEQ112 SEQ107 SEQ105 SEQ120 SEQ122 SEQ130 SEQ126 SEQ99 SEQ94 SEQ92 SEQ76 SEQ77 SEQ74 SEQ81 SEQ82 SEQ45 SEQ44 SEQ42 SEQ52 SEQ51 SEQ47 SEQ56 SEQ58 SEQ67 SEQ62 SEQ32 SEQ33 SEQ26 SEQ25 SEQ13 SEQ14 SEQ10 SEQ16 SEQ21 SEQ36 SEQ156 SEQ155 SEQ152 SEQ163 SEQ159 SEQ160 SEQ141 SEQ140 SEQ144 SEQ145 SEQ193 SEQ191 SEQ184 SEQ186 SEQ179 SEQ178 SEQ167 SEQ170 SEQ176 SEQ175 SEQ238 SEQ239 SEQ241 SEQ242 SEQ235 SEQ234 SEQ232 SEQ216 SEQ215 SEQ218 SEQ219 SEQ223 SEQ225 SEQ226 SEQ206 SEQ202 SEQ257 SEQ256 SEQ254 SEQ249 SEQ599 SEQ600 SEQ721 SEQ722 SEQ1663 SEQ1664 SEQ1661 SEQ1653 SEQ1654 SEQ1656 SEQ1478 SEQ1482 SEQ112 SEQ11 SEQ1596 SEQ1945 SEQ1944 SEQ1941 SEQ1134 SEQ2015 SEQ2010 SEQ2004 SEQ2005 SEQ2000 SEQ583 SEQ588 SEQ257 SEQ256 SEQ254 SEQ1034 SEQ1033 SEQ1036 SEQ446 SEQ1748 SEQ1752 SEQ799 SEQ424 SEQ1856 SEQ1855 SEQ470 SEQ471 SEQ473 SEQ480 SEQ478 SEQ1295 SEQ1286 SEQ1299 SEQ1302 SEQ1307 SEQ1310 SEQ1327 SEQ1326 SEQ1329 SEQ1320 SEQ1323 SEQ1322 SEQ1337 SEQ1333 SEQ1345 SEQ1344 SEQ1340 SEQ1366 SEQ1367 SEQ1370 SEQ1373 SEQ1376 SEQ1354 SEQ1352 SEQ1359 SEQ1361 SEQ1407 SEQ1403 SEQ1397 SEQ1398 SEQ1399 SEQ1389 SEQ1390 SEQ1392 SEQ1383 SEQ463 SEQ462 SEQ466 SEQ465 SEQ493 SEQ494 SEQ497 SEQ488 SEQ485 SEQ1694 SEQ1696 SEQ1745 SEQ1741 SEQ1742 SEQ702 SEQ701 SEQ705 SEQ694 SEQ697 SEQ1764 SEQ1769 SEQ1775 SEQ1772 SEQ426 SEQ249 SEQ1965 SEQ1787 SEQ1687 SEQ1949 SEQ1170 SEQ2043 SEQ2046 SEQ2035 SEQ2036 SEQ1493 SEQ206 SEQ1690 SEQ1469 SEQ1199 SEQ1201 SEQ1191 SEQ1193 SEQ1457 SEQ1164 SEQ1159 SEQ1183 SEQ1182 SEQ58 SEQ56 SEQ215 SEQ216 SEQ218 SEQ219 SEQ1584 SEQ144 SEQ145 SEQ160 SEQ159 SEQ163 SEQ156 SEQ155 SEQ414 SEQ1119 SEQ1806 SEQ1808 SEQ1261 SEQ1436 SEQ1440 SEQ1433 SEQ1434 SEQ1430 SEQ1419 SEQ1418 SEQ1426 SEQ1422 SEQ1472 SEQ1516 SEQ1518 SEQ1489 SEQ1488 SEQ1089 SEQ1090 SEQ1086 SEQ1838 SEQ1837 SEQ1830 SEQ903 SEQ1988 SEQ1992 SEQ1677 SEQ792 SEQ791 SEQ795 SEQ1246 SEQ1245 SEQ1247 SEQ1238 SEQ1239 SEQ1241 SEQ1226 SEQ1227 SEQ1223 SEQ1232 SEQ1231 SEQ779 SEQ775 SEQ761 SEQ760 SEQ764 SEQ765 SEQ767 SEQ749 SEQ7 SEQ752 SEQ753 SEQ742 SEQ743 SEQ745 SEQ746 SEQ32 SEQ33 SEQ36 SEQ26 SEQ25 SEQ10 SEQ13 SEQ14 SEQ16 SEQ21 SEQ105 SEQ107 SEQ307 SEQ306 SEQ297 SEQ299 SEQ313 SEQ312 SEQ322 SEQ318 SEQ7 SEQ512 SEQ626 SEQ992 SEQ623 SEQ840 SEQ839 SEQ841 SEQ847 SEQ856 SEQ865 SEQ1058 SEQ1059 SEQ1055 SEQ1079 SEQ1082 SEQ1064 SEQ1065 SEQ9 SEQ949 SEQ1996 SEQ1997 SEQ1927 SEQ1930 SEQ1937 SEQ1936 SEQ1933 SEQ1934 SEQ1876 SEQ1879 SEQ1818 SEQ1814 SEQ1672 SEQ1674 SEQ369 SEQ370 SEQ385 SEQ280 SEQ281 SEQ284 SEQ288 SEQ291 SEQ418 SEQ417 SEQ393 SEQ392 SEQ1175 SEQ1176 SEQ1177 SEQ896 SEQ347 SEQ1264 SEQ1254 SEQ1256 SEQ1757 SEQ1756 SEQ226 SEQ225 SEQ401 SEQ1622 SEQ1623 SEQ1625 SEQ1626 SEQ1629 SEQ1630 SEQ1609 SEQ1608 SEQ1618 SEQ1614 SEQ1721 SEQ1716 SEQ1712 SEQ648 SEQ649 SEQ6 SEQ543 SEQ1051 SEQ141 SEQ140 SEQ152 SEQ4 SEQ5 SEQ0 SEQ1447 SEQ1737 SEQ1738 SEQ1735 SEQ537 SEQ536 SEQ539 SEQ540 SEQ51 SEQ52 SEQ47 SEQ45 SEQ44 SEQ42 SEQ1115 SEQ1114 SEQ1111 SEQ1020 SEQ1510 SEQ19 SEQ1513 SEQ186 SEQ184 SEQ74 SEQ77 SEQ76 SEQ82 SEQ115 SEQ114 SEQ399 SEQ1887 SEQ1888 SEQ1883 SEQ1918 SEQ1916 SEQ1525 SEQ1582 SEQ833 SEQ689 SEQ690 SEQ686 SEQ671 SEQ670 SEQ674 SEQ662 SEQ666 SEQ667 SEQ575 SEQ574 SEQ578 SEQ566 SEQ570 SEQ563 SEQ562 SEQ558 SEQ553 SEQ552 SEQ554 SEQ888 SEQ889 SEQ1015 SEQ1013 SEQ2030 SEQ2028 SEQ2023 SEQ2024 SEQ521 SEQ522 SEQ524 SEQ525 SEQ1041 SEQ1070 SEQ1784 SEQ1575 SEQ1576 SEQ1579 SEQ429 SEQ432 SEQ607 SEQ606 SEQ610 SEQ609 SEQ238 SEQ239 SEQ241 SEQ242 SEQ1790 SEQ344 SEQ1702 SEQ634 SEQ633 SEQ630 SEQ638 SEQ640 SEQ729 SEQ737 SEQ829 SEQ335 SEQ336 SEQ337 SEQ268 SEQ269 SEQ264 SEQ275 SEQ273 SEQ349 SEQ353 SEQ957 SEQ956 SEQ959 SEQ952 SEQ941 SEQ943 SEQ935 SEQ1097 SEQ1104 SEQ1107 SEQ1206 SEQ1207 SEQ1210 SEQ1209 SEQ1214 SEQ1213 SEQ1216 SEQ982 SEQ362 SEQ1528 SEQ1527 SEQ928 SEQ929 SEQ926 SEQ1710 SEQ1709 SEQ1706 SEQ1724 SEQ1725 SEQ1728 SEQ1727 SEQ801 SEQ232 SEQ1168 SEQ1645 SEQ1649 SEQ1638 SEQ1639 SEQ1641 SEQ1642 SEQ1872 SEQ1868 SEQ1864 SEQ1863 SEQ1908 SEQ1531 SEQ1534 SEQ1982 SEQ1973 SEQ1972 SEQ1976 SEQ1148 SEQ1145 SEQ1146 SEQ1143 SEQ1967 SEQ1968 SEQ1960 SEQ1545 SEQ1547 SEQ1551 SEQ1567 SEQ1569 SEQ1560 SEQ1563 SEQ1127 SEQ1128 SEQ1131 SEQ375 SEQ377 SEQ881 SEQ880 SEQ876 SEQ874 SEQ870 SEQ193 SEQ191 SEQ176 SEQ175 SEQ179 SEQ178 SEQ167 SEQ170 SEQ593 SEQ592 SEQ1486 SEQ1485 SEQ1825 SEQ1824 SEQ130 SEQ126 SEQ438 SEQ120 SEQ122 SEQ1269 SEQ1272 SEQ1273 SEQ1277 SEQ382 SEQ381 SEQ366 SEQ1463 SEQ1461 SEQ1453 SEQ408 SEQ410 SEQ786 SEQ448 SEQ441 SEQ442 SEQ732 SEQ711 SEQ714 SEQ715 SEQ808 SEQ809 SEQ658 SEQ656 SEQ655 SEQ683 SEQ678 SEQ998 SEQ997 SEQ1900 SEQ1896 SEQ1892 SEQ1904 SEQ223 SEQ1849 SEQ1846 SEQ458 SEQ456 SEQ969 SEQ967 SEQ3 SEQ8 SEQ7 SEQ906 SEQ1137 SEQ1799 SEQ911 SEQ619 SEQ235 SEQ234 SEQ202 SEQ825 SEQ821 SEQ922 SEQ528 SEQ531 SEQ14 SEQ1841 SEQ1957 SEQ395 SEQ332 SEQ329 SEQ1594 SEQ547 SEQ892 SEQ92 SEQ99 SEQ94 SEQ783 SEQ976 SEQ62 SEQ67 SEQ1186 SEQ818 SEQ813 SEQ81 SEQ359 SEQ601 SEQ1833 SEQ718 SEQ1122 a b Figure 6: Deep Node Recapitulation of true evolutionary history in mega-phylogenies- Consensus tree between original ROSE tree and tree generated using a) MUSCLE and b) PHYRN. Simulated protein family generated using ROSE, with an average distance of 5 (p distance ~0.82). Green circles mark the deep nodes that are recapitulated correctly in the consensus trees. (no. of query sequences = 584)

10 Discussion Our case-study of the RT superfamily and simulated datasets demonstrates that PHYRN is capable of inferring deep evolutionary relationships between highly divergent proteins. A number of implications can be derived from this study: (i) phylogenies built with PHYRN recapture more of the true evolutionary history and have robust statistical support; (ii) phylogenies built on pairwise alignments outperform conventional MSA methods and (iii) this method is scalable to thousands of sequences. This improved performance is due the improved information content contained in the PSSM libraries used in this study. We improved the efficacy of our PSSMs by: (i) limiting the PSSMs to homologous domains, (ii) optimizing the PSI-BLAST settings for their generation, and (iii) creating a pipeline that is sufficiently fast to handle large datasets. Conversely, with respect to MSA dependent methods, increasing the number of query sequences makes it increasingly difficult to obtain an optimal multiple sequence alignment(22); in PHYRN, increasing number of query sequences also increases the dimensionality of the phylogenetic profile, thus increasing the alignment information space. This increase in information space leads to better, more robust measurements of relative rates. This comprehensive survey approach, where more sequences are better, is in contrast to random walk approach of MSA dependent methods where increased sequences are a problem and trees are limited to discrete taxa. Further, use of frequency tables in the phylogenetic profiles provides more informative measurements for calculating relative rates of evolution. This approach provides PHYRN with a potential to generate trees with thousands of sequences where the only theoretical limit is the available sequencing data. Indeed, when we expanded the RT tree from to 716 sequences comprising >14 groups we obtain a tree that is consistent yet higher resolution than previously reported RT studies (2;15-17). As PHRYN is well suited to making measurements on large divergent datasets, we hypothesize this approach may be capable of solving a number of unanswered questions related to the ancient origins of life and speciation. Moreover, since PHYRN functions in the twilight zone of sequence similarity, this algorithm may have the ability to inform whether functionally or structurally similar proteins have a common ancestor or occurred via convergent evolution. In conclusion, our study provides strong evidence that, even in its nascent stage, PHYRN measurements can provide key insight into evolutionary relationships among distantly related and/or rapidly-evolving proteins. Acknowledgements- This work was supported by the Searle Young Investigators Award and startup money from PSU (RLP), NCSA grant TG-MCB070027N (RLP, DVR), The National Science Foundation M (RLP, DVR), and The National Institutes of Health R01 GM (RLP, DVR). This project was also funded by a Fellowship from the Eberly College of Sciences and the Huck Institutes of the Life Sciences (DVR) and a grant with the Pennsylvania Department of Health using Tobacco Settlement Funds (DVR). The Department of Health specifically disclaims responsibility for any analyses, interpretations or conclusions. We would like to thank Teresa Killick, Sree Chintapalli, and Anand Padmanbha for their help and support during the project as well as Jason Holmes at the Pennsylvania State University CAC center for technical assistance. We would also like to thank Drs. Robert E. Rothe, Jim White, Glenn M. Sharer, Cookie van Volmar, Barbara VanRossum, Russell Hamilton Carroll, C. Heesch and C. Hong for creative dialogue.

11 M (no. of PSSMs) matrix of product scores. Euclidean distance was calculated based on the PHYRN product score of each query and an N X N Euclidean distance matrix was generated. Similar to method used for RT trees, phylogenetic trees were inferred using Neighbor Joining (NJ) and Minimum Evolution (ME) method, as available in MEGA (13). Generating phylogenetic trees using MUSCLE Optimal Multiple Sequence Alignment (MSA) for a given dataset was obtained using MUSCLE v3.6 (23). Phylogenetic trees for these optimal MSA were inferred using MEGA s Neighbor-joining (NJ) and Minimum-evolution (ME) algorithm, with pairwise deletion and p-distance as default settings. Consensus trees with ROSE true history We used consense program of PHYLIP v3.67 package (24;25) to generate consensus trees between PHYRN and Rose trees, as well as between MUSCLE and Rose. Recapitulation rate and percentages were then calculated from consensus tree newick files. Bootstrap and Jackknife We generated 3,000 random samples from our PHYRN M-dimension, using random number generator code from PHYLIP source code ( During resampling, same columns were allowed to be selected more than once. We then used Fitch program with default settings in PHYLIP 3.67 package to generate minimum-evolution (ME) trees for each sample, followed by Consense program for generating a consensus trees of all samples by majority rule. For jackknife resampling, we followed a similar approach to generate 1,000 random samples, however only 80% of original M-dimensional data was resampled each time. FITCH and Consense programs were then used in similar manner as used in bootstrap resampling. Reference List 1. Ko,K.D., Hong,Y., Chang,G.S., Bhardwaj,G., van Rossum,D.B., and Patterson,R.L Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution. Physics Archives arxiv: , q-bio.q. 2. Chang,G.S., Hong,Y., Ko,K.D., Bhardwaj,G., Holmes,E.C., Patterson,R.L., and van Rossum,D.B Phylogenetic profiles reveal evolutionary relationships within the "twilight zone" of sequence similarity. Proc. Natl. Acad Sci U. S. A 105: Ko,K.D., Hong,Y., Bhardwaj,G., Killick,T.M., van Rossum,D.B., and Patterson,R.L Brainstorming through the Sequence Universe: Theories on the Protein Problem. Physics Archives arxiv: v1, q-bio.qm: Pellegrini,M., Marcotte,E.M., Thompson,M.J., Eisenberg,D., and Yeates,T.O Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad Sci U. S. A 96: Ranea,J.A., Yeats,C., Grant,A., and Orengo,C.A Predicting protein function with hierarchical phylogenetic profiles: the Gene3D Phylo-Tuner method applied to eukaryotic genomes. PLoS. Comput. Biol. 3:e Wu,J., Mellor,J.C., and DeLisi,C Deciphering protein network organization using phylogenetic profile groups. Genome Inform. 16:

12 7. Hong,Y., Lee,D., kang,j., van Rossum,D.B., and Patterson,R.L Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding. Physics Archives arxiv: v1, q-bio.qm: Marchler-Bauer,A., Anderson,J.B., Cherukuri,P.F., Weese-Scott,C., Geer,L.Y., Gwadz,M., He,S., Hurwitz,D.I., Jackson,J.D., Ke,Z. et al CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 33 Database Issue:D192-D Stoye,J., Evers,D., and Meyer,F Rose: generating sequence families. Bioinformatics. 14: Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25: Saitou,N., and Nei,M The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: Adams,M.A., Suits,M.D., Zheng,J., and Jia,Z Piecing together the structure-function puzzle: experiences in structure-based functional annotation of hypothetical proteins. Proteomics. 7: Tamura,K., Dudley,J., Nei,M., and Kumar,S MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 24: Arkhipova,I.R., Pyatkov,K.I., Meselson,M., and Evgen'ev,M.B Retroelements containing introns in diverse invertebrate taxa. Nat. Genet. 33: Curcio,M.J., and Belfort,M The beginning of the end: links between ancient retroelements and modern telomerases. Proc. Natl. Acad Sci U. S. A 104: Medhekar,B., and Miller,J.F Diversity-generating retroelements. Curr. Opin. Microbiol. 10: Eickbush,T.H., and Jamburuthugoda,V.K The diversity of retrotransposons and the properties of their reverse transcriptases. Virus Res. 134: Simon,D.M., Kelchner,S.A., and Zimmerly,S A broadscale phylogenetic analysis of group II intron RNAs and intron-encoded reverse transcriptases. Mol. Biol. Evol. 26: Simon,D.M., and Zimmerly,S A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 36: Simon,D.M., Clarke,N.A., McNeil,B.A., Johnson,I., Pantuso,D., Dai,L., Chai,D., and Zimmerly,S Group II introns in eubacteria and archaea: ORF-less introns and new varieties. RNA. 14: Edgar,R.C MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32: Kemena,C., and Notredame,C Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 25: Edgar,R.C MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC. Bioinformatics. 5:113.

13 24. Felsenstein,J An alternating least squares approach to inferring phylogenies from pairwise distances. Syst. Biol. 46: Felsenstein,J Comparative methods with sampling error and within-species variation: contrasts revisited and revised. Am. Nat. 171:

PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences

PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences PHYRN: A Robust Method for Phylogenetic Analysis of Highly Divergent Sequences Gaurav Bhardwaj 1,2,6,8, Kyung Dae Ko 1,2, Yoojin Hong 1,2,3, Zhenhai Zhang 1,4,NgaiLamHo 1,3, Sree V. Chintapalli 7,8, Lindsay

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

The Pennsylvania State University. The Graduate School. College of Engineering A COMPUTATIONAL FRAMEWORK FOR INFERRING STRUCTURE, FUNCTION,

The Pennsylvania State University. The Graduate School. College of Engineering A COMPUTATIONAL FRAMEWORK FOR INFERRING STRUCTURE, FUNCTION, The Pennsylvania State University The Graduate School College of Engineering A COMPUTATIONAL FRAMEWORK FOR INFERRING STRUCTURE, FUNCTION, AND EVOLUTION OF PROTEINS A Dissertation in Computer Science and

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Multiple Sequence Alignment. Sequences

Multiple Sequence Alignment. Sequences Multiple Sequence Alignment Sequences > YOR020c mstllksaksivplmdrvlvqrikaqaktasglylpe knveklnqaevvavgpgftdangnkvvpqvkvgdqvl ipqfggstiklgnddevilfrdaeilakiakd > crassa mattvrsvksliplldrvlvqrvkaeaktasgiflpe

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis 10 December 2012 - Corrections - Exercise 1 Non-vertebrate chordates generally possess 2 homologs, vertebrates 3 or more gene copies; a Drosophila

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

More information

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS Masatoshi Nei" Abstract: Phylogenetic trees: Recent advances in statistical methods for phylogenetic reconstruction and genetic diversity analysis were

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

Session 5: Phylogenomics

Session 5: Phylogenomics Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Phylogeny: building the tree of life

Phylogeny: building the tree of life Phylogeny: building the tree of life Dr. Fayyaz ul Amir Afsar Minhas Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences PO Nilore, Islamabad, Pakistan

More information

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family Jieming Shen 1,2 and Hugh B. Nicholas, Jr. 3 1 Bioengineering and Bioinformatics Summer

More information

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information - Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes - Supplementary Information - Martin Bartl a, Martin Kötzing a,b, Stefan Schuster c, Pu Li a, Christoph Kaleta b a

More information

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION Using Anatomy, Embryology, Biochemistry, and Paleontology Scientific Fields Different fields of science have contributed evidence for the theory of

More information

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Subfamily HMMS in Functional Genomics D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Pacific Symposium on Biocomputing 10:322-333(2005) SUBFAMILY HMMS IN FUNCTIONAL GENOMICS DUNCAN

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure 1 Abstract None 2 Introduction The archaeal core set is used in testing the completeness of the archaeal draft genomes. The core set comprises of conserved single copy genes from 25 genomes. Coverage statistic

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT Inferring phylogeny Constructing phylogenetic trees Tõnu Margus Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions

More information

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016 Molecular phylogeny - Using molecular sequences to infer evolutionary relationships Tore Samuelsson Feb 2016 Molecular phylogeny is being used in the identification and characterization of new pathogens,

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae Emily Germain 1,2 Mentor Dr. Hugh Nicholas 3 1 Bioengineering & Bioinformatics Summer Institute, Department of Computational

More information

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Title ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi) Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction Lesser Tenrec (Echinops telfairi) Goals: 1. Use phylogenetic experimental design theory to select optimal taxa to

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae Emily Germain, Rensselaer Polytechnic Institute Mentor: Dr. Hugh Nicholas, Biomedical Initiative, Pittsburgh Supercomputing

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Phylogenetic molecular function annotation

Phylogenetic molecular function annotation Phylogenetic molecular function annotation Barbara E Engelhardt 1,1, Michael I Jordan 1,2, Susanna T Repo 3 and Steven E Brenner 3,4,2 1 EECS Department, University of California, Berkeley, CA, USA. 2

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 2009 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

Similarity searching summary (2)

Similarity searching summary (2) Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 200 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

What can sequences tell us?

What can sequences tell us? Bioinformatics What can sequences tell us? AGACCTGAGATAACCGATAC By themselves? Not a heck of a lot...* *Indeed, one of the key results learned from the Human Genome Project is that disease is much more

More information

The CATH Database provides insights into protein structure/function relationships

The CATH Database provides insights into protein structure/function relationships 1999 Oxford University Press Nucleic Acids Research, 1999, Vol. 27, No. 1 275 279 The CATH Database provides insights into protein structure/function relationships C. A. Orengo, F. M. G. Pearl, J. E. Bray,

More information

Supplementary material to Whitney, K. D., B. Boussau, E. J. Baack, and T. Garland Jr. in press. Drift and genome complexity revisited. PLoS Genetics.

Supplementary material to Whitney, K. D., B. Boussau, E. J. Baack, and T. Garland Jr. in press. Drift and genome complexity revisited. PLoS Genetics. Supplementary material to Whitney, K. D., B. Boussau, E. J. Baack, and T. Garland Jr. in press. Drift and genome complexity revisited. PLoS Genetics. Tree topologies Two topologies were examined, one favoring

More information

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057 Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Introduction to Bioinformatics Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Dr. rer. nat. Gong Jing Cancer Research Center Medicine School of Shandong University 2012.11.09 1 Chapter 4 Phylogenetic Tree 2 Phylogeny Evidence from morphological ( 形态学的 ), biochemical, and gene sequence

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information