Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2

Size: px
Start display at page:

Download "Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2"

Transcription

1 Phylogenetic Bootstrapping under Resource Constraints: Higher Model Accuracy or more Replicates? Alexandros Stamatakis 1* and Vincent Rousset 2 1 Department of Computer Science, Technical University of Munich, Bolzmannstr. 3, 85747, Garching b. Mu nchen, Germany 2 Department of Biology, University of California, Riverside, Riverside, California, 92521, USA *Correspondence to: stamatak@cs.tum.edu Motivation: Bootstrapping is a standard method to infer confidence values on phylogenetic trees. Given the rapid growth of current input datasets that is driven by advances in wet-lab techniques as well as the high energy consumption and cost of computational resources we address the following question: How can a limited amount of computational resources best be used to infer the most accurate relative bootstrap support values under resource constraints. In particular, we address the question whether more computing time should be invested into optimizing per-bootstrap replicate Maximum Likelihood model parameters or if one should compute more replicates at the expense of lower model accuracy. Our computational experiments with the RAxML and GARLI algorithms indicate that it is better to invest more time into perreplicate model parameter optimization at the expense of computing less replicates, i.e., the computation of more, superficially optimized replicates, does not yield advantages. Introduction: The significant progress in DNA sequencing technology over the last years poses new challenges for phylogenetic analyses, since it provides the possibility to build and analyze significantly larger data sets that incorporate more sequences and/or more taxa. An important observation within this context, is that the pace of molecular data accumulation is significantly higher than the pace at which hardware architectures become faster, i.e., advances in sequencing techniques have outpaced Moore's law. Figure 1 provides the relative growth of data in Genebank compared to the transistor count in hardware architectures from (note the logscale on the y-axis). We call this phenomenon the Bio-Gap. With the emerging discipline of phylogenomics (see Delsuc et al. (2005) for a review) and the growing popularity of Expressed Sequence Tags (ESTs) for phylogeny reconstruction (Jeffroy et al., 2006), there is an increasing need to develop more efficient algorithms for tree-building, especially for time- as well as memory-intensive modelbased methods such as Maximum Likelihood (Felsenstein, 1981) or closely related Bayesian methods. Because of the growing popularity of multi-gene alignments, multi-core and supercomputer architectures will be deployed more frequently to conduct large-scale realworld phylogenetic analyses (see for example Dunn et al., 2008). Access to classic supercomputers such as the IBM BlueGene/L or BlueGene/P systems is typically granted in terms of CPU hours, i.e., a limited amount of time and resources is available to carry out an analysis. Moreover, many supercomputing centers currently face serious problems because of high energy costs, that in some cases even lead to the shutdown of relatively new and powerful facilities. This recent development has lead to a new research area in high performance computing called power-aware computing. Following this trend, the so-called green500 list ( that covers the 500 most energy efficient supercomputers in the world was introduced. This list only partially overlaps with the classic top500 list ( that comprises the most powerful supercomputers worldwide. Thus, taking into account this trend in computer architectures and energy prices, combined with the current molecular data explosion and rapidly growing alignment sizes we address the following question: Given a certain amount of time- or cost-constrained computational resources (CPU hours), how can those limited resources best be used for accurate phylogeny reconstruction? We intend to determine the trade-off, by means of the relative topological accuracy, between investing resources into optimization of per-bootstrap (bootstrapping: see below) replicate ML model parameters at the expense of computing less replicates versus superficial model parameter optimization for the sake of computing more replicates. The general Bootstrapping procedure is a well-established computer-based statistical method to obtain nonparametric error estimates (Efron, 1979), by inferring the variability in an unknown distribution from which the data (bootstrap replicates) was drawn by re-sampling from the original data. The seminal paper by Joe Felsenstein (Felsenstein, 1985), introduced the application of the bootstrap method to phylogenetic inference. The nonparametric phylogenetic Bootstrap (BS) method proceeds by randomly re-sampling the characters (columns) from the original matrix with replacement by creating a respective pseudo-replicate matrix that contains as many columns as the original alignment but has a slightly different column composition. This re-sampled (bootstrapped) alignment is then used as input for a Maximum Likelihood (ML, Felsenstein, 1981) tree search algorithm like GARLI or RAxML. Once ML trees for all 100 or 1,000 replicates have been computed, the frequency of appearance of a particular group (bipartition) of taxa among all the bootstrapped trees corresponds to the bootstrap confidence limit or simply BS value and can then be used to assess the relative stability of the respective phylogenetic group and stability of the overall tree under slight alterations (resamplings) of the original input alignment. To summarize and visualize the results of such a BS analysis the bootstrapped tree topologies are either used to compute various flavors of consensus trees: strict, majority rule, or extended (bifurcating) majority rule trees. Alternatively, the bootstrapped trees can be used to draw support values 36

2 on the best-scoring (best-known) ML tree on the original alignment. The phylogenetic BS is probably the most widely and commonly used approach to assess confidence on phylogenetic trees. Figure 1: The Bio-Gap : Relative growth of processor speeds and sequence data in GeneBank ; Molecular data growth has overtaken Moore's law. Note the logscale on the y- axis. To date, Joe Felsenstein's seminal paper on the phylogenetic BS (Felsenstein, 1985) has been cited over 11,000 times in the literature (ISI Web of Knowledge). Despite the significant progress in the development of heuristic ML search algorithms over the last years (see Morisson (2007) for a review), the bootstrap analyses still represent the major computational bottleneck in real-world phylogenomic studies under ML (Stamatakis et al., 2008; Morisson, 2007) and thus require a high amount of computational resources. These high resource requirements for inference of BS values, particularly on large memory-intensive phylogenomic datasets, have so far been a limiting factor in phylogeny reconstruction, in particular under the ML model (see McMahon and Sanderson 2006). Phylogenetic inference under ML for a single BS replicate/input alignment requires the estimation and optimization of: 1. the substitution model parameters 2. the branch lengths 3. the tree topology In the standard phylogenetic BS procedure the above three computational steps have to be repeated for every BS replicate, i.e., the tree for every replicate is computed de novo, without making use of any topological or model parameter information from ML searches on preceding BS replicates. Recently, Stamatakis et al. (2008) have partially solved the computational problem associated with bootstrapping by developing a novel rapid BS search algorithm (RBS algorithm) that uses the following two approximations to improve inference times by more than one order of magnitude while returning qualitatively comparable support values: Firstly, the replacement of the model parameter optimization procedure for each replicate by a one-time model parameter optimization on the original alignment and a reasonable, i.e., non-random starting tree, and, secondly, by designing a quick & dirty search algorithm to accelerate per-replicate BS tree searches. This quick & dirty algorithm also makes use of the topological information collected during preceding replicates (see Stamatakis et al., 2008 for details). Based on this fast RBS algorithm, here, we assess the trade-offs between speed and accuracy with respect to the first approximation only. We compare support values obtained from standard RBS searches (henceforth denoted as 1- RBS) that use the aforementioned one-time only model parameter optimization on the original alignment to modified RBS searches that use a more compute-intensive per-replicate optimization of model parameters (denoted as N-RBS, where N is the number of replicates). Our objective is to experimentally determine if, given a certain CPU time limit T, more replicates with 1-RBS or less replicates using N-RBS yield better relative accuracy with respect to a large number of 500 N-RBS replicates. Our findings, that are based on a large benchmark set of 18 real-world alignments, indicate that investing more time into per-replicate model parameter optimization yields slightly (RAxML) to significantly (GARLI) more accurate results than the execution of more replicates without thorough model parameter re-optimization. Experimental Setup: We conducted computational experiments on 18 single- and multi-gene real world alignments (17 DNA alignments and 1 Protein alignment) comprising 8 up to 2,000 taxa using an appropriately modified version of RAxML (Stamatakis, 2006b, see below). To ensure that our results are not biased by the specific RBS search algorithm and model parameter optimization strategy implemented in RAxML, we also performed 3 experiments with GARLI (Zwickl, 2006). GARLI allows to disable model parameter optimization by reading in userspecified model parameters instead. As model parameters for GARLI we used ML model parameters that were estimated with RAxML on the original alignment and an MP starting tree. Moreover, the GARLI search algorithm is equally powerful with respect to finding bestscoring trees as the RAxML search algorithm (Stamatakis, 2006b), but significantly slower by 1-2 orders of magnitude (Stamatakis 2006b, Stamatakis et al., 2008). Therefore, we only conducted 3 GARLI reference runs on datasets d150, d218, d354 (see below) with 500 replicates each. We call the resulting collections of BS trees with fixed model parameters 1-GARLI and with on-the-fly model parameter optimization N-GARLI, accordingly. The 18 test alignments are diverse in terms of organisms (green plants, acer, mammals, bacteria, archaea, fungi, Pappilomaviruses), genes (protein, mitochondrial, ribosomal, and non-coding genes), the number of taxa (from 8 to 2,000), and the number of concatenated genes in a single matrix (up to 106 genes for d8_m). Data set sizes are provided in Table 1 and are referred to in the text by dxyz, where XYZ is the respective number of taxa. Multigene datasets for which we executed partitioned analyses are denoted by _M and the protein protein datasets by _AA. 37

3 Dataset #bp Dataset #bp d8_m 127,026 d d53 7,542 d404_m 13,158 d59_m 6,951 d500 1,398 d81 4,552 d628 1,228 d125_m 29,149 d714 1,241 d140_aa_m 1,104 d855 1,436 d150 1,269 d1604 1,276 d217_m 3,665 d1908 1,424 d218 2,294 d2000 1,251 Table 1: Dataset sizes in the benchmark set. Computational experiments were conducted at the CIPRES ( project cluster located at the San Diego Supercomputer Center that is equipped with 16 8-way AMD 2.4 GHz Opteron shared-memory nodes. For each data set, we executed two RBS analyses with 500 replicates each: One analysis using a per-replicate optimization of ML model parameters (N-RBS) and a second analysis using the one-time model parameter optimization on the original alignment that are optimized on a Maximum Parsimony starting tree (1-RBS). All 1- RBS and N-RBS analyses on DNA data sets were performed under the GTR+ model (General Time-Reversible model of nucleotide substitution (Tavare, 1986) with the model of rate heterogeneity (Yang, 1994)), which is among the most widely used models for phylogenetic analyses of DNA (Ripplinger and Sullivan, 2008). Analyses on the 140 taxon protein dataset (d140_aa_m) were conducted under WAG+ (Whelan and Goldman, 2001), a widely used model of amino acid substitution. In addition, we conducted searches for the respective best-scoring ML trees on all original alignments. Adaptation of RAxML RBS Algorithm The RAxML RBS algorithm was adapted as follows with respect to the standard publicly available code described in Stamatakis et al. (2008): The restriction that only the GTR+CAT approximation of rate heterogeneity (Stamatakis, 2006a) can be used in combination with the quick & dirty RBS tree search algorithm was removed to allow usage of GTR+. In addition, we changed the command line interface parameters for invoking RBS to run 1-RBS (-f x) and N-RBS (-f X) analyses (for details please refer to the RAxML manual at The modified code as well as all test datasets and result files are available for download at the following address Result Analysis Experimental results were analyzed as follows: For each N-RBS run we determined the number of replicates that can be compute within the time taken by the significantly faster (average speedup 2.63) RBS replicates, i.e., we used the execution time of the RBS replicates on each dataset as a time constraint. We denote this reduced number of N-RBS replicates as N-RBS T, i.e., N- RBS constrained by CPU time T. We then computed various statistics between the N-RBS T trees, the RBS trees, and the full N-RBS trees. We use the 500 N- RBS trees as reference because they represent the statistically more correct, but slower, standard approach to bootstrapping. For all sets of replicates we computed the relative Robinson-Foulds distance (RF, Robinson and Foulds, 1981) with treedist from PHYLIP as well as the weighted Robinson-Foulds distance (WRF, Robinson and Foulds, 1979) with a script by Olaf Bininda-Emonds called partitionmetric.pl (available at on the extended majority rule (bifurcating/binary) consensus trees obtained by applying the consense program from the PHYLIP package. The unweighted RF distance between two tree topologies counts the number of subtrees found in one tree or the other, but not both. The WRF distance uses the support value information on the respective subtrees instead, i.e., a subtree with a support of 0.5 counts 0.5 instead of 1 as for RF. The WRF distance thus allows to take support values on consensus trees into account and penalizes differently placed subtrees with low support to a lesser extent than differently placed subtrees with high support. If the RF distance for a pair of trees with support values is significantly larger than the respective WRF distances, this means that the topological differences are mainly due to differently placed subtrees with low support, whereas when RF WRF this means that subtrees with high support are placed differently in the two trees under comparison. In addition, we computed the Pearson correlation coefficient between support values obtained via 1-RBS and N-RBS, as well as N-RBS T and N-RBS drawn on the respective bestscoring ML trees. We also computed the intercept and slope of the respective linear regression function. We mostly focus on the results obtained via RF and WRF since those topological distances appear to be more sensitive than the Pearson coefficient to slight changes/differences in the collections of replicates and better allow to discriminate between the approaches (see also Stamatakis et al., 2008). For the experiments with GARLI we computed the RF as well as WRF distances between GARLI runs and 500 N-GARLI runs. We also computed the topological distances between 250 N-GARLI runs, i.e., N-GARLI T and 500 N-GARLI runs. Here we chose a fixed value of 250, because unlike in RAxML, the run time differences between 1-GARLI and N-GARLI runs are insignificant, the run time variation lies between 0.97 and This insignificant execution time improvement is due to the genetic search algorithm that is used in GARLI (Zwickl, 2006). The omission of model parameter optimization does not necessarily mean that the algorithm will execute less generations, i.e., converge faster, in partic- 38

4 ular because the search space becomes less smooth without model parameter optimization. In the RAxML RBS algorithm the run time improvements are more prevalent, because the number of iterations of the search algorithm is fixed a priori (Stamatakis et al., 2008). Results: We provide the speedup (denoted as acc) of RBS replicates over 500 N-RBS replicates in Table 2. We also indicate the number of replicates (#reps) in the time-constrained N-RBS T searches. Dataset acc #reps Dataset acc #reps d8_m d d d404_m d59_m d d d d125_m d d140_aa_m d d d d217_m d d d Table 2: Speedup of 1-RBS over N-RBS and number of replicates that N-RBS can conduct within the time required for RBS replicates. The average number of replicates is 264 while the average speedup amounts to In Table 3 we indicate the topological RF as well as WRF distances in percent between RBS replicates and the 500 N-RBS replicates. The average relative RF is 6.51% and the average WRF is 2.47%. Dataset RF WRF Dataset RF WRF d8_m 0 0 d d d404_m d59_m d d d d125_m 0 0 d d140_aa_m d d d d217_m d d d Table 3: Relative RF and WRF distances in % between consensus trees induced by RBS and 500 N-RBS replicates. Finally, in Table 4 we indicated relative RF and WRF topological distances in % between varying numbers of NRBS T replicates (see Table 2) and 500 N-RBS replicates. The average RF amounts to 4.60% and the WRF to 1.98%. The average Pearson correlation coefficient (data not shown) between RBS and 500 N-RBS support values drawn on the respective best-scoring ML trees is (min: 0.998, max: 1.0), the average slope of the linear regression function is (min: 0.984, max: 1.009) and the average absolute offset amounts to 0.34% (max: 1.48%). The average Pearson correlation coefficient between time constrained N-RBS T and 500 N- RBS replicates on the best-scoring ML tree is (min: 0.995, max: 1.0) with an average slope of (min: 0.97, max: 1.012) and an average absolute offset of 0.56% (max: 3.09%). Dataset RF WRF Dataset RF WRF d8_m 0 0 d d d404_m d59_m 0 0 d d d d125_m d d140_aa_m 0 0 d d d d217_m d d d Table 4: Relative RF and WRF distances in % between consensus trees induced by varying numbers of N-RBS T and 500 N- RBS replicates. GARLI Results The results obtained by GARLI generally show a similar, though stronger tendency as the results obtained by RAxML. The relative RF/WRF topological distances between 1-GARLI and N-GARLI are 10.8%/4.4% (d150), 25.1%/12.8% (d218), and 55.8%/14.9% (d354). The RF/WRF distances between 250 N-GARLI T replicates and 500 N-GARLI replicates are 5.4%/1.6% (d150), 8.3%/2.4% (d218), and 34.47%/5.5%(d354) respectively. Discussion: In general our results indicate that the differences between the relative accuracy of 1-RBS bootstrap replicates and time-constrained NRBS T replicates with respect to a reference set of 500 N-RBS replicates are not very large. This indicates that the model parameter approximation used in the original RBS algorithm, which uses the 1-RBS strategy only has an insignificant impact on the support value distribution and shape of the extended majority rule consensus trees. The average speedup achieved by omitting per-replicate model parameter optimization is However, if the very large speedup on the 8-taxon data set is not included in the calculation, 1-RBS is only two times faster than N-RBS. The large speedup on the 8-taxon dataset is due to the low number of taxa and hence small search space in combination with the very long sequences (127,026 bp). Hence, in contrast to other datasets, the largest amount of CPU time is required to optimize model parameters and not for the tree search. The fact that virtually no speedup is observed on the protein alignment (d140_aa_m) is due to the fact that a fixed model of amino acid substitution is used and hence less parameters need to be optimized. Whereas the Pearson correlation on the best-scoring ML trees only yields insignificant differences between 1-RBS 39

5 and N-RBS T, the RF as well as WRF distances on the extended majority rule consensus trees exhibit larger discrepancies. As mentioned before, RF and WRF exhibit a higher degree of sensitivity for the comparison of sets of bootstrapped trees than the Pearson correlation. The N- RBS T approach shows smaller average RF and WRF values than 1-RBS, in particular on partitioned multigene analyses (denoted by _M). The results also indicate that the effects of Bootstrapping with one-time model parameter optimization regarding the relative accuracy and the run time improvements are highly algorithm-specific, since the omission of model parameter optimization in GARLI does not yield any speedup. Besides, unlike for RBS, the relative accuracy of GARLI bootstrap analyses is significantly worse than for 250 N-GARLI T analyses. The main reason for this phenomenon is that the model parameter optimization, branch length optimization, and tree search mechanisms are more strongly interleaved and interdependent in GARLI than in RAxML. Conclusion: Our results support that computing more replicates at the expense of a more superficial per-replicate model parameter optimization is costly in terms of relative accuracy and does not constitute a good alternative to the standard bootstrapping method. With respect to GARLI, omission of per-replicate model parameter optimization actually significantly decreases relative accuracy. We thus conclude that when only a limited amount of computational resources is available, it should be used to infer less replicates with higher per-replicate model accuracy. In addition, the biologically more meaningful average WRF distance is lower for this approach and at the same time it represents the statistically less debatable approach. Moreover, except for DNA datasets with few taxa and many genes (d8_m, d53, d59_m, d125_m) the inference times do not increase dramatically and the average WRF for these datasets is significantly lower for N- RBS T. The results obtained with GARLI also indicate that speedups induced by omission of the model parameter optimization procedure are highly algorithm-specific. Hence, the computation of more superficially optimized replicates does not seem to yield any substantial advantages. Future work will focus on devising a fast per-replicate model optimization procedure in the RBS algorithm. Acknowledgments: The authors wish to thank Markus Göker, Guido Grimm, Nicolas Salamin, Chuck Robertson, Nikos Poulakakis, and Marc Gottschling for providing realworld test datasets for this study. AS is funded under the auspices of the Emmy-Noether program by the German Science Foundation (DFG). Edgecombe, G.D., et al. (2008) Broad phylogenomic sampling improves resolution of the animal tree of life. Nature, 452(7188): Efron, B. (1979) Bootstrap methods: Another look at the jackknife. Ann. Statist., 7:1-25. Felsenstein, J. (1981) Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol., 17(6): Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution, 39(4): Jeffroy, O., Brinkmann, H., Delsuc, F., Philippe, H. (2006) Phylogenomics: the beginning of incongruence? Trends in Genetics, 22(4): Morrison, D.A. (2007) Increasing the Efficiency of Searches for the Maximum Likelihood Tree in a Phylogenetic Analysis of up to 150 Nucleotide Sequences. Syst. Biol., 56(6): McMahon, M.M., Sanderson M.J. (2006) Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes. Syst. Biol., 55(5): Ripplinger, J., Sullivan, J. (2008) Does Choice In Model Selection Affect Maximum Likelihood Analyses? Syst. Biol., 57(1): Robinson, D.F., Foulds, L.R. (1979) Comparison of weighted labelled trees. Lecture Notes Math., 748: Robinson, D.F, Foulds, L.R. (1981) Comparison of phylogenetic trees. Math. Biosci., 53: Stamatakis, A. (2006a) Phylogenetic models of rate heterogeneity: A high performance computing perspective. In Proceedings of IPDPS2006. Stamatakis, A. (2006b) RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with Thousands of Taxa and Mixed Models. Bioinformatics, 22(21): Stamatakis, A., Hoover, P., Rougemont, J. (2008) A Rapid Bootsrap Algorithm for the RAxML WebServers. Syst. Biol., 75(5): , 2008 Tavare, S. (1986) Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. DNA Sequence Analysis, 17: Whelan, S., Goldman, N. (2001) A General Empirical Model of Protein Evolution Derived from Multiple Protein Families Using a MaximumLikelihood Approach. Mol. Biol. Evol., 18: Yang, Z. (1994) Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites. J. Mol. Evol., 39: Zwickl, D. (2006) Genetic Algorithm Approaches for the Phylogenetic Analysis of Large Biological Sequence Datasets Under the Maximum Likelihood Criterion. PhD Thesis, The University of Texas at Austin. References Delsuc, F., Brinkmann, H., Philippe, H., et al. (2005) Phylogenomics and the reconstruction of the tree of life. Nat. Rev. Gen., 6(5): Dunn, C.W., Hejnol, A., Matus, D.Q., Pang, K., Browne, W.E., Smith, S.A., Seaver, E., Rouse, G.W., Obst, M., 40

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1

Smith et al. American Journal of Botany 98(3): Data Supplement S2 page 1 Smith et al. American Journal of Botany 98(3):404-414. 2011. Data Supplement S1 page 1 Smith, Stephen A., Jeremy M. Beaulieu, Alexandros Stamatakis, and Michael J. Donoghue. 2011. Understanding angiosperm

More information

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS *Stamatakis A.P., Ludwig T., Meier H. Department of Computer Science, Technische Universität München Department of Computer Science,

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

More information

Anatomy of a species tree

Anatomy of a species tree Anatomy of a species tree T 1 Size of current and ancestral Populations (N) N Confidence in branches of species tree t/2n = 1 coalescent unit T 2 Branch lengths and divergence times of species & populations

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Phylogenetics in the Age of Genomics: Prospects and Challenges

Phylogenetics in the Age of Genomics: Prospects and Challenges Phylogenetics in the Age of Genomics: Prospects and Challenges Antonis Rokas Department of Biological Sciences, Vanderbilt University http://as.vanderbilt.edu/rokaslab http://pubmed2wordle.appspot.com/

More information

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057 Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

More information

A phylogenomic toolbox for assembling the tree of life

A phylogenomic toolbox for assembling the tree of life A phylogenomic toolbox for assembling the tree of life or, The Phylota Project (http://www.phylota.org) UC Davis Mike Sanderson Amy Driskell U Pennsylvania Junhyong Kim Iowa State Oliver Eulenstein David

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

A Fitness Distance Correlation Measure for Evolutionary Trees

A Fitness Distance Correlation Measure for Evolutionary Trees A Fitness Distance Correlation Measure for Evolutionary Trees Hyun Jung Park 1, and Tiffani L. Williams 2 1 Department of Computer Science, Rice University hp6@cs.rice.edu 2 Department of Computer Science

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Parsimony via Consensus

Parsimony via Consensus Syst. Biol. 57(2):251 256, 2008 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150802040597 Parsimony via Consensus TREVOR C. BRUEN 1 AND DAVID

More information

Bayesian Models for Phylogenetic Trees

Bayesian Models for Phylogenetic Trees Bayesian Models for Phylogenetic Trees Clarence Leung* 1 1 McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada ABSTRACT Introduction: Inferring genetic ancestry of different species

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees Concatenation Analyses in the Presence of Incomplete Lineage Sorting May 22, 2015 Tree of Life Tandy Warnow Warnow T. Concatenation Analyses in the Presence of Incomplete Lineage Sorting.. 2015 May 22.

More information

Systematics - Bio 615

Systematics - Bio 615 Bayesian Phylogenetic Inference 1. Introduction, history 2. Advantages over ML 3. Bayes Rule 4. The Priors 5. Marginal vs Joint estimation 6. MCMC Derek S. Sikes University of Alaska 7. Posteriors vs Bootstrap

More information

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support

ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support ASTRAL: Fast coalescent-based computation of the species tree topology, branch lengths, and local branch support Siavash Mirarab University of California, San Diego Joint work with Tandy Warnow Erfan Sayyari

More information

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses. Kirsi Kostamo Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo

More information

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets

Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets Xiaofan Zhou, 1,2 Xing-Xing Shen, 3 Chris Todd Hittinger, 4 and Antonis Rokas*,3 1 Integrative Microbiology

More information

Evaluating phylogenetic hypotheses

Evaluating phylogenetic hypotheses Evaluating phylogenetic hypotheses Methods for evaluating topologies Topological comparisons: e.g., parametric bootstrapping, constrained searches Methods for evaluating nodes Resampling techniques: bootstrapping,

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid Lecture 11 Increasing Model Complexity I. Introduction. At this point, we ve increased the complexity of models of substitution considerably, but we re still left with the assumption that rates are uniform

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

The Importance of Proper Model Assumption in Bayesian Phylogenetics

The Importance of Proper Model Assumption in Bayesian Phylogenetics Syst. Biol. 53(2):265 277, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490423520 The Importance of Proper Model Assumption in Bayesian

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation 1 dist est: Estimation of Rates-Across-Sites Distributions in Phylogenetic Subsititution Models Version 1.0 Edward Susko Department of Mathematics and Statistics, Dalhousie University Introduction The

More information

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina 1 the data alignments and phylogenies for ~27,000 gene families from 140 plant species www.phytome.org publicly

More information

The Exelixis Lab, Dept. of Computer Science, Technische Universität München, Germany. Organismic Botany, Eberhard-Karls-University, Tübingen, Germany

The Exelixis Lab, Dept. of Computer Science, Technische Universität München, Germany. Organismic Botany, Eberhard-Karls-University, Tübingen, Germany Research Article: Maximum Likelihood Analyses of 3,490 rbcl Sequences: Scalability of Comprehensive Inference versus Group-Specific Taxon Sampling Alexandros Stamatakis 1, *, Markus Göker 2,3, Guido Grimm

More information

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes

Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes Using Phylogenomics to Predict Novel Fungal Pathogenicity Genes David DeCaprio, Ying Li, Hung Nguyen (sequenced Ascomycetes genomes courtesy of the Broad Institute) Phylogenomics Combining whole genome

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

Phylogenomics: the beginning of incongruence?

Phylogenomics: the beginning of incongruence? Phylogenomics: the beginning of incongruence? Olivier Jeffroy, Henner Brinkmann, Frédéric Delsuc, Hervé Philippe To cite this version: Olivier Jeffroy, Henner Brinkmann, Frédéric Delsuc, Hervé Philippe.

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

An Investigation of Phylogenetic Likelihood Methods

An Investigation of Phylogenetic Likelihood Methods An Investigation of Phylogenetic Likelihood Methods Tiffani L. Williams and Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131-1386 Email: tlw,moret @cs.unm.edu

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence Are directed quartets the key for more reliable supertrees? Patrick Kück Department of Life Science, Vertebrates Division,

More information

Isolating - A New Resampling Method for Gene Order Data

Isolating - A New Resampling Method for Gene Order Data Isolating - A New Resampling Method for Gene Order Data Jian Shi, William Arndt, Fei Hu and Jijun Tang Abstract The purpose of using resampling methods on phylogenetic data is to estimate the confidence

More information

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval ootstraps and testing trees Joe elsenstein epts. of Genome Sciences and of iology, University of Washington ootstraps and testing trees p.1/20 log-likelihoodcurveanditsconfidenceinterval 2620 2625 ln L

More information

Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg

Exam Dates. February 19 March 1 If those dates don't work ad hoc in Heidelberg Exam Dates February 19 March 1 If those dates don't work ad hoc in Heidelberg Plan for next lectures Lecture 11: More on Likelihood Models & Parallel Computing in Phylogenetics Lecture 12: (Andre) Discrete

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Unsupervised Learning in Spectral Genome Analysis

Unsupervised Learning in Spectral Genome Analysis Unsupervised Learning in Spectral Genome Analysis Lutz Hamel 1, Neha Nahar 1, Maria S. Poptsova 2, Olga Zhaxybayeva 3, J. Peter Gogarten 2 1 Department of Computer Sciences and Statistics, University of

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

Appendix 1. 2K(K +1) n " k "1. = "2ln L + 2K +

Appendix 1. 2K(K +1) n  k 1. = 2ln L + 2K + Appendix 1 Selection of model of amino acid substitution. In the SOWHL tests performed on amino acid alignments either the WAG (Whelan and Goldman 2001) or the JTT (Jones et al. 1992) replacement matrix

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Session 5: Phylogenomics

Session 5: Phylogenomics Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis 10 December 2012 - Corrections - Exercise 1 Non-vertebrate chordates generally possess 2 homologs, vertebrates 3 or more gene copies; a Drosophila

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets Béatrice Roure, 1 Denis Baurain, z,2 and Hervé Philippe*,1 1 Département de Biochimie, Centre Robert-Cedergren, Université

More information

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University. Mixture Models in Phylogenetic Inference Mark Pagel and Andrew Meade Reading University m.pagel@rdg.ac.uk Mixture models in phylogenetic inference!some background statistics relevant to phylogenetic inference!mixture

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200 Spring 2018 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley D.D. Ackerly Feb. 26, 2018 Maximum Likelihood Principles, and Applications to

More information

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length ONLINE APPENDIX Talavera, G., and Castresana, J. (). Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Systematic Biology, -. Symmetric

More information

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A J Mol Evol (2000) 51:423 432 DOI: 10.1007/s002390010105 Springer-Verlag New York Inc. 2000 Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus

More information

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Mul$ple Sequence Alignment Methods Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life

More information

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used Molecular Phylogenetics and Evolution 31 (2004) 865 873 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev Efficiencies of maximum likelihood methods of phylogenetic inferences when different

More information

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization) STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization) Laura Salter Kubatko Departments of Statistics and Evolution, Ecology, and Organismal Biology The Ohio State University kubatko.2@osu.edu

More information

HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK

HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK By Xizhou Feng Bachelor of Engineering China Textile University, 1993 Master of Science Tsinghua University, 1996 Submitted in Partial

More information

林仲彥. Dec 4,

林仲彥. Dec 4, 林仲彥 cylin@iis.sinica.edu.tw Dec 4, 2009 http://eln.iis.sinica.edu.tw Coding Characters and Defining Homology Classical phylogenetic analysis by Morphology Molecular phylogenetic analysis By Bio-Molecules

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University

PhyloNet. Yun Yu. Department of Computer Science Bioinformatics Group Rice University PhyloNet Yun Yu Department of Computer Science Bioinformatics Group Rice University yy9@rice.edu Symposium And Software School 2016 The University Of Texas At Austin Installation System requirement: Java

More information

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

More information

molecular evolution and phylogenetics

molecular evolution and phylogenetics molecular evolution and phylogenetics Charlotte Darby Computational Genomics: Applied Comparative Genomics 2.13.18 https://www.thinglink.com/scene/762084640000311296 Internal node Root TIME Branch Leaves

More information

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

More information

Rapid evolution of the cerebellum in humans and other great apes

Rapid evolution of the cerebellum in humans and other great apes Rapid evolution of the cerebellum in humans and other great apes Article Accepted Version Barton, R. A. and Venditti, C. (2014) Rapid evolution of the cerebellum in humans and other great apes. Current

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Chapter 9 BAYESIAN SUPERTREES. Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton

Chapter 9 BAYESIAN SUPERTREES. Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton Chapter 9 BAYESIAN SUPERTREES Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton Abstract: Keywords: In this chapter, we develop a Bayesian approach to supertree construction. Bayesian inference requires

More information