HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK

Size: px
Start display at page:

Download "HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK"

Transcription

1 HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK By Xizhou Feng Bachelor of Engineering China Textile University, 1993 Master of Science Tsinghua University, 1996 Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy in the Department of Computer Science and Engineering College of Engineering and Information Technology University of South Carolina 2006 Major Professor Chairman, Examining Committee Committee Member Committee Member Committee Member Dean of The Graduate School

2 Dedication To Rong, Kevin and Katherine ii

3 Acknowledgements During the course of my graduate study, I have been fortunate to receive advice, support, and encouragement from many people. Foremost is the debt of gratitude that I owe to my thesis advisors, Professor Duncan A. Buell and Professor Kirk W. Cameron. Not only was Duncan responsible for introducing me to this interesting and fruitful field, he also provided me inspiring guidance, great patience, and never-ending encouragement during the past several years. I especially thank Professor Kirk W. Cameron for his invaluable mentoring, insightful advising, and constant investing. Kirk guided me into the exciting field of systems study, and provided opportunities and support to conduct quality research work in several cutting-edge areas. I thank Professor Manton Matthews for his years of academic advising and being on my advisory committee. His guidance and support made it possible for me to explore various fields in computer science and engineering. I thank Professor John R. Rose and Professor Peter Waddell for their valuable suggestions in this research work. The discussions and collaborative work with John and Peter generated some important ideas which have been included in this thesis. I appreciate Professor Austin L. Hughes for being on my advisory committee and providing me critical opinions which led me to rethink and significantly improvement this dissertation. I also thank the faculty and staff in the Department of Computer and Engineering for providing me one of the most wonderful training programs in the world. iii

4 Finally, I thank my family for their love and support during the hard time of completing my dissertation. This dissertation is dedicated to my wife Rong, my son Kevin, and my daughter Katherine. iv

5 Abstract Comparative analyses of biological data rely on a phylogenetic tree that describes the evolutionary relationship of the organisms studied. By combining the Markov Chain Monte Carlo (MCMC) method with likelihood-based assessment of phylogenies, Bayesian phylogenetic inferences incorporate complex statistical models into the process of phylogenetic tree estimation. This combination can be used to address a number of complex questions in evolutionary biology. However, Bayesian analyses are computationally expensive because they almost invariably require high dimensional integrations over unknown parameters. Thoroughly investigating and exploiting the power of the Bayesian approach requires a high performance computing framework. Otherwise one cannot tackle the computational challenges of Bayesian phylogenetic inference for large phylogeny problems. This dissertation extended existing Bayesian phylogenetic inference framework in three aspects: 1) Exploring various strategies to improve the performance of the MCMC sampling method; 2) Developing high performance, parallel algorithms for Bayesian phylogenetic inference; and 3) Combining data uncertainty and model uncertainty in Bayesian phylogenetic inference. We implemented all these extensions in PBPI, a software package for parallel Bayesian phylogenetic inference. We validated the PBPI implementation using simulation study, a common method used in phylogenetics and other scientific disciplines. The simulation results showed that PBPI can estimate the model trees accurately given sufficient number of sequences and correct models. v

6 We evaluated the computational speed of PBPI using simulated datasets on a Terascale computing facility and observed significantly performance improvement. On a single processor, PBPI ran up to 19 times faster than the current leading Bayesian phylogenetic inference program with the same quality output. On 64 processors, PBPI achieved 46 times parallel speedup in average. Combining both sequential improvement and parallel computation, PBPI can speedup current Bayesian phylogenetic inferences up to 870 times.. vi

7 Table of Contents Dedication... ii Acknowledgements...iii Abstract... v List of Tables...xiii List of Figures... xiv Chapter 1 Introduction Phylogeny and its applications Phylogenetic inference The challenges Searching a complex tree space Developing realistic evolutionary models Dealing with incomplete and unequal data distribution Resolving conflicts among different methods and data sources Bayesian phylogenetic inference and its issues Motivation Research objectives and contributions Organization of this dissertation Chapter 2 Background Representations of phylogenetic trees Methods for phylogenetic inference vii

8 2.2.1 Sequenced-based methods and genome-based methods Distance-, MP-, ML- and BP-based methods Tree search strategies High performance computing phylogenetic inference methods Bayesian phylogenetic inference Introduction The Bayesian framework Components of Bayesian phylogenetic inference Likelihood, prior and posterior probability Empirical and hierarchical Bayesian analysis Models of molecular evolution The substitute rate matrix Properties of the substitution rate matrix The general time reversible (GTR) model Rate heterogeneity among different sites Other more realistic evolutionary models Likelihood function and its evaluation The likelihood function Felsenstein s algorithm for likelihood evaluation Optimizations of likelihood computation Sequence packing Likelihood local update Tree balance viii

9 2.8 Markov Chain Monte Carlo methods The Metropolis-Hasting algorithm Exploring the posterior distribution The issues Summary of the posterior distribution Summary of the phylogenetic trees Summary of the model parameters Chapter summary Chapter 3 Improved Monte Carlo Strategies Introduction Observations Strategy #1: reducing stickiness using variable proposal step length Strategy #2: reducing sampling intervals using multipoint MCMC Strategy #3: improving mixing rate with parallel tempering Proposal algorithms for phylogenetic models Basic tree mutation operators Basic tree branch length proposal methods Propose new parameters Co-propose topology and branch length Extended proposal algorithms for phylogenetic models Extended tree mutation operator Multiple-tree-merge operator Backbone-slide-and-slide operator ix

10 3.8 Chapter summary Chapter 4 Parallel Bayesian Phylogenetic Inference The need for parallel Bayesian phylogenetic inference TAPS: a tree-based abstraction of parallel system Performance models for parallel algorithms Concurrencies in Bayesian phylogenetic inference Issues of parallel Bayesian phylogenetic inference Parallel algorithms for Bayesian phylogenetic inference Task decomposition and assignment Synchronization and communication Load balancing Symmetric MCMC algorithm Asymmetric MCMC algorithm Justifying the correctness of the parallel algorithms Chapter summary Chapter 5 Validation and Verification Introduction Experimental methodology The model trees The simulated datasets The accuracy metrics Tested programs and their run configurations The computing platforms x

11 5.3 Results on model tree FUSO The overall accuracy of results Further analysis PBPI stability Results on model tree BURK Chapter summary Chapter 6 Performance Evaluation Introduction Experimental methodology The sequential performance of PBPI The execution time of PBPI and MrBayes The quality of the tree samples drawn by PBPI The execution time of PBPI and MrBayes Parallel speedup for fixed problem size Scalability analysis Parallel speedup with scaled workload Scalability with different problem sizes Scalability with the number of chains Chapter summary Chapter 7 Summary and Future Work The big picture Future work xi

12 Bibliography xii

13 List of Tables Table 1-1: The number of unrooted bifurcating trees as a function of taxa... 5 Table 5-1: The four model trees used in experiments Table 5-2: PBPI run configurations for validation and verification Table 5-3: The number of datasets where the model tree FUSO024 is found in the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree. A total of 5 datasets are used in each case Table 5-4: The average distances between the model tree FUSO024 and the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree. A total of 5 datasets are used in each case Table 5-5: The topological distances between the model tree FUSO024 and the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree for datasets with 10,000 characters. Datasets are simulated under the JC69 model Table 5-6: The average distances between the model tree BURK050 and the maximum probability tree, the 95% credible set of tree and the 50% majority consensus tree. A total of 5 datasets were used in each case Table 6-1: Benchmark dataset used in the evaluation Table 6-2: Sequential execution time of PBPI and MrBayes xiii

14 List of Figures Figure 1-1: The procedure of a phylogenetic inference... 4 Figure 2-1: Phylogenetic trees of 12 primates mitochondrial DNA sequences Figure 2-2: The NEWICK representation of the primate phylogenetic tree Figure 2-3: The nontrivial bipartitions of the primate phylogenetic tree Figure 2-4: A phylogenetic tree with support values for each clade Figure 2-5: The transition diagram and transition matrix of nucleotides Figure 2-6: The Felsenstein algorithm for likelihood evaluation Figure 2-7: Illustration of likelihood local update Figure 2-8: The tree-balance algorithm Figure 2-9: Metropolis-Hasting algorithm Figure 3-1: A target distribution with three modes Figure 3-2: Distribution approximated using Metropolis MCMC methods Figure 3-3: Samples drawn using Metropolis MCMC method Figure 3-4: Illustration of state moves Figure 3-5: Approximated distribution using variable step length MCMC Figure 3-6: The multipoint MCMC Figure 3-7: A family of tempered distributions with different temperatures Figure 3-8: The Metropolis-coupled MCMC algorithm Figure 3-9: The extended-tree-mutation method Figure 3-10: The multiple-tree-merge method Figure 3-11: The backbone slide and scale method xiv

15 Figure 4-1: An illustration of TAPS Figure 4-2: Speedup under fixed workload Figure 4-3: The procedure of a generic Bayesian phylogenetic inference Figure 4-4: Map 8 chains to a 4 x 4 grid, where the length each sequence is Figure 4-5: The symmetric parallel MCMC algorithm Figure 5-1: The procedure of a simulation method for accuracy assessment Figure 5-2: Run configuration for MrBayes Figure 5-3: The phylogram of the model tree FUSO Figure 5-4: The MPP tree estimated from dataset fuso024_l10000_jc69_d Figure 5-5: Estimation variances in 10 individual runs Figure 5-6: The phylogram of the model tree BURK Figure 5-7: The MPP tree estimated from dataset burk050_l10000_jc69_d001.nex. 102 Figure 5-8: The posterior distribution of the top 50 most probable trees Figure 5-9: The topological distances distribution of the top 50 most probable trees Figure 6-1: Different speedup values computed by wall clock time and user time Figure 6-2: Log likelihood plot of the tree samples drawn by PBPI and MrBayes Figure 6-3: The consensus tree estimated by PBPI Figure 6-4: The consensus tree estimated by MrBayes Figure 6-5: Parallel speedup of PBPI for dataset FUSO024_L Figure 6-6: Parallel speedup of PBPI for dataset ARCH107_L Figure 6-7: Parallel speedup of PBPI for dataset BACK218_L Figure 6-8: The consensus tree estimated by PBPI on 64 processors Figure 6-9: Parallel speedup with different number of taxa xv

16 xvi

17 Chapter 1 Introduction 1.1 Phylogeny and its applications All life on the earth, both present and past, are believed to be descended from a common ancestor. The descending pattern or evolutionary relationship among species or organisms, or the relatedness of their genes, is usually described by a phylogeny, a tree or network structure, with edge length representing the evolutionary divergence along different lineages. In a phylogeny, all existing organisms are placed on its leaves and ancestral organisms are placed at its branches, or internal nodes. Since all biological phenomena are the result of evolution, most biological studies have to be conducted in the light of evolution and require information on phylogeny to interpret data [1]. Thus, phylogenies play important roles not only in evolutionary biology, genetics and genomics, but also in modern pharmaceutical research, drug discovery, agricultural plant improvement, disease control studies (detection, prevention and prediction) and other biology-related fields. The importance of phylogeny in scientific research and human society has never been made more clear than by the ambitious Tree of Life project initiated by the US National Science Foundation, which 1

18 aims to assemble a phylogeny for all 1.7 million described species (ATOL) to benefit society and science [2]. The applications of phylogenies span a wide range of fields, both in industry and science. Several examples follow: Identifying, organizing and classifying organism [3, 4]; Interpreting and understanding the organization and evolution of genomes [5, 6]; Identifying and characterizing newly discovered pathogens [7]; Reconstructing the evolution and radiation of life on the earth [8, 9]; and Identifying mutations most likely associated with diseases [10]. 1.2 Phylogenetic inference Phylogeny describes the pattern of evolution history among a group of taxa. But history only happens once, and people have to use clues left by the history to reconstruct actual events. One of the fundamental tasks of phylogenetic inference is to approximate the true phylogenetic tree for a group of taxa using a set of evolutionary evidence in which the phylogenetic signals reside. Various kinds of data are used in phylogenetics inferences, but recently DNA/RNA molecular sequences are most common. There are three reasons: 1) DNA sequences are the inheritance materials of all organisms on the earth; 2) Mathematical models of molecular evolution are feasible and can be improved incrementally; 3) Huge numbers of genomic sequences have been generated and are publicly accessible. 2

19 The third reason is the most important for the rapid advancement of phylogenetic inference using genomic data. Worldwide genome projects, such as the Human Genome Project (HGP) [11], have generated an ever-increasing amount of biological data. These data are publicly accessible through several government-supported database efforts, such as GenBank[12], EMBL[13], DDJB[14], and Swiss-Prot[15]. On August 22, 2005, the public collections of DNA and RNA sequences provided by GenBank, EMBL, and DDBJ reached 100 Giga bases (i.e. 100,000,000,000 bases), representing genes and genomes of over 165,000 organisms. Those massive, complex data sets already generated and those yet to be generated have been fueling the emerging or renaissance of a few interdisciplinary fields, including large scale phylogenetic analysis of genomic data. The problem of phylogenetic inference using genomic (molecular) sequences is formalized as follows: Given an aligned character matrix X ( x ij ) N M = for a set of N taxa, each taxa being represented by an M character sequence, x ij denoting the character of the i-th taxa at the j -th site of its sequence, phylogenetic inference typically seeks to answer two basic questions: 1) What is the phylogenetic tree (or model) that best explains the evolutionary relations among these taxa? 2) With how much confidence is a particular tree expected to be correct? Every phylogenetic method can output a phylogenetic tree which the method views as the best tree according to certain optimization criteria. However, given the inherent complexities in biological evolution and some unrealistic assumptions in phylogenetic inference, each given inference method usually not only produces a tree but also provides 3

20 a measurement of the confidence in the tree. Bootstrapping and Bayesian posterior probability (discussed later) are two common statistical tools to provide such confidence measurements. As shown in Figure 1-1, a phylogenetic inference usually is preceded by multiple alignments and model selections to generate input. Most phylogenetic methods rely on some phylogenetic tree as their input as well. To reduce the errors produced by the interdependence among multiple alignments, model selections and phylogenetic inference, several iterations of alignments, selections, and inferences may be required. Collect Data Retrieve Homologous Sequences Alignt Multiple Sequences Aligned Data Matrix Select Model of Evolution Phylogenetic Inference Phylogenetic Trees(s) Assess Confidence Best tree with measures of support Hypothesis Testing Figure 1-1: The procedure of a phylogenetic inference 4

21 1.3 The challenges Though there have been significant advances in phylogenetic inference in the past several decades, large scale phylogenetic inference is still a challenging problem Searching a complex tree space The biggest challenge of phylogenetic inference is the growth in the number of unrooted trees, described by N Ζ=Π (1-1) i= ( 2-5 i ) 3 Here Z denotes the number of possible tree topologies, N denotes of the number of taxa. Table 1 shows the number of unrooted trees corresponding to the number of taxa. 182 For example, the tree space for 100 taxa will contain unrooted trees. Searching this space to find the best tree is computationally impractical. Most optimization-based phylogenetic methods, such as maximum parsimony and maximum likelihood, are NPhard problems. Many heuristic strategies for tree searching have been studied, but much work remains to be done to improve these methods [16]. Table 1-1: The number of unrooted bifurcating trees as a function of taxa Number of taxa Number of unrooted trees

22 1.3.2 Developing realistic evolutionary models Most phylogenetic methods explicitly or implicitly assume a model of genomic sequence evolution and use such a model to estimate the rate of evolution, calculate pair-wise distance, or compute the likelihood of a given phylogeny. The process of genomic sequence evolution has been affected by two factors: mutations and selections. Mutations are errors incurred during DNA replication. Mutations create genetic diversity among populations, and natural selection steers evolutionary direction. Possible causes of mutations include substitution, recombination, duplication, insertion, deletion, and inversions [17]. At the same time, mutations are constrained by the geometric, physical and chemical structures of nucleotides, amino acids, codons, protein secondary structures, and protein tertiary structures [18]. Though phylogenetic signals exist in all kinds of mutation events, most evolutionary models only consider substitution events because it is either difficult or computationally intractable to integrate other events into the models used by phylogenetic analysis [19, 20]. With increasing computational power, researchers have relaxed some early assumptions in evolutionary models and proposed more realistic models, such as allowing rate variation across sites [21], considering the effect of insertion and deletion, and combining secondary structure information [22-24]. Given multiple possible models, it is necessary for the phylogenetic inference approach to select a model that best fits the data. Also this approach should be robust enough to give a correct tree even when some assumptions have been violated. Besides the complexity of modeling single type sequence evolution, the need for combined analysis of multiple datasets with different data types and sources requires 6

23 some unified model which is both mathematically founded and biologically meaningful [25, 26] Dealing with incomplete and unequal data distribution The imperfect process of sampling, sequencing and alignment may introduce varied noise into an available data set. Bias or errors in multiple sequence alignment is the cause of most noise because: 1) most multiple sequence alignment methods depend on a correct phylogeny to guide the alignment process; 2) it is necessary to search across trees to find the overall optimum. It is possible to refine the alignment by repeating the procedure of multiple alignment model selection phylogenetic inference, but it is always dangerous to assume the alignment is perfect. To assess the reliability or sensitivity of phylogeny on data with uncertainty, the bootstrap approach [28] was suggested by Felsenstein [29] and further refined by Efron et al. [30]. Bootstrapping requires repeating the phylogenetic inference procedure many times (typically on the order of 1000 times [23]) on derived datasets obtained by permuting the original data with resampling and replacing. The usefulness of phylogenetic inference methods is also limited by the sparse and uneven distribution of sequence data among species and the uncertainty inherent in the available data. Some species have been sequenced for many genes; a few genes have been sequenced for many species; but most of the potential data available for phylogenetic purposes is still missing [31, 32]. 7

24 1.3.4 Resolving conflicts among different methods and data sources Researchers usually represent a species with one or more genes in phylogeny reconstruction. However, a gene tree is not the same as a species tree [23]. Phylogenetic trees constructed with different genes or different data types (morphological data vs. molecular data) may be different. These conflicts may come from improper model assumptions or tree building approaches. 1.4 Bayesian phylogenetic inference and its issues This dissertation aims to extend the framework of Bayesian phylogenetic inference to achieve high performance on large phylogeny problems. By combining several factors into a comprehensive probability model and removing unknown parameters with a marginal probability distribution, Bayesian analysis has the potential to integrate complex (i.e. realistic) models and existing knowledge into phylogenetic inference. However, like other methods when they were first introduced, Bayesian phylogenetic inference generated both excitement and debate. Supporters of the Bayesian approach claim that Bayesian phylogenetic methods have at least two advantages over traditional phylogenetic methods [33-36]: 1) The primary Bayesian phylogenetic analysis produces both a tree estimate and a measure of uncertainty for the groups on the estimated tree[10, 37, 38]. The uncertainty is measured by a quantity called Bayesian posterior probability, which is approximated by the percentage of occurrences of a group in the tree samples generated by certain MCMC (Markov Chain Monte Carlo) methods [39-41]. 8

25 2) Bayesian methods can implement very complex models of sequence evolution, because a well-designed MCMC can traverse various highly probably regions of the tree space instead of sticking around only one region which is locally optimal but may be not the globally optimal [37]. However, with more thorough investigations, Bayesian phylogenetic inference also brings various highly-debated issues [34, 36, 42]. Several major issues have been summarized below: 1) Some Bayesian analyses offer conflicting findings to those from other approaches, such as maximum parsimony (MP) and maximum likelihood (ML) [43, 44]. Some highly debated topics include: How meaningful are Bayesian support values? [45]; Do Bayesian support values reflect the probability of being true? [46]; and Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics [47]. Supporters claim that the Bayesian posterior probability of a tree is the probability that the estimated tree is correct under the correct model [10] is highly debatable. Some convincing interpretation is necessary to reconcile these debates. 2) One cornerstone of Bayesian phylogenetic inference is posterior probability approximation using Markov Chain Monte Carlo (MCMC). Shortly after MCMC came out, people expected that it would be more efficient than traditional ML with bootstrapping [41]. However, experience shows that the chains have to run much longer than previously expected to converge to the correct approximation [48]. More seriously, research shows that the MCMC method may give 9

26 misleading posterior probability under certain conditions [42, 49], for example on a mixture of trees [50]. In spite of the above and other issues, Bayesian analysis has still gained wide acceptance since it was introduced into phylogenetics [8, 51-57]. 1.5 Motivation Given the challenges described above, both positive and negative, it is necessary to investigate Bayesian phylogenetic inference more thoroughly. Given the stochastic nature of molecular evolution, statistical analyses such Bayesian methods do have the potential to develop a unified framework to combine multiple data sources and existing knowledge into phylogenetic inference. Some of the debates about Bayesian phylogenetic inference are due to insufficient understanding or implementation of this method, especially the MCMC algorithm. An improper MCMC implementation does have the danger of stopping at local optima. In addition, it can not cross low probability zones to reach other optimal modes. Therefore, we need to explore improved MCMC strategies to develop more reliable, more efficient implementation. One barrier for extensive investigation of Bayesian methods is that the method itself is time consuming. Given hundreds of taxa and complex models, a complete MCMCbased Bayesian analysis may run several months to obtain a solution. A similar situation occurred when the maximum likelihood method was first introduced. However, when computing systems became more and more powerful and better algorithms were 10

27 developed, the maximum likelihood method came into wide use. This phenomenon may happen again to the Bayesian-based phylogenetic method. 1.6 Research objectives and contributions This dissertation aims to develop a high performance framework for Bayesian phylogenetic inference. The following summarizes the research objectives and contributions of this dissertation. 1) Developing a high performance computing framework for Bayesian phylogenetic inference. In this dissertation, we investigate technologies and platforms for Bayesian phylogenetic inference and abstract different computing platforms into the TAPS (Tree-based Abstraction of Parallel System) model. Based on this model, we developed parallel MCMC algorithms for Bayesian phylogenetic inference and implemented them in the PBPI (Parallel Bayesian Phylogenetic Inference) program. Both analytical analyses and numerical simulations show that PBPI achieves roughly linear speedup for datasets with different problem sizes. This means a Bayesian phylogenetic inference lasting several months by former methods can be finished in several hours using parallel algorithms on mid-sized Beowulf-like clusters. 2) Developing better MCMC strategies for Bayesian phylogenetic inference. In this dissertation, we proposed and implemented several MCMC strategies for exploring the posterior probability distribution of the phylogenetic model. By using variable proposal step length, we made the MCMC chain cross high energy barriers (i.e., low probability regions) and overcome stickiness around local 11

28 optimal regions. By introducing directional search within each proposal step, we improved the quality of each proposal and shortened the sample intervals, thereby reducing the total number of generations, to produce an acceptable distribution. To improve the mixing rate of the chain, we also implemented a class of population-based MCMC methods which used multiple chains to explore the search space more efficiently. We demonstrated that classical MCMC methods risk generating misleading posterior probability on some models; by using an improved MCMC framework, this risk was reduced. Various novel algorithms and MCMC strategies were implemented in this research. 3) Accommodating data uncertainty in phylogenetic inference with data resampling in the MCMC. We extended Bayesian phylogenetic inference to include data noise in the inference procedure and showed that ML with bootstrapping can be viewed as a special case of generic Bayesian phylogenetic inference. We justified that Bayesian posterior probability and bootstrap support value measure two kinds of phylogenetic uncertainties: the former refers to multiple possible models for the same dataset; the latter refers to the robustness of a tree on a specific dataset. Both uncertainties can be assessed jointly by incorporating data resampling during a single MCMC run. 1.7 Organization of this dissertation This dissertation includes three parts. The first part consists of Chapters 1 and 2, which present background, methods, and results in the field of Bayesian phylogenetic inference. In this chapter we introduce the 12

29 phylogenetic inference problem, its applications, and its challenges. We also provide a short review of positive and negative views of Bayesian phylogenetic methods. In Chapter 2, we review various phylogenetic approaches and recent advances in high performance computing for solving large phylogeny problems. The second part includes Chapters 3 and 4 in which we describe our extended, high performance, Bayesian phylogenetic inference framework. In Chapter 3, we demonstrate the weaknesses of traditional MCMC methods and propose how to overcome these weaknesses using improved MCMC algorithms. In Chapter 4, we describe our parallel Bayesian phylogenetic inference framework. We first discuss the general models and methods for parallelizing Bayesian phylogenetic inference that can be used as the foundation of introducing high performance computing support to the phylogenetic inference problem. Then we present an implementation of parallel Metropolis-coupled MCMC and numerical results. The third part consists of Chapters 5 and 6, where we provide performance evaluation of the Bayesian method and our implementations. Using simulated datasets under several model trees, we verified that our implementation not only output the correct results but also ran faster both in sequential and parallel implementation, in contrast to MrBayes [58], the most popular Bayesian phylogenetic inference program currently available. Our results also demonstrated that the accuracies of Bayesian-based phylogenetic method are very well-suited for the current models of evolution. Finally, in Chapter 7, we summarize the results, conclusions and contributions from this dissertation and outline future research. 13

30 Chapter 2 Background 2.1 Representations of phylogenetic trees A phylogenetic tree is a graph representation of the evolutionary relationship among a set of species or organisms. Since species are organized as a hierarchical classification in taxonomy, we call species at the leaf node of the tree taxon (plural taxa) in phylogenetic inference. A phylogenetic tree is usually represented by a binary tree in which each tree node are connected at most three other nodes, but it could be represented by a multiforked tree when some parts of the tree can not be fully resolved [59-62]. Each internal branch of the tree maps a divergence event in evolution and divides all taxa into two groups. Each group is called a clade and each taxon in the clade shares the same common ancestor with other taxa in the clade. If the length of the branch is set, it is proportional to the divergence time that two groups of taxa were separated from their latest common ancestor. A phylogenetic tree could be rooted or unrooted depending on whether a unique node is chosen as the least common ancestor of all taxa. Determining the true root from for a group of taxa is usually impractical, so unrooted trees are most used in phylogenetic inference. 14

31 Tarsius syrichta Tarsius syrichta Lemur catta Lemur catta Saimiri sciureus Saimiri sciureus Hylobates Hylobates Pongo Pongo Gorilla Gorilla Homo sapiens Homo sapiens Pan Pan M sylvanus M sylvanus M fascicularis M fascicularis Macaca fuscata Macaca fuscata M mulatta M mulatta ( a ) (b) Tarsius syrichta Lemur catta Saimiri sciureus Hylobates Pongo Gorilla Homo sapiens Pan M sylvanus M fascicularis Macaca fuscata 0.1 M mulatta ( c ) ( d ) Figure 2-1: Phylogenetic trees of 12 primates mitochondrial DNA sequences 15

32 Figure 2-1 shows the phylogenetic tree of 12 Primates mitochondrial DNA sequences. This tree is constructed using MrBayes from 898 DNA characters using JC69 model. Figure 2-1 (a) and (b) are called cladograms which provide topological information only. Figure 2-1 (c) and (d) are called phylograms which provide both branching order and divergence time. The NEWICK format representation of the phylogenetic tree [63, 64] in Figure 2-1 is shown as follows. #NEXUS BEGIN TREES; TRANSLATE 1 Tarsius_syrichta, 2 Lemur_catta, 3 Homo_sapiens, 4 Pan, 5 Gorilla, 6 Pongo, 7 Hylobates, 8 Macaca_fuscata,[63] 9 M_mulatta, 10 M_fascicularis, 11 M_sylvanus, 12 Saimiri_sciureus ; UTREE * PRIMATE = (1,2,(12,((7,(6,(5,(3,4)))),(11,(10,(8,9)))))); ENDBLOCK; Figure 2-2: The NEWICK representation of the primate phylogenetic tree To make the NEWICK representation unique, we define the signature of an unrooted tree as one of its NEWICK format that satisfies two requirements: 1) The root of the tree is fixed at the internal node that has the taxon with the smallest label as one of its children; and 2) The children of each internal node are order by their labels lexicographically. For example, the signature of the above tree is: 16

33 (1,2,((((((3,4),5),6),7),(((8,9),10),11)),12)) Using the tree signature, we can easily test the equality of two trees in the same way as string comparison. When distance between two trees instead of equality is preferred in practice, a phylogenetic tree is also treated as a hierarchical bipartitions. Each branch in the phylogenetic tree divides the set of taxa into one bipartition. For example, the complete set of nontrivial bipartitions (i.e., bipartitions in which each part has at least two nodes) for the primate phylogenetic tree shown in Figure 2-2 is: (1,2) (3,4,5,6,7,8,9,10,11,12) (1,2,12) (3,4,5,6,7,8,9,10,11) (3,4) (1,2,5,6,7,8,9,10,11,12) (3,4,5) (1,2,6,7,8,9,10,11,12) (3,4,5,6) (1,2,7,8,9,10,11,12) (3,4,5,6,7) (1,2,8,9,10,11,12) (8,9) (1,2,3,4,5,6,7,10,11,12) (8,9,10) (1,2,3,4,5,6,7,11,12) (8,9,10,11) (1,2,3,4,5,6,7,12) Figure 2-3: The nontrivial bipartitions of the primate phylogenetic tree Like the signature of a phylogenetic tree, we can view each bipartition as a signature of its corresponding tree node and thus can compare two nodes from two different phylogenetic trees including the same group of taxa. The total number of bipartitions which are shown in only one of the two trees but not both is defined the Robinson and 17

34 Foulds topological distance of these two trees [24], a distanced widely used in tree comparisons. Tarsius syrichta Lemur catta Saimiri sciureus Hylobates Pongo 1.00 Gorilla Homo sapiens Pan M sylvanus 1.00 M fascicularis 1.00 Macaca fuscata 1.00 M mulatta Figure 2-4: A phylogenetic tree with support values for each clade The support of a phylogenetic tree for given is usually assessed with bootstrapping [65] or Bayesian posterior probability [66]. In both methods, a consensus tree is commonly used to summarize common structures among a group of trees sampled using MCMC (Markov Chain Monte Carlo) or computed using the bootstrapped dataset. In either way, the occurrences of each bipartitions are counted and the frequencies of each bipartition are shown in the phylogram as shown in Figure 2-4. The consensus tree is also used to combine trees estimated using different genes or dataset or the same group of taxa. 18

35 When each individual tree has different but overlapped set of taxa, a supertree is used to replace the consensus tree as the summarized output [67]. Considering the possibility of horizontal gene transfer, phylogenetic network is used as an alternative representation of the evolution relationship of a group of taxa[68]. 2.2 Methods for phylogenetic inference Various methods have been developed to build phylogenetic trees from different kinds of data. These methods can be classified by: 1) the data type used in tree estimation; 2) the criteria to define an optimal tree; and 3) the tree search strategies Sequenced-based methods and genome-based methods Currently, molecular sequences and whole genome features are the two major data types used in phylogenetic inference [69]: 1) Sequence-based methods use one or multiple gene alignments to estimate the phylogenetic tree. Phylogenetic inference with multiple gene alignments becomes common in recent years. The supermatrix [70] and supertree [71] methods are two major approaches to handle combined data such as multiple gene alignments. Both approaches rely on standard sequenced-based phylogenetic inference methods. 2) Genome-based methods use phylogenetic signals contained in gene content [72-74] or gene order [75, 76] to estimate the phylogenetic tree. Phylogenetic inference using whole-genome feature attracts researcher s attention recently and many efforts are devoted to how to formulate distance metrics and 19

36 probabilities models. An overview of genome-based methods is provided by Delsuc et al. [69] Distance-, MP-, ML- and BP-based methods There are four major criteria to define an optimal tree: distance, maximum parsimony (MP), maximum likelihood (ML), and Bayesian posterior probability (BP). Comparisons among these methods are reviewed in [33, 62, 77]. Briefly, distance-based methods are much faster than the other three methods but have some potential weaknesses including: 1) information loss in converting sequences into distance matrix; 2) inconsistency for data set with large distances. MP and ML are both optimization-based methods which break the tree estimation process into two major components: scoring a given tree and searching the tree (or trees) with best scores. MP uses the minimum number of mutations that could produce a given tree as the score. ML uses the likelihood of the given tree under an explicit evolutionary model as the score. MP runs much faster than ML because: 1) MP needs much less computations in evaluating the number of mutations than ML evaluating the likelihood; and 2) MP does not need to optimize the branch lengths. Drawbacks of MP include: 1) multiple (or too many) trees may have the same MP score and only one of them is true; and 2) MP is subject to the long-branch attraction problem [78] since it does not account for the fact that the number of mutations varies on different branches. Both ML and BP are likelihood-based methods which explicitly use a probabilistic model of molecular evolution. Their major difference is ML uses point estimation for the unknown parameters and BP uses marginal distribution to integrate out the unknown parameters. BP is suggested as an faster alternative of ML with bootstrapping [41], 20

37 however this argument needs to be further justified [79]. Whether BP should be classified as an optimization-based method is questionable since theoretically BP requires more computations than ML in order to find the probabilities of all modes for the posterior distribution. As ML is conjectured as an NP-Hard problem, BP is at least as difficult as ML. Therefore, we put BP in a new category of phylogenetic methods: sampling-based method Tree search strategies Any phylogenetic inference methods rely on one or more tree search strategies once the optimal criterion is formulated. We divide the tree search strategies into the following categories: 1) Clustering method [23]: a clustering method builds the tree using a sequence of clustering operations. UPGMA[80] and neighbor-joining [81]. A cluster method runs much faster than other methods. Its limitation is that it produces only one tree which may not be the global optimal. 2) Exact search [77]: this method examines every possible tree to locate the best tree. Exact search can be further divided into exhaustive search and branch-andbound search. Exhaustive search enumerates all possible trees for evaluation. Considering the huge number of possible trees as described in Chapter 1, exhaustive is practical only for small data size. Branch-and-bound can prune the search space by deleting those trees that have lower score than a preset bound (or threshold). The more strict the bound, the further the space will be pruned. Same to exhaustive search, branch-and-bound is limited to small problem size. 21

38 3) Deterministic heuristics search: the tree space is not completely random distributed. There is certain order in the tree space. A heuristic search attempts to exploit such an order to find the best or near best tree. Common used deterministic search strategies include stepwise addition, local arrangement, and global arrangement [64, 77]. One potential problem of deterministic heuristics search is that it dose not guarantee a global optimal solution. 4) Stochastic search: By introducing some random moves, a stochastic search may avoid local optima and move toward the global optima. Three stochastic algorithms are used in phylogenetic inference: simulated annealing [82, 83], genetic algorithm [84-86] and MCMC [40, 41, 87, 88]. 5) Divide and conquer: a large problem can be solved by dividing the original problem into a set of smaller problems, solving each of them separately, and then merge the solutions for each smaller problem to obtain the solution for the original problem. Disk-covering method (DCM) [89], quartet-puzzling [90] and supertree [67] are used in phylogenetic inference. 2.3 High performance computing phylogenetic inference methods As phylogenetic inference goes to large problem size and the parallel processing become common, high performance computing support in phylogenetic inference is needed. High performance computing support includes: algorithm turning, parallel algorithm design, and parallel platform deployment. Algorithm tuning seeks alternative approaches for computation intensive parts in the phylogenetic inference. One common technique for likelihood-based phylogenetic 22

39 method is not to frequently optimize the branch length because this optimization process will take 92]. 2 on ( ) times likelihood calculations. This technique has been used [85, 86, 91, Besides algorithms improvement and exploration, parallel processing has the possibility to reduce the computation time from several months to several hours in efficient and immediate manner. Several parallel implementations of widely used phylogenetic inference methods have been developed recently, among them are parallel fastdnlml [93, 94], parallel TREE-PUZZLE [95], parallel genetic algorithm for ML [96], GRAPPA [97], and Parallel MCMC algorithms [98, 99]. We note there are multiple level concurrencies in most phylogenetic inference and these methods can run in parallel embarrassingly. 2.4 Bayesian phylogenetic inference Introduction As described in the previous chapter, the task of phylogenetic inference includes two major steps: 1) constructing a phylogenetic tree that maps the evolutionary relationship among a group of taxa, and 2) accessing the confidence on the estimated tree given the observed data. Various methods are available for building the phylogenetic tree and some of them are based on a probabilistic model of molecular evolution. Due to the stochastic nature of molecular evolution, complicated mechanisms that affect the evolutionary process, almost every phylogenetic method has to deal with uncertainties caused by unknown parameters. Also, the fact that multiple phylogenetic trees are possible for the 23

40 same group of taxa has to be considered in applications which explicitly use a phylogeny as the basis of study. Using a comprehensive probabilistic model, Bayesian analysis provides a methodology to describe relationships among all variables under consideration. Bayesian phylogenetic inference can learn the phylogenetic model from observed data based on a quantity called posterior probability. The posterior probability of a phylogenetic model Ψ ( T,τ,θ ) can be interpreted as the probability with which this phylogenetic model is correct. Bayesian phylogenetic inference share same similarities with maximum likelihood estimation [10, 33]: both explicitly use a model of molecular evolution and a formalization of the likelihood function. However, the underlying methodologies are quite different. First, the Bayesian approach deals with parameter uncertainty by integrating over all possible values that a parameter might assume, while maximum likelihood estimation uses a point estimate in analysis. Second, Bayesian analysis requires specifying prior distributions of the parameters of a phylogenetic model, which provides an advantage to incorporating existing knowledge but also invites criticism since the prior distributions are often unknown. Finally, Bayesian analysis outputs the posterior probability of trees and clades as a measurement of the confidence on the estimated results. Therefore, Bayesian phylogenetic inference is considered a faster alternative of maximum likelihood estimation with bootstrap resampling [41]. Though the idea of Bayesian phylogenetic inference emerged almost at the same period as the maximum likelihood method [100], the computation of Bayesian posterior probability of phylogeny was not feasible until Markov Chain Monte Carlo methods were 24

41 implemented for phylogenetic inference by three independent research groups [87, ] in Bayesian phylogenetic inference became widely used after the method of computing posterior probability was described [10, 33, 39-41, 87, 104, 105] and several phylogenetic inference programs (BAMBE [106] and MrBayes [58]) become publicly available. Despite some obvious benefits and ever-increasing applications, Bayesian phylogenetic inference has been hotly debated on several issues including the amount of bias caused by inappropriate prior probability, the interpretation of Bayesian posterior probability [46], and the accuracy of Bayesian clade support [34, 36, 42, 45]. This calls for further examination of the power and performance of Bayesian phylogenetic analysis, and therefore a need for improved and faster implementations of current Bayesian phylogenetic methods The Bayesian framework A phylogenetic model = ( T,τ,θ ) Ψ consists of three components: a tree structure (T ) that represents the evolutionary relationships of a set of organism under study, a vector of branch lengths (τ ) which maps the divergence time along different lineages, and a model of the molecular evolution (θ ) that approximates how the characters at each site evolve over time along the tree. In the Bayesian framework, both the observed data X and parameters of the phylogenetic model Ψ are treated as random variables. Then the joint distribution of the data and the model can be set up as follows: P ( X, Ψ) = P( X Ψ) P( Ψ) (2-1) Once the data is known, Bayesian theory can be used to compute the posterior probability of the model using 25

42 P( X Ψ) P( Ψ) P( Ψ X ) = (2-2) P( X ) Here, P ( X Ψ) is called the likelihood (the probability of the data given the model), P(Ψ) is called the prior probability of the model (the unconditional probability of the model without any knowledge of the observed data), and P (X ) is the unconditional probability of the data. For the continuous case, P (X ) is computed by PX ( ) = PX ( Ψ) P( Ψ) dψ (2-3) For discrete case, P (X ) is computed by PX ( ) = PX ( Ψi) P( Ψi) (2-4) Ψi Since P (X ) is just a normalizing constant, the computation of (2-3) or (2-4) is not needed in practical inference. The posterior probability distribution of the phylogenetic model can be written as P P( X Ti, τ, θ ) P( Ti, τ, θ ) = P( Ti, τ, θ X ) =. (2-5) P( X T, τ, θ ) P T dτdθ ( Ψ X ) T j j (, τ, θ ) This distribution is the current basis of Bayesian phylogenetic inference; useful information can be obtained from this distribution. For example, the posterior probability of a phylogenetic tree T i can be computed as P( T X ) = P( T, τ, θ X dτdθ. (2-6) i i ) Similarly, the posterior probability of the i th component of the parameter θ in the evolutionary model can be summarized by P ( θ X ) P( T, τ, θ, θ \ θ X ) dτd( θ \ θ ) (2-7) i = T j j i i i j 26

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 2009 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences Molecular phylogeny How to infer phylogenetic trees using molecular sequences ore Samuelsson Nov 200 Applications of phylogenetic methods Reconstruction of evolutionary history / Resolving taxonomy issues

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Phylogeny: building the tree of life

Phylogeny: building the tree of life Phylogeny: building the tree of life Dr. Fayyaz ul Amir Afsar Minhas Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences PO Nilore, Islamabad, Pakistan

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Lecture 11 Friday, October 21, 2011

Lecture 11 Friday, October 21, 2011 Lecture 11 Friday, October 21, 2011 Phylogenetic tree (phylogeny) Darwin and classification: In the Origin, Darwin said that descent from a common ancestral species could explain why the Linnaean system

More information

The Phylogenetic Reconstruction of the Grass Family (Poaceae) Using matk Gene Sequences

The Phylogenetic Reconstruction of the Grass Family (Poaceae) Using matk Gene Sequences The Phylogenetic Reconstruction of the Grass Family (Poaceae) Using matk Gene Sequences by Hongping Liang Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Organizing Life s Diversity

Organizing Life s Diversity 17 Organizing Life s Diversity section 2 Modern Classification Classification systems have changed over time as information has increased. What You ll Learn species concepts methods to reveal phylogeny

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

Bayesian Models for Phylogenetic Trees

Bayesian Models for Phylogenetic Trees Bayesian Models for Phylogenetic Trees Clarence Leung* 1 1 McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada ABSTRACT Introduction: Inferring genetic ancestry of different species

More information

Multiple Sequence Alignment. Sequences

Multiple Sequence Alignment. Sequences Multiple Sequence Alignment Sequences > YOR020c mstllksaksivplmdrvlvqrikaqaktasglylpe knveklnqaevvavgpgftdangnkvvpqvkvgdqvl ipqfggstiklgnddevilfrdaeilakiakd > crassa mattvrsvksliplldrvlvqrvkaeaktasgiflpe

More information

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Nicolas Salamin Department of Ecology and Evolution University of Lausanne

More information

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D 7.91 Lecture #5 Database Searching & Molecular Phylogenetics Michael Yaffe B C D B C D (((,B)C)D) Outline Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

A phylogenomic toolbox for assembling the tree of life

A phylogenomic toolbox for assembling the tree of life A phylogenomic toolbox for assembling the tree of life or, The Phylota Project (http://www.phylota.org) UC Davis Mike Sanderson Amy Driskell U Pennsylvania Junhyong Kim Iowa State Oliver Eulenstein David

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Mohammed El-Shambakey Dissertation Submitted to the Faculty of the Virginia Polytechnic Institute and State

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Integrative Biology 200A PRINCIPLES OF PHYLOGENETICS Spring 2008 Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008 University of California, Berkeley B.D. Mishler March 18, 2008. Phylogenetic Trees I: Reconstruction; Models, Algorithms & Assumptions

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Speciation Times under an Episodic Molecular Clock Syst. Biol. 56(3):453 466, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701420643 Inferring Speciation Times under an Episodic Molecular

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity it of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki Phylogene)cs IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, 2016 Joyce Nzioki Phylogenetics The study of evolutionary relatedness of organisms. Derived from two Greek words:» Phle/Phylon: Tribe/Race» Genetikos:

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Phylogeny Tree Algorithms

Phylogeny Tree Algorithms Phylogeny Tree lgorithms Jianlin heng, PhD School of Electrical Engineering and omputer Science University of entral Florida 2006 Free for academic use. opyright @ Jianlin heng & original sources for some

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE CELLULAR & MOLECULAR BIOLOGY LETTERS http://www.cmbl.org.pl Received: 16 August 2009 Volume 15 (2010) pp 311-341 Final form accepted: 01 March 2010 DOI: 10.2478/s11658-010-0010-8 Published online: 19 March

More information

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method Phylogeny 1 Plan: Phylogeny is an important subject. We have 2.5 hours. So I will teach all the concepts via one example of a chain letter evolution. The concepts we will discuss include: Evolutionary

More information

Phylogenetic Analysis

Phylogenetic Analysis Phylogenetic Analysis Aristotle Through classification, one might discover the essence and purpose of species. Nelson & Platnick (1981) Systematics and Biogeography Carl Linnaeus Swedish botanist (1700s)

More information

Phylogenetic Analysis

Phylogenetic Analysis Phylogenetic Analysis Aristotle Through classification, one might discover the essence and purpose of species. Nelson & Platnick (1981) Systematics and Biogeography Carl Linnaeus Swedish botanist (1700s)

More information

Phylogenetic Analysis

Phylogenetic Analysis Phylogenetic Analysis Aristotle Through classification, one might discover the essence and purpose of species. Nelson & Platnick (1981) Systematics and Biogeography Carl Linnaeus Swedish botanist (1700s)

More information

Unsupervised Learning in Spectral Genome Analysis

Unsupervised Learning in Spectral Genome Analysis Unsupervised Learning in Spectral Genome Analysis Lutz Hamel 1, Neha Nahar 1, Maria S. Poptsova 2, Olga Zhaxybayeva 3, J. Peter Gogarten 2 1 Department of Computer Sciences and Statistics, University of

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence Are directed quartets the key for more reliable supertrees? Patrick Kück Department of Life Science, Vertebrates Division,

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees. Describe the relationship between objects, e.g. species or genes Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies The evolutionary relationships between

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference By Philip J. Bergmann 0. Laboratory Objectives 1. Learn what Bayes Theorem and Bayesian Inference are 2. Reinforce the properties of Bayesian

More information

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

Integrative Biology 200A PRINCIPLES OF PHYLOGENETICS Spring 2012 University of California, Berkeley Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley B.D. Mishler Feb. 7, 2012. Morphological data IV -- ontogeny & structure of plants The last frontier

More information

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT Inferring phylogeny Constructing phylogenetic trees Tõnu Margus Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses. Kirsi Kostamo Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

More information

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016 Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference Fall 2016 By Philip J. Bergmann 0. Laboratory Objectives 1. Learn what Bayes Theorem and Bayesian Inference are 2. Reinforce the properties

More information