HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK

HIGH PERFORMANCE, BAYESIAN BASED PHYLOGENETIC INFERENCE FRAMEWORK By Xizhou Feng Bachelor of Engineering China Textile University, 1993 Master of Science Tsinghua University, 1996 Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor Philosophy in the Department of Computer Science and Engineering College of Engineering and Information Technology University of South Carolina 2006 Major Professor Chairman, Examining Committee Committee Member Committee Member Committee Member Dean of The Graduate School

Dedication To Rong, Kevin and Katherine ii

Acknowledgements During the course of my graduate study, I have been fortunate to receive advice, support, and encouragement from many people. Foremost is the debt of gratitude that I owe to my thesis advisors, Professor Duncan A. Buell and Professor Kirk W. Cameron. Not only was Duncan responsible for introducing me to this interesting and fruitful field, he also provided me inspiring guidance, great patience, and never-ending encouragement during the past several years. I especially thank Professor Kirk W. Cameron for his invaluable mentoring, insightful advising, and constant investing. Kirk guided me into the exciting field of systems study, and provided opportunities and support to conduct quality research work in several cutting-edge areas. I thank Professor Manton Matthews for his years of academic advising and being on my advisory committee. His guidance and support made it possible for me to explore various fields in computer science and engineering. I thank Professor John R. Rose and Professor Peter Waddell for their valuable suggestions in this research work. The discussions and collaborative work with John and Peter generated some important ideas which have been included in this thesis. I appreciate Professor Austin L. Hughes for being on my advisory committee and providing me critical opinions which led me to rethink and significantly improvement this dissertation. I also thank the faculty and staff in the Department of Computer and Engineering for providing me one of the most wonderful training programs in the world. iii

Finally, I thank my family for their love and support during the hard time of completing my dissertation. This dissertation is dedicated to my wife Rong, my son Kevin, and my daughter Katherine. iv

Abstract Comparative analyses of biological data rely on a phylogenetic tree that describes the evolutionary relationship of the organisms studied. By combining the Markov Chain Monte Carlo (MCMC) method with likelihood-based assessment of phylogenies, Bayesian phylogenetic inferences incorporate complex statistical models into the process of phylogenetic tree estimation. This combination can be used to address a number of complex questions in evolutionary biology. However, Bayesian analyses are computationally expensive because they almost invariably require high dimensional integrations over unknown parameters. Thoroughly investigating and exploiting the power of the Bayesian approach requires a high performance computing framework. Otherwise one cannot tackle the computational challenges of Bayesian phylogenetic inference for large phylogeny problems. This dissertation extended existing Bayesian phylogenetic inference framework in three aspects: 1) Exploring various strategies to improve the performance of the MCMC sampling method; 2) Developing high performance, parallel algorithms for Bayesian phylogenetic inference; and 3) Combining data uncertainty and model uncertainty in Bayesian phylogenetic inference. We implemented all these extensions in PBPI, a software package for parallel Bayesian phylogenetic inference. We validated the PBPI implementation using simulation study, a common method used in phylogenetics and other scientific disciplines. The simulation results showed that PBPI can estimate the model trees accurately given sufficient number of sequences and correct models. v

We evaluated the computational speed of PBPI using simulated datasets on a Terascale computing facility and observed significantly performance improvement. On a single processor, PBPI ran up to 19 times faster than the current leading Bayesian phylogenetic inference program with the same quality output. On 64 processors, PBPI achieved 46 times parallel speedup in average. Combining both sequential improvement and parallel computation, PBPI can speedup current Bayesian phylogenetic inferences up to 870 times.. vi

Table of Contents Dedication... ii Acknowledgements...iii Abstract... v List of Tables...xiii List of Figures... xiv Chapter 1 Introduction... 1 1.1 Phylogeny and its applications... 1 1.2 Phylogenetic inference... 2 1.3 The challenges... 5 1.3.1 Searching a complex tree space... 5 1.3.2 Developing realistic evolutionary models... 6 1.3.3 Dealing with incomplete and unequal data distribution... 7 1.3.4 Resolving conflicts among different methods and data sources... 8 1.4 Bayesian phylogenetic inference and its issues... 8 1.5 Motivation... 10 1.6 Research objectives and contributions... 11 1.7 Organization of this dissertation... 12 Chapter 2 Background... 14 2.1 Representations of phylogenetic trees... 14 2.2 Methods for phylogenetic inference... 19 vii

2.2.1 Sequenced-based methods and genome-based methods... 19 2.2.2 Distance-, MP-, ML- and BP-based methods... 20 2.2.3 Tree search strategies... 21 2.3 High performance computing phylogenetic inference methods... 22 2.4 Bayesian phylogenetic inference... 23 2.4.1 Introduction... 23 2.4.2 The Bayesian framework... 25 2.4.3 Components of Bayesian phylogenetic inference... 27 2.4.4 Likelihood, prior and posterior probability... 27 2.4.5 Empirical and hierarchical Bayesian analysis... 28 2.5 Models of molecular evolution... 29 2.5.1 The substitute rate matrix... 29 2.5.2 Properties of the substitution rate matrix... 31 2.5.3 The general time reversible (GTR) model... 32 2.5.4 Rate heterogeneity among different sites... 34 2.5.5 Other more realistic evolutionary models... 35 2.6 Likelihood function and its evaluation... 35 2.6.1 The likelihood function... 35 2.6.2 Felsenstein s algorithm for likelihood evaluation... 37 2.7 Optimizations of likelihood computation... 39 2.7.1 Sequence packing... 39 2.7.2 Likelihood local update... 39 2.7.3 Tree balance... 41 viii

2.8 Markov Chain Monte Carlo methods... 41 2.8.1 The Metropolis-Hasting algorithm... 41 2.8.2 Exploring the posterior distribution... 43 2.8.3 The issues... 44 2.9 Summary of the posterior distribution... 46 2.9.1 Summary of the phylogenetic trees... 46 2.9.2 Summary of the model parameters... 46 2.10 Chapter summary... 47 Chapter 3 Improved Monte Carlo Strategies... 49 3.1 Introduction... 49 3.2 Observations... 50 3.3 Strategy #1: reducing stickiness using variable proposal step length... 53 3.4 Strategy #2: reducing sampling intervals using multipoint MCMC... 55 3.5 Strategy #3: improving mixing rate with parallel tempering... 57 3.6 Proposal algorithms for phylogenetic models... 60 3.6.1 Basic tree mutation operators... 61 3.6.2 Basic tree branch length proposal methods... 62 3.6.3 Propose new parameters... 63 3.6.4 Co-propose topology and branch length... 63 3.7 Extended proposal algorithms for phylogenetic models... 63 3.7.1 Extended tree mutation operator... 64 3.7.2 Multiple-tree-merge operator... 64 3.7.3 Backbone-slide-and-slide operator... 65 ix

3.8 Chapter summary... 66 Chapter 4 Parallel Bayesian Phylogenetic Inference... 68 4.1 The need for parallel Bayesian phylogenetic inference... 68 4.2 TAPS: a tree-based abstraction of parallel system... 69 4.3 Performance models for parallel algorithms... 71 4.4 Concurrencies in Bayesian phylogenetic inference... 74 4.5 Issues of parallel Bayesian phylogenetic inference... 75 4.6 Parallel algorithms for Bayesian phylogenetic inference... 77 4.6.1 Task decomposition and assignment... 77 4.6.2 Synchronization and communication... 79 4.6.3 Load balancing... 80 4.6.4 Symmetric MCMC algorithm... 80 4.6.5 Asymmetric MCMC algorithm... 83 4.7 Justifying the correctness of the parallel algorithms... 83 4.8 Chapter summary... 84 Chapter 5 Validation and Verification... 86 5.1 Introduction... 86 5.2 Experimental methodology... 89 5.2.1 The model trees... 89 5.2.2 The simulated datasets... 90 5.2.3 The accuracy metrics... 90 5.2.4 Tested programs and their run configurations... 92 5.2.5 The computing platforms... 93 x

5.3 Results on model tree FUSO024... 94 5.3.1 The overall accuracy of results... 94 5.3.2 Further analysis... 96 5.3.3 PBPI stability... 100 5.4 Results on model tree BURK050... 103 5.5 Chapter summary... 105 Chapter 6 Performance Evaluation... 107 6.1 Introduction... 107 6.2 Experimental methodology... 108 6.3 The sequential performance of PBPI... 110 6.3.1 The execution time of PBPI and MrBayes... 110 6.3.2 The quality of the tree samples drawn by PBPI... 111 6.3.3 The execution time of PBPI and MrBayes... 112 6.4 Parallel speedup for fixed problem size... 115 6.5 Scalability analysis... 119 6.6 Parallel speedup with scaled workload... 121 6.6.1 Scalability with different problem sizes... 121 6.6.2 Scalability with the number of chains... 122 6.7 Chapter summary... 123 Chapter 7 Summary and Future Work... 124 7.1 The big picture... 124 7.2 Future work... 127 xi

Bibliography... 129. xii

List of Tables Table 1-1: The number of unrooted bifurcating trees as a function of taxa... 5 Table 5-1: The four model trees used in experiments... 89 Table 5-2: PBPI run configurations for validation and verification... 95 Table 5-3: The number of datasets where the model tree FUSO024 is found in the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree. A total of 5 datasets are used in each case... 96 Table 5-4: The average distances between the model tree FUSO024 and the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree. A total of 5 datasets are used in each case... 96 Table 5-5: The topological distances between the model tree FUSO024 and the maximum probability tree, the 95% credible set of trees and the 50% majority consensus tree for datasets with 10,000 characters. Datasets are simulated under the JC69 model.... 97 Table 5-6: The average distances between the model tree BURK050 and the maximum probability tree, the 95% credible set of tree and the 50% majority consensus tree. A total of 5 datasets were used in each case.... 103 Table 6-1: Benchmark dataset used in the evaluation... 109 Table 6-2: Sequential execution time of PBPI and MrBayes... 110 xiii

List of Figures Figure 1-1: The procedure of a phylogenetic inference... 4 Figure 2-1: Phylogenetic trees of 12 primates mitochondrial DNA sequences... 15 Figure 2-2: The NEWICK representation of the primate phylogenetic tree... 16 Figure 2-3: The nontrivial bipartitions of the primate phylogenetic tree... 17 Figure 2-4: A phylogenetic tree with support values for each clade... 18 Figure 2-5: The transition diagram and transition matrix of nucleotides... 30 Figure 2-6: The Felsenstein algorithm for likelihood evaluation... 38 Figure 2-7: Illustration of likelihood local update... 40 Figure 2-8: The tree-balance algorithm... 41 Figure 2-9: Metropolis-Hasting algorithm... 42 Figure 3-1: A target distribution with three modes... 50 Figure 3-2: Distribution approximated using Metropolis MCMC methods... 51 Figure 3-3: Samples drawn using Metropolis MCMC method... 52 Figure 3-4: Illustration of state moves... 54 Figure 3-5: Approximated distribution using variable step length MCMC... 55 Figure 3-6: The multipoint MCMC... 56 Figure 3-7: A family of tempered distributions with different temperatures... 58 Figure 3-8: The Metropolis-coupled MCMC algorithm... 59 Figure 3-9: The extended-tree-mutation method... 64 Figure 3-10: The multiple-tree-merge method... 65 Figure 3-11: The backbone slide and scale method... 66 xiv

Figure 4-1: An illustration of TAPS... 70 Figure 4-2: Speedup under fixed workload... 73 Figure 4-3: The procedure of a generic Bayesian phylogenetic inference... 75 Figure 4-4: Map 8 chains to a 4 x 4 grid, where the length each sequence is 2000... 78 Figure 4-5: The symmetric parallel MCMC algorithm... 82 Figure 5-1: The procedure of a simulation method for accuracy assessment... 88 Figure 5-2: Run configuration for MrBayes... 93 Figure 5-3: The phylogram of the model tree FUSO024... 98 Figure 5-4: The MPP tree estimated from dataset fuso024_l10000_jc69_d001... 99 Figure 5-5: Estimation variances in 10 individual runs... 100 Figure 5-6: The phylogram of the model tree BURK050... 101 Figure 5-7: The MPP tree estimated from dataset burk050_l10000_jc69_d001.nex. 102 Figure 5-8: The posterior distribution of the top 50 most probable trees... 104 Figure 5-9: The topological distances distribution of the top 50 most probable trees.. 105 Figure 6-1: Different speedup values computed by wall clock time and user time... 108 Figure 6-2: Log likelihood plot of the tree samples drawn by PBPI and MrBayes... 111 Figure 6-3: The consensus tree estimated by PBPI... 113 Figure 6-4: The consensus tree estimated by MrBayes... 114 Figure 6-5: Parallel speedup of PBPI for dataset FUSO024_L10000... 116 Figure 6-6: Parallel speedup of PBPI for dataset ARCH107_L1000... 117 Figure 6-7: Parallel speedup of PBPI for dataset BACK218_L10000... 117 Figure 6-8: The consensus tree estimated by PBPI on 64 processors... 118 Figure 6-9: Parallel speedup with different number of taxa... 122 xv

xvi

Chapter 1 Introduction 1.1 Phylogeny and its applications All life on the earth, both present and past, are believed to be descended from a common ancestor. The descending pattern or evolutionary relationship among species or organisms, or the relatedness of their genes, is usually described by a phylogeny, a tree or network structure, with edge length representing the evolutionary divergence along different lineages. In a phylogeny, all existing organisms are placed on its leaves and ancestral organisms are placed at its branches, or internal nodes. Since all biological phenomena are the result of evolution, most biological studies have to be conducted in the light of evolution and require information on phylogeny to interpret data [1]. Thus, phylogenies play important roles not only in evolutionary biology, genetics and genomics, but also in modern pharmaceutical research, drug discovery, agricultural plant improvement, disease control studies (detection, prevention and prediction) and other biology-related fields. The importance of phylogeny in scientific research and human society has never been made more clear than by the ambitious Tree of Life project initiated by the US National Science Foundation, which 1

aims to assemble a phylogeny for all 1.7 million described species (ATOL) to benefit society and science [2]. The applications of phylogenies span a wide range of fields, both in industry and science. Several examples follow: Identifying, organizing and classifying organism [3, 4]; Interpreting and understanding the organization and evolution of genomes [5, 6]; Identifying and characterizing newly discovered pathogens [7]; Reconstructing the evolution and radiation of life on the earth [8, 9]; and Identifying mutations most likely associated with diseases [10]. 1.2 Phylogenetic inference Phylogeny describes the pattern of evolution history among a group of taxa. But history only happens once, and people have to use clues left by the history to reconstruct actual events. One of the fundamental tasks of phylogenetic inference is to approximate the true phylogenetic tree for a group of taxa using a set of evolutionary evidence in which the phylogenetic signals reside. Various kinds of data are used in phylogenetics inferences, but recently DNA/RNA molecular sequences are most common. There are three reasons: 1) DNA sequences are the inheritance materials of all organisms on the earth; 2) Mathematical models of molecular evolution are feasible and can be improved incrementally; 3) Huge numbers of genomic sequences have been generated and are publicly accessible. 2

The third reason is the most important for the rapid advancement of phylogenetic inference using genomic data. Worldwide genome projects, such as the Human Genome Project (HGP) [11], have generated an ever-increasing amount of biological data. These data are publicly accessible through several government-supported database efforts, such as GenBank[12], EMBL[13], DDJB[14], and Swiss-Prot[15]. On August 22, 2005, the public collections of DNA and RNA sequences provided by GenBank, EMBL, and DDBJ reached 100 Giga bases (i.e. 100,000,000,000 bases), representing genes and genomes of over 165,000 organisms. Those massive, complex data sets already generated and those yet to be generated have been fueling the emerging or renaissance of a few interdisciplinary fields, including large scale phylogenetic analysis of genomic data. The problem of phylogenetic inference using genomic (molecular) sequences is formalized as follows: Given an aligned character matrix X ( x ij ) N M = for a set of N taxa, each taxa being represented by an M character sequence, x ij denoting the character of the i-th taxa at the j -th site of its sequence, phylogenetic inference typically seeks to answer two basic questions: 1) What is the phylogenetic tree (or model) that best explains the evolutionary relations among these taxa? 2) With how much confidence is a particular tree expected to be correct? Every phylogenetic method can output a phylogenetic tree which the method views as the best tree according to certain optimization criteria. However, given the inherent complexities in biological evolution and some unrealistic assumptions in phylogenetic inference, each given inference method usually not only produces a tree but also provides 3

a measurement of the confidence in the tree. Bootstrapping and Bayesian posterior probability (discussed later) are two common statistical tools to provide such confidence measurements. As shown in Figure 1-1, a phylogenetic inference usually is preceded by multiple alignments and model selections to generate input. Most phylogenetic methods rely on some phylogenetic tree as their input as well. To reduce the errors produced by the interdependence among multiple alignments, model selections and phylogenetic inference, several iterations of alignments, selections, and inferences may be required. Collect Data Retrieve Homologous Sequences Alignt Multiple Sequences Aligned Data Matrix Select Model of Evolution Phylogenetic Inference Phylogenetic Trees(s) Assess Confidence Best tree with measures of support Hypothesis Testing Figure 1-1: The procedure of a phylogenetic inference 4

1.3 The challenges Though there have been significant advances in phylogenetic inference in the past several decades, large scale phylogenetic inference is still a challenging problem. 1.3.1 Searching a complex tree space The biggest challenge of phylogenetic inference is the growth in the number of unrooted trees, described by N Ζ=Π (1-1) i= ( 2-5 i ) 3 Here Z denotes the number of possible tree topologies, N denotes of the number of taxa. Table 1 shows the number of unrooted trees corresponding to the number of taxa. 182 For example, the tree space for 100 taxa will contain 1.7 10 unrooted trees. Searching this space to find the best tree is computationally impractical. Most optimization-based phylogenetic methods, such as maximum parsimony and maximum likelihood, are NPhard problems. Many heuristic strategies for tree searching have been studied, but much work remains to be done to improve these methods [16]. Table 1-1: The number of unrooted bifurcating trees as a function of taxa Number of taxa Number of unrooted trees 3 1 10 6 2.03 10 50 74 2.84 10 100 182 1.70 10 1000 2860 1.93 10 5

1.3.2 Developing realistic evolutionary models Most phylogenetic methods explicitly or implicitly assume a model of genomic sequence evolution and use such a model to estimate the rate of evolution, calculate pair-wise distance, or compute the likelihood of a given phylogeny. The process of genomic sequence evolution has been affected by two factors: mutations and selections. Mutations are errors incurred during DNA replication. Mutations create genetic diversity among populations, and natural selection steers evolutionary direction. Possible causes of mutations include substitution, recombination, duplication, insertion, deletion, and inversions [17]. At the same time, mutations are constrained by the geometric, physical and chemical structures of nucleotides, amino acids, codons, protein secondary structures, and protein tertiary structures [18]. Though phylogenetic signals exist in all kinds of mutation events, most evolutionary models only consider substitution events because it is either difficult or computationally intractable to integrate other events into the models used by phylogenetic analysis [19, 20]. With increasing computational power, researchers have relaxed some early assumptions in evolutionary models and proposed more realistic models, such as allowing rate variation across sites [21], considering the effect of insertion and deletion, and combining secondary structure information [22-24]. Given multiple possible models, it is necessary for the phylogenetic inference approach to select a model that best fits the data. Also this approach should be robust enough to give a correct tree even when some assumptions have been violated. Besides the complexity of modeling single type sequence evolution, the need for combined analysis of multiple datasets with different data types and sources requires 6

some unified model which is both mathematically founded and biologically meaningful [25, 26]. 1.3.3 Dealing with incomplete and unequal data distribution The imperfect process of sampling, sequencing and alignment may introduce varied noise into an available data set. Bias or errors in multiple sequence alignment is the cause of most noise because: 1) most multiple sequence alignment methods depend on a correct phylogeny to guide the alignment process; 2) it is necessary to search across trees to find the overall optimum. It is possible to refine the alignment by repeating the procedure of multiple alignment model selection phylogenetic inference, but it is always dangerous to assume the alignment is perfect. To assess the reliability or sensitivity of phylogeny on data with uncertainty, the bootstrap approach [28] was suggested by Felsenstein [29] and further refined by Efron et al. [30]. Bootstrapping requires repeating the phylogenetic inference procedure many times (typically on the order of 1000 times [23]) on derived datasets obtained by permuting the original data with resampling and replacing. The usefulness of phylogenetic inference methods is also limited by the sparse and uneven distribution of sequence data among species and the uncertainty inherent in the available data. Some species have been sequenced for many genes; a few genes have been sequenced for many species; but most of the potential data available for phylogenetic purposes is still missing [31, 32]. 7

1.3.4 Resolving conflicts among different methods and data sources Researchers usually represent a species with one or more genes in phylogeny reconstruction. However, a gene tree is not the same as a species tree [23]. Phylogenetic trees constructed with different genes or different data types (morphological data vs. molecular data) may be different. These conflicts may come from improper model assumptions or tree building approaches. 1.4 Bayesian phylogenetic inference and its issues This dissertation aims to extend the framework of Bayesian phylogenetic inference to achieve high performance on large phylogeny problems. By combining several factors into a comprehensive probability model and removing unknown parameters with a marginal probability distribution, Bayesian analysis has the potential to integrate complex (i.e. realistic) models and existing knowledge into phylogenetic inference. However, like other methods when they were first introduced, Bayesian phylogenetic inference generated both excitement and debate. Supporters of the Bayesian approach claim that Bayesian phylogenetic methods have at least two advantages over traditional phylogenetic methods [33-36]: 1) The primary Bayesian phylogenetic analysis produces both a tree estimate and a measure of uncertainty for the groups on the estimated tree[10, 37, 38]. The uncertainty is measured by a quantity called Bayesian posterior probability, which is approximated by the percentage of occurrences of a group in the tree samples generated by certain MCMC (Markov Chain Monte Carlo) methods [39-41]. 8

2) Bayesian methods can implement very complex models of sequence evolution, because a well-designed MCMC can traverse various highly probably regions of the tree space instead of sticking around only one region which is locally optimal but may be not the globally optimal [37]. However, with more thorough investigations, Bayesian phylogenetic inference also brings various highly-debated issues [34, 36, 42]. Several major issues have been summarized below: 1) Some Bayesian analyses offer conflicting findings to those from other approaches, such as maximum parsimony (MP) and maximum likelihood (ML) [43, 44]. Some highly debated topics include: How meaningful are Bayesian support values? [45]; Do Bayesian support values reflect the probability of being true? [46]; and Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics [47]. Supporters claim that the Bayesian posterior probability of a tree is the probability that the estimated tree is correct under the correct model [10] is highly debatable. Some convincing interpretation is necessary to reconcile these debates. 2) One cornerstone of Bayesian phylogenetic inference is posterior probability approximation using Markov Chain Monte Carlo (MCMC). Shortly after MCMC came out, people expected that it would be more efficient than traditional ML with bootstrapping [41]. However, experience shows that the chains have to run much longer than previously expected to converge to the correct approximation [48]. More seriously, research shows that the MCMC method may give 9

misleading posterior probability under certain conditions [42, 49], for example on a mixture of trees [50]. In spite of the above and other issues, Bayesian analysis has still gained wide acceptance since it was introduced into phylogenetics [8, 51-57]. 1.5 Motivation Given the challenges described above, both positive and negative, it is necessary to investigate Bayesian phylogenetic inference more thoroughly. Given the stochastic nature of molecular evolution, statistical analyses such Bayesian methods do have the potential to develop a unified framework to combine multiple data sources and existing knowledge into phylogenetic inference. Some of the debates about Bayesian phylogenetic inference are due to insufficient understanding or implementation of this method, especially the MCMC algorithm. An improper MCMC implementation does have the danger of stopping at local optima. In addition, it can not cross low probability zones to reach other optimal modes. Therefore, we need to explore improved MCMC strategies to develop more reliable, more efficient implementation. One barrier for extensive investigation of Bayesian methods is that the method itself is time consuming. Given hundreds of taxa and complex models, a complete MCMCbased Bayesian analysis may run several months to obtain a solution. A similar situation occurred when the maximum likelihood method was first introduced. However, when computing systems became more and more powerful and better algorithms were 10

developed, the maximum likelihood method came into wide use. This phenomenon may happen again to the Bayesian-based phylogenetic method. 1.6 Research objectives and contributions This dissertation aims to develop a high performance framework for Bayesian phylogenetic inference. The following summarizes the research objectives and contributions of this dissertation. 1) Developing a high performance computing framework for Bayesian phylogenetic inference. In this dissertation, we investigate technologies and platforms for Bayesian phylogenetic inference and abstract different computing platforms into the TAPS (Tree-based Abstraction of Parallel System) model. Based on this model, we developed parallel MCMC algorithms for Bayesian phylogenetic inference and implemented them in the PBPI (Parallel Bayesian Phylogenetic Inference) program. Both analytical analyses and numerical simulations show that PBPI achieves roughly linear speedup for datasets with different problem sizes. This means a Bayesian phylogenetic inference lasting several months by former methods can be finished in several hours using parallel algorithms on mid-sized Beowulf-like clusters. 2) Developing better MCMC strategies for Bayesian phylogenetic inference. In this dissertation, we proposed and implemented several MCMC strategies for exploring the posterior probability distribution of the phylogenetic model. By using variable proposal step length, we made the MCMC chain cross high energy barriers (i.e., low probability regions) and overcome stickiness around local 11

optimal regions. By introducing directional search within each proposal step, we improved the quality of each proposal and shortened the sample intervals, thereby reducing the total number of generations, to produce an acceptable distribution. To improve the mixing rate of the chain, we also implemented a class of population-based MCMC methods which used multiple chains to explore the search space more efficiently. We demonstrated that classical MCMC methods risk generating misleading posterior probability on some models; by using an improved MCMC framework, this risk was reduced. Various novel algorithms and MCMC strategies were implemented in this research. 3) Accommodating data uncertainty in phylogenetic inference with data resampling in the MCMC. We extended Bayesian phylogenetic inference to include data noise in the inference procedure and showed that ML with bootstrapping can be viewed as a special case of generic Bayesian phylogenetic inference. We justified that Bayesian posterior probability and bootstrap support value measure two kinds of phylogenetic uncertainties: the former refers to multiple possible models for the same dataset; the latter refers to the robustness of a tree on a specific dataset. Both uncertainties can be assessed jointly by incorporating data resampling during a single MCMC run. 1.7 Organization of this dissertation This dissertation includes three parts. The first part consists of Chapters 1 and 2, which present background, methods, and results in the field of Bayesian phylogenetic inference. In this chapter we introduce the 12

phylogenetic inference problem, its applications, and its challenges. We also provide a short review of positive and negative views of Bayesian phylogenetic methods. In Chapter 2, we review various phylogenetic approaches and recent advances in high performance computing for solving large phylogeny problems. The second part includes Chapters 3 and 4 in which we describe our extended, high performance, Bayesian phylogenetic inference framework. In Chapter 3, we demonstrate the weaknesses of traditional MCMC methods and propose how to overcome these weaknesses using improved MCMC algorithms. In Chapter 4, we describe our parallel Bayesian phylogenetic inference framework. We first discuss the general models and methods for parallelizing Bayesian phylogenetic inference that can be used as the foundation of introducing high performance computing support to the phylogenetic inference problem. Then we present an implementation of parallel Metropolis-coupled MCMC and numerical results. The third part consists of Chapters 5 and 6, where we provide performance evaluation of the Bayesian method and our implementations. Using simulated datasets under several model trees, we verified that our implementation not only output the correct results but also ran faster both in sequential and parallel implementation, in contrast to MrBayes [58], the most popular Bayesian phylogenetic inference program currently available. Our results also demonstrated that the accuracies of Bayesian-based phylogenetic method are very well-suited for the current models of evolution. Finally, in Chapter 7, we summarize the results, conclusions and contributions from this dissertation and outline future research. 13

Chapter 2 Background 2.1 Representations of phylogenetic trees A phylogenetic tree is a graph representation of the evolutionary relationship among a set of species or organisms. Since species are organized as a hierarchical classification in taxonomy, we call species at the leaf node of the tree taxon (plural taxa) in phylogenetic inference. A phylogenetic tree is usually represented by a binary tree in which each tree node are connected at most three other nodes, but it could be represented by a multiforked tree when some parts of the tree can not be fully resolved [59-62]. Each internal branch of the tree maps a divergence event in evolution and divides all taxa into two groups. Each group is called a clade and each taxon in the clade shares the same common ancestor with other taxa in the clade. If the length of the branch is set, it is proportional to the divergence time that two groups of taxa were separated from their latest common ancestor. A phylogenetic tree could be rooted or unrooted depending on whether a unique node is chosen as the least common ancestor of all taxa. Determining the true root from for a group of taxa is usually impractical, so unrooted trees are most used in phylogenetic inference. 14

Tarsius syrichta Tarsius syrichta Lemur catta Lemur catta Saimiri sciureus Saimiri sciureus Hylobates Hylobates Pongo Pongo Gorilla Gorilla Homo sapiens Homo sapiens Pan Pan M sylvanus M sylvanus M fascicularis M fascicularis Macaca fuscata Macaca fuscata M mulatta M mulatta ( a ) (b) Tarsius syrichta Lemur catta Saimiri sciureus Hylobates Pongo Gorilla Homo sapiens Pan M sylvanus M fascicularis Macaca fuscata 0.1 M mulatta ( c ) ( d ) Figure 2-1: Phylogenetic trees of 12 primates mitochondrial DNA sequences 15

Figure 2-1 shows the phylogenetic tree of 12 Primates mitochondrial DNA sequences. This tree is constructed using MrBayes from 898 DNA characters using JC69 model. Figure 2-1 (a) and (b) are called cladograms which provide topological information only. Figure 2-1 (c) and (d) are called phylograms which provide both branching order and divergence time. The NEWICK format representation of the phylogenetic tree [63, 64] in Figure 2-1 is shown as follows. #NEXUS BEGIN TREES; TRANSLATE 1 Tarsius_syrichta, 2 Lemur_catta, 3 Homo_sapiens, 4 Pan, 5 Gorilla, 6 Pongo, 7 Hylobates, 8 Macaca_fuscata,[63] 9 M_mulatta, 10 M_fascicularis, 11 M_sylvanus, 12 Saimiri_sciureus ; UTREE * PRIMATE = (1,2,(12,((7,(6,(5,(3,4)))),(11,(10,(8,9)))))); ENDBLOCK; Figure 2-2: The NEWICK representation of the primate phylogenetic tree To make the NEWICK representation unique, we define the signature of an unrooted tree as one of its NEWICK format that satisfies two requirements: 1) The root of the tree is fixed at the internal node that has the taxon with the smallest label as one of its children; and 2) The children of each internal node are order by their labels lexicographically. For example, the signature of the above tree is: 16

(1,2,((((((3,4),5),6),7),(((8,9),10),11)),12)) Using the tree signature, we can easily test the equality of two trees in the same way as string comparison. When distance between two trees instead of equality is preferred in practice, a phylogenetic tree is also treated as a hierarchical bipartitions. Each branch in the phylogenetic tree divides the set of taxa into one bipartition. For example, the complete set of nontrivial bipartitions (i.e., bipartitions in which each part has at least two nodes) for the primate phylogenetic tree shown in Figure 2-2 is: (1,2) (3,4,5,6,7,8,9,10,11,12) (1,2,12) (3,4,5,6,7,8,9,10,11) (3,4) (1,2,5,6,7,8,9,10,11,12) (3,4,5) (1,2,6,7,8,9,10,11,12) (3,4,5,6) (1,2,7,8,9,10,11,12) (3,4,5,6,7) (1,2,8,9,10,11,12) (8,9) (1,2,3,4,5,6,7,10,11,12) (8,9,10) (1,2,3,4,5,6,7,11,12) (8,9,10,11) (1,2,3,4,5,6,7,12) Figure 2-3: The nontrivial bipartitions of the primate phylogenetic tree Like the signature of a phylogenetic tree, we can view each bipartition as a signature of its corresponding tree node and thus can compare two nodes from two different phylogenetic trees including the same group of taxa. The total number of bipartitions which are shown in only one of the two trees but not both is defined the Robinson and 17

Foulds topological distance of these two trees [24], a distanced widely used in tree comparisons. Tarsius syrichta Lemur catta Saimiri sciureus Hylobates 1.00 1.00 Pongo 1.00 Gorilla 1.00 1.00 0.91 Homo sapiens Pan M sylvanus 1.00 M fascicularis 1.00 Macaca fuscata 1.00 M mulatta Figure 2-4: A phylogenetic tree with support values for each clade The support of a phylogenetic tree for given is usually assessed with bootstrapping [65] or Bayesian posterior probability [66]. In both methods, a consensus tree is commonly used to summarize common structures among a group of trees sampled using MCMC (Markov Chain Monte Carlo) or computed using the bootstrapped dataset. In either way, the occurrences of each bipartitions are counted and the frequencies of each bipartition are shown in the phylogram as shown in Figure 2-4. The consensus tree is also used to combine trees estimated using different genes or dataset or the same group of taxa. 18

When each individual tree has different but overlapped set of taxa, a supertree is used to replace the consensus tree as the summarized output [67]. Considering the possibility of horizontal gene transfer, phylogenetic network is used as an alternative representation of the evolution relationship of a group of taxa[68]. 2.2 Methods for phylogenetic inference Various methods have been developed to build phylogenetic trees from different kinds of data. These methods can be classified by: 1) the data type used in tree estimation; 2) the criteria to define an optimal tree; and 3) the tree search strategies. 2.2.1 Sequenced-based methods and genome-based methods Currently, molecular sequences and whole genome features are the two major data types used in phylogenetic inference [69]: 1) Sequence-based methods use one or multiple gene alignments to estimate the phylogenetic tree. Phylogenetic inference with multiple gene alignments becomes common in recent years. The supermatrix [70] and supertree [71] methods are two major approaches to handle combined data such as multiple gene alignments. Both approaches rely on standard sequenced-based phylogenetic inference methods. 2) Genome-based methods use phylogenetic signals contained in gene content [72-74] or gene order [75, 76] to estimate the phylogenetic tree. Phylogenetic inference using whole-genome feature attracts researcher s attention recently and many efforts are devoted to how to formulate distance metrics and 19

probabilities models. An overview of genome-based methods is provided by Delsuc et al. [69]. 2.2.2 Distance-, MP-, ML- and BP-based methods There are four major criteria to define an optimal tree: distance, maximum parsimony (MP), maximum likelihood (ML), and Bayesian posterior probability (BP). Comparisons among these methods are reviewed in [33, 62, 77]. Briefly, distance-based methods are much faster than the other three methods but have some potential weaknesses including: 1) information loss in converting sequences into distance matrix; 2) inconsistency for data set with large distances. MP and ML are both optimization-based methods which break the tree estimation process into two major components: scoring a given tree and searching the tree (or trees) with best scores. MP uses the minimum number of mutations that could produce a given tree as the score. ML uses the likelihood of the given tree under an explicit evolutionary model as the score. MP runs much faster than ML because: 1) MP needs much less computations in evaluating the number of mutations than ML evaluating the likelihood; and 2) MP does not need to optimize the branch lengths. Drawbacks of MP include: 1) multiple (or too many) trees may have the same MP score and only one of them is true; and 2) MP is subject to the long-branch attraction problem [78] since it does not account for the fact that the number of mutations varies on different branches. Both ML and BP are likelihood-based methods which explicitly use a probabilistic model of molecular evolution. Their major difference is ML uses point estimation for the unknown parameters and BP uses marginal distribution to integrate out the unknown parameters. BP is suggested as an faster alternative of ML with bootstrapping [41], 20

however this argument needs to be further justified [79]. Whether BP should be classified as an optimization-based method is questionable since theoretically BP requires more computations than ML in order to find the probabilities of all modes for the posterior distribution. As ML is conjectured as an NP-Hard problem, BP is at least as difficult as ML. Therefore, we put BP in a new category of phylogenetic methods: sampling-based method. 2.2.3 Tree search strategies Any phylogenetic inference methods rely on one or more tree search strategies once the optimal criterion is formulated. We divide the tree search strategies into the following categories: 1) Clustering method [23]: a clustering method builds the tree using a sequence of clustering operations. UPGMA[80] and neighbor-joining [81]. A cluster method runs much faster than other methods. Its limitation is that it produces only one tree which may not be the global optimal. 2) Exact search [77]: this method examines every possible tree to locate the best tree. Exact search can be further divided into exhaustive search and branch-andbound search. Exhaustive search enumerates all possible trees for evaluation. Considering the huge number of possible trees as described in Chapter 1, exhaustive is practical only for small data size. Branch-and-bound can prune the search space by deleting those trees that have lower score than a preset bound (or threshold). The more strict the bound, the further the space will be pruned. Same to exhaustive search, branch-and-bound is limited to small problem size. 21

3) Deterministic heuristics search: the tree space is not completely random distributed. There is certain order in the tree space. A heuristic search attempts to exploit such an order to find the best or near best tree. Common used deterministic search strategies include stepwise addition, local arrangement, and global arrangement [64, 77]. One potential problem of deterministic heuristics search is that it dose not guarantee a global optimal solution. 4) Stochastic search: By introducing some random moves, a stochastic search may avoid local optima and move toward the global optima. Three stochastic algorithms are used in phylogenetic inference: simulated annealing [82, 83], genetic algorithm [84-86] and MCMC [40, 41, 87, 88]. 5) Divide and conquer: a large problem can be solved by dividing the original problem into a set of smaller problems, solving each of them separately, and then merge the solutions for each smaller problem to obtain the solution for the original problem. Disk-covering method (DCM) [89], quartet-puzzling [90] and supertree [67] are used in phylogenetic inference. 2.3 High performance computing phylogenetic inference methods As phylogenetic inference goes to large problem size and the parallel processing become common, high performance computing support in phylogenetic inference is needed. High performance computing support includes: algorithm turning, parallel algorithm design, and parallel platform deployment. Algorithm tuning seeks alternative approaches for computation intensive parts in the phylogenetic inference. One common technique for likelihood-based phylogenetic 22

method is not to frequently optimize the branch length because this optimization process will take 92]. 2 on ( ) times likelihood calculations. This technique has been used [85, 86, 91, Besides algorithms improvement and exploration, parallel processing has the possibility to reduce the computation time from several months to several hours in efficient and immediate manner. Several parallel implementations of widely used phylogenetic inference methods have been developed recently, among them are parallel fastdnlml [93, 94], parallel TREE-PUZZLE [95], parallel genetic algorithm for ML [96], GRAPPA [97], and Parallel MCMC algorithms [98, 99]. We note there are multiple level concurrencies in most phylogenetic inference and these methods can run in parallel embarrassingly. 2.4 Bayesian phylogenetic inference 2.4.1 Introduction As described in the previous chapter, the task of phylogenetic inference includes two major steps: 1) constructing a phylogenetic tree that maps the evolutionary relationship among a group of taxa, and 2) accessing the confidence on the estimated tree given the observed data. Various methods are available for building the phylogenetic tree and some of them are based on a probabilistic model of molecular evolution. Due to the stochastic nature of molecular evolution, complicated mechanisms that affect the evolutionary process, almost every phylogenetic method has to deal with uncertainties caused by unknown parameters. Also, the fact that multiple phylogenetic trees are possible for the 23

same group of taxa has to be considered in applications which explicitly use a phylogeny as the basis of study. Using a comprehensive probabilistic model, Bayesian analysis provides a methodology to describe relationships among all variables under consideration. Bayesian phylogenetic inference can learn the phylogenetic model from observed data based on a quantity called posterior probability. The posterior probability of a phylogenetic model Ψ ( T,τ,θ ) can be interpreted as the probability with which this phylogenetic model is correct. Bayesian phylogenetic inference share same similarities with maximum likelihood estimation [10, 33]: both explicitly use a model of molecular evolution and a formalization of the likelihood function. However, the underlying methodologies are quite different. First, the Bayesian approach deals with parameter uncertainty by integrating over all possible values that a parameter might assume, while maximum likelihood estimation uses a point estimate in analysis. Second, Bayesian analysis requires specifying prior distributions of the parameters of a phylogenetic model, which provides an advantage to incorporating existing knowledge but also invites criticism since the prior distributions are often unknown. Finally, Bayesian analysis outputs the posterior probability of trees and clades as a measurement of the confidence on the estimated results. Therefore, Bayesian phylogenetic inference is considered a faster alternative of maximum likelihood estimation with bootstrap resampling [41]. Though the idea of Bayesian phylogenetic inference emerged almost at the same period as the maximum likelihood method [100], the computation of Bayesian posterior probability of phylogeny was not feasible until Markov Chain Monte Carlo methods were 24

implemented for phylogenetic inference by three independent research groups [87, 101-103] in 1996. Bayesian phylogenetic inference became widely used after the method of computing posterior probability was described [10, 33, 39-41, 87, 104, 105] and several phylogenetic inference programs (BAMBE [106] and MrBayes [58]) become publicly available. Despite some obvious benefits and ever-increasing applications, Bayesian phylogenetic inference has been hotly debated on several issues including the amount of bias caused by inappropriate prior probability, the interpretation of Bayesian posterior probability [46], and the accuracy of Bayesian clade support [34, 36, 42, 45]. This calls for further examination of the power and performance of Bayesian phylogenetic analysis, and therefore a need for improved and faster implementations of current Bayesian phylogenetic methods. 2.4.2 The Bayesian framework A phylogenetic model = ( T,τ,θ ) Ψ consists of three components: a tree structure (T ) that represents the evolutionary relationships of a set of organism under study, a vector of branch lengths (τ ) which maps the divergence time along different lineages, and a model of the molecular evolution (θ ) that approximates how the characters at each site evolve over time along the tree. In the Bayesian framework, both the observed data X and parameters of the phylogenetic model Ψ are treated as random variables. Then the joint distribution of the data and the model can be set up as follows: P ( X, Ψ) = P( X Ψ) P( Ψ) (2-1) Once the data is known, Bayesian theory can be used to compute the posterior probability of the model using 25

P( X Ψ) P( Ψ) P( Ψ X ) = (2-2) P( X ) Here, P ( X Ψ) is called the likelihood (the probability of the data given the model), P(Ψ) is called the prior probability of the model (the unconditional probability of the model without any knowledge of the observed data), and P (X ) is the unconditional probability of the data. For the continuous case, P (X ) is computed by PX ( ) = PX ( Ψ) P( Ψ) dψ (2-3) For discrete case, P (X ) is computed by PX ( ) = PX ( Ψi) P( Ψi) (2-4) Ψi Since P (X ) is just a normalizing constant, the computation of (2-3) or (2-4) is not needed in practical inference. The posterior probability distribution of the phylogenetic model can be written as P P( X Ti, τ, θ ) P( Ti, τ, θ ) = P( Ti, τ, θ X ) =. (2-5) P( X T, τ, θ ) P T dτdθ ( Ψ X ) T j j (, τ, θ ) This distribution is the current basis of Bayesian phylogenetic inference; useful information can be obtained from this distribution. For example, the posterior probability of a phylogenetic tree T i can be computed as P( T X ) = P( T, τ, θ X dτdθ. (2-6) i i ) Similarly, the posterior probability of the i th component of the parameter θ in the evolutionary model can be summarized by P ( θ X ) P( T, τ, θ, θ \ θ X ) dτd( θ \ θ ) (2-7) i = T j j i i i j 26