Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

Size: px
Start display at page:

Download "Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne"

Transcription

1 Phylogenetic Inference and Hypothesis Testing Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne November 13, 2003

2 Contents 1 Introduction 4 2 Molecular Phylogenetics The Use of Phylogenetic Trees Traditional Approaches Phylogenetic Trees From Genomic Data What about the root? How Treelike is Evolution? Models of Evolution A Simple Approach Evolution as a stochastic process Markov Models of Evolution Markov Theory Markov Models of Site Substitution Parameterized Models of Nucleotide Evolution Jukes-Cantor Model Jukes-Cantor Variance Generalisations of the Jukes-Cantor Model Problems with Markov Models of Evolution Modelling Rate Heterogeneity Modelling Non-Stationarity Summary of Nucleotide Markov Models Empirical Models of amino acid evolution PAM/Dayhoff Substitution Matrices BLOSUM Differences in PAM and BLOSUM Phylogenetics Tree Reconstruction Methods Evaluating Reconstruction Methods Complexity Accuracy Consistency Efficiency Robustness Usability in tests Parsimony Maximum Likelihood Is MP the same as ML? Distance Based Methods Unweighted pair group method using arithmetic averages (UPGMA)

3 4.5.2 The Molecular Clock Hypothesis Long Branch Attraction Neighbour Joining BIONJ Weighbor NJ and the minimum evolution method Least Squares Estimating Branch lengths Minimum Evolution Method with Least Squares Bayesian Tree Reconstruction Trees from Alignments, Alignments from Trees Phylogenetic Hypothesis Tests Confidence Regions of Phylogenetic Trees The Bootstrap The Non-parametric Bootstrap Testing Phylogenies using the Non-parametric Boostrap How well does it work? The Parametric Bootstrap Problems with the Parametric Bootstrap Bootstrap Based Tests Centering The Kishino Hasegawa Test The Shimodaira Hasegawa Test The Swofford Olsen Waddell Hillis Test (SOWH) Bayesian methods Bootstraps and Posterior Probabilities Which Test? Generalized Least Squares in Phylogenetic Hypothesis Testing Sample Average Variance and Covariance Motivation for Simulation of GLS test statistic GLS Test Statistic Simulation Method Results Discussion Contribution Further Work Conclusion 63 A GLS Results: Sample Average Covariance 64 A.1 Four Leaf Trees A.2 Five Leaf Trees B GLS results: JC Covariance 73 B.1 Four Leaf Trees B.2 Five Leaf Trees C Covariance Estimation 81 C.1 Sample Covariance Results C.1.1 Sample Covariance - 100bp C.1.2 Sample Covariance bp

4 C.1.3 Sample Covariance bp C.2 Jukes-Cantor Covariance C.2.1 Jukes-Cantor Covariance - 100bp C.2.2 Jukes-Cantor Covariance bp C.2.3 Jukes-Cantor Covariance bp C.3 Sample Average Covariance(Susko) C.3.1 Sample Average Covariance(Susko) - 100bp C.3.2 Sample Average Covariance(Susko) bp C.3.3 Sample Average Covariance(Susko) bp

5 Chapter 1 Introduction Phylogenetics is a field of biology that seeks to unlock the evolutionary history of life on earth. The aim is to understand relationships between species and through this the process of evolution itself. These relationships can be represented with a graph structure - traditionally simplified to evolutionary trees. The current approach is to try to reconstruct these trees from the blueprint of life: DNA sequences. Reconstruction methods are difficult to design and evaluate because the biological evidence is often ambiguous. Many approaches have been introduced to deal with the problems of estimation and hypothesis testing of phylogenetic trees. Parametric approaches exploit the elementary knowledge we have of evolution while non-parametric approaches have been developed to avoid the possibility of inaccurate preconceptions. Recently, Susko[40] presented an approach that applies the theory of generalized least squares to phylogenetic hypothesis testing. The generalized least squares approach has strong theoretical foundations in the theory of linear models. While the theory appears to be sound it is based on asymptotic results with regard to sequence length. It is not clear how well the test will perform in practice where the length of sequences is often only a few hundred nucleotides. I investigate the effect of sequence length on this approach. I also consider how Susko s approach differs from traditional parametric techniques with respect to estimation techniques for variance the variance-covariance estimation. In Chapter 2 I will give a general background to the problems involved in molecular phylogenetics. In Chapter 3 and Chapter 4 I review commonly used probabilistic models and tree reconstruction methods commonly respectively. In Chapter 5 I consider methods of evaluating confidence in results from such reconstruction methods when there may be conflict. This leads to an examination of current hypothesis testing methods and consideration of the validity of the generalized least squares approach in Chapter 6. 4

6 Chapter 2 Molecular Phylogenetics Phylogenetic trees represent relationships between species. They tell the story of life on earth. A phylogenetic tree is a tree in the graph sense. External vertices (nodes) represent extant species while internal nodes represent speciating events. The tree topology determines the lines of evolution - which species descended from which common ancestors. The branch (edge) lengths represent time since speciation of adjacent nodes. In reality, absolute time scales cannot be used and relative time scales are employed. These time scales depend on the data used to infer the tree. For example, if we use genome sequence data, a scale is the expected number of substitutions that have taken place at a site. 2.1 The Use of Phylogenetic Trees The role of phylogenetics is to help us understand the process of evolution from the patterns in nature we can observe in the present. As Huelsenbeck describes [25]: [F]or any question in which history may be a confounding factor, phylogenies have a central role. The most obvious use of phylogenetics is inferring common ancestors. This has implications to our understanding of evolution on a large scale. It can also help with more immediate problems. For example in epidemiology and understanding the spread of viruses. Viruses such as hepatitis C have a long dormant period meaning we can only detect the spread of the virus as it happened in the past. Understanding these processes and relationships can help in the development of biotechnology. For example, improving the design of drugs to consider host-pathogen mutual genetic variation. Phylogeny also shed light on our understanding of structural biology, helping us to infer function and functional constraints of genes. [30] 5

7 2.2 Traditional Approaches Historically phylogenetic trees have been constructed using two principles. The phenetic approach uses similarity scores derived from measures of physical characteristics. The most similar species are clustered together. While this is an intuitive approach, results may not represent genetic or evolutionary similarity. The cladistic approach assumes that related species will share unique features that were not present in distant ancestors. All species in a group must share a common ancestor. This means that species with many similar physical traits may not be grouped together. Both of the above approaches rely heavily on morphological and geographical data. However, this has changed with our understanding of the role of DNA in evolution. DNA sequencing has massively increased information about the evolutionary process. Most tree reconstruction methods now focus on examining the way DNA (or amino acid) sequences have evolved. 2.3 Phylogenetic Trees From Genomic Data In molecular phylogenetics patterns are searched for in genomic data. What we find is that evolution is a stochastic process. Mutations arise from changes to a species genome. That is, site substitutions, deletions, insertions and inversions. If we understand the process of mutation, we can make reasonable inferences about our past from the genome material we have now. Sequences that are very different are likely to be less closely related then sequences showing high similarity. Determining similarity is not an easy problem. Substitutions may be hidden from view by a number of factors. Examples of these include when a site changes and then changes back again (a reversal); when more than one mutation occurs at the same site; parallel changes occur on different branches of the tree (convergence or parallelism). With that in mind, there are three major components to phylogenetic inference that need to be considered. Probabilistic Models in Phylogenetics We need to consider what role probabilistic models can play to help our understanding of the problem. Typically, it is assumed that mutations occuring at time t depend on the sequence at that time but not on its previous history. This suggests a Markov model of sequence evolution. However whether or not the traditional assumptions such as homogeneity and reversibility are valid is less clear. The role of probabilistic models in phylogenetics is discussed in Chapter 3. 6

8 Reconstructing Phylogenetic trees from Sequence Data We need to understand what methods can be employed to actually reconstruct a phylogenetic tree. Within this there are three solid problems to consider: choosing a criteria, estimating the tree topology, estimating branch lengths. Besides this there has been a long standing feud between biologists (and to a latter extent mathematicians) whether parametric or nonparametric methods or something in between should be used. A review of commonly used tree reconstruction methods is contained in Chapter 4 Hypothesis Tests of Phylogenetic Trees Once a phylogenetic tree is decided upon, the next step is usually to try to identify which parts are well supported by the data. Hypothesis tests be must used with an understanding of what they actually test and what information the can provide the user. To complicate matters, the hypothesis being tested is often a tree topology and it it is still unclear how the usual statistical measures, such as variance, can be applied to such a structure. The debate between parametric and non-parametric testing is continued. The non-parametric bootstrap has been extensively in phylogenetics to provide a measure of confidence. However, the use of the parametric bootstraps and Bayesian methods are also becoming popular. All have their advantages and disadvantages. In any case, inferred trees need to be compared to traditional biological data (eg morphological). No matter how well a method works for simulated data, the aim of the game is to understand the process in reality. These issues are considered in greater detail in Chapter 5 and (with respect to a generalized least squares test) Chapter What about the root? In theory phylogenetic trees should be rooted to represent descent from the common ancestor. However, as we attempt to reconstruct phylogenetic trees, we have to consider all possible positions of a root with respect to the other nodes of the tree. Unfortunately it is unlikely that sequences will contain enough information to accurately place a root. However, there are some methods for rooting a tree. These include adding data from very distantly related species (outgroups), or the use of the molecular clock hypothesis. The use of outgroups has throws in its own bundle of problems as error in distant species effects the other closer related species. This and the validity of the molecular clock hypothesis is discussed in section

9 2.5 How Treelike is Evolution? It is worth asking the question of whether trees are really the right structure to describe evolution. Often there is not enough information in the data to resolve the issue of parallel mutations, while gene transfer from species to species mean that a gene can have more than one ancestor. There is criterion to determine if data are treelike: the four point condition. For any taxa i,j,k,l in a phylogenetic tree with four or more taxa we have: d ij + d kl max(d ik + d jl,d jk + d il ) (2.1) Otherwise other structures that may be more valid. For example, see Strimmer and Moulton s work on phylogenetic networks [39] 8

10 Chapter 3 Models of Evolution Most widely used tree reconstruction methods use probabilistic models of evolution. These can be formulated parametrically, using known (or assumed) properties of sequence evolution. They can also be derived empirically from information in the observed sequences. It makes sense to use whatever knowledge we have about the process of evolution rather than ignore it. On the other hand, evolution is very complex and biological evidence is often ambiguous. An example of a factor that needs to be taken into account, but is very hard to modeli, is differing in rates of evolution between and within lineages. How well a model fits reality can effect how a testing method works. Simpler models have greater power to discriminate but may be biased. So understanding models is necessary to understanding both tree reconstruction and confidence testing [19] A Simple Approach Evolutionary distances represent the divergence between species. That is, branch lengths on a phylogenetic tree. The following naive approach to determining distance shows why a probabilistic model is desirable. When determining distances between sequences it is intuitive to use a measure of dissimilarity. That is, take the distance between two sequences to be the Hamming distance. That is, for two species x and y, with sequence length S, we have: D xy = S (x i y i ) (3.1) i However, this approach does not take into account the phenomena described in (2.3) such as reversals and parallelism. This means the observed number of substitutions in a given sequence are a lower bound for the actual number of substitutions that have occured. This basically 9

11 means we cannot accurately look very far back in time. We need models that estimate substitution rates that correct for unseen events. An obvious first step is to try to define evolution as some type of stochastic process Evolution as a stochastic process Definition A stochastic process is a collection of random variables {X(t)} t T, with a common probability space. We can think of the process of evolution as the stochastic process of substitutions in a sequence. The the set of states, are nucleotides (or amino acids). That is, for a nucleotide model, X(t) {A,C,G,T } is the nucleotide at that site at time t. 1 To develop a tractable model for evolution we need to make further simplifying assumptions. This leads us to the well studied world of Markov processes. 3.1 Markov Models of Evolution Markov Theory Definition A stochastic process has the Markov property if the probability of observing a new state at time s + t only depends on the state at time s. That is, P(X(s + t) = j X(s) = i s,...,x(0) = i 0 ) = P(X(s + t) = j X(s) = i s ) (3.2) A Markov process is a stochastic process with the Markov property. The Markov property is also referred to as the memoryless property of Markov processes. If t,s then we have a discrete time Markov process. Similarly if t,s Ê we have a continuous time Markov process. To model evolution we want the latter since evolution is happening in continuous time. Definition The transition probability, P ij (t,s), is the probability of changing from state i at time s to state j at time t + s. For a Markov process we have: P ij (s,t) = P(X(s + t) = j X(s) = i) (3.3) 1 When deriving species trees this means the nucleotides at a site at time t expressed in the majority of the population. 10

12 If P ij (t,s) above is independent of s then the process is homogeneous. That is, P ij (t,s) = P ij (t) (3.4) We can notate these probabilities as a transition matrix P(t). The transition probabilities obey the Kolmogorov-Chapman equation: P ik (t) = j P ij (v)p jk (t v) (3.5) In matrix form: P(t) = P(v)P(t v) (3.6) This is equivalent to P(t + v) = P(t)P(v) (3.7) With initial condition P(0) = I (3.8) From this, we can extrapolate the transition probabilities at time t as P(t) = [P(1)] t (3.9) Markov Models of Site Substitution A discrete state continuous time Markov process of site substitution can be formulated as follows. We define the transition probabilities as the probability of substitution of a nucleotide or amino acid. We also assume that the process is time-homogeneous. That is, the rate of substitution is independent of time and the process is the same throughout the whole tree. Markov chain models of site substitution are usually further constrained other properties such as stationarity and reversibility. The assumption of stationarity is that the process is in equilibrium. This effectively assumes that that nucleotide frequencies are (approximately) the same from species to species throught time. Reversibility means that we do not distinguish the 11

13 process from the process in reverse: we treat the process starting at an ancestor species to a descendent and vice versa as the same. That is, evolution does not have a direction. This is in fact a model the evolution of a sequence along a phylogenetic tree branch. Usually the process of substitution is assumed to be Poisson. The validity of these assumptions is discussed in Before this, I examine how transition matrices for Markov models of evolution can be derived parametrically. 3.2 Parameterized Models of Nucleotide Evolution We can derive the transition matrix P(t) by estimating a rate matrix Q. Q ij is the rate that i changes to j in a very small time step δt. P ij (δt) = Q ij δt + o(δt) (3.10) In fact, if for all states i, j P ij = 1 (that is the process is honest), Q defines the process (hence transition matrix) uniquely[9]. Let P(1) = e Q. Then using the Kolmogorov-Chapman equation, P(t) = P(1) t = e tq (3.11) tq j = j! (3.12) j=0 Now, d dt P(t) 0 = Qe tq 0 (3.13) = Q (3.14) Now let v = (1,1,1,1) T. It is easy to see that v is an eigenvector of P(t) as i P ij(t) = 1) as noted earlier. Qv = e tq v = j=0 = 1.v + (tq) j v j! j=1 t j (Q j v) j! (3.15) (3.16) = v (3.17) This represents a power series on t, so all coefficients Q j v = 0. That is Qv = 0. Rate 12

14 matrices that have this condition define processes. It is clear at this point that Markov models suffer from confounding. That is e tq = e t γ γq, where γ > 0, scales the rate. This means that absolute time scales cannot be used. Hence, expected subsitutions per site is the usual time unit stated Jukes-Cantor Model Of all the Markov models of evolution, the Jukes-Cantor model makes the strongest assumptions about the process. Besides the properties of Markov chains this model assumes: The process acts on sites is independently and identically distributed (iid). All substitutions occur with equal probability. With these assumptions, we need only define an appropriate rate matrix to derive transition probabilities. If we define the rate of substitution as α, the Jukes-Cantor rate matrix: 3α α α α α 3α α α Q JC = α α 3α α α α α 3α (3.18) The rate at which a site stays in its current state must be 3α as row sums must equal zero. We can find the spectral decomposition of e tq JC by determining the eigenvectors and eigenvalues of Q. Q = Sdiag(λ 1,λ 2,λ 3,λ 4 )S 1 (3.19) Where S is the matrix with the eigenvectors of Q as columns, λ i are the corresponding eigenvalues. From further linear algebra we can write: e tq = Sdiag(e tλ 1,e tλ 2,e tλ 3,e tλ 4 )S 1 (3.20) In this case it is easy to see that the matrix S is can be defined as follows. 13

15 S = (3.21) Corresponding eigenvalues: λ 1 = 0 and λ 2 = λ 3 = λ 4 = 4α. Also, we can verify that S 1 = 1 4 ST. We can now derive the transition probabilities for the Jukes-Cantor Model. 1 4 P(t) = (1 + 3e 4αt ) for diagonal elements 1 4 (1 e 4αt ) for off diagonal elements (3.22) The probability of seeing a change after time t does not depend on the current state. Pr(X(t) X(0)) = Pr(X(t) = b X(0) = a),b a (3.23) = b a P ab (t) (3.24) = (1 e 4αt ) (3.25) We can use P c (t), the proportion of changed sites after time t, to estimate the time of divergence by solving the above for t. Now the number of sites that changed is distributied binomially as they either have or have not changed. So we have, P c (t) Bin(N, 3 4 (1 e 4αt )), where N is the length of the sequence. This means ˆP c (t) = no. changes/n is a maximum likelihood estimate. From the invariant property of maximum likelihood estimates the following is a maximum likelihood estimate of the time of divergence. ˆt = 1 4α log(1 4 3 ˆP c ) (3.26) This is the Jukes-Cantor distance estimate. It is usually written as d ij where i and j represent to sequences/taxa. Since two independent (unrelated sequences are expected to agree at 1/4 of sites, sequences are considered to be unrelated as ˆP c 3/4. At this point distances tend to infinity. 14

16 Selection of Rate Parameter For very short time spaces the total number of changes inferred by the Jukes-Cantor estimate is equal to the number of observed changes. More precisely: lim t 0 P c (t) P obs (t) = 1 (3.27) As both numerator and denominator tend to zero this can be seen using l Hopitals rule dp c dt t=0 = dp obs t=0 = 1 dt (3.28) dp c = 3αe 4αt dt (3.29) = 3α at t = 0 (3.30) Applying our boundary condition 3α = 1 (3.31) α = 1 3 (3.32) Jukes-Cantor Variance The variance of the Jukes-Cantor estimate can be derived using the delta method. ˆt Eˆt dˆt (ˆp Eˆp) dˆp (3.33) σ 2 1 ˆp Eˆp (ˆt) = (1 4 3 ˆp)2 n (3.34) That is, σ 2 (t) = e 8t/3 T(1 T)/S (3.35) Where T = 3 4 (1 e 4t/3 ) and S is the sequence length. That is, the variance grows exponentially with t. The covariance of two distance estimates is derived using the tree structure. Distance estimates are represented on a phylogenetic tree by the sum of branch lengths on the unique path between the taxa under question. To calculate the covariance of two pairwise distances we simply calculate the variance of branches common to both paths. 15

17 3.2.3 Generalisations of the Jukes-Cantor Model It is very clear from biological evidence that the assumptions made in the Jukes-Cantor model do not generally hold. Analysis of DNA sequences shows that substitutions are not equiprobable. An example of this is transition/transversion bias. Nucleotides are grouped according to their molecular structure as purines (A,G) or pyramidines (C,T). Purine to purine or pyramidine to pyramidine substitutions are called transitions. The rest are called transversions. Because of this molecular structure it is much more likely that a transition than a transversion will happen. The Kimura-2-parameter model(k2p) attempts to correct the assumption by introducing parameters to model the difference in transition and transversion rates. This approach produces a new rate matrix where β is the rate of transitions, α the rate of transversion: (2β + α) β β α β (2β + α) α β Q K2P = β α (2β + α) β α β β (2β + α) (3.36) In 1981 Felsenstein [15] presented a model (F81) where substitution rate depends only on the equilibrium frequency of a nucleotide. These equilibrium frequencies are usually determined from the observed frequencies in the sequences to hand. µ represents a rate parameter and π i represents the frequency of nucleotide i. (. indicates the value necessary to make the row sums equal to zero).. µπ T µπ C µπ G µπ Q F81 = A. µπ C µπ G µπ A µπ T. µπ G µπ A µπ T µπ C. (3.37) Hasegawa et al[20] futher refined Felsenstein s model by considering transition/tranversion rates β and α. 16

18 . βπ T βπ C απ G βπ Q HKY = A. απ C βπ G βπ A απ T. βπ G απ A βπ T βπ C. (3.38) Finally the most general time reversible model (GTR) has nine free parameters:. ρπ T βπ C γπ G ρπ Q G = A. απ C σπ G βπ A απ T. τπ G γπ A σπ T τπ C. (3.39) 3.3 Problems with Markov Models of Evolution Biological evidence that all models so far considered simplify situation too much. For example, they can t deal with long-additive distance correlation due to RNA folding. A key problem appears to be the iid assumption between sites. The assumption of rate homogeneity is contradicted by evidence that mutations are dependent on local sequence context. Protein coding genes are an example of how this assumption can be violated by very basic ideas. Because of the redundancy in the nucleotide to amino acid code, different codon positions are subject to different selectional pressures. Mutation rates appear to be dependent on structural and functional constraints as well as chromosomal positions. These are all local properties of a sequence. However, it is assumed that substitution rates are constant throughout a phylogenetic tree. Markov models of evolution assume stationarity of base frequencies. That is, expected nucleotide frequencies remain the same with time. This is contradicted by observations of nucleotides are very different in sequences from different species. For example, GC content in mammals is much higher than in flies.[30] Lockhart et al [31] have shown that if a model that assumes stationarity is used, then breaking the assumption can lead to inaccurate distance estimates. The main problem being a tendency to group sequences with similar nucleotide frequencies, irrespective of evolutionary development. 17

19 3.4 Modelling Rate Heterogeneity A number of methods have been suggested to add some level of rate heterogeneity into Markov models. One approach is to set some sites invariable while others change. This is useful when one can determine conserved regions in sequences. However, it doesn t allow for more than one rate. Another approach allows sites to evolve at different rates, where the rate for a site from a Gamma distribution with shape parameter α. A discrete Gamma model has also been developed by Yang [30] that allows much easier computation. This is, perhaps, the most popular approach at present. However, it still does make use of information available about local behaviour. Recently, Steel et al [42] have presented a covariotide model of site substitution where sites are effected by different selection pressures. This model allows some sites to be invariant while others change. However, sites do not have to remain invariant. This represents the fact that constraints on sites can change over time. The activation of sites is governed by a Markov process where sites are still iid to keep the model tractable. Other techniques have been based on defining multiple categories of rates. This implemented using hidden Markov models. Algorithms infer the most probable rate category for a site. These are discussed in [11] 3.5 Modelling Non-Stationarity As mentioned previously, Markov models of evolution assume stationarity of nucleotide frequencies. However, there is strong evidence suggesting that this is not the case. The paralinear and logdet corrections have been developed to make distance estimates more reliable when base frequencies differ from species to species. Both rely on the following lemma. Lemma Let t be a measure of evolutionary time. Now, t log[det(p(t)] (3.40) Proof P(t) = e tq) (3.41) 18

20 From linear algebra = det(p(t)) = e ttrace(q) (3.42) = log[det(p(t))] = ttrace(q) (3.43) Since Q remains fixed = t log[det(p(t))] (3.44) Paralinear distance Barry and Hartigan [2] suggest an asynchronous distance estimate. This is still based on a Markov process where sites are iid. However, it makes no assumption of homogeneity, reversibility or stationarity. It need not assume base frequencies are in equilibrium, nor that the rate of substitution is constant throughout the tree. The distance estimate is taken to be: ˆd ij = 1 4 log[det(p xy)] (3.45) Where P xy is the transition matrix at a particular site from species x to species y. This is assumed to be the same across all sites. The (i,j)th element of the transition matrix is estimated as the Pr(Y = j X = i), where X and Y are bases that have the same position in sequences for species x and species y, respectively. i,j {A,C,G,T } for nucleotide sequences. The distance measure is additive and asymmetric. The latter property means that generally d ij d ji, which is not a particularly desirable property. In fact, this measure can only be used to estimate the total number of substitutions along a branch when substitution rates are held constant and the model is reversible. LogDet Transformation This transformation method involves recording a divergence matrix, F xy, for each pair of taxa x and y. The ijth entry of F xy is the proportion of sites in which taxa x and y have states i and j respectively. The dissimilarity value, d xy is calculated as: ˆd xy = log[detf xy ] (3.46) Variance can be calculated using the paralinear method. Where S is the sequence length, r 19

21 is 4 or 20: ˆσ 2 xy = r r [(Fxy 1 ) 2 ji(f xy ) 1]/S (3.47) i=1 j=1 When models have with equal nucleotide frequencies, Lockhart et al [31] show how to calculate branch lengths: ˆd xy = ( ˆd xy + [log(detf xx F yy )]/2)/r (3.48) Distances become treelike as sequence lengths increase, provided we reinstate our independence assumptions across sites and across the tree. This means that reconstruction methods that require treelike distances to work will work with corrected distances (and sufficiently long sequences). The LogDet transform has been shown to provide more realistic results where similar nucleotide frequencies might be indicating false evolutionary relationships Summary of Nucleotide Markov Models We can consider the relationships between these models via the following parameters [44]: κ: The rate of transitions relative to rate of transversions. In practice, κ > 1 reflects the evidence that transitions are more prevalent than transversions. α: A measure of between site variation in the rate of nucleotide substitution. This is often drawn from a gamma distribution with mean 1 and variance 1 α low amounts of rate variation. [44]. High values mean Base frequencies π = (π A,π C,π G,π T ). ie three independent parameters. π MLE,π obs : The maximum likelihood, and observed base frequencies. The maximum likelihood estimate α and κ are usually used. Model α κ π Jukes Cantor each Kimura 2-P variable 0.25 each Felsenstein 1 variable HKY variable variable JC+Γ variable each K2P+Γ variable variable 0.25 each Fel+Γ variable 1 variable HKY+Γ variable variable variable 20

22 3.6 Empirical Models of amino acid evolution When modelling amino acid evolution, empirical Models have been the preferred solution. These models specify explicit transition probabilities derived from empirical evidence. The preference for empirical models when dealing with amino acids is partially due to the complexity increase involved in having twenty character states. The following section provides an overview of the two most common empirical methods: The PAM and the BLOSUM matrices PAM/Dayhoff Substitution Matrices The PAM/Dayhoff matrices empirically estimate amino acide substitution rates based on a markov process framework. These rates were derived from alignments of protein sequences that are atleast 85% identical. Deriving the Mutation Matrix Let A be the matrix of observed proportions of changes in between two amino acides i, j. That is: A ij = N ij N (3.49) In fact, A ij has the same description as F xy described for the LogDet transform. Let ˆπ k be the vector of amino acid frequencies of sequence k. ˆπ k = Nk j N (3.50) We want to derive substitution (transition) probabilities for the time it takes 1% of all amino acids to mutate - this is the point acception mutation (PAM) unit. P ij = Pr(i mutates)pr(i mutates toj i mutates) (3.51) Now we can empirically derive a relative mutability of the amino acid i as m i : m i = P(i mutates) (3.52) j = A ij k,j A (3.53) kj 21

23 Now, Pr(i mutates to j i mutates) = A ij j A ij (3.54) and we now have have an estimate of P: P ij = m i A ij j A ij (3.55) To calibrate our matrix to the PAM measure we simply solve: π i (1 P ii ) = 0.01 (3.56) i The matrix of P ij s is the PAM matrix. If π a vector of amino acid frequencies, Pπ is the probability vector after that time period (1-PAM) To consider more distant relationships we can derive the k-pam matrix. Because this is based on a Markov process we can theoretically achieve this by raisig the 1-PAM matrix to the kth power. P(k) = P k (3.57) The log-odds form of PAM matrices are often used for scoring sequence alignment reliability. This can be thought of as a log-likelihood ration test with the null hypothesis being that a sites have aligned by chance. S ij = log P ij π i (3.58) The more rare the amino acid in each aligned pair, the lower the probability of a chance alignment and so a greater significance. Problems with the PAM model Besides the problems inherent in Markov models of evolution, the PAM matrices suffer from other problems. Firstly, it assumes that proteins have average amino acide composition (many don t). Secondly, rare replacements are not observed enough to resolve relative frequencies properly. Thirdly, error in PAM(1) extrapolated (in say PAM(250)) Markov processes don t accurately model evolution. There is no theoretical justification for applying this to divergent alignments. In fact this 22

24 approach implies a large loss of information. As evolutionary distance increase, information content decreases. This means a longer region of similarity to get a high score to distinguish from chance. However, regions of similarity are found in narrow blocks as evolutionary distance increases so it is difficult to find the necessary data. Attempts to update the PAM matrices to make them more accurate have been made. A particular example is the Jones Taylor Thornton model BLOSUM The Block Sum (BLOSUM) substitution matrices were introduce in 1992 Henikoff and Henikoff [21]. They take completely different approach to the PAM matrices. The key point is that the derivation of transition probabilities uses alignments of distantly related sequences. Blocks are conserved regions of local alignments with no gaps. The aim is to obtain a set of score for matches and mismatches that best favors a correct alignment with each of the other segments in the block relative to incorrect alignment. This is done by creating a table where each column contains amino acid pair frequencies for the corresponding column in the alignment. This is a 20(20 1)/2 N matrix where the first term is the number of possible pairs of amino acids and N is the length of the alignment. A score matrix is defined from a log-odds matrix from the frequency table. Let F ij be the ij t h entry of frequency matrix. Let q ij be the observed probability of an ij pair. q ij = F ij j F ij (3.59) We can estimate the expected probability of an ij pair occuring as e ij. Let p i be the probability of i occuring in an ij pair. Let e ij = p i p j. Our odds ratio matrix takes the form: S ij = q ij e ij (3.60) That is, the observed probability over the expected probability that i and j appear together at random. Ratios are usually multiplied by scaling factor of 2 then rounded to the nearest integer. This is the BLOSUM (block substitution matrix) with half bit units. Unlike the PAM matrices, separate matrices have been derived for different time scales. BLOSUM matrices are referred to by minimum percentage identity between species. That is in BLOSUM 60 sequences that are atleast 60% similar are treated as identical. As distances become large we expect to a BLOSUM matrix with a decrease BLOSUM parameter. 23

25 Problems with the BLOSUM matrices The main problem with the BLOSUM matrices is that it can be overtrained. That is, if most of the conserved blocks are taken from just a few species then the resulting matrix isn t going to look too much like reality. This is a real problem as most genomic data available is from very few species. To reduce contributions from most closely related members of family (reduce multiple contributions of amino acid pairs) - sequences are clustered within blocks. Each cluster is weighted as a single sequence. Matrices analogous to transition matrices estimated without any reference to rate matrix Q. 3.7 Differences in PAM and BLOSUM The differences in PAM and BLOSUM substitution matrices are a consequence of their different approaches to the problem PAM matrices are derived from a tree based model that uses matrix multiplication to extrapolate larger time scales. It is based on mutations in both conserved and variable regions. BLOSUM is derived from pair frequencies in highly conserved blocks. Different weights can be given to different sequence groups. BLOSUM has an advantage in that it was derived with from more representative data set. Hardly any transitions were observed in deriving PAM whereas this was not the case for BLOSUM. This problem has been address by re-deriving the models with more data. This is the Jones Taylor Thornton model. The fact that BLOSUM is not tree derive does not seem to be a major disadvantage. BLOSUM generally gives better results when used to score database searches as highly conserved regions usually serve as anchor points. However, PAM-style matrices are still more widely used in phylogenetics 24

26 Chapter 4 Phylogenetics Tree Reconstruction Methods This chapter surveys tree reconstruction methods. Phylogenetic reconstruction methods come in many forms. Parametric methods such as maximum likelihood rely on a specification of a model. On the other end of the scale, non-parametric, such as maximum parsimony, claim to make no assumptions about the underlying model as any assumption we do make are likely to be inadequate. In the middle there are semi-parametric methods - the distance based methods that require a model to generate distances but then go onto reconstruct the tree by a non-parametric cluster method. Bayesian approaches have also been proposed. This chapter first outlines how the usual statistical indicators are redefined for the problem of phylogenetic inference. The rest of the chapter examines commonly used methods for phylogenetic tree reconstruction. 4.1 Evaluating Reconstruction Methods Before examining the methods available we need to have an idea of what we want from them. Several often used criteria are discussed below. When evaluating tree reconstruction methods having the available the usual bag of statistical measures. However, it will be seen that defining these with respect to phylogenetics is not straightforward Complexity An important issue to consider in the design of reconstruction methods is the size of the space of trees. There are (2N 3)!! topologically unique rooted trees N leaves, and (2N 5)!! unrooted trees. Clearly, algorithms that involve evaluating entire space of N leaf trees are not going to 25

27 be computationally feasible Accuracy The evaluation critieria that has been given the most attention is accuracy. When we build phylogenies we need some measure of how well the method used estimates the true tree. This is usually evaluated by examining how the method performs with respect to simulated data and biologically well supported phylogenies[25] Consistency The behaviour reconstruction methods as sequences gets longer is usually discussed in terms of consistency. Definition An estimator ˆT n of T is consistent if lim P T ( ˆT n T > ǫ) = 0 as n 0 (4.1) n In phylogenetics this is interpreted as whether a reconstruction method will return the true tree if its inputs are based on infinitely long sequences. This criteria has been given much consideration as the amount of genome data available increases. In effect, this means that the barrier to success only depends on the researcher having enough sequence to hand. Steel [13] showed that if the frequencies of residues are known and are iid, then a consistent estimator can be found. If not (site rates are variable and/or frequencies are not known), then it can be impossible to estimate a phylogenetic tree consistently. However, Chang [8] has shown that if sites are iid, the correct model is being used assome other restrictions, than a consistent estimator can be found. There has been much debate about the usefulness of such a measure given that real sequences will always be finite. Holmes makes the point that as sequence lengths are made longer in reality the less valid the site independence assumption is [23] Efficiency Definition As estimator ˆT is efficient if it is unbiased and lim n σ 2 ( ˆT n ) (I(T ) 1 ) = 1 (4.2) 1 Holmes provides a useful phylogenetics to statistics term conversion table in [23] 26

28 Where I(T ) is the Fisher information of T. In phylogenetics, we want this to mean that as longer sequences are used the variances in our trees is as low as it can get. The problem with this is that the variance of a tree is not well defined. In fact, the literature generally uses effiency to describe how quickly a reconstruction method converges to the correct solution as it is given more data. This is usually measured via simulation. Ideally, we would use an analog of the mean squared error for trees, E(d(T, ˆT )) 2, as this gives an indication of both variance and bias. However, the problem remains of how to define distances between trees and this has not been solved Robustness We know current assumptions made about sequence evolution are inadequate. With this is mind, it is very desirable to know how well a method is likely to behave when wrong assumptions are made. For example, the effect of model misspecification on parametric methods This is usually assessed by simulating a data under a fully specified model and then reconstructing the tree with misspecificastion Usability in tests We also need to consider if a reconstruction method can reject false assumptions in our model of evolution. For example, we want to be able to determine is additional complexity is worthwhile. Since understanding the process of evolution is our primary goal, this should alway be kept in mind. 4.2 Parsimony The maximum parsimony method chooses the best tree as the one where the least number of base changes have occurred in sequences from the root to the leaves. An example of this is seen in fig 4.1. Combinatorially this is the same as finding the minimum Steiner tree for Hamming distance between sequences[23]. In theory this means that all possible assignments of sequences to internal nodes over all possible tree topologies (with the necessary number of leaves) must be evaluated. In practice, heuristics are employed to cut down the search space. Recursive algorithms and branch and bound have also been employed to avoid repeating computation (See [11]). 27

29 AAA AAA GGA AAG AAA GGA AGA Figure 4.1: An example of how changes are counted using the principle of parsimony. The paths from sequences at the leaves of the tree to the root involve 4 base changes. Parsimony is based on the concept of Occam s razor. Solutions that make the least amount of assumptions are likely to be the best. By looking only at base changes, parsimony claims to require no knowledge of the evolutionary model. Parametized models are known to be flawed so this non-parametric approach may seem quite reasonable. However, it seems that an underlying model for parsimony exists implicitly. The assumptions are that sites are independent (we cost each substitution separately) and the probability of substitution is equal for all bases. It has been long established that parsimony is inconsistent. The situtation where this happens has been dubbed the Felsenstein zone. However, as mentioned previously, consistency is not always a necessary property for a reconstruction method. Another problem is that different trees can be equally parsimonious for a set of sequences. 4.3 Maximum Likelihood Maximum likelihood reconstruction is, unsurprisingly, based on the likelihood principle. Given data, D and a model we calculate the likelihood of hypothesis, H as P(D H) the probability of observing D if H is correct [38]. With respect to phylogenetics, the data D is sequence data and the model is a process of site substitution (see Chapter 3).H is a phylogenetic tree which is is defined by it s topology and branch lengths. The aim is to find the tree that maximum likelihood. We choose this tree as our best guess. 28

30 t 4 x 5 x 4 t 2 t 3 t 1 x 2 x 3 x 1 Figure 4.2: Example ML tree T. x i are nodes representing sequences, t i are branch lengths Example Likelihood Calculation The likelihood of the rooted tree in fig 4.3 can be calculated as follows. P(x 1,...,x 5 T,t 1,...,t 4 ) = P(x 1 x 4,t 1 )P(x 2 x 4,t 2 )P(x 4 x 5,t 1,t 2,t 4 ) (4.3) Where P(x i x j,t) = L(x i,x j,t). The right hand side being the likelihood of (x i,x j ) forming a branch of lenght t in tree T This can also be transformed into a recursive form. Characterisitcs of maximum likelihood Maximum likelihood estimation is consistent. It borrows its efficiency rating from more general theory of maximum likelihood estimation. Unlike distance based methods it has been found to be robust to the presence of distant taxa [4]. The maximum likelihood tree is not necessarily unique [38]. So this method may not be able to resolve completely which is the best tree. It is also extremely expensive computationally. The three taxa tree shown in fig 4.3 is a trivial example because there is only one possible unrooted tree topology for three leaf tree. If we are dealing with models that are not reversible we have to consider every possible rooted tree. For n sequences ths potentially involves evaluating the likelihood for all (2n 3)!! rooted trees topologies and all possible assignments of sequences to the hidden internal nodes of the 29

31 tree. This is a huge computational problem! 2 Simplifications The problem can be simplified by making our usual assumptions. If we assume that sites are iid, we need only consider the evolution of individual sites with respect to the tree. The probability of the tree with respect to sequences is then just of the product of the probabilities of the sites. This provides opportunities for parallelising computation. If we assume that the model of site substitution is reversible then we can determine the probabilities of substiutions from the leaves up - a postorder traversal. Infact, we only need to consider the unrooted tree. This is the pulley principle described by Felsenstein [15]. Search heuristics This still leaves the problem of calculating the likelihood of every unrooted tree. To cut down the search space heuristics need to be employed. Felsenstein proposed a branch and bound method where taxa are added incrementally to maximize the likelihood at each stage. The big disadvantage with this approach is that it may not find the optimal tree. 4.4 Is MP the same as ML? The use of maximum parsimony over maximum likelihood (and vice versa) has been the source of much division in the phylogenetics. However, as Holmes aptly puts it: The statistical perspective sees the differences between maximum likelihood, maximum parsimony...as much more a matter of degrees of freedom allowed in a model than a matter for religious wars The non-parametric nature of MP means that no parameters are pinned down. In effect it needs to optimize over infinite dimensional criteria. A parametric model such as Jukes-Cantor is at the other end of the scale. Variable rate models lie somewhere in the middle. This view is well supported by the work of Steel et al [38] who have found conditions where the MP tree is the ML tree. This happens when there is no common mechanism assumed between sites or lineages. However, the general evidence from simulations is that MP does not perform as well as ML. This is likely to be due to the implicitly restrained model involved in most parsimony 2 In fact is has been shown that maximum likelihood for phylogeny is NP-complete. 30

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018 Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Phylogeny: building the tree of life

Phylogeny: building the tree of life Phylogeny: building the tree of life Dr. Fayyaz ul Amir Afsar Minhas Department of Computer and Information Sciences Pakistan Institute of Engineering & Applied Sciences PO Nilore, Islamabad, Pakistan

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Lab 9: Maximum Likelihood and Modeltest

Lab 9: Maximum Likelihood and Modeltest Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Mutation models I: basic nucleotide sequence mutation models

Mutation models I: basic nucleotide sequence mutation models Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Consistency Index (CI)

Consistency Index (CI) Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

More information

Inferring Molecular Phylogeny

Inferring Molecular Phylogeny Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 1 Learning Objectives

More information

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Preliminaries. Download PAUP* from:   Tuesday, July 19, 16 Preliminaries Download PAUP* from: http://people.sc.fsu.edu/~dswofford/paup_test 1 A model of the Boston T System 1 Idea from Paul Lewis A simpler model? 2 Why do models matter? Model-based methods including

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

What Is Conservation?

What Is Conservation? What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

More information

arxiv: v1 [q-bio.pe] 4 Sep 2013

arxiv: v1 [q-bio.pe] 4 Sep 2013 Version dated: September 5, 2013 Predicting ancestral states in a tree arxiv:1309.0926v1 [q-bio.pe] 4 Sep 2013 Predicting the ancestral character changes in a tree is typically easier than predicting the

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

Phylogeny. November 7, 2017

Phylogeny. November 7, 2017 Phylogeny November 7, 2017 Phylogenetics Phylon = tribe/race, genetikos = relative to birth Phylogenetics: study of evolutionary relationships among organisms, sequences, or anything in between Related

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution, course # Final Exam, May 3, 2006 Molecular Evolution, course #27615 Final Exam, May 3, 2006 This exam includes a total of 12 problems on 7 pages (including this cover page). The maximum number of points obtainable is 150, and at least

More information

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT Inferring phylogeny Constructing phylogenetic trees Tõnu Margus Contents What is phylogeny? How/why it is possible to infer it? Representing evolutionary relationships on trees What type questions questions

More information

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD Annu. Rev. Ecol. Syst. 1997. 28:437 66 Copyright c 1997 by Annual Reviews Inc. All rights reserved PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD John P. Huelsenbeck Department of

More information

BMI/CS 776 Lecture 4. Colin Dewey

BMI/CS 776 Lecture 4. Colin Dewey BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01 Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference Poisson process continuous Markov process X t0 X t1 X t2

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction 1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Markov Chains. Sarah Filippi Department of Statistics  TA: Luke Kelly Markov Chains Sarah Filippi Department of Statistics http://www.stats.ox.ac.uk/~filippi TA: Luke Kelly With grateful acknowledgements to Prof. Yee Whye Teh's slides from 2013 14. Schedule 09:30-10:30 Lecture:

More information

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University The Problem Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions

More information

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny 008 by The University of Chicago. All rights reserved.doi: 10.1086/588078 Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny (Am. Nat., vol. 17, no.

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid Lecture 11 Increasing Model Complexity I. Introduction. At this point, we ve increased the complexity of models of substitution considerably, but we re still left with the assumption that rates are uniform

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Speciation Times under an Episodic Molecular Clock Syst. Biol. 56(3):453 466, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701420643 Inferring Speciation Times under an Episodic Molecular

More information

Evolutionary Analysis of Viral Genomes

Evolutionary Analysis of Viral Genomes University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Integrative Biology 200A PRINCIPLES OF PHYLOGENETICS Spring 2008 Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008 University of California, Berkeley B.D. Mishler March 18, 2008. Phylogenetic Trees I: Reconstruction; Models, Algorithms & Assumptions

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Consensus methods. Strict consensus methods

Consensus methods. Strict consensus methods Consensus methods A consensus tree is a summary of the agreement among a set of fundamental trees There are many consensus methods that differ in: 1. the kind of agreement 2. the level of agreement Consensus

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information