Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring

Size: px
Start display at page:

Download "Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring"

Transcription

1 Computational Issues in Phylogenetic Reconstruction: Analytic Maximum Likelihood Solutions, and Convex Recoloring Sagi Snir

2

3 Computational Issues in Phylogenetic Reconstruction: Analytic Maximum Likelihood Solutions, and Convex Recoloring Research Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Sagi Snir Submitted to the Senate of the Technion Israel Institute of Technology AV, 5764 Haifa August, 2004

4

5 This Research Thesis was done under the supervision of Professor Benny Chor in the Department of Computer Science I m glad for the opportunity to thank people who helped me: First, I want to thank Benny. Above all, Benny has been a friend. Our collaboration started by a request for advice sent to New Zealand, answered by: Why won t you come work with me?. Despite the student-advisor complex relationship, I could always turn to Benny with a request for advice or help. Our meetings were spent mainly on sharing stories, adventures, jokes or advices. Benny taught me that cafes are the ideal locations to advance research. Benny s questions helped me focus on the important things. It sometime took me few weeks to find that the whole issue is encompassed behind such a question. Benny has sent me around the world to his friends, trips that resulted all these friendly, productive collaborations. I m very grateful to Benny for all these. I also want to thank Shlomo very much. In our very intensive enjoyable collaboration during the last year, every meeting resulted in a new algorithm, some key definition or deep insight. Although we did not continue to a Ph.D in distributed algorithms, and pursued another fruitless direction, I feel very lucky for eventually succeeding to drag him into the project of the convex recoloring. I m sure that convex recoloring will be a big thing one day. However, without Shlomo, it would not have moved beyond a sketch on a paper. I want to thank my other collaborators: Mike Hendy whose Hadamard Conjugation amazed me back then when Benny sent me a bunch of papers from New Zealand. It has been thrilling to walk in Mike s footprints. I want to thank Steve Skiena for our efficient collaboration on the SB(H+RE) project. I want to thank very much to Zohar Yakhini, a friend and a running companion, for our long discussions (could be even better if Zohar talked while running) and the productive ideas. Thanks also to Eliezer Eskin, for getting me involved in the homology kernel project, for the nice conversations and advices, for the wonderful time we spent all along these years. Thanks to Mike Fellows for the unforgettable visit in Newcastle. The exposure to his parameterized complexity, stories and personality, and to the beaches of Newcastle, was a big, dense adventure. Thanks also to Amit Khetan for elevating the comb work to a level of professional algebraic contribution. I want to thank very much Arie Freund and Dror Rawitz. Their friendship has been a big asset. Many thanks to Danny Geiger for his support and wise advices, Ron Pinter for his informal enrichment and advices, to Ron Shamir and Mike Steel for enlightening ideas, to Esti for her very pleasant company during these four years, to the friends in the sixth floor: Irad, Reuven, Orna, Hadas, Ami, Eldar, Eli, Shaul and Seffi. To Yahalomit and Hertzel (ozer) for facilitating getting to the Technion and Tel Aviv Univ. (resp.).

6 Above all, I want to thank my devoted partner Rachel, who supported me so much during all these travels and hard work, and took all of the family burden on her strong, wide shoulders. The generous financial help of Technion and the Center for Complexity Science Scholarship is gratefully acknowledged

7 Contents Abstract 1 Notation 3 1 Introduction Brief Biological Background Phylogenetic Trees Analytical Maximum Likelihood Phylogenetic Trees Convex Recoloring of Trees and Strings Substitution Models and the Hadamard Conjugation Notations and the Hadamard Matrix Neymann Two States Substitution Models Kimura s 3-substitution Model Hadamard Conjugation of K3ST on a Phylogenetic Tree Maximum Likelihood on Evolutionary Trees Introduction Maximum Likelihood on Four Taxa Trees General ML System Simplifying Identities Solving the molecular clock fork Additional Results Solving the molecular clock Comb Overview Obtaining the Solution

8 Contents (continued) Solving the system Unique ML molecular clock comb Proof Solving the Jukes Cantor Triplets Preliminaries Obtaining the Maximum Likelihood Solution Results on Genomic Sequences Consistency of a Tree Reconstruction Method Concluding Remarks Minimum Convex Recoloring of Phylogenetic Trees Introduction Preliminaries NP-Hardness results Minimal Convex Recoloring of Strings is NP-Hard NP Hardness of Minimal Convex Recoloring of Leaves NP Hardness of Minimum Block-Recolorings Exact Algorithms Non-uniform Convex String Recoloring Enhanced algorithm Non-uniform Optimal Convex Recoloring of Binary Trees Fixed Parameter Tractable Recoloring Algorithms FPT Recoloring Algorithm for Bounded Degree Trees Approximation Algorithms for Convex Recoloring Lower Bounds via Penalties A 2-Approximation Algorithm for a String approximation algorithm for a tree Discussion and Future Work References 96 Hebrew Abstract i

9 List of Figures 1.1 The fork and comb two rooted topologies on four taxa A rooted tree over 3 species The three 4 taxa unrooted trees (a) Kimura s 3 substitution model (K3ST). (b) Substitution types t α = t 01, t β = t 10, t γ = t 11 and t ɛ = t (a): Example edge length spectra for the tree T 13. (b): Q = Q T The fork and comb two rooted topologies on four taxa A rooted tree over 3 species Rooted layout of the (12)(34)-molecular clock-fork (left) and its unrooted version (right) In the (12)(34) molecular clock fork, q 1 = q 2 and q 3 = q A general tree with two sister taxa i and j s.t. q i = q j r ε, log(r), q 12 as function of c r as a function of ε A triplet tree under the molecular clock satisfies q 1 = q The primates tree, taken from The Tree of Life project A Schematic view of the colored string corresponding to F. Informative segments appear white (in the figure) where junk segments are longer and have distinct colors A clause segment. The literals are l 1, l 2 and l 3, and the clause of size 3A consists of A repetitions of the corresponding triplet. Each block is a single vertex S xj, the segment of the literal x j. m + 1 d j -blocks are interleaved by the m blocks c i,xj, i = 1,..., m A recoloring of segments S xj and S xj corresponding to a satisfying assignment

10 4.6 A caterpillar of length A reduction from a fully colored string to a leaf colored caterpillar. Two leaves in the two ends are colored with two new colors. All other leaves are colored with the same color as the corresponding vertex in the string A convex recoloring of the input caterpillar. All blue (triangle) and green (circles) blocks were recolored, so a single vertex of each color can be retained without violating convexity Removing the mutation at the left edge implies the removal of the one at the right The input string (S, C) and the corresponding informative segment in (S z, C z ) The counter-weight segment in S z Changing u s color to red (circle) reduces the number of violations by 1 and the number of violations by The white good block is only partially overwritten in the optimal coloring The red (circles) core defined by the three leaf blocks in the core. Filled shapes (vertices) are in the red core The three right hand vertices are in the signature. The signature is invalid since the right red (second from right) is totally overwritten by the black core A block in the signature (left side) that is partially overwritten by two other cores (partially filled vertices), is maximally expanded (right side) The maximal sub block in a good block for a signature, is expanded C is a convex recoloring for C which defines the following penalties: p green (C ) = 1, p red (C ) = 2, p blue (C ) = The upper part of the figure shows the optimal blocks on the string and the lower part shows the coloring returned by the algorithm Case 2: a vertex v is contained in 3 different containers Case 3: Not case 1 nor 2. T is the subtree rooted at rd0 and ˆT = T \ T Case 3a: No vertices of ˆT are colored by d Case 3b: r d0 T d0 T d REDUCE of case 3b: T is replaced with T0 where w1(v 0 ) = C high C min = 2 and w1(r d0 ) = C medium C min =

11 List of Tables 2.1 (a):four aligned sequences with sixteen sites. (b): The corresponding observed sequence spectrum The observed sequence spectrum of NK cell receptor D gene of human, mouse and rat

12

13 Abstract This thesis is in the area of computational biology known as molecular phylogenetics. Specifically, we focus on two issues of phylogenetics: analytic maximum likelihood (ML) solutions and minimum convex recoloring for phylogenetic trees. The first part is dedicated to the ML issue. Up to now, the only known general analytical ML solution were for rooted triplets on the simplest evolutionary model - the Neyman two states under molecular clock. We here present three ML solutions for three different topologies on two different evolutionary models. Each extends the existing knowledge by either considering a larger topology, or a more general evolutionary model. The molecular clock fork and the comb extend the state of the art knowledge from three species to four while retaining the evolutionary model. The JC triplet retains the topology and the molecular clock assumption while extending the substitution model to a more realistic, four states substitution model - the JC model. A by product of this last work is a first formal proof of the Hadamard Conjugation for the Kimura substitution models. Even though this technology exists for ten years already, a formal proof of it was never published. The second part of this thesis deals with the theme of Minimum Convex Recoloring for Phylogenetic Trees. A coloring of a tree is convex if the vertices that pertain to any color induce a connected subtree; a partial coloring (which assigns colors to some of the vertices) is convex if it can be completed to a convex (total) coloring. Convex coloring of trees arises in areas such as phylogenetics, linguistics, etc. For example, a perfect phylogenetic tree is one in which the states of each character induce a convex coloring of the tree. Research on perfect phylogeny is usually focused on finding a tree such that few predetermined partial colorings of its vertices (i.e. each character is a different partial coloring on the leaves) are convex. When a coloring of a tree is not convex, it is natural to ask how far it is from a convex one. That is, what the minimal number of color changes at the vertices needed to make the coloring convex? This can be viewed as minimizing the number of exceptional vertices with respect to a closest convex coloring. We also study a similar measure, which aims at minimizing the number of exceptional edges respect to a closest convex coloring. We show that finding each of these distances is NP-hard even for paths (or strings). We then focus on the first measure and generalize it to weighted trees, and then to non-uniform coloring costs (that is, a change between any two colors is associated with a different cost) 1. On the positive side we present few 1 in contrast to the uniform models where a change between any two colors has a unitary cost 1

14 algorithms for convex recoloring of strings and bounded degree trees: First we present algorithms for optimal convex recolorings of strings and trees with non-uniform coloring costs, which, for any fixed number of colors, are linear in the input size. Then we present algorithms for string and bounded degree trees that run in time exponential in the number of changes required and linear in the input size. Finally, we present polynomial time approximation algorithms for convex recoloring of strings and trees. 2

15 Notation )(34)-fork A rooted tree with two sister taxa at both sides of the root ((12)3)4-comb A rooted tree with three taxa at one side and 1 and 2 are siblings H n Hadamard matrix of order n q The edge length spectrum s The expected sequence spectrum X The set of species e α The edge inducing the split (α, X α) Π α The path set induced by α 3

16 4

17 Chapter 1 Introduction This thesis is in an interdisciplinary area called computational biology, which is motivated by computational questions arising in molecular biology. This area offers many challenging algorithmic, combinatorial, probabilistic and optimization problems. The goal of this research is to devise algorithms, analyze complexity, solve open problems and give analytic results in the following specific topics of computational biology: Analytic solution of maximum likelihood evolutionary trees. Convex recoloring of phylogenetic trees. The remaining of the introduction is devoted to a brief molecular biological background, and an overview of the rest of the dissertation. 1.1 Brief Biological Background The complete set of instructions encoding an organism is called its genome. It contains the master blueprint for all cellular structures and biochemical processes for the lifetime of the cell or organism. The human genome physically consists of tightly coiled threads of deoxyribonucleic acid (DNA) and associated protein molecules, organized in 23 pairs of distinct, physically separate microscopic units called chromosomes. The nucleus of most human cells contains pairs of chromosomes, where in each pair, one chromosome originates from each parent. Each cell has 23 pairs of chromosomes 22 pairs of regular autosomal chromosomes and a pair of sex chromosomes X;Y (females carry two X chromosomes, and males carry one X and one Y chromosome). Chromosomes can be seen under a light microscope and, when stained with certain dyes, reveal a pattern of light and dark bands. Difference in size and banding pattern allow the 23 chromosomes to be distinguished from each other, an analysis called a karyotype. A few types of major chromosomal abnormalities, including missing or extra copies of chromosome or gross breaks and rejoining (translocations), can 5

18 be detected by microscopic examination; Down s syndrome, in which an individual s cells contain a third copy of chromosome 21, can be diagnosed by karyotype analysis. Most changes in DNA, however, are too subtle to be detected by such a technique and require more subtle molecular analysis. These subtle DNA abnormalities (mutations) are responsible for many inherited diseases such as cystic fibrosis and sickle cell anemia, or may predispose an individual to cancer, major psychiatric illnesses, and other complex diseases. A DNA molecule is a polymer consisting of two interwound helical strands. Each strand is a sequence of nucleotides, or bases, drawn from the set {A, C, T, G}, where A stands for Adenin, C for Cytosine, T for Thymine and G for Guanine. The two strands are complementary in the sense that each A on one strand is bound to its complementary nucleotide T on the other, and each C is bound to a G. The particular order of these bases is called the DNA sequence; the sequence specifies the exact genetic instructions required to create a particular organism with its own unique traits. DNA molecules can be very long. For example, the size of an average human chromosome is about 150 million base pairs, while the entire human genome contain about 3 billion base pairs. The genes, which are the basic physical and functional units of heredity, are arranged linearly along the chromosomes. A gene is a specific sequence of nucleotide bases, whose sequence carries the information required for constructing one or more protein. Proteins provide the structural components of cells and tissues as well as enzymes for essential biochemical reactions. The human genome, for example, is estimated to contain approximately 30,000 genes. 1.2 Phylogenetic Trees Given a set of taxa (a group of related biological species), the goal of phylogenetic reconstruction is to build a tree which best represents the course of evolution for this set over time. The leaves of the tree are labelled with the given, extant taxa. Internal nodes correspond to hypothesized, extinct taxa. Because events of taxon divergence are assumed to be rare, the sought after tree is bifurcating (or binary), with internal nodes of degree three. (In case of ambiguous data one might have to resort to multifurcating trees, which are less informative.) In early days, morphologic features were mostly used to study evolution. Today, molecular data are the primary basis for phylogenetic analysis of evolution, but other sources of information (for example palaeontological, anatomical, and morphological) are also in use. Still, our exposition will concentrate on molecular sequence data. The first step in constructing a tree is to collect from an updated database either DNA (typically genes), RNA, or amino acid sequences for all taxa under study. Homologous sequences (detected by similarities, or low edit distances) from different taxa are then grouped together. Homologous sequences for different taxa often have the same functionality (e.g. insulin, hemoglobin, etc.) and are assumed to be 6

19 descendants of a common ancestral sequence. Their degree of similarity gives an indication of the time when two taxa diverged. Since the mutational process is assumed to be probabilistic in nature and to operate locally, we expect that longer periods of time since divergence usually imply more accumulated mutations. Still, different proteins may evolve at different rates. Combining this with the stochastic nature of the process, single proteins, viewed separately, may give conflicting indications as to the history of evolution. To overcome this random noise effect it is thus advisable to employ longer sequences, obtained by concatenating many sequences together. Phylogeny reconstruction methods are broadly divided into character-based and distance-based methods. Distance based methods start by computing evolutionary distances between pairs of taxa. Then a tree with weighted edges whose pairwise tree distances approximate the evolutionary distances is sought, typically by some version of the neighbor joining clustering paradigm [47]. In contrast, character based methods work directly on character data. The best known and most widely used character-based methods are maximum parsimony [20] and maximum likelihood [16]. Maximum parsimony (MP) is a non-parametric combinatorial method, while maximum likelihood (ML) is a parametric statistical method. Despite its popularity, MP has the drawback that for certain ranges of input parameters it is inconsistent. This means that the sequence data leads to the construction of an incorrect tree, even if the number of sample points tends to infinity [15]. 1.3 Analytical Maximum Likelihood Phylogenetic Trees Maximum likelihood (ML) on molecular sequence data uses a model of evolution, which is usually a family of trees with n taxa at their leaves, and a substitution model. The parameters of the substitution model describe probabilities of changes in character states (e.g. point mutations in DNA nucleotides). Given a set of n observed sequences, the goal is to find the best explanation for the data within the model space. In our context, this usually means a weighted tree (where the weights are parameters of the substitution model for each edge) that maximizes the likelihood (the conditional probability, under the model, of generating the observed sequences). One of the attractive properties of maximum likelihood is that two competing explanations for the same data can be evaluated not only qualitatively but also quantitatively, by comparing their log likelihood values (or lod score). The current application of maximum likelihood for reconstructing evolution is by Felsenstein [16], and has gained wide acceptance [7, 24, 46, 56]. The method is computationally intensive, but for tractable cases it is the method of choice. Algorithmically, the likelihood is maximized separately for each tree in the family (pruning in the tree space is sometimes possible). The weighted tree (or trees) with maximum value(s) is then reported. There is no known analytical solution or direct algorithm that optimizes the edge parameters for a given tree. Existing algorithms [18, 55] use 7

20 an iterative, hill climbing approach. For hill climbing to be guaranteed to find the maximum, there must be a single local and global maximum in the parameter space. Fukami and Tateno [21], and subsequently Tillier [58], have argued that for each tree, the ML point is indeed unique. However, Steel [52] showed that their proof was erroneous, and constructed a surprisingly simple counter example (utilizing sequences with two sites and just four taxa). The example shown by Steel was on a very pathological input data. Chor et. al. [10] showed that multiple maxima can occur also on reasonably, non pathological input data. The molecular clock hypothesis assumes a constant rate of evolution across all lineages of the phylogenetic tree. This implies that a tree satisfying the molecular clock assumption can be rooted such that the length of the path from the root to each of the leaves is the same. ML can be divided into two related problems: Big ML that seeks for the weighted tree which maximizes the likelihood of the data among all the tree space, and Small ML that operates on a given tree and seeks for optimal edge weights (i.e. that maximize the likelihood on that tree). When the number of taxa is small, the big problem can be solved by solving the small problem for every tree in the tree space. The current state of analytical solutions for ML applies only to the small problem. However, since the number of species handled analytically is also small, this is not a serious limitation. This focuses the challenge on analytical small ML. The first to consider analytical solutions for small ML with simple substitution models was Yang (2000), who worked on three taxa with the Neymann evolutionary model of symmetric two state characters under molecular clock [59]. Yang denoted his work as the simplest phylogeny estimation problem, but added that it has many of the conceptual and statistical complexities involved in phylogenetic estimation. The solution of Yang was generalized and its derivation was simplified by Chor, Hendy and Penny [11] using the Hadamard Conjugation of Hendy, Penny, and Steel (1994) [28, 29], together with convexity arguments. The first part of this thesis deals with analytical ML solutions. Chapter 2 introduces the notations used in this part of the thesis, describes the two substitution models we work with, and explains the main basic mechanism behind all these solutions - the Hadamard conjugation. Moreover, when working under the more complex substitution model of Kimura, the Hadamard conjugation becomes substantially involved and requires comprehensive understanding. A first correctness proof for the Hadamard conjugation for the Kimura models is provided in that chapter as well. Chapter 3 in this thesis presents three different extensions to the analytical small ML of the rooted triplets works of [59, 11]. In each of these extensions, either the number of species is increased while the model is retained, or the model is extended and number of species is retained. In the first two works, the number of taxa is increased from three to four while the underlying model of substitution is retained. There are two families of topologies on four taxa under the molecular clock model: Topologies with two taxa in each subtree of the root, which we call fork topologies, and topologies where one subtree of the root has three taxa, which we call comb 8

21 topologies. Recall that under molecular clock, the distance from each of the four leaves to the root is the same (Figure 1.3) (((1 2) 3) 4)-MC-comb (1 2)(3 4)-MC-fork Figure 1.1: The fork and comb two rooted topologies on four taxa.!" $# & comb % fork % We start with the simpler topology - the fork - in Section 3.3 and later extend it to the comb topology in Section 3.4. Section 3.5 investigates the same topology handled by [59, 11]: the molecular clock triplet (see Figure 1.2). However, the substitution model is extended from the simple two states model of Neymann to the four states model of JC. 1.4 Convex Recoloring of Trees and Strings The second part of this thesis is in Chapter 4, and deals with combinatorial character based approaches in phylogenetics. In biology, characters describe attributes of the species under consideration and are the data that biologists use to reconstruct phylogenetic trees. Characters can be morphological (for example, wings versus no-wings), biochemical, physiological, behavioral, embryological, or molecular (for example, the nucleotide at a particular DNA sequence position, or the order of certain genes on a chromosome). In a rooted phylogenetic tree, one can view the character states along a path from the root to some species as evolving to that species state Figure 1.2: A rooted tree over 3 species. ( )* +, ' 9

22 A natural biological goal is that the reconstructed phylogeny has the property that each of the characters could have evolved without reverse or convergent transitions: In a reverse transition, some species regains a character state of some old ancestor whilst its direct ancestor has lost this state. A convergent transition occurs if two species posses the same character state, while their least common ancestor possesses a different state. The concept behind this constraint is of innovation. That is, each time the character state changes, it acquires a new state. The innovation assumption exclude reverse and convergent transitions, and is denoted alternatively - homoplasy free character [50]. A character in a phylogenetic tree can be viewed as a coloring of the tree vertices, where each color represents one of the character s states. A character is homoplasy free iff the corresponding coloring is convex, that is, the set of vertices having the same color (state) induces a subtree. Thus, the above discussion implies that in a phylogenetic tree, each character is likely to be convex. This makes convexity a fundamental property in the context of phylogenetic trees. In this work we introduce a natural criterion, suggested by the concept of Hamming distance between vectors: The minimal number of species whose states should be changed to make the given character convex. This measure is motivated by a scenario where a set of species for which the evolutionary tree is already known (e.g. the primates tree shown in Figure 4.1). Given a character relating to a subset of all the species (extant and extinct) in the tree, we want to know how much this character agrees with the tree - that is, how far this character is from perfect phylogeny on the given tree. Indeed, the evolution of a character which can be made convex by removing a small number of exceptions, can be explained by searching biological reasoning for few exceptional phenomena, as was done for the two above mentioned cases (see e.g. [40]). If however a very large number of state changes is needed to make the character convex, a biological explanation becomes less probable, and the reliability of the given phylogeny as a correct description of the evolution of this character diminishes accordingly. Another measure that we discuss is the minimal number of state changes that should be removed for making the character convex on the tree, or the minimum number of changes of both types which are needed to achieve this purpose. We note that our problem for partially colored trees bears some similarity to the small parsimony problem. In both problems a tree with a (partial) assignments of states to the vertices is given. The small parsimony problem finds a coloring with the minimum number of violations to the perfect phylogeny property, and the convex recoloring problem finds the minimum number of color changes needed to achieve perfect phylogeny. It is therefore a bit surprising that while the weighted version of the maximum parsimony problem has an efficient optimal solution [48], the unweighted version of minimum convex recoloring is NP-Hard, as we show here. We study also two more general cost functions, which enable more flexibility in measuring the distance of a given coloring from a convex one: The first allows weights on the vertices (the weight of a vertex reflects the certainty we have in its state/color). The most general criterion, allows also non uniform cost function for color changes (details are in Section 4.2). 10

23 In studying our problem, we first focus on a very simple form of a tree - a string (or a path), which seems to be interesting in its own right. Then we extend the results (if possible) to a tree. Our negative results show that the (unweighted) versions of the above problems are hard even for a very simple tree topology - a string, and for the case where character states are given only at the leaves (so that changes on extant species are not counted); we also prove that finding the minimum number of mutation removals needed to obtain convexity, in a sense to be defined, is NP-hard. On the positive side, we present dynamic programming algorithms for strings and bounded degree trees. The first algorithm is designed to solve the non-uniform versions of the problem. The algorithm runs in linear time for any fixed number of colors (i.e. the time is linear in the input size and exponential in the number of colors). Then we show that for strings and trees of bounded degree, the (unweighted version of the) problem can be solved by a fixed parameter tractable algorithm. The proof of this result is based on identifying signatures of recolorings, so that the number of possible signatures is bounded from above by the number of recolored vertices in an optimal solution, and then presenting a polynomial time algorithm for a minimal convex recoloring with a given signature. Finally, we present polynomial time algorithms for 2-approximation for strings and 3-approximation for trees, for the weighted versions of the problem. 11

24 12

25 Chapter 2 Substitution Models and the Hadamard Conjugation A central tool in our analytical solutions is the Hadamard conjugation [28, 29]. It is applicable to group like models of substitution. In this chapter we define these applicable models of substitution. Hadamard conjugation is an invertible transformation that links the probabilities of site substitutions on edges of an evolutionary tree, T, to the probabilities of obtaining each possible combination of characters. The Hadamard conjugation is applicable to a number of site substitution models: Neymann 2 state model, Jukes Cantor model [33], and Kimura 2ST model and 3ST model [39] (the last three are applicable to normal, four states DNA). For these models, the transformation yields a powerful tool which greatly simplifies and unifies the analysis of phylogenetic data, and in particular the analytical approach to ML. We begin by introducing notations that are used by all analytic solutions. We also define the specific Hadamard matrix we use. Next, we introduce the simplest model, the Neymann 2 state model [42], along with the corresponding Hadamard conjugation. Subsequently, we define the family of four states substitution models of Kimura. Although the Hadamard conjugation for the Kimura models is seemingly identical to the Neymann model, the underlying mathematics is substantially different. Moreover, the correctness of the Hadamard conjugation for the Kimura models was never formally proved. In the last section of this chapter, we explain and prove for the first time the appropriate Hadamard conjugation for this family of models. 2.1 Notations and the Hadamard Matrix We start with notations that are common to both models and will be useful for the rest of the work - tree labelling. Let X = {1, 2,, n} be a set of n taxa represented by the sequences σ 1, σ 2, σ n (these sequences are equi-length, assumed to be homologous and aligned). We select a reference taxon, n, and let X = X {n} be the non-referenced taxa. Consider now an evolutionary tree, T, on the taxa set 13

26 2 e 2 e 123 e e 3 e 123 e e 123 e23 e 2 2 e 1 1 T 12 e e 1 T 13 e e 1 T 23 e 3 3 Figure 2.1: The three 4 taxa unrooted trees ' X. The leaves of T are labelled by the elements of X. For i, j X we define the path Π i,j to be the set of edges in T connecting leaf i to leaf j. Each edge e of T defines a unique split A X A among the taxa set X induced by deleting e from T. We index e as e A where A = {i e Π i,n } X is the set of taxa separated from n by e. Figure 2.1 shows The three unrooted trees on the taxa set X = {1, 2, 3, 4}. The subscript A of e A (A X = {1, 2, 3}) is the set of taxa isolated from the reference taxon 4 on the deletion of that edge. The subscripts on the edges are abbreviated so that, for example e 23 stands for e {2,3}. The trees are indexed as T A where A is the index of the internal edge. The Hadamard matrix is used by both types of the Hadamard conjugation and is defined below: Definition A Hadamard matrix of order l is an l l matrix A with ±1 entries such that A t A = li l. We will use a special family of Hadamard matrices, called Sylvester matrices in MacWilliams[ and Sloan (1977, ] p. 45), defined inductively for n 0 by H 0 = [1] Hn H and H n+1 = n. For example, H n H n [ ] H 1 = and H = We encode a subset of {1,..., n} in an n-long binary number where the ith least significant bit (i = 1,..., n) is 1 if i is in the subset, and 0 otherwise. Using this representation, it is convenient to index the rows and columns of H n by subsets of {1,..., n} in a lexicographically increasing order (i.e. φ, {1}, {2}, {1, 2}...). Denote by h α,γ the general element of H n, the element at the (α, γ) entry of H n. Observation h α,γ = ( 1) α γ. This implies that H n is symmetric, namely Hn t = H n, and thus by the definition of Hadamard matrices Hn 1 = 1 H 2 n n. 14

27 Observation h A,B = h B,A, and h A,B = h A,C h A,D, (2.1) where C, D form a partition of B ( i.e. C D = B, C D = ). 2.2 Neymann Two States Substitution Models In Neymann 2 states model [42], each character at a species admits one out of two states, without loss of generality {x, y}. Hence, a character evolving along an evolutionary tree T with n leaves, induces a split pattern between the leaves admitting the state x and y. To every edge e E(T ) we assign a length q e. q e is defined as the expected number of substitutions (changes) per site along that edge. Given the edge lengths of T : q = [q e ] e E(T ) (0 q e < ), the probability of generating an α-split pattern (α {1,..., n 1}) is well defined. Denote this probability by s α = P r(α-split T, q). Using the same indexing scheme as above, we define the expected sequence spectrum (expected spec) s = [s α ] α {1,...,n 1}. The edges lengths spectrum (edges spec) of a tree T with n leaves is the 2 n 1 dimensional vector q = [q α ] α {1,...,n 1}, defined for any subset α {1,..., n 1} by q e if e E(T ) induces the split α, q α = e E(T ) q e if α =, 0 otherwise. The Hadamard conjugation specifies a relation between the expected sequence spectrum s and the edge lengths spectrum q of the tree as follows: Proposition (Hendy and Penny 1993) Let T be a phylogenetic tree on n leaves with finite edge lengths (q e < for all e E(T )). Assume that sites mutate according to a symmetric substitution model, with equal rates across sites. Let s be the expected sequence spectrum. Then s = s(q) = H 1 n 1 exp(hq), where the exponentiation function exp(x) = e x is applied element wise to the vector ρ = Hq. That is, for α {1,..., n 1}, s α = 2 (n 1) γ h α,γ (exp ( δ h γδq δ )). The Hadamard conjugation for the two states symmetric model was presented and proved by Hendy and Penny in [28] 1. 1 some earlier versions of spectral analysis also appeared in [26, 27]. 15

28 Definition A vector ŝ R 2n 1 satisfying α {1,...,n 1} ŝα = 1 and Hŝ > 0 is called conservative. For conservative data ŝ, the Hadamard conjugation is invertible, yielding : γ = γ(ŝ) = H 1 n 1 ln(hŝ) where the ln function is applied element-wise to the vector Hŝ. We note that γ is not necessarily the edge length spectrum of any tree. On the other hand, the expected sequence spectrum of any tree T is always conservative. 2.3 Kimura s 3-substitution Model In 1981 Kimura [38] introduced his 3 substitution type model of RNA nucleotide substitution (K3ST), (see Figure 2.2(a)), where he identified 3 classes of substitution: the transitions (with rate α), the type I transversions (with rate β) and the type II transversions (with rate γ). Setting β = γ gives his 2 substitution type (2ST) model [37], and setting α = β = γ gives the 1 parameter model of Jukes and Cantor [33]. We take the characters as the DNA or RNA nucleotides A, G, T(or U) and C. A particular advantage of K3ST is that the substitutions form a group acting on the set of nucleotides, and it is from this property that Hadamard conjugation can be derived. In Figure 2.2(b) we illustrate the substitution types t α = t 01 for transitions, t β = t 10 and t γ = t 11 for transversions. t ɛ = t 00 is the identity (no substitution). With the binary codes for the nucleotides (e.g. (0, 1) for C, c.f. Figure 2.2(a)) t xy (a, b) = (c, d), where c a + x(mod 2) and d b + y(mod 2). Kimura identified α, β and γ as three rate classes, however we take our parameters to be the probabilities of each substitution type p α, p β and p γ, with p ɛ = 1 (p α + p β + p γ ) being the probability of no substitution. These probabilities are recorded in a 2 2 probability matrix [ ] pɛ p P = α. p β For each e an edge of a tree T, we set P e = P, where each entry p θ is the probability that the characters at the vertices are transformed by t θ. (Thus X at one vertex is transformed to t θ (X) at the other.) (The direction is irrelevant as Kimura s model is symmetric.) From P e, (provided HP e H > 0) we derive the matrix [ K Q e = H 1 [Ln(HP e H)]H 1 qα = p γ q β q γ ], (2.2) where H = H 1 = 16 [ ] (2.3)

29 % (a) (b) U(T) α C t α (0, 0) (0, 1) β γ γ A α β G t γ t γ (1, 0) (1, 1) t β t α t β Figure 2.2: (a) Kimura s 3 substitution model (K3ST). (b) Substitution types t α = t 01, t β = t 10, t γ = t 11 and t ɛ = t 00. # # " t γ = t 11 t β = t 10 t α = t 01 (b) ' (K3ST ) ) * (a) ' t ɛ = t 00 is the 2 2 Hadamard matrix, having inverse H 1 = 1 H, and Ln is the natural 2 logarithm function applied individually to each entry of the matrix M = HP e H. The entries of q α, q β and q γ of Q e are additive parameters which we will refer to as the three edge-length parameters. Their sum q α + q β + q γ = K is the value Kimura refers to as evolutionary distance. If we assume a Poisson model of substitution with the three Kimura rates over time t, then we find q α = 2αt, q β = 2βt and q γ = 2γt are the expected numbers of substitutions of each type occurring along e, and R = 1 Q is the rate matrix. 2t Equation (2.2) can be inverted, giving [ P e = H 1 [Exp(HQ e H)]H 1 pɛ p = α p β p γ ], (2.4) where Exp refers to the exponential function applied individually to each entry of the matrix HQ e H. We see Hence [ HP e H = [ HQ e H = (p α + p γ ) 1 2(p β + p γ ) 1 2(p α + p β ) ] 0 q α + qγ. q β + q γ q α + q β 17 ] [ = 1 e 2(qα+qγ) e 2(q β+q γ) e 2(qα+q β) ], (2.5)

30 and [ ] pɛ p P e = α p β p γ = 1 [ ] 1 + e 2(q α+q γ) + e 2(q β+q γ) + e 2(qα+q β) 1 e 2(qα+qγ) + e 2(q β+q γ) e 2(qα+q β) e 2(qα+qγ) e 2(q β+q γ) e 2(qα+q β) 1 e 2(qα+qγ) e 2(q β+q γ) + e 2(qα+q (2.6). β) In Figure 2.2 (b), a reference nucleotide Σ can be inserted at (0, 0) then t θ (Σ) occurs at the point (a, b), where (0, 1) = α, (1, 0) = β and (1, 1) = γ. The set of substitutions {t ab a, b Z 2 } is a group acting on the nucleotides. The i th site of the sequence σ j is the i th character Σ ij of σ j. Selecting σ 0 as the reference sequence, the i th site pattern is a vector of n substitution types, with j th component the substitution which transforms Σ i0 to Σ ij. This site pattern can be referenced as a pair (C, D) of subsets of X = {1,..., n}, where C = {i:σ ij = t θ (Σ 0j ), θ = α or γ} D = {i:σ ij = t θ (Σ 0j ), θ = β or γ}. (2.7) Table 2.1(a) illustrates four sample DNA sequences with sixteen sites. σ 0 is the reference sequence, the pair of binary digits above each character of σ 1,, is the substitution type to derive that character from the homologous character of σ 0. For example, the entry 11 above G at site #10 of σ 3 indicates that the substitution to this nucleotide from the corresponding T of the reference sequence σ 0 is of type t γ. In (b), the frequencies of each of the site patterns from (a) are summarized in the observed sequence spectrum F. The rows of F are indexed by the first triple of the binary pairs, and the columns by the second, in the order 000, 100, 010, 110, 001, 101, 011, 111. The site pattern of site #10 is represented by the pair (011, 001) so the entry corresponding to this is in row 011 and column 001 of F. As this pattern occurs only at site #10, the entry in row 011 and column 001 of F is 1 (highlighted in bold font). 2.4 Hadamard Conjugation of K3ST on a Phylogenetic Tree In this section we prove the Hadamard conjugation for the Kimura s substitution models. Although the transformation is identical to the case of two states model, the expected sequence spectrum is defined substantially different, as well as the edge length spectrum. In the K3ST model, for each edge e A of T, instead of one length, we specify three edge-length parameters q ea (α), q ea (β) and q ea (γ). Using Equation 2.4 we can relate the probabilities of each type of substitution across e A as functions of the edge lengths, P ea = H 1 (Exp(HQ ea H))H 1, 18

31 ' site # σ 0 = A C A G T A G T G T T A C C A G σ 1 = A C A G C A A T G T T A T C T C σ 2 = C C A T T G A A G A T G C G T T σ 3 = C C A T C A A A C G T G T G A C a F = Table 2.1: (a):four aligned sequences with sixteen sites. (b): The corresponding observed sequence spectrum! " (! (! (b) ' 16 (a) b 19

32 where and [ ] pea (ɛ) p P ea = ea (α) p ea (β) p ea (γ) [ KeA q Q ea = ea (α) q ea (β) q ea (γ) ]. From these equations we can derive the transition matrix p ea (ɛ) p ea (α) p ea (β) p ea (γ) P (e A ) = p ea (α) p ea (ɛ) p ea (γ) p ea (β) p ea (β) p ea (γ) p ea (ɛ) p ea (α) p ea (γ) p ea (β) p ea (α) p ea (ɛ) Further, if W = {e A, e B,, e C } is a set of edges of T, then the product of transition matrices p W (ɛ) p W (α) p W (β) p W (γ) P (W ) = P (e A )P (e B ) P (e C ) = p W (α) p W (ɛ) p W (γ) p W (β) p W (β) p W (γ) p W (ɛ) p W (α) p W (γ) p W (β) p W (α) p W (ɛ) is the transition matrix of the probabilities p W (θ) of the substitutions multiplied across the edges of W being of type θ. Then we see from Equation (2.5) that [ ] pw (ɛ) p P W = W (α) = H 1 [Exp(HQ p W (β) p W (γ) W H)]H 1, (2.8) where Q W = e A W Q ea. The edge-length parameters for the full set of edges of T can be collected in three vectors q α, q β and q γ, where the components are indexed by the 2 n 1 subsets of X of taxa. For A X not an edge split, we set q A (θ) = 0, (θ {α, β, γ}), except for A =, where q (θ) = K(θ) where K(θ) is the sum of all other q A (θ) values. Figure 2.3(a) shows the edge length spectra for the tree T 13 on n = 4 taxa illustrated in Figure 2.1. Corresponding components of the vectors q α, q β, q γ, give the three edge lengths parameters for the corresponding edge. The value 0 indicates that there is no corresponding edge in T (e.g. q 12 in T 13 ). We will find it convenient to put these three vectors into a matrix Q(= Q T ) = [q A,B ] of 2 n rows and columns indexed by the subsets of X, with q, = (K(α) + K(β) + K(γ)), with the remaining entries of q α, q β and q γ becoming the leading row, column and main diagonal of Q, and all other entries set to 0. Figure 2.3(b) shows the matrix Q = Q T13 holding the vectors q α, q β, q γ from Figure 2.3(a).These vectors are placed in the leading row, column and main diagonal of the matrix Q. 20

33 q α = K(α) q 1 (α) q 2 (α) 0 q 3 (α) q 13 (α) 0 q 123 (α), q β = K(β) q 1 (β) q 2 (β) 0 q 3 (β) q 13 (β) 0 q 123 (β) (a), q γ = K(γ) q 1 (γ) q 2 (γ) 0 q 3 (γ) q 13 (γ) 0 q 123 (γ), Q T = K q 1 (α) q 2 (α) 0 q 3 (α) q 13 (α) 0 q 123 (α) q 1 (β) q 1 (γ) q 2 (β). q 2 (γ) q 3 (β)... q 3 (γ)... q 13 (β).... q 13 (γ) q 123 (β) q 123 (γ) (b), Figure 2.3: (a): Example edge length spectra for the tree T 13. (b): Q = Q T13 ' Q = Q T13 (b) ' T 13, ( " (! (a) This means that for A, B X = {1, 2, 3}, Q,B = q B (α), Q A, = q A (β), Q A,A = q A (γ), and for all other entries Q A,B = 0, except the first entry Q, = K, where K = K(α) + K(β) + K(γ). The entries indicated by. are all zero, these are zero for every tree. The entries indicated by 0 are zero for this tree T, but for different trees can be non-zero. The non-zero entries (in the leading row, column and main diagonal) should each be in the same component, and these identify the edge splits of T. For general trees on n taxa, the edge length spectra are vectors and matrices of order 2 n 1. For any A X, let Π A = {e B h A,B = 1}. It is easily shown that if A contains 2m (or 2m 1) elements, Π A is the disjoint union of m paths connecting disjoint pairs of leaves ina (or ina {n}). When T is binary, then these paths are disjoint. We will refer to Π A as the pathset generated by A. Consider the A th component of product Hq α : 21

34 (Hq α ) A = ( 1) A D q D (α) = q (α) + ( 1) A D q D (α) D X D :e D E(T ) = q D (α) + ( 1) A D q D (α) = 2 q D (α) = 2 D :e D E(T ) D :e D E(T ) D :h A,D = 1 D :e D Π A q D (α). (2.9) For any subset W of the edge set of T, let d W (θ) = e W q e(θ) be the θ length of W, for θ {α, β, γ}. Thus in particular from Equation (2.9) (Hq θ ) A = 2d ΠA. Similarly we find for A, B subsets of X = {1, 2,..., n}, the (A, B) th component of HQH is (HQH) A,B = 2(d ΠA (α) + d ΠB (β) + d ΠC (γ)), where C = A B is the symmetric difference of A and B. Let U = Π A Π C, V = Π B Π C and W = Π A Π B, so Π A Π B, is partitioned into the disjoint subsets U, V, W, with Π A = U W, Π B = V W and Π C = U V. We see so e 2d Π A (α) = e 2(d U (α)+d W (α)), (Exp(HQH)) A,B = e 2d Π A (α) e 2d Π B (β) e 2d Π C (γ), Now, from Equation(5) we see = e 2(d U (α)+d W (α)) e 2(d V (β)+d W (β)) e 2(d U (γ)+d V (γ)). (2.10) e 2(d U (α)+d U (γ)) = 1 2(p U (α) + p U (γ)) = p U (ɛ) p U (α) + p U (β) p U (γ), so that Equation (2.10) becomes Exp(HQH)) A,B = [p U (ɛ) p U (α) + p U (β) p U (γ)] [p V (ɛ) + p V (α) p V (β) p V (γ)] [p W (ɛ) p W (α) p W (β) + p W (γ)]. (2.11) When the factors of Equation (2.11) are expanded we obtain 64 terms each of the form Further we observe ±p U (θ 1 )p V (θ 2 )p W (θ 3 ). (2.12) p ΠA (α)p ΠB (β) = θ p U (αθ)p V (βθ)p W (θ), where θ is summed over the elements of {ɛ, α, β, γ}. The corresponding four terms of Equation (2.11) each occur with + sign. However each of the four terms corresponding to p ΠA (α)p ΠB (α) = θ p U (αθ)p V (αθ)p W (θ) 22

35 occur in Equation (2.11) with a sign. Continuing this observation with other terms we find that the coefficient of p ΠA (ψ)p ΠB (φ) is a(a, ψ)b(b, φ) where a(a, ψ) = { 1 1 when { ψ = ɛ, β ψ = α, γ, b(b, φ) = { 1 1 when { φ = ɛ, α φ = β, γ. Thus (Exp(HQH)) A,B = ψ,φ a(a, ψ)p ΠA (ψ)b(b, φ)p ΠB (φ). (2.13) The characters χ i1,..., χ in at the i th site, partition X into (up to four) subsets by their states. These patterns can be indexed by a pair D, E X, if we set so that D = {j :χ ij = t θ (χ i0 ) for θ = α, γ}, E = {j :χ ij = t θ (χ i0 ) for θ = β, γ}, {j :χ ij = t ɛ (χ i0 )} = X (D E), {j :χ ij = t α (χ i0 )} = D E, {j :χ ij = t β (χ i0 )} = E D, {j :χ ij = t γ (χ i0 )} = D E. Let s DE to be the probability of obtaining a site with such a pattern. Then we see a(a, ψ) = h AD, and b(b, φ) = h BE so that Equation (2.13) can be rewritten as (Exp(HQH)) A,B = h DU h EV h (D E)W S DE = h AD h BE S DE = (HSH) AB, D,E D,E which gives us our major result Theorem S = H 1 (Exp(HQH))H 1, which provided the arguments of the Ln function are all positive, inverts to Q = H 1 (Ln(HSH))H 1. 23

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005 Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions arxiv:q-bio/0505054v1 [q-bio.pe] 27 May 2005 Benny Chor Michael D. Hendy Sagi Snir December 21, 2017 Abstract Complex systems of polynomial

More information

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Benny Chor Michael Hendy David Penny Abstract We consider the problem of finding the maximum likelihood rooted tree under

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Properties of normal phylogenetic networks

Properties of normal phylogenetic networks Properties of normal phylogenetic networks Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu August 13, 2009 Abstract. A phylogenetic network is

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Maximum Likelihood on Four Taxa Phylogenetic Trees: Analytic Solutions

Maximum Likelihood on Four Taxa Phylogenetic Trees: Analytic Solutions Maximum Likelihood on Four Taxa Phylogenetic Trees: Analytic Solutions [Extended Abstract] Benny Chor School of Computer Science Tel-Aviv University Tel-Aviv 39040 Israel benny@cs.tau.ac.il Amit Khetan

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions

Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions Benny Chor,* Michael D. Hendy, and Sagi Snirà *School of Computer Science, Tel-Aviv University, Israel; Allan Wilson Centre for Molecular Ecology

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Reconstruction of certain phylogenetic networks from their tree-average distances

Reconstruction of certain phylogenetic networks from their tree-average distances Reconstruction of certain phylogenetic networks from their tree-average distances Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu October 10,

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1 Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 Lecturer: Wing-Kin Sung Scribe: Ning K., Shan T., Xiang

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Full file at CHAPTER 2 Genetics

Full file at   CHAPTER 2 Genetics CHAPTER 2 Genetics MULTIPLE CHOICE 1. Chromosomes are a. small linear bodies. b. contained in cells. c. replicated during cell division. 2. A cross between true-breeding plants bearing yellow seeds produces

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Chapter 16: Reconstructing and Using Phylogenies

Chapter 16: Reconstructing and Using Phylogenies Chapter Review 1. Use the phylogenetic tree shown at the right to complete the following. a. Explain how many clades are indicated: Three: (1) chimpanzee/human, (2) chimpanzee/ human/gorilla, and (3)chimpanzee/human/

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab Date: Agenda Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab Ask questions based on 5.1 and 5.2 Quiz on 5.1 and 5.2 How

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Biology Semester 2 Final Review

Biology Semester 2 Final Review Name Period Due Date: 50 HW Points Biology Semester 2 Final Review LT 15 (Proteins and Traits) Proteins express inherited traits and carry out most cell functions. 1. Give examples of structural and functional

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees. Describe the relationship between objects, e.g. species or genes Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies The evolutionary relationships between

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,

More information

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty

Enumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY

More information

arxiv: v5 [q-bio.pe] 24 Oct 2016

arxiv: v5 [q-bio.pe] 24 Oct 2016 On the Quirks of Maximum Parsimony and Likelihood on Phylogenetic Networks Christopher Bryant a, Mareike Fischer b, Simone Linz c, Charles Semple d arxiv:1505.06898v5 [q-bio.pe] 24 Oct 2016 a Statistics

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The

The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The What is a genome? A genome is an organism's complete set of genetic instructions. Single strands of DNA are coiled up

More information

arxiv: v1 [q-bio.pe] 4 Sep 2013

arxiv: v1 [q-bio.pe] 4 Sep 2013 Version dated: September 5, 2013 Predicting ancestral states in a tree arxiv:1309.0926v1 [q-bio.pe] 4 Sep 2013 Predicting the ancestral character changes in a tree is typically easier than predicting the

More information

Chapter Chemical Uniqueness 1/23/2009. The Uses of Principles. Zoology: the Study of Animal Life. Fig. 1.1

Chapter Chemical Uniqueness 1/23/2009. The Uses of Principles. Zoology: the Study of Animal Life. Fig. 1.1 Fig. 1.1 Chapter 1 Life: Biological Principles and the Science of Zoology BIO 2402 General Zoology Copyright The McGraw Hill Companies, Inc. Permission required for reproduction or display. The Uses of

More information

Lesson 4: Understanding Genetics

Lesson 4: Understanding Genetics Lesson 4: Understanding Genetics 1 Terms Alleles Chromosome Co dominance Crossover Deoxyribonucleic acid DNA Dominant Genetic code Genome Genotype Heredity Heritability Heritability estimate Heterozygous

More information

Organizing Life s Diversity

Organizing Life s Diversity 17 Organizing Life s Diversity section 2 Modern Classification Classification systems have changed over time as information has increased. What You ll Learn species concepts methods to reveal phylogeny

More information

The Generalized Neighbor Joining method

The Generalized Neighbor Joining method The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1 Challenge We would like to

More information

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF EVOLUTION Evolution is a process through which variation in individuals makes it more likely for them to survive and reproduce There are principles to the theory

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

arxiv: v1 [cs.cc] 9 Oct 2014

arxiv: v1 [cs.cc] 9 Oct 2014 Satisfying ternary permutation constraints by multiple linear orders or phylogenetic trees Leo van Iersel, Steven Kelk, Nela Lekić, Simone Linz May 7, 08 arxiv:40.7v [cs.cc] 9 Oct 04 Abstract A ternary

More information

Classification and Phylogeny

Classification and Phylogeny Classification and Phylogeny The diversity it of life is great. To communicate about it, there must be a scheme for organization. There are many species that would be difficult to organize without a scheme

More information

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method Phylogeny 1 Plan: Phylogeny is an important subject. We have 2.5 hours. So I will teach all the concepts via one example of a chain letter evolution. The concepts we will discuss include: Evolutionary

More information

Class 10 Heredity and Evolution CBSE Solved Test paper-1

Class 10 Heredity and Evolution CBSE Solved Test paper-1 Class 10 Heredity and Evolution CBSE Solved Test paper-1 Q.1.What is heredity? Ans : Heredity refers to the transmission of characters or traits from the parents to their offspring. Q.2. Name the plant

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

PHYLOGENY & THE TREE OF LIFE

PHYLOGENY & THE TREE OF LIFE PHYLOGENY & THE TREE OF LIFE PREFACE In this powerpoint we learn how biologists distinguish and categorize the millions of species on earth. Early we looked at the process of evolution here we look at

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

Lab 9: Maximum Likelihood and Modeltest

Lab 9: Maximum Likelihood and Modeltest Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

More information