Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Similar documents
Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Evolutionary Tree Analysis. Overview

BINF6201/8201. Molecular phylogenetic methods

EVOLUTIONARY DISTANCES

Theory of Evolution Charles Darwin

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Algorithms in Bioinformatics

Constructing Evolutionary/Phylogenetic Trees

Molecular Evolution and Phylogenetic Tree Reconstruction


Dr. Amira A. AL-Hosary

Phylogenetic Tree Reconstruction

Phylogenetic inference

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Phylogeny: traditional and Bayesian approaches

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Phylogenetic trees 07/10/13

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

C.DARWIN ( )

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogeny: building the tree of life

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Theory of Evolution. Charles Darwin

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogeny Tree Algorithms

Reading for Lecture 13 Release v10

Constructing Evolutionary/Phylogenetic Trees

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Multiple Sequence Alignment. Sequences

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Phylogenetic Networks, Trees, and Clusters

Phylogenetics: Building Phylogenetic Trees

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

STEM-hy: Species Tree Estimation using Maximum likelihood (with hybridization)

Phylogeny. November 7, 2017

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Evolutionary Models. Evolutionary Models

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

A Phylogenetic Network Construction due to Constrained Recombination

How to read and make phylogenetic trees Zuzana Starostová

BIOINFORMATICS GABRIEL VALIENTE ALGORITHMS, BIOINFORMATICS, COMPLEXITY AND FORMAL METHODS RESEARCH GROUP, TECHNICAL UNIVERSITY OF CATALONIA

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

What is Phylogenetics

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics. BIOL 7711 Computational Bioscience

Lecture 6 Phylogenetic Inference

Phylogenetics: Likelihood

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

8/23/2014. Phylogeny and the Tree of Life

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Concepts and Methods in Molecular Divergence Time Estimation

Phylogenetics: Parsimony

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Chapter 7: Models of discrete character evolution

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Reconstruire le passé biologique modèles, méthodes, performances, limites

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Macroevolution Part I: Phylogenies

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

molecular evolution and phylogenetics

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Pairwise sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Hierarchical Clustering

A (short) introduction to phylogenetics

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Sequence Analysis '17- lecture 8. Multiple sequence alignment

What Is Conservation?

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic inference: from sequences to trees

Copyright 2000 N. AYDIN. All rights reserved. 1

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

Comparative Genomics II

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Biology 211 (2) Week 1 KEY!

Quantifying sequence similarity

SUPPLEMENTARY INFORMATION

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetic Tree Generation using Different Scoring Methods

Week 5: Distance methods, DNA and protein models

Constructing Evolutionary Trees

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Transcription:

Inferring Phylogenetic Trees Distance Approaches Representing distances in rooted and unrooted trees The distance approach to phylogenies given: an n n matrix M where M ij is the distance between taxa i and j problem: build an edge-weighted tree such that the distances between leaves i and j are as close as possible to M ij

Where do we get distances? Commonly obtained from multiple sequence alignments: In the alignment of sequence i with sequence j let f ij = #mismatches #matches + #mismatches Then this could be used as a simple measure of sequence distance: d ij = f ij Or we could use the Jukes-Cantor correction for multiple substitutions at a single position: 3 4 d ij = log(1 ) 4 3 f ij Derivation of the Jukes-Cantor model assume that all sites are independent and have identical mutation rates assumes that all possible nucleotide substitutions occur at the same rate per unit time A matrix can then represent the substitution rates: A C G T A 1 3 C 1 3 G 1 3 T 1 3 Now suppose that an ancestral sequence diverged time t years ago into two related sequences After this time, suppose that the fraction of identical sites between the two sequences is q(t), and the fraction of different sites is p(t), so that p(0) = 0 and q(0) = 1. and p(t) + q(t) = 1, t > 0.

We can calculate q(t + 1), the fraction of identical sites after time t+1 There are two ways of getting an identical site at time t + 1: Two aligned sites not mutating: the probability of this event is (1 3) 2 (1 6). Since q(t) sites were identical at time t, we expect (1 6)q(t) remain identical at time t + 1 One of two different aligned sites at time t mutate to become identical to the other at time t + 1: the probability of this event is 2(1 3)p(t) 2p(t) Therefore, the fraction of identical sites at time t + 1 is: This allows for estimating the derivative of q(t) with time as: Solving this differential equation subject to the initial condition, q(0) = 1, gives rise to q(t + 1) = (1 6)q(t) + 2p(t) = q(t + 1) q(t) = 2 8q(t) 1 q(t) = (1 + 3 e 8t ) 4 1 Notice that q t= =, so this model predicts a minimum 25% identity even on aligning unrelated nucleotide sequences. 4 dq(t) dt Finally to obtain Jukes-Cantor correction we note that we would expect 3t mutations during a time t for each sequence site on each sequence. Thus, the evolutionary distance between two sequences under this model is 6t However: Replacing p(t) by our measured deviation, 6t = = = = 3 ( 8t) 4 3 4q(t) 1 log( ) 4 3 3 1 p(t) 1 log((4 ) 4 3 3 4 log(1 p(t)) 4 3 f ij = #mismatches #matches + #mismatches gives the Jukes-Cantor correction from 7 slides back: 3 4 d ij = log(1 ) 4 3 f ij The molecular clock hypothesis Some proteins appear to evolve slowly, others rapidly. But for any given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

ultrametric data the molecular clock assumption is not generally true: selection pressures vary across time periods, organisms, genes within an organism, regions within a gene if it does hold, then the data is said to be ultrametric ultrametric data condition if your data is ultrametric then for any triplet of sequences, (i, j, k), the distances are either all equal, or two are equal and the remaining one is smaller. Unweighted Pair Group Method using Averages given ultrametric data, UPGMA will reconstruct the tree T that is consistent with the data. basic idea:

iteratively pick two taxa clusters and merge them create a new node in tree for merged cluster. distance d ij between clusters C i and C j of taxa is defined as the average distance between pairs of taxa from each cluster. 1 d ij = C i C j p Ci d pq,q C j UPGMA algorithm assign each taxon to its own cluster define one leaf for each taxon; place it at height 0 while more than two clusters determine two clusters i, j with smallest d ij define a new cluster C k = C i C j define a node k with children i and j: d ij place k at height 2 replace clusters i and j with cluster k compute distance between k and other clusters: C i d il + C j d jl d kl = C i + C j join last two clusters, i and j, by root at height d ij 2 UPGMA example

Newick format for phylogenetic trees An example phylogenetic tree This tree can be represented via an integer n followed by the adjacency list of a weighted tree with n leaves.

4 A->F:0.1 B->F:0.2 C->E:0.3 D->E:0.4 E->F:0.5 The tree can also be represented as Newick strings: (,,(,)); (no names) (A,B,(C,D)); (leaves are named) (A,B,(C,D)E)F; (leaves and internal nodes are named) (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; (distance to parent) (A:0.1,B:0.2,(C:0.3,D:0.4):0.5); (distance and leaf names) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; (distance and all node names) Julia code for UPGMA trees First we need a function to read in a distance matrix: function to read in a single integer n followed by an n x n distance matrix. function getdistmatrix(fn) # read the whole file as a string fr = open(fn) data = readstring(fr) # use split to generate a list of tokens # use filter to get rid of empty tokens # use tryparse to convert tokens to Float64 # suppose the result is in nums # strip the first and reshape the rest n = round(int,nums[1]) nums = nums[2:length(nums)] dm = reshape(nums,(n,n)) return dm end Julia code for UPGMA returns tree as Newick string

function upgma(dm) n = length(dm[1,:]) # first nodes are labled 1..n # and each node is placed in a cluster # heights of leaf nodes are all set to zero # newick string starts off as a list of nodes clusters = Array{Int64}[] newick = [] heights = [] nodes = [] for i in 1:n push!(clusters,[i]) newick = vcat(newick,"$i") nodes = vcat(nodes,"$(i-1)") heights = vcat(heights,0) end # next node to generate has label n+1 next = n+1 enter while loop, and each time merge two clusters while n > 1 # first add 2 * max to the diagonal zeros # before finding the indices # and value of the minimum distance (max,ind)= findmax(dm) dme = dm + eye(n,n)*max*2 (min,ind) = findmin(dme) # store indicies of the min as row and col row = ((ind-1)%n)+1 col = div(ind-1,n)+1 continue while loop compute weights for generating distances to new cluster ncr = length(clusters[row]) ncc = length(clusters[col]) # get distance to new cluster formula # and append new row and new column # to distance matrix newrow = ( ncr * dm[row,:] + ncc * dm[col,:] ) / (ncr + ncc) dm = vcat(dm,newrow') newcol = ( ncr * dm[:,row] + ncc * dm[:,col] ) / (ncr + ncc) dm = hcat(dm,newcol) # set the diagonal element of new # row and new col to zero dm[n+1,n+1] = 0.0 continue while loop

# append the new cluster to cluster list push!(clusters,vcat(clusters[row], clusters[col])) # compute height for the new cluster # and generate the Newick representation # for the new cluster h=min/2 hr = ":$(@sprintf("%.3f", (h-heights[row])))" hc = ":$(@sprintf("%.3f", (h-heights[col])))" newnode = "("*newick[row]*hr*", "*newick[col]*hc*")$next" # append the new newick rep, # the new height and the new node name # to the appropriate lists newick = vcat(newick,newnode) heights = vcat(heights,h) nodes = vcat(nodes,next-1) continue while loop # make use of daleteat to remove # row and col items from each list if (row < col) deleteat!(clusters,[row,col]) deleteat!(newick,[row,col]) deleteat!(heights,[row,col]) deleteat!(nodes,[row,col]) else deleteat!(clusters,[col,row]) deleteat!(newick,[col,row]) deleteat!(heights,[col,row]) deleteat!(nodes,[col,row]) end complete the while loop end # finally remove row and col # rows and columns # from the distance matrix dm = dm[setdiff(1:n+1,[row,col]),:] dm = dm[:,setdiff(1:n+1,[row,col])] # by now n should drop by one. n = length(dm[1,:]) # increment the next node label next=next+1 return the Newick string

# after the while loop # there should be one string in the # newick list representing # the whole tree, return it! newick[1] In [1]: # the code is stored locally, lets try it out # note that print statements have been added # to generate the adjacency list required by Rosalind. include("code/upgma.jl") tree = upgma(getdistmatrix("data/dm1.txt")) 3->4:5.000 4->3:5.000 2->4:5.000 4->2:5.000 4->5:2.000 5->4:2.000 0->5:7.000 5->0:7.000 5->6:1.833 6->5:1.833 1->6:8.833 6->1:8.833 Out[1]: "(((4:5.000, 3:5.000)5:2.000, 1:7.000)6:1.833, 2:8.833)7" In [2]: # write the Newick string to a file for viewing with FigTree open("data/tr1.tree", "w") do f write(f, tree) end Out[2]: 55 A rendering from FigTree

In [7]: tree2 = upgma(getdistmatrix("data/dm2.txt")) open("data/tr2.tree", "w") do f write(f, tree2) end 17->26:338.000 26->17:338.000 16->26:338.000 26->16:338.000 25->27:339.500 27->25:339.500 12->27:339.500 27->12:339.500 22->28:340.000 28->22:340.000 19->28:340.000 28->19:340.000 24->29:341.000 29->24:341.000 2->29:341.000 29->2:341.000 20->30:342.000 30->20:342.000 9->30:342.000 30->9:342.000 28->31:5.000 31->28:5.000 13->31:345.000 31->13:345.000 14->32:346.500 32->14:346.500 0->32:346.500 32->0:346.500 23->33:355.500 33->23:355.500 6->33:355.500 33->6:355.500 21->34:356.000 34->21:356.000 11->34:356.000 34->11:356.000 18->35:358.000 35->18:358.000 1->35:358.000 35->1:358.000 8->36:363.500 36->8:363.500 3->36:363.500 36->3:363.500 15->37:365.000 37->15:365.000 4->37:365.000 37->4:365.000 10->38:369.000 38->10:369.000 5->38:369.000 38->5:369.000 27->39:59.000 39->27:59.000 7->39:398.500 39->7:398.500 30->40:78.000 40->30:78.000 29->40:79.000 40->29:79.000 36->41:58.500 41->36:58.500

Out[7]: 570 34->41:66.000 41->34:66.000 35->42:65.125 42->35:65.125 33->42:67.625 42->33:67.625 38->43:56.000 43->38:56.000 26->43:87.000 43->26:87.000 43->44:36.625 44->43:36.625 39->44:63.125 44->39:63.125 44->45:23.232 45->44:23.232 31->45:139.857 45->31:139.857 41->46:63.875 46->41:63.875 37->46:120.875 46->37:120.875 45->47:13.718 47->45:13.718 32->47:152.075 47->32:152.075 42->48:79.000 48->42:79.000 40->48:82.125 48->40:82.125 47->49:18.224 49->47:18.224 46->49:30.924 49->46:30.924 49->50:5.191 50->49:5.191 48->50:19.865 50->48:19.865 In [8]: tree2 Out[8]: "(((((((11:369.000, 6:369.000)39:56.000, (18:338.000, 17:338.000)27:87.000)44:36.625, ((26:339.500, 13:339.500)28:59.000, 8:398.500)40:63.125)45:23.232, ((23:340.000, 20:340.000)29:5.00 0, 14:345.000)32:139.857)46:13.718, (15:346.500, 1:346.500)33:152.075)48:18.224, (((9:363.500, 4:363.500)37:58.500, (22:356.000, 12:356.000)35:66.000)42:63.875, (16:365.000, 5:365.000)3 8:120.875)47:30.924)50:5.191, (((19:358.000, 2:358.000)36:65.125, (24:355.500, 7:355.500)34:67.625)43:79.000, ((21:342.000, 10:342.000)31:78.000, (25:341.000, 3:341.000)30:79.000)41:82. 125)49:19.865)51" Another rendering from FigTree

Homework Attempt the following problems from the UKZN-COMP710-bioinformatics course on the Rosalind website. In each case write Julia code to solve the problem. Do not use web based tools. http://rosalind.info/classes/enroll/2c2d9f977b/ (http://rosalind.info/classes/enroll/2c2d9f977b/) BA7D Implement UPGMA