# Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

1 Inferring Phylogenetic Trees Distance Approaches Representing distances in rooted and unrooted trees The distance approach to phylogenies given: an n n matrix M where M ij is the distance between taxa i and j problem: build an edge-weighted tree such that the distances between leaves i and j are as close as possible to M ij

2 Where do we get distances? Commonly obtained from multiple sequence alignments: In the alignment of sequence i with sequence j let f ij = #mismatches #matches + #mismatches Then this could be used as a simple measure of sequence distance: d ij = f ij Or we could use the Jukes-Cantor correction for multiple substitutions at a single position: 3 4 d ij = log(1 ) 4 3 f ij Derivation of the Jukes-Cantor model assume that all sites are independent and have identical mutation rates assumes that all possible nucleotide substitutions occur at the same rate per unit time A matrix can then represent the substitution rates: A C G T A 1 3 C 1 3 G 1 3 T 1 3 Now suppose that an ancestral sequence diverged time t years ago into two related sequences After this time, suppose that the fraction of identical sites between the two sequences is q(t), and the fraction of different sites is p(t), so that p(0) = 0 and q(0) = 1. and p(t) + q(t) = 1, t > 0.

3 We can calculate q(t + 1), the fraction of identical sites after time t+1 There are two ways of getting an identical site at time t + 1: Two aligned sites not mutating: the probability of this event is (1 3) 2 (1 6). Since q(t) sites were identical at time t, we expect (1 6)q(t) remain identical at time t + 1 One of two different aligned sites at time t mutate to become identical to the other at time t + 1: the probability of this event is 2(1 3)p(t) 2p(t) Therefore, the fraction of identical sites at time t + 1 is: This allows for estimating the derivative of q(t) with time as: Solving this differential equation subject to the initial condition, q(0) = 1, gives rise to q(t + 1) = (1 6)q(t) + 2p(t) = q(t + 1) q(t) = 2 8q(t) 1 q(t) = (1 + 3 e 8t ) 4 1 Notice that q t= =, so this model predicts a minimum 25% identity even on aligning unrelated nucleotide sequences. 4 dq(t) dt Finally to obtain Jukes-Cantor correction we note that we would expect 3t mutations during a time t for each sequence site on each sequence. Thus, the evolutionary distance between two sequences under this model is 6t However: Replacing p(t) by our measured deviation, 6t = = = = 3 ( 8t) 4 3 4q(t) 1 log( ) p(t) 1 log((4 ) log(1 p(t)) 4 3 f ij = #mismatches #matches + #mismatches gives the Jukes-Cantor correction from 7 slides back: 3 4 d ij = log(1 ) 4 3 f ij The molecular clock hypothesis Some proteins appear to evolve slowly, others rapidly. But for any given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

4 ultrametric data the molecular clock assumption is not generally true: selection pressures vary across time periods, organisms, genes within an organism, regions within a gene if it does hold, then the data is said to be ultrametric ultrametric data condition if your data is ultrametric then for any triplet of sequences, (i, j, k), the distances are either all equal, or two are equal and the remaining one is smaller. Unweighted Pair Group Method using Averages given ultrametric data, UPGMA will reconstruct the tree T that is consistent with the data. basic idea:

5 iteratively pick two taxa clusters and merge them create a new node in tree for merged cluster. distance d ij between clusters C i and C j of taxa is defined as the average distance between pairs of taxa from each cluster. 1 d ij = C i C j p Ci d pq,q C j UPGMA algorithm assign each taxon to its own cluster define one leaf for each taxon; place it at height 0 while more than two clusters determine two clusters i, j with smallest d ij define a new cluster C k = C i C j define a node k with children i and j: d ij place k at height 2 replace clusters i and j with cluster k compute distance between k and other clusters: C i d il + C j d jl d kl = C i + C j join last two clusters, i and j, by root at height d ij 2 UPGMA example

6

7 Newick format for phylogenetic trees An example phylogenetic tree This tree can be represented via an integer n followed by the adjacency list of a weighted tree with n leaves.

8 4 A->F:0.1 B->F:0.2 C->E:0.3 D->E:0.4 E->F:0.5 The tree can also be represented as Newick strings: (,,(,)); (no names) (A,B,(C,D)); (leaves are named) (A,B,(C,D)E)F; (leaves and internal nodes are named) (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; (distance to parent) (A:0.1,B:0.2,(C:0.3,D:0.4):0.5); (distance and leaf names) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; (distance and all node names) Julia code for UPGMA trees First we need a function to read in a distance matrix: function to read in a single integer n followed by an n x n distance matrix. function getdistmatrix(fn) # read the whole file as a string fr = open(fn) data = readstring(fr) # use split to generate a list of tokens # use filter to get rid of empty tokens # use tryparse to convert tokens to Float64 # suppose the result is in nums # strip the first and reshape the rest n = round(int,nums[1]) nums = nums[2:length(nums)] dm = reshape(nums,(n,n)) return dm end Julia code for UPGMA returns tree as Newick string

9 function upgma(dm) n = length(dm[1,:]) # first nodes are labled 1..n # and each node is placed in a cluster # heights of leaf nodes are all set to zero # newick string starts off as a list of nodes clusters = Array{Int64}[] newick = [] heights = [] nodes = [] for i in 1:n push!(clusters,[i]) newick = vcat(newick,"\$i") nodes = vcat(nodes,"\$(i-1)") heights = vcat(heights,0) end # next node to generate has label n+1 next = n+1 enter while loop, and each time merge two clusters while n > 1 # first add 2 * max to the diagonal zeros # before finding the indices # and value of the minimum distance (max,ind)= findmax(dm) dme = dm + eye(n,n)*max*2 (min,ind) = findmin(dme) # store indicies of the min as row and col row = ((ind-1)%n)+1 col = div(ind-1,n)+1 continue while loop compute weights for generating distances to new cluster ncr = length(clusters[row]) ncc = length(clusters[col]) # get distance to new cluster formula # and append new row and new column # to distance matrix newrow = ( ncr * dm[row,:] + ncc * dm[col,:] ) / (ncr + ncc) dm = vcat(dm,newrow') newcol = ( ncr * dm[:,row] + ncc * dm[:,col] ) / (ncr + ncc) dm = hcat(dm,newcol) # set the diagonal element of new # row and new col to zero dm[n+1,n+1] = 0.0 continue while loop

10 # append the new cluster to cluster list push!(clusters,vcat(clusters[row], clusters[col])) # compute height for the new cluster # and generate the Newick representation # for the new cluster h=min/2 hr = (h-heights[row])))" hc = (h-heights[col])))" newnode = "("*newick[row]*hr*", "*newick[col]*hc*")\$next" # append the new newick rep, # the new height and the new node name # to the appropriate lists newick = vcat(newick,newnode) heights = vcat(heights,h) nodes = vcat(nodes,next-1) continue while loop # make use of daleteat to remove # row and col items from each list if (row < col) deleteat!(clusters,[row,col]) deleteat!(newick,[row,col]) deleteat!(heights,[row,col]) deleteat!(nodes,[row,col]) else deleteat!(clusters,[col,row]) deleteat!(newick,[col,row]) deleteat!(heights,[col,row]) deleteat!(nodes,[col,row]) end complete the while loop end # finally remove row and col # rows and columns # from the distance matrix dm = dm[setdiff(1:n+1,[row,col]),:] dm = dm[:,setdiff(1:n+1,[row,col])] # by now n should drop by one. n = length(dm[1,:]) # increment the next node label next=next+1 return the Newick string

11 # after the while loop # there should be one string in the # newick list representing # the whole tree, return it! newick[1] In [1]: # the code is stored locally, lets try it out # note that print statements have been added # to generate the adjacency list required by Rosalind. include("code/upgma.jl") tree = upgma(getdistmatrix("data/dm1.txt")) 3->4: >3: >4: >2: >5: >4: >5: >0: >6: >5: >6: >1:8.833 Out[1]: "(((4:5.000, 3:5.000)5:2.000, 1:7.000)6:1.833, 2:8.833)7" In [2]: # write the Newick string to a file for viewing with FigTree open("data/tr1.tree", "w") do f write(f, tree) end Out[2]: 55 A rendering from FigTree

12 In [7]: tree2 = upgma(getdistmatrix("data/dm2.txt")) open("data/tr2.tree", "w") do f write(f, tree2) end 17->26: >17: >26: >16: >27: >25: >27: >12: >28: >22: >28: >19: >29: >24: >29: >2: >30: >20: >30: >9: >31: >28: >31: >13: >32: >14: >32: >0: >33: >23: >33: >6: >34: >21: >34: >11: >35: >18: >35: >1: >36: >8: >36: >3: >37: >15: >37: >4: >38: >10: >38: >5: >39: >27: >39: >7: >40: >30: >40: >29: >41: >36:58.500

13 Out[7]: >41: >34: >42: >35: >42: >33: >43: >38: >43: >26: >44: >43: >44: >39: >45: >44: >45: >31: >46: >41: >46: >37: >47: >45: >47: >32: >48: >42: >48: >40: >49: >47: >49: >46: >50: >49: >50: >48: In [8]: tree2 Out[8]: "(((((((11: , 6: )39:56.000, (18: , 17: )27:87.000)44:36.625, ((26: , 13: )28:59.000, 8: )40:63.125)45:23.232, ((23: , 20: )29:5.00 0, 14: )32: )46:13.718, (15: , 1: )33: )48:18.224, (((9: , 4: )37:58.500, (22: , 12: )35:66.000)42:63.875, (16: , 5: )3 8: )47:30.924)50:5.191, (((19: , 2: )36:65.125, (24: , 7: )34:67.625)43:79.000, ((21: , 10: )31:78.000, (25: , 3: )30:79.000)41: )49:19.865)51" Another rendering from FigTree

14 Homework Attempt the following problems from the UKZN-COMP710-bioinformatics course on the Rosalind website. In each case write Julia code to solve the problem. Do not use web based tools. ( BA7D Implement UPGMA

