Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Inferring Phylogenetic Trees Distance Approaches Representing distances in rooted and unrooted trees The distance approach to phylogenies given: an n n matrix M where M ij is the distance between taxa i and j problem: build an edge-weighted tree such that the distances between leaves i and j are as close as possible to M ij

Where do we get distances? Commonly obtained from multiple sequence alignments: In the alignment of sequence i with sequence j let f ij = #mismatches #matches + #mismatches Then this could be used as a simple measure of sequence distance: d ij = f ij Or we could use the Jukes-Cantor correction for multiple substitutions at a single position: 3 4 d ij = log(1 ) 4 3 f ij Derivation of the Jukes-Cantor model assume that all sites are independent and have identical mutation rates assumes that all possible nucleotide substitutions occur at the same rate per unit time A matrix can then represent the substitution rates: A C G T A 1 3 C 1 3 G 1 3 T 1 3 Now suppose that an ancestral sequence diverged time t years ago into two related sequences After this time, suppose that the fraction of identical sites between the two sequences is q(t), and the fraction of different sites is p(t), so that p(0) = 0 and q(0) = 1. and p(t) + q(t) = 1, t > 0.

We can calculate q(t + 1), the fraction of identical sites after time t+1 There are two ways of getting an identical site at time t + 1: Two aligned sites not mutating: the probability of this event is (1 3) 2 (1 6). Since q(t) sites were identical at time t, we expect (1 6)q(t) remain identical at time t + 1 One of two different aligned sites at time t mutate to become identical to the other at time t + 1: the probability of this event is 2(1 3)p(t) 2p(t) Therefore, the fraction of identical sites at time t + 1 is: This allows for estimating the derivative of q(t) with time as: Solving this differential equation subject to the initial condition, q(0) = 1, gives rise to q(t + 1) = (1 6)q(t) + 2p(t) = q(t + 1) q(t) = 2 8q(t) 1 q(t) = (1 + 3 e 8t ) 4 1 Notice that q t= =, so this model predicts a minimum 25% identity even on aligning unrelated nucleotide sequences. 4 dq(t) dt Finally to obtain Jukes-Cantor correction we note that we would expect 3t mutations during a time t for each sequence site on each sequence. Thus, the evolutionary distance between two sequences under this model is 6t However: Replacing p(t) by our measured deviation, 6t = = = = 3 ( 8t) 4 3 4q(t) 1 log( ) 4 3 3 1 p(t) 1 log((4 ) 4 3 3 4 log(1 p(t)) 4 3 f ij = #mismatches #matches + #mismatches gives the Jukes-Cantor correction from 7 slides back: 3 4 d ij = log(1 ) 4 3 f ij The molecular clock hypothesis Some proteins appear to evolve slowly, others rapidly. But for any given protein, the rate of molecular evolution is approximately constant in all evolutionary lineages

ultrametric data the molecular clock assumption is not generally true: selection pressures vary across time periods, organisms, genes within an organism, regions within a gene if it does hold, then the data is said to be ultrametric ultrametric data condition if your data is ultrametric then for any triplet of sequences, (i, j, k), the distances are either all equal, or two are equal and the remaining one is smaller. Unweighted Pair Group Method using Averages given ultrametric data, UPGMA will reconstruct the tree T that is consistent with the data. basic idea:

iteratively pick two taxa clusters and merge them create a new node in tree for merged cluster. distance d ij between clusters C i and C j of taxa is defined as the average distance between pairs of taxa from each cluster. 1 d ij = C i C j p Ci d pq,q C j UPGMA algorithm assign each taxon to its own cluster define one leaf for each taxon; place it at height 0 while more than two clusters determine two clusters i, j with smallest d ij define a new cluster C k = C i C j define a node k with children i and j: d ij place k at height 2 replace clusters i and j with cluster k compute distance between k and other clusters: C i d il + C j d jl d kl = C i + C j join last two clusters, i and j, by root at height d ij 2 UPGMA example

Newick format for phylogenetic trees An example phylogenetic tree This tree can be represented via an integer n followed by the adjacency list of a weighted tree with n leaves.

4 A->F:0.1 B->F:0.2 C->E:0.3 D->E:0.4 E->F:0.5 The tree can also be represented as Newick strings: (,,(,)); (no names) (A,B,(C,D)); (leaves are named) (A,B,(C,D)E)F; (leaves and internal nodes are named) (:0.1,:0.2,(:0.3,:0.4):0.5):0.0; (distance to parent) (A:0.1,B:0.2,(C:0.3,D:0.4):0.5); (distance and leaf names) (A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F; (distance and all node names) Julia code for UPGMA trees First we need a function to read in a distance matrix: function to read in a single integer n followed by an n x n distance matrix. function getdistmatrix(fn) # read the whole file as a string fr = open(fn) data = readstring(fr) # use split to generate a list of tokens # use filter to get rid of empty tokens # use tryparse to convert tokens to Float64 # suppose the result is in nums # strip the first and reshape the rest n = round(int,nums[1]) nums = nums[2:length(nums)] dm = reshape(nums,(n,n)) return dm end Julia code for UPGMA returns tree as Newick string

function upgma(dm) n = length(dm[1,:]) # first nodes are labled 1..n # and each node is placed in a cluster # heights of leaf nodes are all set to zero # newick string starts off as a list of nodes clusters = Array{Int64}[] newick = [] heights = [] nodes = [] for i in 1:n push!(clusters,[i]) newick = vcat(newick,"$i") nodes = vcat(nodes,"$(i-1)") heights = vcat(heights,0) end # next node to generate has label n+1 next = n+1 enter while loop, and each time merge two clusters while n > 1 # first add 2 * max to the diagonal zeros # before finding the indices # and value of the minimum distance (max,ind)= findmax(dm) dme = dm + eye(n,n)*max*2 (min,ind) = findmin(dme) # store indicies of the min as row and col row = ((ind-1)%n)+1 col = div(ind-1,n)+1 continue while loop compute weights for generating distances to new cluster ncr = length(clusters[row]) ncc = length(clusters[col]) # get distance to new cluster formula # and append new row and new column # to distance matrix newrow = ( ncr * dm[row,:] + ncc * dm[col,:] ) / (ncr + ncc) dm = vcat(dm,newrow') newcol = ( ncr * dm[:,row] + ncc * dm[:,col] ) / (ncr + ncc) dm = hcat(dm,newcol) # set the diagonal element of new # row and new col to zero dm[n+1,n+1] = 0.0 continue while loop

# append the new cluster to cluster list push!(clusters,vcat(clusters[row], clusters[col])) # compute height for the new cluster # and generate the Newick representation # for the new cluster h=min/2 hr = ":$(@sprintf("%.3f", (h-heights[row])))" hc = ":$(@sprintf("%.3f", (h-heights[col])))" newnode = "("*newick[row]*hr*", "*newick[col]*hc*")$next" # append the new newick rep, # the new height and the new node name # to the appropriate lists newick = vcat(newick,newnode) heights = vcat(heights,h) nodes = vcat(nodes,next-1) continue while loop # make use of daleteat to remove # row and col items from each list if (row < col) deleteat!(clusters,[row,col]) deleteat!(newick,[row,col]) deleteat!(heights,[row,col]) deleteat!(nodes,[row,col]) else deleteat!(clusters,[col,row]) deleteat!(newick,[col,row]) deleteat!(heights,[col,row]) deleteat!(nodes,[col,row]) end complete the while loop end # finally remove row and col # rows and columns # from the distance matrix dm = dm[setdiff(1:n+1,[row,col]),:] dm = dm[:,setdiff(1:n+1,[row,col])] # by now n should drop by one. n = length(dm[1,:]) # increment the next node label next=next+1 return the Newick string

# after the while loop # there should be one string in the # newick list representing # the whole tree, return it! newick[1] In [1]: # the code is stored locally, lets try it out # note that print statements have been added # to generate the adjacency list required by Rosalind. include("code/upgma.jl") tree = upgma(getdistmatrix("data/dm1.txt")) 3->4:5.000 4->3:5.000 2->4:5.000 4->2:5.000 4->5:2.000 5->4:2.000 0->5:7.000 5->0:7.000 5->6:1.833 6->5:1.833 1->6:8.833 6->1:8.833 Out[1]: "(((4:5.000, 3:5.000)5:2.000, 1:7.000)6:1.833, 2:8.833)7" In [2]: # write the Newick string to a file for viewing with FigTree open("data/tr1.tree", "w") do f write(f, tree) end Out[2]: 55 A rendering from FigTree

In [7]: tree2 = upgma(getdistmatrix("data/dm2.txt")) open("data/tr2.tree", "w") do f write(f, tree2) end 17->26:338.000 26->17:338.000 16->26:338.000 26->16:338.000 25->27:339.500 27->25:339.500 12->27:339.500 27->12:339.500 22->28:340.000 28->22:340.000 19->28:340.000 28->19:340.000 24->29:341.000 29->24:341.000 2->29:341.000 29->2:341.000 20->30:342.000 30->20:342.000 9->30:342.000 30->9:342.000 28->31:5.000 31->28:5.000 13->31:345.000 31->13:345.000 14->32:346.500 32->14:346.500 0->32:346.500 32->0:346.500 23->33:355.500 33->23:355.500 6->33:355.500 33->6:355.500 21->34:356.000 34->21:356.000 11->34:356.000 34->11:356.000 18->35:358.000 35->18:358.000 1->35:358.000 35->1:358.000 8->36:363.500 36->8:363.500 3->36:363.500 36->3:363.500 15->37:365.000 37->15:365.000 4->37:365.000 37->4:365.000 10->38:369.000 38->10:369.000 5->38:369.000 38->5:369.000 27->39:59.000 39->27:59.000 7->39:398.500 39->7:398.500 30->40:78.000 40->30:78.000 29->40:79.000 40->29:79.000 36->41:58.500 41->36:58.500

Out[7]: 570 34->41:66.000 41->34:66.000 35->42:65.125 42->35:65.125 33->42:67.625 42->33:67.625 38->43:56.000 43->38:56.000 26->43:87.000 43->26:87.000 43->44:36.625 44->43:36.625 39->44:63.125 44->39:63.125 44->45:23.232 45->44:23.232 31->45:139.857 45->31:139.857 41->46:63.875 46->41:63.875 37->46:120.875 46->37:120.875 45->47:13.718 47->45:13.718 32->47:152.075 47->32:152.075 42->48:79.000 48->42:79.000 40->48:82.125 48->40:82.125 47->49:18.224 49->47:18.224 46->49:30.924 49->46:30.924 49->50:5.191 50->49:5.191 48->50:19.865 50->48:19.865 In [8]: tree2 Out[8]: "(((((((11:369.000, 6:369.000)39:56.000, (18:338.000, 17:338.000)27:87.000)44:36.625, ((26:339.500, 13:339.500)28:59.000, 8:398.500)40:63.125)45:23.232, ((23:340.000, 20:340.000)29:5.00 0, 14:345.000)32:139.857)46:13.718, (15:346.500, 1:346.500)33:152.075)48:18.224, (((9:363.500, 4:363.500)37:58.500, (22:356.000, 12:356.000)35:66.000)42:63.875, (16:365.000, 5:365.000)3 8:120.875)47:30.924)50:5.191, (((19:358.000, 2:358.000)36:65.125, (24:355.500, 7:355.500)34:67.625)43:79.000, ((21:342.000, 10:342.000)31:78.000, (25:341.000, 3:341.000)30:79.000)41:82. 125)49:19.865)51" Another rendering from FigTree

Homework Attempt the following problems from the UKZN-COMP710-bioinformatics course on the Rosalind website. In each case write Julia code to solve the problem. Do not use web based tools. http://rosalind.info/classes/enroll/2c2d9f977b/ (http://rosalind.info/classes/enroll/2c2d9f977b/) BA7D Implement UPGMA