Chapter 3: Phylogenetics 3. Computing Phylogeny Prof. Yechiam Yemini (YY) Computer Science epartment Columbia niversity Overview Computing trees istance-based techniques Maximal Parsimony (MP) techniques Maximum likelihood techniques This chapter is based on urbin Chapter 7 lso recommended: The Phylogenetic Handbook, Salemi and andamme 00
Can e Tell volution rom Homology uplication Partial sample Speciation 3 3B B B Phylogeny How do we tell the right tree? 3 B 3 B 3 Phylogeny: Computing Trees INPT: Y GGGCT TGCCC TGCTT TGCC TGCGCTT Phylogeny OTPT: Y
Brute orce pproach Brute orce numerate all trees Compute some measure of evolutionary likelihood Select best tree How many rooted trees are there with n leaves? n= leaves => tree n=3 leaves =>attach 3 rd leaf to 3 edges => 3 trees Let T(n)= # rooted trees with n leaves; (n) = # edges T()=, ()=3; T(3)=3, (3)= ddition of a leaf creates two new edges => (n)=(n-)+=> (n)=n- T(n)=T(n-)*(n-)=T(n-)*(n-3) => T(n)= *3** (n-3) or n=0 leaves ~0 pproaches istance based Tree should best model evolutionary distance metric among taxa Character-based [Maximal Parsimony (MP)] Tree should minimize changes Maximum likelihood (ML) Tree should maximize likelihood of changes INPT: Y GGGCT TGCCC TGCTT TGCC TGCGCTT Phylogeny OTPT: Y 6 3
istance Based Techniques 7 I. istance Based Techniques Key Idea: Compute evolutionary distance metric among S={,,,,Y} Compute a tree on S that best fits the distances ormally: Given: nxn distance matrix Compute: weighted tree T on n leaves that best fits How to establish evolutionary distance measures? istance ~ changes Next chapter: evaluating distance using Markovian evolution models 8
Is There Tree That Perfectly its? Not every distance metric can be modeled by a tree How can we tell distance metrics that model a tree?...? 9 The our-point condition distance matrix corresponding to a tree is called additive THORM: is additive if and only if: or every four indices i,j,k,l, the maximum and median of the three pairwise sums are identical: ij + kl < ik + jl = il + jk Suggests how to connect points into a tree to fit i l ik il ij kl < = jl j k jk... 0
How o e Handle Non-dditive? dditive metrics are very useful Provide perfect fit with a tree model; tree is easily computed from But evolutionary distance metrics are often non-additive How do we handle non-additive metric? itch & Margoliash: find a tree T to minimize least-square fit: (T) = i,j (d ij (T) ij ) This problem is NP-Hard need heuristics itch & Margoliash (968) exhaustive search Closest-Pair Clustering Idea: use to guide closest-pair clustering xtend to clusters by PGM/PGM averaging 6
PGM lgorithm Initialization Initialize n clusters C i ={S i } Initialize T with leaves for each cluster Ci Iteration ind C i, C j with smallest distance ij Create new cluster C k = C i C j dd a new node to T, for C k, and connect it to C i,c j If all nodes are connected to a tree exit; otherwise, assign ki = kj = ij / and compute the distances kl to all clusters C l il C i + jl C j kl = C i + C j Repeat the iteration 3 PGM: Molecular Clock Property niform distance from root to leaves istance to root ~ evolutionary clock Species are assumed to take identical time to evolve 9 8 6 7 0. 0. 67 0. 8 3 7
Notes Complexity is is O(n ) veraging redistributes distances to overcome non-additivity Clustering can lead to substantial errors and is very sensitive This limits the applications of clustering How do we overcome the sensitivity of PGM? Real tree.. 0 9 6 3 0 PGM 3 9 3 3 Improvements Through Bootstrapping Bootstrapping: statistical technique to increase robustness Scenario: given a sample S(ω) and a result R(S) computed from S Bootstrapping: o Resample S, to get S (ω); o valuate R(S (ω)); o valuate match of R(S) with the values R(S (ω)) In here S= columns of sequences of size n; R(S)=tree S (ω)=sample n random columns of S with possible repetitions Compute phylogenetic tree R(S (ω)) se {R(S (ω))} to compute consensus/likelihood of branches of R(S) 6 8
Bootstrapping xample 7 Closest Pair vs. volutionary-neighbors dditivity: ij + kl < ik + jl = il + jk i l ik il ij kl < = jl j k PGM overcomes non-additivity by averaging distances But, the closest pair may not be evolutionary neighbors The evolutionary tree distances may diverge greatly; averaging distorts neighborhood jk 8 9
Neighbor Joining [Saitou & Nei 87; Studier & Keppler 88] Neighbor joining heuristics: join closest clusters that are far from the rest efine: R k =Σ i k ik the divergence of k Cluster nodes k,m that minimize km = km -(R k +R m )/(n-) [efine r k =R k /(n-) and consider km -r k -r m ] km r k r m r 6 6 9 37-9 -30-9 -9-30 -9 9 Neighbor Joining lgorithm Initialization:(same as PGM) Initialize n clusters C i ={S i } Iteration:. Compute r k =Σ i k ik /(n-) for each cluster k. ind (k,m) minimizing km -r k -r m ; 3. efine a new node i and set is = 0.( ks + ms - km ) for all s. Join node i to k and m with edges of respective lengths: ki =0.( km +r k -r m ) mi =0.( km +r m -r k ). Repeat until all nodes are connected 0 0
xample: Step --Compute ivergences r B C Σ B C 7 6 8 7 0 9 7 7 6 8 7 0 7 9 6 9 6 8 8 8 9 8 Step 30 3 38 3 B C Step : compute r k =Σ i k ik /(n-) Sum the columns then divide by 6-= r 7. 0. 8 9. 8. rom The Phylogenetic Handbook, Salemi and andamme 00 Step : find neighboring pair Step : evaluate neighboring distance matrix N km = km -(r k +r m ) [Subtract the r column & row] ind (k,m) minimizing N km Create a new node and attach to k,m B C 7 6 8 B 7 0 9 C 7 7 6 8 7 0 7 9 6 9 6 8 8 8 9 8 7. 0. 8 9. 8. PGM would connect the closest pair 7. 0. 8 9. 8. Step B C -3 -.-. -0-0 B C B C -0-0. -0-0. -3-0.-0. - -. -. Min{ Min{N km km }
Step 3,: Join Neighbors pdate istances Step 3: Compute the branch lengths,b =0.( B +r -r B )=0.(-3)= B =0.( B +r B -r )=0.(+3)= Step : pdate distance matrix = 0.( + B - B ) C = 0.(+7-)=3; =0.(7+0-)=6 =0.(6+9-)=; =0.(8+-)=7 B C 7 6 8 B 7 0 9 C 7 7 6 8 7 0 7 9 6 9 6 8 8 8 9 8 7. 0. 8 9. 8. Step C 3 6 7 C 3 7 6 8 6 7 9 6 8 7 8 9 8 Step 3 B C 3 Repeat Steps //3/ r C 3 6 7 7 6 8 6 7 9 Step 6 8 7 8 9 8 7 8 9 8 0.7 C C 3 Step - -0-0 C - -0-0.7-0.7 - -0.7-0.7 Step : compute r k =Σ i k ik /(n-) Step : compute neighboring pair Min{N Y = Y -r -r Y } => (,C) or (,) Step 3: join neighbors; compute branch length =0.( C +r -r C )=; C = Step : re-compute distances = 0.( + C - C ) Step 3 B C Step 9 8 9 8
Repeat 6 9 8 Step 6 9 8 Step : compute r k =Σ i k ik /(n-) Step : compute neighboring pair Min{N Y = Y -r -r Y } => (,) Step 3: join neighbors; compute branch length =0.( +r -r )=3; = Step : re-compute distances = 0.( + - ) r 7. 9. 8.. Step - - -3-3 - - Step 3 C 3 Step 6 6 6 6 B Repeat 6 6 6 6 Step Step : compute r k =Σ i k ik /(n-) Step : compute neighboring pair Min{N Y = Y -r -r Y } => (,) Step 3: join neighbors; compute branch length Z =0.( +r -r )=; Z = Step : re-compute distances Z = 0.( + - ) r 8 8 Step - - - Step 3 C Z 3 Step Z Z B 6 3
7 Complete B C 3 Z Z Z B C 3 Z 8 Notes On Neighbors Joining Complexity is O(n ) oes not depend on molecular clock assumption Heavily used in practice [e.g., Clustal ] But can be sensitive to non-additivity
Maximal Parsimony (character based phylogeny) 9 Key Idea: Minimize Changes Reconsider the problem: ind best tree to explain evolution of sequences Motivation: focus on evolution of positions istance loses information on evolutionary changes TTCTG TTCT GTTGCT TTGCT Key idea: find tree with minimal changes to explain data G GG G C= G G GG C=3 G G GG G 30
More Generally Taxa are considered as sets of attributes: characters character = N position, genes order, morphological feature character state = a value assumed by a character Characters evolve through state changes volutionary tree represents changes in character states MP-tree seeks to minimize state changes 3 MP xample http://evolution.berkeley.edu/evosite/evo0/iicasingparsimony.shtml Characters Binary states Taxa state change 3 6
MP xample 7 state changes 6 state changes 33 xample: volution of Gene www.life.uiuc.edu/ib/33/molsyst.html Taxa Character = position State = nucleotide 3 7
xample: volution of Gene http://home.cc.umanitoba.ca/~psgendb/g/phylogeny/parsimony/phylip.parsimony.html Character = position State = nucleotide Taxa 3 xample MP rearrangements of chromosome Pevzner 003 Genome Research 36 8
The Max Parsimony (MP) Problem Big MP: Input: set of n aligned sequences of length k Output: phylogenetic tree T such that o T has n leaves labeled with the input sequences (taxa) o T has internal nodes labeled with sequences of length k (states) o T minimizes the Hamming distance among its node labels H=3 G This is a Steiner Tree type problem Can be shown to be NP hard [Gusfield, oulds] But often the number of sequences considered is small G GG G Small MP Input: a tree with sequence-labeled leaves Output: labeling of internal nodes states which max parsimony 37 MP Basics Consider {T,TT, GTT, GT, GGT} irst column admits arrangements & identifies likely mutation T G TT G 3 G GTT GT G G 3 G GGT MP ( mutation) mutations Second column does not provide clues on likely mutations T G T T 3 T 3 T T T T G T TT GTT GT GGT Non-informative position (need at least characters) 38 9
MP Basics G 3 MP G G T T 3 T MP T TT GTT GT GGT Merge MP trees of columns & 3: T TT GTT TT GTT GTT 3 GGT GT T GT T TT GTT 3 TT GGT GTT Two MP trees 39 ardvark: CGGT Bison: CGC Chimp: CGGGT og: TGCCT lephant: TGCGT xample (N. riedman) TGGGT CGGT CGGGT TGCGT ardvark Bison Chimp og lephant CGGT CGC CGGGT TGCCT TGCGT 0 0
xample:volution of Protein omains http://ai.stanford.edu/~serafim/cs37_006/ 0 0 0 3 0 Total Cost: 3 C. Chothia et al, volution of the Protein Repertoire, Science OL 300, 3 June 003 T. Przytycka et al, Graph Theoretical Insights., RCOMB 00, LNBI 300, pp. 3-3, 00 Single Site MP: The itch lgorithm Problem: Input: a tree T with labeled leaves Output: labels of internal nodes of MP tree + cost C Step : ssign to each node x a set of labels S(x) such that If x is a leaf then S(x)= label of x, C 0 If x has children y,z S(x) = if S(y) S(z) 0 then S(y) S(z) else S(y) S(z), C C+ Traverse T in postorder (leaves to root) Step : ssign to a node x a character value v(x) Traverse T in preorder (root to leaves) If y is the parent of x and v(y)εs(x) then v(x) v(y) else v(x)= any label from S(x)
Step : Computing Candidate Labels C= {} C= {, G} C= {} C= {, G} C= {, G} C=0 G G G G {} {G} {} {G} {} {G} {} {G} G G {} {G} {} {G} 3 Step : Selecting MP Labels {} {, G} {} {, G} {} C= {, G} {, G} {, G} {, G} G G {} {G} {} {G} G G {} {G} {} {G} G G {} {G} {} {G}
Notes lgorithm is fast O(nk) n= # nodes, k=#character values It selects a particular MP tree (there may be others) {, G} C= G G {, G} {} G G G G {} {G} {} {G} G G G G G G Run separately for each character then merge results May be generalized for weighted parsimony: Sankoff s generalization: different costs of different changes Heuristic MP lgorithms se Steiner-tree heuristic algorithms Branch-and-bound search Represent search space as tree (nodes at k-th level represent phylogenetic trees for first k species) ind best scoring search-node and use it as bound Branch to children of this search-node Nearest neighbor interchange (NNI) switch subtrees Simulated annealing. 6 3
Maximal Likelihood pproach 7 (III) Max Likelihood pproaches (Based on N. riedman slides) Key idea: compute maximum likelihood tree Many models of changes (trees) can yield observed data Compute tree that maximizes the likelihood Problem : given T, compute probability P(S T) S={, n } are the observed sequences Need a probability model of changes generated by T: o Background probabilities: q(a) o Mutation probabilities: P(a b,t) x Problem : compute T that maximizes P(S T) This is the complex part x t t t t 3 x x x 3 8
Tree Likelihood Computation efine P(L k a)= prob. of subtree below node k given x k =a Init: for all leaves k; P(L k a)= if x k =a ; 0 otherwise Iteration: if k is node with children i and j, then " P(L k a) = P(b a,t i )L(i b)p(c a,t j )L( j c) b,c Termination:Likelihood is P( x, K, x3 T, t) =! P( Lroot a) q( a) a x t x t t t 3 x x x 3 9 Maximum Likelihood (ML) Score each tree by P (, K, n T, t) =! P( x[ m], K, xn[ m] T, t) m ssumption of independent positions ind the highest scoring tree xhaustive search Sampling methods (Metropolis) pproximation (consider only a subset of trees) 0
Comparison Tony eisstein, http://bioquest.org:6080/bedrock/terre_haute_03_0/phylogenetics_.0.ppt Neighbor-joining Maximum parsimony Maximum likelihood ses only pairwise distances ses only shared derived characters ses all data Minimizes distance between nearest neighbors Minimizes total distance Maximizes tree likelihood given specific parameter values ery fast asily trapped in local optima Slow ssumptions fail when evolution is rapid ery slow Highly dependent on assumed evolution model Good for generating tentative tree, or choosing among multiple trees Best option when tractable (<30 taxa) Good for very small data sets and for testing trees built using other methods Conclusions Computing phylogeny is an area of active research Hundreds of algorithms. New models: phylogenetic networks (generalize trees) New challenges: whole genome phylogeny ccount for multi-site changes: replication, transpositions New algorithms pplications pidemiology Cancer diagnosis. 6