omputational Genomics Molecular Evolution: Phylogenetic trees Eric Xing Lecture, March, 2007 Reading: DTW boo, hap 2 DEKM boo, hap 7, Phylogeny In, Ernst Haecel coined the word phylogeny and presented phylogenetic trees for most nown groups of living organisms. Ernst Haecel (3-99)
Phylogenetic Inference Given a multiple alignment, how do we construct the tree? - GTTGTGTTGT B TTGTTGTTGT TTGTGGT D - TTGGTTTTT E GTGGTTTGT F - TTTTGG? What Is a Tree? tree is a mathematical structure which represents a model of an actual evolutionary history of a group of sequences or organisms. In other words, it is an evolutionary hypothesis. tree consists of nodes connected by branches. One unique internal node is the root of the tree: the ancestor of all the sequences. Internal nodes represent hypothetical ancestors Terminal nodes represent sequences or organisms for which we have data. Each is typically called a Operational Taxonomical Unit or OTU. 2
Types of Trees Rooted vs. Unrooted Branches Nodes Rooted Interior M 2 M Total 2M 2 2M Unrooted Interior M 3 M 2 Total 2M 3 2M 2 M is the number of OTU s The number of rooted and unrooted trees: Number of OTU s Possible Number of Rooted trees 2 Unrooted trees 3 3 5 3 5 05 5 95 05 7 0395 95 3535 0395 9 2027025 3535 0 35925 2027025 3
More different inds of trees Different inds of trees can be used to depict different aspects of evolutionary history. ladogram: simply shows relative order of common ancestry 2 3 3 7 5 3 2. dditive trees: a cladogram with branch lengths, also called phylograms and metric trees 3 3 2 3. Ultrametric trees: (dendograms) special ind of additive tree in which the tips of the trees are all equidistant from the root Phylogeny methods Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is pretty certain avoid areas with (too many) gaps Major methods:
UPGM onstruction of a distance tree using clustering with the Unweighted Pair Group Method with rithmatic Mean (UPGM) First, construct a distance matrix: - GTTGTGTTGT B TTGTTGTTGT TTGTGGT D - TTGGTTTTT E GTGGTTTGT F - TTTTGG B D E F 2 B D E From http://www.icp.ucl.ac.be/~opperd/private/upgma.html Ultrametric Trees 3 2 3 a b c Metric distances (for additive trees) must obey rules: Non-negativity: d(a,b) 0 Distinctness: d(a,b) = 0 if and only if a = b Symmetry: d(a,b) = d(b,a) Triangle Inequality: d(a,c) d(a,b) + d(b,c) Ultrametric must obey one additional rule: Three point condition: d(a,b) max( d(a,c), d(b,c) ) 0. 2 a c b 5
UPGM First round B D E F 2 B D E hoose the most similar pair, cluster them together and calculate the new distance matrix. dist(,b), = (dist + distb) / 2 = dist(,b),d = (distd + distbd) / 2 = dist(,b),e = (diste + distbe) / 2 = dist(,b),f = (distf + distbf) / 2 = D E F B, D E UPGM Second round D E F B, D E Third round,d E F B,,D E
UPGM Forth round,b E,D E,D F Fifth round DE,B F Note the this method identifies the root of the tree. UPGM assumes a molecular cloc The UPGM clustering method is very sensitive to unequal evolutionary rates (assumes that the evolutionary rate is the same for all branches). lustering wors only if the data are ultrametric Ultrametric distances are defined by the satisfaction of the 'threepoint condition'. The three-point condition: B B For any three taxa, the two greatest distances are equal. 7
UPGM fails when rates of evolution are not constant B 5 B 7 D E D 7 0 7 E 9 5 F 9 From http://www.icp.ucl.ac.be/~opperd/private/upgma.html (Neighbor joining will get the right tree in this case.) Phylogeny based upon the molecular cloc Evidence for a human mitochondrial origin in frica: frican sequence diversity is twice as large as that of non-frican Gyllensten and colleagues estimate that the divergence of fricans and non-fricans occurred 52,000 to 2,000 years ago. Ingman, M., Kaessmann, H., Pääbo, S. & Gyllensten, U. (2000) Nature 0: 70-73.
pair of homologous bases ancestor? T years Q h Q m Typically, the ancestor is unnown. How does sequence variation arise? Mutation: (a) Inherent: DN replication errors are not always corrected. (b) External: exposure to chemicals and radiation. Selection: Deleterious mutations are removed quicly. Neutral and rarely, advantageous mutations, are tolerated and stic around. Fixation: It taes time for a new variant to be established (having a stable frequency) in a population. 9
Modeling DN base substitution Strictly speaing, only applicable to regions undergoing little selection. Standard assumptions (sometimes weaened). Site independence. 2. Site homogeneity. 3. Marovian: given current base, future substitutions independent of past.. Temporal homogeneity: stationary Marov chain. More assumptions Q h = s h Q and Q m = s m Q, for some positive s h, s m, and a rate matrix Q. The ancestor is sampled from the stationary distribution π of Q. Q is reversible: for a, b, t 0 π(a)p(t,a,b) = P(t,b,a)π(b), (detailed balance). 0
The stationary distribution probability distribution π on {,,G,T} is a stationary distribution of the Marov chain with transition probability matrix P = P(i,j), if for all j, i π(i) P(i,j) = π(j). Exercise. Given any initial distribution, the distribution at time t of a chain with transition matrix P converges to π as t. Thus, π is also called an equilibrium distribution. Exercise. For the Jues-antor and Kimura models, the uniform distribution is stationary. (Hint: diagonalize their infinitesimal rate matrices.) We often assume that the ancestor sequence is i.i.d π. New picture from a statistical evolution model ancestor ~ π s h T PMs Q Q s m T PMs
The Jues-antor adjustment ssume that the common ancestor has, G, or T with probability /. common ancestor Then the chance of the nt differing p = 3/ ( e αt ) = 3/ ( e /3 ), since =2 3αt 3/ orang human When =.0, described as PM t Joint probability of and Under the model in the previous slides, the joint probability is p(,) = = a a π ( a ) p( s T, Q, a ) p( s T, Q, a ) π () p( a s T, Q, ) p( s T, Q, a ) = π () p( s T + s T, Q, ) = ( t,,) h h h m m m where t = s h T+ s m T is the (evolutionary) distance between and. Note that s h, s m and T are not identifiable. The matrix F(t) is symmetric. It is equally valid to view as the ancestor of or vice versa. 2
Estimating the evolutionary distance between two sequences Suppose two aligned protein sequences a a n and b b n are separated by t PMs. Under a reversible substitution model that is IID across sites, the lielihood of t is L( t ) = p( a a, b Kb model) = = a, b K n ( t, a, b ) ( t, a, b) ( a, b ) where c(a,b) = # { : a = a, b = b}. n Maximizing this quantity gives the maximum lielihood estimate of t. This generalizes the distance correction with Jues-antor. Phylogeny The shaded nodes represent the observed nucleotides at a given site for a set of organisms The unshaded nodes represent putative ancestral nucleotides Transitions between nodes capture the dynamic of evolution 3
Lielihood methods tree, with branch lengths, and the data at a single site. t 7 t 5 t t t 2 t t 3 t GTGGGT GTGGTGT TGTGGTG TGTGGTGG TGTGGTGG Since the sites evolve independently on the same tree, m ( i ) L = P ( D T ) = i = P ( D T ) Lielihood at one site on a tree We can compute this by summing over all assignments of states x, y, z and w to the interior nodes: ( i ) P ( D T ) P (,,,,, x, y, z, w T ) = x y z w x Due to the Marov property of the tree, we can factorize the complete lielihood according to the tree topology: t t z y t 7 t 2 t w t 5 t 3 t P(,,,,, x, y, z, w T ) = P ( x ) P ( y x, t) P ( y, t ) P ( y, t2) P z x, t ) P ( y, ) ( t3 P ( w z, t7 ) P ( y, t) P ( y, t5 Summing this up, there are 25 terms in this case! )
Getting a recursive algorithm when we move the summation signs as far right as possible: t 5 ( i) P ( D T ) P(,,,,, x, y, z, w T ) = ( = x y z w x y P ( x) z P ( y x, t ) P( y, t ) P( y, t ) 2 P z x, t ) P( z, t ) ( 3 P ( w z, t7 ) P( w, t) P( w, t5) ( w )) x t t z y t 7 t 3 t 2 t w t Felsenstein s Pruning lgorithm To calculate P(x, x 2,, x N T, t) Initialization: Set = 2N Recursion: ompute P(L a) for all a Σ If is a leaf node: Set P(L a) = (a = x ) If is not a leaf node:. ompute P(L i b), P(L j c) for all b and c, for daughter nodes i, j Termination: a b c i j 2. Set P(L a) = Σ b, c P(b a, t i )P(L i b) P(c a, t j ) P(L j c) Lielihood at this column = P(x, x 2,, x N T, t) = Σ a P(L 2N- a)p(a) a This algorithm can easily handle mbiguity and error in the sequences (how?) 5
Finding the ML tree So far I have just taled about the computation of the lielihood for one tree with branch lengths nown. To find a ML tree, we must search the space of tree topologies, and for each one examined, we need to optimize the branch lengths to maximize the lielihood. Bayesian phylogeny methods Bayesian inference has been applied to inferring phylogenies (Rannala and Yang, 99;Mau and Larget, 997; Li, Pearl and Doss, 2000). ll use a prior distribution on trees. The prior has enough influence on the result that its reasonableness should be a major concern. In particular, the depth of the tree may be seriously affected by the distribution of depths in the prior. ll use Marov hain Monte arlo (MM) methods. They sample from the posterior distribution. When these methods mae sense they not only get you a point estimate of the phylogeny, they get you a distribution of possible phylogenies.
Modeling rate variation among sites G G... T G G model of variation in evolutionary rates among sites The basic idea is that the rate at each site is drawn independently from a distribution of rates. The most widely used choice is the Gamma distribution, which has density function: α α λ r e f ( r ) = Γ( α) Gamma distributions (α,θ) λr α r e = Γ( α) θ r / θ α 7
Unrealistic aspects of the model: There is no reason, aside from mathematical convenience, to assume that the Gamma is the right distribution. common variation is to assume there is a separate probability f 0 of having rate 0. Rates at different sites appear to be correlated, which this model does not allow. Rates are not constant throughout evolution, they change with time. Rates varying among sites If L (i) (r i ) is the lielihood of the tree for site i given that the rate of evolution at site i is r i, we can integrate this over a gamma density: so that the overall lielihood is L ( i ) = ( i ) f ( r i ; α) L ( ri ) dri 0 L = m i = 0 f ( r ; α) L i ( i ) ( r i ) dri Unfortunately these integrals cannot be evaluated for trees with more than a few tips as the quantities L (i) (r i ) becomes complicated.
Modeling rate variation among sites There are a finite number of rates (denote rate i as r i ). There are probabilities p i of a site having rate i. process not visible to us ("hidden") assigns rates to sites. The probability of our seeing some data are to be obtained by summing over all possible combinations of rates, weighting appropriately by their probabilities of occurrence. Rocall the HMM Y Y 2 Y 3... Y T i i i i i e e 2 e 3 e e 5 e e 7 e i i i i G T G G G T T... X X 2 X 3 X T The shaded nodes represent the observed nucleotides at particular sites of an organism's genome For discrete Y i, widely used in computational biology to represent segments of sequences gene finders and motif finders profile models of protein domains models of secondary structure 9
Definition (of HMM) Observation space lphabetic set: Euclidean space: Index set of hidden states x I = {, 2, L,M} Transition probabilities between any two states j i p y = y = ) = a, or p = d R Start probabilities p y ) ~ Multinomial( π, π, K, π ). Emission probabilities associated with each state or in general: i p( x y = ) ~ f θ, i I { c, c, L, } ( t t i,j i ( yt yt = ) ~ Multinomial ai,, ai, 2, K, ai, M, i I p ( 2 M 2 c K y ( ). i y = ) ~ Multinomial( b, b, K, b ), i. ( xt t i, i, 2 i, K I t t ( ). i y 2 y 3 x 2 x 3...... Graphical model K 2 State automata y T x T Hidden Marov Phylogeny... G G G T G Replacing the standard emission model with a tree process not visible to us (.hidden") assigns rates to sites. It is a Marov process woring along the sequence. For example it might have transition probability Prob (j i) of changing to rate j in the next site, given that it is at rate i in this site. These are the most widely used models allowing rate variation to be correlated along the sequence. 20
The Forward lgorithm α t We can compute for all, t, using dynamic programming! Initialization: α = P ( x y = ) π α = P ( x, y = P ( x y = ) = P ( x y = ) P ( y = ) = ) π Iteration: α P ( x t y t = ) i t = P ( xt y t = ) α i t ai, Termination: P (x) = α T The Bacward lgorithm β t We can compute for all, t, using dynamic programming! Initialization: β T =, Iteration: β = ) β t i i a i, ip ( x t + y t + = ) t + Termination: P (x) α β = 2
Hidden Marov Phylogeny... G G G T G this yields a gene finder that exploits evolutionary constraints omparison of comparative genomic gene-finding and isolated gene-finding Based on sequence data from 2-5 primate species, Mculiffe et al (2003) obtained sensitivity of 00%, with a specificity of 9%. Genscan (state-of-the-art gene finder) yield a sensitivity of 5%, with a specificity of 3%. 00 90 0 70 0 50 0 30 20 0 0 sensitivity specificity Phylogenetic HMM Genscan (HMM) 22
Open questions (philosophical) Observation: Finding a good phylogeny will help in finding the genes. Finding the genes will help to find biologically meaningful phylogenetic trees Which came first, the chicen or the egg? Open questions (technical) How to learn a phylogeny (topology and transition prob.)? Should different site use the same phylogeny? Functionspecific phylogeny? Other evolutionary events: duplication, rearrangement, lateral transfer, etc. 23
cnowledgments Terry Speed: for some of the slides modified from his lectures at U Bereley Phil Green and Joe Felsenstein: for some of the slides modified from his lectures at Univ. of Washington Itai Yanai: for some of the slides modified from his lectures at Weizmann Institute of Science 2