Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

Models Models help us intelligently interpolate between our observations for purposes of making predictions Adding parameters to a model generally increases its fit to the data Underparameterized models lead to poor fit to observed data points Overparameterized models lead to poor prediction of future observations Criteria for choosing models include likelihood ratio tests, AIC, BIC, Bayes Factors, etc. all provide a way to choose a model that is neither underparameterized nor overparameterized Copyright 2007 Paul O. Lewis 3

The Poisson distribution Probability distribution on the number of events when: 1. events are assumed to be independent, 2. the rate of events some constant, µ, and 3. the process continues for some duration of time, t. The expectation of the number of events is ν = µt. Note that ν can be any non-negative number, but the Poisson is a discrete distribution it gives the probabilities of the number of events (and this number will always be a non-negative integer).

The Poisson distribution Pr(k events Expected # is ν) = νk e ν k! Pr(0 events) = ν0 e ν 0! = e ν = e µt Pr( 1 events) = 1 e ν = 1 e µt

"Disruptions" vs. substitutions When a disruption occurs, any base can appear in a sequence. T? Note: disruption is my term for this make-believe event. You will not see this term in the literature. If the base that A appears is different C G from the base that was already there, then a substitution event has occurred. T The rate at which any particular substitution occurs will be 1/4 the disruption rate (assuming equal base frequencies) Copyright 2007 Paul O. Lewis 13

Probability of T G over time t If µ is the rate of disruptions, and a branch is t units of time long then: Let s use θ for the rate of any particular disruption. µ T A = µ T C = µ T G = µ T T = θ µ = 4θ Furthermore, given that there is a disruption the chance of any particular change is 1 4

Probability of T G over time t Pr(0 disruptions t) = e µt Pr(at least 1 disruption t) = 1 e µt Pr(last disruption leads to G) = 0.25 Pr(T G t) = 0.25 ( 1 e µt) = 0.25 ( 1 e 4θt)

JC69 model Bases are assumed to be equally frequent (all 0.25) Assumes rate of substitution (α) is the same for all possible substitutions Usually described as a 1-parameter model (the parameter being α) Remember, however, that each edge in a tree can have its own α, so there are really as many parameters in the model as there are edges in the tree! Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pages 21-132 in H. N. Munro (ed.), Mammalian Protein Metabolism. Academic Press, New York. Copyright 2007 Paul O. Lewis 15

JC transition probabilities Pr(T A t) = 0.25 ( 1 e 4θt) Pr(T C t) = 0.25 ( 1 e 4θt) Pr(T G t) = 0.25 ( 1 e 4θt) Pr(T T t) = 0.25 ( 1 e 4θt) but this only adds up to: ( 1 e 4θt ) instead of 1!

We left out the probability of no disruptions:e 4θt So: Pr(T A t) = 0.25 ( 1 e 4θt) Pr(T C t) = 0.25 ( 1 e 4θt) Pr(T G t) = 0.25 ( 1 e 4θt) Pr(T T t) = e 4θt + 0.25 ( 1 e 4θt) = 0.25 + 0.75e 4θt

JC transition probabilities Pr(i j t) = 0.25 ( 1 e 4θt) Pr(i i t) = 0.25 + 0.75e 4θt When t = 0, then e 4θt = 1, and: Pr(i j t) = 0 Pr(i i t) = 1

JC transition probabilities Pr(i j t) = 0.25 ( 1 e 4θt) Pr(i i t) = 0.25 + 0.75e 4θt When t =, then e 4θt = 0, and: Pr(i j t) = 0.25 Pr(i i t) = 0.25

Probability of A present as a function of time Upper curve assumes we started with A at time 0. Over time, the probability of still seeing an A at this site drops because rate of changing to one of the other three bases is 3α (so rate of staying the same is -3α). The equilibrium relative frequency of A is 0.25 Lower curve assumes we started with some state other than A (T is used here). Over time, the probability of seeing an A at this site grows because the rate at which the current base will change into an A is α. Copyright 2007 Paul O. Lewis 24

Water analogy (time 0) 3α A C G T Start with container A completely full and others empty Imagine that all containers are connected by tubes that allow same rate of flow between any two Initially, A will be losing water at 3 times the rate that C (or G or T) gains water α Copyright 2007 Paul O. Lewis 25

Water analogy (after a very long time) A C G T Eventually, all containers are one fourth full and there is zero net volume change stationarity (equilibrium) has been achieved (Thanks to Kent Holsinger for this analogy) Copyright 2007 Paul O. Lewis 27

JC instantaneous rate matrix - the Q matrix for JC The 1 parameter is α (sometimes parameterized in terms of µ). This is the rate of replacements ( disruptions that change the state): To State A C G T From A 3α α α α C α 3α α α State G α α 3α α T α α α 3α

Change probabilities We can calculate a transition probability matrix as a function of time by: P(t) = e Qt The important thing to note is the rates (Q matrix) is multiplied by the time. We can t separate rates and times since we always see the effect of their product. Is a medium level of character divergence: 1. medium rate of change and medium amount of time, 2. high rate, but short time period, 3. low rate, but a long time period?

JC instantaneous rate matrix again What if you do not know the length of time for a branch in the tree? We estimate branch lengths in terms of character divergence the product of rate and time. What is important is that we know the relative rates of different types of substitutions, so JC can be expressed: To State A C G T From A 3 1 1 1 C 1 3 1 1 State G 1 1 3 1 T 1 1 1 3

JC instantaneous rate matrix yet again We estimate branch lengths in terms of expected number of changes per site. To do this we standardize the total rate of divergence in the Q matrix and estimate ν = µt = 3αt for each branch. From A 1 1 3 1 C State G 1 3 T 1 3 To State A C G T 1 3 1 3 1 3 1 3 1 3 1 1 3 1 3 1 3 1 3 1

Kimura (1980) model or the K80 model Transitions and transversions occur at different rates: To State A C G T From A 2β α β α β C β 2β α β α State G α β 2β α β T β α β 2β α

Kimura (1980) model or the K80 model. Reparameterized. Once again, we care only about the relative rates, so we can choose one rate to be frame of reference. This turns the 2 parameter model into a 1 parameter form: From State To State A C G T A (2 + κ)β β κβ β C β (2 + κ)β β κβ G κβ β (2 + κ)β β T β κβ β (2 + κ)β

Kimura (1980) model or the K80 model. Reparameterized again. To State A C G T From A 2 κ 1 κ 1 C 1 2 κ 1 κ State G κ 1 2 κ 1 T 1 κ 1 2 κ

Kappa is the transititon/transversion rate ratio: κ = α β (if κ = 1 then we are back to JC).

What is the instantaneous probability of an particular transversion? Pr(A C) = Pr(A) Pr(change to C) = 1 4 (βdt)

What is the instantaneous probability of an particular transition? Pr(A G) = Pr(A) Pr(change to G) = 1 4 (κβdt)

There are four types of transitions: A G, G A, C T, T C and eight types of transversions: A C, A T, G C, G T, C A, C G, T A, T G Ti/Tv ratio = Pr(any transition) Pr(any transversion) = 4 ( 1 4 (κβdt)) 8 ( 1 4 (βdt)) = κ 2 For K2P instantaneous transition/transversion ratio is one-half the instantaneous transition/transversion rate ratio

Felsenstein 1981 model or F81 model To State A C G T From A π C π G π T C π State A π G π T G π A π C π T T π A π C π G

HKY 1985 model To State A C G T From A π C κπ G π T C π State A π G κπ T G κπ A π C π T T π A κπ C π G

F84 model: F84* vs. HKY85 μ rate of process generating all types of substitutions kμ rate of process generating only transitions Becomes F81 model if k = 0 HKY85 model: β rate of process generating only transversions κβ rate of process generating only transitions Becomes F81 model if κ = 1 *First used in PHYLIP in 1984, first published bykishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. Journal of Molecular Evolution 29: 170-179. Copyright 2007 Paul O. Lewis 38

General Time Reversible GTR model To State A C G T From A aπ C bπ G cπ T C aπ State A dπ G eπ T G bπ A dπ C fπ T T cπ A eπ C fπ G In PAUP, f = 1 indicating that G T is the reference rate

References Kimura, M. (1980). A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16:111 120.

JC state-comparison probabilities i and j refer to states (A, C, G, T). i j: Pr(i i ν) = 1 4 + 3 4 e 4ν 3 Pr(i j ν) = 1 4 1 4 e 4ν 3

CFN transition probabilities Pr(0 0 ν) = Pr(1 1 ν) = 1 2 + 1 2 e 2ν Pr(0 1 ν) = Pr(1 0 ν) = 1 2 1 2 e 2ν

Mk transition probabilities k-state version of the one-rate model. Pr(i i ν) = 1 k + (k 1)e ( k k 1)ν k Pr(i j ν) = 1 k e ( k k 1)ν k

Kimura (1980) model or the K80 model Transitions and transversions occur at different rates: To State A C G T From A 2β α β α β C β 2β α β α State G α β 2β α β T β α β 2β α

Kimura (1980) model or the K80 model. Reparameterized. We only care about the relative rates, so we can choose one rate to be frame of reference. This turns the 2 parameter model into a 1 parameter form: From State To State A C G T A (2 + κ)β β κβ β C β (2 + κ)β β κβ G κβ β (2 + κ)β β T β κβ β (2 + κ)β

Kimura (1980) model or the K80 model. Reparameterized again. To State A C G T From A 2 κ 1 κ 1 C 1 2 κ 1 κ State G κ 1 2 κ 1 T 1 κ 1 2 κ

Kappa is the transititon/transversion rate ratio: κ = α β (if κ = 1 then we are back to JC).

What is the instantaneous probability of an particular transversion? Pr(A C) = Pr(A) Pr(change to C) = 1 4 (βdt)

What is the instantaneous probability of an particular transition? Pr(A G) = Pr(A) Pr(change to G) = 1 4 (κβdt)

Kimura model change probabilities Pr(A A ν) = 1 4 ( 1 + e ( 4 2+κ)ν + 2e ( 2+2κ 2+κ )ν ) Pr(A G ν) = 1 4 ( 1 + e ( 4 2+κ)ν 2e ( 2+2κ 2+κ )ν ) Pr(A C ν) = 1 4 ( 1 e ( 4 2+κ)ν )

Felsenstein 1981 model or F81 model To State A C G T From A π C π G π T C π State A π G π T G π A π C π T T π A π C π G

HKY 1985 model To State A C G T From A π C κπ G π T C π State A π G κπ T G κπ A π C π T T π A κπ C π G

General Time Reversible GTR model To State A C G T From A aπ C bπ G cπ T C aπ State A dπ G eπ T G bπ A dπ C fπ T T cπ A eπ C fπ G In PAUP, f = 1 indicating that G T is the reference rate

Likelihood of a single sequence First 32 nucleotides of the ψη-globin gene of gorilla: GAAGTCCTTGAGAAATAAACTGCACACACTGG L = π π π π π π π π π π π π π π π π π π π π π π π π π π π π π π π π G A A G T C C T T G A G A A A T A A A C T G C A C A C A C T G G = π π π π 12 7 7 6 A C G T ( ) ( ) ( ) ( ) ln L = 12 ln π + 7 ln π + 7 ln π + 6 ln π A C G T We can already see by eye-balling this that the F81 model (which allows unequal base frequencies) will fit better than the JC69 model (which assumes equal base frequencies) because there are about twice as many As as there are Cs, Gs and Ts. Copyright 2007 Paul O. Lewis 4

How can we calculate the likelihood score Under the JC (or K2P) model: ln L = 12 ln π A + 7 ln π C + 7 ln π G + 6 ln π T = 12 ln 0.25 + 7 ln 0.25 + 7 ln 0.25 + 6 ln 0.25 = 44.361

How can we calculate the likelihood score Under the F81 (or HKY or GTR) model: ln L = 12 ln π A + 7 ln π C + 7 ln π G + 6 ln π T But what are the values for the parameters: π A, π C, π G, π T? In many cases we refer to these parameters as nuisance parameters. They must be specified in order to calculate the likelihood, but we are not interested in them by themselves.

ML parameter estimates We can find the maximum likelihood estimates of the parameters to give us the ML score: the maximum likelihood obtainable under this model: ln L = 12 ln π A + 7 ln π C + 7 ln π G + 6 ln π T = 12 ln π A + 7 ln π C + 7 ln π G + 6 ln π T = 12 ln 0.375 + 7 ln 0.21875 + 7 ln 0.21875 + 6 ln 0.1875 = 43.091 But how did I get the numbers to fill for the parameters? How do we know that π A = 0.375 and π C = 0.21875

ML parameter estimates We might guess that: but how do we prove it? π A = 12 32 = 0.375 π C = 7 32 = 0.21875 π G = 7 32 = 0.21875 π T = 6 32 = 0.1875

ML parameter estimates For simple problems we solve for the point in parameter space for which derivatives with respect to all parameters are 0 (we also have to consider boundary points). We would have to do constrained optimization because and that is a pain. π A + π C + π G + π T = 1

ML parameter estimates We can reparameterize: r = π A + π G a = c = π A π A + π G π C π C + π T and always recover the original parameters: π A = ra π G = r(1 a) π C = (1 r)c π T = (1 r)(1 c)

ML parameter estimates ln L = 12 ln π A + 7 ln π C + 7 ln π G + 6 ln π T Recall that: = 12 ln [ra] + 7 ln [(1 r)c] + 7 ln [r(1 a)] + 6 ln [(1 r)(1 c)] ln f(x) x = f(x) x f(x)

ln L = 12 ln [ra] + 7 ln [(1 r)c] + 7 ln [r(1 a)] + 6 ln [(1 r)(1 c)] ln L a = 12r ra + 7( r) r(1 a) = 12 a 7 (1 a) 0 = 12 â 7 (1 â) â = 12 19

ln L = 12 ln [ra] + 7 ln [(1 r)c] + 7 ln [r(1 a)] + 6 ln [(1 r)(1 c)] ln L c = 7(1 r) (1 r)c = 7 c 6 (1 c) 0 = 7 ĉ 6 (1 ĉ) ĉ = 7 13 + 6 (1 r) (1 r)(1 a)

ln L = 12 ln [ra] + 7 ln [(1 r)c] + 7 ln [r(1 a)] + 6 ln [(1 r)(1 c)] ln L r 0 = = 12a ra + 7( c) (1 r)c + 7(1 a) r(1 a) = 12 r 7 (1 r) + 7 r 6 1 r = 19 r 13 (1 r) ˆr = 19 32 19ˆr 13 (1 ˆr) + 6( (1 c)) (1 r)(1 c)

ML inference displays scale invariance so we can just transform the ML estimates into our original parameters: π A = ˆrâ = π G = ˆr(1 â) = π C = (1 ˆr)ĉ = π T = (1 ˆr)(1 ĉ) = ( ) ( ) 19 12 = 12 32 19 32 ( ) ( ) 19 7 = 7 32 19 32 ( ) ( ) 13 6 = 7 32 13 32 ( ) ( ) 13 6 = 6 32 13 32

Likelihood ratio testing ln L JC = 44.361 ln L F 81 = 43.091 But the F81 model has 3 more free parameters than JC. The likelihood ratio test is a hypothesis testing approach to model selection.

Likelihood ratio testing H 0 : the data were generated under the simpler model H A : the data were generated under the more complex model. ) test statistic: 2 (ln L complex ln L simple The LRT only works if the simple model is nested inside the more complex model (if the free parameters for the simple model are a subset of the free parameters for the more complex model).

Likelihood ratio testing Null distribution: χ 2 distribution with d.f. = difference in the number of free parameters. If the LR test statistic is > than the critical value from the appropriate chi-square table, then we reject the simple model and prefer the more complex model.

Likelihood ratio testing - example ln L JC = 44.361 ln L F 81 = 43.091 LRT = 2( 43.091 ( 44.361)) = 2.54 df = 3 0 = 3 χ 2 3(critical, P = 0.05) = 7.815 Not significant. Do not reject the JC model. (if we look up the P -value for this test statistic it is 0.4681).

Likelihoods on the simplest possible tree GA GG L = L 1 L 2 = Pr(G) Pr(G G) Pr(A) Pr(A G) = Pr(G) Pr(G G ν) Pr(A) Pr(A G ν) ( ) ( 1 1 = 4 4 + 3 ) ( ) ( e 4ν 1 1 3 4 4 4 1 ) e 4ν 3 4

d = 1 4 1 4 e 4ν 3 ( 1 4 + 3 ) e 4ν 3 = 1 3d 4 ln L d L = ( ) ( 1 1 4 4 + 3 e 4ν 3 4 (1 3d)d = 16 = 1 6d 16 0 = 1 6 ˆd 16 ˆd = 1 6 ˆν = 0.82396 L = 0.005208 ) ( ) ( 1 1 4 4 1 ) e 4ν 3 4

You may recall that the JC distance correction from lecture 8 looked like this: ν = 3 ( 4 ln 1 4p ) 3 If you put in p = 0.5, because half the sites differ in our example then you the same branch length: ν = 0.82396 Our JC distance correction formula is actually an ML estimator of the branch length between a pair of taxa.

The first 30 nucleotides of the ψη-globin gene 50 gorilla orangutan [( 1 L = 4 GAAGTCCTTGAGAAATAAACTGCACACTGG GGACTCCTTGAGAAATAAACTGCACACTGG )] 28 [( ) ( 1 1 4 4 1 )] 2 e 4ν 3 4 ) ( 1 4 + 3 e 4ν 3 4 51 52 53 54 ˆν = 0.06982 ln L = 51.13396 0.00 0.05 0.10 0.15 0.20 0.25

Likelihood of a tree (data for only one site shown) A A A C Arbitrarily chosen to serve as the root node C T Ancestral states like this are not really known - we will address this in a minute. Copyright 2007 Paul O. Lewis 9

Likelihood for site k A C ν 1 A ν 3 C ν 5 ν 5 is the expected no. substitutions for just this segment of the tree π A A ν 2 ν 4 T 1 1 3 4 ν1/3 1 3 4 ν2/3 1 1 4 ν3/3 1 1 4 ν4/3 1 3 4 ν5/3 Lk = 4 4 + 4e 4 + 4e 4 4e 4 4e 4 + 4e P AA (ν 1 ) P AA (ν 2 ) P AC (ν 3 ) Copyright 2007 Paul O. Lewis P CT (ν 4 ) P CC (ν 5 ) 10

Pruning algorithm* (same result, much less time) Many calculations can be done just once, and then reused many times *The pruning algorithm was introduced by: Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution 17:368-376 Copyright 2007 Paul O. Lewis 12

Taxon Character 1 A 2 C 3 C 4 C 5 G ν 6 ν 8 ν 1 ν 2 ν 3 ν 7 1 2 3 4 ν 4 5 ν 5

L = x Pr(x, y, z, w, A, C, C, C, G ν) y z w A y ν 1 x ν 6 ν 2 C C ν 8 z ν 3 w C ν 4 G ν 7 ν 5 1

L = x Pr(x) Pr(y x, ν 6 ) Pr(A y, ν 1 ) Pr(C y, ν 2 ) y z w Pr(z x, ν 8 ) Pr(C z, ν 3 ) Pr(w z, ν 7 ) Pr(C w, ν 4 ) Pr(G w, ν 5 ) A y ν 1 x ν 6 ν 2 C C ν 8 z ν 3 w C ν 4 G ν 7 ν 5

L = Pr(x) Pr(y x, ν 6 ) Pr(A y, ν 1 ) Pr(C y, ν 2 ) x y z ( ) Pr(z x, ν 8 ) Pr(C z, ν 3 ) Pr(w z, ν 7 ) Pr(C w, ν 4 ) Pr(G w, ν 5 ) w A y ν 1 x ν 6 ν 2 C C ν 8 z ν 3 w C ν 4 G ν 7 ν 5

L = Pr(x) Pr(y x, ν 6 ) Pr(A y, ν 1 ) Pr(C y, ν 2 ) x y ( ( )) Pr(z x, ν 8 ) Pr(C z, ν 3 ) Pr(w z, ν 7 ) Pr(C w, ν 4 ) Pr(G w, ν 5 ) z w A y ν 1 x ν 6 ν 2 C C ν 8 z ν 3 w C ν 4 G ν 7 ν 5

L = ( Pr(x) x y ( ( Pr(z x, ν 8 ) Pr(C z, ν 3 ) z ) Pr(y x, ν 6 ) Pr(A y, ν 1 ) Pr(C y, ν 2 ) w )) Pr(w z, ν 7 ) Pr(C w, ν 4 ) Pr(G w, ν 5 ) A y ν 1 x ν 6 ν 2 C C ν 8 z ν 3 w C ν 4 G ν 7 ν 5

Maximum likelihood is a lot of work Site likelihoods involve products of transition probabilities, summed over ancestral states Overall log-likelihood for a tree is sum of site loglikelihoods Overall log-likelihood must be maximized! must find MLEs for all edge lengths and all model parameters this involves computing the overall log-likelihood many, many times (try turning on logiter in PAUP to get a feel for how much work this involves) Maximized lnl can now be compared to maximized lnl from other trees Copyright 2007 Paul O. Lewis 13

Uses all information Is it worth it? Parsimony ignores constant and autapomorphic sites Distance methods ignore information not captured in pairwise comparisons Model generality Some models possible with distance methods, but some quantities cannot be estimated reliably (e.g. variation in rates across sites) Many parsimony variants exist, but parsimony does not allow estimation of the step matrix entries, for example Many complex models are only possible under likelihood or Bayesian methods (which have a likelihood foundation) Copyright 2007 Paul O. Lewis 14

References Kimura, M. (1980). A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16:111 120.

Green Plant rbcl First 88 amino acids, translation is for Zea mays M--S--P--Q--T--E--T--K--A--S--V--G--F--K--A--G--V--K--D--Y--K--L--T--Y--Y--T--P--E--Y--E--T--K--D--T--D--I--L--A--A--F--R--V--T--P-- Chara (green alga; land plant lineage) AAAGATTACAGATTAACTTACTATACTCCTGAGTATAAAACTAAAGATACTGACATTTTAGCTGCATTTCGTGTAACTCCA Chlorella (green alga)...c...c.t...t..cc..c.a...c...t...c.t..a..g..c...a.g...t Volvox (green alga)...tc.t...a...c..a...c...gt.gta...c...c...a...a.g... Conocephalum (liverwort)...tc...t...g..t...g...g..t...a...a.aa.g...t Bazzania (moss)...t...c..t...g...a...g.g..c...g..a..t...g..a...a.g...c Anthoceros (hornwort)...t...cc.t...c...t..cg.g..c..g...t...g..a..g.c.t.aa.g...t Osmunda (fern)...tc...g...c...c..t...g.g..c..g...t...g..a...c..aa.g...c Lycopodium (club "moss").gg...c.t..c...t...g..c...a..c..t...c.g..a...aa.g...t Ginkgo (gymnosperm; Ginkgo biloba)...g...t...a...c...c...t..c..g..a...c..a...t Picea (gymnosperm; spruce)...t...a...c.g..c...g..t...g..a...c..a...t Iris (flowering plant)...g...t...t..cg...c...t..c..g..a...c..a...t Asplenium (fern; spleenwort)...tc..c.g...t..c..c..c..a..c..g..c...c..t..c..g..a..t..c..ga.g..c... Nicotiana (flowering plant; tobacco)...g...a...g...t...cc...c..g...t..a..g..a...c..a...t Q--L--G--V--P--P--E--E--A--G--A--A--V--A--A--E--S--S--T--G--T--W--T--T--V--W--T--D--G--L--T--S--L--D--R--Y--K--G--R--C--Y--H--I--E-- CAACCTGGCGTTCCACCTGAAGAAGCAGGGGCTGCAGTAGCTGCAGAATCTTCTACTGGTACATGGACTACTGTTTGGACTGACGGATTAACTAGTTTGGACCGATACAAAGGAAGATGCTACGATATTGAA...A..T...A...G..T..G...A...A..A...T...G...A...T..T...A...T...TC.T..T..T..C..C..G...A..T...TGT..T...T..T...T...A..A..A...T...A...A...T..T...A...C.T...T...TC.T..T..T..C..C..G..G...G..A...G.A...A..A...T...T...A...T..TC.T...ACC.T..T..T..T...TC...T.G...C...G..A..A...A..G...T...A..C...G...C..G...C..T..GC.T..A...C.C..T..T...TC...T..C..C... T...A..G..G...A..C...T...A...C..T...C.T..C..CC.T...T...TC...C......C..A..A..GG...G...T..A...G...A...G...C...A...G..T...C.T..C...C.T..T..T..T..G..TC......T...A..A...C..G...G..A..C...T...C...C..T...C.T..C...C.C..T..C...TC.G...T..A......A..G...G...G..A...C...C...C...C..T...C.T..C...C.T..T..T...G...T..C..C..G...A..G..G..G..C..G...G..A..A...T...C..C...C...C..T...C.T...C.T..T..T...G..GC...T..C..C..G...C..A...TG...G...C..G...C...A..A..G...T...C.T..C...C.T..T..T...C...C.C..C..G...C..A..A...G...C..A...G..C...A...C...G...A...G..G..C..CC.T...T...G..CC...C..G...A...C..G...C...A...A...C..T...C.T..C..CC.T..T..T...GC...CGC..C..G All four bases are observed at some sites......while at other sites, only one base is observed Copyright 2007 Paul O. Lewis 3

Question: Why is rate heterogeneity ubiquituous? Answer: Differences in mutational rates and (mainly) selective constraint Many sites are under purifying (stabilizing) selection: Any mutation results in a different amino acid, AND A amino acid replacement at the site results in dramatically worse functioning of the protein. These sites will show low rates of evolution on a tree. Other sites are less constrained. A mutation results in the same amino acid, OR Many amino acids will work equally well at that position in the protein. These sites will show high rates of evolution on a tree.

Rate heterogeneity in protein-coding genes: terms Synonymous mutations result in the same amino acid. Non-synonymous mutations result in the different amino acid. Conservative changes are non-synonymous changes that result in a chemically similar amino acid. Neutral mutations result in a new genotype that has the same fitness as the genotypes currently fixed in the population.

Rate heterogeneity in protein-coding genes: generalities Synonymous changes are often neutral (or close to neutral), Third base positions and untranslated regions (introns and other non-coding regions) tend to have high rates because changes to these sites lead to synonymous changes. Transitions tend to lead to more synonymous or conservative changes. Amino acid residues that are embedded, involved in salt bonding, or part of the active site tend to be more constrained. Loops of amino acid residues on the outside of proteins often tolerate a wide range of substitutions (or even indels).

U C A G 2nd Base U C A G UUU UCU UAU UGU F Y C UUC UCC UAC UGC S UUA UCA UAA UGA * L * UUG UCG UAG UGG W CUU CCU CAU CGU H CUC CCC CAC CGC L P CUA CCA CAA CGA Q CUG CCG CAG CGG R AUU ACU AAU AGU N S AUC I ACC AAC AGC T AUA ACA AAA AGA K R AUG M ACG AAG AGG GUU GCU GAU GGU V D GUC GCC GAC GGC A GUA GCA GAA GGA L E GUG GCG GAG GGG G

Rate heterogeneity in RNA coding genes Stem regions formed when RNA strand forms double-helix with itself strongly conserved in general evidence for compensatory substitutions Loop regions some strongly conserved some entire loops are found in only particular lineages Copyright 2007 Paul O. Lewis 6

Accommodating rate heterogeneity in substitution models Site-specific rates approach e.g. let 1st, 2nd and 3rd position sites each have their own relative substitution rate Proportion of invariable sites approach assume that some proportion p invar of sites have rate 0, while a proportion 1-p invar have a rate > 0 Discrete gamma distributed relative rates approach assume that each site is evolving at one of n cat relative rates, where the relative rates are determined using a gamma distribution having mean 1 and shape α Codon models (protein-coding genes only) uses genetic code to determine appropriate relative rates Secondary structure models (RNA-coding genes only) uses separate model for loops vs. stems, stem model takes account of compensatory substitutions Copyright 2007 Paul O. Lewis 7

Site-specific rates You decide there are 3 classes of sites: 1st positions evolve at relative rate r 1 2nd positions evolve at relative rate r 2 3rd positions evolve at relative rate r 3 r 1, r 2 and r 3 are relative rates, not actual rates: their average is 1.0: if each category has the same number of sites, (r 1 + r 2 + r 3 )/3 = 1.0 the actual rates are r 1 α (for 1st positions), r 2 α (for 2nd positions) and r 3 α (for 3rd positions) note that the average substitution rate over all sites is α (r 1 α + r 2 α + r 3 α)/3 = α (1.0) = α Assuming k rate classes adds k-1 parameters to the model Copyright 2007 Paul O. Lewis 8

Transition probabilities under the JC69 model with no rate heterogeneity: Pr(i i ν) = 1 4 + 3 4 e 4ν 3 Pr(i j ν) = 1 4 1 4 e 4ν 3

Transition probabilities under the JC69 model First base positions under a site-specific rates model: Pr(i i ν) = 1 4 + 3 1 ν 4 e 4r 3 Pr(i j ν) = 1 4 1 1 ν 4 e 4r 3

Site-specific rates in PAUP* First, define a character partition that puts each site into one of several mutually exclusive categories (the category names are arbitrary): charpartition codons = one:1-.\3, two:2-.\3, three:3-.\3; Then tell PAUP* that you want site specific rates and provide the partition you defined previously: lset rates=sitespec siterates=partition:codons; Copyright 2007 Paul O. Lewis 11

Pinvar approach Unlike the site-specific rates approach, this approach does not require you to assign sites to rate categories Assumes there are only two classes of sites: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r) Remarks: mean of relative rates = (p invar )(0) + (1-p invar )(r) = 1 this means that r = 1/(1-p invar ) if all sites are variable, p invar = 0 and r = 1 Copyright 2007 Paul O. Lewis 12

Constant site a site in which all of the taxa display the same character state. Invariable site a site in which only one character state is allowed. A site that cannot change state. All invariable sites are constant, but not all constant sites have to be invariable.

Pr(i i invariable) = 1 4 + 3 4 e 40ν 3 = 1 4 + 3 4 e0 = 1 Pr(i j invariable) = 1 4 1 4 e 40ν 3 = 0

A site s likelihood under the JC+ I model x i is the data pattern for site i. General form: Pr(x i JC+I) = p inv Pr(x i inv) + (1 p inv ) Pr If x i is a variable site: If x i is a constant site: Pr(x i JC+I) = (1 p inv ) Pr ( x i JC, Pr(x i JC+I) = p inv Pr(x i inv) + (1 p inv ) Pr ( x i JC, ν 1 p inv ( x i JC, ν 1 p inv ) ν 1 p inv ) )

Why ν 1 p inv? We want the mean rate of change to be 1.0 over all sites (so we can interpret the branch lengths in terms of the expected # of changes per site). If r is the rate of change for the variable sites then: 1 = 0p inv + r ( 1 p inv ) r = = r ( 1 p inv ) 1 1 p inv

Variable (but unknown) rates We expect more shades of grey rather than the on-or-off view of the pinvar model. a priori we do not know which sites are fast and which are slow We may be able to characterize the distribution of rates across sites high variance or low variance.

Gamma distributions relative frequency of sites 1.4 1.2 1 0.8 0.6 0.4 0.2 0 α = 10 α = 1 larger α means less heterogeneity α = 0.1 The mean equals 1.0 for all three of of these distributions smaller α means more heterogeneity 0 1 2 3 4 5 relative rate Copyright 2007 Paul O. Lewis 20

Gamma distribution f(r) = rα 1 β α e βr Γ(α) mean = α/β mean (in phylogenetics) = 1 (in phylogenetics) β = α variance = α/β 2 variance (in phylogenetics) = 1/α

Using Gamma-distributed rates across sites We usually use a discretized version of the gamma with 4-8 categories (the computation time increases linearly with the number of categories). Pr(x i JC + G) = ncat j Pr(x i JC, r j ν) Pr(r j ) where: ncat j r j Pr(r j ) = 1

Discrete gamma (continued) We break up the continuous gamma into intervals each of which has an equal probability, and use the mean rate within each interval as the representative rate for that rate category: Pr(r j ) = 1 ncat So: Pr(x i JC + G) = 1 ncat ncat j Pr(x i JC, r j ν)

Relative rates in 4-category case 1 0.8 Boundary between 1st and 2nd categories Boundaries are placed so that each category represents 1/4 of the distribution (i.e. 1/4 of the area under the curve) 0.6 Relative rates represent the mean of their 0.4 category Boundary between 2nd and 3rd categories Boundary between 3rd and 4th categories 0.2 r 1 = 0.137 r 2 = 0.477 r 3 = 1.000 r 4 = 2.386 0 0 0.5 1 1.5 2 2.5 Copyright 2007 Paul O. Lewis 21

Discrete gamma rate heterogeneity in PAUP* To use gamma distributed rates with 4 categories: lset rates=gamma ncat=4; To estimate the shape parameter: lset shape=estimate; To combine pinvar with gamma: lset rates=gamma shape=0.2 pinvar=0.4; Note: estimate, previous, or a specific value can be specified for both shape and pinvar Copyright 2007 Paul O. Lewis 23

Rate homogeneity in PAUP* Just tell PAUP* that you want all rates to be equal and that you want all sites to be allowed to vary: lset rates=equal pinvar=0; Note: these are the default settings, but it is useful to know how to go back to rate homogeneity after you have experimented with rate heterogeneity! Copyright 2007 Paul O. Lewis 24

Likelihood ratio test Always compares an unconstrained to a constrained model Constrained model must be nested within the unconstrained model Parameter(s) take on their maximum likelihood estimates (MLEs) in the unconstrained model Parameters(s) set to some other value of interest in the constrained model Unconstrained model must be able to attain a higher maximum likelihood than the constrained model Copyright 2007 by Paul O. Lewis 2

Likelihood Ratio Test is the MLE is some other value Coin-flipping example: Data: 6 heads out of 10 flips Constrained model: fair coin (θ = 0.5) Unconstrained model: biased coin (θ = ) Example of likelihood calculation for case of θ = 0.6 Copyright 2007 by Paul O. Lewis 3

Likelihood Ratio Test Coin-flipping example: Data: 6 heads out of 10 flips Constrained model: fair coin (θ = 0.5) Unconstrained model: biased coin (θ = ) Not significant: P = 0.527 This means that the simpler, constrained model cannot be rejected LRT approximates a chi-square random variable with d.f. equal to the difference in the number of free parameters between the two models Copyright 2007 by Paul O. Lewis 4

Examples of unconstrained vs. constrained model comparisons 1. GTR+G (shape=mle) vs. GTR (shape= ) 2. K80 (κ=mle) vs. JC (κ=1.0) 3. HKY+I+G (p inv =MLE) vs. HKY+G (p inv =0) 4. HKY+I+G (p inv =MLE, shape=mle) vs. HKY (p inv =0, shape= ) Note: cases in which the constrained model involves setting a parameter to the edge of its valid range (e.g. cases 1, 2 and 4 above) require special consideration (see Ota et al. 2000) Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Molecular Biology and Evolution 17:798-803. Copyright 2007 by Paul O. Lewis 6

Testing the molecular clock Unconstrained model: need to estimate 2n-3 = 11 branch lengths Constrained model: need to estimate n-1 = 6 divergence times 1 2 3 4 5 6 7 t 1 t2 t 3 t 4 t 5 t 6 n = 7 taxa Likelihood ratio test thus has (2n-3) - (n-1) = n-2 d.f. Copyright 2007 by Paul O. Lewis 7

Akaike Information Criterion AIC = -2 max(lnl) + 2K K is number of free model parameters Measures relative distance to true model Model with smallest AIC wins Advantage over LRT: non-nested models Example: 6 heads/10 flips revisited Unconstrained model: θ = 0.6, AIC = -2(-1.383) + 2(1) = 4.766 Constrained model: θ = 0.5, AIC = -2(-1.584) + 2(0) = 3.168 (best) Copyright 2007 by Paul O. Lewis 8

Bayesian Information Criterion BIC = -2 max(lnl) + K log(n) K is number of free model parameters n is the sample size Model with smallest BIC wins Advantage over LRT: non-nested models Considered superior to both AIC and LRT Example: 6 heads/10 flips one more time. Note: log(10) 2.3 Unconstrained model: θ = 0.6, BIC = -2(-1.383) + (2.3)(1) = 5.066 Constrained model: θ = 0.5, BIC = -2(-1.584) + 0 = 3.168 (best) Copyright 2007 by Paul O. Lewis 9

Likelihood ratio test favors more complex models Assume the simpler, constrained model is the true model If the LRT was statistically consistent, it would choose the true model with certainty as n But the simpler model will be rejected 5% of the time, regardless of sample size Thus, LRT biased toward choosing the more complex, unconstrained model Copyright 2007 by Paul O. Lewis 11

References