T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Lecture 11 Increasing Model Complexity I. Introduction. At this point, we ve increased the complexity of models of substitution considerably, but we re still left with the assumption that rates are uniform across sites. Differences in functional and structural constraints across sites leads to different sites evolving at different rates. For example if we look at this hypothetical sequence, we ll see that, because of the nature of the genetic code, not all nucleotide substitutions will result in an amino-acid substitution. T R K V CCU CG A AAA GUC (No Amino-acid Substitution) T R K V CCU CGG AAA GUC (Ancestral Sequence) Therefore, a very common observation is that sites at 3 rd -codon positions evolve fastest (perhaps at the rate of neutral mutation), followed by those at 1 st positions, and 2 nd position sites evolve the slowest. This is one form of among-site rate variation, which exacerbates the loss of historical information caused by multiple hits. If we think about this, it should be pretty obvious. If we have 10 substitutions, and they are distributed randomly across 50 sites, there should only rarely be more than a single substitution per site. However if those 10 substitutions are distributed across 50 sites in a non-random fashion, say concentrated to 1/3 of them, many more will occur at multiply hit sites. Because of the importance of this, I want to present ways to model with among-site rate variation. II. Discrete Methods: T Q K V CCU C AG AAA GUC (Amino-acid Substitution) The simplest thing to do is to assign the sites of an alignment to a series of rate partitions. This assignment is often done based on some extraneous information such as codon structure or stem/loop structure.

Accommodating different rates of substitution is easily accomplished simply by adding a relative-rate parameter r for site classes to our models, as illustrated below for a JC model: ={ 1/4 + 3/4 e -4αrt P ij(t,r) -4αrt 1/4-1/4 e for i = j for i = j Then the likelihood for a particular site is calculated as follows: lnl τ i c = w r ln L (i,r) r=1 where, there are c rate categories, and w r is the probability that site i belongs to a particular rate category; these are binary (0 or 1) if we re assigning sites to rate classes a priori. B. The most common discrete-rates model is called the Site-Specific Rates (or SSR) model. In the SSR models, the theoretical limit to the number of rate categories is the number of sites in the alignment, but usually these are determined a priori and often they follow codon structure. Felsenstein gives an example of these on page 223 in the text. So in this case, w 1, w 2, and w 3 are fixed to 0 or 1, and we just use an independent JC model for each class (e.g., codon position). The relative rate parameters then can be assigned, as Joe describes in the text, or they can be optimized numerically, which is what is often done. These SSR models have the advantage that one can essentially use a different transformation matrix (Q) for each class. This can lead to huge improvements in fit between model and data relative to, say, a single GTR for all sites. They have the disadvantage that all sites within a category are assumed to be evolving a uniform rate. This may be an ok assumption for 3 rd codon position sites, but it s probably a really bad one for 1 st and 2 nd position sites (e.g., Buckley et al. 2001. Syst. Biol. 50:67). C. Invariable Sites model. A common approach allows two rate categories and in one of these, the relative rate parameter is zero. This is based on observations that there are sites in alignments of conserved genes in which all life seem to have the same state.

We can think about this model in two ways. As a mixture model. Typically in this model, the w r for the category of sites that is potentially variable is estimated from the data. This is taken as the probability that a particular site belongs to either the variable class or the invariable class. So the degree of the mixture is 2. w invar = p invar The probability that a site in in the class where r = 0. w var = p var The probability that the site is in the class where r 0. w var = 1 - w invar Sites that are observed to vary have w invar = 0. We can also take the w r to be the proportion of sites that are invariable across the alignment. So this model is governed by the parameter p invar, the proportion of sites that are invariable. This is constrained to be the proportion of sites that are constant because there is a non-zero probability that a potentially variable site has not experienced a substitution due to stochasticity of the process. II. Continuous Methods: There s no biological reason, however, to expect rates to fall into discrete categories, and we can use rate-mixture models to deal with this. There are a number of continuous-rates models that have been applied historically, but one of these has become pretty dominant. Most studies that attempt to incorporate ASRV directly use a gamma distribution to model rate variation across sites (G -distributed Rates model) Gamma distributions are governed by two parameters: a shape parameter (a ) and a scale parameter (b ). The mean of a G -distribution is equal to the product of these, a b. In applications to molecular systematics, we set the mean of the G -distribution equal to 1 by constraining b = 1/a. This allows us to scale branch lengths in units of expected substitutions per site. In addition, the G -distribution is then governed solely by the shape parameter. The advantage of using G -distributions to model ASRV is that, by varying this single parameter, a, the distribution can take on a variety of different shapes.

When a =, the gamma model converges on a single rate model. When a = 0.5, the distribution becomes L-shaped. Many real data sets have a shape parameter of 0.5, although there is a lot of variation. So using a gamma distribution to model ASRV in phylogenetics can be accomplished by integrating across the G -distribution. This isn t at all feasible, so the common solution is to discretize the gamma distribution. The idea is to break the continuous gamma into a number of rate categories (usually 4 8). The rate within a category is represented by the within-category mean, and these means are drawn from a G -distribution with shape parameter a. The boundaries of the rate categories are set such that there is an equal area of the distribution in each. This is demonstrated below:

0.0 1.75 0.00 +-----------------------------------------------------------------------------------+ #################################################################################### 0.10 +########################################################## ############################################## 0.20 +####################################### ################################## 0.30 +############################## ########################### 0.40 +######################### ####################### 0.50 +##################### ################### 0.60 +################## ################# 0.70 +################ ############### 0.80 +############## ############# 0.90 +############# ############ 1.00 +########### ########### 1.10 +########## ########## 1.20 +######### ######### Gamma distribution with shape parameter (alpha) = 0.492 1.30 +######### ######## 1.40 +######## ######## 1.50 +####### ####### 1.60 +####### ###### 1.70 +###### ###### 1.80 +###### ###### 1.90 +##### ##### 2.00 +##### ##### 2.10 +##### #### 2.20 +#### #### 2.30 +#### #### 2.40 +#### #### 2.50 +### ### 2.60 +### ### 2.70 +### ### 2.80 +### ### 2.90 +### ### 3.00 +-----------------------------------------------------------------------------------+ 0.0 1.75

Cut-points and category rates for discrete gamma approximation (ncat = 4) ------ cut-points ------ category lower upper rate (mean) ------------------------------------------------- 1 0.00000000 0.09804816 0.03191473 2 0.09804816 0.44841399 0.24666120 3 0.44841399 1.31969682 0.81435904 4 1.31969682 infinity 2.90706503 Mean = 1.0 So these means are incorporated into the likelihood function as the r i s, and the w i s for each site (the probability of the site occurring in rat category i) are optimized. Although it s not usually done, one should vary ncat and identify the smallest value that produces an accurate estimate of a. Cut-points and category rates for discrete gamma approximation (ncat = 8) ------ cut-points ------ category lower upper rate (mean) ------------------------------------------------- 1 0.00000000 0.02338747 0.00768838 2 0.02338747 0.09804816 0.05614108 3 0.09804816 0.23352213 0.16013076 4 0.23352213 0.44841399 0.33319164 5 0.44841399 0.78071211 0.60229167 6 0.78071211 1.31969682 1.02642641 7 1.31969682 2.35886822 1.77009489 8 2.35886822 infinity 4.04403517 and: 0.16 0.14 0.12 0.1 0.08 4 6 8 10 12 14 16 18 20 22 24 ncat

This is done across the entire data set, so essentially we take the same transformation matrix (Q) for each site and scale it by the average rate for each category. This has the tremendous advantage of being able to accommodate such a wide diversity of rates with just a single parameter, a. Some sites can be so slowly evolving to have a high probability of stasis, yet others (perhaps adjacent) may be free to evolve rapidly. It has the disadvantage that we apply the same transformation matrix uniformly across a data set. It s also pretty common to over discretize the gamma distribution (i.e., use too few rate categories). III. I+G Models A further elaboration that has become widely used is a mixture of invariable sites, with rates at variable sites being drawn from a gamma distribution. This is called the I + G model, and was developed independently by Gu, Fu, & Li (1995, Mol. Biol. Evol. ) and Waddell and Penny (1996). p invar Γ - shape parameter (α) Rate of Evolution This is intuitively very appealing when one considers that, at least from some genes, there s a set of sites that are constant across essentially the tree of life.

This model is very frequently required by real data sets (as assessed by methods we ll discuss in subsequent lectures), but there are some issues with it that are sometimes not appreciated. The mixed model and the gamma alone expect there to be many constant sites. It can be very difficult to discern the sites that are truly invariable from those potentially variable sites that are evolving slowly enough to have a high probability of stasis. This can result in very poorly-behaved likelihood surfaces, as shown below. ln Likelihood ln Likelihood -5660-5680 -5700-5720 -5740-5760 -5780 0 0.1-5720 -5740-5760 -5780-5800 -5660-5680 -5700 0.2 0.3 Pinv 0.4 0.5 0.6 2.5 2.0 1.5 1.0 alpha 0.5 This is the likelihood surface for the parameters of the I + G model. There are relatively few taxa in this data set and there are multiple peaks in the likelihood surface, one of which is the true peak (the data are simulated, so fit the model perfectly). However with many taxa, the surface is better behaved (same data, more taxa).

IV. Expanding the GTR family of models. Let s think about the parameters of a GTR transformation matrix with ASRV modeled with I + G. There are 10 free parameters. Eight are associated with the transformation matrix: three free base frequencies (they re constrained to sum to 1) & five relative rate parameters (they re relative and r GT is set to one). Two are associated with ASRV: the shape parameter of the gamma distribution and p invar. Just as, when we were discussing the transformations matrix, simpler models are special cases of the most parameter rich, the equal rates models are special cases of the variable rates models. That is we can erect a nested series of substitution models. So the relationships among the GTR+I+G family of models can be illustrated with a clover-leaf diagram. To me, this is a very convenient way to visualize model space. JC+Γ K2P+Γ F81+Γ K3P+Γ HKY+Γ F84+Γ SYM+Γ TmN+Γ GTR+Γ JC+I+ Γ F81+I+Γ HKY+I+Γ F84+I+Γ TmN+I+Γ pinvar = 0 α = infinity GTR+I+Γ GTR+I SYM+I K3P+I K2P+I JC+I K2P+I+Γ K3P+I+Γ SYM+I+Γ pinvar = 0, α = infinity TmN+I HKY+I F84+I F81+I all tv equal GTR equal base frequencies TmN SYM all ti equal all ti equal HKY85 F84 equal b.f. K3P all substitutions equal all tv equal F81 equal base frequencies JC K2P all substitutions equal

In the I+G models, if a = infinity, the ASRV model is an invariable sites model. If p invar = 0, the ASRV model is equivalent to a G alone. In a G model alone, when a = infinity, the gamma model converges to the equal-rates models. Similarly, in an invariable sites model (alone) if p invar = 0, the invariable sites model also reduces to an equal rates models. Remember that each lobe of the cloverleaf represents 203 possible restrictions of the r-matrix. Similarly, we can consider the SSR models to be a family of special cases. If we have a GTR+SSR 3 model, we can think of the following parameterization: p A1 p A2 p A3 p C1 p C2 p C3 p G1 p G2 p G3 p T1 p T2 p T3 r (AC)1 r (AC)2 r (AG)1 r (AG)2 r (AT)1 r (AT)2 r (CG)1 r (CG)2 r (CT)1 r (CT)2 r (GT)1 r (GT)2 r (AC)1 r (AG)3 r (AT)3 r (CG)3 r (CT)3 r (GT)3 The GTR model applied to all sites is equivalent to this with the following restrictions: p A1 = p A2 = p A3 p C1 = p C2 = p C3 p G1 = p G2 = p G3 p T1 = p T2 = p T3 r (AC)1 = r (AC)2 = r (AC)1 r (AG)1 = r (AG)2 = r (AG)3 r (AT)1 = r (AT)2 = r (AT)3 r (CG)1 = r (CG)2 = r (CG)3 r (CT)1 = r (CT)2 = r (CT)3 r (GT)1 = r (GT)2 = r (GT)3 So we have a number of models, and there are nested series. GTR+CAT in RAxML Before we leave rate-heterogeneity, we should discuss a relatively new approach that Stamatakis has implemented in RAxML.

It s like the SSR approach, in that sites are assigned to rate classes, and therefore the w r s are all either zero or 1. However sites are classed into categories (usually 25 rate categories) based on an initial estimate their rates on a starting tree. The rates for each class are assigned as the rate of the site with the highest SSL in the category, and they re then fixed for tree searching (remember via stepwise addition under parsimony followed by lazy SPR). This is much faster than a G -model because SSLs are only calculated once, since every site is assigned to a single rate category. This approach works well when the number of sequences in the data set is large, but when there s less than several hundred, the estimates of the rates at each site are pretty lousy and the performance of GTR+CAT declines. IV. rrna Model A couple models have been developed to deal with non-independence of nucleotides in paired stem region of rrna. These models use a priori partitioning of sites into stem and loop regions, and sites in the loops partition are treated with some variant of the GTR+I+G family. Sites in the stem regions are treated using the doublet model. Doublets are treated as characters rather than nucleotides and there are 16 possible states rather than 4. So instead of 12 substitution types (or 6 reversible types) there are n(n-1) = 240 types (or 120 reversible types). In addition, instead of three free base frequency parameters, there will be 15 free doublet frequencies, may of which are likely to be zero. Smith et al. (2004. Mol. Bio. Evol. 21:419) used an aligned database of 50K sequences to estimate these parameters and they provide a fully parameterized empirical model. Telford et al. (2005. Mol. Biol. Evol. 22:1129); Todaro et al. (2006. Zool. Scripta 35:251). V. Codon-based models It s also possible to model the non-independence of sites generated by the genetic code using codon-based model.

Here, in-frame triplets are used as characters and there are 61 possible character states (64 triplets minus the three stop codons). Thus the transformation matrix has 3660 rate parameters (or 1830 in the reversible case). Again, empirical matrices can be used. Alternatively, cells of the transformation matrix can be restricted so that there are only, say two substitution types. TTT ß à TTC : Both code for Phe so the Tß à C transition is silent. The cell in the matrix would be filled with a p C, where a is the rate of silent substitutions and p C is (as before) the frequency of nucleotide C. Conversely, TTTß à TTA would be expressed as b p A, where b is the rate of amino acid replacement substitutions, because TTA codes for Leucine. This is the approach taken by Muse & Weir (1994. Mol. Biol. Evol. 11:715). There are only 4 parameters here, the three free base frequencies and the ratio of the rates of silent vs. replacement substitutions. Goldman and Yang (1994. Mol. Bio. Evol., 11:725) go a step further and incorporate a transition/transversion rate ratio, and Halpern and Bruno (1998. Mol. Biol. Evol. 15:910) allow all six possible nucleotide substitution types. A cool thing about this approach is that we can calculate the ratio of synonymous to replacement substitutions, which allows for an assessment of the strength of selection operating at a site.