Base Range A 0.00 to 0.25 C 0.25 to 0.50 G 0.50 to 0.75 T 0.75 to 1.00

Similar documents
Evolutionary Models. Evolutionary Models

Logarithmic Functions

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

9.2 Multiplication Properties of Radicals

A factor times a logarithm can be re-written as the argument of the logarithm raised to the power of that factor

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

MITOCW MITRES18_006F10_26_0501_300k-mp4

Exponential Functions

5.6 Logarithmic and Exponential Equations

Slide 1. Slide 2. Slide 3. Dilution Solution. What s the concentration of red triangles? What s the concentration of red triangles?

Appendix A. Common Mathematical Operations in Chemistry

Numerical Methods. Exponential and Logarithmic functions. Jaesung Lee

2. Limits at Infinity

Conditional Probability

1 Continuity and Limits of Functions

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Section 5.1 Logarithms and Their Properties

Mathematics-I Prof. S.K. Ray Department of Mathematics and Statistics Indian Institute of Technology, Kanpur. Lecture 1 Real Numbers

Nondeterministic finite automata

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

#29: Logarithm review May 16, 2009

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

Engineering Mechanics Prof. Siva Kumar Department of Civil Engineering Indian Institute of Technology, Madras Statics - 5.2

Guide to Proofs on Sets

Since the logs have the same base, I can set the arguments equal and solve: x 2 30 = x x 2 x 30 = 0

Logarithms Tutorial for Chemistry Students

Fitting a Straight Line to Data

Exponential and Logarithmic Functions

Skill 6 Exponential and Logarithmic Functions

4.4 Graphs of Logarithmic Functions

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Functions

The Growth of Functions. A Practical Introduction with as Little Theory as possible

Graphical Analysis and Errors - MBL

Introduction to Determining Power Law Relationships

Logarithmic Properties

Cosets and Lagrange s theorem

4.1 Solutions to Exercises

TheFourierTransformAndItsApplications-Lecture28

CS1800: Strong Induction. Professor Kevin Gold

If a function has an inverse then we can determine the input if we know the output. For example if the function

Exponential and Logarithmic Functions. By Lauren Bae and Yamini Ramadurai

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2018

Understanding Organic Reactions

Lecture 29 Relevant sections in text: 3.9

Sequences and Series

2.6 Logarithmic Functions. Inverse Functions. Question: What is the relationship between f(x) = x 2 and g(x) = x?

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

MA554 Assessment 1 Cosets and Lagrange s theorem

Graphical Analysis and Errors MBL

J = L + S. to this ket and normalize it. In this way we get expressions for all the kets

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

MATH 124. Midterm 2 Topics

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Expected Value II. 1 The Expected Number of Events that Happen

MITOCW MITRES18_005S10_DerivOfSinXCosX_300k_512kb-mp4

Continuity and One-Sided Limits

Chapter 3. Exponential and Logarithmic Functions. 3.2 Logarithmic Functions

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Chapter 13 - Inverse Functions

Exercise Sheet 4: Covariance and Correlation, Bayes theorem, and Linear discriminant analysis

TAYLOR POLYNOMIALS DARYL DEFORD

Lecture 7: Chemical Equilbria--a bit more detail and some additional kinds of problems.

Logarithms and Exponentials

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Non-independence in Statistical Tests for Discrete Cross-species Data

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

20. The pole diagram and the Laplace transform

Section 0.6: Factoring from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

Modeling Vascular Networks with Applications. Instructor: Van Savage Spring 2015 Quarter 4/6/2015 Meeting time: Monday and Wednesday, 2:00pm-3:50 pm

Assignment 5 Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Module 6 Lecture Notes

MATH 22 FUNCTIONS: ORDER OF GROWTH. Lecture O: 10/21/2003. The old order changeth, yielding place to new. Tennyson, Idylls of the King

Skill 6 Exponential and Logarithmic Functions

MITOCW R11. Double Pendulum System

Regression with Nonlinear Transformations

Problem Set 1 Solutions

P (E) = P (A 1 )P (A 2 )... P (A n ).

0. Introduction 1 0. INTRODUCTION

This is the last of our four introductory lectures. We still have some loose ends, and in today s lecture, we will try to tie up some of these loose

4 ORTHOGONALITY ORTHOGONALITY OF THE FOUR SUBSPACES 4.1

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

Lecture 12: Quality Control I: Control of Location

Lecture 3 Probability Basics

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

PROFESSOR: WELCOME BACK TO THE LAST LECTURE OF THE SEMESTER. PLANNING TO DO TODAY WAS FINISH THE BOOK. FINISH SECTION 6.5

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Chapter 7: Models of discrete character evolution

Lesson 6-1: Relations and Functions

Chemical Applications of Symmetry and Group Theory Prof. Manabendra Chandra Department of Chemistry Indian Institute of Technology, Kanpur.

Exponential Functions:

PHYSICS 206b HOMEWORK #3 SOLUTIONS

MATH 320, WEEK 7: Matrices, Matrix Operations

Conditional probabilities and graphical models

Separable First-Order Equations

MATH 1101 Exam 3 Review - Gentry. Spring 2018

POD. A) Graph: y = 2e x +2 B) Evaluate: (e 2x e x ) 2 2e -x. e 7x 2

MA 1128: Lecture 19 4/20/2018. Quadratic Formula Solving Equations with Graphs

Transcription:

PhyloMath Lecture 1 by Paul O. Lewis, 22 January 2004 Simulation of a single sequence under the JC model We drew 10 uniform random numbers to simulate a nucleotide sequence 10 sites long. The JC model specifies that each base has a relative frequency of 0.25, so the following table was used to decide which base to insert for each random number drawn: Base Range A 0.00 to 0.25 C 0.25 to 0.50 G 0.50 to 0.75 T 0.75 to 1.00 Below is a table showing the random numbers drawn and the data generated (not the exact values presented in class, I made up new numbers): Random number Base chosen 0.644 G 0.783 T 0.752 T 0.717 G 0.757 T 0.307 C 0.079 A 0.471 C 0.093 A 0.998 T Likelihood under the JC model using the JC sequence The likelihood for the sequence simulated in the previous section (GTTGTCACAT) under the JC model is just L JC =( 1 4 )( 1 4)( 1 4)( 1 4)( 1 4)( 1 4)( 1 4)( 1 4)( 1 4 )( 1 4)=( 1 4) 10 =9.536743164 10 7 The likelihood is the same numerical value for every possible sequence under the JC model. The question was raised: Why wouldn t the JC model specify a smaller likelihood for the sequence GGGGGGGGGG than for the one we simulated; it seems much less probable to have 10 Gs in a row than to have a sequence with at least one representative of each of the four bases. The answer is that it really is just as probable to see GGGGGGGGGG as any other particular sequence, including GTTGTCACAT. The reason this is not intuitive is that in our minds we (mistakenly) think of sequences as falling into two groups: monomorphic sequences like GGGGGGGGGG and polymorphic sequences like GTTGTCACAT. Our minds correctly figure out that the probability of simulating one of the four monomorphic sequences is small compared to the probability of simulating one of the polymorphic sequences (of which there are 4 10 4!). The chance of simulating the particular sequence GTTGTCACAT is, however, identical to the chance of simulating the particular sequence GGGGGGGGGG. Simulation of a single sequence under the F81 model We next drew 10 uniform random numbers to simulate a second nucleotide sequence 10 sites long, this time using the F81 model, which allows the relative nucleotide frequencies to be unequal. We decided (arbitrarily) to let π A =0.1, π C =0.2, π G =0.3, and π T =0.4. The following table was used to decide which base to insert for each random number drawn:

Base Range A 0.0 to 0.1 C 0.1 to 0.3 G 0.3 to 0.6 T 0.6 to 1.0 Below is a table showing the random numbers drawn and the data generated (again, these are not the exact values presented in class): Random number Base chosen 0.511 G 0.632 T 0.601 T 0.739 T 0.627 T 0.766 T 0.125 C 0.480 G 0.599 G 0.978 T Likelihood under the F81 model using the F81 sequence The empirical base frequencies for a sequence are computed as simple proportions, using the number of each base observed in the sequence, divided by the total number of sites. The empirical frequency is 0.0 for base A (no As were observed), 0.1 for base C (1 C was observed out of 10 sites), 0.3 for G, and 0.6 for T. Below, the likelihood under the F81 model is computed for this sequence using the empirical nucleotide composition: L F 81 =(0.3)(0.6)(0.6)(0.6)(0.6)(0.6)(0.1)(0.3)(0.3)(0.6) = 1.259712 10 4 This can be written as a general formula as follows: L F 81 = π na A πnc C πng G πnt T Here, π i is the relative frequency of base i, and n i is the number of times base i was seen in the sequence. Plugging in the empirical values 0.0, 0.1, 0.3, and 0.6 for π A, π C, π G, and π T, respectively, we get the same answer as before (1.259712 10 4 ). Plugging in the true frequencies, we get L F 81 = π na A πnc C πnga G πnt T =(0.1)0 (0.2) 1 (0.3) 3 (0.4) 6 =2.21184 10 5 Thus, the probability of the sequence GTTTTTCGGT under the F81 model is higher using the empirical frequencies than it is using the true frequencies. This will always be the case, because if the data consist of just one sequence, the empirical base frequencies are the maximum likelihood estimates of the relative nucleotide frequencies. This means that one cannot make the likelihood any higher using any other combination of relative frequencies, including the true combination. Likelihood under both models using the JC sequence The difference between the JC model and the F81 model lies in the fact that the F81 model allows any combination of base frequencies, whereas the JC model forces all the base frequencies to be equal: i.e. π A = π C = π G = π T =0.25. The likelihood computed under a certain model tells us how well the model fits the data. We might reasonably expect the JC model to fit the JC sequence better than the F81 model, and vice versa, but there is so little data in this case (10 sites) that this result is certainly not guaranteed. Also, note that it is impossible for the JC model to beat the F81 model in this game, because the F81 model

can be made to equal the JC model by stipulating that the base frequencies are all equal. So the only way for JC to win is to prevent F81 from setting its base frequency parameters to anything it chooses. We have already computed the likelihood for the JC sequence (GTTGTCACAT) under the JC model: 9.536743164 10 7. The likelihood under the F81 model for the same sequence (using the empirical frequencies) is: L F 81 = π na A πnc C πnga G πnt T =(0.2)2 (0.2) 2 (0.2) 2 (0.4) 4 =1.6384 10 6 The probability of the data (i.e. the likelihood) is higher under the F81 model, even though the data were generated using the JC model. Exercise 1: compute the likelihood for the JC sequence under the F81 model using the true frequencies (that is, 0.25 for all four bases) The F81 likelihood should be identical to the JC likelihood in this case, namely 9.536743164 10 7. Exercise 2: compute the likelihood for the JC sequence under the F81 model assuming π A =0.4, π C =0.3, π G =0.2, and π T =0.1? These frequencies are pretty far from the true ones, so we might expect the F81 likelihood to be less (i.e. worse) than the JC likelihood in this case. The answer I got for this was 5.76 10 8, which is indeed quite a bit worse than the JC likelihood. Likelihood under both models using the F81 sequence Doing the same exercise for the F81 sequence (GTTTTTCGGT), we again get 9.536743164 10 7 for the JC model (all sequences of length 10 have the same likelihood under the JC model) and 2.21184 10 5 for the F81 model (using the true frequencies). So this time, the model actually used to simulate the sequence is indeed the best-fitting model. The fit of the F81 model is even better if the empirical frequencies are used: 1.259712 10 4. Using the empirical frequencies tunes the F81 model to fit the data as well as possible. In this case, the sequence GTTTTTCGGT is 132 times more probable under the F81 model than it is under the JC model (0.0001259712 is 132 times larger than 0.0000009536743164). Logarithms Logarithms are useful for representing likelihoods especially for large amounts of data. The probability that an A will be seen at the first position of a sequence and a T will be seen at the second position is the probability of a coincidence. The more players (i.e. sites) involved in the coincidence, the smaller will be the probability of that coincidence. There is nothing unusual about this: just think of the probability of seeing three of your friends while sitting at the stoplight at Four Corners. All of your friends (and you) probably have perfectly good reasons for being at Four Corners at some time during the day, but the coincidence is always going to be much less probable than any individual component. With sequences, the probabilities of coincidences involving thousands of sites are very tiny numbers indeed. It doesn t mean these numbers are not useful, but they can become so small that computers cannot distinguish them from zero. Logarithms allow such tiny numbers to be manipulated accurately. Common logarithms are defined to be the power of 10 that is needed to represent some number. For example, the common logarithm of the value x (written log(x)) is the value y such that 10 y = x. This means that the following is a true statement: x =10 log(x) This is a very useful tautology. Natural logarithms are much more common in the sciences, and behave just like common logarithms but use the base e instead of the base 10. Thus x = e ln(x) The value e is a constant approximately equal to 2.718281828. You can use your calculator to get the value of e by entering the number 1 and pressing the e x key.

Logs of products The (natural) log of a product is the sum of the logs of the individual terms in the product. Thus, logs turn products into sums. For example, for any two real numbers a and b, For example, ln(ab) =ln(a) + ln(b) ln((2.2)(4.4)) = ln(9.68) = 2.270061901 ln(2.2) = 0.78845736 ln(4.4) = 1.481604541 ln(2.2) + ln(4.4) = 2.270061901 Why is this true? Starting with ab, replace a and b with their equivalents using the tautology and simplify the resulting expression using the rules of exponents: ab = (e ln(a))( e ln(b)) = e ln(a)+ln(b) The last step just involves invoking the tautology once more, realizing that the power to which e on the right side is raised (i.e. the quantity ln(a) + ln(b)) must be (by definition) the natural logarithm of ab. In other words, if ab = e ln(ab) then ln(ab) must be equal to ln(a) + ln(b) (because they both equal ab). Logs of quotients The same process can be used to derive a rule for quotients. Starting with the quotient a/b, Thus, ln(a/b) must equal ln(a) ln(b). a b = eln(a) e ln(b) = eln(a) ln(b) Log-likelihoods The log-likelihood (commonly abbreviated lnl) is the natural logarithm of the likelihood. For the JC likelihood of one sequence 10 sites in length, ln L =ln ( ( (0.25) 10) =ln[ e ln(0.25)) ] 10 ( =ln e 10 ln(0.25)) = 10 ln(0.25) = 13.86294361 Note that you can recover the now-familiar L JC =9.536743164 10 7 by entering ln L = 13.86294361 into your calculator and pressing the e x button. The exponential function is the inverse of the natural log function, and the two cancel each other out, leaving just the likelihood L: e ln L = L. A general formula for the log-likelihood of a sequence under the F81 model would be: ln L = n A ln(π A )+n C ln(π C )+n G ln(π G )+n T ln(π T )= n i ln(π i ) i {A,C,G,T }

Simulation of two sequences separated by time t and evolving at rate α We finished by beginning a discussion on evolving one sequence from another sequence. Two more quantities must be introduced before the simulation can be done. Assume the time t is 10 and the substitution rate α is 0.01. The product αt is thus 0.1. This product represents the expected number of substitutions. Thus, for αt =0.1, we expect one substitution to occur for every 10 sites. For the JC model, the expected number of substitutions is actually 3αt, but not for the reason I proposed in class. The real reason is because there are three stochastic substitution processes going on simultaneously (the base in the ancestral sequence can change to any one of the three other bases), and the expected number of substitutions is the same for all three (i.e. αt). Adding the expected number of substitutions from all three processes together gives you the 3αt. Starting with the JC sequence created at the beginning of this lecture (GTTGTCACAT), simulating the evolution of a second sequence involves transition equations, which for the JC model are: P ii (t) = 1 4 + 3 4e 4αt P ij (t) = 1 4 1 4e 4αt These equations tell us that the probability of a site ending up with a different base than it started with over a time t = 10 when it is evolving at a substitution rate α = 0.01 is 0.25 0.25e 0.4 = 0.25 (0.25)(0.670320046) = 0.08242. The probability of ending up in the same base as it started with is just 1 (3)(0.08242) = 0.75274. Calculating it the hard way, you get the same answer: 0.25+0.75e 0.4 = 0.25 + (0.75)(0.670320046) = 0.75274. For the first site, the starting base is G, and to simulate the evolution of this site, we would set up a table for deciding how to process the uniform random numbers as follows: Base Range A 0.0 to 0.08242 C 0.08242 to 0.16484 G 0.16484 to 0.91758 T 0.91758 to 1.0 Note that this table is designed specifically for a starting base of G (the probability of ending with a G is the largest slice of the pie in this case, a little over 75%). If we were simulating the second site, which starts with a T, the table would look like this instead (with the largest slice of the pie going to T as the base at the far end of the branch): Base Range A 0.0 to 0.08242 C 0.08242 to 0.16484 G 0.16484 to 0.24726 T 0.24726 to 1.0 It is of course important to establish these rules before you begin drawing random numbers! Otherwise, you might be sorely tempted to cheat and start inserting your favorite base too many times. This process is exactly how programs such as SeqGen simulate sequence data. Programs like SeqGen allow one to use a variety of models, of course, but this just involves using different transition probability equations, which differ among models. Simulating data on a tree involves simply repeating this process for each branch in the tree, each time starting with the sequence you were left with in the last step. Here is a second sequence I simulated starting with an ancestral sequence consisting of 10 Gs. The simplicity of the ancestral sequence meant that I could use the first table above for all 10 sites.

Random number Base chosen 0.670 G 0.333 G 0.173 G 0.733 G 0.524 G 0.890 G 0.400 G 0.156 C 0.591 G 0.708 G Only one site (8) ended up with a different base (C); all the other sites did not change over the branch. The expected number of substitutions was 3αt = 0.3. The transition equations take account of hidden substitutions, so the fact that we only detected one substitution when three substitutions were expected out of 10 sites should not worry us. If one of the other sites had experienced a substitution from G to say A and back again, the ending base would still be G and we would not be able to detect these additional two substitutions. I will show you another method of simulating data next time that allows us to see these hidden substitutions. This alternative method forms the basis of the character mapping approach advocated by Rasmus Nielsen in several recent papers (Huelsenbeck et al., 2003; Nielsen, 2002). Literature Cited Huelsenbeck, J. P., R. Nielsen, and Bollback, J. P. 2003. Stochastic mapping of morphological characters. Systematic Biology 52(2): 131 158. Nielsen, R. 2002. Mapping mutations on phylogenies. Systematic Biology 51(5): 729 739.