Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Size: px

Start display at page:

Download "Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington."

Naomi Sanders
6 years ago
Views:

1 Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed.

2 Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community has to do with the difficulty of understanding the theory and also the absence (initially) of good quality software with choice of models and ease of interaction with data. Also, at the time, it was computationally intractable to analyze large datasets (consider that in the mid-1980s a typical desktop computer had a processor speed of less than 30 MHz). In recent times, software, models and computer hardware have become sufficiently sophisticated that ML is becoming a method of choice.

3 ML: comparison with other methods ML assumes a model of sequence evolution. ML attempts to answer the question: What is the probability that I would observe these data (a multiple sequence alignment), given a particular model of evolution (a tree and a process). ML uses a model. This is justifiable, since molecular sequence data can be shown to have arisen according to a evolutionary process.

4 Maximum Likelihood - goal To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time. Probability of given π = [ a,c,g,t] a b c d b a e f c e a g d c f a

5 What is the probability of observing a datum? If we flip a coin and get a head and we think the coin is unbiased, then the probability of observing this head is 0.5. If we think the coin is biased so that we expect to get a head 80% of the time, then the likelihood of observing this datum (a head) is 0.8. Therefore: The likelihood of making some observation is entirely dependent on the model that underlies our assumption. p =? Lesson: The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed.

6 What is the probability of observing a 'G' nucleotide? Question:If we have a DNA sequence of one nucleotide in length and the identity of this nucleotide is 'G', what is the likelihood that we would observe this 'G'? Answer: In the same way as the coin-flipping observation, the likelihood of observing this 'G' is dependent on the model of sequence evolution that is thought to underlie the data. E.g. Model 1: frequency of G = 0.40 => likelihood(g) = 0.40 Model 2: frequency of G = 0.10 => likelihood(g) = 0.10 Model 3: frequency of G = 0.25 => likelihood(g) = 0.25

7 One rule the rule of 1 The sum of the likelihoods of all the possibilities will always equal 1 E.g. for DNA p(a)+p(c)+p(g)+p(t)=1

8 What about longer sequences? If we consider a gene of length 2: Gene 1: ga The the probability of observing this gene is the product of the probabilities of observing each character. E.g p(g) = 0.40; p(a)=0.15 (for instance) Likelihood (ga) = 0.40 x 0.15 = 0.06

9 or even longer sequences? Gene 1: gactagctagacagatacgaattac Model (simple base frequency model): p(a)=0.15; p(c)=0.20; p(g)=0.40; p(t)=0.25; (the sum of all probabilities must equal 1) Like(Gene 1) = = 1.8 x 10-17

10 Note about models You might notice that our model of base frequency is not the optimal model for our observed data. If we had used the following model: p(a)=0.40; p(c) =0.20; p(g)= 0.20; p(t) = 0.20; The likelihood of observing the gene is: Like(gene 1) = = 3.3 x (a value that is almost 10,000 times higher vs ) Lesson: The datum has not changed, our model has. Therefore under the new model the likelihood of observing the datum has changed.

11 How does this relate to phylogenetic trees? Consider an alignment of two sequences: Gene 1: gaac Gene 2: gacc We assume these genes are related by a (simple) phylogenetic tree with branch lengths.

12 Increase in model sophistication It is no longer possible to simply invoke a model that encompasses base composition, we must also include the mechanism of sequence change and stasis. There are two parts to this model - the tree and the process (the latter is confusingly referred to as the model, although both parts really compose the model).

13 The model The two parts of the model are the tree and the process (the model). The model is composed of the composition and the substitution process rate of change from one character state to another character state. Model = + a b c d b a e f c e a g d c f a [ ] π = a,c,g,t

14 Simple time-reversible model A simple model is that the rate of change from a to c or vice versa is 0.4, the composition of a is 0.25 and the composition of c is 0.25 (a simplified version of the Jukes and Cantor 1969 model) P = π = [ ]

15 Probability of the third nucleotide position in our current alignment p(a) =0.25; p(c) = 0.25; p a c = 0.4 Starting with a, the likelihood of the nucleotide is 0.25 and the likelihood of the substitution (branch) is 0.4. So the likelihood of observing these data is: *Likelihood model = prob (D M) = 0.25 x 0.4 =0.01 Note: you will get the same result if you start with c, since this model is reversible *The likelihood of the model, given the data.

16 Substitution matrix For nucleotide sequences, there are 16 possible ways to describe substitutions - a 4x4 matrix. P = a b c d e f g h i j k l m n o p Convention dictates that the order of the nucleotides is a,c,g,t Note: for amino acids, the matrix is a 20 x 20 matrix and for codon-based models, the matrix is 61 x 61

17 Substitution matrix - an example P = In this matrix, the probability of an a changing to a c is 0.01 and the probability of a c remaining the same is 0.979, etc. Note: The rows of this matrix sum to 1 - meaning that for every nucleotide, we have covered all the possibilities of what might happen to it. The columns do not sum to anything in particular.

18 To calculate the likelihood of the entire dataset, given a substitution matrix, base composition and a branch length of one "certain evolutionary distance" or "ced" Likelihood of Gene 1: ccat Gene 2: ccgt given P = π=[0.1,0.4,0.2,0.3]

19 Likelihood of a two-sequence alignment ccat ccgt π c P c > c π c P c >c π a P a > g π t P t > t =0.4x0.983x0.4x0.983x0.1x0.007x0.3x0.979 = Likelihood of going from the first to the second sequence is

20 Different Branch Lengths For very short branch lengths, the probability of a character staying the same is high and the probability of it changing is low (for our particular matrix). For longer branch lengths, the probability of character change becomes higher and the probability of staying the same is lower. The previous calculations are based on the assumption that the branch length describes one Certain Evolutionary Distance or CED. If we want to consider a branch length that is twice as long (2 CED), then we can multiply the substitution matrix by itself (matrix 2 ).

21 2 CED model P = = X P = Which gives a likelihood of Note the higher likelihood

22 For 3 CED P 3 = This gives a likelihood of Note that as the branch lengths increase, the values on diagonal decrease and the values on the off-diagonals increase.

23 For higher values of CED units L i k e l i h o o d Branch Length

24 Raising P to a large power If we raise P to a very large power, we find that the ML base composition pops out P 10 6 = So the base composition is built into the probability matrix P.

25 Consider the following equation Rate Matrices 5 4 = exp( 4log(5) ) In the same way, raising a matrix to the power of a number can be calculated by taking the log of the matrix, multiplying it by branch length and taking the exponent of the product. In this way, you can exponentiate the matrix by a number that is not a whole number (e.g or whatever). E.g. The log (natural log base e = ln) of the previous matrix, P is: log P = Note: the sum of each row is zero.

26 log P = This matrix corresponds to one CED. What we want is to derive a matrix so that when we exponentiate it, the values correspond to substitutions per site. We must therefore scale logp so that when the rows of logp are multiplied by π row the off-diagonal elements sum to 1. The resulting scaled logp (called Q), when its exponent is taken gives a P corresponding to 1 substitution per site.

27 Converting to substitutions per For a branch length of v: site e Qv = P(v) If we scale the logp appropriately, we will get a Q matrix. If we multiply this Q matrix by a diagonal matrix of the composition we get a matrix where the off-diagonal elements sum to 1 and the diagonal elements sum to -1.

28 Scaling logp appropriately LogP scaled by a factor of 50 (in this case) Q = Q = (diagonal matrix of the composition) Off-diagonal elements sum to 1, diagonal elements sum to -1 P s generated from this Q will give branch lengths in substitutions per site

29 Separating composition from rates π col If we divide the columns of Q by the composition is separated from the rates. You can then use the exact same rate matrix with different matrices of base composition. For the model we have been using, the rate matrix R is: R = The diagonal elements do not matter and are left out. The model is symmetrical (time reversible).

30 Relationships between R, Q and P matrices R Multiply columns by the composition, scale so that the offdiagonals sum to 1 Q Multiply by branch length, then exponentiate P Divide columns by the composition Log, then scale so that off-diagonals sum to 1 P = substitution matrix Q = scaled log substitution matrix R = rate matrix (defines the model of evolution)

31 Likelihood of the alignment at various branch lengths ccat ccgt The maximum likelihood value is at a branch length of

32 Likelihood of a two-branch tree O 0.1 A 0.2 B O is the origin or root, the numbers represent branch lengths. The likelihood can be calculated in three ways: from A to B in one step (this amounts to the previous method) from A to B in two steps (through O) in two parts starting at O

33 Lesson about O O is an unknown sequence. We can only speculate what each position in the alignment would be if we could observe the sequence of O. What we do know is that the sum of all possibilities is equal to 1. Therefore we must sum the likelihoods for all possibilities of O. This becomes computationally intensive. O 0.1 A For position 1: {a,c,g,t} 0.1 {c} 0.2 B 0.2 {c}

34 A three branch tree A C B 0.2 The tree can be rooted anywhere and the substitutions calculated accordingly. There are many ways of doing this and this is left as an exercise for the student.

35 Increasing the sophistication of models So far, the models we have dealt with assume that change is equally likely at all positions and that the rate of change is constant for the entire duration of the phylogeny. This is not a realistic model for all sequences (it is a neutral model with a constant molecular clock). A B A p CORRECT TREE q q q p p > q D WRONG TREE C D C B

36 Small subunit ribosomal RNA 18S or 16S rrna

37 The molecular clock for alpha-globin: Each point represents the number of substitutions separating each animal from humans number of substitutions cow platypus chicken shark carp Time to common ancestor (millions of years)

38 Invariable sites For a given dataset we can assume that a certain proportion of sites are not free to vary (e.g. purifying selection prevents these sites from changing). We can therefore observe invariable positions either because they are under this selective constraint or because they have not had a chance to vary or because there is homoplasy in the dataset and a reversal (say) has caused the site to appear constant. The likelihood that a site is invariable can be calculated by incorporating this possibility into our model and calculating for every site the likelihood that it is an invariable site.

39 Variable sites Obviously other sites in the dataset are free to vary. Selection intensity on these sites is rarely uniform, so it is desirable to model site-by-site rate variation. This is done in two ways: site specific (codon position, or alpha helix etc.) using a discrete approximation to a continuous distribution (gamma distribution). Again, these variables are modeled over all possibilities of sequence change over all possibilities of branch length over all possibilities of tree topology.

40 The shape of the gamma distribution for different values of alpha

41 Does changing a model affect the outcome? There are different models Jukes and Cantor (JC69): All base compositions equal (0.25 each), rate of change from one base to another is the same Kimura 2-Parameter (K2P): All base compositions equal (0.25 each), different substitution rate for transitions and transversions). Hasegawa-Kishino-Yano (HKY): Like the K2P, but with base composition free to vary. General Time Reversible (GTR): Base composition free to vary, all possible substitutions can differ. All these models can be extended to accommodate invariable sites and site-to-site rate variation.

43 Long-branch attraction (LBA) In the case below, the wrong tree is often selected. ML will not be prone to this problem, if the correct model of sequence evolution is used. A B A p CORRECT TREE q C q q D p p > q D C WRONG TREE B This is often called the Felsenstein-zone.

44 Strengths of ML Does not try to make an observation of sequence change and then a correction for superimposed substitutions. There is no need to correct for anything, the models take care of superimposed substitutions. Accurate branch lengths. Each site has a likelihood. If the model is correct, we should retrieve the correct tree*. You can use a model that fits the data. ML uses all the data (no selection of sites based on informativeness, all sites are informative). ML can not only tell you about the phylogeny of the sequences, but also the process of evolution that led to the observations of today s sequences. *If we have long-enough sequences and a sophisticated-enough model.

45 Models You can use models that: Deal with different transition/transversion ratios. Deal with unequal base composition. Deal with heterogeneity of rates across sites. Deal with heterogeneity of the substitution process (different rates across lineages, different rates at different parts of the tree). The more free parameters, the better your model fits your data (good). The more free parameters, the higher the variance of the estimate (bad). Use a model that fits your data.

46 Over-fitting a model to your data

47 Weaknesses of ML Can be inconsistent if we use models that are not accurate. Model might not be sophisticated enough (you can max-out on models). Very computationally-intensive. Might not be possible to examine all models (substitution matrices, tree topologies, etc.).

48 Recommendations Interact with the data. If you have collected enough data, you might get a good picture of the underlying model of sequence evolution. Use a test of alternative models (implemented in the Modeltest software). Don t just choose a model, use a model that fits your data. Don t over-fit a model to your data.

49 How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth. Sherlock Holmes to Dr. Watson in The Sign of Four, by A. Conan Doyle.

Consensus methods. Strict consensus methods

Consensus methods. Strict consensus methods Consensus methods A consensus tree is a summary of the agreement among a set of fundamental trees There are many consensus methods that differ in: 1. the kind of agreement 2. the level of agreement Consensus