# Lecture Notes: Markov chains

Size: px
Start display at page:

Transcription

1 Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity function, that assigns a score of M to matches (M > ), m to mismatches (m < ), and g to indels (g < ) and an edit distance, which does not reward matches and assigns a unit cost to mismatches and gaps. In considering the degree of similarity that we expect by chance, these scoring functions allow us to compare two alignments by comparing their scores, but are less useful for assessing a pairwise alignment in an absolute sense. Given a pair of aligned sequences with a particular collection of matches, mismatches, and indels, does the alignment reflect enough similarity to suggest that it is of biological interest? One way of assessing an alignment in an absolute sense is to determine whether it reflects more similarity than we would expect by chance. In developing this approach, we must take into account the divergence of related sequences due to mutation. With that in mind, we will explore models of sequence evolution and then discuss how they are used to assess alignments. Sequence evolution models are typically based on Markov chains, so we will begin with a general introduction to Markov models. Finite discrete Markov chains In various computational biology applications, it is useful to track the stochastic variation of a random variable. Here are some examples:. For models of sequence evolving by point mutation, the random variable of interest is the nucleotide observed at a fixed position, or site, in the sequence at time t. The goal is to characterize how this random variable changes over time.. It is also useful consider how the residues in a sequence change as one moves along the sequence from one site to the next. In this case, the random variable is the amino acid (or nucleotide) at site i. We are interested in how the probability of observing a given amino acid at site i depends on the amino acid observed at site i. For each of these examples, we can model how the random variable (the nucleotide or amino acid) changes with respect to an independent variable (time or position), using a Markov chain with a

2 Computational Genomics and Molecular Biology, Fall 5 finite number of states, E, E,... E s. Each state corresponds to one of the possible values of the random variable. In our examples, the states are defined as follows:. There are four states, each corresponding to the event of observing one of the four nucleotides at the site of interest, e.g., E = A, E = C, E 3 = G, E 4 = T. In a time-dependent system, such as this one, we say the system is in state E j at time t.. Each of the states corresponds to the event of observing a given amino acid; for example E = Ala, E = Cys,... E = Tyr. In a spatially varying system, we say the system is in state E j at site i. This is in contrast to the previous example, where time varies and the position, i, is held fixed. The probability that a Markov chain is in state E j at time t is designated ϕ j (t). The vector ϕ(t) = (ϕ (t), ϕ (t),... ϕ s (t)) describes the state probability distribution over all states at time t. The initial state probability distribution is given by ϕ(). Note that Ewens and Grant use π to denote the initial state distribution: π = (ϕ (), ϕ (),... ϕ s ()). In order to capture the stochastic variation of the system, we must also define the probability of making a transition from one state to another. The transition probability, P jk, is the probability that the chain will be in state E k at time t +, given that it was in state E j at the previous time step, t. P is an s s matrix specifying the probability of making a transition from any state to any other state. The rows of this matrix sum to one ( k P jk = ) since the chain must be in some state at every time step. The columns do not have to add up to one, since there is no guarantee that you will arrive at a particular state, k. The Markov property states that Markov chains are memoryless. In other words, the probability that the chain is in state E j at time t +, depends only on the state at time t and not on the past history of the states visited at times t, t... In this course, we will focus on discrete, finite, time-homogeneous Markov chains. These are models with a finite number of states, in which time (or space) is split into discrete steps. The assumption of discrete steps is quite natural for Example, because sequences of symbols are inherently discrete, but somewhat artificial for the sequence evolution model in Example, since time is continuous. Our models are time-homogeneous, because the transition matrix does not change over time. To illustrate these concepts, let us consider a simple example: A drunk is staggering about on a very short railway track with five ties on top of a mesa (a high hill with a flat top and steep sides.) To simplify the exposition, we will focus on models where time is the independent variable. However, the framework is more general, and can be used to model variation with respect to other independent variables, such as the position in a sequence. Statistical Methods in Bioinformatics: An Introduction. W. Ewens, G. Grant. Springer.

3 Computational Genomics and Molecular Biology, Fall 5 3 At each time step, the drunk staggers either to the left or to the right. Here state E j corresponds to the event that the drunk is standing on the j th tie, where j 4. At each step, the drunk moves to the left or to the right with equal probability, resulting in the following transition probability matrix: E E E E 3 E 4 E P = E E E 3 E 4 () Note that each row sums to one, consistent with the definition of a Markov chain. If the drunk reaches either end of the track (either the th or the 4 th tie), he falls off the mesa. This model is called a random walk with absorbing boundaries, because once the drunk falls off the mesa, he can never get back on the railroad track. States E and E 4 are absorbing states. Once the system enters one of these states, it remains in that state forever, since P = P 44 =. The transition matrix of a Markov chain can be represented as a graph, where the nodes represent states and the edges represent transitions with non-zero probability. For example, the random walk with absorbing boundaries can be modeled like this: How does the state probability distribution change over time? If we know the state probability distribution at time t, the distribution at the next time step is given by: ϕ k (t + ) = j ϕ j (t)p jk () or in matrix notation. ϕ(t + ) = ϕ(t)p (3)

4 Computational Genomics and Molecular Biology, Fall 5 4 For example, suppose that at time t =, the drunk is standing on the middle tie; that is, ϕ() = (,,,, ). To obtain the state probability distribution after one time step, we apply Equation : 4 ϕ k () = ϕ j ()P jk. For example, the probability of being in state E when t = is given by ϕ () = j= 4 ϕ j ()P j. j= = (4) =. Note that Equation 4 is equivalent to multiplying the vector (,,,, ) by the second column of the matrix in Equation. Since the Markov chain is symmetrical, it is easy to show that ϕ 3 () is also equal to /. It is not possible to reach state E or state E 4 in a single step from state E, so ϕ () = ϕ 4 () =. Nor is it possible to remain in state E for two consecutive time steps, so ϕ () =. Since state E is the only state with non-zero probability at time t =, we obtain, ϕ() = (,,,, ). Now that we have the probability distribution at time t =, we can calculate the probability distribution at time t = using the same procedure ϕ k () = 4 ϕ j ()P jk. j= The probability of being in state E at t = is given by ϕ () = 4 ϕ j ()P j j= = = 4.

5 Computational Genomics and Molecular Biology, Fall 5 5 As above, ϕ 4 () = ϕ (), because the matrix is symmetrical. The probability of being in state E is ϕ () = 4 ϕ j ()P j j= = =. The probabilities of being in state E or E 3 at time t = are zero, because P = and P 33 =. The probability distribution vector at time t = is, therefore, ϕ() = ( 4,,,, ). (5) 4 Summary of Markov chain notation A Markov chain has states E, E,... E s corresponding to the range of the associated random variable. ϕ j (t) is the probability that the chain is in state E j at time t. The vector ϕ(t) = (ϕ (t),... ϕ s (t)) is the state probability distribution at time t. π = ϕ() is the initial state probability distribution. P is the transition probability matrix. P jk gives the probability of making a transition to state E k at time t +, given that the chain was in state E j at time t. The rows of this matrix sum to one: k P jk =. The state probability distribution at time t+ is given by ϕ(t+) = ϕ(t) P. The probability of being in state E k at t + is ϕ k (t + ) = j ϕ j (t)p jk The Markov property states that Markov chains are memoryless. The probability that the chain is in state E k at time t +, depends only on ϕ(t) and is independent of ϕ(t ), ϕ(t ), ϕ(t 3)...

6 Computational Genomics and Molecular Biology, Fall 5 6 In this course, we will focus on discrete, finite, time-homogeneous Markov chains. These are models with a finite number of states, in which the independent variable takes on a discrete set of values. In other words, assume that time (or space) is split into discrete steps. A Markov chain is timehomogeneous if the transition matrix does not change over time. Higher order Markov chains Suppose that we wish to know the state of the system after two time steps. In the previous section, we used Equation to calculate ϕ(), given ϕ(), and then we applied Equation again to calculate ϕ(), from ϕ(). Here we derive a general expression for ϕ(t+) in terms of ϕ(t). From Equation, we obtain s ϕ l (t + ) = ϕ j (t)p jl (6) and ϕ k (t + ) = j= s ϕ l (t + )P lk. (7) Substituting the right hand side of Equation 6 for ϕ l (t + ) in Equation 7 yields s s ϕ k (t + ) = ϕ j (t)p jl P lk. l= l= We can reverse the order of the summations since the terms may be added in any order, yielding ( s s ) ϕ k (t + ) = ϕ j (t) P jl P lk. (8) j= The term in the inner summation is simply the element in row j and column k of the matrix obtained by multiplying matrix P by itself. In other words, P () j= l= s jk = P jl P lk, where P () = P P, so that Equation 8 may be rewritten as ϕ k (t + ) = l= s j= ϕ j (t)p () jk. Matrix P () is the transition matrix of a nd order Markov chain that has the same states as the st order Markov chain described by P. A single time step in P () is equivalent to two time steps in P.

7 Computational Genomics and Molecular Biology, Fall 5 7 Similarly, an n th order Markov chain models change after n time steps with a transition probability matrix The n th order equivalent of Equation 3 is P (n) = P P.. P }{{}. n ϕ(t + n) = ϕ(t) P (n). As an example, let s apply this approach to our 5-state random walk with absorbing boundaries. Recall that the transition matrix for the st order random walk is E E E E 3 E 4 E E P = E. E 3 E 4 Multiplying P times itself yields the nd order transition matrix, P () : E E E E 3 E 4 E P () E 4 4 = E 4 4. E E 4 The state probability distribution at t = can be calculated by applying P () to ϕ(): ϕ() = ϕ() P () = (,,,, ) P () = ( 4,,,, 4 ).

8 Computational Genomics and Molecular Biology, Fall 5 8 Try this matrix multiplication to convince yourself that this is correct. Note that this is the same result as Equation 5, which we got by applying the first order Markov chain twice. Periodic Markov chains Let us consider a second example. In order to save the drunk from an early death, we introduce a random walk with reflecting boundaries. At each step, the drunk moves to the left or to the right with equal probability. When the drunk reaches one of the boundary states (E or E 4 ), he returns to the adjacent state (E or E 3 ) at the next step, with probability one. This yields the following transition probability matrix: E E E E 3 E 4 E E E. E 3 E 4 The random walk with reflecting boundaries can be represented graphically like this: Suppose that the drunk starts out on the middle tie at t =, as before. That is, the initial state probability distribution is ϕ() = (,,,, ). The state distributions for the first two time steps are the same in both random walk models, namely ϕ() = (,,,, ) ϕ() = ( 4,,,, 4 ).

9 Computational Genomics and Molecular Biology, Fall 5 9 This makes sense because the two random walk models differ only in the boundary states, E and E 4, and ϕ (t) = ϕ 4 (t) = when t = or t =. We calculate the state probability distribution at t = 3 by multiplying the vector ϕ() with the matrix P : ϕ(3) = ϕ() P = ( 4,,,, 4 ) P = (,,,, ). This demonstrates that the state probability distribution at time t = 3 is the same as the distribution at time t =. In other words, ϕ(3) = ϕ(). Similarly, ϕ(4) = ϕ(), as can be seen from the following calculation: ϕ(4) = ϕ(3) P = (,,,, ) P = ( 4,,,, 4 ). From this we can see that the probability state distribution will be (,,,, ) at all odd time steps and ( 4,,,, 4 ) at all even time steps. Thus, the random walk with reflecting boundaries is a periodic Markov chain. A Markov chain is periodic if there is some state that can only be visited in multiples of m time steps, where m >. We do not require periodic Markov chains for modeling sequence evolution and will only consider aperiodic Markov chains going forward. Stationary distributions A state probability distribution, ϕ, that satisfies the equation ϕ = ϕ P (9) is called a stationary distribution. A key question for a given Markov chain is whether such a stationary distribution exists. Equation 9 is equivalent to a system of s equations in s unknowns. One way to determine the steady state distribution is to solve that system of equations. The stationary distribution can also be obtained using matrix algebra, but that approach is beyond the scope of this course. The random walk with reflecting boundaries clearly does not have a stationary distribution, since every state with non-zero probability at time t has zero probability at time t +. The random walk with absorbing boundaries does have a stationary distribution, but it is not unique. Both

10 Computational Genomics and Molecular Biology, Fall 5 (,,,, ) and (,,,, ) are stationary distributions of the random walk with absorbing boundaries. For the rest of this course, we will concern ourselves only with aperiodic Markov chains that do not have absorbing states. In fact, we will make an even stronger assumption and restrict our consideration to Markov chains in which every state is connected to every other state via a series of zero or more states. If a finite Markov chain is aperiodic and connected in this way, it has a unique stationary distribution. We will not attempt to prove this or even to state the theorem in a rigorous way. That is beyond the scope of this class. For those who are interested, a very nice treatment can be found in Chapter 5 of Probability Theory and its Applications (Volume I) by William Feller (John Wiley & Sons). As an example of a Markov chain with a unique stationary distribution, we introduce a third random walk model that has neither absorbing, nor reflecting boundaries. In this model, if the drunk is in one of the boundary states (E or E 4 ) at time t, then at time t + he remains in the boundary state with a probability of.5 or returns to the adjacent state (E or E 3 ) with a probability of.5. This results in the following state transition matrix E E E E 3 E 4 E E P = E, E 3 E 4 which can be represented graphically like this:

11 Computational Genomics and Molecular Biology, Fall 5 We can determine the stationary state distribution for this random walk model by substituting this transition matrix into Equation. The probability of being in state E is ϕ = 4 ϕ jp j j= = ϕ P + ϕ P + ϕ P + ϕ 3P 3 + ϕ 4P 4. This reduces to ϕ = ϕ + ϕ, since P, P 3 and P 4 are all equal to zero. The other steady state probabilities are derived similarly, yielding ϕ = ϕ + ϕ () ϕ = ϕ + ϕ () ϕ = ϕ + ϕ 3 () ϕ 3 = ϕ + ϕ 4 (3) ϕ 4 = ϕ 3 + ϕ 4. (4) In addition, the steady state probabilities must sum to, imposing an additional constraint: ϕ + ϕ + ϕ + ϕ 3 + ϕ 4 =. (5) The model has a stationary distribution if the above equations have a solution. By repeated substitution, it is possible to show that Equations - 4 reduce to ϕ = ϕ = ϕ = ϕ 3 = ϕ 4. (Do the algebra to convince yourself that this is true.) Applying the constraint in Equation 5, we see that ϕ = (.,.,.,.,.) is a unique solution to the above equations. In this example, we found a unique solution to Equation 5, demonstrating that our third random walk has a unique stationary state. Solving Equation 9 is a general approach to finding the stationary distribution. Alternatively, if we know the stationary state distribution, or have an educated guess, it is sufficient to verify that it indeed satisfies Equation 9. For example, it is easy to verify that (.,.,.,.,.) P = (.,.,.,.,.).

12 Computational Genomics and Molecular Biology, Fall 5 Markov models of sequence evolution Now that we have the Markov chain machinery under our belts, let s return to the question of modeling sequence evolution. The process of substitution at a single site in a nucleotide sequence can be modeled as a Markov chain, where each state represents a single nucleotide and the transition probability, P jk, is the probability that nucleotide j will be replaced by nucleotide k in one time stop. Similarly, Markov chains can be constructed to model the evolution of amino acid sequences. Although in principle Markov models of sequence evolution are general and can be applied to nucleotide sequences and to amino acid sequences in exactly the same way, in practice working with a twenty-letter alphabet poses challenges that do not arise with a four-letter alphabet. In addition, the biophysical properties of the amino acids are more varied than those of the nucleotides. For these reasons, the Markov chain framework is applied somewhat differently in amino acid sequence models. For the moment, we will focus on nucleotide models and postpone amino acid models until later in the course. Markov models of sequence substitution are used to answer a wide range of questions that arise in molecular evolution, including correcting for multiple substitutions at the same site, simulating sequence evolution, estimating rates of evolution, deriving substitution scoring matrices, and estimating the likelihood of observing a pair of aligned nucleotides, given a phylogenetic model. The simplest Markov model of sequence evolution for DNA is the Jukes-Cantor model 3, which assumes that all substitutions (A C, A G, A T, C A...) are equally probable and occur at a rate, α. The consequence of this assumption that the overall rate of substitution is λ = 3α. That is, λ is the probability that a given nucleotide will be replaced by some other nucleotide in one time step. The probability that the nucleotide remains unchanged is 3α. The transition probability matrix for this Markov model is: A G C T A 3α α α α G α 3α α α C α α 3α α T α α α 3α 3 Jukes and Cantor, Evolution of protein molecules. In H. N. Munro, (ed.) Mammalian Protein Metabolism, -3, Adademic Press, NY, 969.

13 Computational Genomics and Molecular Biology, Fall 5 3 This model can be represented graphically like this: The stationary distribution of this Markov chain is ϕ = (.5,.5,.5,.5). (Verify that this is so using Equation 9). The rate, α, of each possible substitution is an explicit parameter of the Jukes-Cantor model. The frequencies of A s, G s, C s and T s are implicitly specified by the model, since this is determined by the stationary distribution. The assumption that all four bases have the same frequency (ϕ A = ϕ C = ϕ G = ϕ T ) is unlikely to hold in most data sets. Nucleotide substitution models can be made more realistic in two directions. First, the assumption that all substitutions occur at the same rate can be relaxed. Second, the specification of the rates can be adjusted to yield a non-uniform stationary distribution. Non-uniform transition probabilities: The Kimura Parameter (KP) model assumes that transitions and transversions occur at different rates. A transition is the substitution of a purine for another purine or a pyrimidine for another pyrimidine. A transversion is the substitution of a purine for a pyrimidine or a pyrimidine for a purine. Recall that the pyrimidines, including cytosine and thymine, are nucleotides with a ring with six elements. The purines, including adenine and guanine, have a pyrimidine ring fused to a five-sided imidazole ring. It make sense that transversions would proceed at a different rate than transitions, since substituting a purine with a pyrimidine, or vice versa, involves a greater change in size and shape than a substitution of two nucleotides from the same class. The transition matrix for the KP model is A G C T A α β α β β G α α β β β C β β α β α T β β α α β where α is the rate of transitions and β is the rate of transversions.

14 Computational Genomics and Molecular Biology, Fall 5 4 Non-uniform stationary distributions: Like the Juke Cantor model, the KP model has a uniform stationary distribution, ϕ = (.5,.5,.5,.5). Work through the algebra to convince yourself that this is true. However, this is not a realistic model for the many genomes in which the G+C content deviates from 5%. The Felsenstein (98) model, like the Jukes-Cantor model, assumes that all substitutions are equally likely, but can model an arbitrary stationary distribution, ϕ = (ϕ A, ϕ C, ϕ G, ϕ T ), where ϕ A ϕ C ϕ G ϕ T. The Felsenstein model transition matrix is A G C T A α (ϕ C + ϕ G + ϕ T ) ϕ G α ϕ C α ϕ T α G ϕ A α α (ϕ A + ϕ C + ϕ T ) ϕ C α ϕ T α C ϕ A α ϕ G α α (ϕ A + ϕ G + ϕ T ) ϕ T α T ϕ A α ϕ G α ϕ C α α (ϕ A + ϕ C + ϕ G ) The Hasegawa, Kishino, Yano (HKY) model, which combines both innovations, allows different rates for transitions and transversions and an arbitrary stationary distribution, ϕ = (ϕ A, ϕ C, ϕ G, ϕ T ). A G C T A αϕ G β (ϕ C + ϕ T ) ϕ G α ϕ C β ϕ T β G ϕ A α αϕ A β (ϕ C + ϕ T ) ϕ C β ϕ T β C ϕ A β ϕ G β αϕ T β (ϕ A + ϕ G ) ϕ T α T ϕ A β ϕ G β ϕ C α αϕ C β (ϕ A + ϕ G ) The General Time Reversible (GTR) model is an even more general model that allows a different rate for each of the six possible substitutions and an arbitrary stationary distribution. These models are discussed in greater detail in various molecular evolution textbooks; see, for example, Li s Molecular Evolution, (Sinauer Associates, 997). In deciding which model to use for a particular data set, we face the usual tradeoff: more general models with more parameters provide a more accurate representation of the underlying evolutionary process. However, more complex models require more data to infer the parameter values and the danger of overfitting the parameters is greater. Analyses of alignments of present-day sequences suggest that, in many sequence families, the rate of change varies from site to site. This is typically addressed by assuming that sequence substitution in a given family can be modeled by a single model with a small number of rate categories. For example, one might model substitution in a given family using the Jukes Cantor model with four rates, (α, α, α 3, α 4 ). For each site, i, maximum likelihood estimation is used to estimate probabilities (p (i), p (i), p 3 (i), p 4 (i)), where p r (i) is the probability that site i is evolving at rate α r. Ziheng Yang discusses this approach in his textbook Computational Molecular Evolution (Oxford University Press, 6). In addition, these models do not allow for changes in rate or in GC-content over time. Developing models to account for temporal changes in rate or nucleotide composition is currently an active area of research.

### Lecture Notes: Markov chains Tuesday, September 16 Dannie Durand

Computational Genomics and Molecular Biology, Lecture Notes: Markov chains Tuesday, September 6 Dannie Durand In the last lecture, we introduced Markov chains, a mathematical formalism for modeling how

### Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

### Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

### Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

### Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

### Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

### Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

### Evolutionary Models. Evolutionary Models

Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

### BLAST: Target frequencies and information content Dannie Durand

Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

### Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

### Quantifying sequence similarity

Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

### Mutation models I: basic nucleotide sequence mutation models

Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or

### Taming the Beast Workshop

Workshop David Rasmussen & arsten Magnus June 27, 2016 1 / 31 Outline of sequence evolution: rate matrices Markov chain model Variable rates amongst different sites: +Γ Implementation in BES2 2 / 31 genotype

### Lecture 3: Markov chains.

1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

### Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

### Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

### Evolutionary Analysis of Viral Genomes

University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

### Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

### How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

### Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

### Motivating the need for optimal sequence alignments...

1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

### 6.842 Randomness and Computation March 3, Lecture 8

6.84 Randomness and Computation March 3, 04 Lecture 8 Lecturer: Ronitt Rubinfeld Scribe: Daniel Grier Useful Linear Algebra Let v = (v, v,..., v n ) be a non-zero n-dimensional row vector and P an n n

### What Is Conservation?

What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

### Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

### Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

### 7.36/7.91 recitation CB Lecture #4

7.36/7.91 recitation 2-19-2014 CB Lecture #4 1 Announcements / Reminders Homework: - PS#1 due Feb. 20th at noon. - Late policy: ½ credit if received within 24 hrs of due date, otherwise no credit - Answer

### Dr. Amira A. AL-Hosary

Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

### Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

### EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

August 0 Vol 4 No 005-0 JATIT & LLS All rights reserved ISSN: 99-8645 wwwjatitorg E-ISSN: 87-95 EVOLUTIONAY DISTANCE MODEL BASED ON DIFFEENTIAL EUATION AND MAKOV OCESS XIAOFENG WANG College of Mathematical

### EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

### Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

### Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

### Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

### Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

### Today s Lecture: HMMs

Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

### Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

### Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

### THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

### Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 1 Learning Objectives

### Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations Introductory Remarks This section of the course introduces dynamic systems;

### ISM206 Lecture, May 12, 2005 Markov Chain

ISM206 Lecture, May 12, 2005 Markov Chain Instructor: Kevin Ross Scribe: Pritam Roy May 26, 2005 1 Outline of topics for the 10 AM lecture The topics are: Discrete Time Markov Chain Examples Chapman-Kolmogorov

### Phylogenetics: Building Phylogenetic Trees

1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

### Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

### Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

Available online at wwwsciencedirectcom Procedia Computer Science 11 (01 ) 4 16 Proceedings of the 3rd International Conference on Computational Systems-Biology and Bioinformatics (CSBio 01) Predicting

### Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Markov Chains Sarah Filippi Department of Statistics http://www.stats.ox.ac.uk/~filippi TA: Luke Kelly With grateful acknowledgements to Prof. Yee Whye Teh's slides from 2013 14. Schedule 09:30-10:30 Lecture:

### Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

### Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39

### Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

### INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

### Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

### CMPSCI 240: Reasoning Under Uncertainty

CMPSCI 240: Reasoning Under Uncertainty Lecture 24 Prof. Hanna Wallach wallach@cs.umass.edu April 24, 2012 Reminders Check the course website: http://www.cs.umass.edu/ ~wallach/courses/s12/cmpsci240/ Eighth

### MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

### BMI/CS 776 Lecture 4. Colin Dewey

BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01 Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference Poisson process continuous Markov process X t0 X t1 X t2

### Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

### Phylogenetic Assumptions

Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA Copyright htregulation WARNING This material has been reproduced and communicated to you by

### An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

### An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

### Local Alignment Statistics

Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

### DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths

### Algorithms in Bioinformatics

Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

### Alignment Algorithms. Alignment Algorithms

Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.

### Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

### Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Chapter 8 Finite Markov Chains A discrete system is characterized by a set V of states and transitions between the states. V is referred to as the state space. We think of the transitions as occurring

### Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

### Pairwise sequence alignment

Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

### Nucleotide substitution models

Nucleotide substitution models Alexander Churbanov University of Wyoming, Laramie Nucleotide substitution models p. 1/23 Jukes and Cantor s model [1] The simples symmetrical model of DNA evolution All

### Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

### Phylogenetic Inference using RevBayes

Phylogenetic Inference using RevBayes Model section using Bayes factors Sebastian Höhna 1 Overview This tutorial demonstrates some general principles of Bayesian model comparison, which is based on estimating

### STA 414/2104: Machine Learning

STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

### Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

### Lecture #5. Dependencies along the genome

Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3., Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger s and Nir Friedman s. Dependencies along the genome

### "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

### Sequence analysis and Genomics

Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

### Computational Biology and Chemistry

Computational Biology and Chemistry 33 (2009) 245 252 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem Research Article

### Hidden Markov Models

Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

### Modeling Noise in Genetic Sequences

Modeling Noise in Genetic Sequences M. Radavičius 1 and T. Rekašius 2 1 Institute of Mathematics and Informatics, Vilnius, Lithuania 2 Vilnius Gediminas Technical University, Vilnius, Lithuania 1. Introduction:

### 1.5 Sequence alignment

1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

### Chapter 16 focused on decision making in the face of uncertainty about one future

9 C H A P T E R Markov Chains Chapter 6 focused on decision making in the face of uncertainty about one future event (learning the true state of nature). However, some decisions need to take into account

### Lab 9: Maximum Likelihood and Modeltest

Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

### Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

### Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

### HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

### IEOR 6711: Professor Whitt. Introduction to Markov Chains

IEOR 6711: Professor Whitt Introduction to Markov Chains 1. Markov Mouse: The Closed Maze We start by considering how to model a mouse moving around in a maze. The maze is a closed space containing nine

### Computational Biology

Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

### CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions Instructor: Erik Sudderth Brown University Computer Science April 14, 215 Review: Discrete Markov Chains Some

### Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

### Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

### Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

### Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

### MAS275 Probability Modelling Exercises

MAS75 Probability Modelling Exercises Note: these questions are intended to be of variable difficulty. In particular: Questions or part questions labelled (*) are intended to be a bit more challenging.

### Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

### Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics 1. Introduction. Jorge L. Ortiz Department of Electrical and Computer Engineering College of Engineering University of Puerto

### Pairwise sequence alignments

Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

### Phylogenetic inference

Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

### First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

### BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

### Lecture 20 : Markov Chains

CSCI 3560 Probability and Computing Instructor: Bogdan Chlebus Lecture 0 : Markov Chains We consider stochastic processes. A process represents a system that evolves through incremental changes called