Lecture 3: Markov chains.

Similar documents
Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Evolutionary Models. Evolutionary Models

Today s Lecture: HMMs

O 3 O 4 O 5. q 3. q 4. Transition

Biology 644: Bioinformatics

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Markov Chains Handout for Stat 110

Data Mining in Bioinformatics HMM

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Genome 373: Hidden Markov Models II. Doug Fowler

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Markov Chains and Hidden Markov Models. = stochastic, generative models

STOCHASTIC PROCESSES Basic notions

Computational Genomics and Molecular Biology, Fall

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

Lecture Notes: Markov chains

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Markov Models & DNA Sequence Evolution

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Stochastic processes and

Stochastic processes and

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models. Terminology, Representation and Basic Problems

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. Three classic HMM problems

MARKOV PROCESSES. Valerio Di Valerio

BMI/CS 576 Fall 2016 Final Exam

Phylogenetic Assumptions

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

HIDDEN MARKOV MODELS

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Online Social Networks and Media. Link Analysis and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Stochastic processes and Markov chains (part II)

EECS730: Introduction to Bioinformatics

MCMC: Markov Chain Monte Carlo

15-381: Artificial Intelligence. Hidden Markov Models (HMMs)

CS711008Z Algorithm Design and Analysis

Computational Genomics and Molecular Biology, Fall

11.3 Decoding Algorithm

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Chapter 4: Hidden Markov Models

Hidden Markov Models

Statistics 992 Continuous-time Markov Chains Spring 2004

Markov chains and the number of occurrences of a word in a sequence ( , 11.1,2,4,6)

DNA Feature Sensors. B. Majoros

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

CS145: Probability & Computing Lecture 18: Discrete Markov Chains, Equilibrium Distributions

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

STA 294: Stochastic Processes & Bayesian Nonparametrics

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 21: Spectral Learning for Graphical Models

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Pair Hidden Markov Models

Hidden Markov Models

P i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=

Note Set 5: Hidden Markov Models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 9. Intro to Hidden Markov Models (finish up)

1 What is a hidden Markov model?

Alignment Algorithms. Alignment Algorithms

Chapter 16 focused on decision making in the face of uncertainty about one future

Lecture 21. David Aldous. 16 October David Aldous Lecture 21

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Online Social Networks and Media. Link Analysis and Web Search

Chapter 11 Advanced Topic Stochastic Processes

Hidden Markov Models

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

Hidden Markov Models Hamid R. Rabiee

MATH 564/STAT 555 Applied Stochastic Processes Homework 2, September 18, 2015 Due September 30, 2015

CDA6530: Performance Models of Computers and Networks. Chapter 3: Review of Practical Stochastic Processes

Hidden Markov Models (I)

18.440: Lecture 33 Markov Chains

Hidden Markov Models. Terminology and Basic Algorithms

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Statistical NLP: Hidden Markov Models. Updated 12/15

18.600: Lecture 32 Markov Chains

Dynamic Approaches: The Hidden Markov Model

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

arxiv: v1 [math.st] 12 Aug 2017

Hidden Markov Models

Hidden Markov models

Hidden Markov Models. Ron Shamir, CG 08

TMA 4265 Stochastic Processes Semester project, fall 2014 Student number and

Linear Algebra Tutorial for Math3315/CSE3365 Daniel R. Reynolds

Markov Processes Hamid R. Rabiee

Multiscale Systems Engineering Research Group

Transcription:

1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr. D. Schuhmacher & Dr. D. Svensson.

2 Random processes A random process (or stochastic process) in discrete time is a sequence of random variables: They are usually dependent. X 0, X 1, X 2, X 3,... Typically, the process describes something that evolves in time: X i = the state of the process at time i.

3 Examples of random processes: Ex. 1: Let X 0, X 1, X 2,... be independent random variables, X i Bin(100, 0.25) for i 0. No dependence at all, and a trivial process. Ex. 2: Let X 0 = 0, and set X i = X i 1 + Z i for i 1, where Z 1, Z 2,... are independent random variables with P(Z i = 1) = P(Z i = +1) = 0.5. Dependence? Given the history X 0, X 1,..., X n of the process (including the current state X n ), the future X n+1, X n+2,..., depends on the history only through the current state X n. Ex. 3: Let X 0 Bin(100, 0.25) and let X i := X 0 for all i 1. This is the strongest possible dependence (but a trivial process). There are infinitely many examples...

4 Markov chains (MC), overview: Markov chains are of high importance in bioinformatics. A Markov chain is a particular type of random process, typically evolving in time. At each time point the Markov chain visits one of a number of possible states. A Markov chain is memoryless in the sense that only the current state of the process matters for the future

5 (X j ) j=0 = Markov chain. At each time point the MC visits a state: tes 5 4 3 2 1 X 1 =2 X 2 =1 X3 =1 X 7 =3 0 =1 0 1 2 3 4 5 6 7 8 9 10 Time Definition: A Markov chain is a time-homogeneous Markovian random process which takes values in a state space S.

6 Markovian means: At each time, only the current state is important for the future: P(X n+1 = i n+1 X 0 = i 0, X 1 = i 1,..., X n = i n ) = P(X n+1 = i n+1 X n = i n ). (This is not true for general random processes!) time-homogeneous means: The probability that a Markov chain jumps in one time step from state i to state j does not depend on time: P(X n+1 = j X n = i) = P(X 1 = j X 0 = i).

7 Why Markov chains? Intuitive, mathematically tractable, well-studied topic, many good applications,... Applications: sequence evolution, scoring matrices for sequence alignments, phylogenetic tree reconstructions, much more!

8 The states visited are not necessarily numerical. Can e.g. be a, c, g and t. The concept of time can be replaced by space : e.g. position in a sequence....cacagctagctacgactacgacacttttattactttattatatcagcgc... Example: Markov chains as a model for DNA. Let X j be equal to the nucleotide present at position j (counted from left to right) in a gene, say, for j = 1, 2,..., N. An i.i.d. random sequence of a, c, g and t s is unlikely to be a good model for the nucleotide pattern in a gene sequence A Markov chain on {a, c, g, t} might be a better approximation to reality: the probabilities for the nucl. at position j + 1 depends upon the nucl. at position j. (However, in real data, more complex dependencies are found.)

9 Transition matrix Let X 0, X 1, X 2, X 3,... be a Markov chain with state space S, for example S = {a, c, g, t}. To any Markov chain, a transition matrix P is associated. In the case with four possible states, P is 4 4: p a,a p a,c p a,g p a,t p c,a p c,c p c,g p c,t P = p g,a p g,c p g,g p g,t. p t,a p t,c p t,g p t,t Here p i,j = P(X n+1 = j X n = i) for n 0, where i, j {a, c, g, t}.

10 The ( one-step ) transition matrix P contains all the information needed to compute any multi-step transition probabilities to future states: The two-step probabilities p (2) ij matrix (where i, j S). := P(X n+2 = j X n = i) are given by the P (2) = P P (matrix multiplication!) E.g. with only two states (say, S = {0, 1}) P (2) = p 00 p 01 p 10 p 11 p 00 p 01 p 10 p 11 p2 00 + p 01 p 10 p 00 p 01 + p 01 p 11 p 10 p 00 + p 11 p 10 p 10 p 01 + p 2 11 =.

11 How to derive the two-step transition matrix P (2)? Let i, j S. Then P(X n+2 = j X n = i) = k S P(X n+1 = k, X n+2 = j X n = i) = k S = k S P(X n = i, X n+1 = k, X n+2 = j) P(X n = i) P(X n = i, X n+1 = k, X n+2 = j) P(X n = i, X n+1 = k) P(X n = i, X n+1 = k) P(X n = i) = k S P(X n+2 = j X n = i, X n+1 = k) P(X n+1 = k X n = i) = k S P(X n+2 = j X n+1 = k) P(X n+1 = k X n = i) = k S p kj p ik, which is the entry at position (i, j) in the matrix P 2 = P P.

12 m-step transition probabilities: In general: the m-step transition probability p (m) ij := P(X n+m = j X n = i) is given as the entry at position (i, j) in the matrix P m (m-th matrix power of P ). In other words: if we denote by P (m) the matrix of the m-step transition probabilities, then P (m) = P } P {{ P } =: P m. m times

13 Initial distribution Typically, in a Markov chain, the first state X 0 is also random, and drawn from some initial probability distribution. Let λ j = P(X 0 = j) for j S. E.g. S = {a, c, g, t}, then the initial probability distribution can be represented as a row vector λ = ( λa, λ c, λ g, λ t ). The probabilities for the states at time 1 are ( P λ (X 1 = a), P λ (X 1 = c), P λ (X 1 = g), P λ (X 1 = t) ) = λ P.

14 P and λ completely determine the distribution of the MC Given the transition matrix P and the initial distribution λ of a Markov chain with state space S, we can compute any probability we like for that chain. E.g. for any i 0, i 1,..., i n S P λ (X 0 = i 0, X 1 = i 1,... X n = i n ) = λ i0 p i0 i 1 p in 1 i n, or, if we chose for the sake of simplicity again S = {a, c, g, t}, ( ) P λ (X n = a), P λ (X n = c), P λ (X n = g), P λ (X n = t) = λ P n.

15 The MC gradually forgets about its initial distribution... Usually, the distribution of X n changes over time, but it stabilizes as n gets bigger and bigger: λ i = P λ (X 0 = i) P λ (X 1 = i) P λ (X 2 = i)... for N large enough.... P λ (X N = i) P λ (X N+1 = i)... This phenomenon can be thought of as the process remembers how it was started, but it gradually forgets about its initial distribution.

16 Stationary distribution, overview: Mathematically, there is a probability distribution π, such that P λ (X n = i) converges (sometimes in a broader sense of the word) towards π i as n. Furthermore, if we take π as the initial distribution, then the distribution of X n is equal to π for any n. For this reason π is called the stationary distribution of the Markov chain X 0, X 1, X 2,...

17 Stationary distribution Let π = (π a, π c, π g, π t ) be a probability distribution such that holds. π = π P If this holds, the probability vector π is the stationary distribution (sometimes called equilibrium distribution) for the Markov chain. In our situations (finite state space, reasonable MC) there is always a unique stationary distribution.

18 Stationarity Let X 0, X 1,... be a Markov chain with the initial distribution being the stationary distribution, i.e. P(X 0 = i) = π i for i S, where S is the state space of the chain. Then the chain is stationary (or in equilibrium ): for all i S and for all n 1. P(X n = i) = π i In other words: X n has the same distribution for every n 0. It is thus much easier to compute probabilities...

19 Convergence to the equilibrium Let X 0, X 1,... be a Markov chain with finite state space and arbitrary initial distribution λ. In virtually all practical cases (namely if the MC is irreducible and aperiodic ): P λ (X n = i) π i as n. irreducible means: the MC never gets stuck in one part of the state space, i.e. departing from any state there is always a positive probability for the MC to reach any other state (within one or several time steps). aperiodic means: there is no strict periodicity in the MC for returning to a fixed state, e.g. a MC on S = {a, c, g, t} that can only reach a at every second point in time has period 2, and hence is not aperiodic.

20 Reversibility Suppose that X 0, X 1, X 2,... is a Markov chain (irreducible, aperiodic, with finite state space) with stationary distribution π. If π i p ij = π j p ji for any two states i and j, then the Markov chain is reversible. By far not all Markov chains are reversible! *** Reversibility is often desired in phylogenetic analyses (in computing the evolutionary distance between two sequences).

21 Detailed versus full balance Compare the condition for reversibility (1) π i p ij = π j p ji for all i, j S with the condition for the stationary distribution, π = π P, or in sum notation (2) π j p ji for all i S. j S π i p ij = π i = j S (1) is often referred to as detailed balance condition (DBC), (2) as full balance condition (FBC). The DBC is much stronger, but usually also much easier to solve than the FBC. It is often a good shortcut for computing the stationary distribution of a reversible Markov chain.

22 Example: Let X 0, X 1,... be a Markov chain with state space S := {1, 2, 3} and transition matrix 0 2/3 1/3 P := 2/3 0 1/3. 1/2 1/2 0 This situation can be depicted in the form of a transition graph: 1 2/3 2/3 2 1/2 1/3 1/2 1/3 3

23 We try to solve the detailed balance condition π i p ij = π j p ji for all i, j S. Set π 1 := 1. Then, because of the DBC, π 2 = p 12 p 21 π 1 = 2/3 2/3 1 = 1, and π 3 = p 13 p 31 π 1 = 1/3 1/2 1 = 2 3. However, π = (π 1, π 2, π 3 ) = (1, 1, 2/3) is not yet a probability distribution, because 3 i=1 π i = 8/3 1. But if we multiply everything with 3/8, we get π = ( 3 8, 3 8, 2 8) = (0.375, 0.375, 0.25), a probability vector which satisfies the DBC!

24 (... π = (0.375, 0.375, 0.25) ) Thus, if the MC is started with this π, we have for any n 0 that ( P π (X n = 1), P π (X n = 2), P π (X n = 3) ) = (0.375, 0.375, 0.25). (Stationarity) Furthermore, if the MC is started with an arbitrary initial distribution λ, we have for n large that ( P λ (X n = 1), P λ (X n = 2), P λ (X n = 3) ) (0.375, 0.375, 0.25). (Convergence to the equilibrium)

25 Higher-order Markov chains Remember: Markov chain model for DNA X i := nucleotide at position i in a DNA sequence of length N. We have seen: X 1, X 2,..., X N i.i.d. is unrealistic. Better model: X 1, X 2,..., X N is Markov chain (probabilities for the nucleotides at position n + 1 depend only on the nucleotide present at the preceding position n). But in real data the dependence is usually more complex! Better still: Higher-order Markov chain. (Markov chain of order k: probabilities for the nucleotides at position n + k depend only on the nucleotides present at the k preceding positions n, n + 1,..., n + k 1.)

26 Markov chain of order k We can write the transition probabilities of a Markov chain of order k as p i0 i 1...i k 1 ;i k := P ( X k = i k X 0 = i 0, X 1 = i 1,..., X k 1 = i k 1 ) = P ( X n+k = i k X n = i 0, X n+1 = i 1,..., X n+k 1 = i k 1 ) = P ( X n+k = i k X 0 = l 0, X 1 = l 1,... X n 1 = l n 1, X n = i 0, X n+1 = i 1,..., X n+k 1 = i k 1 ) for any n 1 and for any states i 0,..., i k ; l 0,..., l n 1 S. (The second equality holds because the higher-order MC is time-homogeneous, the third one holds because it is higher-order Markovian.)

27 Higher-order MC vs. first order MC Main advantage of higher-order MC: Yield generally a better approximation of the dependence in real data. (It has been found that even DNA in nonfunctional intergenic regions tends to have high-order dependence!) Main disadvantage: Computationally expensive: the transition matrix for an order k Markov chain model of DNA has 3 4 k degrees of freedom, a similar model for a protein even 19 20 k. That means that huge amounts of data would be required for making reasonable statistical statements about such a MC (e.g. for estimating the transition matrix ; see Lecture 4).

28 Hidden Markov models (HMM): Sequence of states (= the path ): q = q 1 q 2 q N, where q i S for i = 1, 2,..., N, and where S is the set of possible states. The path is usually not observable (i.e. it is hidden). Sequence of observations (or symbols ): o = o 1 o 2 o N, where o i A for i = 1, 2,..., N, and where A is the set of possible symbols (the alphabet ).

29 HMM s are important in bioinformatics, e.g. for multiple alignments, gene finding, more. *** HMM s are more flexible (more realistic) than standard Markov chains, but still mathematically feasible.

30 Hidden Markov models (HMM): The sequence of states q = q 1 q 2 q N is a (standard) Markov chain, with transition probabilities where k, l S. p kl = P(q i = l q i 1 = k), In each state, a symbol is emitted, symbol b with probability e k (b) = P(o i = b q i = k), depending on in which state k the Markov chain is (where b A, k S).

31 The joint probability of having a given sequence o = o 1 o 2 o N and a given path q = q 1 q 2 q N is P( o, q) = p s,q1 e q1 (o 1 ) p q1,q 2 e q2 (o 2 ) p qn 1,q N e qn (o N ) p qn,f where s = the start state, and f = the end state.

32 Example of HMM application: model for nucleotide patterns in a gene. States S = {0, 1} where 0 = intron and 1 = exon. Alphabet A = {a, c, g, t}. If the state q i = 0 (INTRON), emit a symbol o i according to the probability distribution (e 0 (a), e 0 (c), e 0 (g), e 0 (t)). If the state q i = 1 (EXON), emit a symbol o i according to the probability distribution (e 1 (a), e 1 (c), e 1 (g), e 1 (t)).

33 Decoding (finding the path) In HMM applications the following problem often occurs: Given the observations o = o 1 o 2 o N (and given all the model parameters), find the path q = q 1 q 2 q N that is most likely to have generated the observations. For example, given the observed sequence ccgtactagctgtagctgtgac...atcggggctgctgccatcgatcgacgtagc, where are the introns and exons?

34 Most probable path The most probable path q (given the observations o) is q = argmax q P( q o ). Even with a moderate number of possible states, the number of possible paths is huge. Therefore, calculation by exhaustive methods (testing all possible paths) becomes impossible in many cases. But the most probable path q can be found recursively using a dynamic programming algorithm, the Viterbi algorithm (see also Lectures 6 and 7 on HMM s).

35 Most probable path: the Viterbi algorithm Given: observations o 1, o 2,..., o N. Define δ n (k) = for k S, 1 n N. max P(q 1, q 2,..., q n 1, q n = k, o 1, o 2,..., o n ) q 1,q 2,...,q n 1 INITIATION: δ 1 (k) = p sk e k (o 1 ) for k S (s = start ). INDUCTION: δ n (j) = max i S δ n 1 (i) p ij e j (o n ) for j S, for 2 n N. BACKTRACKING: Put q N := argmax j S δ N (j) and q n 1 := argmax j S δ n 1 (j) p j,q n for n = N, N 1,..., 2. RESULT: Most probable path q 1 q 2 q N.