A Study of Hidden Markov Model

Size: px
Start display at page:

Download "A Study of Hidden Markov Model"

Transcription

1 University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School A Study of Hidden Markov Model Yang Liu University of Tennessee - Knoxville Recommended Citation Liu, Yang, "A Study of Hidden Markov Model. " Master's Thesis, University of Tennessee, This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Masters Theses by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information, please contact trace@utk.edu.

2 To the Graduate Council: I am submitting herewith a thesis written by Yang Liu entitled "A Study of Hidden Markov Model." I have examined the final electronic copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Mathematics. We have read this thesis and recommend its acceptance: Xia Chen, Balram Rajput (Original signatures are on file with official student records.) Jan Rosinski, Major Professor Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School

3 To the Graduate Council: I am submitting herewith a thesis written by Yang Liu entitled A Study of Hidden Markov Model. I have examined the final electronic copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Mathematics. We have read this thesis and recommend its acceptance: Jan Rosinski Major Professor Xia Chen Balram Rajput Accepted for the Council: Anne Mayhew Vice Chancellor and Dean of Graduate Studies (Original signatures are on file with official student records.)

4 A STUDY OF HIDDEN MARKOV MODEL A Thesis Presented for the Master of Science Degree The University of Tennessee, Knoxville Yang Liu August 2004

5 DEDICATION This dissertation is dedicated to my parents, Jiaquan Liu and Xiaobo Yang, for consistently supporting me, inspiring me, and encouraging me. ii

6 ACKNOWLEDGMENTS I would like to thank my parents for their consistent support and encouragement. I specially express my deepest appreciation to my advisor, Dr. Jan Rosinski, for his patient, friendly, and unfailing support over the past two years. I also want to thank the other committee members, Dr. Balram Rajput and Dr Xia Chen for their comments and suggestions. I would also like to thank Xiaohu Tang for his help. iii

7 ABSTRACT The purpose of this thesis is fourfold: Introduce the definition of Hidden Markov Model. Present three problems about HMM that must be solved for the model to be useful in real-world application. Cover the solution of the three basic problems; explain some algorithms in detail as well as the mathematical proof. And finally, give some examples to show how these algorithms work. iv

8 TABLE OF CONTENTS 1. Introduction 1 2. What is A Hidden Markov Model 3 3. Three Basic Problems of HMMs and Their Solutions The Evaluation Problem The Forward Procedure The Backward Procedure Summary of the Evaluation Problem The Decoding Problem Viterbi Algorithm δ θ Algorithm The Learning Problem An Example HMMs Analysis by Matlab How to Generate A, B, π How to Test the Baum-Welch Method 46 References 54 Appendices 57 Appendix A: The New Models generated in Section 4.2, Step 3 58 Appendix B: Matlab Program 73 Vita 86 v

9 LIST OF TABLES Table 1. The Forward Variables α (i) 35 t Table 2. The Variables δ (i) and ψ (i) 37 t t Table 3. The Backward Variables β (i ) 39 t Table 4. The Variables ξ ( i, j) and γ (i) 41 t t Table 5. The Distances, Seed λ 0 50 Table 6. The Distances, Seed λ 52 vi

10 1 INTRODUCTION Many of the most powerful sequence analysis methods are now based on principles of probabilistic modeling, such as Hidden Markov Models. Hidden Markov Models (HMMs) are very powerful and flexible. They originally emerged in the domain of speech recognition. During the past several years it has become the most successful speech model. In recent years, they are widely used as useful tools in many other fields, including molecular biology, biochemistry and genetics, as well as computer vision. An HMM is probabilistic model composed of a number of interconnected states, each of which emits an observable output. Each state has two kinds of parameters. First, symbol emission probabilities describe the probabilities of the possible outputs from the state, and second, state transition probabilities specify the probability of moving to a new state from the current one. An observed sequence of symbols is generated by starting at some initial state and moving probabilistically from state to state until some terminal state is reached, emitting observable symbols from each state that is passed through. A sequence of states is a first order Markov chain. This state sequence is hidden, only the sequence of symbols that it emits being observable; hence the term hidden Markov model. This paper is organized in the following manner. The definition as well as some 1

11 necessary assumptions about HMMs are explained first. The three basic problems of HMMs are discussed, and then their solutions are given. To make HMMs are useful in the real world application, we must know how to solve these problems, which are the evaluation problem, the decoding problem and the learning problem. In this part, some useful algorithm are introduced, including the forward procedure, the backward procedure, the Viterbi algorithm, the δ θ algorithm and the Baum-welch method. Lastly, the computer programs by which HMMs are simulated are coved. 2

12 2 WHAT IS A HIDDEN MARKOV MODEL A Hidden Markov Model is a stochastic model comprising an unobserved Markov chain (X t : t = 0, 1...) and an observable process (Y t : t = 0, 1,...). In this paper, we only consider the case when the observations were represented as discrete symbols chosen from a finite set, and therefore we could use a discrete probability density within each state as this model. To define an HMM, we need some elements. The number of the hidden states is N. We denote these N states by s 1, s 2,..., s N. The number of the observable states is M. We denote them by r 1, r 2,..., r M. Specifically, (X t : t = 0, 1...) is a Markov chain with transition probability matrix A = [a ij ] and an initial state distribution π = (π i ), where 1 i, j N. By the properties of Markov chain, we know that a ij = P (X t+1 = s j X t = s i ) = P (X t+1 = s j X t = s i, X t 1 = s k,..., X 0 = s l ), t = 0, 1,..., for every s i, s j,..., s k and s l, where 1 i, j, k, l N. The initial state distribution of the Markov chain is given by π i = P (X 0 = s i ). 3

13 For a finite observation sequence (Y t : t = 0, 1,..., T ), where T is any fixed number, we have a fundamental assumption connecting the hidden state sequence (X t : t = 0, 1,..., T ) and the observation sequence, that is statistical independence of observations (Y t : t = 0, 1,..., T ). If we formulate this assumption mathematically, we have P (Y 0 = r j0, Y 1 = r j1,..., Y T = r jt X 0 = s i0, X 1 = s i1,..., X T = s it ) = T P (Y t = r jt X t = s it ), (1) t=0 where 1 i 0, i 1,..., i T N, 1 j 0, j 1,..., j T M. To simplify the notation, we denote the event sequence (s i0, s i1,..., s it ) by s T 0, (r j0, r j1,..., r jt ) by r T 0, and denote (X 0, X 1,..., X T ) by X T 0, (Y 0, Y 1,..., Y T ) by Y T 0. Then, we put b j (k) = P (Y t = r k X t = s j ), 1 j N, 1 k M, t = 0, 1, 2... We may rewrite the formula (1) by T P (Y0 T = r T 0 X T 0 = s T 0 ) = b it (j t ) (2) t=0 So far, we know a Hidden Markov Model has several components. It has a set of states s 1, s 2,..., s N, a set of output symbols r 1, r 2,..., r M, a set of transitions which have associated with them a probability and an output symbol, and a starting state. When a transition is taken, it produces an output symbol. The complicating factor is that the output symbol given is not necessarily unique to that transition, and thus it 4

14 is difficult to determine which transition was the one actually taken and this is why they are termed hidden. See Figure 2.1. By the knowledge of Markov Chain, we know P (X T 0 = s T 0 ), the probability of the state sequence s T 0, P (X T 0 = s T 0 ) = P (X 0 = s i0, X 1 = s i1,..., X T = s it ) = π i0 a i0 i 1...a it 1 i T = T a it 1 i t t=0 where a i 1 i 0 = π i0. Hence, the joint probability of s T 0 and r T 0 is P (X T 0 = s T 0, Y T 0 = r T 0 ) = P (Y T 0 = r T 0 X T 0 = s T 0 ) P (X T 0 = s T 0 ) = T [b it (j t )a it 1 i t ] (3) t=0 The transition probability matrix A, the initial state distribution π and the matrix B = [b j (k)], 1 j N, 1 k M, define a Hidden Markov Model completely. Therefore we can use a compact notation λ = (A, B, π) to denote a Hidden Markov Model with discrete probability distribution. We may think of λ as a parameter of the Hidden Markov Model. We denote the probability given a model λ by P λ later. Lemma 2.1. P λ (Y T 0 = r T 0 ) = 1 i 0,i 1,...,i T N Tt=0 a it 1 i t b it (j t), where a i 1 i 0 = π i0. 5

15 X 0 X 1 X 2 X T-1 X T Y 0 Y 1 Y 2 Y T-1 Y T Figure 2.1 HMM 6

16 Proof: We want to compute P λ (Y T 0 = r T 0 ), the probability of the observation sequence given the model λ. From time 0 to time T, we consider every possible hidden state sequence, s T 0. Then the probability of r T 0 is obtained by summing the joint probability over r T 0 and all possible s T 0, that is P λ (Y T 0 = r T 0 ) = all possible s T 0 P λ (X T 0 = s T 0, Y T 0 = r T 0 ) = T [a it 1 i t b it (j t )]. (4) 1 i 0,i 1,...,i T N t=0 Lemma2.1 gives us a method to compute the probability of a sequence of observations r T 0. But unfortunately, this calculation is computationally unfeasible. Because this formula involves on the order of (T + 1) N (T +1) calculations. We are going to introduce some more efficient methods in the next section. We can understand Lemma 2.1 from another point of view. A hidden Markov model consists of a set of hidden states s 1, s 2,..., s N connected by directed edges. Each state assigns probabilities to the characters of the alphabet used in the observable sequence and to the edges leaving the state. A path in an HMM, s i0, s i1,..., s it, is a sequence of states such that there is an edge from each state in the path to the next state in the path. And the probability of this path is the product of the probabilities of the edges traversed, that is P (X T 0 = s T 0 ). 7

17 Each path through the HMM gives a probability distribution for each position in a string of the same length, based on the probabilities for the characters in the corresponding states. The probability of the observable sequence given a particular path is the product of the probabilities of the characters, that is T P (Y0 T = r T 0 X T 0 = s T 0 ) = b it (j t ). The probability of any sequence of characters is the sum, over all paths whose length is the same as the sequence, of the probability of the path times the probability of the sequence given the path, that is the result of Lemma 2.1. See Figure 2.2. t=0 A Hidden Markov Model has a very similar property as a Markov process, that is given the values of X t, the values of Y s, s t, do not depend on the values of X u, u < t. The probability of any particular future observation of the model when its present hidden state is known exactly, is not altered by additional knowledge concerning its past hidden behavior. In formal terms, we have Lemma 2.2. Lemma 2.2. P λ (Y u = r ju X t 0 = s t 0) = P λ (Y u = r ju X t = s it ), u t. Proof: Firstly, we prove this result is true when u=t. P λ (Y t = r jt X t 0 = s t 0) = = 0 j 0,...,j t 1 M 0 j 0,...,j t 1 M P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (Y t 1 0 = r t 1 0 X t 1 0 = s t 1 0 ) P λ (Y t = r jt X t = s it ) 8

18 t=0 t=1 t=t-1 t=t s 1 s 1 s 1 s 1 s 2 s 2 2 s s 2 s N s N s N s N Figure 2.2 Paths of An HMM 9

19 = 1 P λ (Y t = r jt X t = s it ) = P λ (Y t = r jt X t = s it ) Specially, if we have q t, we will have = = P λ (Y t = r jt X t q = s t q) 0 j q,...,j t 1 M 0 j q,...,j t 1 M P λ (Y t q = r t q X t q = s t q) P λ (Y t 1 q = 1 P λ (Y t = r jt X t = s it ) = P λ (Y t = r jt X t = s it ) = r t 1 q X t 1 q = s t 1 q ) P λ (Y t = r jt X t = s it ) Then, we prove it is also true when u > t. P λ (Y u = r ju X t 0 = s t 0) = = = = = 0 j 0,...,j u 1 M P λ (Y u 0 = r u 0 X t 0 = s t 0) 0 j 0,...,j u 1 M 0 i t+1,...,i u N 0 j 0,...,j u 1 M 0 i t+1,...,i u N 0 j 0,...,j u 1 M 0 i t+1,...,i u N P λ (X u t+1 = s u t+1 X t = s it ) 0 i t+1,...,i u N P λ (Y u 0 = r u 0, X u t+1 = s u t+1 X t 0 = s t 0) P λ (Y u 0 = r u 0 X u 0 = s u 0) P λ (X u t+1 = s u t+1 X t = s it ) P λ (Y u 1 0 = r u 1 0 X u 1 0 = s u 1 0 ) P λ (Y u = r ju X u = s iu ) P λ (Y u = r ju X u = s iu ) P λ (X u t+1 = s u t+1 X t = s it ) 10

20 = P λ (Y u = r ju X t = s it ) HMM. Lemma 2.2 will be widely used in next section to help solve the basic problems of Remark. (Y t, t 0) are not independent. Proof: We take T = 1. From Lemma 2.1, we have But P λ (Y 0 = r j0 ) = N i0 =1 b i0 (j 0 )π i0, and N P λ (Y0 1 = r 1 0) = b i0 (j 0 )b i1 (j 1 )π i0 a i0 i 1. i 0,i 1 =1 = N P λ (Y 1 = r j1 ) = P λ (Y 1 = r j1 X 1 = s i1 )P λ (X 1 = s i1 ) i 1 N N N b i1 (j 1 )[ a i0 i 1 π i0 ] = b i1 (j 1 )a i0 i 1 π i0. i 1 i 0 i 0,i 1 =1 Hence P λ (Y 1 0 = r 1 0) P λ (Y 0 = r j0 ) P λ (Y 1 = r j1 ). Therefore, the sequence (Y t, t 0) are not independent. Actually, it is very natural to be understood. Because for each Y t, it is generated by the corresponding X t, and the hidden sequence (X t, t = 0, 1...) are not independent. There is a very easy example. We take N = M and b i (j) = δ ij. Then, Y t is a Markov Chain. 11

21 3 THREE BASIC PROBLEMS OF HMMs AND THEIR SOLUTIONS Once we have an HMM, there are three problems of interest. 3.1 The Evaluation Problem Given a Hidden Markov Model λ and a sequence of observations Y T 0 = r T 0, what is the probability that the observations are generated by the model, i.e. P λ (Y0 T = r T 0 ). We can also view the problem as how well a given model matches a given observation sequence. By the second viewpoint, if we have several competing models, the solution to the evaluation problem will give us a best model which best matches the observation sequence. The most straightforward way of doing this is using Lemma 2.1. But it involves a lot of calculations. The more efficient methods are called the forward procedure and the backward procedure. See[1]. We will introduce these two procedures first, then reveal the mathematical idea inside them. 12

22 3.1.1 The Forward Procedure Fix r t 0 and consider the forward variable α t (i t ) defined as α t (i t ) = P λ (Y t 0 = r t 0, X t = s it ), t = 0, 1,..., T, 1 i t N. that is the probability of the partial observation sequence, r j0, r j1,..., r jt (until time t) and at time t the hidden state is s it. So for the forward variableα t (i t ), we only consider those paths which end at the state s it at the time t. We can solve for α t (i t ) inductively, as follows: (1) Initialization: α 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N. (2) Induction (t = 0, 1,..., T 1): α t+1 (i t+1 ) = P λ (Y t+1 0 = r t+1 0, X t+1 = s it+1 ) = = = 1 i 0,...,i t N 1 i 0,...,i t N 1 i 0,...,i t N P λ (Y t+1 0 = r t+1 0, X t+1 0 = s t+1 0 ) P λ (Y t+1 0 = r t+1 0 X t+1 0 = s t+1 0 ) P λ (X t+1 0 = s t+1 0 ) P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t 0 = r t 0) P λ (X t 0 = r t 0) = [ P λ (Y0 t = r t 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it )] P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) 1 i 0,...,i t N 13

23 = [ P λ (Y0 t = r t 0, X t = s t ) P λ (X t+1 = s it+1 X t = s it )] P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) = [ 1 i t N N i t=1 α t (i t )a it i t+1 ]b it+1 (j t+1 ) (3) Termination: P λ (Y T 0 = r T 0 ) = = N P λ (Y0 T = r T 0, X T = s it ) i T =1 N α T (i T ) i T =1 If we examine the computation involved in the calculation of α t (i t ), we see that it requires on the order of N 2 (T + 1) calculations, rather than (T + 1) N (T +1) as required by the direct calculation. Hence, the forward probability calculation is more efficient than the direct calculation. In similar, we have another method for the Evaluation Problem. It is called backward procedure The Backward Procedure We consider a backward variable β t (i t ) defined as, β t (i t ) = P λ (Y T t+1 = r T t+1 X t = s it ) 0 t T 1, 1 i t N. 14

24 That is the probability of the partial observation sequence from time t + 1 to the end, given the hidden state is s it at time t. Again we can solve for β t (i t ) inductively, as follows: (1) Initialization: To make this procedure work for t = T 1, we arbitrarily define β T (i T ) to be 1 in the initialization step (1). β T (i T ) = 1. (2) Induction (t = 0, 1,..., T 1): β t (i t ) = P λ (Y T t+1 = r T t+1 X t = s it ) = = = = = N P λ (Yt+1 T = r T t+1, X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X t+1 = s it+1, X t = s it ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X T t+1 = s T t+1) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) i t+1 =1 1 i t+2,...,i T N P λ (X t+1 = s it+1 X t = s it ) N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 15

25 = = [ P λ (Yt+2 T = r T t+2 X T t+1 = s T t+1)p λ (X T t+2 = s T t+2 X t+1 = s it+1 )] 1 i t+2,...,i T N N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Yt+2 T = r T t+2 X t+1 = s it+1 ) i t+1 =1 P λ (X t+1 = s it+1 X t = s it ) N a it i t+1 b it+1 (j t+1 )β t+1 (i t+1 ). i t+1 =1 (3) Termination: P λ (Y T 0 = r T 0 ) = = = = N P λ (Y0 T = r T 0, X 0 = s i0 ) i 0 =1 N P λ (Y0 T = r T 0 X 0 = s i0 ) P λ (X 0 = s i0 ) i 0 =1 N P λ (Y1 T = r T 1 X 0 = s i0 ) P λ (Y 0 = r j0 X 0 = s i0 ) P λ (X 0 = s i0 ) i 0 =1 N β 0 (i 0 )b i0 (j 0 )π i0 i 0 =1 The back ward procedure requires on the order of N 2 (T + 1) calculations, as many as the forward procedure Summary of the Evaluation Problem We have introduced how to evaluate the probability that the observation sequence Y T 0 = r T 0 is generated by using either the forward procedure or the backward proce- 16

26 dure. They are mode efficient than the method given in Lemma 2.1. In fact, these two procedures are nothing but changing multiple sum, that is Lemma 2.1, to repeated sum. For example, in the forward procedure, we use the identity N N =..., 1 i 0,...,i T N i T =1 i 0 =1 and for the backward procedure, we reverse the order of summation, that is N N = i 0,...,i T N i 0 =1 i T =1 From this point of view, it is obvious that other procedures are possible. For example, we can do he summation from the two ends to the middle at the same time. In some sense, this can be seen as a kind of parallel algorithm. It requires on the order of N(N 1) (T + 1) calculations. 3.2 The Decoding Problem Given a model λ and a sequence of observations Y T 0 = r T 0, what is the most likely state sequence in the model that produced the observation? That is, we want to find a hidden state sequence X T 0 = s T 0, to maximize the probability, P λ (X T 0 = s T 0 Y0 T = r T 0 ), for any possible sequences s T 0. This is equivalent to maximize P λ (X T 0 = s T 0, Y0 T = r T 0 ), because the probability of r T 0 given a model λ, P λ (Y0 T = r T 0 ) is fixed. A technique for finding this state sequence exists, based on dynamic programming methods, and 17

27 is called the Viterbi algorithm. See[1] Viterbi Algorithm Fixing s it, we consider a variable δ t (i t ) defined as δ t (i t ) = max P λ(x t 1 0 = s t 1 0, X t = s it, Y0 t = r t 1 i 0,i 1,...,i t 1 0). N Hence, δ t (i t ) is the highest probability along a path, which accounts for the first t observations and ends in state s it at time t. By induction we have, δ t+1 (i t+1 ) = max P λ(x t 0 = s t 1 i 0,i 1,...,i t N 0, X t+1 = s it+1, Y0 t+1 = r t+1 0 ) = max P λ(y0 t+1 = r t+1 0 X t+1 0 = s t+1 0 ) P λ (X t+1 0 = s t+1 0 ) 1 i 0,i 1,...,i t N = max 1 i 0,i 1,...,i t N P λ(y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (X t+1 = s it+1 X t 0 = s t 0) P λ (X t 0 = s t 0) = max 1 i 0,i 1,...,i t N P λ(y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Y t 0 = r t 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it ) = [ max max P λ(y0 t = r t 1 i t N 1 i 0,...,i t 1 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it )] N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) = [ max 1 i t N δ t(i t )a iti t+1 ]b it+1 (j t+1 ). 18

28 To find the hidden state sequence, we need to keep track of the argument which maximize δ t (i t ), for every t and i t, and they will be noted by an array ψ t (i t ). (1) Initialization: δ 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N. (2)Induction (t=1,...,t): δ t (i t ) = max 1 i t 1 N [δ t 1(i t 1 )a it 1 i t ]b it (j t ), ψ t (i t ) = arg max [δ t 1(i t 1 )a it 1 i t ]. 1 i t 1 N (3)Termination: δ = max [δ T (i T )], 1 i T N ψ = arg max 1 i T N [δ T (i T )]. Finally, we will have the state sequence i T = ψ and i t = ψ t+1 (i t+1 ), t = T 1, T 2,..., 0. Remark1. In the Viterbi Algorithm, ψ t (i t ), t = 0,..., T 1, and ψ are set-valued maps. This means there may be more than one best hidden state sequence. In order to distinguish them, we need more information. Remark2. This procedure is not consistent with T increasing. 19

29 Proof: We give a counterexample. We take ( ) π = , A =, B =, Suppose we have an observation sequence Y 1 0 = (r 1, r 2 ). Following the Viterbi Algorithm, when T = 0, the most likely hidden state is s 1, but when T = 1, the best one is (s 2, s 1 ). This means, given a fixed sequence Y, if s i0 is a solution for T = 0 and (s i 0, s i 1 ) is a solution for T = 1, we may not have i 0 = i 0. Even if s i0 and (s i 0, s i 1 ) are the unique best solutions respectively. The Viterbi Algorithm is very similar as the forward procedure that we have introduced in the above section. In the forward procedure, we define α t (i t ), while here we use δ t (i t ). The only diffence between them is we change summation to maximum. But the general ideas are same. We change multiple summation to repeated summation and multiple maximum to repeated maximum. From this point of view, it is very natural to think how to use the idea that we have used in the backward procedure to solve the Decoding Problem. Therefore, we 20

30 introduce a new method. We do the maximum from the two ends to the middle at the same time. That is, δ θ Algorithm The advantage of this method is saving time. To describe this procedure, we define another variable θ t (i t ), t = 0,..., T 1. θ t (i t ) = max P λ(x T t+1 = s T 1 i t+1,...,i T t+1, Yt+1 T = r T t+1 X t = s it ). N θ t (i t ) is the highest probability along a path, which accounts from time t to the end and starts at state s it at time t. Inductively, we have θ t (i t ) = max 1 i t+1,...,i T N P λ(x T t+1 = s T t+1, Y T t+1 = r T t+1 X t = s it ) = max 1 i t+1,...,i T N P λ(x T t+2 = s T t+2, Y T t+1 = r T t+1 X t+1 = s it+1, X t = s it ) P λ (X t+1 = s it+1, X t = s it ) = max 1 i t+1,...,i T N P λ(y T t+1 = r T t+1 X T t = s T t ) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) = max 1 i t+1,...,i T N P λ(y T t+2 = r T t+2 X T t+2 = s T t+2) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) 21

31 = max 1 i t+1,...,i T N P λ(y T t+2 = r T t+2, X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) = max 1 i t+1 N θ t+1(i t+1 )a it i t+1 b it+1 (j t+1 ) (1)Initialization: δ 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N, θ T (i T ) = 1, 1 i T N. (2)Induction: δ t (i t ) = max [δ t 1(i t 1 )a it 1 i t ]b it (j t ), t = 1,..., T. 1 i t 1 N ψ t (i t ) = arg max [δ t 1(i t 1 )a it 1 i t ], t = 1,..., T. 1 i t 1 N θ t (i t ) = max 1 i t+1 N θ t+1(i t+1 )a iti t+1 b it+1 (j t+1 ), t = 0,..., T 1. ϕ t (i t ) = arg max θ t+1(i t+1 )a it i t+1 b it+1 (j t+1 ), t = 0,..., T 1. 1 i t+1 N (3)Termination: max P λ(x T 0 = s T 0, Y0 T = r T 0 ) 1 i 0,...,i T N 22

32 = max max P λ(x T 0 = s T 0, Y0 T = r T 0 ) 1 i t N 1 i s N,s t = max max P λ(y0 T = r T 0 X T 0 = s T 0 )P λ (X T 0 = s T 0 ) 1 i t N 1 i s N,s t = max max P λ(y0 t = r t 1 i t N 1 i s 0 X t 0 = s t 0)P λ (X t 0 = s t 0) N,s t P λ (Y T t+1 = r T t+1 X T t+1 = s T t+1)p λ (X T t+1 = s T t+1 X t 0 = s t 0) = max 1 i t N [ max P λ(x t 0 = s t 1 i s 0, Y0 t = r t 0) N,s<t max 1 i s N,s>t P λ(y T t+1 = r T t+1, X T t+1 = s T t+1 X t+1 = s it+1 )] = max 1 i t N [δ t(i t ) θ t (i t )] We define i t = arg max 1 it N[δ t (i t ) θ t (i t )] and the most likely hidden sequence is i t = ψ t+1 (i t+1 ) from 0 to t 1 i t when time = t ϕ t 1 (i t 1 ) from t + 1 to T 3.3 The Learning Problem Given a model λ and a sequence of observations Y T 0 = r T 0, how should we adjust the model parameters (A, B, π) in order to maximize P λ (Y T 0 = r T 0 )? We face an optimization problem with restrictions. This probability P λ (Y T 0 = r T 0 ) is a function of the variables π i, a ij, b j (k), where 1 i, j N, 1 k N. Also. we may view the probability P λ (Y T 0 = r T 0 ) as the likelihood function of λ, 23

33 considered as a function of λ for fixed r T 0. Thus, for each r T 0, P λ (Y T 0 = r T 0 ) gives the probability of observing r T 0. We use the method of maximum likelihood, try to find that the value of λ, that is most likely to have produced the sequence r T 0. See Lemma 2.1. The problem we need to solve is: max P λ (Y0 T λ = r T T 0 ) = max a it 1 i t b it (j t ), λ 1 i 0,i 1,...,i T N t=0 subject to N π i 0, π i = 1, i=1 N a ij 0, a ij = 1, for any i, j=1 M b j (k) 0, b j (k) = 1, for any j, k=1 where 1 i, j N, 1 k M and a i 1 i 0 = π i0 To solve this problem, we need to use some results that Leonard E. Baum and George R. Sell proved in See [3]. 1. Let M M denote the manifold with boundary given by x = (x ij ) where {x ij : x ij 0 and x ij = 1} q i j=1 24

34 where q 1,..., q k is a set of nonnegative integers. Let P be a homogeneous polynomial in the variable {x ij }, with nonnegative coefficients. Let Λ = Λ P : M M M defined by y = Λ P (x) where P y ij = x ij [ x ij q i k=1 x ik P x ik ] 1. Then P (x) P (tλ P (x) + (1 t)x), where 0 t 1, x M. Here, M = {x ij : x ij > 0 and q i j=1 x ij = 1}, M = {x ij : x ij = 0 and q i j=1 x ij = 1}. 2. Let P be a homogeneous polynomial in the variables (x ij ) with positive coefficients and let q M be an isolated local maximum of P. Then there exists a neighborhood V of q such that Λ(V ) V and for every x V Λ n (x) q as n. 3.(A) The transformation Λ P on M can be extended to be continuous on M M. (B) Λ P can also be continuously extended on any isolated local maximum q of P on M by the definition Λ P (q) = q. (C) The extended transformation Λ P still obeys the inequality P (x) P (tλ P (x) + (1 t)x), where 0 t 1. 25

35 Now, we come back to the Learning Problem. We define P = P λ (Y0 T = r T T 0 ) = a it 1 i t b it (j t ). 1 i 0,i 1,...,i T N t=0 Notice that P is a 2(T +1) order homogeneous polynomial in the variable π i, a ij, b j (k), with nonnegative coefficients. And these π i, a ij, b j (k) satisfy the conditions in the above results. We define a new model λ = (A, B, π). When π i 0, a ij 0, b j (k) 0, π i = a ij = b j (k) = π i P π i Ni=1 π i P π i, a ij P a ij Nj=1 a ij P a ij, b j (k) P b j (k) Mk=1 b j (k) P. b j (k) When π i = 0, or a ij = 0, or b j (k) = 0, we define π i = 0, a ij = 0, b j (k) = 0. Applying the above result and taking t = 1, we obtain P (π i, a ij, b j (k)) P (π i, a ij, b j (k)). Also, if λ belongs to a neighborhood V of λ, where λ is a local maximum of P, we will have either 1) the initial model λ defines a critical point of P, in which 26

36 case λ = λ = λ ; or 2) the model λ is better than the model λ in the sense that P (λ) > P (λ), or we may say λ is more likely than λ. The next step that we need to solve is how to represent π i, a ij, b j (k) by using π i, a ij, b j (k). We claim: π i P π i = P λ (Y T 0 = r T 0, X 0 = s i ), a ij P a ij = b j (k) P b j (k) = T 1 t=0 T P λ (Y T 0 = r T 0, X t = s i, X t+1 = s i+1 ), t=0,y t =r k P λ (Y T 0 = r T 0, X t = s j ). Proof: From the formula P = π i0 a i0 i 1...a it i t+1...a it 1 i T b i0 (j 0 )...b it (j t )...b it (j T ), 1 i 0,i 1,...,i T N we may find out P is the sum of the probability of a path s i0, s i1,..., s it. Each term of the sum is a possible path. If we take partial derivative of P with respect to π i, all the other terms will vanish except for those represent the pathes start from state s i. So P π i = 1 i 1,...,i T N a ii1...a it 1 i T b i (j 0 )...b it (j T ), 27

37 and P π i π i = π i a ii1...a it 1 i T b i (j 0 )...b it (j T ) 1 i 1,...,i T N = P λ (Y T 0 = r T 0, X 0 = s i ). In the same way, we can proof the other two results. When we take partial derivative of P with respect to a ij, only those terms, which represent the pathes stay at state s i at some time t and move to state s j at time t + 1, will appear. When we take partial derivative of P with respect to b j (k), all the other terms will vanish except for those represent the pathes go through the state s j at the time t, and the observation at that is r k. Then we have, π i = a ij = b j (k) = P π i π i Ni=1 P π i π i P a ij a ij Nj=1 P = a ij a ij b j (k) P b j (k) Mk=1 b j (k) P b j (k) = Pλ(YT 0 = r T 0, X 0 = s i ), P λ (Y0 T = r T 0 ) T 1 t=0 P λ (Y0 T = r T 0, X t = s i, X t+1 = s i+1 ) T 1, t=0 P λ (Y0 T = r T 0, X t = s i ) = T 1 t=0,y t =r k P λ (Y0 T = r T 0, X t = s j ) T 1. t=0 P λ (Y0 T = r T 0, X t = s j ) After these theoretical results, we introduce an algorithm to realize it. That is, the Baum-Welch method: In order to describe the procedure, we first define ξ t (i t, i t+1 ), the probability of 28

38 being in state s it at time t and state s it+1 at time t+1, given the model λ and the observation sequence, ξ t (i t, i t+1 ) = P λ (X t = s it, X t+1 = s it+1 Y T 0 = r T 0 ). From the definition of the forward and backward variables, we can write ξ t (i t, i t+1 ) in the form, ξ t (i t, i t+1 ) = α t(i t )a it i t+1 b it+1 (j t+1 )β t+1 (i t+1 ) P λ (Y0 T = r T 0 ) α t (i t )a it i = t+1 b it+1 (j t+1 )β t+1 (i t+1 ) Nit+1=1 α t (i t )a itit+1 b it+1 (j t+1 )β t+1 (i t+1 ) Nit =1 To solve this problem, we need another variable, γ t (i t ), the probability of being in state s it at time t, given the observation sequence and the model λ, γ t (i t ) = P λ (X t = s it Y T 0 = r T 0 ). Hence we can relate γ t (i t ) to ξ t (i t, i t+1 ) by summing over i t+1, N γ t (i t ) = ξ t (i t, i t+1 ). i t+1 =1 Let i t = i, i t+1 = j, where 1 i, j N. Then, γ t (i) = P λ (X t = s i Y T 0 = r T 0 ), ξ t (i, j) = P λ (X t = s i, X t+1 = s j Y T 0 = r T 0 ). 29

39 If we sum γ t (i) over the time index t, we get a quantity which can be interpreted as the expected number of times that state s i is visited, that is T 1 t=0 γ t (i) = expected number of transitions from s i, similarly, T 1 t=0 ξ t (i, j) = expected number of transitions from s i to s j. Using these formulas, we can give a method for reestimation of the parameters of an HMM, π i = expected frequency(number of times)in state s i at time 0 = γ 0 (i), a ij = expected number of transitions from s i to s j expected number of transitions from s i T 1 t=0 ξ t (i, j) = T 1 t=0 γ t (i), b j (k) = expected number of times in state j and observation is s k expected number of times in state j T 1 t=0,y = t =r k γ t (j) T 1. t=0 γ t (j) Therefore, we obtain a new model λ = (A, B, π). Based on the above procedure, if we iteratively use λ in place of λ and repeat the calculation, we can improve the probability of Y T 0 = r T 0, until some limiting point is reached. 30

40 3.4 An Example In this section, we introduce an example to show how to solve these three problems by hands or by computer. A school teacher gave three different types of daily homework assignments: (1)took about 5 minutes to complete, (2)took about 1 hour to complete, (3)took about 3 hours to complete. And the teacher did not reveal openly his mood to his students daily, but we know that the teacher had either in a (1)good, (2)neutral, or(3) bad mood for a whole day. Meanwhile, we knew the relationship between his mood and the homework he would assign. We use a Hidden Markov Model λ to analysis this problem. This model includes three hidden states, {1(good mood), 2(neutral mood), 3(bad mood)} and three observable states, {1(5 minutes assignment),2(1 hour assignment), 3(3 hours assignment)}. We consider the problem in a week, from Monday to Friday, using t=1, 2, 3, 4, 5. Hence, the Markov Chain in this model is (X t : t = 1, 2, 3, 4, 5), and the observable process is (Y t : t = 1, 2, 3, 4, 5). The transition probability matrix is A = [a ij ] and the initial state distribution is π = (π i ), where a ij = P λ (X t+1 = i X t = j), π i = P λ (X 1 = i), 31

41 1 i, j 3, t = 1, 2, 3, 4. The relationship between the teacher s mood and the assignments is given by a matrix B = [b j (k)], where b j (k) = P λ (Y t = k X t = j), 1 j 3, 1 k 3, t = 1, 2, 3, 4, 5. In this particular case, the observable sequence Y 5 1 = (1, 3, 2, 1, 3), and the HMM λ = (A, B, π), where A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = , b 1 (1) b 1 (2) b 1 (3) B = b 2 (1) b 2 (2) b 2 (3) = , b 3 (1) b 3 (2) b 3 (3) ( ) ( ) π = π 1 π 2 π 3 = To get the initial distribution π, we use some knowledge of Markov Chain, which is the stationary probability distribution. 32

42 Definition: For a Markov Chain with the transition probability matrix P = P ij i,j=0 and the initial distribution π = (π i ) i=0, where j=0 P ij = 1 for each i, and i=0 π i = 1 for each i, if we have π j = π i P ij i=0 for every j, then we call the set π = (π i ) i=0 is a stationary probability distribution of the Markov Chain. For the transition matrix that we used in this example, to find the stationary probability distribution, we need to solve a linear system as the following, π 1 = π 1 a 11 + π 2 a 21 + π 3 a 31 π 2 = π 1 a 12 + π 2 a 22 + π 3 a 32 π 3 = π 1 a 13 + π 2 a 23 + π 3 a 33 1 = π 1 + π 2 + π 3 This linear system tells us that the stationary probability distribution is π = (0.05, 0.2, 0.75). The Evaluation Problem: The first question we want to ask is what is the probability that this teacher would assign this order of homework assignments. We use the forward procedure to solve 33

43 this problem. (1)Initialization: α 1 (1) = π 1 b 1 (1) = = α 1 (2) = π 2 b 2 (1) = = 0.06 α 1 (3) = π 3 b 3 (1) = = 0 (2)Induction:(t=1, 2, 3, 4.) α 2 (1) = (α 1 (1) a 11 + α 1 (2) a 21 + α 1 (3) a 31 ) b 1 (3) = ( ) 0.1 = α 2 (2) = (α 1 (1) a 12 + α 1 (2) a 22 + α 1 (3) a 32 ) b 2 (3) = α 2 (3) = (α 1 (1) a 13 + α 1 (2) a 23 + α 1 (3) a 33 ) b 3 (3) = By using the same way, we can calculate the values for all forward variables α t (i), 1 t 5, 1 i 3. We use Table 1 to represent our result. 34

44 Table 1. The Forward Variables α t (i) α t (i) (3) Termination: P λ (Y1 5 = (1, 3, 2, 1, 3)) = α 5 (1) + α 5 (2) + α 5 (3) = = The Decoding Problem: The second question we want to ask is what did his mood curve look like most likely that week. We use the Viterbi method. (1)Initialization: δ 1 (1) = π 1 b 1 (1) =

45 δ 1 (2) = π 2 b 2 (1) = 0.06 δ 1 (3) = π 3 b 3 (1) = 0 (2)Induction: δ 2 (1) = max 1 i 3 [δ 1(i)a i1 ]b 1 (3) = max[0.007, 0.012, 0] 0.1 = ψ 2 (1) = arg max 1 i 3 [δ 1(i)a i1 ] = 2 δ 2 (2) = max 1 i 3 [δ 1(i)a i2 ]b 2 (3) = max[0.0105, 0.012, 0] 0.3 = ψ 2 (2) = arg max 1 i 3 [δ 1(i)a i2 ] = 2 Using the same method, we will have the values for all variables δ t (i) and ψ t (i), where 1 t 5, 1 i 3. See Table 2. 36

46 Table 2. The Variables δ t (i) and ψ t (i) δ t (i) ψ t (i) (3) Termination: δ = max 1 i N [δ 5(i)] = X T = arg max 1 i 3 [δ 5(i)] = 3 Finally, the hidden state sequence is given by i t = ψ t+1 (i t+1 ), t=4, 3, 2, 1. Hence, this teacher s mood curve is (2,3,2,1,3). The Learning Problem: The third problem is how to adjust the model parameters (A, B, π) to maximize P λ (Y 5 1 = (1, 3, 2, 1, 3)). We use the Baum-Welch method. 37

47 To use this method, we need to know the values of the backward variables β t (i t ). (1)Initialization: β 5 (1) = 1, β 5 (2) = 1, β 5 (3) = 1. (2)Induction(t=4, 3, 2, 1. ): 3 β 4 (1) = a 1i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 11 b 1 (3) + a 12 b 2 (3) + a 13 b 3 (3) = = 0.56, 3 β 4 (2) = a 2i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 21 b 1 (3) + a 22 b 2 (3) + a 23 b 3 (3) = = 0.62, 38

48 Table 3. The Backward Variables β t (i) β t (i) β 4 (3) = a 3i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 31 b 1 (3) + a 32 b 2 (3) + a 33 b 3 (3) = = Using the same method, we will have the values for all variables β t (i), where 1 t 5, 1 i 3. See Table 3. We can use this table and follow the backward procedure to get the P λ (Y 3 0 = (1, 3, 2, 1, 3)). P λ (Y 3 0 = (1, 3, 2, 1, 3)) = 3 β 1 (i 1 )b i1 (j 1 )π i1 i 1 =1 39

49 = β 1 (1)b 1 (1)π 1 + β 1 (2)b 2 (1)π 2 + β 1 (3)b 3 (1)π 3 = We can compare this result with the one we get before. Then, the next step is to get the values of ξ t (i, j) and γ t (i), where 1 t 4, 1 i, j 3. See Table 4. After we get these tables, it is easy to get a new model λ. A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = , b 1 (1) b 1 (2) b 1 (3) B = b 2 (1) b 2 (2) b 2 (3) = b 3 (1) b 3 (2) b 3 (3) ( ) ( ) π = π 1 π 2 π 3 = , Therefore, we get a new model λ, which is better than λ. We may check it by using λ to solve the evaluation problem again. And the result is P λ (Y 3 1 = (1, 3, 2, 1, 3)) =

50 Table 4. The Variables ξ t (i, j) and γ t (i) (a)ξ t (i, j) ξ 1 (i, j) ξ 2 (i, j) ξ 3 (i, j) ξ 4 (i, j) (b)γ t (i) γ t (i)

51 4 HMMs Analysis by Matlab 4.1 How to generate A, B and π Firstly, let us discuss how to get an initial distribution π. Here, we only consider that we have finite many states. Suppose we have N hidden states. Then the initial distribution π = (π 1, π 2,, π N ). We can think these π s as N parameters of a multinomial distribution. When we estimate the parameters for the binomail distribution by using the Bayesian medols, we usually use the two-parameter beta family as the prior π. This class of distribution has the remarkable property that the resulting posterior distributions are again beta distribution. We call this property conjugate. So it is very natural for us to think what is a conjugate prior for the multinomial. The answer is the Dirichlet distribution. In some sense, the Dirichlet distribution is an extension of beta distribution to the high dimensional space. Definition: The Dirichlet distribution D(α), where α = (α 1,..., α r ), α j > 0, 1 j r, is supported by a (r-1)-dimensional simplex, r 1 = {(u 1,..., u r 1 ) : u i 1, 0 u i 1, i = 1,..., r 1}. i=1 42

52 and the probability density is given by f(u 1,..., u r 1 ) = r Γ( j=1 α j) r 1 r j=1 Γ(α j) j=1 u α j 1 j (1 r 1 j=1 u j ) α r 1 if (u 1,..., u r 1 ) 0 if (u 1,..., u r 1 ) R r 1 \ Lemma 4.1. Let N = (N 1,..., N r ) has the multinomial distribution M(n, θ), where θ = (θ 1,..., θ r ), 0 < θ j < 1, r j=1 θ j = 1. If the prior distribution π(θ) for θ is D(α), then the posterior distribution π(θ N = n) is D(α + n), where α = (α 1,..., α r ), n = (n 1,..., n r ). Proof: For the vector θ = (θ 1,..., θ r ), we can rewrite it by θ = (θ 1,..., θ r 1 ) and define θ r = 1 r 1 j=1 θ j. Then we can write the prior π(θ) out, π(θ) = Γ( r r 1 j=1 α j ) rj=1 θ α r 1 j 1 j (1 θ j ) αr 1. Γ(α j ) j=1 By the knowledge from Bayes rule, and define = {(t 1,..., t r 1 ) : r 1 i=1 t i 1, 0 t i 1, i = 1,..., r 1} we have j=1 π(θ N = n) = π(θ)p (N = n θ) π(t)p (N = n t)d(t 1,..., t r 1 ) = r 1 C θ α r 1 j+n j 1 j (1 θ j ) α r+n r 1 j=1 j=1 The proportionality constant C, which depends on α and n only, must be Γ( r j=1 (α j+n j )) r j=1 Γ(α j+n j ), and the posterior distribution of θ given n is D(α + n). 43

53 Lemma 4.2. See [9]. Let d be an integer number greater than 1. Let X 1,..., X d be random variables distributed according to gamma distribution with corresponding parameters(α, γ i ), and suppose that the sequence {X 1,..., X d } is independent. Then the random vector (Y 1,..., Y d ) = 1 X X d (X 1,..., X d ) has the Dirichlet distribution. Proof: For the random vector (Y 1,..., Y d ), we have Y d = 1 d 1 j=1 Y j. Therefor, the distribution of (Y 1,..., Y d ) is actually supported by a (d-1)-dimensional simplex 0, d 1 0 = {(y 1,..., y d 1 ) : y i 1, 0 y i 1, i = 1,..., d 1}. i=1 For a Borel subset A of = {(y 1,..., y d ) : d i=1 y i = 1, 0 y i 1, i = 1,..., d}, let B = {(x 1,..., x d ) : 1 x x d (x 1,..., x d ) }. Because the sequence {X 1,..., X d } is independent, we can write out the probability distribution for the random vector (X 1,..., X d ), P ({ω : (X 1 (ω),..., X d (ω)) X 1 (ω) X d (ω) }) = B d i=1 a γ i x γ i 1 i Γ(γ i ) e ax i d(x 1,..., x d ), where d(x 1,..., x d ) indicates that the integration is with respect to Lebesgue measure in R d. We make a change of variables: s = x x d, y i = x i s, 1 i d 1. 44

54 The Jacobian of (x 1,..., x d )with respect to (y 1,..., y d 1, s) is s d 1. After some calculations, the above integral equals C Γ( d d 1 j=1 γ j ) dj=1 y γ d 1 j 1 j (1 y j ) γd 1 d(y 1,..., y d 1 ), Γ(γ j ) j=1 C = {(y 1,..., y d 1 ) : (y 1,..., y d 1, 1 y 1... y d 1 ) )}. j=1 Thus we have a density for 1 X X d (X 1,..., X d ) with respect to Lebesgue measure on 0. Lemma 4.3. Let X be a real-valued random variable with continuous distribution function F. Then, F (X) is uniform distributed on [0, 1]. On the other hand, let Y be a real-valued random variable with uniformed distribution on [0, 1], then F 1 (Y ) has the distribution F. Proof: Let us prove the second part of this lemma. Define X = F 1 (Y ). P (X x) = P (F 1 (Y ) x) = P (Y F (x)), Since Y is uniform distributed on [0,1], we have the above probability is F (x) which means X has the distribution F. We use Lemma 4.2 and take α = 1, γ i = 1, 1 i d. Therefore X 1,..., X d have the exponential distribution with parameters 1. To get the exponential distribution randomly, we apply Lemma 4.3. Generate the random variable Y first, which has the uniform distribution on [0,1]. Then apply the function F 1 (x) = ln(1 x) to Y, 45

55 where F is the distribution function of the exponential distribution. So in this way, we may get the distribution for π, and for each row of A and B. 4.2 How to test the Baum-Welch method Idea: Firstly, we generate a model λ, which is looked as the true value. Then use λ to generate some sample sequences Y 1,..., Y n. To use the Baum-Welch method, we need a seed λ 0. We can think it as an initial value. For this seed λ 0 and each sequence Y i, we run the program to get the new model λnew i = (Anew i, Bnew i, πnew i ), where 1 i n. Finally, we get an estimate of λ, which is λnew = (Anew, Bnew, πnew), Anew = Bnew = πnew = ni=1 Anew i, n ni=1 Bnew i, n ni=1 πnew i. n And then we can compare this estimate λnew with the true value λ to see how the Baum-Welch method works. We also define a distant between two matrixes A = (a ij ) and B = (b ij ) by d(a, B), where 46

56 N M d(a, B) = (a ij b ij ) 2, i=1 j=1 1 i N, 1 j M. For example, we take N = M = 3, T = 20, and n = 10, 20, 30 respectively. Step1: λ and λ 0. In this step we generate a true value λ and a seed λ 0. λ = ( π, A, B ), π = ( ), A = , B = , 47

57 λ 0 = ( π 0, A 0, B 0 ), π 0 = ( ), A 0 = , B 0 = Step2: get Y(i), take n = 30, 1 i n. See Figure 4.1 Step3: Get the new model λnew. Step3 is divided into two parts. In the first part, we use the seed λ 0 to run the program. In the second part, we take the seed equals to λ. 48

58 Y1= Y2= Y3= Y4= Y5= Y6= Y7 = Y8 = Y9 = Y10 = Y11 = Y12 = Y13 = Y14 = Y15 = Y16 = Y17 = Y18 = Y19 = Y20 = Y21 = Y22 = Y23 = Y24 = Y25 = Y26 = Y27 = Y28 = Y29 = Y30 = Figure 4.1 The generated observable sequences Y i, 1 i

59 The reason that we want to do this is to show that the Buam-Welch method only works when the initial model λ 0 is in a neighborhood V of λ q, where λ q is a local maximum of the function P = P λ (Y T 0 = r T 0 ). If we use the seed λ 0, we will see that we can not get a good estimate of λ. The reason is λ 0 is not in a small enough neighborhood of λ. While in the second part, if we use λ as the seed, we will get good result. We think λ must be a local maximum, and it is in any neighborhood of itself. So this seed satisfies the condition in the Baum-Welch method. PartI: use λ 0 = (A 0, B 0, π 0 ) as the seed. We use sample {Y 1,..., Y 10 }, {Y 1,..., Y 20 }, {Y 1,..., Y 30 } respectively to get three new models λnew 1 10, λnew 1 20, λnew 1 30, see Figure 4.2. And then calculate the distances between them and λ. From Table 5, we can see the distances between the new models and λ are big even if we increase the sample size. This result means we can not get a good estimate from the seed λ 0. Table 5. The Distances, Seed λ 0 d(π, πnew) d(a, Anew) d(b, Bnew) λnew λnew λnew

60 λ new1 10= ( Anew1 10, Bnew1 10, πnew1 10) π new 1 10 = Anew 1 10= Bnew 1 10 = λ new1 20= ( Anew1 20, Bnew1 20, πnew1 20) π new 1 20 = Anew 1 20= Bnew 1 20 = λ new1 30= ( Anew1 30, Bnew1 30, πnew1 30) π new 1 30 = Anew 1 30= Bnew 1 30= Figure 4.2 Three new models corresponding sample { Y 1,..., Y10}, { Y1,..., Y20}, { Y1,..., Y30}, as the seed is λ 0. 51

61 Table 6. The Distances, Seed λ d(π, πnew) d(a, Anew) d(b, Bnew) λnew λnew λnew PartII: use λ = (A, B, π) as the seed. In the same way, we use sample {Y 1,..., Y 10 }, {Y 1,..., Y 20 }, {Y 1,..., Y 30 } respectively to get three new models λnew 1 10, λnew 1 20,and λnew The results are listed in Figure 4.3. And the distances between them and λ are listed in the Table 6. This table indicates that we get a good estimate of λ, which is very close to λ. Discussion When we think the P λ (Y T 0 = r T 0 ) as a function of λ, it may have more than one local maximum value, for example P 1 and P 2 corresponding λ 1 and λ 2. By the Baum- Welch method, we may get a good estimate of one of them, only if the initial model we used is in a neighborhood of one of them, and the closer the better. The true value λ we used here may not be a local maximum point of the function P λ (Y T 0 = r T 0 ). To get each new model λnew i, 1 i 30, we run the program 10 times. Maybe it is not enough, because we do not know the convergence speed. 52

62 λ new1 10= ( Anew1 10, Bnew1 10, πnew1 10) π = new 1 10 Anew 1 10= Bnew 1 10 = λ new1 20= ( Anew1 20, Bnew1 20, πnew1 20) π new 1 20 = Anew 1 20= Bnew 1 20 = λ new1 30= ( Anew1 30, Bnew1 30, πnew1 30) π new 1 30 = Anew 1 30= Bnew 1 30= Figure 4.3 Three new models corresponding sample Y,..., Y }, { Y,..., Y }, { Y,..., }, as the seed is λ. { Y30 53

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

Parametric Models Part III: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent

More information

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n

More information

Hydro-Geological Flow Analysis Using Hidden Markov Models

Hydro-Geological Flow Analysis Using Hidden Markov Models University of South Carolina Scholar Commons Theses and Dissertations 2016 Hydro-Geological Flow Analysis Using Hidden Markov Models Chandrahas Raj Venkat Gurram University of South Carolina Follow this

More information

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong Abstract: In this talk, a higher-order Interactive Hidden Markov Model

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content

More information

L23: hidden Markov models

L23: hidden Markov models L23: hidden Markov models Discrete Markov processes Hidden Markov models Forward and Backward procedures The Viterbi algorithm This lecture is based on [Rabiner and Juang, 1993] Introduction to Speech

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course) 10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of

More information

1 What is a hidden Markov model?

1 What is a hidden Markov model? 1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 9. Intro to Hidden Markov Models (finish up) Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is

More information

Alignment Algorithms. Alignment Algorithms

Alignment Algorithms. Alignment Algorithms Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Inferences about Parameters of Trivariate Normal Distribution with Missing Data

Inferences about Parameters of Trivariate Normal Distribution with Missing Data Florida International University FIU Digital Commons FIU Electronic Theses and Dissertations University Graduate School 7-5-3 Inferences about Parameters of Trivariate Normal Distribution with Missing

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Learning from Sequential and Time-Series Data

Learning from Sequential and Time-Series Data Learning from Sequential and Time-Series Data Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/? Sequential and Time-Series Data Many real-world applications

More information

1. Markov models. 1.1 Markov-chain

1. Markov models. 1.1 Markov-chain 1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited

More information

Hidden Markov Models NIKOLAY YAKOVETS

Hidden Markov Models NIKOLAY YAKOVETS Hidden Markov Models NIKOLAY YAKOVETS A Markov System N states s 1,..,s N S 2 S 1 S 3 A Markov System N states s 1,..,s N S 2 S 1 S 3 modeling weather A Markov System state changes over time.. S 1 S 2

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Models: All the Glorious Gory Details Hidden Markov Models: All the Glorious Gory Details Noah A. Smith Department of Computer Science Johns Hopkins University nasmith@cs.jhu.edu 18 October 2004 1 Introduction Hidden Markov models (HMMs, hereafter)

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore

Supervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,.

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Optimal Control of Differential Equations with Pure State Constraints

Optimal Control of Differential Equations with Pure State Constraints University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-2013 Optimal Control of Differential Equations with Pure State Constraints Steven Lee

More information

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Kalman Filters Types of dynamic systems Problem of future state prediction Predictability Observability Easily

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) Reading Assignments R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 3.10, hard-copy). L. Rabiner, "A tutorial on HMMs and selected

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

HIDDEN MARKOV MODELS IN INDEX RETURNS

HIDDEN MARKOV MODELS IN INDEX RETURNS HIDDEN MARKOV MODELS IN INDEX RETURNS JUSVIN DHILLON Abstract. In this paper, we consider discrete Markov Chains and hidden Markov models. After examining a common algorithm for estimating matrices of

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010

More information

Outline of Today s Lecture

Outline of Today s Lecture University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s

More information

Advanced Algorithms. Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret

Advanced Algorithms. Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret Advanced Algorithms Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret Dynamic Programming (continued) Finite Markov models A finite Markov model

More information

A REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS

A REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS A REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS Michael Ryan Hoff A Dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data 0. Notations Myungjun Choi, Yonghyun Ro, Han Lee N = number of states in the model T = length of observation sequence

More information

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009 CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov

More information

MCMC notes by Mark Holder

MCMC notes by Mark Holder MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:

More information

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models Hamid R. Rabiee Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models. Terminology and Basic Algorithms Hidden Markov Models Terminology and Basic Algorithms The next two weeks Hidden Markov models (HMMs): Wed 9/11: Terminology and basic algorithms Mon 14/11: Implementing the basic algorithms Wed 16/11:

More information

MAS114: Solutions to Exercises

MAS114: Solutions to Exercises MAS114: s to Exercises Up to week 8 Note that the challenge problems are intended to be difficult! Doing any of them is an achievement. Please hand them in on a separate piece of paper if you attempt them.

More information

Hidden Markov Models. Terminology, Representation and Basic Problems

Hidden Markov Models. Terminology, Representation and Basic Problems Hidden Markov Models Terminology, Representation and Basic Problems Data analysis? Machine learning? In bioinformatics, we analyze a lot of (sequential) data (biological sequences) to learn unknown parameters

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

( ).666 Information Extraction from Speech and Text

( ).666 Information Extraction from Speech and Text (520 600).666 Information Extraction from Speech and Text HMM Parameters Estimation for Gaussian Output Densities April 27, 205. Generalization of the Results of Section 9.4. It is suggested in Section

More information

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models 6 Dependent data The AR(p) model The MA(q) model Hidden Markov models Dependent data Dependent data Huge portion of real-life data involving dependent datapoints Example (Capture-recapture) capture histories

More information

CS 7180: Behavioral Modeling and Decision- making in AI

CS 7180: Behavioral Modeling and Decision- making in AI CS 7180: Behavioral Modeling and Decision- making in AI Learning Probabilistic Graphical Models Prof. Amy Sliva October 31, 2012 Hidden Markov model Stochastic system represented by three matrices N =

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

Hidden Markov Model and Speech Recognition

Hidden Markov Model and Speech Recognition 1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed

More information

Doubly Indexed Infinite Series

Doubly Indexed Infinite Series The Islamic University of Gaza Deanery of Higher studies Faculty of Science Department of Mathematics Doubly Indexed Infinite Series Presented By Ahed Khaleel Abu ALees Supervisor Professor Eissa D. Habil

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden

More information

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept Computational Biology Lecture #3: Probability and Statistics Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept 26 2005 L2-1 Basic Probabilities L2-2 1 Random Variables L2-3 Examples

More information

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16 VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

MATH3200, Lecture 31: Applications of Eigenvectors. Markov Chains and Chemical Reaction Systems

MATH3200, Lecture 31: Applications of Eigenvectors. Markov Chains and Chemical Reaction Systems Lecture 31: Some Applications of Eigenvectors: Markov Chains and Chemical Reaction Systems Winfried Just Department of Mathematics, Ohio University April 9 11, 2018 Review: Eigenvectors and left eigenvectors

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017 1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Introduction to Artificial Intelligence (AI)

Introduction to Artificial Intelligence (AI) Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 10 Oct, 13, 2011 CPSC 502, Lecture 10 Slide 1 Today Oct 13 Inference in HMMs More on Robot Localization CPSC 502, Lecture

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information