A Study of Hidden Markov Model
|
|
- Oswin Sharp
- 5 years ago
- Views:
Transcription
1 University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School A Study of Hidden Markov Model Yang Liu University of Tennessee - Knoxville Recommended Citation Liu, Yang, "A Study of Hidden Markov Model. " Master's Thesis, University of Tennessee, This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Masters Theses by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information, please contact trace@utk.edu.
2 To the Graduate Council: I am submitting herewith a thesis written by Yang Liu entitled "A Study of Hidden Markov Model." I have examined the final electronic copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Mathematics. We have read this thesis and recommend its acceptance: Xia Chen, Balram Rajput (Original signatures are on file with official student records.) Jan Rosinski, Major Professor Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School
3 To the Graduate Council: I am submitting herewith a thesis written by Yang Liu entitled A Study of Hidden Markov Model. I have examined the final electronic copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Mathematics. We have read this thesis and recommend its acceptance: Jan Rosinski Major Professor Xia Chen Balram Rajput Accepted for the Council: Anne Mayhew Vice Chancellor and Dean of Graduate Studies (Original signatures are on file with official student records.)
4 A STUDY OF HIDDEN MARKOV MODEL A Thesis Presented for the Master of Science Degree The University of Tennessee, Knoxville Yang Liu August 2004
5 DEDICATION This dissertation is dedicated to my parents, Jiaquan Liu and Xiaobo Yang, for consistently supporting me, inspiring me, and encouraging me. ii
6 ACKNOWLEDGMENTS I would like to thank my parents for their consistent support and encouragement. I specially express my deepest appreciation to my advisor, Dr. Jan Rosinski, for his patient, friendly, and unfailing support over the past two years. I also want to thank the other committee members, Dr. Balram Rajput and Dr Xia Chen for their comments and suggestions. I would also like to thank Xiaohu Tang for his help. iii
7 ABSTRACT The purpose of this thesis is fourfold: Introduce the definition of Hidden Markov Model. Present three problems about HMM that must be solved for the model to be useful in real-world application. Cover the solution of the three basic problems; explain some algorithms in detail as well as the mathematical proof. And finally, give some examples to show how these algorithms work. iv
8 TABLE OF CONTENTS 1. Introduction 1 2. What is A Hidden Markov Model 3 3. Three Basic Problems of HMMs and Their Solutions The Evaluation Problem The Forward Procedure The Backward Procedure Summary of the Evaluation Problem The Decoding Problem Viterbi Algorithm δ θ Algorithm The Learning Problem An Example HMMs Analysis by Matlab How to Generate A, B, π How to Test the Baum-Welch Method 46 References 54 Appendices 57 Appendix A: The New Models generated in Section 4.2, Step 3 58 Appendix B: Matlab Program 73 Vita 86 v
9 LIST OF TABLES Table 1. The Forward Variables α (i) 35 t Table 2. The Variables δ (i) and ψ (i) 37 t t Table 3. The Backward Variables β (i ) 39 t Table 4. The Variables ξ ( i, j) and γ (i) 41 t t Table 5. The Distances, Seed λ 0 50 Table 6. The Distances, Seed λ 52 vi
10 1 INTRODUCTION Many of the most powerful sequence analysis methods are now based on principles of probabilistic modeling, such as Hidden Markov Models. Hidden Markov Models (HMMs) are very powerful and flexible. They originally emerged in the domain of speech recognition. During the past several years it has become the most successful speech model. In recent years, they are widely used as useful tools in many other fields, including molecular biology, biochemistry and genetics, as well as computer vision. An HMM is probabilistic model composed of a number of interconnected states, each of which emits an observable output. Each state has two kinds of parameters. First, symbol emission probabilities describe the probabilities of the possible outputs from the state, and second, state transition probabilities specify the probability of moving to a new state from the current one. An observed sequence of symbols is generated by starting at some initial state and moving probabilistically from state to state until some terminal state is reached, emitting observable symbols from each state that is passed through. A sequence of states is a first order Markov chain. This state sequence is hidden, only the sequence of symbols that it emits being observable; hence the term hidden Markov model. This paper is organized in the following manner. The definition as well as some 1
11 necessary assumptions about HMMs are explained first. The three basic problems of HMMs are discussed, and then their solutions are given. To make HMMs are useful in the real world application, we must know how to solve these problems, which are the evaluation problem, the decoding problem and the learning problem. In this part, some useful algorithm are introduced, including the forward procedure, the backward procedure, the Viterbi algorithm, the δ θ algorithm and the Baum-welch method. Lastly, the computer programs by which HMMs are simulated are coved. 2
12 2 WHAT IS A HIDDEN MARKOV MODEL A Hidden Markov Model is a stochastic model comprising an unobserved Markov chain (X t : t = 0, 1...) and an observable process (Y t : t = 0, 1,...). In this paper, we only consider the case when the observations were represented as discrete symbols chosen from a finite set, and therefore we could use a discrete probability density within each state as this model. To define an HMM, we need some elements. The number of the hidden states is N. We denote these N states by s 1, s 2,..., s N. The number of the observable states is M. We denote them by r 1, r 2,..., r M. Specifically, (X t : t = 0, 1...) is a Markov chain with transition probability matrix A = [a ij ] and an initial state distribution π = (π i ), where 1 i, j N. By the properties of Markov chain, we know that a ij = P (X t+1 = s j X t = s i ) = P (X t+1 = s j X t = s i, X t 1 = s k,..., X 0 = s l ), t = 0, 1,..., for every s i, s j,..., s k and s l, where 1 i, j, k, l N. The initial state distribution of the Markov chain is given by π i = P (X 0 = s i ). 3
13 For a finite observation sequence (Y t : t = 0, 1,..., T ), where T is any fixed number, we have a fundamental assumption connecting the hidden state sequence (X t : t = 0, 1,..., T ) and the observation sequence, that is statistical independence of observations (Y t : t = 0, 1,..., T ). If we formulate this assumption mathematically, we have P (Y 0 = r j0, Y 1 = r j1,..., Y T = r jt X 0 = s i0, X 1 = s i1,..., X T = s it ) = T P (Y t = r jt X t = s it ), (1) t=0 where 1 i 0, i 1,..., i T N, 1 j 0, j 1,..., j T M. To simplify the notation, we denote the event sequence (s i0, s i1,..., s it ) by s T 0, (r j0, r j1,..., r jt ) by r T 0, and denote (X 0, X 1,..., X T ) by X T 0, (Y 0, Y 1,..., Y T ) by Y T 0. Then, we put b j (k) = P (Y t = r k X t = s j ), 1 j N, 1 k M, t = 0, 1, 2... We may rewrite the formula (1) by T P (Y0 T = r T 0 X T 0 = s T 0 ) = b it (j t ) (2) t=0 So far, we know a Hidden Markov Model has several components. It has a set of states s 1, s 2,..., s N, a set of output symbols r 1, r 2,..., r M, a set of transitions which have associated with them a probability and an output symbol, and a starting state. When a transition is taken, it produces an output symbol. The complicating factor is that the output symbol given is not necessarily unique to that transition, and thus it 4
14 is difficult to determine which transition was the one actually taken and this is why they are termed hidden. See Figure 2.1. By the knowledge of Markov Chain, we know P (X T 0 = s T 0 ), the probability of the state sequence s T 0, P (X T 0 = s T 0 ) = P (X 0 = s i0, X 1 = s i1,..., X T = s it ) = π i0 a i0 i 1...a it 1 i T = T a it 1 i t t=0 where a i 1 i 0 = π i0. Hence, the joint probability of s T 0 and r T 0 is P (X T 0 = s T 0, Y T 0 = r T 0 ) = P (Y T 0 = r T 0 X T 0 = s T 0 ) P (X T 0 = s T 0 ) = T [b it (j t )a it 1 i t ] (3) t=0 The transition probability matrix A, the initial state distribution π and the matrix B = [b j (k)], 1 j N, 1 k M, define a Hidden Markov Model completely. Therefore we can use a compact notation λ = (A, B, π) to denote a Hidden Markov Model with discrete probability distribution. We may think of λ as a parameter of the Hidden Markov Model. We denote the probability given a model λ by P λ later. Lemma 2.1. P λ (Y T 0 = r T 0 ) = 1 i 0,i 1,...,i T N Tt=0 a it 1 i t b it (j t), where a i 1 i 0 = π i0. 5
15 X 0 X 1 X 2 X T-1 X T Y 0 Y 1 Y 2 Y T-1 Y T Figure 2.1 HMM 6
16 Proof: We want to compute P λ (Y T 0 = r T 0 ), the probability of the observation sequence given the model λ. From time 0 to time T, we consider every possible hidden state sequence, s T 0. Then the probability of r T 0 is obtained by summing the joint probability over r T 0 and all possible s T 0, that is P λ (Y T 0 = r T 0 ) = all possible s T 0 P λ (X T 0 = s T 0, Y T 0 = r T 0 ) = T [a it 1 i t b it (j t )]. (4) 1 i 0,i 1,...,i T N t=0 Lemma2.1 gives us a method to compute the probability of a sequence of observations r T 0. But unfortunately, this calculation is computationally unfeasible. Because this formula involves on the order of (T + 1) N (T +1) calculations. We are going to introduce some more efficient methods in the next section. We can understand Lemma 2.1 from another point of view. A hidden Markov model consists of a set of hidden states s 1, s 2,..., s N connected by directed edges. Each state assigns probabilities to the characters of the alphabet used in the observable sequence and to the edges leaving the state. A path in an HMM, s i0, s i1,..., s it, is a sequence of states such that there is an edge from each state in the path to the next state in the path. And the probability of this path is the product of the probabilities of the edges traversed, that is P (X T 0 = s T 0 ). 7
17 Each path through the HMM gives a probability distribution for each position in a string of the same length, based on the probabilities for the characters in the corresponding states. The probability of the observable sequence given a particular path is the product of the probabilities of the characters, that is T P (Y0 T = r T 0 X T 0 = s T 0 ) = b it (j t ). The probability of any sequence of characters is the sum, over all paths whose length is the same as the sequence, of the probability of the path times the probability of the sequence given the path, that is the result of Lemma 2.1. See Figure 2.2. t=0 A Hidden Markov Model has a very similar property as a Markov process, that is given the values of X t, the values of Y s, s t, do not depend on the values of X u, u < t. The probability of any particular future observation of the model when its present hidden state is known exactly, is not altered by additional knowledge concerning its past hidden behavior. In formal terms, we have Lemma 2.2. Lemma 2.2. P λ (Y u = r ju X t 0 = s t 0) = P λ (Y u = r ju X t = s it ), u t. Proof: Firstly, we prove this result is true when u=t. P λ (Y t = r jt X t 0 = s t 0) = = 0 j 0,...,j t 1 M 0 j 0,...,j t 1 M P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (Y t 1 0 = r t 1 0 X t 1 0 = s t 1 0 ) P λ (Y t = r jt X t = s it ) 8
18 t=0 t=1 t=t-1 t=t s 1 s 1 s 1 s 1 s 2 s 2 2 s s 2 s N s N s N s N Figure 2.2 Paths of An HMM 9
19 = 1 P λ (Y t = r jt X t = s it ) = P λ (Y t = r jt X t = s it ) Specially, if we have q t, we will have = = P λ (Y t = r jt X t q = s t q) 0 j q,...,j t 1 M 0 j q,...,j t 1 M P λ (Y t q = r t q X t q = s t q) P λ (Y t 1 q = 1 P λ (Y t = r jt X t = s it ) = P λ (Y t = r jt X t = s it ) = r t 1 q X t 1 q = s t 1 q ) P λ (Y t = r jt X t = s it ) Then, we prove it is also true when u > t. P λ (Y u = r ju X t 0 = s t 0) = = = = = 0 j 0,...,j u 1 M P λ (Y u 0 = r u 0 X t 0 = s t 0) 0 j 0,...,j u 1 M 0 i t+1,...,i u N 0 j 0,...,j u 1 M 0 i t+1,...,i u N 0 j 0,...,j u 1 M 0 i t+1,...,i u N P λ (X u t+1 = s u t+1 X t = s it ) 0 i t+1,...,i u N P λ (Y u 0 = r u 0, X u t+1 = s u t+1 X t 0 = s t 0) P λ (Y u 0 = r u 0 X u 0 = s u 0) P λ (X u t+1 = s u t+1 X t = s it ) P λ (Y u 1 0 = r u 1 0 X u 1 0 = s u 1 0 ) P λ (Y u = r ju X u = s iu ) P λ (Y u = r ju X u = s iu ) P λ (X u t+1 = s u t+1 X t = s it ) 10
20 = P λ (Y u = r ju X t = s it ) HMM. Lemma 2.2 will be widely used in next section to help solve the basic problems of Remark. (Y t, t 0) are not independent. Proof: We take T = 1. From Lemma 2.1, we have But P λ (Y 0 = r j0 ) = N i0 =1 b i0 (j 0 )π i0, and N P λ (Y0 1 = r 1 0) = b i0 (j 0 )b i1 (j 1 )π i0 a i0 i 1. i 0,i 1 =1 = N P λ (Y 1 = r j1 ) = P λ (Y 1 = r j1 X 1 = s i1 )P λ (X 1 = s i1 ) i 1 N N N b i1 (j 1 )[ a i0 i 1 π i0 ] = b i1 (j 1 )a i0 i 1 π i0. i 1 i 0 i 0,i 1 =1 Hence P λ (Y 1 0 = r 1 0) P λ (Y 0 = r j0 ) P λ (Y 1 = r j1 ). Therefore, the sequence (Y t, t 0) are not independent. Actually, it is very natural to be understood. Because for each Y t, it is generated by the corresponding X t, and the hidden sequence (X t, t = 0, 1...) are not independent. There is a very easy example. We take N = M and b i (j) = δ ij. Then, Y t is a Markov Chain. 11
21 3 THREE BASIC PROBLEMS OF HMMs AND THEIR SOLUTIONS Once we have an HMM, there are three problems of interest. 3.1 The Evaluation Problem Given a Hidden Markov Model λ and a sequence of observations Y T 0 = r T 0, what is the probability that the observations are generated by the model, i.e. P λ (Y0 T = r T 0 ). We can also view the problem as how well a given model matches a given observation sequence. By the second viewpoint, if we have several competing models, the solution to the evaluation problem will give us a best model which best matches the observation sequence. The most straightforward way of doing this is using Lemma 2.1. But it involves a lot of calculations. The more efficient methods are called the forward procedure and the backward procedure. See[1]. We will introduce these two procedures first, then reveal the mathematical idea inside them. 12
22 3.1.1 The Forward Procedure Fix r t 0 and consider the forward variable α t (i t ) defined as α t (i t ) = P λ (Y t 0 = r t 0, X t = s it ), t = 0, 1,..., T, 1 i t N. that is the probability of the partial observation sequence, r j0, r j1,..., r jt (until time t) and at time t the hidden state is s it. So for the forward variableα t (i t ), we only consider those paths which end at the state s it at the time t. We can solve for α t (i t ) inductively, as follows: (1) Initialization: α 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N. (2) Induction (t = 0, 1,..., T 1): α t+1 (i t+1 ) = P λ (Y t+1 0 = r t+1 0, X t+1 = s it+1 ) = = = 1 i 0,...,i t N 1 i 0,...,i t N 1 i 0,...,i t N P λ (Y t+1 0 = r t+1 0, X t+1 0 = s t+1 0 ) P λ (Y t+1 0 = r t+1 0 X t+1 0 = s t+1 0 ) P λ (X t+1 0 = s t+1 0 ) P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t 0 = r t 0) P λ (X t 0 = r t 0) = [ P λ (Y0 t = r t 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it )] P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) 1 i 0,...,i t N 13
23 = [ P λ (Y0 t = r t 0, X t = s t ) P λ (X t+1 = s it+1 X t = s it )] P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) = [ 1 i t N N i t=1 α t (i t )a it i t+1 ]b it+1 (j t+1 ) (3) Termination: P λ (Y T 0 = r T 0 ) = = N P λ (Y0 T = r T 0, X T = s it ) i T =1 N α T (i T ) i T =1 If we examine the computation involved in the calculation of α t (i t ), we see that it requires on the order of N 2 (T + 1) calculations, rather than (T + 1) N (T +1) as required by the direct calculation. Hence, the forward probability calculation is more efficient than the direct calculation. In similar, we have another method for the Evaluation Problem. It is called backward procedure The Backward Procedure We consider a backward variable β t (i t ) defined as, β t (i t ) = P λ (Y T t+1 = r T t+1 X t = s it ) 0 t T 1, 1 i t N. 14
24 That is the probability of the partial observation sequence from time t + 1 to the end, given the hidden state is s it at time t. Again we can solve for β t (i t ) inductively, as follows: (1) Initialization: To make this procedure work for t = T 1, we arbitrarily define β T (i T ) to be 1 in the initialization step (1). β T (i T ) = 1. (2) Induction (t = 0, 1,..., T 1): β t (i t ) = P λ (Y T t+1 = r T t+1 X t = s it ) = = = = = N P λ (Yt+1 T = r T t+1, X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X t+1 = s it+1, X t = s it ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X T t+1 = s T t+1) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) i t+1 =1 1 i t+2,...,i T N P λ (X t+1 = s it+1 X t = s it ) N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 15
25 = = [ P λ (Yt+2 T = r T t+2 X T t+1 = s T t+1)p λ (X T t+2 = s T t+2 X t+1 = s it+1 )] 1 i t+2,...,i T N N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Yt+2 T = r T t+2 X t+1 = s it+1 ) i t+1 =1 P λ (X t+1 = s it+1 X t = s it ) N a it i t+1 b it+1 (j t+1 )β t+1 (i t+1 ). i t+1 =1 (3) Termination: P λ (Y T 0 = r T 0 ) = = = = N P λ (Y0 T = r T 0, X 0 = s i0 ) i 0 =1 N P λ (Y0 T = r T 0 X 0 = s i0 ) P λ (X 0 = s i0 ) i 0 =1 N P λ (Y1 T = r T 1 X 0 = s i0 ) P λ (Y 0 = r j0 X 0 = s i0 ) P λ (X 0 = s i0 ) i 0 =1 N β 0 (i 0 )b i0 (j 0 )π i0 i 0 =1 The back ward procedure requires on the order of N 2 (T + 1) calculations, as many as the forward procedure Summary of the Evaluation Problem We have introduced how to evaluate the probability that the observation sequence Y T 0 = r T 0 is generated by using either the forward procedure or the backward proce- 16
26 dure. They are mode efficient than the method given in Lemma 2.1. In fact, these two procedures are nothing but changing multiple sum, that is Lemma 2.1, to repeated sum. For example, in the forward procedure, we use the identity N N =..., 1 i 0,...,i T N i T =1 i 0 =1 and for the backward procedure, we reverse the order of summation, that is N N = i 0,...,i T N i 0 =1 i T =1 From this point of view, it is obvious that other procedures are possible. For example, we can do he summation from the two ends to the middle at the same time. In some sense, this can be seen as a kind of parallel algorithm. It requires on the order of N(N 1) (T + 1) calculations. 3.2 The Decoding Problem Given a model λ and a sequence of observations Y T 0 = r T 0, what is the most likely state sequence in the model that produced the observation? That is, we want to find a hidden state sequence X T 0 = s T 0, to maximize the probability, P λ (X T 0 = s T 0 Y0 T = r T 0 ), for any possible sequences s T 0. This is equivalent to maximize P λ (X T 0 = s T 0, Y0 T = r T 0 ), because the probability of r T 0 given a model λ, P λ (Y0 T = r T 0 ) is fixed. A technique for finding this state sequence exists, based on dynamic programming methods, and 17
27 is called the Viterbi algorithm. See[1] Viterbi Algorithm Fixing s it, we consider a variable δ t (i t ) defined as δ t (i t ) = max P λ(x t 1 0 = s t 1 0, X t = s it, Y0 t = r t 1 i 0,i 1,...,i t 1 0). N Hence, δ t (i t ) is the highest probability along a path, which accounts for the first t observations and ends in state s it at time t. By induction we have, δ t+1 (i t+1 ) = max P λ(x t 0 = s t 1 i 0,i 1,...,i t N 0, X t+1 = s it+1, Y0 t+1 = r t+1 0 ) = max P λ(y0 t+1 = r t+1 0 X t+1 0 = s t+1 0 ) P λ (X t+1 0 = s t+1 0 ) 1 i 0,i 1,...,i t N = max 1 i 0,i 1,...,i t N P λ(y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (X t+1 = s it+1 X t 0 = s t 0) P λ (X t 0 = s t 0) = max 1 i 0,i 1,...,i t N P λ(y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Y t 0 = r t 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it ) = [ max max P λ(y0 t = r t 1 i t N 1 i 0,...,i t 1 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it )] N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) = [ max 1 i t N δ t(i t )a iti t+1 ]b it+1 (j t+1 ). 18
28 To find the hidden state sequence, we need to keep track of the argument which maximize δ t (i t ), for every t and i t, and they will be noted by an array ψ t (i t ). (1) Initialization: δ 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N. (2)Induction (t=1,...,t): δ t (i t ) = max 1 i t 1 N [δ t 1(i t 1 )a it 1 i t ]b it (j t ), ψ t (i t ) = arg max [δ t 1(i t 1 )a it 1 i t ]. 1 i t 1 N (3)Termination: δ = max [δ T (i T )], 1 i T N ψ = arg max 1 i T N [δ T (i T )]. Finally, we will have the state sequence i T = ψ and i t = ψ t+1 (i t+1 ), t = T 1, T 2,..., 0. Remark1. In the Viterbi Algorithm, ψ t (i t ), t = 0,..., T 1, and ψ are set-valued maps. This means there may be more than one best hidden state sequence. In order to distinguish them, we need more information. Remark2. This procedure is not consistent with T increasing. 19
29 Proof: We give a counterexample. We take ( ) π = , A =, B =, Suppose we have an observation sequence Y 1 0 = (r 1, r 2 ). Following the Viterbi Algorithm, when T = 0, the most likely hidden state is s 1, but when T = 1, the best one is (s 2, s 1 ). This means, given a fixed sequence Y, if s i0 is a solution for T = 0 and (s i 0, s i 1 ) is a solution for T = 1, we may not have i 0 = i 0. Even if s i0 and (s i 0, s i 1 ) are the unique best solutions respectively. The Viterbi Algorithm is very similar as the forward procedure that we have introduced in the above section. In the forward procedure, we define α t (i t ), while here we use δ t (i t ). The only diffence between them is we change summation to maximum. But the general ideas are same. We change multiple summation to repeated summation and multiple maximum to repeated maximum. From this point of view, it is very natural to think how to use the idea that we have used in the backward procedure to solve the Decoding Problem. Therefore, we 20
30 introduce a new method. We do the maximum from the two ends to the middle at the same time. That is, δ θ Algorithm The advantage of this method is saving time. To describe this procedure, we define another variable θ t (i t ), t = 0,..., T 1. θ t (i t ) = max P λ(x T t+1 = s T 1 i t+1,...,i T t+1, Yt+1 T = r T t+1 X t = s it ). N θ t (i t ) is the highest probability along a path, which accounts from time t to the end and starts at state s it at time t. Inductively, we have θ t (i t ) = max 1 i t+1,...,i T N P λ(x T t+1 = s T t+1, Y T t+1 = r T t+1 X t = s it ) = max 1 i t+1,...,i T N P λ(x T t+2 = s T t+2, Y T t+1 = r T t+1 X t+1 = s it+1, X t = s it ) P λ (X t+1 = s it+1, X t = s it ) = max 1 i t+1,...,i T N P λ(y T t+1 = r T t+1 X T t = s T t ) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) = max 1 i t+1,...,i T N P λ(y T t+2 = r T t+2 X T t+2 = s T t+2) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) 21
31 = max 1 i t+1,...,i T N P λ(y T t+2 = r T t+2, X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) = max 1 i t+1 N θ t+1(i t+1 )a it i t+1 b it+1 (j t+1 ) (1)Initialization: δ 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N, θ T (i T ) = 1, 1 i T N. (2)Induction: δ t (i t ) = max [δ t 1(i t 1 )a it 1 i t ]b it (j t ), t = 1,..., T. 1 i t 1 N ψ t (i t ) = arg max [δ t 1(i t 1 )a it 1 i t ], t = 1,..., T. 1 i t 1 N θ t (i t ) = max 1 i t+1 N θ t+1(i t+1 )a iti t+1 b it+1 (j t+1 ), t = 0,..., T 1. ϕ t (i t ) = arg max θ t+1(i t+1 )a it i t+1 b it+1 (j t+1 ), t = 0,..., T 1. 1 i t+1 N (3)Termination: max P λ(x T 0 = s T 0, Y0 T = r T 0 ) 1 i 0,...,i T N 22
32 = max max P λ(x T 0 = s T 0, Y0 T = r T 0 ) 1 i t N 1 i s N,s t = max max P λ(y0 T = r T 0 X T 0 = s T 0 )P λ (X T 0 = s T 0 ) 1 i t N 1 i s N,s t = max max P λ(y0 t = r t 1 i t N 1 i s 0 X t 0 = s t 0)P λ (X t 0 = s t 0) N,s t P λ (Y T t+1 = r T t+1 X T t+1 = s T t+1)p λ (X T t+1 = s T t+1 X t 0 = s t 0) = max 1 i t N [ max P λ(x t 0 = s t 1 i s 0, Y0 t = r t 0) N,s<t max 1 i s N,s>t P λ(y T t+1 = r T t+1, X T t+1 = s T t+1 X t+1 = s it+1 )] = max 1 i t N [δ t(i t ) θ t (i t )] We define i t = arg max 1 it N[δ t (i t ) θ t (i t )] and the most likely hidden sequence is i t = ψ t+1 (i t+1 ) from 0 to t 1 i t when time = t ϕ t 1 (i t 1 ) from t + 1 to T 3.3 The Learning Problem Given a model λ and a sequence of observations Y T 0 = r T 0, how should we adjust the model parameters (A, B, π) in order to maximize P λ (Y T 0 = r T 0 )? We face an optimization problem with restrictions. This probability P λ (Y T 0 = r T 0 ) is a function of the variables π i, a ij, b j (k), where 1 i, j N, 1 k N. Also. we may view the probability P λ (Y T 0 = r T 0 ) as the likelihood function of λ, 23
33 considered as a function of λ for fixed r T 0. Thus, for each r T 0, P λ (Y T 0 = r T 0 ) gives the probability of observing r T 0. We use the method of maximum likelihood, try to find that the value of λ, that is most likely to have produced the sequence r T 0. See Lemma 2.1. The problem we need to solve is: max P λ (Y0 T λ = r T T 0 ) = max a it 1 i t b it (j t ), λ 1 i 0,i 1,...,i T N t=0 subject to N π i 0, π i = 1, i=1 N a ij 0, a ij = 1, for any i, j=1 M b j (k) 0, b j (k) = 1, for any j, k=1 where 1 i, j N, 1 k M and a i 1 i 0 = π i0 To solve this problem, we need to use some results that Leonard E. Baum and George R. Sell proved in See [3]. 1. Let M M denote the manifold with boundary given by x = (x ij ) where {x ij : x ij 0 and x ij = 1} q i j=1 24
34 where q 1,..., q k is a set of nonnegative integers. Let P be a homogeneous polynomial in the variable {x ij }, with nonnegative coefficients. Let Λ = Λ P : M M M defined by y = Λ P (x) where P y ij = x ij [ x ij q i k=1 x ik P x ik ] 1. Then P (x) P (tλ P (x) + (1 t)x), where 0 t 1, x M. Here, M = {x ij : x ij > 0 and q i j=1 x ij = 1}, M = {x ij : x ij = 0 and q i j=1 x ij = 1}. 2. Let P be a homogeneous polynomial in the variables (x ij ) with positive coefficients and let q M be an isolated local maximum of P. Then there exists a neighborhood V of q such that Λ(V ) V and for every x V Λ n (x) q as n. 3.(A) The transformation Λ P on M can be extended to be continuous on M M. (B) Λ P can also be continuously extended on any isolated local maximum q of P on M by the definition Λ P (q) = q. (C) The extended transformation Λ P still obeys the inequality P (x) P (tλ P (x) + (1 t)x), where 0 t 1. 25
35 Now, we come back to the Learning Problem. We define P = P λ (Y0 T = r T T 0 ) = a it 1 i t b it (j t ). 1 i 0,i 1,...,i T N t=0 Notice that P is a 2(T +1) order homogeneous polynomial in the variable π i, a ij, b j (k), with nonnegative coefficients. And these π i, a ij, b j (k) satisfy the conditions in the above results. We define a new model λ = (A, B, π). When π i 0, a ij 0, b j (k) 0, π i = a ij = b j (k) = π i P π i Ni=1 π i P π i, a ij P a ij Nj=1 a ij P a ij, b j (k) P b j (k) Mk=1 b j (k) P. b j (k) When π i = 0, or a ij = 0, or b j (k) = 0, we define π i = 0, a ij = 0, b j (k) = 0. Applying the above result and taking t = 1, we obtain P (π i, a ij, b j (k)) P (π i, a ij, b j (k)). Also, if λ belongs to a neighborhood V of λ, where λ is a local maximum of P, we will have either 1) the initial model λ defines a critical point of P, in which 26
36 case λ = λ = λ ; or 2) the model λ is better than the model λ in the sense that P (λ) > P (λ), or we may say λ is more likely than λ. The next step that we need to solve is how to represent π i, a ij, b j (k) by using π i, a ij, b j (k). We claim: π i P π i = P λ (Y T 0 = r T 0, X 0 = s i ), a ij P a ij = b j (k) P b j (k) = T 1 t=0 T P λ (Y T 0 = r T 0, X t = s i, X t+1 = s i+1 ), t=0,y t =r k P λ (Y T 0 = r T 0, X t = s j ). Proof: From the formula P = π i0 a i0 i 1...a it i t+1...a it 1 i T b i0 (j 0 )...b it (j t )...b it (j T ), 1 i 0,i 1,...,i T N we may find out P is the sum of the probability of a path s i0, s i1,..., s it. Each term of the sum is a possible path. If we take partial derivative of P with respect to π i, all the other terms will vanish except for those represent the pathes start from state s i. So P π i = 1 i 1,...,i T N a ii1...a it 1 i T b i (j 0 )...b it (j T ), 27
37 and P π i π i = π i a ii1...a it 1 i T b i (j 0 )...b it (j T ) 1 i 1,...,i T N = P λ (Y T 0 = r T 0, X 0 = s i ). In the same way, we can proof the other two results. When we take partial derivative of P with respect to a ij, only those terms, which represent the pathes stay at state s i at some time t and move to state s j at time t + 1, will appear. When we take partial derivative of P with respect to b j (k), all the other terms will vanish except for those represent the pathes go through the state s j at the time t, and the observation at that is r k. Then we have, π i = a ij = b j (k) = P π i π i Ni=1 P π i π i P a ij a ij Nj=1 P = a ij a ij b j (k) P b j (k) Mk=1 b j (k) P b j (k) = Pλ(YT 0 = r T 0, X 0 = s i ), P λ (Y0 T = r T 0 ) T 1 t=0 P λ (Y0 T = r T 0, X t = s i, X t+1 = s i+1 ) T 1, t=0 P λ (Y0 T = r T 0, X t = s i ) = T 1 t=0,y t =r k P λ (Y0 T = r T 0, X t = s j ) T 1. t=0 P λ (Y0 T = r T 0, X t = s j ) After these theoretical results, we introduce an algorithm to realize it. That is, the Baum-Welch method: In order to describe the procedure, we first define ξ t (i t, i t+1 ), the probability of 28
38 being in state s it at time t and state s it+1 at time t+1, given the model λ and the observation sequence, ξ t (i t, i t+1 ) = P λ (X t = s it, X t+1 = s it+1 Y T 0 = r T 0 ). From the definition of the forward and backward variables, we can write ξ t (i t, i t+1 ) in the form, ξ t (i t, i t+1 ) = α t(i t )a it i t+1 b it+1 (j t+1 )β t+1 (i t+1 ) P λ (Y0 T = r T 0 ) α t (i t )a it i = t+1 b it+1 (j t+1 )β t+1 (i t+1 ) Nit+1=1 α t (i t )a itit+1 b it+1 (j t+1 )β t+1 (i t+1 ) Nit =1 To solve this problem, we need another variable, γ t (i t ), the probability of being in state s it at time t, given the observation sequence and the model λ, γ t (i t ) = P λ (X t = s it Y T 0 = r T 0 ). Hence we can relate γ t (i t ) to ξ t (i t, i t+1 ) by summing over i t+1, N γ t (i t ) = ξ t (i t, i t+1 ). i t+1 =1 Let i t = i, i t+1 = j, where 1 i, j N. Then, γ t (i) = P λ (X t = s i Y T 0 = r T 0 ), ξ t (i, j) = P λ (X t = s i, X t+1 = s j Y T 0 = r T 0 ). 29
39 If we sum γ t (i) over the time index t, we get a quantity which can be interpreted as the expected number of times that state s i is visited, that is T 1 t=0 γ t (i) = expected number of transitions from s i, similarly, T 1 t=0 ξ t (i, j) = expected number of transitions from s i to s j. Using these formulas, we can give a method for reestimation of the parameters of an HMM, π i = expected frequency(number of times)in state s i at time 0 = γ 0 (i), a ij = expected number of transitions from s i to s j expected number of transitions from s i T 1 t=0 ξ t (i, j) = T 1 t=0 γ t (i), b j (k) = expected number of times in state j and observation is s k expected number of times in state j T 1 t=0,y = t =r k γ t (j) T 1. t=0 γ t (j) Therefore, we obtain a new model λ = (A, B, π). Based on the above procedure, if we iteratively use λ in place of λ and repeat the calculation, we can improve the probability of Y T 0 = r T 0, until some limiting point is reached. 30
40 3.4 An Example In this section, we introduce an example to show how to solve these three problems by hands or by computer. A school teacher gave three different types of daily homework assignments: (1)took about 5 minutes to complete, (2)took about 1 hour to complete, (3)took about 3 hours to complete. And the teacher did not reveal openly his mood to his students daily, but we know that the teacher had either in a (1)good, (2)neutral, or(3) bad mood for a whole day. Meanwhile, we knew the relationship between his mood and the homework he would assign. We use a Hidden Markov Model λ to analysis this problem. This model includes three hidden states, {1(good mood), 2(neutral mood), 3(bad mood)} and three observable states, {1(5 minutes assignment),2(1 hour assignment), 3(3 hours assignment)}. We consider the problem in a week, from Monday to Friday, using t=1, 2, 3, 4, 5. Hence, the Markov Chain in this model is (X t : t = 1, 2, 3, 4, 5), and the observable process is (Y t : t = 1, 2, 3, 4, 5). The transition probability matrix is A = [a ij ] and the initial state distribution is π = (π i ), where a ij = P λ (X t+1 = i X t = j), π i = P λ (X 1 = i), 31
41 1 i, j 3, t = 1, 2, 3, 4. The relationship between the teacher s mood and the assignments is given by a matrix B = [b j (k)], where b j (k) = P λ (Y t = k X t = j), 1 j 3, 1 k 3, t = 1, 2, 3, 4, 5. In this particular case, the observable sequence Y 5 1 = (1, 3, 2, 1, 3), and the HMM λ = (A, B, π), where A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = , b 1 (1) b 1 (2) b 1 (3) B = b 2 (1) b 2 (2) b 2 (3) = , b 3 (1) b 3 (2) b 3 (3) ( ) ( ) π = π 1 π 2 π 3 = To get the initial distribution π, we use some knowledge of Markov Chain, which is the stationary probability distribution. 32
42 Definition: For a Markov Chain with the transition probability matrix P = P ij i,j=0 and the initial distribution π = (π i ) i=0, where j=0 P ij = 1 for each i, and i=0 π i = 1 for each i, if we have π j = π i P ij i=0 for every j, then we call the set π = (π i ) i=0 is a stationary probability distribution of the Markov Chain. For the transition matrix that we used in this example, to find the stationary probability distribution, we need to solve a linear system as the following, π 1 = π 1 a 11 + π 2 a 21 + π 3 a 31 π 2 = π 1 a 12 + π 2 a 22 + π 3 a 32 π 3 = π 1 a 13 + π 2 a 23 + π 3 a 33 1 = π 1 + π 2 + π 3 This linear system tells us that the stationary probability distribution is π = (0.05, 0.2, 0.75). The Evaluation Problem: The first question we want to ask is what is the probability that this teacher would assign this order of homework assignments. We use the forward procedure to solve 33
43 this problem. (1)Initialization: α 1 (1) = π 1 b 1 (1) = = α 1 (2) = π 2 b 2 (1) = = 0.06 α 1 (3) = π 3 b 3 (1) = = 0 (2)Induction:(t=1, 2, 3, 4.) α 2 (1) = (α 1 (1) a 11 + α 1 (2) a 21 + α 1 (3) a 31 ) b 1 (3) = ( ) 0.1 = α 2 (2) = (α 1 (1) a 12 + α 1 (2) a 22 + α 1 (3) a 32 ) b 2 (3) = α 2 (3) = (α 1 (1) a 13 + α 1 (2) a 23 + α 1 (3) a 33 ) b 3 (3) = By using the same way, we can calculate the values for all forward variables α t (i), 1 t 5, 1 i 3. We use Table 1 to represent our result. 34
44 Table 1. The Forward Variables α t (i) α t (i) (3) Termination: P λ (Y1 5 = (1, 3, 2, 1, 3)) = α 5 (1) + α 5 (2) + α 5 (3) = = The Decoding Problem: The second question we want to ask is what did his mood curve look like most likely that week. We use the Viterbi method. (1)Initialization: δ 1 (1) = π 1 b 1 (1) =
45 δ 1 (2) = π 2 b 2 (1) = 0.06 δ 1 (3) = π 3 b 3 (1) = 0 (2)Induction: δ 2 (1) = max 1 i 3 [δ 1(i)a i1 ]b 1 (3) = max[0.007, 0.012, 0] 0.1 = ψ 2 (1) = arg max 1 i 3 [δ 1(i)a i1 ] = 2 δ 2 (2) = max 1 i 3 [δ 1(i)a i2 ]b 2 (3) = max[0.0105, 0.012, 0] 0.3 = ψ 2 (2) = arg max 1 i 3 [δ 1(i)a i2 ] = 2 Using the same method, we will have the values for all variables δ t (i) and ψ t (i), where 1 t 5, 1 i 3. See Table 2. 36
46 Table 2. The Variables δ t (i) and ψ t (i) δ t (i) ψ t (i) (3) Termination: δ = max 1 i N [δ 5(i)] = X T = arg max 1 i 3 [δ 5(i)] = 3 Finally, the hidden state sequence is given by i t = ψ t+1 (i t+1 ), t=4, 3, 2, 1. Hence, this teacher s mood curve is (2,3,2,1,3). The Learning Problem: The third problem is how to adjust the model parameters (A, B, π) to maximize P λ (Y 5 1 = (1, 3, 2, 1, 3)). We use the Baum-Welch method. 37
47 To use this method, we need to know the values of the backward variables β t (i t ). (1)Initialization: β 5 (1) = 1, β 5 (2) = 1, β 5 (3) = 1. (2)Induction(t=4, 3, 2, 1. ): 3 β 4 (1) = a 1i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 11 b 1 (3) + a 12 b 2 (3) + a 13 b 3 (3) = = 0.56, 3 β 4 (2) = a 2i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 21 b 1 (3) + a 22 b 2 (3) + a 23 b 3 (3) = = 0.62, 38
48 Table 3. The Backward Variables β t (i) β t (i) β 4 (3) = a 3i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 31 b 1 (3) + a 32 b 2 (3) + a 33 b 3 (3) = = Using the same method, we will have the values for all variables β t (i), where 1 t 5, 1 i 3. See Table 3. We can use this table and follow the backward procedure to get the P λ (Y 3 0 = (1, 3, 2, 1, 3)). P λ (Y 3 0 = (1, 3, 2, 1, 3)) = 3 β 1 (i 1 )b i1 (j 1 )π i1 i 1 =1 39
49 = β 1 (1)b 1 (1)π 1 + β 1 (2)b 2 (1)π 2 + β 1 (3)b 3 (1)π 3 = We can compare this result with the one we get before. Then, the next step is to get the values of ξ t (i, j) and γ t (i), where 1 t 4, 1 i, j 3. See Table 4. After we get these tables, it is easy to get a new model λ. A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = , b 1 (1) b 1 (2) b 1 (3) B = b 2 (1) b 2 (2) b 2 (3) = b 3 (1) b 3 (2) b 3 (3) ( ) ( ) π = π 1 π 2 π 3 = , Therefore, we get a new model λ, which is better than λ. We may check it by using λ to solve the evaluation problem again. And the result is P λ (Y 3 1 = (1, 3, 2, 1, 3)) =
50 Table 4. The Variables ξ t (i, j) and γ t (i) (a)ξ t (i, j) ξ 1 (i, j) ξ 2 (i, j) ξ 3 (i, j) ξ 4 (i, j) (b)γ t (i) γ t (i)
51 4 HMMs Analysis by Matlab 4.1 How to generate A, B and π Firstly, let us discuss how to get an initial distribution π. Here, we only consider that we have finite many states. Suppose we have N hidden states. Then the initial distribution π = (π 1, π 2,, π N ). We can think these π s as N parameters of a multinomial distribution. When we estimate the parameters for the binomail distribution by using the Bayesian medols, we usually use the two-parameter beta family as the prior π. This class of distribution has the remarkable property that the resulting posterior distributions are again beta distribution. We call this property conjugate. So it is very natural for us to think what is a conjugate prior for the multinomial. The answer is the Dirichlet distribution. In some sense, the Dirichlet distribution is an extension of beta distribution to the high dimensional space. Definition: The Dirichlet distribution D(α), where α = (α 1,..., α r ), α j > 0, 1 j r, is supported by a (r-1)-dimensional simplex, r 1 = {(u 1,..., u r 1 ) : u i 1, 0 u i 1, i = 1,..., r 1}. i=1 42
52 and the probability density is given by f(u 1,..., u r 1 ) = r Γ( j=1 α j) r 1 r j=1 Γ(α j) j=1 u α j 1 j (1 r 1 j=1 u j ) α r 1 if (u 1,..., u r 1 ) 0 if (u 1,..., u r 1 ) R r 1 \ Lemma 4.1. Let N = (N 1,..., N r ) has the multinomial distribution M(n, θ), where θ = (θ 1,..., θ r ), 0 < θ j < 1, r j=1 θ j = 1. If the prior distribution π(θ) for θ is D(α), then the posterior distribution π(θ N = n) is D(α + n), where α = (α 1,..., α r ), n = (n 1,..., n r ). Proof: For the vector θ = (θ 1,..., θ r ), we can rewrite it by θ = (θ 1,..., θ r 1 ) and define θ r = 1 r 1 j=1 θ j. Then we can write the prior π(θ) out, π(θ) = Γ( r r 1 j=1 α j ) rj=1 θ α r 1 j 1 j (1 θ j ) αr 1. Γ(α j ) j=1 By the knowledge from Bayes rule, and define = {(t 1,..., t r 1 ) : r 1 i=1 t i 1, 0 t i 1, i = 1,..., r 1} we have j=1 π(θ N = n) = π(θ)p (N = n θ) π(t)p (N = n t)d(t 1,..., t r 1 ) = r 1 C θ α r 1 j+n j 1 j (1 θ j ) α r+n r 1 j=1 j=1 The proportionality constant C, which depends on α and n only, must be Γ( r j=1 (α j+n j )) r j=1 Γ(α j+n j ), and the posterior distribution of θ given n is D(α + n). 43
53 Lemma 4.2. See [9]. Let d be an integer number greater than 1. Let X 1,..., X d be random variables distributed according to gamma distribution with corresponding parameters(α, γ i ), and suppose that the sequence {X 1,..., X d } is independent. Then the random vector (Y 1,..., Y d ) = 1 X X d (X 1,..., X d ) has the Dirichlet distribution. Proof: For the random vector (Y 1,..., Y d ), we have Y d = 1 d 1 j=1 Y j. Therefor, the distribution of (Y 1,..., Y d ) is actually supported by a (d-1)-dimensional simplex 0, d 1 0 = {(y 1,..., y d 1 ) : y i 1, 0 y i 1, i = 1,..., d 1}. i=1 For a Borel subset A of = {(y 1,..., y d ) : d i=1 y i = 1, 0 y i 1, i = 1,..., d}, let B = {(x 1,..., x d ) : 1 x x d (x 1,..., x d ) }. Because the sequence {X 1,..., X d } is independent, we can write out the probability distribution for the random vector (X 1,..., X d ), P ({ω : (X 1 (ω),..., X d (ω)) X 1 (ω) X d (ω) }) = B d i=1 a γ i x γ i 1 i Γ(γ i ) e ax i d(x 1,..., x d ), where d(x 1,..., x d ) indicates that the integration is with respect to Lebesgue measure in R d. We make a change of variables: s = x x d, y i = x i s, 1 i d 1. 44
54 The Jacobian of (x 1,..., x d )with respect to (y 1,..., y d 1, s) is s d 1. After some calculations, the above integral equals C Γ( d d 1 j=1 γ j ) dj=1 y γ d 1 j 1 j (1 y j ) γd 1 d(y 1,..., y d 1 ), Γ(γ j ) j=1 C = {(y 1,..., y d 1 ) : (y 1,..., y d 1, 1 y 1... y d 1 ) )}. j=1 Thus we have a density for 1 X X d (X 1,..., X d ) with respect to Lebesgue measure on 0. Lemma 4.3. Let X be a real-valued random variable with continuous distribution function F. Then, F (X) is uniform distributed on [0, 1]. On the other hand, let Y be a real-valued random variable with uniformed distribution on [0, 1], then F 1 (Y ) has the distribution F. Proof: Let us prove the second part of this lemma. Define X = F 1 (Y ). P (X x) = P (F 1 (Y ) x) = P (Y F (x)), Since Y is uniform distributed on [0,1], we have the above probability is F (x) which means X has the distribution F. We use Lemma 4.2 and take α = 1, γ i = 1, 1 i d. Therefore X 1,..., X d have the exponential distribution with parameters 1. To get the exponential distribution randomly, we apply Lemma 4.3. Generate the random variable Y first, which has the uniform distribution on [0,1]. Then apply the function F 1 (x) = ln(1 x) to Y, 45
55 where F is the distribution function of the exponential distribution. So in this way, we may get the distribution for π, and for each row of A and B. 4.2 How to test the Baum-Welch method Idea: Firstly, we generate a model λ, which is looked as the true value. Then use λ to generate some sample sequences Y 1,..., Y n. To use the Baum-Welch method, we need a seed λ 0. We can think it as an initial value. For this seed λ 0 and each sequence Y i, we run the program to get the new model λnew i = (Anew i, Bnew i, πnew i ), where 1 i n. Finally, we get an estimate of λ, which is λnew = (Anew, Bnew, πnew), Anew = Bnew = πnew = ni=1 Anew i, n ni=1 Bnew i, n ni=1 πnew i. n And then we can compare this estimate λnew with the true value λ to see how the Baum-Welch method works. We also define a distant between two matrixes A = (a ij ) and B = (b ij ) by d(a, B), where 46
56 N M d(a, B) = (a ij b ij ) 2, i=1 j=1 1 i N, 1 j M. For example, we take N = M = 3, T = 20, and n = 10, 20, 30 respectively. Step1: λ and λ 0. In this step we generate a true value λ and a seed λ 0. λ = ( π, A, B ), π = ( ), A = , B = , 47
57 λ 0 = ( π 0, A 0, B 0 ), π 0 = ( ), A 0 = , B 0 = Step2: get Y(i), take n = 30, 1 i n. See Figure 4.1 Step3: Get the new model λnew. Step3 is divided into two parts. In the first part, we use the seed λ 0 to run the program. In the second part, we take the seed equals to λ. 48
58 Y1= Y2= Y3= Y4= Y5= Y6= Y7 = Y8 = Y9 = Y10 = Y11 = Y12 = Y13 = Y14 = Y15 = Y16 = Y17 = Y18 = Y19 = Y20 = Y21 = Y22 = Y23 = Y24 = Y25 = Y26 = Y27 = Y28 = Y29 = Y30 = Figure 4.1 The generated observable sequences Y i, 1 i
59 The reason that we want to do this is to show that the Buam-Welch method only works when the initial model λ 0 is in a neighborhood V of λ q, where λ q is a local maximum of the function P = P λ (Y T 0 = r T 0 ). If we use the seed λ 0, we will see that we can not get a good estimate of λ. The reason is λ 0 is not in a small enough neighborhood of λ. While in the second part, if we use λ as the seed, we will get good result. We think λ must be a local maximum, and it is in any neighborhood of itself. So this seed satisfies the condition in the Baum-Welch method. PartI: use λ 0 = (A 0, B 0, π 0 ) as the seed. We use sample {Y 1,..., Y 10 }, {Y 1,..., Y 20 }, {Y 1,..., Y 30 } respectively to get three new models λnew 1 10, λnew 1 20, λnew 1 30, see Figure 4.2. And then calculate the distances between them and λ. From Table 5, we can see the distances between the new models and λ are big even if we increase the sample size. This result means we can not get a good estimate from the seed λ 0. Table 5. The Distances, Seed λ 0 d(π, πnew) d(a, Anew) d(b, Bnew) λnew λnew λnew
60 λ new1 10= ( Anew1 10, Bnew1 10, πnew1 10) π new 1 10 = Anew 1 10= Bnew 1 10 = λ new1 20= ( Anew1 20, Bnew1 20, πnew1 20) π new 1 20 = Anew 1 20= Bnew 1 20 = λ new1 30= ( Anew1 30, Bnew1 30, πnew1 30) π new 1 30 = Anew 1 30= Bnew 1 30= Figure 4.2 Three new models corresponding sample { Y 1,..., Y10}, { Y1,..., Y20}, { Y1,..., Y30}, as the seed is λ 0. 51
61 Table 6. The Distances, Seed λ d(π, πnew) d(a, Anew) d(b, Bnew) λnew λnew λnew PartII: use λ = (A, B, π) as the seed. In the same way, we use sample {Y 1,..., Y 10 }, {Y 1,..., Y 20 }, {Y 1,..., Y 30 } respectively to get three new models λnew 1 10, λnew 1 20,and λnew The results are listed in Figure 4.3. And the distances between them and λ are listed in the Table 6. This table indicates that we get a good estimate of λ, which is very close to λ. Discussion When we think the P λ (Y T 0 = r T 0 ) as a function of λ, it may have more than one local maximum value, for example P 1 and P 2 corresponding λ 1 and λ 2. By the Baum- Welch method, we may get a good estimate of one of them, only if the initial model we used is in a neighborhood of one of them, and the closer the better. The true value λ we used here may not be a local maximum point of the function P λ (Y T 0 = r T 0 ). To get each new model λnew i, 1 i 30, we run the program 10 times. Maybe it is not enough, because we do not know the convergence speed. 52
62 λ new1 10= ( Anew1 10, Bnew1 10, πnew1 10) π = new 1 10 Anew 1 10= Bnew 1 10 = λ new1 20= ( Anew1 20, Bnew1 20, πnew1 20) π new 1 20 = Anew 1 20= Bnew 1 20 = λ new1 30= ( Anew1 30, Bnew1 30, πnew1 30) π new 1 30 = Anew 1 30= Bnew 1 30= Figure 4.3 Three new models corresponding sample Y,..., Y }, { Y,..., Y }, { Y,..., }, as the seed is λ. { Y30 53
Basic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationStatistical NLP: Hidden Markov Models. Updated 12/15
Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first
More informationParametric Models Part III: Hidden Markov Models
Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent
More informationHidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391
Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391 Parameters of an HMM States: A set of states S=s 1, s n Transition probabilities: A= a 1,1, a 1,2,, a n,n
More informationHydro-Geological Flow Analysis Using Hidden Markov Models
University of South Carolina Scholar Commons Theses and Dissertations 2016 Hydro-Geological Flow Analysis Using Hidden Markov Models Chandrahas Raj Venkat Gurram University of South Carolina Follow this
More informationA Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong
A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong Abstract: In this talk, a higher-order Interactive Hidden Markov Model
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More informationHidden Markov Models
Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content
More informationL23: hidden Markov models
L23: hidden Markov models Discrete Markov processes Hidden Markov models Forward and Backward procedures The Viterbi algorithm This lecture is based on [Rabiner and Juang, 1993] Introduction to Speech
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationHidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden
More information10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)
10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of
More information1 What is a hidden Markov model?
1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and
More informationDynamic Approaches: The Hidden Markov Model
Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message
More informationHidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing
Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech
More informationMultiscale Systems Engineering Research Group
Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationLecture 9. Intro to Hidden Markov Models (finish up)
Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is
More informationAlignment Algorithms. Alignment Algorithms
Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationLecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010
Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationSTATS 306B: Unsupervised Learning Spring Lecture 5 April 14
STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced
More informationCOMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma
COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed
More informationp(d θ ) l(θ ) 1.2 x x x
p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to
More informationHidden Markov Models and Gaussian Mixture Models
Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian
More informationLecture 11: Hidden Markov Models
Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationHidden Markov Models Part 2: Algorithms
Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:
More informationHidden Markov Modelling
Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models
More informationProbabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm
Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to
More informationInferences about Parameters of Trivariate Normal Distribution with Missing Data
Florida International University FIU Digital Commons FIU Electronic Theses and Dissertations University Graduate School 7-5-3 Inferences about Parameters of Trivariate Normal Distribution with Missing
More informationWe Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named
We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,
More informationLearning from Sequential and Time-Series Data
Learning from Sequential and Time-Series Data Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/? Sequential and Time-Series Data Many real-world applications
More information1. Markov models. 1.1 Markov-chain
1. Markov models 1.1 Markov-chain Let X be a random variable X = (X 1,..., X t ) taking values in some set S = {s 1,..., s N }. The sequence is Markov chain if it has the following properties: 1. Limited
More informationHidden Markov Models NIKOLAY YAKOVETS
Hidden Markov Models NIKOLAY YAKOVETS A Markov System N states s 1,..,s N S 2 S 1 S 3 A Markov System N states s 1,..,s N S 2 S 1 S 3 modeling weather A Markov System state changes over time.. S 1 S 2
More informationLinear Dynamical Systems (Kalman filter)
Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations
More informationHidden Markov Models: All the Glorious Gory Details
Hidden Markov Models: All the Glorious Gory Details Noah A. Smith Department of Computer Science Johns Hopkins University nasmith@cs.jhu.edu 18 October 2004 1 Introduction Hidden Markov models (HMMs, hereafter)
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationLecture 7 Sequence analysis. Hidden Markov Models
Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden
More informationStatistical Sequence Recognition and Training: An Introduction to HMMs
Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationSupervised Learning Hidden Markov Models. Some of these slides were inspired by the tutorials of Andrew Moore
Supervised Learning Hidden Markov Models Some of these slides were inspired by the tutorials of Andrew Moore A Markov System S 2 Has N states, called s 1, s 2.. s N There are discrete timesteps, t=0, t=1,.
More informationDirected Probabilistic Graphical Models CMSC 678 UMBC
Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,
More informationOptimal Control of Differential Equations with Pure State Constraints
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-2013 Optimal Control of Differential Equations with Pure State Constraints Steven Lee
More informationLEARNING DYNAMIC SYSTEMS: MARKOV MODELS
LEARNING DYNAMIC SYSTEMS: MARKOV MODELS Markov Process and Markov Chains Hidden Markov Models Kalman Filters Types of dynamic systems Problem of future state prediction Predictability Observability Easily
More informationUniversity of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I
University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()
More informationHidden Markov Models (HMMs)
Hidden Markov Models (HMMs) Reading Assignments R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 3.10, hard-copy). L. Rabiner, "A tutorial on HMMs and selected
More informationMath 350: An exploration of HMMs through doodles.
Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or
More informationHidden Markov Models
Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic
More informationHIDDEN MARKOV MODELS IN INDEX RETURNS
HIDDEN MARKOV MODELS IN INDEX RETURNS JUSVIN DHILLON Abstract. In this paper, we consider discrete Markov Chains and hidden Markov models. After examining a common algorithm for estimating matrices of
More informationCS838-1 Advanced NLP: Hidden Markov Models
CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationHidden Markov Models
Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010
More informationOutline of Today s Lecture
University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s
More informationAdvanced Algorithms. Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret
Advanced Algorithms Lecture Notes for April 5, 2016 Dynamic programming, continued (HMMs); Iterative improvement Bernard Moret Dynamic Programming (continued) Finite Markov models A finite Markov model
More informationA REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS
A REVIEW AND APPLICATION OF HIDDEN MARKOV MODELS AND DOUBLE CHAIN MARKOV MODELS Michael Ryan Hoff A Dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More informationNote Set 5: Hidden Markov Models
Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional
More informationMaster 2 Informatique Probabilistic Learning and Data Analysis
Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel
More informationHuman Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data
Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data 0. Notations Myungjun Choi, Yonghyun Ro, Han Lee N = number of states in the model T = length of observation sequence
More informationCMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009
CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models Jimmy Lin The ischool University of Maryland Wednesday, September 30, 2009 Today s Agenda The great leap forward in NLP Hidden Markov
More informationMCMC notes by Mark Holder
MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:
More informationHidden Markov Models Hamid R. Rabiee
Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationHidden Markov Models. Terminology and Basic Algorithms
Hidden Markov Models Terminology and Basic Algorithms The next two weeks Hidden Markov models (HMMs): Wed 9/11: Terminology and basic algorithms Mon 14/11: Implementing the basic algorithms Wed 16/11:
More informationMAS114: Solutions to Exercises
MAS114: s to Exercises Up to week 8 Note that the challenge problems are intended to be difficult! Doing any of them is an achievement. Please hand them in on a separate piece of paper if you attempt them.
More informationHidden Markov Models. Terminology, Representation and Basic Problems
Hidden Markov Models Terminology, Representation and Basic Problems Data analysis? Machine learning? In bioinformatics, we analyze a lot of (sequential) data (biological sequences) to learn unknown parameters
More information11.3 Decoding Algorithm
11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More information( ).666 Information Extraction from Speech and Text
(520 600).666 Information Extraction from Speech and Text HMM Parameters Estimation for Gaussian Output Densities April 27, 205. Generalization of the Results of Section 9.4. It is suggested in Section
More informationDynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models
6 Dependent data The AR(p) model The MA(q) model Hidden Markov models Dependent data Dependent data Huge portion of real-life data involving dependent datapoints Example (Capture-recapture) capture histories
More informationCS 7180: Behavioral Modeling and Decision- making in AI
CS 7180: Behavioral Modeling and Decision- making in AI Learning Probabilistic Graphical Models Prof. Amy Sliva October 31, 2012 Hidden Markov model Stochastic system represented by three matrices N =
More informationCS 6820 Fall 2014 Lectures, October 3-20, 2014
Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given
More informationHidden Markov Model and Speech Recognition
1 Dec,2006 Outline Introduction 1 Introduction 2 3 4 5 Introduction What is Speech Recognition? Understanding what is being said Mapping speech data to textual information Speech Recognition is indeed
More informationDoubly Indexed Infinite Series
The Islamic University of Gaza Deanery of Higher studies Faculty of Science Department of Mathematics Doubly Indexed Infinite Series Presented By Ahed Khaleel Abu ALees Supervisor Professor Eissa D. Habil
More informationHidden Markov Models. Three classic HMM problems
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems
More informationMarkov Chains and Hidden Markov Models
Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden
More informationComputational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept
Computational Biology Lecture #3: Probability and Statistics Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept 26 2005 L2-1 Basic Probabilities L2-2 1 Random Variables L2-3 Examples
More informationVL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16
VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationWhat s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction
Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationMATH3200, Lecture 31: Applications of Eigenvectors. Markov Chains and Chemical Reaction Systems
Lecture 31: Some Applications of Eigenvectors: Markov Chains and Chemical Reaction Systems Winfried Just Department of Mathematics, Ohio University April 9 11, 2018 Review: Eigenvectors and left eigenvectors
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationMachine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017
1 Introduction Let x = (x 1,..., x M ) denote a sequence (e.g. a sequence of words), and let y = (y 1,..., y M ) denote a corresponding hidden sequence that we believe explains or influences x somehow
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationIntroduction to Artificial Intelligence (AI)
Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 10 Oct, 13, 2011 CPSC 502, Lecture 10 Slide 1 Today Oct 13 Inference in HMMs More on Robot Localization CPSC 502, Lecture
More informationNatural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay
Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More information