A Study of Hidden Markov Model

Size: px

Start display at page:

Download "A Study of Hidden Markov Model"

Oswin Sharp
5 years ago
Views:

1 University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School A Study of Hidden Markov Model Yang Liu University of Tennessee - Knoxville Recommended Citation Liu, Yang, "A Study of Hidden Markov Model. " Master's Thesis, University of Tennessee, This Thesis is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been accepted for inclusion in Masters Theses by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information, please contact trace@utk.edu.

2 To the Graduate Council: I am submitting herewith a thesis written by Yang Liu entitled "A Study of Hidden Markov Model." I have examined the final electronic copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Mathematics. We have read this thesis and recommend its acceptance: Xia Chen, Balram Rajput (Original signatures are on file with official student records.) Jan Rosinski, Major Professor Accepted for the Council: Dixie L. Thompson Vice Provost and Dean of the Graduate School

3 To the Graduate Council: I am submitting herewith a thesis written by Yang Liu entitled A Study of Hidden Markov Model. I have examined the final electronic copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Mathematics. We have read this thesis and recommend its acceptance: Jan Rosinski Major Professor Xia Chen Balram Rajput Accepted for the Council: Anne Mayhew Vice Chancellor and Dean of Graduate Studies (Original signatures are on file with official student records.)

4 A STUDY OF HIDDEN MARKOV MODEL A Thesis Presented for the Master of Science Degree The University of Tennessee, Knoxville Yang Liu August 2004

5 DEDICATION This dissertation is dedicated to my parents, Jiaquan Liu and Xiaobo Yang, for consistently supporting me, inspiring me, and encouraging me. ii

6 ACKNOWLEDGMENTS I would like to thank my parents for their consistent support and encouragement. I specially express my deepest appreciation to my advisor, Dr. Jan Rosinski, for his patient, friendly, and unfailing support over the past two years. I also want to thank the other committee members, Dr. Balram Rajput and Dr Xia Chen for their comments and suggestions. I would also like to thank Xiaohu Tang for his help. iii

7 ABSTRACT The purpose of this thesis is fourfold: Introduce the definition of Hidden Markov Model. Present three problems about HMM that must be solved for the model to be useful in real-world application. Cover the solution of the three basic problems; explain some algorithms in detail as well as the mathematical proof. And finally, give some examples to show how these algorithms work. iv

8 TABLE OF CONTENTS 1. Introduction 1 2. What is A Hidden Markov Model 3 3. Three Basic Problems of HMMs and Their Solutions The Evaluation Problem The Forward Procedure The Backward Procedure Summary of the Evaluation Problem The Decoding Problem Viterbi Algorithm δ θ Algorithm The Learning Problem An Example HMMs Analysis by Matlab How to Generate A, B, π How to Test the Baum-Welch Method 46 References 54 Appendices 57 Appendix A: The New Models generated in Section 4.2, Step 3 58 Appendix B: Matlab Program 73 Vita 86 v

9 LIST OF TABLES Table 1. The Forward Variables α (i) 35 t Table 2. The Variables δ (i) and ψ (i) 37 t t Table 3. The Backward Variables β (i ) 39 t Table 4. The Variables ξ ( i, j) and γ (i) 41 t t Table 5. The Distances, Seed λ 0 50 Table 6. The Distances, Seed λ 52 vi

10 1 INTRODUCTION Many of the most powerful sequence analysis methods are now based on principles of probabilistic modeling, such as Hidden Markov Models. Hidden Markov Models (HMMs) are very powerful and flexible. They originally emerged in the domain of speech recognition. During the past several years it has become the most successful speech model. In recent years, they are widely used as useful tools in many other fields, including molecular biology, biochemistry and genetics, as well as computer vision. An HMM is probabilistic model composed of a number of interconnected states, each of which emits an observable output. Each state has two kinds of parameters. First, symbol emission probabilities describe the probabilities of the possible outputs from the state, and second, state transition probabilities specify the probability of moving to a new state from the current one. An observed sequence of symbols is generated by starting at some initial state and moving probabilistically from state to state until some terminal state is reached, emitting observable symbols from each state that is passed through. A sequence of states is a first order Markov chain. This state sequence is hidden, only the sequence of symbols that it emits being observable; hence the term hidden Markov model. This paper is organized in the following manner. The definition as well as some 1

11 necessary assumptions about HMMs are explained first. The three basic problems of HMMs are discussed, and then their solutions are given. To make HMMs are useful in the real world application, we must know how to solve these problems, which are the evaluation problem, the decoding problem and the learning problem. In this part, some useful algorithm are introduced, including the forward procedure, the backward procedure, the Viterbi algorithm, the δ θ algorithm and the Baum-welch method. Lastly, the computer programs by which HMMs are simulated are coved. 2

12 2 WHAT IS A HIDDEN MARKOV MODEL A Hidden Markov Model is a stochastic model comprising an unobserved Markov chain (X t : t = 0, 1...) and an observable process (Y t : t = 0, 1,...). In this paper, we only consider the case when the observations were represented as discrete symbols chosen from a finite set, and therefore we could use a discrete probability density within each state as this model. To define an HMM, we need some elements. The number of the hidden states is N. We denote these N states by s 1, s 2,..., s N. The number of the observable states is M. We denote them by r 1, r 2,..., r M. Specifically, (X t : t = 0, 1...) is a Markov chain with transition probability matrix A = [a ij ] and an initial state distribution π = (π i ), where 1 i, j N. By the properties of Markov chain, we know that a ij = P (X t+1 = s j X t = s i ) = P (X t+1 = s j X t = s i, X t 1 = s k,..., X 0 = s l ), t = 0, 1,..., for every s i, s j,..., s k and s l, where 1 i, j, k, l N. The initial state distribution of the Markov chain is given by π i = P (X 0 = s i ). 3

13 For a finite observation sequence (Y t : t = 0, 1,..., T ), where T is any fixed number, we have a fundamental assumption connecting the hidden state sequence (X t : t = 0, 1,..., T ) and the observation sequence, that is statistical independence of observations (Y t : t = 0, 1,..., T ). If we formulate this assumption mathematically, we have P (Y 0 = r j0, Y 1 = r j1,..., Y T = r jt X 0 = s i0, X 1 = s i1,..., X T = s it ) = T P (Y t = r jt X t = s it ), (1) t=0 where 1 i 0, i 1,..., i T N, 1 j 0, j 1,..., j T M. To simplify the notation, we denote the event sequence (s i0, s i1,..., s it ) by s T 0, (r j0, r j1,..., r jt ) by r T 0, and denote (X 0, X 1,..., X T ) by X T 0, (Y 0, Y 1,..., Y T ) by Y T 0. Then, we put b j (k) = P (Y t = r k X t = s j ), 1 j N, 1 k M, t = 0, 1, 2... We may rewrite the formula (1) by T P (Y0 T = r T 0 X T 0 = s T 0 ) = b it (j t ) (2) t=0 So far, we know a Hidden Markov Model has several components. It has a set of states s 1, s 2,..., s N, a set of output symbols r 1, r 2,..., r M, a set of transitions which have associated with them a probability and an output symbol, and a starting state. When a transition is taken, it produces an output symbol. The complicating factor is that the output symbol given is not necessarily unique to that transition, and thus it 4

14 is difficult to determine which transition was the one actually taken and this is why they are termed hidden. See Figure 2.1. By the knowledge of Markov Chain, we know P (X T 0 = s T 0 ), the probability of the state sequence s T 0, P (X T 0 = s T 0 ) = P (X 0 = s i0, X 1 = s i1,..., X T = s it ) = π i0 a i0 i 1...a it 1 i T = T a it 1 i t t=0 where a i 1 i 0 = π i0. Hence, the joint probability of s T 0 and r T 0 is P (X T 0 = s T 0, Y T 0 = r T 0 ) = P (Y T 0 = r T 0 X T 0 = s T 0 ) P (X T 0 = s T 0 ) = T [b it (j t )a it 1 i t ] (3) t=0 The transition probability matrix A, the initial state distribution π and the matrix B = [b j (k)], 1 j N, 1 k M, define a Hidden Markov Model completely. Therefore we can use a compact notation λ = (A, B, π) to denote a Hidden Markov Model with discrete probability distribution. We may think of λ as a parameter of the Hidden Markov Model. We denote the probability given a model λ by P λ later. Lemma 2.1. P λ (Y T 0 = r T 0 ) = 1 i 0,i 1,...,i T N Tt=0 a it 1 i t b it (j t), where a i 1 i 0 = π i0. 5

15 X 0 X 1 X 2 X T-1 X T Y 0 Y 1 Y 2 Y T-1 Y T Figure 2.1 HMM 6

16 Proof: We want to compute P λ (Y T 0 = r T 0 ), the probability of the observation sequence given the model λ. From time 0 to time T, we consider every possible hidden state sequence, s T 0. Then the probability of r T 0 is obtained by summing the joint probability over r T 0 and all possible s T 0, that is P λ (Y T 0 = r T 0 ) = all possible s T 0 P λ (X T 0 = s T 0, Y T 0 = r T 0 ) = T [a it 1 i t b it (j t )]. (4) 1 i 0,i 1,...,i T N t=0 Lemma2.1 gives us a method to compute the probability of a sequence of observations r T 0. But unfortunately, this calculation is computationally unfeasible. Because this formula involves on the order of (T + 1) N (T +1) calculations. We are going to introduce some more efficient methods in the next section. We can understand Lemma 2.1 from another point of view. A hidden Markov model consists of a set of hidden states s 1, s 2,..., s N connected by directed edges. Each state assigns probabilities to the characters of the alphabet used in the observable sequence and to the edges leaving the state. A path in an HMM, s i0, s i1,..., s it, is a sequence of states such that there is an edge from each state in the path to the next state in the path. And the probability of this path is the product of the probabilities of the edges traversed, that is P (X T 0 = s T 0 ). 7

17 Each path through the HMM gives a probability distribution for each position in a string of the same length, based on the probabilities for the characters in the corresponding states. The probability of the observable sequence given a particular path is the product of the probabilities of the characters, that is T P (Y0 T = r T 0 X T 0 = s T 0 ) = b it (j t ). The probability of any sequence of characters is the sum, over all paths whose length is the same as the sequence, of the probability of the path times the probability of the sequence given the path, that is the result of Lemma 2.1. See Figure 2.2. t=0 A Hidden Markov Model has a very similar property as a Markov process, that is given the values of X t, the values of Y s, s t, do not depend on the values of X u, u < t. The probability of any particular future observation of the model when its present hidden state is known exactly, is not altered by additional knowledge concerning its past hidden behavior. In formal terms, we have Lemma 2.2. Lemma 2.2. P λ (Y u = r ju X t 0 = s t 0) = P λ (Y u = r ju X t = s it ), u t. Proof: Firstly, we prove this result is true when u=t. P λ (Y t = r jt X t 0 = s t 0) = = 0 j 0,...,j t 1 M 0 j 0,...,j t 1 M P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (Y t 1 0 = r t 1 0 X t 1 0 = s t 1 0 ) P λ (Y t = r jt X t = s it ) 8

18 t=0 t=1 t=t-1 t=t s 1 s 1 s 1 s 1 s 2 s 2 2 s s 2 s N s N s N s N Figure 2.2 Paths of An HMM 9

19 = 1 P λ (Y t = r jt X t = s it ) = P λ (Y t = r jt X t = s it ) Specially, if we have q t, we will have = = P λ (Y t = r jt X t q = s t q) 0 j q,...,j t 1 M 0 j q,...,j t 1 M P λ (Y t q = r t q X t q = s t q) P λ (Y t 1 q = 1 P λ (Y t = r jt X t = s it ) = P λ (Y t = r jt X t = s it ) = r t 1 q X t 1 q = s t 1 q ) P λ (Y t = r jt X t = s it ) Then, we prove it is also true when u > t. P λ (Y u = r ju X t 0 = s t 0) = = = = = 0 j 0,...,j u 1 M P λ (Y u 0 = r u 0 X t 0 = s t 0) 0 j 0,...,j u 1 M 0 i t+1,...,i u N 0 j 0,...,j u 1 M 0 i t+1,...,i u N 0 j 0,...,j u 1 M 0 i t+1,...,i u N P λ (X u t+1 = s u t+1 X t = s it ) 0 i t+1,...,i u N P λ (Y u 0 = r u 0, X u t+1 = s u t+1 X t 0 = s t 0) P λ (Y u 0 = r u 0 X u 0 = s u 0) P λ (X u t+1 = s u t+1 X t = s it ) P λ (Y u 1 0 = r u 1 0 X u 1 0 = s u 1 0 ) P λ (Y u = r ju X u = s iu ) P λ (Y u = r ju X u = s iu ) P λ (X u t+1 = s u t+1 X t = s it ) 10

20 = P λ (Y u = r ju X t = s it ) HMM. Lemma 2.2 will be widely used in next section to help solve the basic problems of Remark. (Y t, t 0) are not independent. Proof: We take T = 1. From Lemma 2.1, we have But P λ (Y 0 = r j0 ) = N i0 =1 b i0 (j 0 )π i0, and N P λ (Y0 1 = r 1 0) = b i0 (j 0 )b i1 (j 1 )π i0 a i0 i 1. i 0,i 1 =1 = N P λ (Y 1 = r j1 ) = P λ (Y 1 = r j1 X 1 = s i1 )P λ (X 1 = s i1 ) i 1 N N N b i1 (j 1 )[ a i0 i 1 π i0 ] = b i1 (j 1 )a i0 i 1 π i0. i 1 i 0 i 0,i 1 =1 Hence P λ (Y 1 0 = r 1 0) P λ (Y 0 = r j0 ) P λ (Y 1 = r j1 ). Therefore, the sequence (Y t, t 0) are not independent. Actually, it is very natural to be understood. Because for each Y t, it is generated by the corresponding X t, and the hidden sequence (X t, t = 0, 1...) are not independent. There is a very easy example. We take N = M and b i (j) = δ ij. Then, Y t is a Markov Chain. 11

21 3 THREE BASIC PROBLEMS OF HMMs AND THEIR SOLUTIONS Once we have an HMM, there are three problems of interest. 3.1 The Evaluation Problem Given a Hidden Markov Model λ and a sequence of observations Y T 0 = r T 0, what is the probability that the observations are generated by the model, i.e. P λ (Y0 T = r T 0 ). We can also view the problem as how well a given model matches a given observation sequence. By the second viewpoint, if we have several competing models, the solution to the evaluation problem will give us a best model which best matches the observation sequence. The most straightforward way of doing this is using Lemma 2.1. But it involves a lot of calculations. The more efficient methods are called the forward procedure and the backward procedure. See[1]. We will introduce these two procedures first, then reveal the mathematical idea inside them. 12

22 3.1.1 The Forward Procedure Fix r t 0 and consider the forward variable α t (i t ) defined as α t (i t ) = P λ (Y t 0 = r t 0, X t = s it ), t = 0, 1,..., T, 1 i t N. that is the probability of the partial observation sequence, r j0, r j1,..., r jt (until time t) and at time t the hidden state is s it. So for the forward variableα t (i t ), we only consider those paths which end at the state s it at the time t. We can solve for α t (i t ) inductively, as follows: (1) Initialization: α 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N. (2) Induction (t = 0, 1,..., T 1): α t+1 (i t+1 ) = P λ (Y t+1 0 = r t+1 0, X t+1 = s it+1 ) = = = 1 i 0,...,i t N 1 i 0,...,i t N 1 i 0,...,i t N P λ (Y t+1 0 = r t+1 0, X t+1 0 = s t+1 0 ) P λ (Y t+1 0 = r t+1 0 X t+1 0 = s t+1 0 ) P λ (X t+1 0 = s t+1 0 ) P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t 0 = r t 0) P λ (X t 0 = r t 0) = [ P λ (Y0 t = r t 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it )] P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) 1 i 0,...,i t N 13

23 = [ P λ (Y0 t = r t 0, X t = s t ) P λ (X t+1 = s it+1 X t = s it )] P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) = [ 1 i t N N i t=1 α t (i t )a it i t+1 ]b it+1 (j t+1 ) (3) Termination: P λ (Y T 0 = r T 0 ) = = N P λ (Y0 T = r T 0, X T = s it ) i T =1 N α T (i T ) i T =1 If we examine the computation involved in the calculation of α t (i t ), we see that it requires on the order of N 2 (T + 1) calculations, rather than (T + 1) N (T +1) as required by the direct calculation. Hence, the forward probability calculation is more efficient than the direct calculation. In similar, we have another method for the Evaluation Problem. It is called backward procedure The Backward Procedure We consider a backward variable β t (i t ) defined as, β t (i t ) = P λ (Y T t+1 = r T t+1 X t = s it ) 0 t T 1, 1 i t N. 14

24 That is the probability of the partial observation sequence from time t + 1 to the end, given the hidden state is s it at time t. Again we can solve for β t (i t ) inductively, as follows: (1) Initialization: To make this procedure work for t = T 1, we arbitrarily define β T (i T ) to be 1 in the initialization step (1). β T (i T ) = 1. (2) Induction (t = 0, 1,..., T 1): β t (i t ) = P λ (Y T t+1 = r T t+1 X t = s it ) = = = = = N P λ (Yt+1 T = r T t+1, X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X t+1 = s it+1, X t = s it ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 N P λ (Yt+1 T = r T t+1 X T t+1 = s T t+1) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) i t+1 =1 1 i t+2,...,i T N P λ (X t+1 = s it+1 X t = s it ) N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) i t+1 =1 15

25 = = [ P λ (Yt+2 T = r T t+2 X T t+1 = s T t+1)p λ (X T t+2 = s T t+2 X t+1 = s it+1 )] 1 i t+2,...,i T N N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Yt+2 T = r T t+2 X t+1 = s it+1 ) i t+1 =1 P λ (X t+1 = s it+1 X t = s it ) N a it i t+1 b it+1 (j t+1 )β t+1 (i t+1 ). i t+1 =1 (3) Termination: P λ (Y T 0 = r T 0 ) = = = = N P λ (Y0 T = r T 0, X 0 = s i0 ) i 0 =1 N P λ (Y0 T = r T 0 X 0 = s i0 ) P λ (X 0 = s i0 ) i 0 =1 N P λ (Y1 T = r T 1 X 0 = s i0 ) P λ (Y 0 = r j0 X 0 = s i0 ) P λ (X 0 = s i0 ) i 0 =1 N β 0 (i 0 )b i0 (j 0 )π i0 i 0 =1 The back ward procedure requires on the order of N 2 (T + 1) calculations, as many as the forward procedure Summary of the Evaluation Problem We have introduced how to evaluate the probability that the observation sequence Y T 0 = r T 0 is generated by using either the forward procedure or the backward proce- 16

26 dure. They are mode efficient than the method given in Lemma 2.1. In fact, these two procedures are nothing but changing multiple sum, that is Lemma 2.1, to repeated sum. For example, in the forward procedure, we use the identity N N =..., 1 i 0,...,i T N i T =1 i 0 =1 and for the backward procedure, we reverse the order of summation, that is N N = i 0,...,i T N i 0 =1 i T =1 From this point of view, it is obvious that other procedures are possible. For example, we can do he summation from the two ends to the middle at the same time. In some sense, this can be seen as a kind of parallel algorithm. It requires on the order of N(N 1) (T + 1) calculations. 3.2 The Decoding Problem Given a model λ and a sequence of observations Y T 0 = r T 0, what is the most likely state sequence in the model that produced the observation? That is, we want to find a hidden state sequence X T 0 = s T 0, to maximize the probability, P λ (X T 0 = s T 0 Y0 T = r T 0 ), for any possible sequences s T 0. This is equivalent to maximize P λ (X T 0 = s T 0, Y0 T = r T 0 ), because the probability of r T 0 given a model λ, P λ (Y0 T = r T 0 ) is fixed. A technique for finding this state sequence exists, based on dynamic programming methods, and 17

27 is called the Viterbi algorithm. See[1] Viterbi Algorithm Fixing s it, we consider a variable δ t (i t ) defined as δ t (i t ) = max P λ(x t 1 0 = s t 1 0, X t = s it, Y0 t = r t 1 i 0,i 1,...,i t 1 0). N Hence, δ t (i t ) is the highest probability along a path, which accounts for the first t observations and ends in state s it at time t. By induction we have, δ t+1 (i t+1 ) = max P λ(x t 0 = s t 1 i 0,i 1,...,i t N 0, X t+1 = s it+1, Y0 t+1 = r t+1 0 ) = max P λ(y0 t+1 = r t+1 0 X t+1 0 = s t+1 0 ) P λ (X t+1 0 = s t+1 0 ) 1 i 0,i 1,...,i t N = max 1 i 0,i 1,...,i t N P λ(y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Y t 0 = r t 0 X t 0 = s t 0) P λ (X t+1 = s it+1 X t 0 = s t 0) P λ (X t 0 = s t 0) = max 1 i 0,i 1,...,i t N P λ(y t+1 = r jt+1 X t+1 = s it+1 ) P λ (Y t 0 = r t 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it ) = [ max max P λ(y0 t = r t 1 i t N 1 i 0,...,i t 1 0, X t 0 = s t 0) P λ (X t+1 = s it+1 X t = s it )] N P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) = [ max 1 i t N δ t(i t )a iti t+1 ]b it+1 (j t+1 ). 18

28 To find the hidden state sequence, we need to keep track of the argument which maximize δ t (i t ), for every t and i t, and they will be noted by an array ψ t (i t ). (1) Initialization: δ 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N. (2)Induction (t=1,...,t): δ t (i t ) = max 1 i t 1 N [δ t 1(i t 1 )a it 1 i t ]b it (j t ), ψ t (i t ) = arg max [δ t 1(i t 1 )a it 1 i t ]. 1 i t 1 N (3)Termination: δ = max [δ T (i T )], 1 i T N ψ = arg max 1 i T N [δ T (i T )]. Finally, we will have the state sequence i T = ψ and i t = ψ t+1 (i t+1 ), t = T 1, T 2,..., 0. Remark1. In the Viterbi Algorithm, ψ t (i t ), t = 0,..., T 1, and ψ are set-valued maps. This means there may be more than one best hidden state sequence. In order to distinguish them, we need more information. Remark2. This procedure is not consistent with T increasing. 19

29 Proof: We give a counterexample. We take ( ) π = , A =, B =, Suppose we have an observation sequence Y 1 0 = (r 1, r 2 ). Following the Viterbi Algorithm, when T = 0, the most likely hidden state is s 1, but when T = 1, the best one is (s 2, s 1 ). This means, given a fixed sequence Y, if s i0 is a solution for T = 0 and (s i 0, s i 1 ) is a solution for T = 1, we may not have i 0 = i 0. Even if s i0 and (s i 0, s i 1 ) are the unique best solutions respectively. The Viterbi Algorithm is very similar as the forward procedure that we have introduced in the above section. In the forward procedure, we define α t (i t ), while here we use δ t (i t ). The only diffence between them is we change summation to maximum. But the general ideas are same. We change multiple summation to repeated summation and multiple maximum to repeated maximum. From this point of view, it is very natural to think how to use the idea that we have used in the backward procedure to solve the Decoding Problem. Therefore, we 20

30 introduce a new method. We do the maximum from the two ends to the middle at the same time. That is, δ θ Algorithm The advantage of this method is saving time. To describe this procedure, we define another variable θ t (i t ), t = 0,..., T 1. θ t (i t ) = max P λ(x T t+1 = s T 1 i t+1,...,i T t+1, Yt+1 T = r T t+1 X t = s it ). N θ t (i t ) is the highest probability along a path, which accounts from time t to the end and starts at state s it at time t. Inductively, we have θ t (i t ) = max 1 i t+1,...,i T N P λ(x T t+1 = s T t+1, Y T t+1 = r T t+1 X t = s it ) = max 1 i t+1,...,i T N P λ(x T t+2 = s T t+2, Y T t+1 = r T t+1 X t+1 = s it+1, X t = s it ) P λ (X t+1 = s it+1, X t = s it ) = max 1 i t+1,...,i T N P λ(y T t+1 = r T t+1 X T t = s T t ) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) = max 1 i t+1,...,i T N P λ(y T t+2 = r T t+2 X T t+2 = s T t+2) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) 21

31 = max 1 i t+1,...,i T N P λ(y T t+2 = r T t+2, X T t+2 = s T t+2 X t+1 = s it+1 ) P λ (Y t+1 = r jt+1 X t+1 = s it+1 ) P λ (X t+1 = s it+1 X t = s it ) = max 1 i t+1 N θ t+1(i t+1 )a it i t+1 b it+1 (j t+1 ) (1)Initialization: δ 0 (i 0 ) = π i0 b i0 (j 0 ), 1 i 0 N, θ T (i T ) = 1, 1 i T N. (2)Induction: δ t (i t ) = max [δ t 1(i t 1 )a it 1 i t ]b it (j t ), t = 1,..., T. 1 i t 1 N ψ t (i t ) = arg max [δ t 1(i t 1 )a it 1 i t ], t = 1,..., T. 1 i t 1 N θ t (i t ) = max 1 i t+1 N θ t+1(i t+1 )a iti t+1 b it+1 (j t+1 ), t = 0,..., T 1. ϕ t (i t ) = arg max θ t+1(i t+1 )a it i t+1 b it+1 (j t+1 ), t = 0,..., T 1. 1 i t+1 N (3)Termination: max P λ(x T 0 = s T 0, Y0 T = r T 0 ) 1 i 0,...,i T N 22

32 = max max P λ(x T 0 = s T 0, Y0 T = r T 0 ) 1 i t N 1 i s N,s t = max max P λ(y0 T = r T 0 X T 0 = s T 0 )P λ (X T 0 = s T 0 ) 1 i t N 1 i s N,s t = max max P λ(y0 t = r t 1 i t N 1 i s 0 X t 0 = s t 0)P λ (X t 0 = s t 0) N,s t P λ (Y T t+1 = r T t+1 X T t+1 = s T t+1)p λ (X T t+1 = s T t+1 X t 0 = s t 0) = max 1 i t N [ max P λ(x t 0 = s t 1 i s 0, Y0 t = r t 0) N,s<t max 1 i s N,s>t P λ(y T t+1 = r T t+1, X T t+1 = s T t+1 X t+1 = s it+1 )] = max 1 i t N [δ t(i t ) θ t (i t )] We define i t = arg max 1 it N[δ t (i t ) θ t (i t )] and the most likely hidden sequence is i t = ψ t+1 (i t+1 ) from 0 to t 1 i t when time = t ϕ t 1 (i t 1 ) from t + 1 to T 3.3 The Learning Problem Given a model λ and a sequence of observations Y T 0 = r T 0, how should we adjust the model parameters (A, B, π) in order to maximize P λ (Y T 0 = r T 0 )? We face an optimization problem with restrictions. This probability P λ (Y T 0 = r T 0 ) is a function of the variables π i, a ij, b j (k), where 1 i, j N, 1 k N. Also. we may view the probability P λ (Y T 0 = r T 0 ) as the likelihood function of λ, 23

33 considered as a function of λ for fixed r T 0. Thus, for each r T 0, P λ (Y T 0 = r T 0 ) gives the probability of observing r T 0. We use the method of maximum likelihood, try to find that the value of λ, that is most likely to have produced the sequence r T 0. See Lemma 2.1. The problem we need to solve is: max P λ (Y0 T λ = r T T 0 ) = max a it 1 i t b it (j t ), λ 1 i 0,i 1,...,i T N t=0 subject to N π i 0, π i = 1, i=1 N a ij 0, a ij = 1, for any i, j=1 M b j (k) 0, b j (k) = 1, for any j, k=1 where 1 i, j N, 1 k M and a i 1 i 0 = π i0 To solve this problem, we need to use some results that Leonard E. Baum and George R. Sell proved in See [3]. 1. Let M M denote the manifold with boundary given by x = (x ij ) where {x ij : x ij 0 and x ij = 1} q i j=1 24

34 where q 1,..., q k is a set of nonnegative integers. Let P be a homogeneous polynomial in the variable {x ij }, with nonnegative coefficients. Let Λ = Λ P : M M M defined by y = Λ P (x) where P y ij = x ij [ x ij q i k=1 x ik P x ik ] 1. Then P (x) P (tλ P (x) + (1 t)x), where 0 t 1, x M. Here, M = {x ij : x ij > 0 and q i j=1 x ij = 1}, M = {x ij : x ij = 0 and q i j=1 x ij = 1}. 2. Let P be a homogeneous polynomial in the variables (x ij ) with positive coefficients and let q M be an isolated local maximum of P. Then there exists a neighborhood V of q such that Λ(V ) V and for every x V Λ n (x) q as n. 3.(A) The transformation Λ P on M can be extended to be continuous on M M. (B) Λ P can also be continuously extended on any isolated local maximum q of P on M by the definition Λ P (q) = q. (C) The extended transformation Λ P still obeys the inequality P (x) P (tλ P (x) + (1 t)x), where 0 t 1. 25

35 Now, we come back to the Learning Problem. We define P = P λ (Y0 T = r T T 0 ) = a it 1 i t b it (j t ). 1 i 0,i 1,...,i T N t=0 Notice that P is a 2(T +1) order homogeneous polynomial in the variable π i, a ij, b j (k), with nonnegative coefficients. And these π i, a ij, b j (k) satisfy the conditions in the above results. We define a new model λ = (A, B, π). When π i 0, a ij 0, b j (k) 0, π i = a ij = b j (k) = π i P π i Ni=1 π i P π i, a ij P a ij Nj=1 a ij P a ij, b j (k) P b j (k) Mk=1 b j (k) P. b j (k) When π i = 0, or a ij = 0, or b j (k) = 0, we define π i = 0, a ij = 0, b j (k) = 0. Applying the above result and taking t = 1, we obtain P (π i, a ij, b j (k)) P (π i, a ij, b j (k)). Also, if λ belongs to a neighborhood V of λ, where λ is a local maximum of P, we will have either 1) the initial model λ defines a critical point of P, in which 26

36 case λ = λ = λ ; or 2) the model λ is better than the model λ in the sense that P (λ) > P (λ), or we may say λ is more likely than λ. The next step that we need to solve is how to represent π i, a ij, b j (k) by using π i, a ij, b j (k). We claim: π i P π i = P λ (Y T 0 = r T 0, X 0 = s i ), a ij P a ij = b j (k) P b j (k) = T 1 t=0 T P λ (Y T 0 = r T 0, X t = s i, X t+1 = s i+1 ), t=0,y t =r k P λ (Y T 0 = r T 0, X t = s j ). Proof: From the formula P = π i0 a i0 i 1...a it i t+1...a it 1 i T b i0 (j 0 )...b it (j t )...b it (j T ), 1 i 0,i 1,...,i T N we may find out P is the sum of the probability of a path s i0, s i1,..., s it. Each term of the sum is a possible path. If we take partial derivative of P with respect to π i, all the other terms will vanish except for those represent the pathes start from state s i. So P π i = 1 i 1,...,i T N a ii1...a it 1 i T b i (j 0 )...b it (j T ), 27

37 and P π i π i = π i a ii1...a it 1 i T b i (j 0 )...b it (j T ) 1 i 1,...,i T N = P λ (Y T 0 = r T 0, X 0 = s i ). In the same way, we can proof the other two results. When we take partial derivative of P with respect to a ij, only those terms, which represent the pathes stay at state s i at some time t and move to state s j at time t + 1, will appear. When we take partial derivative of P with respect to b j (k), all the other terms will vanish except for those represent the pathes go through the state s j at the time t, and the observation at that is r k. Then we have, π i = a ij = b j (k) = P π i π i Ni=1 P π i π i P a ij a ij Nj=1 P = a ij a ij b j (k) P b j (k) Mk=1 b j (k) P b j (k) = Pλ(YT 0 = r T 0, X 0 = s i ), P λ (Y0 T = r T 0 ) T 1 t=0 P λ (Y0 T = r T 0, X t = s i, X t+1 = s i+1 ) T 1, t=0 P λ (Y0 T = r T 0, X t = s i ) = T 1 t=0,y t =r k P λ (Y0 T = r T 0, X t = s j ) T 1. t=0 P λ (Y0 T = r T 0, X t = s j ) After these theoretical results, we introduce an algorithm to realize it. That is, the Baum-Welch method: In order to describe the procedure, we first define ξ t (i t, i t+1 ), the probability of 28

38 being in state s it at time t and state s it+1 at time t+1, given the model λ and the observation sequence, ξ t (i t, i t+1 ) = P λ (X t = s it, X t+1 = s it+1 Y T 0 = r T 0 ). From the definition of the forward and backward variables, we can write ξ t (i t, i t+1 ) in the form, ξ t (i t, i t+1 ) = α t(i t )a it i t+1 b it+1 (j t+1 )β t+1 (i t+1 ) P λ (Y0 T = r T 0 ) α t (i t )a it i = t+1 b it+1 (j t+1 )β t+1 (i t+1 ) Nit+1=1 α t (i t )a itit+1 b it+1 (j t+1 )β t+1 (i t+1 ) Nit =1 To solve this problem, we need another variable, γ t (i t ), the probability of being in state s it at time t, given the observation sequence and the model λ, γ t (i t ) = P λ (X t = s it Y T 0 = r T 0 ). Hence we can relate γ t (i t ) to ξ t (i t, i t+1 ) by summing over i t+1, N γ t (i t ) = ξ t (i t, i t+1 ). i t+1 =1 Let i t = i, i t+1 = j, where 1 i, j N. Then, γ t (i) = P λ (X t = s i Y T 0 = r T 0 ), ξ t (i, j) = P λ (X t = s i, X t+1 = s j Y T 0 = r T 0 ). 29

39 If we sum γ t (i) over the time index t, we get a quantity which can be interpreted as the expected number of times that state s i is visited, that is T 1 t=0 γ t (i) = expected number of transitions from s i, similarly, T 1 t=0 ξ t (i, j) = expected number of transitions from s i to s j. Using these formulas, we can give a method for reestimation of the parameters of an HMM, π i = expected frequency(number of times)in state s i at time 0 = γ 0 (i), a ij = expected number of transitions from s i to s j expected number of transitions from s i T 1 t=0 ξ t (i, j) = T 1 t=0 γ t (i), b j (k) = expected number of times in state j and observation is s k expected number of times in state j T 1 t=0,y = t =r k γ t (j) T 1. t=0 γ t (j) Therefore, we obtain a new model λ = (A, B, π). Based on the above procedure, if we iteratively use λ in place of λ and repeat the calculation, we can improve the probability of Y T 0 = r T 0, until some limiting point is reached. 30

40 3.4 An Example In this section, we introduce an example to show how to solve these three problems by hands or by computer. A school teacher gave three different types of daily homework assignments: (1)took about 5 minutes to complete, (2)took about 1 hour to complete, (3)took about 3 hours to complete. And the teacher did not reveal openly his mood to his students daily, but we know that the teacher had either in a (1)good, (2)neutral, or(3) bad mood for a whole day. Meanwhile, we knew the relationship between his mood and the homework he would assign. We use a Hidden Markov Model λ to analysis this problem. This model includes three hidden states, {1(good mood), 2(neutral mood), 3(bad mood)} and three observable states, {1(5 minutes assignment),2(1 hour assignment), 3(3 hours assignment)}. We consider the problem in a week, from Monday to Friday, using t=1, 2, 3, 4, 5. Hence, the Markov Chain in this model is (X t : t = 1, 2, 3, 4, 5), and the observable process is (Y t : t = 1, 2, 3, 4, 5). The transition probability matrix is A = [a ij ] and the initial state distribution is π = (π i ), where a ij = P λ (X t+1 = i X t = j), π i = P λ (X 1 = i), 31

41 1 i, j 3, t = 1, 2, 3, 4. The relationship between the teacher s mood and the assignments is given by a matrix B = [b j (k)], where b j (k) = P λ (Y t = k X t = j), 1 j 3, 1 k 3, t = 1, 2, 3, 4, 5. In this particular case, the observable sequence Y 5 1 = (1, 3, 2, 1, 3), and the HMM λ = (A, B, π), where A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = , b 1 (1) b 1 (2) b 1 (3) B = b 2 (1) b 2 (2) b 2 (3) = , b 3 (1) b 3 (2) b 3 (3) ( ) ( ) π = π 1 π 2 π 3 = To get the initial distribution π, we use some knowledge of Markov Chain, which is the stationary probability distribution. 32

42 Definition: For a Markov Chain with the transition probability matrix P = P ij i,j=0 and the initial distribution π = (π i ) i=0, where j=0 P ij = 1 for each i, and i=0 π i = 1 for each i, if we have π j = π i P ij i=0 for every j, then we call the set π = (π i ) i=0 is a stationary probability distribution of the Markov Chain. For the transition matrix that we used in this example, to find the stationary probability distribution, we need to solve a linear system as the following, π 1 = π 1 a 11 + π 2 a 21 + π 3 a 31 π 2 = π 1 a 12 + π 2 a 22 + π 3 a 32 π 3 = π 1 a 13 + π 2 a 23 + π 3 a 33 1 = π 1 + π 2 + π 3 This linear system tells us that the stationary probability distribution is π = (0.05, 0.2, 0.75). The Evaluation Problem: The first question we want to ask is what is the probability that this teacher would assign this order of homework assignments. We use the forward procedure to solve 33

43 this problem. (1)Initialization: α 1 (1) = π 1 b 1 (1) = = α 1 (2) = π 2 b 2 (1) = = 0.06 α 1 (3) = π 3 b 3 (1) = = 0 (2)Induction:(t=1, 2, 3, 4.) α 2 (1) = (α 1 (1) a 11 + α 1 (2) a 21 + α 1 (3) a 31 ) b 1 (3) = ( ) 0.1 = α 2 (2) = (α 1 (1) a 12 + α 1 (2) a 22 + α 1 (3) a 32 ) b 2 (3) = α 2 (3) = (α 1 (1) a 13 + α 1 (2) a 23 + α 1 (3) a 33 ) b 3 (3) = By using the same way, we can calculate the values for all forward variables α t (i), 1 t 5, 1 i 3. We use Table 1 to represent our result. 34

44 Table 1. The Forward Variables α t (i) α t (i) (3) Termination: P λ (Y1 5 = (1, 3, 2, 1, 3)) = α 5 (1) + α 5 (2) + α 5 (3) = = The Decoding Problem: The second question we want to ask is what did his mood curve look like most likely that week. We use the Viterbi method. (1)Initialization: δ 1 (1) = π 1 b 1 (1) =

45 δ 1 (2) = π 2 b 2 (1) = 0.06 δ 1 (3) = π 3 b 3 (1) = 0 (2)Induction: δ 2 (1) = max 1 i 3 [δ 1(i)a i1 ]b 1 (3) = max[0.007, 0.012, 0] 0.1 = ψ 2 (1) = arg max 1 i 3 [δ 1(i)a i1 ] = 2 δ 2 (2) = max 1 i 3 [δ 1(i)a i2 ]b 2 (3) = max[0.0105, 0.012, 0] 0.3 = ψ 2 (2) = arg max 1 i 3 [δ 1(i)a i2 ] = 2 Using the same method, we will have the values for all variables δ t (i) and ψ t (i), where 1 t 5, 1 i 3. See Table 2. 36

46 Table 2. The Variables δ t (i) and ψ t (i) δ t (i) ψ t (i) (3) Termination: δ = max 1 i N [δ 5(i)] = X T = arg max 1 i 3 [δ 5(i)] = 3 Finally, the hidden state sequence is given by i t = ψ t+1 (i t+1 ), t=4, 3, 2, 1. Hence, this teacher s mood curve is (2,3,2,1,3). The Learning Problem: The third problem is how to adjust the model parameters (A, B, π) to maximize P λ (Y 5 1 = (1, 3, 2, 1, 3)). We use the Baum-Welch method. 37

47 To use this method, we need to know the values of the backward variables β t (i t ). (1)Initialization: β 5 (1) = 1, β 5 (2) = 1, β 5 (3) = 1. (2)Induction(t=4, 3, 2, 1. ): 3 β 4 (1) = a 1i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 11 b 1 (3) + a 12 b 2 (3) + a 13 b 3 (3) = = 0.56, 3 β 4 (2) = a 2i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 21 b 1 (3) + a 22 b 2 (3) + a 23 b 3 (3) = = 0.62, 38

48 Table 3. The Backward Variables β t (i) β t (i) β 4 (3) = a 3i5 b i5 (j 5 )β 5 (i 5 ) i 5 =1 = a 31 b 1 (3) + a 32 b 2 (3) + a 33 b 3 (3) = = Using the same method, we will have the values for all variables β t (i), where 1 t 5, 1 i 3. See Table 3. We can use this table and follow the backward procedure to get the P λ (Y 3 0 = (1, 3, 2, 1, 3)). P λ (Y 3 0 = (1, 3, 2, 1, 3)) = 3 β 1 (i 1 )b i1 (j 1 )π i1 i 1 =1 39

49 = β 1 (1)b 1 (1)π 1 + β 1 (2)b 2 (1)π 2 + β 1 (3)b 3 (1)π 3 = We can compare this result with the one we get before. Then, the next step is to get the values of ξ t (i, j) and γ t (i), where 1 t 4, 1 i, j 3. See Table 4. After we get these tables, it is easy to get a new model λ. A = a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 = , b 1 (1) b 1 (2) b 1 (3) B = b 2 (1) b 2 (2) b 2 (3) = b 3 (1) b 3 (2) b 3 (3) ( ) ( ) π = π 1 π 2 π 3 = , Therefore, we get a new model λ, which is better than λ. We may check it by using λ to solve the evaluation problem again. And the result is P λ (Y 3 1 = (1, 3, 2, 1, 3)) =

50 Table 4. The Variables ξ t (i, j) and γ t (i) (a)ξ t (i, j) ξ 1 (i, j) ξ 2 (i, j) ξ 3 (i, j) ξ 4 (i, j) (b)γ t (i) γ t (i)

51 4 HMMs Analysis by Matlab 4.1 How to generate A, B and π Firstly, let us discuss how to get an initial distribution π. Here, we only consider that we have finite many states. Suppose we have N hidden states. Then the initial distribution π = (π 1, π 2,, π N ). We can think these π s as N parameters of a multinomial distribution. When we estimate the parameters for the binomail distribution by using the Bayesian medols, we usually use the two-parameter beta family as the prior π. This class of distribution has the remarkable property that the resulting posterior distributions are again beta distribution. We call this property conjugate. So it is very natural for us to think what is a conjugate prior for the multinomial. The answer is the Dirichlet distribution. In some sense, the Dirichlet distribution is an extension of beta distribution to the high dimensional space. Definition: The Dirichlet distribution D(α), where α = (α 1,..., α r ), α j > 0, 1 j r, is supported by a (r-1)-dimensional simplex, r 1 = {(u 1,..., u r 1 ) : u i 1, 0 u i 1, i = 1,..., r 1}. i=1 42

52 and the probability density is given by f(u 1,..., u r 1 ) = r Γ( j=1 α j) r 1 r j=1 Γ(α j) j=1 u α j 1 j (1 r 1 j=1 u j ) α r 1 if (u 1,..., u r 1 ) 0 if (u 1,..., u r 1 ) R r 1 \ Lemma 4.1. Let N = (N 1,..., N r ) has the multinomial distribution M(n, θ), where θ = (θ 1,..., θ r ), 0 < θ j < 1, r j=1 θ j = 1. If the prior distribution π(θ) for θ is D(α), then the posterior distribution π(θ N = n) is D(α + n), where α = (α 1,..., α r ), n = (n 1,..., n r ). Proof: For the vector θ = (θ 1,..., θ r ), we can rewrite it by θ = (θ 1,..., θ r 1 ) and define θ r = 1 r 1 j=1 θ j. Then we can write the prior π(θ) out, π(θ) = Γ( r r 1 j=1 α j ) rj=1 θ α r 1 j 1 j (1 θ j ) αr 1. Γ(α j ) j=1 By the knowledge from Bayes rule, and define = {(t 1,..., t r 1 ) : r 1 i=1 t i 1, 0 t i 1, i = 1,..., r 1} we have j=1 π(θ N = n) = π(θ)p (N = n θ) π(t)p (N = n t)d(t 1,..., t r 1 ) = r 1 C θ α r 1 j+n j 1 j (1 θ j ) α r+n r 1 j=1 j=1 The proportionality constant C, which depends on α and n only, must be Γ( r j=1 (α j+n j )) r j=1 Γ(α j+n j ), and the posterior distribution of θ given n is D(α + n). 43

53 Lemma 4.2. See [9]. Let d be an integer number greater than 1. Let X 1,..., X d be random variables distributed according to gamma distribution with corresponding parameters(α, γ i ), and suppose that the sequence {X 1,..., X d } is independent. Then the random vector (Y 1,..., Y d ) = 1 X X d (X 1,..., X d ) has the Dirichlet distribution. Proof: For the random vector (Y 1,..., Y d ), we have Y d = 1 d 1 j=1 Y j. Therefor, the distribution of (Y 1,..., Y d ) is actually supported by a (d-1)-dimensional simplex 0, d 1 0 = {(y 1,..., y d 1 ) : y i 1, 0 y i 1, i = 1,..., d 1}. i=1 For a Borel subset A of = {(y 1,..., y d ) : d i=1 y i = 1, 0 y i 1, i = 1,..., d}, let B = {(x 1,..., x d ) : 1 x x d (x 1,..., x d ) }. Because the sequence {X 1,..., X d } is independent, we can write out the probability distribution for the random vector (X 1,..., X d ), P ({ω : (X 1 (ω),..., X d (ω)) X 1 (ω) X d (ω) }) = B d i=1 a γ i x γ i 1 i Γ(γ i ) e ax i d(x 1,..., x d ), where d(x 1,..., x d ) indicates that the integration is with respect to Lebesgue measure in R d. We make a change of variables: s = x x d, y i = x i s, 1 i d 1. 44

54 The Jacobian of (x 1,..., x d )with respect to (y 1,..., y d 1, s) is s d 1. After some calculations, the above integral equals C Γ( d d 1 j=1 γ j ) dj=1 y γ d 1 j 1 j (1 y j ) γd 1 d(y 1,..., y d 1 ), Γ(γ j ) j=1 C = {(y 1,..., y d 1 ) : (y 1,..., y d 1, 1 y 1... y d 1 ) )}. j=1 Thus we have a density for 1 X X d (X 1,..., X d ) with respect to Lebesgue measure on 0. Lemma 4.3. Let X be a real-valued random variable with continuous distribution function F. Then, F (X) is uniform distributed on [0, 1]. On the other hand, let Y be a real-valued random variable with uniformed distribution on [0, 1], then F 1 (Y ) has the distribution F. Proof: Let us prove the second part of this lemma. Define X = F 1 (Y ). P (X x) = P (F 1 (Y ) x) = P (Y F (x)), Since Y is uniform distributed on [0,1], we have the above probability is F (x) which means X has the distribution F. We use Lemma 4.2 and take α = 1, γ i = 1, 1 i d. Therefore X 1,..., X d have the exponential distribution with parameters 1. To get the exponential distribution randomly, we apply Lemma 4.3. Generate the random variable Y first, which has the uniform distribution on [0,1]. Then apply the function F 1 (x) = ln(1 x) to Y, 45

55 where F is the distribution function of the exponential distribution. So in this way, we may get the distribution for π, and for each row of A and B. 4.2 How to test the Baum-Welch method Idea: Firstly, we generate a model λ, which is looked as the true value. Then use λ to generate some sample sequences Y 1,..., Y n. To use the Baum-Welch method, we need a seed λ 0. We can think it as an initial value. For this seed λ 0 and each sequence Y i, we run the program to get the new model λnew i = (Anew i, Bnew i, πnew i ), where 1 i n. Finally, we get an estimate of λ, which is λnew = (Anew, Bnew, πnew), Anew = Bnew = πnew = ni=1 Anew i, n ni=1 Bnew i, n ni=1 πnew i. n And then we can compare this estimate λnew with the true value λ to see how the Baum-Welch method works. We also define a distant between two matrixes A = (a ij ) and B = (b ij ) by d(a, B), where 46

56 N M d(a, B) = (a ij b ij ) 2, i=1 j=1 1 i N, 1 j M. For example, we take N = M = 3, T = 20, and n = 10, 20, 30 respectively. Step1: λ and λ 0. In this step we generate a true value λ and a seed λ 0. λ = ( π, A, B ), π = ( ), A = , B = , 47

57 λ 0 = ( π 0, A 0, B 0 ), π 0 = ( ), A 0 = , B 0 = Step2: get Y(i), take n = 30, 1 i n. See Figure 4.1 Step3: Get the new model λnew. Step3 is divided into two parts. In the first part, we use the seed λ 0 to run the program. In the second part, we take the seed equals to λ. 48

58 Y1= Y2= Y3= Y4= Y5= Y6= Y7 = Y8 = Y9 = Y10 = Y11 = Y12 = Y13 = Y14 = Y15 = Y16 = Y17 = Y18 = Y19 = Y20 = Y21 = Y22 = Y23 = Y24 = Y25 = Y26 = Y27 = Y28 = Y29 = Y30 = Figure 4.1 The generated observable sequences Y i, 1 i

59 The reason that we want to do this is to show that the Buam-Welch method only works when the initial model λ 0 is in a neighborhood V of λ q, where λ q is a local maximum of the function P = P λ (Y T 0 = r T 0 ). If we use the seed λ 0, we will see that we can not get a good estimate of λ. The reason is λ 0 is not in a small enough neighborhood of λ. While in the second part, if we use λ as the seed, we will get good result. We think λ must be a local maximum, and it is in any neighborhood of itself. So this seed satisfies the condition in the Baum-Welch method. PartI: use λ 0 = (A 0, B 0, π 0 ) as the seed. We use sample {Y 1,..., Y 10 }, {Y 1,..., Y 20 }, {Y 1,..., Y 30 } respectively to get three new models λnew 1 10, λnew 1 20, λnew 1 30, see Figure 4.2. And then calculate the distances between them and λ. From Table 5, we can see the distances between the new models and λ are big even if we increase the sample size. This result means we can not get a good estimate from the seed λ 0. Table 5. The Distances, Seed λ 0 d(π, πnew) d(a, Anew) d(b, Bnew) λnew λnew λnew

60 λ new1 10= ( Anew1 10, Bnew1 10, πnew1 10) π new 1 10 = Anew 1 10= Bnew 1 10 = λ new1 20= ( Anew1 20, Bnew1 20, πnew1 20) π new 1 20 = Anew 1 20= Bnew 1 20 = λ new1 30= ( Anew1 30, Bnew1 30, πnew1 30) π new 1 30 = Anew 1 30= Bnew 1 30= Figure 4.2 Three new models corresponding sample { Y 1,..., Y10}, { Y1,..., Y20}, { Y1,..., Y30}, as the seed is λ 0. 51

61 Table 6. The Distances, Seed λ d(π, πnew) d(a, Anew) d(b, Bnew) λnew λnew λnew PartII: use λ = (A, B, π) as the seed. In the same way, we use sample {Y 1,..., Y 10 }, {Y 1,..., Y 20 }, {Y 1,..., Y 30 } respectively to get three new models λnew 1 10, λnew 1 20,and λnew The results are listed in Figure 4.3. And the distances between them and λ are listed in the Table 6. This table indicates that we get a good estimate of λ, which is very close to λ. Discussion When we think the P λ (Y T 0 = r T 0 ) as a function of λ, it may have more than one local maximum value, for example P 1 and P 2 corresponding λ 1 and λ 2. By the Baum- Welch method, we may get a good estimate of one of them, only if the initial model we used is in a neighborhood of one of them, and the closer the better. The true value λ we used here may not be a local maximum point of the function P λ (Y T 0 = r T 0 ). To get each new model λnew i, 1 i 30, we run the program 10 times. Maybe it is not enough, because we do not know the convergence speed. 52

62 λ new1 10= ( Anew1 10, Bnew1 10, πnew1 10) π = new 1 10 Anew 1 10= Bnew 1 10 = λ new1 20= ( Anew1 20, Bnew1 20, πnew1 20) π new 1 20 = Anew 1 20= Bnew 1 20 = λ new1 30= ( Anew1 30, Bnew1 30, πnew1 30) π new 1 30 = Anew 1 30= Bnew 1 30= Figure 4.3 Three new models corresponding sample Y,..., Y }, { Y,..., Y }, { Y,..., }, as the seed is λ. { Y30 53

Basic math for biology

Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood