Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Size: px

Start display at page:

Download "Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010"

Kelly Hunter
5 years ago
Views:

1 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26

2 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition probability P ij = P(x t+1 = j x t = i) = P(φ t (i) = j) for finitely many states i and j, with initial distribution π 0 (i) = P(x 0 = i). It becomes a Markov chain. Special Meeting 2/26

3 Hidden Markov Model Hidden It has two processes: (1) the evolution of state is internal and unobservable, but (2) the observation is obtained from each internal state according to the transition probability Q jk = P(y t = k x t = j) Special Meeting 3/26

4 Hidden Example: DNA Sequence Alignment DNA is composed of an alphabet of four nucleotides, A, C, G, and T, and may have been acquired from the common ancestor. The problem is complicated due to insertions, deletions, and mutations. We can introduce three hidden states, match (M), insertion (I), and deletion (D). Special Meeting 4/26

5 Hidden HMM for DNA Sequence Alignment A family of models is introduced by parameters (θ 0, θ 1, θ 2, θ 3 ). The initial distribution π 0 has the form π 0 (M) = 1 2θ 0 ; π 0 (I ) = π 0 (D) = θ 0, and the transition probabilities P ij and Q jk are expressed in terms of θ 1, θ 2, and θ 3. Special Meeting 5/26

6 Hidden DNA Sequence Alignment: A Challenge Two sequences are not aligned: The observed values y 0, y 1,... are determined, and therefore, the sequences are aligned only when the control values, M (Match/mismatch), I (Insert), or D (Delete), are estimated. It seems impossible to pursue the sequence alignment! Special Meeting 6/26

7 Hidden DNA Sequence Alignment A dynamic programming with indefinite time horizon works for Viterbi Algorithm when it is applied for DNA sequence alignment. In order for memory usage to be effective we set the maximum length of time horizon to be the total length of the two sequences. > source("viterbi.r") > source("dnaprofile.r") > DNA = strsplit(scan("gene57.txt", what="character"), "") > cat(dna[[1]], "\n", DNA[[2]], "\n", sep="") > th = c(0.1, 0.1, 0.1, 0.1) > out = viterbi(3, length(dna[[1]])+length(dna[[2]]), k0, cc) > aligned(out[[1]]) Special Meeting 7/26

8 Hidden Conditional Distribution Suppose that a pair (X, Y ) of discrete random variables has a joint frequency function p(x, y). Then we can introduce the conditional frequency function p(x, y) p(x y) = p(y) Here the marginal frequency function p(y) is known as the normalizing constant in the sense that it guarantees p(x y) = 1 x Since p(x y) is proportional to p(x, y), we simply write p(x y) p(x, y) Special Meeting 8/26

9 Hidden The joint distribution of the evolution x 0,..., x T, and the observation y 0,..., y T p(x 0,..., x T, y 0,..., y T ) = π 0 (x)p x0,x 1 Q x0,y 0 P xt,y T Q xt,y T is proportional to the conditional distribution of the evolution given y 0,..., y T. p(x 0,..., x T y 0,..., y T ) Special Meeting 9/26

10 Hidden : Filtering Recursion The filtering problem is to compute the conditional distribution of the internal state x T π T T (i) = p(x T = i y 0,..., y T ) = p(x 0,..., x T y 0,..., y T ) x 0,...,x T 1 given y 0,..., y T. It is formulated by the forward algorithm. 1. π 0 0 (i) π 0 (i)q i,y0 2. π t t (j) i π t 1 t 1 (i)p i,j Q j,yt for t = 1,..., T. Special Meeting 10/26

11 Hidden : Smoothing Recursion Let t < T. The smoothing problem is to compute the conditional distribution of the internal state x t π t T (i) = p(x t = i y 0,..., y T ) = p(x 0,..., x t y 0,..., y T ) x 0,...,x t 1,x t+1,...,x T given y 0,..., y T. It combines π t t (i) forward with the backward algorithm. 1. β T T (j) = 1 2. β k 1 T (i) = j Then it formulates β k T (j)p i,j Q j,yk for k = T,..., t + 1. π t T (i) π t t (i)β t T (i) Special Meeting 11/26

12 Hidden Baum-Welch Algorithm Let t < T. The bivariate smoothing problem is to compute the transition probability λ t T (i, j) = p(x t = i, x t+1 = j y 0,..., y T ) conditioned upon y 0,..., y T. The forward-backward algorithm can be applied to obtain λ t T (i, j) π t t (i)p i,j Q j,yt+1 β t+1 T (j) The summation gives the univariate smoothing π t T (i) = j λ t T (i, j) Special Meeting 12/26

13 Hidden Maximum Likelihood for The transition probabilities P ij are considered as model parameters. Having observed the evolution x 0, x 1,..., x T, we can infer the parameters by maximizing the likelihood L = P x0,x 1 P x1,x 2 P xt 1,x T This maximum likelihood estimate (MLE) is proportional to the occurrence count P ij T I (x t 1 = i, x t = j) t=1 where I (x t 1 = i, x t = j) = 1 or 0, indicating the occurrence of transition. Special Meeting 13/26

14 Hidden Model Inference for HMM The initial probabilities π 0 (i), and the transition probabilities P ij and Q jk becomes model parameters. Given the observation y 0,..., y T, L(θ) = π 0 (x 0 )P x0,x 1 Q x0,y 0 P xt 1,x T Q xt,y T x 0,...,x T is the likelihood. The evolution x 0,..., x T is not observed, and called the latent variables. It is not tractable to maximize the likelihood L with probability constraints for π 0 (i) (i.e., π 0 (i) 0 and i π 0(i) = 1) as well as for P ij, and Q jk, and find their estimate analytically. Special Meeting 14/26

15 Hidden Maximization with latent variables If the hidden transitions x 0,..., x T are assumed to be known, the maximum likelihood estimate for P ij can be formulated with the occurrence of hidden transitions. Then based on the current estimate of π 0 (i), P ij, and Q jk, it can be replaced with the conditional expectation of occurrence count as follows. [ T ] P ij E I (x t 1 = i, x t = j) y 0,..., y T = t=1 T λ t 1 T (i, j) t=1 where the conditional probabilities λ t 1 T (i, j) can be obtained via Baum-Welch algorithm. Special Meeting 15/26

16 Baum-Welch Training Hidden 1. Estimate Internal States: Given the current estimate π 0 (i), P ij, and Q jk, compute λ t 1 T (i, j) by Baum-Welch algorithm 2. Update Model Parameters: For example, π 0 (i) π 0 T (i) P ij T 1 t=0 Q jk t:y t=k λ t T (i, j) π t T (j) 3. Repeat the above steps until it converges. Special Meeting 16/26

17 Hidden Dynamical System with Control Knowing the state x t at time t, the control value u t is used to determine x t+1. Then the evolution of states is governed by x t+1 = φ t (x t, u t ) Special Meeting 17/26

18 Hidden Optimal Control Problem Given the initial state x 0 and the control sequence u 0,..., u T, we obtain the trajectory x 0,..., x T. Then the real value V = c 0 (x 0, u 0 ) + + c T (x T, u T ) + k T (x T ) is defined over the horizon from t = 0 to T, and viewed as the running and the terminal reward (or cost). The optimal control problem is to find the control sequence u 0,..., u T to maximize the reward V (or, to minimize the cost). Special Meeting 18/26

19 Hidden Starting from the terminal reward k T (x T ), we can work backward and find the optimal value V 0 (x 0 ). Then from any time t on the remaining control sequence becomes optimal. 1. V T +1 (i) = k T (i) 2. Compute backward for t = T,..., 0, V t (i) = max u t [c t (i, u t ) + V t+1 (φ t (i, u t ))] ψ t (i) = argmax u t [c t (i, u t ) + V t+1 (φ t (i, u t ))] 3. Set u 0 = ψ 0(x 0 ), and calculate u t = ψ t (φ t (x t 1, u t 1 )) forward for t = 1,..., T. Special Meeting 19/26

20 Hidden Log Likelihood for Optimal Decoding Assume that the model parameters π 0 (i), P ij, and Q jk are known. Given the observation y 0,..., y T, the log likelihood becomes where V = k 0 (x 0 ) + c 1 (x 0, x 1 ) + + c T (x T 1, x T ) k 0 (x 0 ) = log (π 0 (x 0 )Q x0,y 0 ) c t (x t 1, x t ) = log ( P xt 1,x t Q xt,y t ), t = 1,..., T The MLE problem is to obtain the optimal decoding x 0,..., x T. Special Meeting 20/26

21 Hidden Viterbi Decoding Algorithm Starting from the initial cost k 0 (x 0 ), we can work forward and find the optimal value V T (x T ). 1. V 0 (i) = k 0 (i) 2. Compute forward for t = 1,..., T, V t (j) = max x t 1 [V t 1 (x t 1 ) + c t (x t 1, j)] ψ t (j) = argmax [V t 1 (x t 1 ) + c t (x t 1, j)] x t 1 3. Set x T = argmax x T V T (x T ), and calculate x t 1 = ψ t(x t ) backward for t = T,..., 1. Special Meeting 21/26

22 Viterbi Training Hidden 1. Estimate Internal States: Given the current estimate π 0 (i), P ij, and Q jk, decode x 0,..., x T by Viterbi algorithm. 2. Update Model Parameters: For example, P ij Q jk T I (x t 1 = i, x t = j) t=1 T I (x t = j, y t = k) t=0 3. Repeat the above steps until it converges. Special Meeting 22/26

23 Hidden DNA Sequence Alignment: Position State Setting Two sequences are dynamically aligned: Starting from the empty state (0, 0) of aligned sequences, we choose the control value M (Match/mismatch), I (Insert), or D (Delete) to add letters to the aligned sequences. Special Meeting 23/26

24 Hidden DNA Sequence Alignment: Dynamical System It introduces the dynamical system: Given the current state x t = (i, j) it updates with the control value u t = M, I, or D (i + 1, j + 1) if u t = M; x t+1 = φ t (x t, u t ) = (i, j + 1) if u t = I ; (i + 1, j) if u t = D Special Meeting 24/26

25 Hidden Needleman-Wunsch Algorithm At the state x = (i, j) the reward function c(x, u) is given by { 1 if u = M and the pair at (i, j) match; c(x, u) = 0 otherwise. When the two sequences have N and L letters, x = (N, j) or (i, L) becomes the boundary state with terminal cost k(x) = 0. Then the dynamical programming principle applies with indefinite time horizon. Special Meeting 25/26

26 Hidden Comparison with Needleman-Wunsch algorithm If the reward function is changed, the algorithm can be adjusted for Needleman-Wunsch algorithm. Note that it is not an HMM, and that it does not require the prior estimate of parameter θ. Compare the outputs with Viterbi decoding with various prior estimates for θ. > source("needleman.r") > DNA = strsplit(scan("gene57.txt", what="character"), "") > cat(dna[[1]], "\n", DNA[[2]], "\n", sep="") > out = viterbi(3, length(dna[[1]])+length(dna[[2]]), k0, cc) > aligned(out[[1]]) Special Meeting 26/26

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated