Lecture 4: State Estimation in Hidden Markov Models (cont.)

Size: px

Start display at page:

Download "Lecture 4: State Estimation in Hidden Markov Models (cont.)"

Alice Dixon
5 years ago
Views:

1 EE378A Statistical Signal Processing Lecture 4-04/13/2017 Lecture 4: State Estimation in Hidden Markov Models (cont.) Lecturer: Tsachy Weissman Scribe: David Wugofski In this lecture we build on previous analysis of state estimation for Hidden Markov Models to create an algorithm for estimating the posterior distribution of the states given the observations. 1 Notation A quick summary of the notation: 1. Random Variables (objects): used more loosely, i.e. X, Y, U, V 2. Alphabets: X, Y, U, V 3. Specific Realizations of Random Variables: x, y, u, v 4. Vector Notation: For a vector (x 1,, x n ), we denote the vector of values from the i-th component to the j-th component (inclusive) as x j i, with the convention xi x i Markov Chains: We use the standard Markov Chain notation X Y Z to denote that X, Y, Z form a Markov triplet. For discrete random variable (object), we write p(x) for the pmf P(X = x), and similarly p(x, y) for p X,Y (x, y) and p(y x) for p Y X (y x), etc. 2 Recap of HMP For a set of unknown states {X n } n 1 of a Markov Process and noisy observations {Y n } n 1 on those states taken through some memoryless channel, we can express the joint distribution of the states and outputs: n n p(x n, y n ) = p(x t x t 1 ) p(y t x t ) (1) where x 0 is taken to be some fixed arbitrary value (e.g., 0). Additionally we have the following principles for this setting: (X t 1, Y t ) X t (Xt+1, n Yt+1) n (X t 1, Y t 1 ) X t (Xt+1, n Yt n ) X t 1 (X t, Y t 1 ) (Xt+1, n Yt n ) In the last lecture, through the forward-backward recursion we have the following algorithms for computing various posteriors: α t (x t ) = p Xt Y t(x t y t ) = β t(x t )p Yt X t (y t x t ) x t β t ( x t )p Yt X t (y t x t ) (a) (b) (c) (Measurement Update) β t+1 (x t+1 ) = p Xt Y t 1(x t y t 1 ) = x t α t (x t )p Xt+1 X t (x t+1 x t ) (Time Update) γ t (x t ) = p Xt Y n(x t y n ) = x t+1 γ t+1 (x t+1 ) α t(x t )p Xt+1 X t (x t+1 x t ) β t+1 (x t+1 ) (Backward Recursion) 1

3 Factor Graphs Let us derive a graphical representation of the joint distribution expressed in (1) as follows: 1. Create nodes on the graph representing each random variable: X 1,..., X n, Y 1,.

Example: Figure 1 shows the factor graph for the hidden Markov process in our setting.

sets form a Markov process S 1 S 2 S 3 (to see this, simply express the joint pmf into factors).

2 3 Factor Graphs Let us derive a graphical representation of the joint distribution expressed in (1) as follows: 1. Create nodes on the graph representing each random variable: X 1,..., X n, Y 1,..., Y n. 2. For each pairing of nodes, create an edge if there is some factor in 1 which contains both variables. This form of graph is called a Factor Graph. Example: Figure 1 shows the factor graph for the hidden Markov process in our setting. Figure 1: The factor graph for our hidden Markov process For any partitioning of the graph into three node sets S 1, S 2, and S 3 such that any path from S 1 to S 3 passes some node in S 2, these sets form a Markov process S 1 S 2 S 3 (to see this, simply express the joint pmf into factors). Hence, for any s i S i, i = 1, 2, 3, we have p(s 1, s 3 s 2 ) = φ 1 (s 1, s 2 )φ 2 (s 2, s 3 ) (2) for some functions φ 1, φ 2 which consist of the factors which were used to form the factor graph. Using the factor graph, it is quite easy to establish principles (a)-(c) via graphs. For example, principles (a) and (b) can be proved via Figure 2 and 3 shown below. Figure 2: The factor graph partition for (a) Figure 3: The factor graph partition for (b) 2

We can also construct an additional principle from the factor graph partitioning presented in Figure 4: X t 1 (X t, Y n ) X n t+1 (d) Figure 4: The graphical proof of (d) 4 General Forward Backward

3 We can also construct an additional principle from the factor graph partitioning presented in Figure 4: X t 1 (X t, Y n ) X n t+1 (d) Figure 4: The graphical proof of (d) 4 General Forward Backward Recursion: Module Functions Let us consider a generic discrete random variable U p U passed through a memoryless channel characterized by p V U to get a discrete random variable V. By Bayes rule we know p U V (u v) = p U (u)p V U (v u) ũ p U (ũ)p V U (v ũ) If we create the probability vector of p U V for each value of U, we can define a vector function F s.t. { } pu (u)p V U (v u) {p U V (u v)} u = ũ p F (p U, p V U, v) (4) U (ũ)p V U (v ũ) u Additionally, overloading the notation of F, we can define the matrix function F as F (p U, p V U ) = {F (p U, p V U, v)} v (5) This function F acts as a module function for calculating the posterior distribution of a random variable given a prior distribution and a channel distribution. In addition to F, we can write a module function for calculating the marginal distribution given a prior distribution on the input and a channel distribution. We label this vector function G and define it as follows: { } p V = {p V (v)} v = p U (ũ)p V U (v ũ) G(p U, p V U ) (6) ũ With these module functions in mind, we can express our recursion functions as follows v (3) α t = F (β t, p Yt X t, y t ) β t+1 = G(α t, p Xt+1 X t ) γ t = G(γ t+1, F (α t, p Xt+1 X t )) (Measurement Update) (Time Update) (Backward Recursion) 3

4 5 State Estimation: the Viterbi Algorithm Now we are ready to derive the joint posterior distribution p(x n y n ): p(x n y n ) = p(x n y n ) p(x t x n t+1, y n ) = p(x n y n ) p(x t x t+1, y n ) = p(x n y n ) p(x t x t+1, y t ) = p(x n y n ) = γ n (x n ) p(x t y t )p(x t+1 x t, y t ) p(x t+1 y t ) α t (x t )p(x t+1 x t ) β t+1 (x t+1 ) (By principle (d)) (By principle (c)) (By principle (a)) Write ln p(x n y n ) = g t (x t, x t+1 ) with { ln α t(x t)p(x t+1 x t) β g t (x t, x t+1 ) t+1(x t+1) t = 1,, n 1 ln γ n (x n ) t = n we can solve the MAP estimator ˆx MAP with the help of the following definition: Definition 1. For 1 k n, Let M k (x k ) := max x n k+1 g t (x t, x t+1 ). It is straightforward from definition that M 1 (ˆx MAP 1 ) = ln p(ˆx MAP y n ) = max x n p(x n y n ) and t=k M k (x k ) = max max(g k (x k, ) + x n k+2 = max (g k (x k, ) + max t=k+1 x n k+2 t=k+1 = max (g k (x k, ) + M k+1 ( )). g t (x t, x t+1 )) g t (x t, x t+1 )) Since M n (x n ) = g(x n, x n+1 ) = ln γ n (x n ) only depends on one term x n, we may start from n and use the previous recursive formula to obtain M 1 (x 1 ). This is called the Viterbi Algorithm. The Bellman Equations, referred to in the communication setting as the Viterbi Algorithm, is an application of a technique called dynamic programming. Using the recursive equations derived in the previous section, we can express the Viterbi Algorithm as follows. 4

5 1: function Viterbi 2: M n (x n ) ln γ n (x n ) Initialization of log-likelihood 3: ˆx MAP (x n ) Initialization of the MAP estimator 4: for k = n 1,, 1 do 5: M k (x k ) max xk+1 (g k (x k, ) + M k+1 ( )) Maximum of log-likelihood 6: ˆ (x k ) arg max xk+1 (g k (x k, ) + M k+1 ( )) 7: ˆx MAP (x k ) [ˆ (x k ), ˆx MAP (ˆ (x k ))] Maximizing sequence with leading term x k 8: end for 9: M max x1 M 1 (x 1 ) Maximum of overall log-likelihood 10: ˆx 1 arg max x1 M 1 (x 1 ) 11: ˆx MAP [ˆx 1, ˆx MAP (ˆx 1 )] Overall maximizing sequence 12: end function 5

Lecture 2: Statistical Decision Theory

EE378A Statistical Signal Processing Lecture 2-04/06/207 Lecture 2: Statistical Decision Theory Lecturer: Jiantao Jiao Scribe: Andrew Hilger In this lecture, we discuss a unified theoretical framework