Infering the Number of State Clusters in Hidden Markov Model and its Extension

Size: px

Start display at page:

Download "Infering the Number of State Clusters in Hidden Markov Model and its Extension"

Hillary Hodges
5 years ago
Views:

1 Infering the Number of State Clusters in Hidden Markov Model and its Extension Xugang Ye Department of Applied Mathematics and Statistics, Johns Hopkins University

2 Elements of a Hidden Markov Model (HMM) S: Set of state clusters, S ={i} O: Set of observations, O = {k} X t : Time series of state clusters Y t : Time series of observations A: Set of state transitions, A = {A(i, )}, where A(i, )= P(X t+1 = X t = i) t = 0, 1, 2, B: Set of emissions, B = {B(i, k)}, where B(i, k) = P(Y t = k X t = i) t = 0, 1, 2, μ : initial state cluster distribution, μ = {μ(i)}, where μ(i) = P(X 0 = i) Graphic illustration Property 1: Stationary X X X X X t X t+1 Y 0 Y 1 Y 2 Y 3 Y t Y t+1 Property 2 (implicit): P(X t+1 X 0:t ) = P(X t+1 X t ) Property 3 (implicit): P(Y t+1 Y 0:t, X 0:t +1 ) = P(Y t+1 X t+1 ) Compact notation λ = (A, B, μ)

3 Factorization of complete oint likelihood of history up to time T : T T P(X 0:T, Y 0:T λ) = B( X, ) ( 0 t Y t A X 1 t 1, Xt) μ( X0) t= t= P T (Try to find recursive relation!) = P(X 0:T, Y 0:T λ) = P(X T, Y T X 0:T-1, Y 0:T-1 ; λ) P( X 0:T-1, Y 0:T-1 λ) = P(X T, Y T X 0:T-1, Y 0:T-1 ; λ) P T-1 (Conditioning on history up to time T-1) = P(Y T X T, X 0:T-1, Y 0:T-1 ; λ) P(X T X 0:T-1, Y 0:T-1 ; λ) P T-1 (Conditioning on X T ) = P(Y( T X T ; λ) P(X) ( T X T-1 1 ; λ) P) T-1 1 (By Assumption 2 and Assumption 3) = B(X T, Y T ) A(X T-1, X T ) P T-1 (Transition factor and emission factor) = B(X 0, Y 0 ) μ(x 0 ) = T T B( X, ) ( 1 t Y t A X 1 t 1, Xt) P t= t= 0 = T T B( X, ) ( 0 t Y t A X 1 t 1, Xt) μ( X0) t= t= P 0 = P(X 0, Y 0 λ) = P(Y 0 X 0 ; λ) P(X 0 λ) = B(X 0 Y 0 ) μ(x 0 )

4 A fundamental problem P(Y 0:T λ) Data Likelihood Computing method: forward/backward iteration (dynamic programming) Forward calculation: Define α t(i) = P(Y 0:t, X t = i λ) Then α 0 (i) = P(Y 0, X 0 = i λ) = P(Y 0 X 0 = i, λ) P(X 0 = i λ) = B(i, Y 0 ) μ(i) and for t = 1, 2,, T, α t (i) = P(Y 0:t, X t = i λ) = P(Y 0:t, X t-1 =, X t = i λ) = P(Y t, X t = i Y 0:t-1, X t-1 = ; λ) P(Y 0:t-1, X t-1 = λ) = P(Y t, X t = i Y 0:t-1, X t-1 = ; λ) α t-1 () t t 0:t t t 0:t t t = P(Y t X t = i, Y 0:t-1, X t-1 = ; λ) P(X t = i Y 0:t-1, X t-1 = ; λ) α t-1 () = P(Y t X t = i; λ) P(X t = i X t-1 = ; λ) α t-1 () = α t-1 ()A(, i)b(i, Y t ) Finally, by α T (i) = P(Y 0:T, X T = i λ), we have P(Y 0:T λ) = α T (i) i

5 Backward calculation: Define β t (i) = P(Y t+1:t X t = i ; λ) Then β T (i) = 1 and for t = T 1, T 2,, 0, β t (i) = P(Y t+1:t X t = i ; λ) = P(Y t+1:t, X t+1 = X t = i ; λ) = P(Y t+2:t Y t+1, X t+1 =, X t = i ; λ) P(Y t+1, X t+1 = X t = i ; λ) = P(Y t+2:t X t+1 = ; λ) P(Y t+1 X t+1 =, X t = i ; λ) P(X t+1 = X t = i ; λ) = β t+1 () P(Y t+1 X t+1 = ; λ) P(X t+1 = X t = i ; λ) = A(i, ) B(, Y t+1 ) β t+1 () Finally, P(Y 0:T T λ) = P(Y 0:T T, X 0 = i λ) = P(Y 1:T T Y 0, X 0 = i ; λ) P(Y 0, X 0 = i λ) i i = P(Y 1:T X 0 = i ; λ) P(Y 0 X 0 = i ; λ) P(X 0 = i λ) i = μ(i) β 0 (i) B(i, Y 0 ) i

6 An important posterior P(X t = i Y 0:T ; λ) P(X t = i Y 0:T ; λ) = = = P(X t = i, Y 0:T λ) P( ( Y 0:T λ) ) P(X t = i, Y 0:t, Y t+1:t λ) P(Y 0:T λ) P(Y t+1:t X t = i, Y 0:t ; λ) P(X t = i, Y 0:t λ) P(Y 0:T λ) = P(Y t+1:t X t = i; λ) P(X t = i, Y 0:t λ) P(Y 0:T λ) = α t(i) β t (i) α T ()! Now, consider the inverse problem of inferring λ given Y 0:T. A annoying problem is that we don t know the dimension of A, B, and μ.

7 Goal: Infer the number of different hidden state clusters given the observation data: Y (1:N) 0:T Note that for different n and n,y 0:T (n) and Y 0:T (n ) are independent. However, for any n, there exists sequential dependence relation within Y 0:T (n). Inference Method: Gibbs Sampling + Stick Breaking Construction + Dirichlet Distribution Steps: Step 1. Select an initial estimate of S (hence, S = {1, 2,, S }). Select an initialization of λ = (A, B, μ) (use uniform initialization for A, μ and initialize B by drawing from Dirichlet distribution parameterized by the empirical distribution of Y t ) Step 2. For each n = 1, 2,..., N, draw a sequence of state clusters X 0:T (n) from the posteriors: P(X t (n) = i Y 0:T (n) ; λ), t = 0, 1, 2,, T. Step 3. For each n = 1, 2,..., N, compute the count statistics #(i ) (n),#(i k) (n), and #(X = (n) 0 i). Compute the state cluster occupation based on newly drawed X (1:N) 0:T. This will lead to the new estimate of S. Relabel the newly drawed X (1:N) 0:T according to the occupation if necessary. Step 4. Based on the count statistics obtained in in Step 3, draw A and μ via stick-breaking process, and draw B from the Dirichlet conditional posteriors. Go to Step 2.

8 Posteriors involved in stick-breaking process: α i (A) ~ Gamma(c 1 + 1, d 1 log(1 V i (A) ) ) ' V (A) ~ Beta(1 + #(i (n) α (A) i ), i + #(i > ) (n) ), a i1 = V i1,, a i = V i (1 V i ) A n α (μ) ~ Gamma(c 2 + 1, d 2 log(1 V (μ) ) ) n V (μ) ~ Beta(1 + #(X 0 = ) (n), α (μ) + #(X 0 > ) (n) ), μ 1 = V 1 (μ), μ = V (μ) (1 ) μ n n ' < ( ) V μ ' ' < Conditional Dirichlet posteriors: (b i1, b i2,, b i O ) ~ Dirichlet ( β 1 (B) + #(i 1) (n), β 2 (B) + #(i 2) (n),, β O (B) + #(i O ) (n) ) n n n B Empirical hyper-parameters: c 1 = 10-6, d 1 = 0.1, c 2 = 0.01, d 2 = 0.01, all β k s are determined by the empirical distribution of Y t.

9 A toy ground truth (randomly generated): A 10 10, B 10 4, μ 1 10 A = B = μ = Simulated toy data: Y 0:39 (1:100)

10 Estimation Results 6 starting estimates: 60 Estimation of S S Ground dtruthth = 10: Number of Gibbs Iterations

11 An extension of the Hidden Markov Model S: Set of state clusters, S = {i} A: Set of behaviors, A = {a} O: set of inputs, O = {o} z t : Time series of state clusters a t : Time series of behaviors o t : Time series of inputs W: Set of state transitions, W = {W(i, a, o, )}, where W(i, a, o, )= P(z t+1 = z t = i, a t = a, o t+1 = o) t π : St Set of emissions, i π = {π (i, a)}, where π (i, a) ) = P(a t = a z t = i) t μ : initial state cluster distribution, μ = {μ(i)}, where μ(i) = P(X 0 = i) Graphic illustration μ(z 0 ) W(z 0, a 0, o 1, z 1 ) W(z 1, a 1, o 2, z 2 ) z 0 z 1 z 2 π(z 0, a 0 ) π(z 1, a 1 ) π(z 2, a 2 ) a 0 o 1 a 1 o 2 a 2 W(z 2, a 2, o 3, z 3 ) z3 o 3 3 a 3 π(z 3, a 3) t = 0 t = 1 t = 2 t = 3

12 Key properties Property 1: Stationary Property 2 (implicit): P(z t+1, a t+1 z 0:t, a 0:t, o 1:t+1 ) = P(z t+1, a t+1 z t, a t, o t+1 ) Property 3 (implicit): P(z( 0:t, a 0:t o 1:t+1) ) = P(z( 0:t, a 0:t o 1:t) ) Property 4 (implicit): P(a t+1 z t+1, z t, a t, o t+1 ) = P(a t+1 z t+1 ) Compact notation θ = {W, π, μ} A tensor A matrix A vector Factorization of complete oint likelihood of history given input data up to time T : P(z (, T T a o θ = π z a W z a o z z 0:T 0:T 1:T ; ) (, ) (,,, ) ( ) t t t t t t t t μ = = 0

13 P(z 0:T, a 0:T o 1:T ; θ ) = P T (Try to find recursive relation!) = P(z T, a T z 0:T-1 T, a 0:T-1 T, o 1:T T ; θ ) P(z 0:T-1 T, a 0:T-1 T o 1:T T ; θ ) (Conditioning on history up to time T-1) = P(z T, a T z T-1, a T-1, o T ; θ ) P(z 0:T-1, a 0:T-1 o 1:T-1 ; θ ) (By property 2, z T, a T are independent of z 0:T-2, a 0:T-2, o 1:T-1 when z T-1, a T-1, o T are given) (By property 3, z 0:T-1, a 0:T-1 are id independent d of o T, that tis, history does not tdepend don future!) = P(z T, a T z T-1, a T-1, o T ; θ ) P T-1 = P(a T z T, z T-1, a T-1, o T ; θ ) P(z T z T-1, a T-1, o T ; θ ) P T-1 (Conditioning on z T ) = P(a T z T, z T-1, a T-1, o T ; θ ) W(z T-1, a T-1, o T, z T ) P T-1 (Transition factor) = P(a T z T ; θ ) W(z T-1, a T-1, o T, z T ) P T-1 (By property 4, a T is independent of z T-1, a T-1, o T given z T ) = π(z T, a T ) W(z T-1, a T-1, o T, z T ) P T-1 (Emission factor) = T T = π ( z, ) ( a W z 1, a 1, o 1, z ) P 1 t= 2 t t t= 2 t t t t T T = π( z, ) ( 0 t a t W z 1 t 1, at 1, ot 1, zt) μ( z0) t= t= P 1 = P(z 0:1, a 0:1 o 1 ; θ ) = P(z 1, a 1 z 0, a 0, o 1 ; θ ) P(z 0, a 0 o 1 ; θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) P(z 0, a 0 θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) P(a 0 z 0, θ ) P(z 0 θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) π(z 0, a 0 ) μ(z 0 )

14 An important posterior P(z t = i a 0:T, o 1:T ; θ ) Similar to (but different from) the hidden markov model we define α 0 (i) = P(z 0 = i, a 0 o 1:T ; θ ) = P(z 0 = i, a 0 θ ) = P(a 0 z 0 = i ; θ ) P(z 0 = i θ ) = π(i, a 0 ) μ(i) For t = 1, 2,, T, α t (i) = P(z t = i, a 0:t, o 1:T ; θ ) = P(z t = i, a 0:t, o 1:t ; θ ) = P(z t-1 =, z t = i, a 0:t o 1:t ; θ ) = P(z t = i, a t z t-1 =, a 0:t-1, o 1:t ; θ ) P(z t-1 =, a 0:t-1 o 1:t ; θ ) = P(z t = i, a t z t-1 =, a t-1, o t ; θ ) P(z t-1 =, a 0:t-1 o 1:t-1 ; θ ) = P(z t = i, a t z t-1 =, a t-1, o t ; θ ) α t-1 () = P(a t z t = i, z t-1 =, a t-1, o t ; θ ) P(z t = i z t-1 =, a t-1, o t ; θ ) α t-1 () = P(a t z t = i ; θ ) P(z t = i z t-1 =, a t-1, o t ; θ ) α t-1 () = α t-1 () W(, a t-1, o t, i) π(i, a t )

15 By α T (i) = P(z T = i, a 0:T o 1:T ; θ ), we have P(a 0:T o 1:T ; θ ) = P(z T = i, a 0:T o 1:T ; θ ) = α T (i), this is complete data likelihood. We also define β T (i) = 1 and for t = T 1, T 2,, 0, i β t (i) = P(a t+1:t z t = i, a t, o 1:T ; θ ) = P(a t+1:t z t = i, a t, o t+1:t ; θ ) = P(z t+1 =, a t+1:t z t = i, a t, o t+1:t ; θ ) = P(a t+2:t z t+1 =, z t = i, a t+1, a t, o t+2:t ; θ ) P(z t+1 =, a t+1 z t = i, a t, o t+1:t ; θ ) = β t+1 () P(z t+1 =, a t+1 z t = i, a t, o t+1:t ; θ ) = β t+1 () P(a t+1 z t+1 =, z t = i, a t, o t+1:t ; θ ) P(z t+1 = z t = i, a t, o t+1:t ; θ ) = β t+1 () P(a t+1 z t+1 = ; θ ) P(z t+1 = z t = i, a t, o t+1 ; θ ) = W(i, a t, o t+1, ) π(, a t+1 ) β t+1 () i

16 Finally, P(z t = i a 0:T, o 1:T ; θ ) = = = = = P(z t = i, a 0:T o 1:T ; θ ) P(a 0:T o 1:T ; θ ) P(z t = i, a 0:t, a t+1:t o 1:T ; θ ) P(z T =, a 0:T o 1:T ; θ ) P(z t = i, a 0:t, a t+1:t o 1:T ; θ ) α T () P(a t+1:t z t = i, a 0:t, o 1:T ; θ ) P(z t = i, a 0:t o 1:T ; θ ) α T () P(a t+1:t z t = i, a t, o t+1:t ; θ ) P(z t = i, a 0:t o 1:t ; θ ) α T () = β t (i) α t (i) α T () Same form as that in HMM!!!

17 Goal: Infer the number of different hidden state clusters given the observation data: a 0:T (1:N) and the input data o 1:T (1:N) Inference Method: Gibbs Sampling + Stick Breaking Construction + Dirichlet Distribution Steps: Step 1. Select an initial estimate of S (hence, S = {1, 2,, S }). Select an initialization of θ = {W, π, μ} (use uniform initialization for W, μ and initialize π by drawing from Dirichlet distribution parameterized by the empirical distribution of a t ) Step 2. For each n = 1, 2,..., N, draw a sequence of states z (n) 0:T from the conditional posteriors: P(z (n) t = i a (n) 0:T, o (n) 1:T ; θ ), t = 0, 1, 2,, T. Step 3. For each n = 1, 2,..., N, compute the count statistics #(i a, o) (n), #(i a) (n), and #(X 0 = i) (n). Compute the state t cluster occupation based on newly drawed d z (1:N) 0:T. This will lead to the new estimate of S. Relabel the newly drawed z (1:N) 0:T according to the occupation if necessary. Step 4. Based on the count statistics obtained in in Step 3, draw W and μ via stick-breaking process, and draw π from the Dirichlet conditional posteriors. Go to Step 2.

18 Posteriors involved in stick-breaking process: α (W, a, o) ~ +1,d (W, a, o) i Gamma(c 1 d 1 log(1 V i )) V i (W, a, o) ~ Beta(1 + #(i a, o) (n), α i (W, a, o) + #(i >, a, o) (n) ), n w i1 (a, o) = V i1 (W, a, o), w i (a, o) = V i (W, a, o) (1 α (μ) ~ Gamma(c 2 + 1, d 2 log(1 V (μ) ) ) ( W, a, o) V i ' ' < n ) W V (μ) ~ Beta(1 + #(z 0 = ) (n), α (μ) + #(z 0 > ) (n) ), μ 1 = V 1 (μ), μ = V (μ) (1 ) μ n n ( ) V μ ' ' < Conditional Dirichlet posteriors: (π i1, π i2,, π i A ) ~ Dirichlet ( β 1 (π) + #(i 1) (n), β 2 (π) + #(i 2) (n),, β A (π) + #(i A ) (n) ) n n n π Empirical i hyper-parameters: ete c 1 = 10-6, d 1 = 0.1, c 2 = 0.01, d 2 = 0.01, all β k s are determined by the empirical distribution of Y t.

19 A toy ground truth (randomly generated): W , π 10 4, μ 1 10 ( A A = 4, O O = 15, S S = 10) W_a_o = Columns 1 through 8 [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] Columns 9 through 15 [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] >> W_a_o{1,1} ans =

20 A toy ground truth (randomly generated): >> W_a_o{3,5} ans = e >> W_a_o{4,10} ans =

21 A toy ground truth (randomly generated): >> W_a_o{3,5} ans = e >> W_a_o{4,10} ans =

22 A toy ground truth (randomly generated): >> P_i P_i = >> m_u m_u = >> p_o (this discrete distribution is used to simulate the input data) p_o = Columns 1 through Columns 13 through

23 Simulated toy data: o 1:19 (1:100) a 0:19 (1:100) 6 starting estimates: 60 Estimation of S S Ground truth = 10: Number of Gibbs Iterations

24 Now, with the same toy ground truth, change the mechanism of generating o 1:T (1:N), that is, don t ust use a single multinomial distribution Simulated toy data: o 1:39 (1:100) a 0:39 (1:100) 6 starting Estimation of S estimates: S Ground truth = 10: Number of Gibbs Iterations

25 Questions? Thanks

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message