Infering the Number of State Clusters in Hidden Markov Model and its Extension

Infering the Number of State Clusters in Hidden Markov Model and its Extension Xugang Ye Department of Applied Mathematics and Statistics, Johns Hopkins University

Elements of a Hidden Markov Model (HMM) S: Set of state clusters, S ={i} O: Set of observations, O = {k} X t : Time series of state clusters Y t : Time series of observations A: Set of state transitions, A = {A(i, )}, where A(i, )= P(X t+1 = X t = i) t = 0, 1, 2, B: Set of emissions, B = {B(i, k)}, where B(i, k) = P(Y t = k X t = i) t = 0, 1, 2, μ : initial state cluster distribution, μ = {μ(i)}, where μ(i) = P(X 0 = i) Graphic illustration Property 1: Stationary X X X X 0 1 2 3 X t X t+1 Y 0 Y 1 Y 2 Y 3 Y t Y t+1 Property 2 (implicit): P(X t+1 X 0:t ) = P(X t+1 X t ) Property 3 (implicit): P(Y t+1 Y 0:t, X 0:t +1 ) = P(Y t+1 X t+1 ) Compact notation λ = (A, B, μ)

Factorization of complete oint likelihood of history up to time T : T T P(X 0:T, Y 0:T λ) = B( X, ) ( 0 t Y t A X 1 t 1, Xt) μ( X0) t= t= P T (Try to find recursive relation!) = P(X 0:T, Y 0:T λ) = P(X T, Y T X 0:T-1, Y 0:T-1 ; λ) P( X 0:T-1, Y 0:T-1 λ) = P(X T, Y T X 0:T-1, Y 0:T-1 ; λ) P T-1 (Conditioning on history up to time T-1) = P(Y T X T, X 0:T-1, Y 0:T-1 ; λ) P(X T X 0:T-1, Y 0:T-1 ; λ) P T-1 (Conditioning on X T ) = P(Y( T X T ; λ) P(X) ( T X T-1 1 ; λ) P) T-1 1 (By Assumption 2 and Assumption 3) = B(X T, Y T ) A(X T-1, X T ) P T-1 (Transition factor and emission factor) = B(X 0, Y 0 ) μ(x 0 ) = T T B( X, ) ( 1 t Y t A X 1 t 1, Xt) P t= t= 0 = T T B( X, ) ( 0 t Y t A X 1 t 1, Xt) μ( X0) t= t= P 0 = P(X 0, Y 0 λ) = P(Y 0 X 0 ; λ) P(X 0 λ) = B(X 0 Y 0 ) μ(x 0 )

A fundamental problem P(Y 0:T λ) Data Likelihood Computing method: forward/backward iteration (dynamic programming) Forward calculation: Define α t(i) = P(Y 0:t, X t = i λ) Then α 0 (i) = P(Y 0, X 0 = i λ) = P(Y 0 X 0 = i, λ) P(X 0 = i λ) = B(i, Y 0 ) μ(i) and for t = 1, 2,, T, α t (i) = P(Y 0:t, X t = i λ) = P(Y 0:t, X t-1 =, X t = i λ) = P(Y t, X t = i Y 0:t-1, X t-1 = ; λ) P(Y 0:t-1, X t-1 = λ) = P(Y t, X t = i Y 0:t-1, X t-1 = ; λ) α t-1 () t t 0:t t t 0:t t t = P(Y t X t = i, Y 0:t-1, X t-1 = ; λ) P(X t = i Y 0:t-1, X t-1 = ; λ) α t-1 () = P(Y t X t = i; λ) P(X t = i X t-1 = ; λ) α t-1 () = α t-1 ()A(, i)b(i, Y t ) Finally, by α T (i) = P(Y 0:T, X T = i λ), we have P(Y 0:T λ) = α T (i) i

Backward calculation: Define β t (i) = P(Y t+1:t X t = i ; λ) Then β T (i) = 1 and for t = T 1, T 2,, 0, β t (i) = P(Y t+1:t X t = i ; λ) = P(Y t+1:t, X t+1 = X t = i ; λ) = P(Y t+2:t Y t+1, X t+1 =, X t = i ; λ) P(Y t+1, X t+1 = X t = i ; λ) = P(Y t+2:t X t+1 = ; λ) P(Y t+1 X t+1 =, X t = i ; λ) P(X t+1 = X t = i ; λ) = β t+1 () P(Y t+1 X t+1 = ; λ) P(X t+1 = X t = i ; λ) = A(i, ) B(, Y t+1 ) β t+1 () Finally, P(Y 0:T T λ) = P(Y 0:T T, X 0 = i λ) = P(Y 1:T T Y 0, X 0 = i ; λ) P(Y 0, X 0 = i λ) i i = P(Y 1:T X 0 = i ; λ) P(Y 0 X 0 = i ; λ) P(X 0 = i λ) i = μ(i) β 0 (i) B(i, Y 0 ) i

An important posterior P(X t = i Y 0:T ; λ) P(X t = i Y 0:T ; λ) = = = P(X t = i, Y 0:T λ) P( ( Y 0:T λ) ) P(X t = i, Y 0:t, Y t+1:t λ) P(Y 0:T λ) P(Y t+1:t X t = i, Y 0:t ; λ) P(X t = i, Y 0:t λ) P(Y 0:T λ) = P(Y t+1:t X t = i; λ) P(X t = i, Y 0:t λ) P(Y 0:T λ) = α t(i) β t (i) α T ()! Now, consider the inverse problem of inferring λ given Y 0:T. A annoying problem is that we don t know the dimension of A, B, and μ.

Goal: Infer the number of different hidden state clusters given the observation data: Y (1:N) 0:T Note that for different n and n,y 0:T (n) and Y 0:T (n ) are independent. However, for any n, there exists sequential dependence relation within Y 0:T (n). Inference Method: Gibbs Sampling + Stick Breaking Construction + Dirichlet Distribution Steps: Step 1. Select an initial estimate of S (hence, S = {1, 2,, S }). Select an initialization of λ = (A, B, μ) (use uniform initialization for A, μ and initialize B by drawing from Dirichlet distribution parameterized by the empirical distribution of Y t ) Step 2. For each n = 1, 2,..., N, draw a sequence of state clusters X 0:T (n) from the posteriors: P(X t (n) = i Y 0:T (n) ; λ), t = 0, 1, 2,, T. Step 3. For each n = 1, 2,..., N, compute the count statistics #(i ) (n),#(i k) (n), and #(X = (n) 0 i). Compute the state cluster occupation based on newly drawed X (1:N) 0:T. This will lead to the new estimate of S. Relabel the newly drawed X (1:N) 0:T according to the occupation if necessary. Step 4. Based on the count statistics obtained in in Step 3, draw A and μ via stick-breaking process, and draw B from the Dirichlet conditional posteriors. Go to Step 2.

Posteriors involved in stick-breaking process: α i (A) ~ Gamma(c 1 + 1, d 1 log(1 V i (A) ) ) ' V (A) ~ Beta(1 + #(i (n) α (A) i ), i + #(i > ) (n) ), a i1 = V i1,, a i = V i (1 V i ) A n α (μ) ~ Gamma(c 2 + 1, d 2 log(1 V (μ) ) ) n V (μ) ~ Beta(1 + #(X 0 = ) (n), α (μ) + #(X 0 > ) (n) ), μ 1 = V 1 (μ), μ = V (μ) (1 ) μ n n ' < ( ) V μ ' ' < Conditional Dirichlet posteriors: (b i1, b i2,, b i O ) ~ Dirichlet ( β 1 (B) + #(i 1) (n), β 2 (B) + #(i 2) (n),, β O (B) + #(i O ) (n) ) n n n B Empirical hyper-parameters: c 1 = 10-6, d 1 = 0.1, c 2 = 0.01, d 2 = 0.01, all β k s are determined by the empirical distribution of Y t.

A toy ground truth (randomly generated): A 10 10, B 10 4, μ 1 10 A = 0.19691 0.12754 0.011997 0.0031654 0.17369 0.040087 0.10291 0.15069 0.16472 0.028292 0.046207 0.15832 0.070542 0.14929 0.0039261 0.13638 0.17987 0.06183 0.19128 0.0023503 0.088622 0.13462 0.11875 0.065001 0.099492 0.044215 0.11999 0.12245 0.076318 0.13054 0.090343 0.13723 0.0018332 0.17322 0.070545 0.1007 0.11989 0.1056 0.16362 0.037019 0.20655 0.040848 0.032187 0.10799 0.19276 0.034963 0.18956 0.08584 0.040081 0.069226 0.12714 0.067684 0.033828 0.069844 0.083885 0.11643 0.11015 0.11724 0.16345 0.11035 0.091861 0.18826 0.039991 0.1703 0.14278 0.076145 0.068819 0.10999 0.054627 0.057235 0.0038474 0.19065 0.12554 0.10919 0.089178 0.17882 0.060241 0.092502 0.052466 0.097564 0.16967 0.084748 0.056225 0.04186 0.062924 0.17634 0.070479 0.14347 0.1809 0.013382 0.075713075713 0.15215 0.033849033849 0.11443 0.032289032289 0.10106 0.090929090929 0.10578 0.12553 0.16827 B = 0.41007 0.14711 0.29227 0.15055 0.24175 0.21682 0.17411 0.36733 0.20676 0.31418 0.35069 0.12836 0.16782 0.34215 0.0075428 0.48248 0.18124 0.19304 0.32151 0.30421 0.10381 0.26088 0.44604 0.18927 0.18651 0.25548 0.31849 0.23951 0.40523 0.031541 0.42042 0.1428 0.26343 0.29974 0.2181 0.21873 0.30178 0.023684 0.23478 0.43976 μ = 0.13981 0.04349 0.17171 0.12865 0.02737 0.04238 0.12423 0.12888 0.0758 0.11768 Simulated toy data: Y 0:39 (1:100)

Estimation Results 6 starting estimates: 60 Estimation of S 50 40 S 30 20 10 0 Ground dtruthth = 10: 0 100 200 300 400 500 600 700 800 900 1000 Number of Gibbs Iterations

An extension of the Hidden Markov Model S: Set of state clusters, S = {i} A: Set of behaviors, A = {a} O: set of inputs, O = {o} z t : Time series of state clusters a t : Time series of behaviors o t : Time series of inputs W: Set of state transitions, W = {W(i, a, o, )}, where W(i, a, o, )= P(z t+1 = z t = i, a t = a, o t+1 = o) t π : St Set of emissions, i π = {π (i, a)}, where π (i, a) ) = P(a t = a z t = i) t μ : initial state cluster distribution, μ = {μ(i)}, where μ(i) = P(X 0 = i) Graphic illustration μ(z 0 ) W(z 0, a 0, o 1, z 1 ) W(z 1, a 1, o 2, z 2 ) z 0 z 1 z 2 π(z 0, a 0 ) π(z 1, a 1 ) π(z 2, a 2 ) a 0 o 1 a 1 o 2 a 2 W(z 2, a 2, o 3, z 3 ) z3 o 3 3 a 3 π(z 3, a 3) t = 0 t = 1 t = 2 t = 3

Key properties Property 1: Stationary Property 2 (implicit): P(z t+1, a t+1 z 0:t, a 0:t, o 1:t+1 ) = P(z t+1, a t+1 z t, a t, o t+1 ) Property 3 (implicit): P(z( 0:t, a 0:t o 1:t+1) ) = P(z( 0:t, a 0:t o 1:t) ) Property 4 (implicit): P(a t+1 z t+1, z t, a t, o t+1 ) = P(a t+1 z t+1 ) Compact notation θ = {W, π, μ} A tensor A matrix A vector Factorization of complete oint likelihood of history given input data up to time T : P(z (, T T a o θ = π z a W z a o z z 0:T 0:T 1:T ; ) (, ) (,,, ) 0 1 1 1 ( ) t t t t t t t t μ = = 0

P(z 0:T, a 0:T o 1:T ; θ ) = P T (Try to find recursive relation!) = P(z T, a T z 0:T-1 T, a 0:T-1 T, o 1:T T ; θ ) P(z 0:T-1 T, a 0:T-1 T o 1:T T ; θ ) (Conditioning on history up to time T-1) = P(z T, a T z T-1, a T-1, o T ; θ ) P(z 0:T-1, a 0:T-1 o 1:T-1 ; θ ) (By property 2, z T, a T are independent of z 0:T-2, a 0:T-2, o 1:T-1 when z T-1, a T-1, o T are given) (By property 3, z 0:T-1, a 0:T-1 are id independent d of o T, that tis, history does not tdepend don future!) = P(z T, a T z T-1, a T-1, o T ; θ ) P T-1 = P(a T z T, z T-1, a T-1, o T ; θ ) P(z T z T-1, a T-1, o T ; θ ) P T-1 (Conditioning on z T ) = P(a T z T, z T-1, a T-1, o T ; θ ) W(z T-1, a T-1, o T, z T ) P T-1 (Transition factor) = P(a T z T ; θ ) W(z T-1, a T-1, o T, z T ) P T-1 (By property 4, a T is independent of z T-1, a T-1, o T given z T ) = π(z T, a T ) W(z T-1, a T-1, o T, z T ) P T-1 (Emission factor) = T T = π ( z, ) ( a W z 1, a 1, o 1, z ) P 1 t= 2 t t t= 2 t t t t T T = π( z, ) ( 0 t a t W z 1 t 1, at 1, ot 1, zt) μ( z0) t= t= P 1 = P(z 0:1, a 0:1 o 1 ; θ ) = P(z 1, a 1 z 0, a 0, o 1 ; θ ) P(z 0, a 0 o 1 ; θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) P(z 0, a 0 θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) P(a 0 z 0, θ ) P(z 0 θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) π(z 0, a 0 ) μ(z 0 )

An important posterior P(z t = i a 0:T, o 1:T ; θ ) Similar to (but different from) the hidden markov model we define α 0 (i) = P(z 0 = i, a 0 o 1:T ; θ ) = P(z 0 = i, a 0 θ ) = P(a 0 z 0 = i ; θ ) P(z 0 = i θ ) = π(i, a 0 ) μ(i) For t = 1, 2,, T, α t (i) = P(z t = i, a 0:t, o 1:T ; θ ) = P(z t = i, a 0:t, o 1:t ; θ ) = P(z t-1 =, z t = i, a 0:t o 1:t ; θ ) = P(z t = i, a t z t-1 =, a 0:t-1, o 1:t ; θ ) P(z t-1 =, a 0:t-1 o 1:t ; θ ) = P(z t = i, a t z t-1 =, a t-1, o t ; θ ) P(z t-1 =, a 0:t-1 o 1:t-1 ; θ ) = P(z t = i, a t z t-1 =, a t-1, o t ; θ ) α t-1 () = P(a t z t = i, z t-1 =, a t-1, o t ; θ ) P(z t = i z t-1 =, a t-1, o t ; θ ) α t-1 () = P(a t z t = i ; θ ) P(z t = i z t-1 =, a t-1, o t ; θ ) α t-1 () = α t-1 () W(, a t-1, o t, i) π(i, a t )

By α T (i) = P(z T = i, a 0:T o 1:T ; θ ), we have P(a 0:T o 1:T ; θ ) = P(z T = i, a 0:T o 1:T ; θ ) = α T (i), this is complete data likelihood. We also define β T (i) = 1 and for t = T 1, T 2,, 0, i β t (i) = P(a t+1:t z t = i, a t, o 1:T ; θ ) = P(a t+1:t z t = i, a t, o t+1:t ; θ ) = P(z t+1 =, a t+1:t z t = i, a t, o t+1:t ; θ ) = P(a t+2:t z t+1 =, z t = i, a t+1, a t, o t+2:t ; θ ) P(z t+1 =, a t+1 z t = i, a t, o t+1:t ; θ ) = β t+1 () P(z t+1 =, a t+1 z t = i, a t, o t+1:t ; θ ) = β t+1 () P(a t+1 z t+1 =, z t = i, a t, o t+1:t ; θ ) P(z t+1 = z t = i, a t, o t+1:t ; θ ) = β t+1 () P(a t+1 z t+1 = ; θ ) P(z t+1 = z t = i, a t, o t+1 ; θ ) = W(i, a t, o t+1, ) π(, a t+1 ) β t+1 () i

Finally, P(z t = i a 0:T, o 1:T ; θ ) = = = = = P(z t = i, a 0:T o 1:T ; θ ) P(a 0:T o 1:T ; θ ) P(z t = i, a 0:t, a t+1:t o 1:T ; θ ) P(z T =, a 0:T o 1:T ; θ ) P(z t = i, a 0:t, a t+1:t o 1:T ; θ ) α T () P(a t+1:t z t = i, a 0:t, o 1:T ; θ ) P(z t = i, a 0:t o 1:T ; θ ) α T () P(a t+1:t z t = i, a t, o t+1:t ; θ ) P(z t = i, a 0:t o 1:t ; θ ) α T () = β t (i) α t (i) α T () Same form as that in HMM!!!

Goal: Infer the number of different hidden state clusters given the observation data: a 0:T (1:N) and the input data o 1:T (1:N) Inference Method: Gibbs Sampling + Stick Breaking Construction + Dirichlet Distribution Steps: Step 1. Select an initial estimate of S (hence, S = {1, 2,, S }). Select an initialization of θ = {W, π, μ} (use uniform initialization for W, μ and initialize π by drawing from Dirichlet distribution parameterized by the empirical distribution of a t ) Step 2. For each n = 1, 2,..., N, draw a sequence of states z (n) 0:T from the conditional posteriors: P(z (n) t = i a (n) 0:T, o (n) 1:T ; θ ), t = 0, 1, 2,, T. Step 3. For each n = 1, 2,..., N, compute the count statistics #(i a, o) (n), #(i a) (n), and #(X 0 = i) (n). Compute the state t cluster occupation based on newly drawed d z (1:N) 0:T. This will lead to the new estimate of S. Relabel the newly drawed z (1:N) 0:T according to the occupation if necessary. Step 4. Based on the count statistics obtained in in Step 3, draw W and μ via stick-breaking process, and draw π from the Dirichlet conditional posteriors. Go to Step 2.

Posteriors involved in stick-breaking process: α (W, a, o) ~ +1,d (W, a, o) i Gamma(c 1 d 1 log(1 V i )) V i (W, a, o) ~ Beta(1 + #(i a, o) (n), α i (W, a, o) + #(i >, a, o) (n) ), n w i1 (a, o) = V i1 (W, a, o), w i (a, o) = V i (W, a, o) (1 α (μ) ~ Gamma(c 2 + 1, d 2 log(1 V (μ) ) ) ( W, a, o) V i ' ' < n ) W V (μ) ~ Beta(1 + #(z 0 = ) (n), α (μ) + #(z 0 > ) (n) ), μ 1 = V 1 (μ), μ = V (μ) (1 ) μ n n ( ) V μ ' ' < Conditional Dirichlet posteriors: (π i1, π i2,, π i A ) ~ Dirichlet ( β 1 (π) + #(i 1) (n), β 2 (π) + #(i 2) (n),, β A (π) + #(i A ) (n) ) n n n π Empirical i hyper-parameters: ete c 1 = 10-6, d 1 = 0.1, c 2 = 0.01, d 2 = 0.01, all β k s are determined by the empirical distribution of Y t.

A toy ground truth (randomly generated): W 4 15 10 10, π 10 4, μ 1 10 ( A A = 4, O O = 15, S S = 10) W_a_o = Columns 1 through 8 [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] Columns 9 through 15 [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] >> W_a_o{1,1} ans = 0.19691 0.12754 0.011997 0.0031654 0.17369 0.040087 0.10291 0.15069 0.16472 0.028292 0.046207046207 0.15832 0.070542 070542 0.14929 0.0039261 0039261 0.13638 0.17987 0.06183 06183 0.19128 0.0023503 0023503 0.088622 0.13462 0.11875 0.065001 0.099492 0.044215 0.11999 0.12245 0.076318 0.13054 0.090343 0.13723 0.0018332 0.17322 0.070545 0.1007 0.11989 0.1056 0.16362 0.037019 0.20655 0.040848 0.032187 0.10799 0.19276 0.034963 0.18956 0.08584 0.040081 0.069226 0.12714 0.067684 0.033828 0.069844 0.083885 0.11643 0.11015 0.11724 0.16345 0.11035 0.091861 0.18826 0.039991 0.1703 0.14278 0.076145 0.068819 0.10999 0.054627 0.057235 0.0038474 0.19065 0.12554 0.10919 0.089178 0.17882 0.060241 0.092502 0.052466 0.097564 0.16967 0.084748 0.056225 0.04186 0.062924 0.17634 0.070479 0.14347 0.1809 0.013382 0.075713 0.15215 0.033849 0.11443 0.032289 0.10106 0.090929 0.10578 0.12553 0.16827

A toy ground truth (randomly generated): >> W_a_o{3,5} ans = 0.049595 0.1711 0.0048086 0.22873 0.092572 0.024649 0.069779 0.10077 0.14129 0.1167 0.037899 0.17893 0.074279 0.078157 0.10912 0.072839 0.15128 0.14954 0.096132 0.051819 0.112 0.0094001 0.13369 0.038813 0.15062 0.13029 0.14407 0.078113 0.20227 0.00073703 0.0041812 0.20036 0.0064187 0.18575 0.22347 0.0077509 0.13714 0.20276 0.0058403 0.02633 0.043656 0.11452 0.1658 0.11457 0.13171 0.086594 0.010688 0.11056 0.070362 0.15154 0.10089 0.085904 0.098771 0.072727 7.3562e-005 0.18129 0.15943 0.071039 0.061676 0.16821 0.082729 0.11931 0.12038 0.13838 0.15105 0.12878 0.044396 0.017092 0.023745 0.17415 0.077517077517 0.14142 0.026553 026553 0.11525 0.066107 066107 0.01647 01647 0.094269 094269 0.1871 0.14204 0.13327 0.17604 0.21067 0.095448 0.017306 0.062586 0.054057 0.0063592 0.19496 0.13312 0.049449 0.043106 0.20757 0.091527 0.029171 0.18601 0.03193 0.14494 0.10332 0.043977 0.11845 >> W_a_o{4,10} ans = 0.081838 0.13278 0.056819 0.0033495 0.010084 0.20048 0.071638 0.14095 0.14306 0.159 0.13113 0.0039547 0.10423 0.2564 0.0073725 0.03746 0.057935 0.20249 0.15949 0.039539 0.14215 0.12991 0.17242 0.051946 0.039586 0.055112 0.019218 0.078064 0.17018 0.14141 0.026762 0.036433 0.23035 0.069818 0.026359 0.2867 0.008662 0.11252 0.11566 0.086737 0.10372 0.026524 0.2023 0.11755 0.067874 0.17014 0.076155 0.16799 0.057377 0.010372 0.14841 0.11691 0.16689 0.17767 0.0062905 0.17205 0.0015792 0.086508 0.10353 0.020161 0.079686 0.13615 0.09691 0.0080626 0.17542 0.080951 0.10324 0.13532 0.0595 0.12476 0.010199 0.0029338 0.18272 0.16032 0.25071 0.0026783 0.27869 0.026886 0.0067274 0.078133 0.14151 0.01472 0.04687 0.18795 0.1471 0.013547 0.0396 0.17451 0.092818 0.14136 0.048205 0.0060693 0.025795 0.13245 0.10371 0.02278 0.17249 0.16962 0.14522 0.17366

A toy ground truth (randomly generated): >> P_i P_i = 0.21432 0.0032032 0.096714 0.68577 0.38876 0.32973 0.022091 0.25943 0.0038868 0.15986 0.40782 0.42844 0.24144 0.29837 0.27234 0.18784 0.24997 0.12633 0.089493 0.53421 0.34921 0.025155 0.20485 0.42078 0.31067 0.103 0.29444 0.29189 0.39228 0.10078 0.28538 0.22157 0.35535 0.11409 0.18591 0.34464 0.27563 0.26337 0.12512 0.33589 >> m_u m_u = 0.20145 0.010774 0.15493 0.12541 0.070178 0.0016605 0.062962 0.21026 0.096605 0.06578 >> p_o (this discrete distribution is used to simulate the input data) p_o = Columns 1 through 12 0.16553 0.086469 0.040123 0.095611 0.019218 0.021096 0.02608 0.025348 0.095511 0.048504 0.15768 0.033172 Columns 13 through 15 0.00254430025443 0.019514 019514 0.16359

Simulated toy data: o 1:19 (1:100) a 0:19 (1:100) 6 starting estimates: 60 Estimation of S 50 40 S 30 20 10 0 Ground truth = 10: 0 20 40 60 80 100 120 140 160 Number of Gibbs Iterations

Now, with the same toy ground truth, change the mechanism of generating o 1:T (1:N), that is, don t ust use a single multinomial distribution Simulated toy data: o 1:39 (1:100) a 0:39 (1:100) 6 starting Estimation of S estimates: 60 50 40 S 30 20 10 0 Ground truth = 10: 0 20 40 60 80 100 120 140 160 Number of Gibbs Iterations

Questions? Thanks