Infering the Number of State Clusters in Hidden Markov Model and its Extension

Similar documents
Dynamic Approaches: The Hidden Markov Model

Basic math for biology

Note Set 5: Hidden Markov Models

Non-Parametric Bayes

Study Notes on the Latent Dirichlet Allocation

Steven L. Scott. Presented by Ahmet Engin Ural

STA 414/2104: Machine Learning

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

STA 4273H: Statistical Machine Learning

CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm

Final Exam, Machine Learning, Spring 2009

Hidden Markov Models and Gaussian Mixture Models

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Linear Dynamical Systems (Kalman filter)

Bayesian Inference for Dirichlet-Multinomials

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Machine Learning for Data Science (CS4786) Lecture 24

Bayesian Machine Learning - Lecture 7

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Linear Dynamical Systems

Introduction to Machine Learning CMU-10701

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

13: Variational inference II

MACHINE LEARNING 2 UGM,HMMS Lecture 7

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Latent Dirichlet Allocation

Variational Inference (11/04/13)

Topic Modelling and Latent Dirichlet Allocation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

9 Forward-backward algorithm, sum-product on factor graphs

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Markov Models and Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Introduction to Artificial Intelligence (AI)

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Expectation-Maximization (EM) algorithm

Lecture 13 : Variational Inference: Mean Field Approximation

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Introduction to Machine Learning

Statistical Machine Learning from Data

LEARNING DYNAMIC SYSTEMS: MARKOV MODELS

Advanced Data Science

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Bayesian Nonparametric Models

CPSC 540: Machine Learning

Markov Chains and Hidden Markov Models

Fitting Narrow Emission Lines in X-ray Spectra

EM with Features. Nov. 19, Sunday, November 24, 13

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models and Gaussian Mixture Models

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

28 : Approximate Inference - Distributed MCMC

Probabilistic Time Series Classification

Bayesian Methods for Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Evaluation Methods for Topic Models

Collapsed Variational Bayesian Inference for Hidden Markov Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Gaussian Mixture Model

Topics in Natural Language Processing

Probabilistic Graphical Models

Graphical Models and Kernel Methods

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Modeling Environment

Cours 7 12th November 2014

Hidden Markov Models

Machine Learning 4771

Hidden Markov Models. Terminology, Representation and Basic Problems

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Approximate Inference

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

5. Sum-product algorithm

Clustering using Mixture Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Bayesian Hidden Markov Models and Extensions

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Introduction to Probabilistic Graphical Models: Exercises

16 : Approximate Inference: Markov Chain Monte Carlo

Topic Models. Charles Elkan November 20, 2008

Inference in Explicit Duration Hidden Markov Models

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Directed Probabilistic Graphical Models CMSC 678 UMBC

Nonparametric Bayesian Methods - Lecture I

Mixture Models and Expectation-Maximization

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Hidden Markov Models. x 1 x 2 x 3 x K

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Gentle Introduction to Infinite Gaussian Mixture Modeling

Transcription:

Infering the Number of State Clusters in Hidden Markov Model and its Extension Xugang Ye Department of Applied Mathematics and Statistics, Johns Hopkins University

Elements of a Hidden Markov Model (HMM) S: Set of state clusters, S ={i} O: Set of observations, O = {k} X t : Time series of state clusters Y t : Time series of observations A: Set of state transitions, A = {A(i, )}, where A(i, )= P(X t+1 = X t = i) t = 0, 1, 2, B: Set of emissions, B = {B(i, k)}, where B(i, k) = P(Y t = k X t = i) t = 0, 1, 2, μ : initial state cluster distribution, μ = {μ(i)}, where μ(i) = P(X 0 = i) Graphic illustration Property 1: Stationary X X X X 0 1 2 3 X t X t+1 Y 0 Y 1 Y 2 Y 3 Y t Y t+1 Property 2 (implicit): P(X t+1 X 0:t ) = P(X t+1 X t ) Property 3 (implicit): P(Y t+1 Y 0:t, X 0:t +1 ) = P(Y t+1 X t+1 ) Compact notation λ = (A, B, μ)

Factorization of complete oint likelihood of history up to time T : T T P(X 0:T, Y 0:T λ) = B( X, ) ( 0 t Y t A X 1 t 1, Xt) μ( X0) t= t= P T (Try to find recursive relation!) = P(X 0:T, Y 0:T λ) = P(X T, Y T X 0:T-1, Y 0:T-1 ; λ) P( X 0:T-1, Y 0:T-1 λ) = P(X T, Y T X 0:T-1, Y 0:T-1 ; λ) P T-1 (Conditioning on history up to time T-1) = P(Y T X T, X 0:T-1, Y 0:T-1 ; λ) P(X T X 0:T-1, Y 0:T-1 ; λ) P T-1 (Conditioning on X T ) = P(Y( T X T ; λ) P(X) ( T X T-1 1 ; λ) P) T-1 1 (By Assumption 2 and Assumption 3) = B(X T, Y T ) A(X T-1, X T ) P T-1 (Transition factor and emission factor) = B(X 0, Y 0 ) μ(x 0 ) = T T B( X, ) ( 1 t Y t A X 1 t 1, Xt) P t= t= 0 = T T B( X, ) ( 0 t Y t A X 1 t 1, Xt) μ( X0) t= t= P 0 = P(X 0, Y 0 λ) = P(Y 0 X 0 ; λ) P(X 0 λ) = B(X 0 Y 0 ) μ(x 0 )

A fundamental problem P(Y 0:T λ) Data Likelihood Computing method: forward/backward iteration (dynamic programming) Forward calculation: Define α t(i) = P(Y 0:t, X t = i λ) Then α 0 (i) = P(Y 0, X 0 = i λ) = P(Y 0 X 0 = i, λ) P(X 0 = i λ) = B(i, Y 0 ) μ(i) and for t = 1, 2,, T, α t (i) = P(Y 0:t, X t = i λ) = P(Y 0:t, X t-1 =, X t = i λ) = P(Y t, X t = i Y 0:t-1, X t-1 = ; λ) P(Y 0:t-1, X t-1 = λ) = P(Y t, X t = i Y 0:t-1, X t-1 = ; λ) α t-1 () t t 0:t t t 0:t t t = P(Y t X t = i, Y 0:t-1, X t-1 = ; λ) P(X t = i Y 0:t-1, X t-1 = ; λ) α t-1 () = P(Y t X t = i; λ) P(X t = i X t-1 = ; λ) α t-1 () = α t-1 ()A(, i)b(i, Y t ) Finally, by α T (i) = P(Y 0:T, X T = i λ), we have P(Y 0:T λ) = α T (i) i

Backward calculation: Define β t (i) = P(Y t+1:t X t = i ; λ) Then β T (i) = 1 and for t = T 1, T 2,, 0, β t (i) = P(Y t+1:t X t = i ; λ) = P(Y t+1:t, X t+1 = X t = i ; λ) = P(Y t+2:t Y t+1, X t+1 =, X t = i ; λ) P(Y t+1, X t+1 = X t = i ; λ) = P(Y t+2:t X t+1 = ; λ) P(Y t+1 X t+1 =, X t = i ; λ) P(X t+1 = X t = i ; λ) = β t+1 () P(Y t+1 X t+1 = ; λ) P(X t+1 = X t = i ; λ) = A(i, ) B(, Y t+1 ) β t+1 () Finally, P(Y 0:T T λ) = P(Y 0:T T, X 0 = i λ) = P(Y 1:T T Y 0, X 0 = i ; λ) P(Y 0, X 0 = i λ) i i = P(Y 1:T X 0 = i ; λ) P(Y 0 X 0 = i ; λ) P(X 0 = i λ) i = μ(i) β 0 (i) B(i, Y 0 ) i

An important posterior P(X t = i Y 0:T ; λ) P(X t = i Y 0:T ; λ) = = = P(X t = i, Y 0:T λ) P( ( Y 0:T λ) ) P(X t = i, Y 0:t, Y t+1:t λ) P(Y 0:T λ) P(Y t+1:t X t = i, Y 0:t ; λ) P(X t = i, Y 0:t λ) P(Y 0:T λ) = P(Y t+1:t X t = i; λ) P(X t = i, Y 0:t λ) P(Y 0:T λ) = α t(i) β t (i) α T ()! Now, consider the inverse problem of inferring λ given Y 0:T. A annoying problem is that we don t know the dimension of A, B, and μ.

Goal: Infer the number of different hidden state clusters given the observation data: Y (1:N) 0:T Note that for different n and n,y 0:T (n) and Y 0:T (n ) are independent. However, for any n, there exists sequential dependence relation within Y 0:T (n). Inference Method: Gibbs Sampling + Stick Breaking Construction + Dirichlet Distribution Steps: Step 1. Select an initial estimate of S (hence, S = {1, 2,, S }). Select an initialization of λ = (A, B, μ) (use uniform initialization for A, μ and initialize B by drawing from Dirichlet distribution parameterized by the empirical distribution of Y t ) Step 2. For each n = 1, 2,..., N, draw a sequence of state clusters X 0:T (n) from the posteriors: P(X t (n) = i Y 0:T (n) ; λ), t = 0, 1, 2,, T. Step 3. For each n = 1, 2,..., N, compute the count statistics #(i ) (n),#(i k) (n), and #(X = (n) 0 i). Compute the state cluster occupation based on newly drawed X (1:N) 0:T. This will lead to the new estimate of S. Relabel the newly drawed X (1:N) 0:T according to the occupation if necessary. Step 4. Based on the count statistics obtained in in Step 3, draw A and μ via stick-breaking process, and draw B from the Dirichlet conditional posteriors. Go to Step 2.

Posteriors involved in stick-breaking process: α i (A) ~ Gamma(c 1 + 1, d 1 log(1 V i (A) ) ) ' V (A) ~ Beta(1 + #(i (n) α (A) i ), i + #(i > ) (n) ), a i1 = V i1,, a i = V i (1 V i ) A n α (μ) ~ Gamma(c 2 + 1, d 2 log(1 V (μ) ) ) n V (μ) ~ Beta(1 + #(X 0 = ) (n), α (μ) + #(X 0 > ) (n) ), μ 1 = V 1 (μ), μ = V (μ) (1 ) μ n n ' < ( ) V μ ' ' < Conditional Dirichlet posteriors: (b i1, b i2,, b i O ) ~ Dirichlet ( β 1 (B) + #(i 1) (n), β 2 (B) + #(i 2) (n),, β O (B) + #(i O ) (n) ) n n n B Empirical hyper-parameters: c 1 = 10-6, d 1 = 0.1, c 2 = 0.01, d 2 = 0.01, all β k s are determined by the empirical distribution of Y t.

A toy ground truth (randomly generated): A 10 10, B 10 4, μ 1 10 A = 0.19691 0.12754 0.011997 0.0031654 0.17369 0.040087 0.10291 0.15069 0.16472 0.028292 0.046207 0.15832 0.070542 0.14929 0.0039261 0.13638 0.17987 0.06183 0.19128 0.0023503 0.088622 0.13462 0.11875 0.065001 0.099492 0.044215 0.11999 0.12245 0.076318 0.13054 0.090343 0.13723 0.0018332 0.17322 0.070545 0.1007 0.11989 0.1056 0.16362 0.037019 0.20655 0.040848 0.032187 0.10799 0.19276 0.034963 0.18956 0.08584 0.040081 0.069226 0.12714 0.067684 0.033828 0.069844 0.083885 0.11643 0.11015 0.11724 0.16345 0.11035 0.091861 0.18826 0.039991 0.1703 0.14278 0.076145 0.068819 0.10999 0.054627 0.057235 0.0038474 0.19065 0.12554 0.10919 0.089178 0.17882 0.060241 0.092502 0.052466 0.097564 0.16967 0.084748 0.056225 0.04186 0.062924 0.17634 0.070479 0.14347 0.1809 0.013382 0.075713075713 0.15215 0.033849033849 0.11443 0.032289032289 0.10106 0.090929090929 0.10578 0.12553 0.16827 B = 0.41007 0.14711 0.29227 0.15055 0.24175 0.21682 0.17411 0.36733 0.20676 0.31418 0.35069 0.12836 0.16782 0.34215 0.0075428 0.48248 0.18124 0.19304 0.32151 0.30421 0.10381 0.26088 0.44604 0.18927 0.18651 0.25548 0.31849 0.23951 0.40523 0.031541 0.42042 0.1428 0.26343 0.29974 0.2181 0.21873 0.30178 0.023684 0.23478 0.43976 μ = 0.13981 0.04349 0.17171 0.12865 0.02737 0.04238 0.12423 0.12888 0.0758 0.11768 Simulated toy data: Y 0:39 (1:100)

Estimation Results 6 starting estimates: 60 Estimation of S 50 40 S 30 20 10 0 Ground dtruthth = 10: 0 100 200 300 400 500 600 700 800 900 1000 Number of Gibbs Iterations

An extension of the Hidden Markov Model S: Set of state clusters, S = {i} A: Set of behaviors, A = {a} O: set of inputs, O = {o} z t : Time series of state clusters a t : Time series of behaviors o t : Time series of inputs W: Set of state transitions, W = {W(i, a, o, )}, where W(i, a, o, )= P(z t+1 = z t = i, a t = a, o t+1 = o) t π : St Set of emissions, i π = {π (i, a)}, where π (i, a) ) = P(a t = a z t = i) t μ : initial state cluster distribution, μ = {μ(i)}, where μ(i) = P(X 0 = i) Graphic illustration μ(z 0 ) W(z 0, a 0, o 1, z 1 ) W(z 1, a 1, o 2, z 2 ) z 0 z 1 z 2 π(z 0, a 0 ) π(z 1, a 1 ) π(z 2, a 2 ) a 0 o 1 a 1 o 2 a 2 W(z 2, a 2, o 3, z 3 ) z3 o 3 3 a 3 π(z 3, a 3) t = 0 t = 1 t = 2 t = 3

Key properties Property 1: Stationary Property 2 (implicit): P(z t+1, a t+1 z 0:t, a 0:t, o 1:t+1 ) = P(z t+1, a t+1 z t, a t, o t+1 ) Property 3 (implicit): P(z( 0:t, a 0:t o 1:t+1) ) = P(z( 0:t, a 0:t o 1:t) ) Property 4 (implicit): P(a t+1 z t+1, z t, a t, o t+1 ) = P(a t+1 z t+1 ) Compact notation θ = {W, π, μ} A tensor A matrix A vector Factorization of complete oint likelihood of history given input data up to time T : P(z (, T T a o θ = π z a W z a o z z 0:T 0:T 1:T ; ) (, ) (,,, ) 0 1 1 1 ( ) t t t t t t t t μ = = 0

P(z 0:T, a 0:T o 1:T ; θ ) = P T (Try to find recursive relation!) = P(z T, a T z 0:T-1 T, a 0:T-1 T, o 1:T T ; θ ) P(z 0:T-1 T, a 0:T-1 T o 1:T T ; θ ) (Conditioning on history up to time T-1) = P(z T, a T z T-1, a T-1, o T ; θ ) P(z 0:T-1, a 0:T-1 o 1:T-1 ; θ ) (By property 2, z T, a T are independent of z 0:T-2, a 0:T-2, o 1:T-1 when z T-1, a T-1, o T are given) (By property 3, z 0:T-1, a 0:T-1 are id independent d of o T, that tis, history does not tdepend don future!) = P(z T, a T z T-1, a T-1, o T ; θ ) P T-1 = P(a T z T, z T-1, a T-1, o T ; θ ) P(z T z T-1, a T-1, o T ; θ ) P T-1 (Conditioning on z T ) = P(a T z T, z T-1, a T-1, o T ; θ ) W(z T-1, a T-1, o T, z T ) P T-1 (Transition factor) = P(a T z T ; θ ) W(z T-1, a T-1, o T, z T ) P T-1 (By property 4, a T is independent of z T-1, a T-1, o T given z T ) = π(z T, a T ) W(z T-1, a T-1, o T, z T ) P T-1 (Emission factor) = T T = π ( z, ) ( a W z 1, a 1, o 1, z ) P 1 t= 2 t t t= 2 t t t t T T = π( z, ) ( 0 t a t W z 1 t 1, at 1, ot 1, zt) μ( z0) t= t= P 1 = P(z 0:1, a 0:1 o 1 ; θ ) = P(z 1, a 1 z 0, a 0, o 1 ; θ ) P(z 0, a 0 o 1 ; θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) P(z 0, a 0 θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) P(a 0 z 0, θ ) P(z 0 θ ) = π(z 1, a 1 ) W(z 0, a 0, o 1, z 1 ) π(z 0, a 0 ) μ(z 0 )

An important posterior P(z t = i a 0:T, o 1:T ; θ ) Similar to (but different from) the hidden markov model we define α 0 (i) = P(z 0 = i, a 0 o 1:T ; θ ) = P(z 0 = i, a 0 θ ) = P(a 0 z 0 = i ; θ ) P(z 0 = i θ ) = π(i, a 0 ) μ(i) For t = 1, 2,, T, α t (i) = P(z t = i, a 0:t, o 1:T ; θ ) = P(z t = i, a 0:t, o 1:t ; θ ) = P(z t-1 =, z t = i, a 0:t o 1:t ; θ ) = P(z t = i, a t z t-1 =, a 0:t-1, o 1:t ; θ ) P(z t-1 =, a 0:t-1 o 1:t ; θ ) = P(z t = i, a t z t-1 =, a t-1, o t ; θ ) P(z t-1 =, a 0:t-1 o 1:t-1 ; θ ) = P(z t = i, a t z t-1 =, a t-1, o t ; θ ) α t-1 () = P(a t z t = i, z t-1 =, a t-1, o t ; θ ) P(z t = i z t-1 =, a t-1, o t ; θ ) α t-1 () = P(a t z t = i ; θ ) P(z t = i z t-1 =, a t-1, o t ; θ ) α t-1 () = α t-1 () W(, a t-1, o t, i) π(i, a t )

By α T (i) = P(z T = i, a 0:T o 1:T ; θ ), we have P(a 0:T o 1:T ; θ ) = P(z T = i, a 0:T o 1:T ; θ ) = α T (i), this is complete data likelihood. We also define β T (i) = 1 and for t = T 1, T 2,, 0, i β t (i) = P(a t+1:t z t = i, a t, o 1:T ; θ ) = P(a t+1:t z t = i, a t, o t+1:t ; θ ) = P(z t+1 =, a t+1:t z t = i, a t, o t+1:t ; θ ) = P(a t+2:t z t+1 =, z t = i, a t+1, a t, o t+2:t ; θ ) P(z t+1 =, a t+1 z t = i, a t, o t+1:t ; θ ) = β t+1 () P(z t+1 =, a t+1 z t = i, a t, o t+1:t ; θ ) = β t+1 () P(a t+1 z t+1 =, z t = i, a t, o t+1:t ; θ ) P(z t+1 = z t = i, a t, o t+1:t ; θ ) = β t+1 () P(a t+1 z t+1 = ; θ ) P(z t+1 = z t = i, a t, o t+1 ; θ ) = W(i, a t, o t+1, ) π(, a t+1 ) β t+1 () i

Finally, P(z t = i a 0:T, o 1:T ; θ ) = = = = = P(z t = i, a 0:T o 1:T ; θ ) P(a 0:T o 1:T ; θ ) P(z t = i, a 0:t, a t+1:t o 1:T ; θ ) P(z T =, a 0:T o 1:T ; θ ) P(z t = i, a 0:t, a t+1:t o 1:T ; θ ) α T () P(a t+1:t z t = i, a 0:t, o 1:T ; θ ) P(z t = i, a 0:t o 1:T ; θ ) α T () P(a t+1:t z t = i, a t, o t+1:t ; θ ) P(z t = i, a 0:t o 1:t ; θ ) α T () = β t (i) α t (i) α T () Same form as that in HMM!!!

Goal: Infer the number of different hidden state clusters given the observation data: a 0:T (1:N) and the input data o 1:T (1:N) Inference Method: Gibbs Sampling + Stick Breaking Construction + Dirichlet Distribution Steps: Step 1. Select an initial estimate of S (hence, S = {1, 2,, S }). Select an initialization of θ = {W, π, μ} (use uniform initialization for W, μ and initialize π by drawing from Dirichlet distribution parameterized by the empirical distribution of a t ) Step 2. For each n = 1, 2,..., N, draw a sequence of states z (n) 0:T from the conditional posteriors: P(z (n) t = i a (n) 0:T, o (n) 1:T ; θ ), t = 0, 1, 2,, T. Step 3. For each n = 1, 2,..., N, compute the count statistics #(i a, o) (n), #(i a) (n), and #(X 0 = i) (n). Compute the state t cluster occupation based on newly drawed d z (1:N) 0:T. This will lead to the new estimate of S. Relabel the newly drawed z (1:N) 0:T according to the occupation if necessary. Step 4. Based on the count statistics obtained in in Step 3, draw W and μ via stick-breaking process, and draw π from the Dirichlet conditional posteriors. Go to Step 2.

Posteriors involved in stick-breaking process: α (W, a, o) ~ +1,d (W, a, o) i Gamma(c 1 d 1 log(1 V i )) V i (W, a, o) ~ Beta(1 + #(i a, o) (n), α i (W, a, o) + #(i >, a, o) (n) ), n w i1 (a, o) = V i1 (W, a, o), w i (a, o) = V i (W, a, o) (1 α (μ) ~ Gamma(c 2 + 1, d 2 log(1 V (μ) ) ) ( W, a, o) V i ' ' < n ) W V (μ) ~ Beta(1 + #(z 0 = ) (n), α (μ) + #(z 0 > ) (n) ), μ 1 = V 1 (μ), μ = V (μ) (1 ) μ n n ( ) V μ ' ' < Conditional Dirichlet posteriors: (π i1, π i2,, π i A ) ~ Dirichlet ( β 1 (π) + #(i 1) (n), β 2 (π) + #(i 2) (n),, β A (π) + #(i A ) (n) ) n n n π Empirical i hyper-parameters: ete c 1 = 10-6, d 1 = 0.1, c 2 = 0.01, d 2 = 0.01, all β k s are determined by the empirical distribution of Y t.

A toy ground truth (randomly generated): W 4 15 10 10, π 10 4, μ 1 10 ( A A = 4, O O = 15, S S = 10) W_a_o = Columns 1 through 8 [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] Columns 9 through 15 [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] [10x10 double] >> W_a_o{1,1} ans = 0.19691 0.12754 0.011997 0.0031654 0.17369 0.040087 0.10291 0.15069 0.16472 0.028292 0.046207046207 0.15832 0.070542 070542 0.14929 0.0039261 0039261 0.13638 0.17987 0.06183 06183 0.19128 0.0023503 0023503 0.088622 0.13462 0.11875 0.065001 0.099492 0.044215 0.11999 0.12245 0.076318 0.13054 0.090343 0.13723 0.0018332 0.17322 0.070545 0.1007 0.11989 0.1056 0.16362 0.037019 0.20655 0.040848 0.032187 0.10799 0.19276 0.034963 0.18956 0.08584 0.040081 0.069226 0.12714 0.067684 0.033828 0.069844 0.083885 0.11643 0.11015 0.11724 0.16345 0.11035 0.091861 0.18826 0.039991 0.1703 0.14278 0.076145 0.068819 0.10999 0.054627 0.057235 0.0038474 0.19065 0.12554 0.10919 0.089178 0.17882 0.060241 0.092502 0.052466 0.097564 0.16967 0.084748 0.056225 0.04186 0.062924 0.17634 0.070479 0.14347 0.1809 0.013382 0.075713 0.15215 0.033849 0.11443 0.032289 0.10106 0.090929 0.10578 0.12553 0.16827

A toy ground truth (randomly generated): >> W_a_o{3,5} ans = 0.049595 0.1711 0.0048086 0.22873 0.092572 0.024649 0.069779 0.10077 0.14129 0.1167 0.037899 0.17893 0.074279 0.078157 0.10912 0.072839 0.15128 0.14954 0.096132 0.051819 0.112 0.0094001 0.13369 0.038813 0.15062 0.13029 0.14407 0.078113 0.20227 0.00073703 0.0041812 0.20036 0.0064187 0.18575 0.22347 0.0077509 0.13714 0.20276 0.0058403 0.02633 0.043656 0.11452 0.1658 0.11457 0.13171 0.086594 0.010688 0.11056 0.070362 0.15154 0.10089 0.085904 0.098771 0.072727 7.3562e-005 0.18129 0.15943 0.071039 0.061676 0.16821 0.082729 0.11931 0.12038 0.13838 0.15105 0.12878 0.044396 0.017092 0.023745 0.17415 0.077517077517 0.14142 0.026553 026553 0.11525 0.066107 066107 0.01647 01647 0.094269 094269 0.1871 0.14204 0.13327 0.17604 0.21067 0.095448 0.017306 0.062586 0.054057 0.0063592 0.19496 0.13312 0.049449 0.043106 0.20757 0.091527 0.029171 0.18601 0.03193 0.14494 0.10332 0.043977 0.11845 >> W_a_o{4,10} ans = 0.081838 0.13278 0.056819 0.0033495 0.010084 0.20048 0.071638 0.14095 0.14306 0.159 0.13113 0.0039547 0.10423 0.2564 0.0073725 0.03746 0.057935 0.20249 0.15949 0.039539 0.14215 0.12991 0.17242 0.051946 0.039586 0.055112 0.019218 0.078064 0.17018 0.14141 0.026762 0.036433 0.23035 0.069818 0.026359 0.2867 0.008662 0.11252 0.11566 0.086737 0.10372 0.026524 0.2023 0.11755 0.067874 0.17014 0.076155 0.16799 0.057377 0.010372 0.14841 0.11691 0.16689 0.17767 0.0062905 0.17205 0.0015792 0.086508 0.10353 0.020161 0.079686 0.13615 0.09691 0.0080626 0.17542 0.080951 0.10324 0.13532 0.0595 0.12476 0.010199 0.0029338 0.18272 0.16032 0.25071 0.0026783 0.27869 0.026886 0.0067274 0.078133 0.14151 0.01472 0.04687 0.18795 0.1471 0.013547 0.0396 0.17451 0.092818 0.14136 0.048205 0.0060693 0.025795 0.13245 0.10371 0.02278 0.17249 0.16962 0.14522 0.17366

A toy ground truth (randomly generated): >> W_a_o{3,5} ans = 0.049595 0.1711 0.0048086 0.22873 0.092572 0.024649 0.069779 0.10077 0.14129 0.1167 0.037899 0.17893 0.074279 0.078157 0.10912 0.072839 0.15128 0.14954 0.096132 0.051819 0.112 0.0094001 0.13369 0.038813 0.15062 0.13029 0.14407 0.078113 0.20227 0.00073703 0.0041812 0.20036 0.0064187 0.18575 0.22347 0.0077509 0.13714 0.20276 0.0058403 0.02633 0.043656 0.11452 0.1658 0.11457 0.13171 0.086594 0.010688 0.11056 0.070362 0.15154 0.10089 0.085904 0.098771 0.072727 7.3562e-005 0.18129 0.15943 0.071039 0.061676 0.16821 0.082729 0.11931 0.12038 0.13838 0.15105 0.12878 0.044396 0.017092 0.023745 0.17415 0.077517077517 0.14142 0.026553 026553 0.11525 0.066107 066107 0.01647 01647 0.094269 094269 0.1871 0.14204 0.13327 0.17604 0.21067 0.095448 0.017306 0.062586 0.054057 0.0063592 0.19496 0.13312 0.049449 0.043106 0.20757 0.091527 0.029171 0.18601 0.03193 0.14494 0.10332 0.043977 0.11845 >> W_a_o{4,10} ans = 0.081838 0.13278 0.056819 0.0033495 0.010084 0.20048 0.071638 0.14095 0.14306 0.159 0.13113 0.0039547 0.10423 0.2564 0.0073725 0.03746 0.057935 0.20249 0.15949 0.039539 0.14215 0.12991 0.17242 0.051946 0.039586 0.055112 0.019218 0.078064 0.17018 0.14141 0.026762 0.036433 0.23035 0.069818 0.026359 0.2867 0.008662 0.11252 0.11566 0.086737 0.10372 0.026524 0.2023 0.11755 0.067874 0.17014 0.076155 0.16799 0.057377 0.010372 0.14841 0.11691 0.16689 0.17767 0.0062905 0.17205 0.0015792 0.086508 0.10353 0.020161 0.079686 0.13615 0.09691 0.0080626 0.17542 0.080951 0.10324 0.13532 0.0595 0.12476 0.010199 0.0029338 0.18272 0.16032 0.25071 0.0026783 0.27869 0.026886 0.0067274 0.078133 0.14151 0.01472 0.04687 0.18795 0.1471 0.013547 0.0396 0.17451 0.092818 0.14136 0.048205 0.0060693 0.025795 0.13245 0.10371 0.02278 0.17249 0.16962 0.14522 0.17366

A toy ground truth (randomly generated): >> P_i P_i = 0.21432 0.0032032 0.096714 0.68577 0.38876 0.32973 0.022091 0.25943 0.0038868 0.15986 0.40782 0.42844 0.24144 0.29837 0.27234 0.18784 0.24997 0.12633 0.089493 0.53421 0.34921 0.025155 0.20485 0.42078 0.31067 0.103 0.29444 0.29189 0.39228 0.10078 0.28538 0.22157 0.35535 0.11409 0.18591 0.34464 0.27563 0.26337 0.12512 0.33589 >> m_u m_u = 0.20145 0.010774 0.15493 0.12541 0.070178 0.0016605 0.062962 0.21026 0.096605 0.06578 >> p_o (this discrete distribution is used to simulate the input data) p_o = Columns 1 through 12 0.16553 0.086469 0.040123 0.095611 0.019218 0.021096 0.02608 0.025348 0.095511 0.048504 0.15768 0.033172 Columns 13 through 15 0.00254430025443 0.019514 019514 0.16359

Simulated toy data: o 1:19 (1:100) a 0:19 (1:100) 6 starting estimates: 60 Estimation of S 50 40 S 30 20 10 0 Ground truth = 10: 0 20 40 60 80 100 120 140 160 Number of Gibbs Iterations

Now, with the same toy ground truth, change the mechanism of generating o 1:T (1:N), that is, don t ust use a single multinomial distribution Simulated toy data: o 1:39 (1:100) a 0:39 (1:100) 6 starting Estimation of S estimates: 60 50 40 S 30 20 10 0 Ground truth = 10: 0 20 40 60 80 100 120 140 160 Number of Gibbs Iterations

Questions? Thanks