Inference in Explicit Duration Hidden Markov Models

Size: px

Start display at page:

Download "Inference in Explicit Duration Hidden Markov Models"

Elisabeth Walters
5 years ago
Views:

1 Inference in Explicit Duration Hidden Markov Models Frank Wood Joint work with Chris Wiggins, Mike Dewar Columbia University November, 2011 Wood (Columbia University) EDHMM Inference November, / 38

2 Hidden Markov Models Hidden Markov models (HMMs) [Rabiner, 1989] are an important tool for data exploration and engineering applications. Applications include Speech recognition [Jelinek, 1997, Juang and Rabiner, 1985] Natural language processing [Manning and Schütze, 1999] Hand-writing recognition [Nag et al., 1986] DNA and other biological sequence modeling applications [Krogh et al., 1994] Gesture recognition [Tanguay Jr, 1995, Wilson and Bobick, 1999] Financial data modeling [Rydén et al., 1998]... and many more. Wood (Columbia University) EDHMM Inference November, / 38

3 Graphical Model: Hidden Markov Model π m z 0 z 1 z 2 z 3 z T θ m y 1 y 2 y y K 3 T Wood (Columbia University) EDHMM Inference November, / 38

4 Notation: Hidden Markov Model z t z t 1 = m Discrete(π m ) y t z t = m F θ (θ m ) A =... π 1 π m π K... Wood (Columbia University) EDHMM Inference November, / 38

5 HMM: Typical Usage Scenario (Character Recognition) Training data: multiple observed y t = {v t, h t } sequences of stylus positions for each kind of character Task: train Σ different models, one for each character Latent states: sequences of strokes Usage: classify new stylus position sequences using trained models M σ = {A σ, Θ σ } P(M σ y 1,..., y T ) P(y 1,..., y T M σ )P(M σ ) Wood (Columbia University) EDHMM Inference November, / 38

6 Shortcomings of Original HMM Specification Latent state dwell times are not usually geometrically distributed P(z t = m,..., z t+l = m A) L = P(z t+l+1 = m z t+l = m, A) l=1 = Geometric(L; π m (m)) Prior knowledge often suggests structural constraints on allowable transitions The state cardinality of the latent Markov chain K is usually unknown Wood (Columbia University) EDHMM Inference November, / 38

7 Explicit Duration HMM / H. Semi-Markov Model [Mitchell et al., 1995, Murphy, 2002, Yu and Kobayashi, 2003, Yu, 2010] Latent state sequence z = ({s 1, r 1 },..., {s T, r T }) Latent state id sequence s = (s 1,..., s T ) Latent remaining duration sequence r = (r 1,..., r T ) State-specific duration distribution F r (λ m ) Other distributions the same An EDHMM transitions between states in a different way than does a typical HMM. Unless r t = 0 the current remaining duration is decremented and the state does not change. If r t = 0 then the EDHMM transitions to a state m s t according to the distribution defined by π st Problem: inference requires enumerating possible durations. Wood (Columbia University) EDHMM Inference November, / 38

8 EDHMM notation Latent state z t = {s t, r t } is tuple consisting of state identity and time left in state. s t s t 1, r t 1 r t s t, r t 1 y t s t F θ (θ st ) { I(s t = s t 1 ), r t 1 > 0 Discrete(π st 1 ), { r t 1 = 0 I(r t = r t 1 1), r t 1 > 0 F r (λ st ), r t 1 = 0 Wood (Columbia University) EDHMM Inference November, / 38

9 EDHMM: Graphical Model λ m r 0 r 1 r 2 r 3 r T π m s 0 s 1 s 2 s 3 s T θ m K y 1 y 2 y 3 y T Wood (Columbia University) EDHMM Inference November, / 38

10 Structured HMMs: i.e. left-to-right HMM [Rabiner, 1989] Example: Chicken pox Observations vital signs Latent states pre-infection, infected, post-infection a State transition structure can t go from infected to pre-infection a disregarding shingles Structured transitions also imply zeros in the transition matrix A, i.e. p(s t = m s t 1 = l) = 0 m < l Wood (Columbia University) EDHMM Inference November, / 38

11 Bayesian HMM We will put a prior on parameters so that we can effect a solution that conforms to our ideas about what the solution should look like Structured prior examples A i,j = 0 (hard constraints) A i,j P j A i,j (rich get richer) Regularization means that we can specify a model with more parameters than could possibly be needed infinite complexity (i.e. K ) avoids many model selection problems extra states can be thought of as auxiliary or nuisance variables inference requires sampling in a model with countably infinite support Posterior over latent variables encodes uncertainty about interpretation of data. Wood (Columbia University) EDHMM Inference November, / 38

12 Bayesian HMM H z π m z 0 z 1 z 2 z 3 z T H y θ m y 1 y 2 y y K 3 T π m H z θ m H y z t z t 1 = m Discrete(π m ) y t z t = m F θ (θ m ) Wood (Columbia University) EDHMM Inference November, / 38

13 Infinite HMMs (IHMM) [Beal et al., 2002, Teh et al., 2006] c 0 γ c 1 π m z 0 z 1 z 2 z 3 z T H y θ m y 1 y 2 y y 3 T K, Sticky Infinite HMM [Fox et al., 2011] Wood (Columbia University) EDHMM Inference November, / 38

14 Inference for the Explicit Duration HMM (EDHMM) Simple Idea Result Infinite HMMs and EDHMMs share fundamental characteristic: countable support Inference techniques for Bayesian nonparametric (infinite) HMMs can be used for EDHMM inference Approximation-free, efficient inference algorithm for EDHMM inference Wood (Columbia University) EDHMM Inference November, / 38

15 Bayesian EDHMM: Graphical Model H r λ m r 0 r 1 r 2 r 3 r T H z π m s 0 s 1 s 2 s 3 s T H y θ m K y 1 y 2 y 3 y T A choice of prior λ m H r Gamma(H r ) π m H z Dir(1/K, 1/K,..., 1/K, 0, 1/K,..., 1/K, 1/K) Wood (Columbia University) EDHMM Inference November, / 38

16 EDHMM Inference: Beam Sampling [Dewar, Wiggins and W, 2011] We employ the forward-filtering, backward slice-sampling approach for the IHMM of [Van Gael et al., 2008], in which the state and duration variables s and r are sampled conditioned on auxiliary slice variables u. Net result: efficient, always finite forward-backward procedure for sampling latent states and the amount of time spent in them. Wood (Columbia University) EDHMM Inference November, / 38

17 Auxiliary Variables for Sampling Objective: get samples of x. λ x Wood (Columbia University) EDHMM Inference November, / 38

18 Auxiliary Variables for Sampling Objective: get samples of x. Sometimes it is easier to introduce an auxiliary variable u and to Gibbs sample the joint P(x, u) (i.e. sample from P(x u; λ) then P(u x, λ), etc.) then discard the u values than it is to directly sample from p(x λ). λ x λ x u Wood (Columbia University) EDHMM Inference November, / 38

19 Auxiliary Variables for Sampling Objective: get samples of x. Sometimes it is easier to introduce an auxiliary variable u and to Gibbs sample the joint P(x, u) (i.e. sample from P(x u; λ) then P(u x, λ), etc.) then discard the u values than it is to directly sample from p(x λ). Useful when: p(x λ) does not have a known parametric form but adding u results in a parametric form and when x has countable support and sampling it requires enumerating all values. λ x λ x u Wood (Columbia University) EDHMM Inference November, / 38

20 Slice Sampling: A very useful auxiliary variable sampling trick Pedagogical Example: x λ Poisson(λ) (countable support) enumeration strategy for sampling x (impossible) 1 auxiliary variable u with P(u x, λ) = I(0 u P(x λ)) P(x λ) Note: Marginal distribution of x is P(x λ) = u P(x, u λ) = u = u = u P(x λ)p(u x, λ) I(0 u P(x λ)) P(x λ) P(x λ) I(0 u P(x λ)) = P(x λ) 1 Necessary in EDHMM Wood (Columbia University) EDHMM Inference November, / 38

21 Slice Sampling: A very useful auxiliary variable sampling trick This suggests a Gibbs sampling scheme: alternately sampling from P(x u, λ) I(u P(x λ)) finite support, uniform above slice, enumeration possible P(u x, λ) = I(0 u P(x λ)) P(x λ) uniform between 0 and y = P(x λ) then discarding the u values to arrive at x samples marginally distributed according to P(x λ). PoissonPDF(r t ; λ m = 5) u t = r t Wood (Columbia University) EDHMM Inference November, / 38

22 EDHMM Graphical Model with Auxiliary Variables r 0 r 1 r 2 r 3 r T u 1 u 2 u T s 0 s 1 s 2 s 3 s T y 1 y 2 y 3 y T Wood (Columbia University) EDHMM Inference November, / 38

23 EDHMM Inference: Beam Sampling With auxiliary variables defined as and p(u t z t, z t 1 ) = I(0 < u t < p(z t z t 1 )) p(z t z t 1 ) p(z t z t 1 ) = t, r t ) (s t 1, r t 1 )) p((s { = r t 1 > 0, I(s t = s t 1 )I(r t = r t 1 1) r t 1 = 0, π st 1 s t F r (r t ; λ st ). one can run standard forward-backward conditioned on u s. Forward-backward slice sampling only has to consider a finite number of successor states at each timestep. Wood (Columbia University) EDHMM Inference November, / 38

24 Forward recursion ˆα t (z t ) = p(z t, Y t 1, U t 1) = z t 1 p(z t, z t 1, Y t 1, U t 1) z t 1 p(u t z t, z t 1 )p(z t, z t 1, Y t 1, U t 1 1 ) = z t 1 I(0 < u t < p(z t z t 1 ))p(y t z t )ˆα t 1 (z t 1 ). Wood (Columbia University) EDHMM Inference November, / 38

25 Backward Sampling Recursively sample a state sequence from the distribution p(z t 1 z t, Y, U) which can expressed in terms of the forward variable: p(z t 1 z t, Y, U) p(z t, z t 1, Y, U) p(u t z t, z t 1 )p(z t z t 1 )ˆα t 1 (z t 1 ) I(0 < u t < p(z t z t 1 ))ˆα t 1 (z t 1 ). Wood (Columbia University) EDHMM Inference November, / 38

26 Results To illustrate EDHMM learning on synthetic data, five hundred datapoints were generated using a 4 state EDHMM with Poisson duration distributions λ = (10, 20, 3, 7) and Gaussian emission distributions with means all unit variance. µ = ( 6, 2, 2, 6) Wood (Columbia University) EDHMM Inference November, / 38

27 EDHMM: Synthetic Data Results 10 emission/mean time Wood (Columbia University) EDHMM Inference November, / 38

28 EDHMM: Synthetic Data, State Mean Posterior 600 count emission mean Wood (Columbia University) EDHMM Inference November, / 38

29 EDHMM: Synthetic Data, State Duration Parameter Posterior 300 count duration rate Wood (Columbia University) EDHMM Inference November, / 38

30 EDHMM: Nanoscale Transistor Spontaneous Voltage Fluctuation 2 1 emission/mean time Wood (Columbia University) EDHMM Inference November, / 38

31 EDHMM: States Not Distinguishable By Output Task: understand system with states that have identical observation distributions and differ only in duration distribution. Observation distributions have means µ 1 = 0, µ 2 = 0, and µ 3 = 3 and the duration distributions have rates λ 1 = 5, λ 2 = 15, λ 3 = 20. (Next slide) a) observations; b) true states overlaid with 20 randomly selected state traces produced by the sampler. Samples from the posterior observation distribution mean are shown in c), and samples from the posterior duration distribution rates are shown in d). Wood (Columbia University) EDHMM Inference November, / 38

32 EDHMM: States Not Distinguishable By Output yt time (a) xt time (b) p(µj Y) y (c) p(λj Y) time (d) Wood (Columbia University) EDHMM Inference November, / 38

33 EDHMM vs. IHMM: Modeling the Morse Code Cepstrum Wood (Columbia University) EDHMM Inference November, / 38

34 Wrap-Up Novel inference procedure for EDHMMs that doesn t require truncation and is more efficient than considering all possible durations. number of transitions visited iteration Figure: Mean number of transitions considered per time point by the beam sampler for 1000 post-burn-in iterations. Compare to (KT ) 2 = O(10 6 ) transitions that would need to be considered by standard forward backward without truncation, the only surely safe, truncation-free alternative. Wood (Columbia University) EDHMM Inference November, / 38

35 Future Work Novel Gamma process construction for dependent, structured, infinite dimensional HMM transition distributions. Generalize to spatial prior on HMM states ( location ) Simultaneous location and mapping Process diagram modeling for systems biology Applications; seeking users Wood (Columbia University) EDHMM Inference November, / 38

36 Questions? Thank you! Wood (Columbia University) EDHMM Inference November, / 38

37 Bibliography I M J Beal, Z Ghahramani, and C E Rasmussen. The Infinite Hidden Markov Model. In Advances in Neural Information Processing Systems, pages , March M Dewar, C Wiggins, and F Wood. Inference in hidden Markov models with explicit state duration distributions. In Submission, E B Fox, E B Sudderth, M I Jordan, and A S Willsky. A Sticky HDP-HMM with Application to Speaker Diarization. Annals of Applied Statistics, 5(2A): , F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, B H Juang and L R Rabiner. Mixture Autoregressive Hidden Markov Models for Speech Signals. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(6): , Wood (Columbia University) EDHMM Inference November, / 38

38 Bibliography II A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modelling. Journal of Molecular Biology, 235: , C. Manning and H. Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, C Mitchell, M Harper, and L Jamieson. On the complexity of explicit duration HMM s. IEEE Transactions on Speech and Audio Processing, 3 (3): , K P Murphy. Hidden semi-markov models (HSMMs). Technical report, MIT, R. Nag, K. Wong, and F. Fallside. Script regonition using hidden Markov models. In ICASSP86, pages , L R Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, pages , Wood (Columbia University) EDHMM Inference November, / 38

39 Bibliography III T. Rydén, T. Teräsvirta, and S. rasbrink. Stylized facts of daily return series and the hidden markov model. Journal of Applied Econometrics, 13(3): , D.O. Tanguay Jr. Hidden Markov models for gesture recognition. PhD thesis, Massachusetts Institute of Technology, Y W Teh, M I Jordan, M J Beal, and D M Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101(476): , J Van Gael, Y Saatci, Y W Teh, and Z Ghahramani. Beam sampling for the infinite hidden Markov model. In Proceedings of the 25th International Conference on Machine Learning, pages ACM, A.D. Wilson and A.F. Bobick. Parametric hidden Markov models for gesture recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(9): , Wood (Columbia University) EDHMM Inference November, / 38

40 Bibliography IV S Yu and H Kobayashi. An Efficient Forward Backward Algorithm for an Explicit-Duration Hidden Markov Model. Signal Processing letters, 10 (1):11 14, S Z Yu. Hidden semi-markov models. Artificial Intelligence, 174(2): , Wood (Columbia University) EDHMM Inference November, / 38

Hidden Markov models: from the beginning to the state of the art

Hidden Markov models: from the beginning to the state of the art Frank Wood Columbia University November, 2011 Wood (Columbia University) HMMs November, 2011 1 / 44 Outline Overview of hidden Markov models