The Infinite Hidden Markov Model

Size: px

Start display at page:

Download "The Infinite Hidden Markov Model"

Sheena Owens
5 years ago
Views:

1 The Infinite Hidden Markov Model Matthew J. Beal Gatsby Computational Neuroscience Unit University College London beal Mar 2002 work with Zoubin Ghahramani Carl E. Rasmussen

2 Abstract We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite consider, for example, symbols being possible words appearing in English text.

3 Overview Introduce the idea of an HMM with an infinite number of states and symbols. Make use of the infinite Dirichlet process, and generalise this to a two-level hierarchical Dirichlet process over transitions and states. Explore the properties of the prior over trajectories. Present inference and learning procedures for the infinite HMM.

4 eg. Infinite Gaussian Mixtures Following Neal (1991), Rasmussen (2000) showed that it is possible to do inference in countably infinite mixtures of Gaussians. N K P (x 1,..., x N π, µ, Σ) = π j N (x i µ j, Σ j ) = P (s, x π, µ, Σ) = s s i=1 j=1 N K [π j N (x i µ j, Σ j )] δ(s i,j) i=1 j=1 Joint distribution of indicators is multinomial K P (s 1,..., s N π) = π n j j, n j = j=1 N δ(s i, j). i=1 Mixing proportions are given symmetric Dirichlet prior P (π β) = Γ(β) K Γ(β/K) K j=1 π β/k 1 j

5 Infinite Mixtures (continued) Joint distribution of indicators is multinomial K P (s 1,..., s N π) = π n j j, n j = j=1 N δ(s i, j). i=1 Mixing proportions are given symmetric Dirichlet conjugate prior P (π β) = Γ(β) K Γ(β/K) K π β/k 1 j j=1 Integrating out the mixing proportions we obtain P (s 1,..., s N β) = dπ P (s 1,..., s N π)p (π β) = Γ(β) Γ(n + β) K j=1 Γ(n j + β/k) Γ(β/K) This yields a Dirichlet Process over indicator variables. The probability of a set of states is only a function of the indicator variables. Analytical tractablility of this integral means we can work with the finite number of indicators rather than an infinite dimensional mixing proportion.

6 Dirichlet Process Conditional Probabilities Conditional Probabilities: Finite K P (s i = j s i, β) = n i,j + β/k N 1 + β where s i denotes all indices except i, and n i,j is total number of observations of indicator j excluding i th. DP: more populous classes are more more likely to be joined Conditional Probabilities: Infinite K Taking the limit as K yields the conditionals P (s i = j s i, β) = n i,j N 1+β β N 1+β j represented all j not represented Left over mass, β, countably infinite number of indicator settings. Gibbs sampling from posterior of indicators is easy!

7 Infinite Hidden Markov Models S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Motivation: We want to model data with HMMs without worrying about overfitting, picking number of states, picking architectures...

8 Review of Hidden Markov Models (HMMs) Generative graphical model: hidden states s t, emitted symbols y t S 1 S 2 S 3 S T Y 1 2ex Hidden state evolves as a Markov process Y 2 Y 3 Y T P (s 1:T A) = P (s 1 π 0 ) T 1 t=1 P (s t+1 s t ), P (s t+1 = j s t = i) = A ij i, j {1,..., K}. Observation model e.g. discrete y t symbols from an alphabet produced according to an emission matrix, P (y t = l s t = i) = E il.

9 Limitations of the standard HMM There are some important limitations of the standard estimation procedure and the model definition of HMMs: Maximum likelihood methods do not consider the complexity of the model, leading to over/under fitting. The model structure has to be specified in advance. In many scenarios it is unreasonable to a priori believe that the data was generated from a fixed number of discrete states. We don t know what form the state transition matrix should take on, We might not want to restrict ourselves to a finite alphabet of symbols. Previous authors (Mackay 97,Stolcke and Omohundro 93) have attempted to approximate a Bayesian analysis.

10 Generative model for hidden state Propose transition to s t+1 conditional on current state, s t. Existing transitions are more probable, thus giving rise to typical trajectories. n ij = s t s t β β β β Σ j n ii + α n ij + β + α t self ransition Σ j n ij n ij + β + α t existing ransition j=i Σ j β n ij + β + α oracle If oracle propose according to occupancies. Previously chosen oracle states are more probable. n j o oracle γ n o j = s t+1 [ ] γ Σ j n j o + γ existing state Σ j n j o + γ n ew state

11 Trajectories under the Prior explorative: α = 0.1, β = 1000, γ = 100 repetitive: α = 0, β = 0.1, γ = 100 self-transitioning: α = 2, β = 2, γ = 20 ramping: α = 1, β = 1, γ = Just 3 hyperparameters provide: slow/fast dynamics sparse/dense transition matrices many/few states left right structure, with multiple interacting cycles (α) (β) (γ)

12 2500 Real data Lewis Carroll s Alice s Adventures in Wonderland 2000 word identity word position in text x 10 4 With a finite alphabet a model would assign zero likelihood to a test sequence containing any symbols not present in the training set(s). In ihmms, at each time step the hidden state s t emits a symbol y t, which can possibly come from an infinite alphabet. state s t state s t+1 : transition process, counts n ij state s t symbol y t : emission process, counts m iq

13 Infinite HMMs S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Approach: Countably-infinite hidden states. Deal with both transition and emission processes using a two-level hierarchical Dirichlet process. Transition process Emission process Σ j n ii + α n ij + β + α t self ransition Σ j n ij n ij + β + α existing transition j=i Σ j n j o n j o + γ Σ existing state j β n ij + β + α oracle γ Σ j new state n j o + γ Σ m iq m iq + β e q existing emission m q Σ q existing symbol Σ q m q o + γ e β e m iq + β e Σ q s γ e m q o + γ e Gibbs sampling over the states is possible, while all parameters are implicitly integrated out; only five hyperparameters need to be inferred (Beal, Ghahramani, and Rasmussen, 2001). oracle new ymbol

14 For a given observed sequence... Inference and learning Inference: What is the distribution over hidden state trajectories? In this model, we can straightforwardly sample the hidden state using Gibbs sampling. The Gibbs probabilities are P (s t s t, y 1:T, α, β, γ, β e, γ e ) Unfortunately exact calculation takes in general O(T 2 ). We can use a sampling argument to reduce this to O(T ). Learning: What can we infer about the transition and emission processes? Involves updating the hyperparameters α, β, γ, β e and γ e. Q. How do we learn the transition and emission parameters? A. With infinite Dirichlet priors, they are represented non-parametrically as count matrices, capturing countably infinite state- and observable- spaces.

15 Hyperparameter updates In a standard Dirichlet process with concentration β, the likelihood of observing counts n j is P ({n j } j=1,...,k β) [ β β ] β... n j n j 1 + β [ ] β 1 n n K + β n n K β... n K 1 n 1 + β βk Γ (β) Γ (n + β) where n = j n j More specifically in this model, the likelihood from (α, β) is approximately* P ({n i1,..., n ii,..., n ik } α, β) = βk(i) 1 Γ(α + β) Γ( j n ij + α + β) Γ(n ii + α) Γ(α) All counts are their relevant expectations** over Gibbs sweeps.

16 A toy example: ascending-descending ABCDEFEDCBABCDEFEDCBABCDEFEDCBABCDEFEDCB... This requires minimally 10 states to capture. Number of represented states, K Gibbs sweeps

17 Synthetic examples: expansive & compressive HMMs {}}{{}}{{}}{{}}{ True transition and emission matrices n (1) m (1) n (80) m (80) n (150) m (150) True transition and emission matrices n (1) m (1) n (100) m (100) n (230) m (230) True and learned transition and emission probabilities/count matrices up to permutation of the hidden states; lighter boxes correspond to higher values. (top row) Expansive HMM. Count matrix pairs {n, m} are displayed after {1, 80, 150} sweeps of Gibbs sampling. (bottom row) Compressive HMM. Similar to top row displaying count matrices after {1, 100, 230} sweeps of Gibbs sampling. See hmm2.avi and hmm3.avi

18 Real data: Alice Trained on 1st chapter (10787 characters: A... Z, space, period ) =2046 words. ihmm initialized with random sequence of 30 states. α = 0; β = β e = γ = γ e = Gibbs sweeps (=several hours in Matlab). n matrix starts out full, ends up sparse (14% full). 200 character fantasies... 1: LTEMFAODEOHADONARNL SAE UDSEN DTTET ETIR NM H VEDEIPH L.SHYIPFADB OMEBEGLSENTEI GN HEOWDA EELE HEFOMADEE IS AL THWRR KH TDAAAC CHDEE OIGW OHRBOOLEODT DSECT M OEDPGTYHIHNOL CAEGTR.ROHA NOHTR.L 250: AREDIND DUW THE JEDING THE BUBLE MER.FION SO COR.THAN THALD THE BATHERSTHWE ICE WARLVE I TOMEHEDS I LISHT LAKT ORTH.A CEUT.INY OBER.GERD POR GRIEN THE THIS FICE HIGE TO SO.A REMELDLE THEN.SHILD TACE G 500: H ON ULY KER IN WHINGLE THICHEY TEIND EARFINK THATH IN ATS GOAP AT.FO ANICES IN RELL A GOR ARGOR PEN EUGUGTTHT ON THIND NOW BE WIT OR ANND YOADE WAS FOUE CAIT DOND SEAS HAMBER ANK THINK ME.HES URNDEY 1000: BUT.THOUGHT ANGERE SHERE ACRAT OR WASS WILE DOOF SHE.WAS ABBORE GLEAT DOING ALIRE AT TOO AMIMESSOF ON SHAM LUZDERY AMALT ANDING A BUPLA BUT THE LIDTIND BEKER HAGE FEMESETIMEY BUT NOTE GD I SO CALL OVE

19 Alice Results: Number of States and Hyperparameters β γ β e γ e K

20 Likelihood calculation: The infinite-state particle filter The likelihood for a particular observable sequence of symbols involves intractable sums over the possible hidden state trajectories. Integrating out the parameters in any HMM induces long range dependencies between states. We cannot use standard tricks like dynamic programming. The number of distinct states can grow with the sequence length as new states are generated. Starting with K distinct states, at time t there could be K + t possible distinct states making the total number of trajectories over the entire length of the sequence (K + T )!/K!. We propose estimating the likelihood of a test sequence given a learned model using particle filtering. The idea is to start with some number of particles R distributed on the represented hidden states according to the final state marginal from the training sequence (some of the R may fall onto new states).

21 The infinite-state particle filter (cont.) Starting from the set of particles {s 1 t,..., s R t }, the tables from the training sequences {n, n o, m, m o }, and t = 1 the recursive procedure is as specified below, where P (s t y 1,..., y t 1 ) 1 R r δ(s t, s r t) : 1. Compute l r t = P (y t s t = s r t ) for each particle r. 2. Calculate l t = (1/R) r lr t P (y t y 1,..., y t 1 ). 3. Resample R particles s r t (1/ r lr t ) r lr t δ(s t, s r t ). 4. Update transition and emission tables n r, m r for each particle. 5. For each r sample forward dynamics: s r t+1 P (s t+1 s t = s r t, nr, m r ); this may cause particles to land on novel states. Update n r and m r. 6. If t < T, Goto 1 with t = t + 1. The log likelihood of the test sequence is computed as t log l t. Since it is a discrete state space, with much of the probability mass concentrated on the represented states, it is feasible to use O(K) particles.

22 Conclusions It is possible to extend HMMs to a countably infinite number of hidden states using a hierarchical Dirichlet process. The resulting model is a non-parametric Bayesian HMM. With just three hyperparameters, prior trajectories demonstrate rich appealing structure. We can use the same process for emissions. Tools: O(T ) approximate Gibbs sampler for inference. Equations for hyperparameter learning. Particle filtering scheme for likelihood evaluation. Presented empirical results on toy data sets and Alice.

23 Some References 1. Attias H. (1999) Inferring parameters and structure of latent variable models by variational Bayes. Proc. 15th Conference on Uncertainty in Artificial Intelligence. 2. Barber D., Bishop C. M., (1998) Ensemble Learning for MultiLayer Networks. Advances in Neural Information Processing Systems Bishop, C. M. (1999). Variational principal components. Proceedings Ninth International Conference on Artificial Neural Networks, ICANN 99 (pp ). 4. Beal, M. J., Ghahramani, Z. and Rasmussen, C. E. (2001) The Infinite Hidden Markov Model. To appear in NIPS Ghahramani, Z. and Beal, M.J. (1999) Variational inference for Bayesian mixtures of factor analysers. In Neural Information Processing Systems Ghahramani, Z. and Beal, M.J. (2000) Propagation algorithms for variational Bayesian learning. In Neural Information Processing Systems Hinton, G. E., and van Camp, D. (1993) Keeping neural networks simple by minimizing the description length of the weights. In Proc. 6th Annu. Workshop on Comput. Learning Theory, pp ACM Press, New York, NY. 8. MacKay, D. J. C. (1995) Probable networks and plausible predictions a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6: Miskin J. and D. J. C. MacKay, Ensemble learning independent component analysis for blind separation and deconvolution of images, in Advances in Independent Component Analysis, M. Girolami, ed., pp , Springer, Berlin, Neal, R. M. (1991) Bayesian mixture modeling by Monte Carlo simulation, Technical Report CRG-TR-91-2, Dept. of Computer Science, University of Toronto, 23 pages. 11. Neal, R. M. (1994) Priors for infinite networks, Technical Report CRG-TR-94-1, Dept. of Computer Science, University of Toronto, 22 pages.

24 12. Rasmussen, C. E. (1996) Evaluation of Gaussian Processes and other Methods for Non-Linear Regression. Ph.D. thesis, Graduate Department of Computer Science, University of Toronto. 13. Rasmussen, C. E. (1999) The Infinite Gaussian Mixture Model. Advances in Neural Information Processing Systems 12, S.A. Solla, T.K. Leen and K.-R. Mller (eds.), pp , MIT Press (2000). 14. Rasmussen, C. E and Ghahramani, Z. (2000) Occam s Razor. Advances in Neural Information Systems 13, MIT Press (2001). 15. Rasmussen, C. E and Ghahramani, Z. (2001) Infinite Mixtures of Gaussian Process Experts. In NIPS Ueda, N. and Ghahramani, Z. (2000) Optimal model inference for Bayesian mixtures of experts. IEEE Neural Networks for Signal Processing. Sydney, Australia. 17. Waterhouse, S., MacKay, D.J.C. & Robinson, T. (1996). Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press. 18. Williams, C. K. I., and Rasmussen, C. E. (1996) Gaussian processes for regression. In Advances in Neural Information Processing Systems 8, ed. by D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo.

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon