Bayesian Hidden Markov Models and Extensions

Similar documents
Unsupervised Learning

Nonparametric Probabilistic Modelling

A Brief Overview of Nonparametric Bayesian Models

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Inference in Explicit Duration Hidden Markov Models

Statistical Approaches to Learning and Discovery

Hidden Markov models: from the beginning to the state of the art

The Infinite Factorial Hidden Markov Model

Recent Advances in Bayesian Inference Techniques

Infinite Hierarchical Hidden Markov Models

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 3a: Dirichlet processes

Machine Learning Summer School

Lecture 6: Graphical Models: Learning

Variational Scoring of Graphical Model Structures

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

STA 414/2104: Machine Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hierarchical Dirichlet Processes

STA 4273H: Statistical Machine Learning

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Dirichlet Processes and other non-parametric Bayesian models

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Bayesian Mixtures of Bernoulli Distributions

Bayesian Learning in Undirected Graphical Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

COMP90051 Statistical Machine Learning

Bayesian Machine Learning

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Infinite latent feature models and the Indian Buffet Process

The Automatic Statistician

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Stochastic Variational Inference for the HDP-HMM

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Basic math for biology

Dirichlet Enhanced Latent Semantic Analysis

Image segmentation combining Markov Random Fields and Dirichlet Processes

Bayesian Methods for Machine Learning

Learning Gaussian Process Models from Uncertain Data

Pattern Recognition and Machine Learning

Part IV: Monte Carlo and nonparametric Bayes

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Interpretable Latent Variable Models

Part 1: Expectation Propagation

Dynamic Approaches: The Hidden Markov Model

Dirichlet Processes: Tutorial and Practical Course

Gentle Introduction to Infinite Gaussian Mixture Modeling

Non-Parametric Bayes

Bayesian nonparametric latent feature models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Infinite Latent Feature Models and the Indian Buffet Process

Hidden Markov models

Hidden Markov Models

Bayesian Machine Learning

p L yi z n m x N n xi

Master 2 Informatique Probabilistic Learning and Data Analysis

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Time Series Classification

Study Notes on the Latent Dirichlet Allocation

Tree-Based Inference for Dirichlet Process Mixtures

The infinite HMM for unsupervised PoS tagging

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Non-parametric Bayesian Methods

CPSC 540: Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Introduction to Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Learning from Sequential and Time-Series Data

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Linear Dynamical Systems

Bayesian Nonparametrics

20: Gaussian Processes

Machine Learning for OR & FE

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Approximate Inference using MCMC

Model Selection for Gaussian Processes

IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1. Variational Infinite Hidden Conditional Random Fields

Introduction to Gaussian Processes

Bayesian Nonparametrics

Bayesian Inference Course, WTCN, UCL, March 2013

Variational Methods in Bayesian Deconvolution

Introduction to Machine Learning CMU-10701

Machine Learning Techniques for Computer Vision

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

The Block Diagonal Infinite Hidden Markov Model

Bayesian nonparametric models for bipartite graphs

Advanced Data Science

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Variational Principal Components

Learning Energy-Based Models of High-Dimensional Data

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

STA 4273H: Statistical Machine Learning

Hidden Markov Models

Machine Learning Overview

Transcription:

Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh

Modeling time series Sequence of observations: For example: y 1, y 2, y 3,..., y t Sequence of images Speech signals Stock prices Kinematic variables in a robot Sensor readings from an industrial process Amino acids, etc... Goal: To build a probilistic model of the data: something that can predict p(y t y t! 1, y t! 2, y t! 3...) 2

Causal structure and hidden variables Speech recognition: x - underlying phonemes or words X 1 X 2 X 3 X T y - acoustic waveform Y 1 Y 2 Y 3 Y T Vision: x - object identities, poses, illumination y - image pixel values Industrial Monitoring: x - current state of molten steel in caster y - temperature and pressure sensor readings Two frequently-used tractable models: Linear-Gaussian state-space models Hidden Markov models 3

Graphical Model for HMM S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Discrete hidden states s t {1..., K}, and outputs y t (discrete or continuous). Joint probability factorizes: τ P(s 1,..., s τ, y 1..., y τ )=P(s 1 )P(y 1 s 1 ) P(s t s t 1 )P(y t s t ) t=2 x1 x2 x3 xt a Markov chain with stochastic measurements: y1 y2 y3 yt x1 x2 x3 xt or a mixture model with states coupled across time: y1 y2 y3 yt 4

Hidden Markov Models Hidden Markov models (HMMs) are widely used, but how do we choose the number of hidden states? Variational Bayesian learning of HMMs A non-parametric Bayesian approach: infinite HMMs. Can we extract richer structure from sequences by grouping together states in an HMM? Block-diagonal ihmms. A single discrete state variable is a poor representation of the history. Can we do better? Factorial HMMs Can we make Factorial HMMs non-parametric? infinite factorial HMMs and the Markov Indian Buffet Process

Part I: Variational Bayesian learning of Hidden Markov Models 6

Bayesian Learning Apply the basic rules of probability to learning from data. Data set: D = {x 1,..., x n } Models: m, m etc. Model parameters: θ Prior probability of models: P (m), P (m ) etc. Prior probabilities of model parameters: P (θ m) Model of data given parameters (likelihood model): P (x θ,m) If the data are independently and identically distributed then: n P (D θ,m)= P (x i θ,m) i=1 Posterior probability of model parameters: P (θ D,m)= P (D θ,m)p (θ m) P (D m) Posterior probability of models: P (m D) = P (m)p (D m) P (D)

Bayesian Occam s Razor and Model Comparison Compare model classes, e.g. m and m, using posterior probabilities given D: p(d m) p(m) p(m D) =, p(d m) = p(d θ,m) p(θ m) dθ p(d) Interpretations of the Marginal Likelihood ( model evidence ): The probability that randomly selected parameters from the prior would generate D. Probability of the data under the model, averaging over all possible parameter values. ( ) 1 log 2 is the number of bits of surprise at observing data D under model m. p(d m) Model classes that are too simple are unlikely to generate the data set. Model classes that are too complex can generate many possible data sets, so again, they are unlikely to generate that particular data set at random. P(D m) All possible data sets of size n too simple "just right" too complex D 8

Bayesian Model Comparison: Occam s Razor at Work M = 0 M = 1 M = 2 M = 3 40 40 40 40 20 20 20 20 1 Model Evidence 0 0 0 0 0.8!20!20 0 5 10 M = 4!20 0 5 10 M = 5 0 5 10 M = 6!20 0 5 10 M = 7 P(Y M) 0.6 0.4 40 40 40 40 0.2 20 20 20 20 0 0 1 2 3 4 5 6 7 M 0 0 0 0!20!20 0 5 10!20 0 5 10 0 5 10!20 0 5 10 For example, for quadratic polynomials (m = 2): ɛ N (0, σ 2 ) and parameters θ =(a 0 a 1 a 2 σ) demo: polybayes y = a 0 + a 1 x + a 2 x 2 + ɛ, where

How many clusters in the data? Learning Model Structure What is the intrinsic dimensionality of the data? Is this input relevant to predicting that output? What is the order of a dynamical system? How many states in a hidden Markov model? SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNGVALMTTY How many auditory sources in the input? Which graph structure best models the data? A B C demo: run simple E D

Variational Bayesian Learning Lower Bounding the Marginal Likelihood Let the observed data be D, the hidden state variables be s, and the parameters be θ. Lower bound the marginal likelihood (Bayesian model evidence) using Jensen s inequality: log P (D m) = log P (D, s, θ m) dθ s = log s Q(s, θ) log s P (D, s, θ m) Q(s, θ) Q(s, θ) P (D, s, θ m) Q(s, θ) Here Q(s, θ) is an approximation to the posterior P (s, θ D,m). Assume Q(s, θ) is a simpler factorised distribution: dθ dθ. log P (D m) Q s (s)q θ (θ) log s P (D, s, θ m) Q s (s)q θ (θ) dθ = F(Q s(s), Q θ (θ), D). Maximize this lower bound with respect to Q leads to generalization of the EM algorithm.

Hidden Markov Models S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Discrete hidden states, s t. Observations y t. How many hidden states? What structure state-transition matrix? Variational Bayesian HMMs (MacKay 1997; Beal PhD thesis 2003): demo: vbhmm demo 12

Summary of Part I Bayesian machine learning Marginal likelihoods and Occam s Razor Variational Bayesian lower bounds Application to learning the number of hidden states and structure of an HMM 13

Part II The Infinite Hidden Markov Model

Hidden Markov Models

Choosing the number of hidden states How do we choose K, the number of hidden states, in an HMM? Can we define a model with an unbounded number of hidden states, and a suitable inference algorithm?

Alice in Wonderland

Infinite Hidden Markov models

Hierarchical Urn Scheme for generating transitions in the ihmm (2002) n ij is the number of previous transitions from i to j α, β, and γ are hyperparameters prob. of transition from i to j proportional to n ij with prob. proportional to βγ jump to a new state

Relating ihmms to DPMs The infinite Hidden Markov Model is closely related to Dirichlet Process Mixture (DPM) models This makes sense: HMMs are time series generalisations of mixture models. DPMs are a way of defining mixture models with countably infinitely many components. ihmms are HMMs with countably infinitely many states.

HMMs as sequential mixtures

Infinite Hidden Markov Models

Infinite Hidden Markov Models

Infinite Hidden Markov Models Teh, Jordan, Beal and Blei (2005) derived ihmms in terms of Hierarchical Dirichlet Processes.

Efficient inference in ihmms?

Inference and Learning in HMMs and ihmms HMM inference of hidden states p(s t y 1 y T,θ): forward backward = dynamic programming = belief propagation HMM parameter learning: Baum Welch = expectation maximization (EM), or Gibbs sampling (Bayesian) ihmm inference and learning, p(s t,θ y 1 y T ): Gibbs Sampling This is unfortunate: Gibbs can be very slow for time series! Can we use dynamic programming?

Dynamic Programming in HMMs Forward Backtrack Sampling

Beam Sampling

Beam Sampling

Auxiliary variables Note: adding u variables, does not change distribution over other vars.

Beam Sampling

Experiment: Text Prediction

Experiment: Changepoint Detection

Experiment: Changepoint Detection

Parallel and Distributed Implementations of ihmms Recent work on parallel (.NET) and distributed (Hadoop) implementations of beam-sampling for ihmms (Bratieres, Van Gael, Vlachos and Ghahramani, 2010). Applied to unsupervised learning of part-of-speech tags from Newswire text (10 million word sequences). Promising results; open source code available for beam sampling ihmm: http://mloss.org/software/view/205/ 35

Part III: ihmms with clustered states We would like HMM models that can automatically group or cluster states. States within a group are more likely to transition to other states within the same group. This implies a block-diagonal transition matrix. 36

The Block-Diagonal ihmm (Stepleton, Ghahramani, Gordon & Lee, 2009) ζ =1 γ = 10 α 0 = 10 ξ = 10 ζ =1 γ = 50 α 0 = 10 ξ = 10 ζ =5 γ = 10 α 0 = 10 ξ = 10 ζ =1 γ = 10 α 0 =1 ξ =1 ζ = 10 γ = 10 α 0 = 1000 ξ = 10 Figure 2: Truncated Markov transition matrices (right stochastic) sampled from the BD-IHMM prior with variou 37

BD-iHMM finding sub-behaviours in video gestures (Nintendo Wii) (a) (b) (c) # "!#!!# " " #!! Batting Boxing Pitching Tennis 1000 2000 3000 4000 5000 38 Figure 6: (a) Selected downscaled video frames for

Part IV Hidden Markov models represent the entire history of a sequence using a single state variable s t This seems restrictive... It seems more natural to allow many hidden state variables, a distributed representation of state. the Factorial Hidden Markov Model

Factorial HMMs Factorial HMMs (Ghahramani and Jordan, 1997) A kind of dynamic Bayesian network. Inference using variational methods or sampling. Have been used in a variety of applications (e.g. condition monitoring, biological sequences, speech recognition).

From factorial HMMs to infinite factorial HMMs? A non-parametric version where the number of chains is unbounded? In infinite factorial HMM (ifhmm) each chain is binary (van Gael, Teh, and Ghahramani, 2008). Based on the Markov extension of the Indian Buffet Process (IBP).

Bars-in-time data

Bars-in-time data

ifhmm Toy Experiment: Bars-in-time Ground truth Inferred Data

ifhmm Experiment: Bars-in-time

separating speech audio of multiple speakers in time

The Big Picture

Summary Bayesian methods provide a flexible framework for modelling. HMMs can be learned using variational Bayesian methods. This should always be preferable to EM. ihmms provide a non-parametric sequence model where the number of states is not bounded a priori. Beam sampling provides an efficient exact dynamic programmingbased MCMC method for ihmms. Block-Diagional ihmms learn to cluster states into sub-behaviours. ifhmms extend ihmms to have multiple state variables in parallel. Future directions: new models, fast algorithms, and other compelling applications.

References Beal, M.J. (2003) Variational Algorithms for Approximate Bayesian Inference. PhD Thesis. University College London, UK. Beal, M.J., Ghahramani, Z. and Rasmussen, C.E. (2002) The infinite hidden Markov model. Advances in Neural Information Processing Systems 14:577 585. Cambridge, MA: MIT Press. Bratieres, S., van Gael, J., Vlachos, A., and Ghahramani, Z. (2010) Scaling the ihmm: Parallelization versus Hadoop. International Workshop on Scalable Machine Learning and Applications (SMLA-10). Ghahramani, Z. and Jordan, M.I. (1997) Factorial Hidden Markov Models. Machine Learning, 29: 245 273. Griffiths, T.L., and Ghahramani, Z. (2006) Infinite Latent Feature Models and the Indian Buffet Process. In Advances in Neural Information Processing Systems 18:475 482. Cambridge, MA: MIT Press. MacKay, D.J.C. (1997) Ensemble learning for hidden Markov models. Technical Report. Stepleton, T., Ghahramani, Z., Gordon, G., and Lee, T.-S. (2009) The Block Diagonal Infinite Hidden Markov Model. AISTATS 2009. Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M. (2006) Hierarchical Dirichlet processes. Journal of the American Statistical Association. 101(476):1566-1581. van Gael, J., Teh, Y.-W., and Ghahramani, Z. (2009) The Infinite Factorial Hidden Markov Model. In Advances in Neural Information Processing Systems 21. Cambridge, MA: MIT Press. van Gael, J., Saatci, Y., Teh, Y.-W., and Ghahramani, Z. (2008) Beam sampling for the infinite Hidden Markov Model. International Conference on Machine Learning (ICML 2008). van Gael, J., Vlachos, A. and Ghahramani, Z. (2009) The Infinite HMM for Unsupervised POS Tagging. EMNLP 2009. Singapore.