The Infinite Hidden Markov Model

Size: px
Start display at page:

Download "The Infinite Hidden Markov Model"

Transcription

1 The Infinite Hidden Markov Model Matthew J. Beal Gatsby Computational Neuroscience Unit University College London beal Mar 2002 work with Zoubin Ghahramani Carl E. Rasmussen

2 Abstract We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite consider, for example, symbols being possible words appearing in English text.

3 Overview Introduce the idea of an HMM with an infinite number of states and symbols. Make use of the infinite Dirichlet process, and generalise this to a two-level hierarchical Dirichlet process over transitions and states. Explore the properties of the prior over trajectories. Present inference and learning procedures for the infinite HMM.

4 eg. Infinite Gaussian Mixtures Following Neal (1991), Rasmussen (2000) showed that it is possible to do inference in countably infinite mixtures of Gaussians. N K P (x 1,..., x N π, µ, Σ) = π j N (x i µ j, Σ j ) = P (s, x π, µ, Σ) = s s i=1 j=1 N K [π j N (x i µ j, Σ j )] δ(s i,j) i=1 j=1 Joint distribution of indicators is multinomial K P (s 1,..., s N π) = π n j j, n j = j=1 N δ(s i, j). i=1 Mixing proportions are given symmetric Dirichlet prior P (π β) = Γ(β) K Γ(β/K) K j=1 π β/k 1 j

5 Infinite Mixtures (continued) Joint distribution of indicators is multinomial K P (s 1,..., s N π) = π n j j, n j = j=1 N δ(s i, j). i=1 Mixing proportions are given symmetric Dirichlet conjugate prior P (π β) = Γ(β) K Γ(β/K) K π β/k 1 j j=1 Integrating out the mixing proportions we obtain P (s 1,..., s N β) = dπ P (s 1,..., s N π)p (π β) = Γ(β) Γ(n + β) K j=1 Γ(n j + β/k) Γ(β/K) This yields a Dirichlet Process over indicator variables. The probability of a set of states is only a function of the indicator variables. Analytical tractablility of this integral means we can work with the finite number of indicators rather than an infinite dimensional mixing proportion.

6 Dirichlet Process Conditional Probabilities Conditional Probabilities: Finite K P (s i = j s i, β) = n i,j + β/k N 1 + β where s i denotes all indices except i, and n i,j is total number of observations of indicator j excluding i th. DP: more populous classes are more more likely to be joined Conditional Probabilities: Infinite K Taking the limit as K yields the conditionals P (s i = j s i, β) = n i,j N 1+β β N 1+β j represented all j not represented Left over mass, β, countably infinite number of indicator settings. Gibbs sampling from posterior of indicators is easy!

7 Infinite Hidden Markov Models S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Motivation: We want to model data with HMMs without worrying about overfitting, picking number of states, picking architectures...

8 Review of Hidden Markov Models (HMMs) Generative graphical model: hidden states s t, emitted symbols y t S 1 S 2 S 3 S T Y 1 2ex Hidden state evolves as a Markov process Y 2 Y 3 Y T P (s 1:T A) = P (s 1 π 0 ) T 1 t=1 P (s t+1 s t ), P (s t+1 = j s t = i) = A ij i, j {1,..., K}. Observation model e.g. discrete y t symbols from an alphabet produced according to an emission matrix, P (y t = l s t = i) = E il.

9 Limitations of the standard HMM There are some important limitations of the standard estimation procedure and the model definition of HMMs: Maximum likelihood methods do not consider the complexity of the model, leading to over/under fitting. The model structure has to be specified in advance. In many scenarios it is unreasonable to a priori believe that the data was generated from a fixed number of discrete states. We don t know what form the state transition matrix should take on, We might not want to restrict ourselves to a finite alphabet of symbols. Previous authors (Mackay 97,Stolcke and Omohundro 93) have attempted to approximate a Bayesian analysis.

10 Generative model for hidden state Propose transition to s t+1 conditional on current state, s t. Existing transitions are more probable, thus giving rise to typical trajectories. n ij = s t s t β β β β Σ j n ii + α n ij + β + α t self ransition Σ j n ij n ij + β + α t existing ransition j=i Σ j β n ij + β + α oracle If oracle propose according to occupancies. Previously chosen oracle states are more probable. n j o oracle γ n o j = s t+1 [ ] γ Σ j n j o + γ existing state Σ j n j o + γ n ew state

11 Trajectories under the Prior explorative: α = 0.1, β = 1000, γ = 100 repetitive: α = 0, β = 0.1, γ = 100 self-transitioning: α = 2, β = 2, γ = 20 ramping: α = 1, β = 1, γ = Just 3 hyperparameters provide: slow/fast dynamics sparse/dense transition matrices many/few states left right structure, with multiple interacting cycles (α) (β) (γ)

12 2500 Real data Lewis Carroll s Alice s Adventures in Wonderland 2000 word identity word position in text x 10 4 With a finite alphabet a model would assign zero likelihood to a test sequence containing any symbols not present in the training set(s). In ihmms, at each time step the hidden state s t emits a symbol y t, which can possibly come from an infinite alphabet. state s t state s t+1 : transition process, counts n ij state s t symbol y t : emission process, counts m iq

13 Infinite HMMs S 1 S 2 S 3 S T Y 1 Y 2 Y 3 Y T Approach: Countably-infinite hidden states. Deal with both transition and emission processes using a two-level hierarchical Dirichlet process. Transition process Emission process Σ j n ii + α n ij + β + α t self ransition Σ j n ij n ij + β + α existing transition j=i Σ j n j o n j o + γ Σ existing state j β n ij + β + α oracle γ Σ j new state n j o + γ Σ m iq m iq + β e q existing emission m q Σ q existing symbol Σ q m q o + γ e β e m iq + β e Σ q s γ e m q o + γ e Gibbs sampling over the states is possible, while all parameters are implicitly integrated out; only five hyperparameters need to be inferred (Beal, Ghahramani, and Rasmussen, 2001). oracle new ymbol

14 For a given observed sequence... Inference and learning Inference: What is the distribution over hidden state trajectories? In this model, we can straightforwardly sample the hidden state using Gibbs sampling. The Gibbs probabilities are P (s t s t, y 1:T, α, β, γ, β e, γ e ) Unfortunately exact calculation takes in general O(T 2 ). We can use a sampling argument to reduce this to O(T ). Learning: What can we infer about the transition and emission processes? Involves updating the hyperparameters α, β, γ, β e and γ e. Q. How do we learn the transition and emission parameters? A. With infinite Dirichlet priors, they are represented non-parametrically as count matrices, capturing countably infinite state- and observable- spaces.

15 Hyperparameter updates In a standard Dirichlet process with concentration β, the likelihood of observing counts n j is P ({n j } j=1,...,k β) [ β β ] β... n j n j 1 + β [ ] β 1 n n K + β n n K β... n K 1 n 1 + β βk Γ (β) Γ (n + β) where n = j n j More specifically in this model, the likelihood from (α, β) is approximately* P ({n i1,..., n ii,..., n ik } α, β) = βk(i) 1 Γ(α + β) Γ( j n ij + α + β) Γ(n ii + α) Γ(α) All counts are their relevant expectations** over Gibbs sweeps.

16 A toy example: ascending-descending ABCDEFEDCBABCDEFEDCBABCDEFEDCBABCDEFEDCB... This requires minimally 10 states to capture. Number of represented states, K Gibbs sweeps

17 Synthetic examples: expansive & compressive HMMs {}}{{}}{{}}{{}}{ True transition and emission matrices n (1) m (1) n (80) m (80) n (150) m (150) True transition and emission matrices n (1) m (1) n (100) m (100) n (230) m (230) True and learned transition and emission probabilities/count matrices up to permutation of the hidden states; lighter boxes correspond to higher values. (top row) Expansive HMM. Count matrix pairs {n, m} are displayed after {1, 80, 150} sweeps of Gibbs sampling. (bottom row) Compressive HMM. Similar to top row displaying count matrices after {1, 100, 230} sweeps of Gibbs sampling. See hmm2.avi and hmm3.avi

18 Real data: Alice Trained on 1st chapter (10787 characters: A... Z, space, period ) =2046 words. ihmm initialized with random sequence of 30 states. α = 0; β = β e = γ = γ e = Gibbs sweeps (=several hours in Matlab). n matrix starts out full, ends up sparse (14% full). 200 character fantasies... 1: LTEMFAODEOHADONARNL SAE UDSEN DTTET ETIR NM H VEDEIPH L.SHYIPFADB OMEBEGLSENTEI GN HEOWDA EELE HEFOMADEE IS AL THWRR KH TDAAAC CHDEE OIGW OHRBOOLEODT DSECT M OEDPGTYHIHNOL CAEGTR.ROHA NOHTR.L 250: AREDIND DUW THE JEDING THE BUBLE MER.FION SO COR.THAN THALD THE BATHERSTHWE ICE WARLVE I TOMEHEDS I LISHT LAKT ORTH.A CEUT.INY OBER.GERD POR GRIEN THE THIS FICE HIGE TO SO.A REMELDLE THEN.SHILD TACE G 500: H ON ULY KER IN WHINGLE THICHEY TEIND EARFINK THATH IN ATS GOAP AT.FO ANICES IN RELL A GOR ARGOR PEN EUGUGTTHT ON THIND NOW BE WIT OR ANND YOADE WAS FOUE CAIT DOND SEAS HAMBER ANK THINK ME.HES URNDEY 1000: BUT.THOUGHT ANGERE SHERE ACRAT OR WASS WILE DOOF SHE.WAS ABBORE GLEAT DOING ALIRE AT TOO AMIMESSOF ON SHAM LUZDERY AMALT ANDING A BUPLA BUT THE LIDTIND BEKER HAGE FEMESETIMEY BUT NOTE GD I SO CALL OVE

19 Alice Results: Number of States and Hyperparameters β γ β e γ e K

20 Likelihood calculation: The infinite-state particle filter The likelihood for a particular observable sequence of symbols involves intractable sums over the possible hidden state trajectories. Integrating out the parameters in any HMM induces long range dependencies between states. We cannot use standard tricks like dynamic programming. The number of distinct states can grow with the sequence length as new states are generated. Starting with K distinct states, at time t there could be K + t possible distinct states making the total number of trajectories over the entire length of the sequence (K + T )!/K!. We propose estimating the likelihood of a test sequence given a learned model using particle filtering. The idea is to start with some number of particles R distributed on the represented hidden states according to the final state marginal from the training sequence (some of the R may fall onto new states).

21 The infinite-state particle filter (cont.) Starting from the set of particles {s 1 t,..., s R t }, the tables from the training sequences {n, n o, m, m o }, and t = 1 the recursive procedure is as specified below, where P (s t y 1,..., y t 1 ) 1 R r δ(s t, s r t) : 1. Compute l r t = P (y t s t = s r t ) for each particle r. 2. Calculate l t = (1/R) r lr t P (y t y 1,..., y t 1 ). 3. Resample R particles s r t (1/ r lr t ) r lr t δ(s t, s r t ). 4. Update transition and emission tables n r, m r for each particle. 5. For each r sample forward dynamics: s r t+1 P (s t+1 s t = s r t, nr, m r ); this may cause particles to land on novel states. Update n r and m r. 6. If t < T, Goto 1 with t = t + 1. The log likelihood of the test sequence is computed as t log l t. Since it is a discrete state space, with much of the probability mass concentrated on the represented states, it is feasible to use O(K) particles.

22 Conclusions It is possible to extend HMMs to a countably infinite number of hidden states using a hierarchical Dirichlet process. The resulting model is a non-parametric Bayesian HMM. With just three hyperparameters, prior trajectories demonstrate rich appealing structure. We can use the same process for emissions. Tools: O(T ) approximate Gibbs sampler for inference. Equations for hyperparameter learning. Particle filtering scheme for likelihood evaluation. Presented empirical results on toy data sets and Alice.

23 Some References 1. Attias H. (1999) Inferring parameters and structure of latent variable models by variational Bayes. Proc. 15th Conference on Uncertainty in Artificial Intelligence. 2. Barber D., Bishop C. M., (1998) Ensemble Learning for MultiLayer Networks. Advances in Neural Information Processing Systems Bishop, C. M. (1999). Variational principal components. Proceedings Ninth International Conference on Artificial Neural Networks, ICANN 99 (pp ). 4. Beal, M. J., Ghahramani, Z. and Rasmussen, C. E. (2001) The Infinite Hidden Markov Model. To appear in NIPS Ghahramani, Z. and Beal, M.J. (1999) Variational inference for Bayesian mixtures of factor analysers. In Neural Information Processing Systems Ghahramani, Z. and Beal, M.J. (2000) Propagation algorithms for variational Bayesian learning. In Neural Information Processing Systems Hinton, G. E., and van Camp, D. (1993) Keeping neural networks simple by minimizing the description length of the weights. In Proc. 6th Annu. Workshop on Comput. Learning Theory, pp ACM Press, New York, NY. 8. MacKay, D. J. C. (1995) Probable networks and plausible predictions a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems 6: Miskin J. and D. J. C. MacKay, Ensemble learning independent component analysis for blind separation and deconvolution of images, in Advances in Independent Component Analysis, M. Girolami, ed., pp , Springer, Berlin, Neal, R. M. (1991) Bayesian mixture modeling by Monte Carlo simulation, Technical Report CRG-TR-91-2, Dept. of Computer Science, University of Toronto, 23 pages. 11. Neal, R. M. (1994) Priors for infinite networks, Technical Report CRG-TR-94-1, Dept. of Computer Science, University of Toronto, 22 pages.

24 12. Rasmussen, C. E. (1996) Evaluation of Gaussian Processes and other Methods for Non-Linear Regression. Ph.D. thesis, Graduate Department of Computer Science, University of Toronto. 13. Rasmussen, C. E. (1999) The Infinite Gaussian Mixture Model. Advances in Neural Information Processing Systems 12, S.A. Solla, T.K. Leen and K.-R. Mller (eds.), pp , MIT Press (2000). 14. Rasmussen, C. E and Ghahramani, Z. (2000) Occam s Razor. Advances in Neural Information Systems 13, MIT Press (2001). 15. Rasmussen, C. E and Ghahramani, Z. (2001) Infinite Mixtures of Gaussian Process Experts. In NIPS Ueda, N. and Ghahramani, Z. (2000) Optimal model inference for Bayesian mixtures of experts. IEEE Neural Networks for Signal Processing. Sydney, Australia. 17. Waterhouse, S., MacKay, D.J.C. & Robinson, T. (1996). Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press. 18. Williams, C. K. I., and Rasmussen, C. E. (1996) Gaussian processes for regression. In Advances in Neural Information Processing Systems 8, ed. by D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo.

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Bayesian Model Selection Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Bayesian Hidden Markov Models and Extensions

Bayesian Hidden Markov Models and Extensions Bayesian Hidden Markov Models and Extensions Zoubin Ghahramani Department of Engineering University of Cambridge joint work with Matt Beal, Jurgen van Gael, Yunus Saatci, Tom Stepleton, Yee Whye Teh Modeling

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks Creighton Heaukulani and Zoubin Ghahramani University of Cambridge TU Denmark, June 2013 1 A Network Dynamic network data

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Infinite Mixtures of Gaussian Process Experts

Infinite Mixtures of Gaussian Process Experts in Advances in Neural Information Processing Systems 14, MIT Press (22). Infinite Mixtures of Gaussian Process Experts Carl Edward Rasmussen and Zoubin Ghahramani Gatsby Computational Neuroscience Unit

More information

Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

More information

Dirichlet Processes and other non-parametric Bayesian models

Dirichlet Processes and other non-parametric Bayesian models Dirichlet Processes and other non-parametric Bayesian models Zoubin Ghahramani http://learning.eng.cam.ac.uk/zoubin/ zoubin@cs.cmu.edu Statistical Machine Learning CMU 10-702 / 36-702 Spring 2008 Model

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian time series classification

Bayesian time series classification Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University

More information

Variational inference in the conjugate-exponential family

Variational inference in the conjugate-exponential family Variational inference in the conjugate-exponential family Matthew J. Beal Work with Zoubin Ghahramani Gatsby Computational Neuroscience Unit August 2000 Abstract Variational inference in the conjugate-exponential

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Inference in Explicit Duration Hidden Markov Models

Inference in Explicit Duration Hidden Markov Models Inference in Explicit Duration Hidden Markov Models Frank Wood Joint work with Chris Wiggins, Mike Dewar Columbia University November, 2011 Wood (Columbia University) EDHMM Inference November, 2011 1 /

More information

Variational Bayesian Hidden Markov Models

Variational Bayesian Hidden Markov Models Chapter 3 Variational Bayesian Hidden Markov Models 3.1 Introduction Hidden Markov models (HMMs) are widely used in a variety of fields for modelling time series data, with applications including speech

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Gaussian Processes Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London

More information

Gaussian process for nonstationary time series prediction

Gaussian process for nonstationary time series prediction Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Infinite Latent Feature Models and the Indian Buffet Process

Infinite Latent Feature Models and the Indian Buffet Process Infinite Latent Feature Models and the Indian Buffet Process Thomas L. Griffiths Cognitive and Linguistic Sciences Brown University, Providence RI 292 tom griffiths@brown.edu Zoubin Ghahramani Gatsby Computational

More information

Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley

More information

Infinite Hierarchical Hidden Markov Models

Infinite Hierarchical Hidden Markov Models Katherine A. Heller Engineering Department University of Cambridge Cambridge, UK heller@gatsby.ucl.ac.uk Yee Whye Teh and Dilan Görür Gatsby Unit University College London London, UK {ywteh,dilan}@gatsby.ucl.ac.uk

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

Non-parametric Bayesian Methods

Non-parametric Bayesian Methods Non-parametric Bayesian Methods Uncertainty in Artificial Intelligence Tutorial July 25 Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK Center for Automated Learning

More information

Tree-Based Inference for Dirichlet Process Mixtures

Tree-Based Inference for Dirichlet Process Mixtures Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Hidden Markov models: from the beginning to the state of the art

Hidden Markov models: from the beginning to the state of the art Hidden Markov models: from the beginning to the state of the art Frank Wood Columbia University November, 2011 Wood (Columbia University) HMMs November, 2011 1 / 44 Outline Overview of hidden Markov models

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

More information

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui School of Mathematical and Computing Sciences Computer Science Multiple Output Gaussian Process Regression Phillip Boyle and

More information

Bayesian ensemble learning of generative models

Bayesian ensemble learning of generative models Chapter Bayesian ensemble learning of generative models Harri Valpola, Antti Honkela, Juha Karhunen, Tapani Raiko, Xavier Giannakopoulos, Alexander Ilin, Erkki Oja 65 66 Bayesian ensemble learning of generative

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

Time-Sensitive Dirichlet Process Mixture Models

Time-Sensitive Dirichlet Process Mixture Models Time-Sensitive Dirichlet Process Mixture Models Xiaojin Zhu Zoubin Ghahramani John Lafferty May 25 CMU-CALD-5-4 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Abstract We introduce

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 The Variational Bayesian EM

More information

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models $ Technical Report, University of Toronto, CSRG-501, October 2004 Density Propagation for Continuous Temporal Chains Generative and Discriminative Models Cristian Sminchisescu and Allan Jepson Department

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero

A NEW VIEW OF ICA. G.E. Hinton, M. Welling, Y.W. Teh. S. K. Osindero ( ( A NEW VIEW OF ICA G.E. Hinton, M. Welling, Y.W. Teh Department of Computer Science University of Toronto 0 Kings College Road, Toronto Canada M5S 3G4 S. K. Osindero Gatsby Computational Neuroscience

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Inference Control and Driving of Natural Systems

Inference Control and Driving of Natural Systems Inference Control and Driving of Natural Systems MSci/MSc/MRes nick.jones@imperial.ac.uk The fields at play We will be drawing on ideas from Bayesian Cognitive Science (psychology and neuroscience), Biological

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

The Infinite Gaussian Mixture Model

The Infinite Gaussian Mixture Model The Infinite Gaussian Mixture Model Carl Edward Rasmussen Department of Mathematical Modelling Technical University of Denmar Building 321, DK-2800 Kongens Lyngby, Denmar carl@imm.dtu.d http://bayes.imm.dtu.d

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Regression with Input-Dependent Noise: A Bayesian Treatment

Regression with Input-Dependent Noise: A Bayesian Treatment Regression with Input-Dependent oise: A Bayesian Treatment Christopher M. Bishop C.M.BishopGaston.ac.uk Cazhaow S. Qazaz qazazcsgaston.ac.uk eural Computing Research Group Aston University, Birmingham,

More information

An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models

An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models An Ensemble Learning Approach to Nonlinear Dynamic Blind Source Separation Using State-Space Models Harri Valpola, Antti Honkela, and Juha Karhunen Neural Networks Research Centre, Helsinki University

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Non-parametric Clustering with Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction

More information

Hidden Markov Models. Terminology, Representation and Basic Problems

Hidden Markov Models. Terminology, Representation and Basic Problems Hidden Markov Models Terminology, Representation and Basic Problems Data analysis? Machine learning? In bioinformatics, we analyze a lot of (sequential) data (biological sequences) to learn unknown parameters

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Deterministic annealing variant of variational Bayes method

Deterministic annealing variant of variational Bayes method Journal of Physics: Conference Series Deterministic annealing variant of variational Bayes method To cite this article: K Katahira et al 28 J. Phys.: Conf. Ser. 95 1215 View the article online for updates

More information

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Tutorial at CVPR 2012 Erik Sudderth Brown University Work by E. Fox, E. Sudderth, M. Jordan, & A. Willsky AOAS 2011: A Sticky HDP-HMM with

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group Nonparmeteric Bayes & Gaussian Processes Baback Moghaddam baback@jpl.nasa.gov Machine Learning Group Outline Bayesian Inference Hierarchical Models Model Selection Parametric vs. Nonparametric Gaussian

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Collapsed Variational Dirichlet Process Mixture Models

Collapsed Variational Dirichlet Process Mixture Models Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Deep Neural Networks as Gaussian Processes

Deep Neural Networks as Gaussian Processes Deep Neural Networks as Gaussian Processes Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein Google Brain {jaehlee, yasamanb, romann, schsam, jpennin,

More information

An Alternative Infinite Mixture Of Gaussian Process Experts

An Alternative Infinite Mixture Of Gaussian Process Experts An Alternative Infinite Mixture Of Gaussian Process Experts Edward Meeds and Simon Osindero Department of Computer Science University of Toronto Toronto, M5S 3G4 {ewm,osindero}@cs.toronto.edu Abstract

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Cognitive Science 30 (2006) 725 731 Copyright 2006 Cognitive Science Society, Inc. All rights reserved. Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation Geoffrey Hinton,

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada. In Advances in Neural Information Processing Systems 8 eds. D. S. Touretzky, M. C. Mozer, M. E. Hasselmo, MIT Press, 1996. Gaussian Processes for Regression Christopher K. I. Williams Neural Computing

More information

Latent Dirichlet Bayesian Co-Clustering

Latent Dirichlet Bayesian Co-Clustering Latent Dirichlet Bayesian Co-Clustering Pu Wang 1, Carlotta Domeniconi 1, and athryn Blackmond Laskey 1 Department of Computer Science Department of Systems Engineering and Operations Research George Mason

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models. Terminology and Basic Algorithms Hidden Markov Models Terminology and Basic Algorithms What is machine learning? From http://en.wikipedia.org/wiki/machine_learning Machine learning, a branch of artificial intelligence, is about the construction

More information

Variational Linear Response

Variational Linear Response Variational Linear Response Manfred Opper (1) Ole Winther (2) (1) Neural Computing Research Group, School of Engineering and Applied Science, Aston University, Birmingham B4 7ET, United Kingdom (2) Informatics

More information

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process Haupthseminar: Machine Learning Chinese Restaurant Process, Indian Buffet Process Agenda Motivation Chinese Restaurant Process- CRP Dirichlet Process Interlude on CRP Infinite and CRP mixture model Estimation

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Graphical models: parameter learning

Graphical models: parameter learning Graphical models: parameter learning Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London London WC1N 3AR, England http://www.gatsby.ucl.ac.uk/ zoubin/ zoubin@gatsby.ucl.ac.uk

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information