A Brief Overview of Nonparametric Bayesian Models

Similar documents
Dirichlet Processes and other non-parametric Bayesian models

Non-parametric Bayesian Methods

Lecture 3a: Dirichlet processes

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Hidden Markov Models and Extensions

Bayesian Nonparametrics: Dirichlet Process

Non-parametric Clustering with Dirichlet Processes

Bayesian Nonparametrics

Dirichlet Processes: Tutorial and Practical Course

Bayesian Nonparametrics

Bayesian nonparametric latent feature models

Bayesian Nonparametric Models

Infinite latent feature models and the Indian Buffet Process

Nonparametric Probabilistic Modelling

Non-Parametric Bayes

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Infinite Latent Feature Models and the Indian Buffet Process

Dirichlet Process. Yee Whye Teh, University College London

Hierarchical Dirichlet Processes

Tree-Based Inference for Dirichlet Process Mixtures

Bayesian nonparametric latent feature models

The Indian Buffet Process: An Introduction and Review

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Bayesian nonparametrics

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Hierarchical Models, Nested Models and Completely Random Measures

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian Nonparametric Models on Decomposable Graphs

Foundations of Nonparametric Bayesian Methods

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Properties of Bayesian nonparametric models and priors over trees

Probabilistic Graphical Models

Nonparametric Bayesian Methods - Lecture I

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Bayesian nonparametric models for bipartite graphs

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Applied Nonparametric Bayes

Truncation error of a superposed gamma process in a decreasing order representation

Bayesian non parametric approaches: an introduction

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Gentle Introduction to Infinite Gaussian Mixture Modeling

On the posterior structure of NRMI

arxiv: v2 [stat.ml] 4 Aug 2011

Bayesian Methods for Machine Learning

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Construction of Dependent Dirichlet Processes based on Poisson Processes

Dirichlet Enhanced Latent Semantic Analysis

An Introduction to Bayesian Nonparametric Modelling

STAT Advanced Bayesian Inference

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Nonparametric Mixed Membership Models

Distance dependent Chinese restaurant processes

Spatial Bayesian Nonparametrics for Natural Image Segmentation

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Machine Learning Summer School

arxiv: v1 [stat.ml] 8 Jan 2012

Nonparametric Factor Analysis with Beta Process Priors

Inference in Explicit Duration Hidden Markov Models

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

An Alternative Prior Process for Nonparametric Bayesian Clustering

Dependent hierarchical processes for multi armed bandits

Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State

Priors for Random Count Matrices with Random or Fixed Row Sums

Variational Inference for the Nested Chinese Restaurant Process

The Infinite Factorial Hidden Markov Model

A permutation-augmented sampler for DP mixture models

Bayesian Mixtures of Bernoulli Distributions

Construction of Dependent Dirichlet Processes based on Poisson Processes

Lecture 6: Graphical Models: Learning

Image segmentation combining Markov Random Fields and Dirichlet Processes

Advanced Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

Stick-Breaking Beta Processes and the Poisson Process

CPSC 540: Machine Learning

Message Passing Algorithms for Dirichlet Diffusion Trees

Spatial Normalized Gamma Process

Bayesian nonparametric models for bipartite graphs

Mixed Membership Models for Time Series

Nonparametric Bayesian Methods (Gaussian Processes)

Compound Random Measures

Infinite Hierarchical Hidden Markov Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Sparsity? A Bayesian view

arxiv: v1 [stat.ml] 5 Dec 2016

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Unsupervised Learning

arxiv: v1 [stat.ml] 20 Nov 2012

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

A Stick-Breaking Construction of the Beta Process

Statistical Approaches to Learning and Discovery

Hidden Markov models: from the beginning to the state of the art

CMPS 242: Project Report

The Infinite Markov Model

Transcription:

A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine Learning Department, CMU

Parametric vs Nonparametric Models Parametric models assume some finite set of parameters θ. Given the parameters, future predictions, x, are independent of the observed data, D: P (x θ, D) = P (x θ) therefore θ capture everything there is to know about the data. So the complexity of the model is bounded even if the amount of data is unbounded. This makes them not very flexible. Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters. But they can often be defined by assuming an infinite dimensional θ. Usually we think of θ as a function. The amount of information that θ can capture about the data D can grow as the amount of data grows. This makes them more flexible.

Why? flexibility better predictive performance more realistic 7 6 5 4 3 4 6 8

Outline Bayesian nonparametrics has many uses. Some modelling goals and examples of associated nonparametric Bayesian models: Modelling Goal Example process Distributions on functions Gaussian process Distributions on distributions Dirichlet process Polya Tree Clustering Chinese restaurant process Pitman-Yor process Hierarchical clustering Dirichlet diffusion tree Kingman s coalescent Sparse latent feature models Indian buffet processes Survival analysis Beta processes Distributions on measures Completely random measures......

Gaussian Processes A Gaussian process defines a distribution P (f) on functions, f, where f is a function mapping some input space X to R. f : X R. Let f = (f(x ), f(x ),..., f(x n )) be an n-dimensional vector of function values evaluated at n points x i X. Note f is a random variable. Definition: P (f) is a Gaussian process if for any finite subset {x,..., x n } X, the marginal distribution on that finite subset P (f) has a multivariate Gaussian distribution, and all these marginals are coherent. We can think of GPs as the extension of multivariate Gaussians to distributions on functions.

Samples from Gaussian processes with different c(x, x ) 3.5.5 5.5.5 4.5.5 3 f(x).5 f(x) f(x).5 f(x).5.5.5.5.5 3 4 5 6 7 8 9 x.5 3 4 5 6 7 8 9 x 3 4 5 6 7 8 9 x 3 4 5 6 7 8 9 x 3 3 4 8 3 6 4 f(x) f(x) f(x) f(x) 3 4 4 3 4 5 6 7 8 9 x 3 3 4 5 6 7 8 9 x 3 3 4 5 6 7 8 9 x 6 3 4 5 6 7 8 9 x 3 3 4 8 3 6 4 f(x) f(x) f(x) f(x) 3 3 4 4 3 4 5 6 7 8 9 x 3 3 4 5 6 7 8 9 x 4 3 4 5 6 7 8 9 x 6 3 4 5 6 7 8 9 x

Dirichlet Processes Gaussian processes define a distribution on functions f GP( µ, c) where µ is the mean function and c is the covariance function. We can think of GPs as infinite-dimensional Gaussians Dirichlet processes define a distribution on distributions (a measure on measures) G DP( G, α) where α > is a scaling parameter, and G is the base measure. We can think of DPs as infinite-dimensional Dirichlet distributions. Note that both f and G are infinite dimensional objects.

Dirichlet Process Let Θ be a measurable space, G be a probability measure on Θ, and α a positive real number. For all (A,... A K ) finite partitions of Θ, G DP( G, α) means that (G(A ),..., G(A K )) Dir(αG (A ),..., αg (A K )) (Ferguson, 973)

Dirichlet Distribution The Dirichlet distribution is a distribution on the K-dim probability simplex. Let p be a K-dimensional vector s.t. j : p j and K j= p j = P (p α) = Dir(α,..., α K ) def = Γ( j α j) j Γ(α j) K j= p α j j where the first term is a normalization constant and E(p j ) = α j /( k α k) The Dirichlet is conjugate to the multinomial distribution. Let c p Multinomial( p) That is, P (c = j p) = p j. Then the posterior is also Dirichlet: P (p c = j, α) = where α j = α j +, and l j : α l = α l P (c = j p)p (p α) P (c = j α) = Dir(α ) Γ(x) = (x )Γ(x ) = R tx e t dt. For integer n, Γ(n) = (n )!

Dirichlet Process G DP( G, α) OK, but what does it look like? Samples from a DP are discrete with probability one: G(θ) = π k δ θk (θ) k= where δ θk ( ) is a Dirac delta at θ k, and θ k G ( ). Note: E(G) = G As α, G looks more like G.

Dirichlet Process: Conjugacy If the prior on G is a DP:...and you observe θ......then the posterior is also a DP: P (G θ) = P (θ G)P (G) P (θ) G DP( G, α) P (G) = DP(G G, α) P (θ G) = G(θ) = DP ( α α + G + ) α + δ θ, α + Generalization for n observations: ( α P (G θ,..., θ n ) = DP α + n G + α + n ) n δ θi, α + n i= Analogous to Dirichlet being conjugate to multinomial observations.

Dirichlet Process Blackwell and MacQueen s (973) urn representation G DP( G, α) and θ G G( ) Then θ n θ,... θ n, G, α α n + α G ( ) + n n + α n j= δ θj ( ) P (θ n θ,... θ n, G, α) dg P (θ j G)P (G G, α) j= The model exhibits a clustering effect.

Relationship between DPs and CRPs DP is a distribution on distributions DP results in discrete distributions, so if you draw n points you are likely to get repeated values A DP induces a partitioning of the n points e.g. ( 3 4) ( 5) θ = θ 3 = θ 4 θ = θ 5 Chinese Restaurant Process (CRP) defines the corresponding distribution on partitions Although the CRP is a sequential process, the distribution on θ,..., θ n exchangeable (i.e. invariant to permuting the indices of the θs): e.g. is P (θ, θ, θ 3, θ 4 ) = P (θ, θ 4, θ 3, θ )

Chinese Restaurant Process The CRP generates samples from the distribution on partitions induced by a DPM. 5 3 4 T T T 3 T 4... 6 Generating from a CRP: customer enters the restaurant and sits at table. K =, n =, n = for n =,..., customer n sits at table { k with prob n k n +α K + with prob α n +α if new table was chosen then K K + endif endfor for k =... K (new table) Rich get richer property. (Aldous 985; Pitman )

Dirichlet Processes: Big Picture There are many ways to derive the Dirichlet Process: Dirichlet distribution Urn model Chinese restaurant process Stick breaking Gamma process DP: distribution on distributions Dirichlet process mixture (DPM): a mixture model with infinitely many components where parameters of each component are drawn from a DP. Useful for clustering; assignments of points to clusters follows a CRP.

Samples from a Dirichlet Process Mixture of Gaussians N= N= N= N=3 Notice that more structure (clusters) appear as you draw more points. (figure inspired by Neal)

Approximate Inference in DPMs Gibbs sampling (e.g. Escobar and West, 995; Neal, ; Rasmussen, ) Expectation propagation (Minka and Ghahramani, 3) Variational approximation (Blei and Jordan, 5) Hierarchical clustering (Heller and Ghahramani, 5)

Hierarchical Clustering

Dirichlet Diffusion Trees (DFT) (Neal, 3) In a DPM, parameters of one mixture component are independent of another components this lack of structure is potentially undesirable. A DFT is a generalization of DPMs with hierarchical structure between components. To generate from a DFT, we will consider θ taking a random walk according to a Brownian motion Gaussian diffusion process. θ (t) Gaussian diffusion process starting at origin (θ () = ) for unit time. θ (t), also starts at the origin and follows θ but diverges at some time τ d, at which point the path followed by θ becomes independent of θ s path. a(t) is a divergence or hazard function, e.g. a(t) = /( t). For small dt: P (θ diverges (t, t + dt)) = a(t)dt m where m is the number of previous points that have followed this path. If θ i reaches a branch point between two paths, it picks a branch in proportion to the number of points that have followed that path.

Dirichlet Diffusion Trees (DFT) Generating from a DFT: Figure from Neal, 3.

Dirichlet Diffusion Trees (DFT) Some samples from DFT priors: Figure from Neal, 3.

Kingman s Coalescent (Kingman, 98) A model for genealogies. Working backwards in time from t = Generate tree with n individuals at leaves by merging lineages backwards in time. Every pair of lineages merges independently with rate. Time of first merge for n individuals is t Exp(n(n )/), etc This generates w.p.. a binary tree where all lineages are merged at t =. The coalescent results in the limit n and is infinitely exchangeable over individuals and has uniform marginals over tree topologies. The coalescent has been used as the basis of a hierarchical clustering algorithm by (Teh, Daumé, Roy, 7) z!") (a) (b) (c) y {,,3,4 } 3 y {, } t 3 t t π(t) = { {,, 3, 4} } { {, }, {3, 4} } { {}, {}, {3, 4} } { {}, {}, {3}, {4} } y { 3,4 } x x x 3 x 4 t =! &") &!&")!!!!")!%!%")!(!!"#!!"$!!"%!!!&"'!&"#!&"$!&"% & t #& $& %&! %&! $! $& # $ %! #!!! "! #! $ % $

Hierarchical Dirichlet Processes (HDP) (Teh et al, 5) Assume you have data which is divided into J groups. You assume there are clusters within each group, but you also believe these clusters are shared between groups (i.e. data points in different groups can belong to the same cluster). In an HDP there is a common DP: γ G H G H, γ DP( H, γ) which forms the base measure for a draw from a DP within each group G j G, α DP( G, α) G j φ ji x ji α n j J

Nonparametric Times Series Models

Infinite hidden Markov models (ihmms) Hidden Markov models (HMMs) are widely used sequence models for speech recognition, bioinformatics, text modelling, video monitoring, etc. S Y S Y S 3 Y 3 S T Y T HMMs can be thought of as time-dependent mixture models. s t K π k (.) In an HMM with K states, the transition matrix has K K elements. s t- k We want to let K Infinite HMMs can be derived from the HDP. K β γ Stick( γ) (base distribution over states) π k α, β DP( α, β) (transition parameters for state k =,... ) θ k H H( ) (emission parameters for state k =,... ) s t s t, (π k ) k= π st ( ) (transition) y t s t, (θ k ) k= p( θ st ) (emission) (Beal, Ghahramani, and Rasmussen, ) (Teh et al. 5)

Infinite hidden Markov models We let the number of states K. S S S 3 S T Y Y Y 3 Y T Experiment: Changepoint Detection Introduced in (Beal, Ghahramani and Rasmussen, ). Teh, Jordan, Beal and Blei (5) showed that ihmms can be derived from hierarchical Dirichlet processes, and provided a more efficient Gibbs sampler. We have recently derived a much more efficient sampler based on Dynamic Programming (Van Gael, Saatci, Teh, and Ghahramani, 8).

Completely Random Measures (Kingman, 967) measurable space: Θ with σ-algebra Ω measure: function µ : Ω [, ] assigning to each measurable set a non-neg. real random measure: measures are drawn from some distribution on measures: µ P completely random measure (CRM): the values that µ takes on disjoint subsets are independent: µ(a) µ(b) if A B = CRMs can be represented as sum of nonrandom measure, atomic measure with fixed atoms but random masses, and atomic measure with random atoms and masses. µ = µ + N or i= u iδ φ i + M or i= u i δ φi We can write µ CRM(µ, Λ, {φ i, F i }) where u i F i and {u i, φ i } are drawn from a Poisson process on (, ] Θ with rate measure Λ called the Lévy measure. Examples: Gamma process, Beta process (Hjort, 99), Stable-beta process (Teh and Görür, 9).

Beta Processes (Hjort, 99) CRMs can be represented as sum of nonrandom measure, atomic measure with fixed atoms but random masses, and atomic measure with random atoms and masses. µ = µ + N or i= u iδ φ i + M or i= u i δ φi For a beta process we have µ CRM(, Λ, {}) where {u i, φ i } are drawn from a Poisson process on (, ] Θ with rate Lévy measure: Λ(du dθ) = α c u ( u) c du H(dθ) 3.5 8 6 4.5 3 θ 3.. u.3.4.5.5 3 3 θ

Selected References Gaussian Processes: O Hagan, A. (978). Curve Fitting and Optimal Design for Prediction (with discussion). Journal of the Royal Statistical Society B, 4():-4. Rasmussen, C.E and Williams, C.K.I. (6) Gaussian processes for machine learning. MIT Press. Dirichlet Processes, Chinese Restaurant Processes, and related work Ferguson, T. (973), A Bayesian Analysis of Some Nonparametric Problems, Annals of Statistics, (), pp. 9 3. Blackwell, D. and MacQueen, J. (973), Ferguson Distributions via Polya Urn Schemes, Annals of Statistics,, pp. 353 355. Aldous, D. (985), Exchangeability and Related Topics, in Ecole d Ete de Probabilites de Saint-Flour XIII 983, Springer, Berlin, pp. 98. Hierarchical Dirichlet Processes and Infinite Hidden Markov Models Beal, M. J., Ghahramani, Z., and Rasmussen, C.E. (), The Infinite Hidden Markov Model, in T. G. Dietterich, S. Becker, and Z. Ghahramani (eds.) Advances in Neural Information Processing Systems, Cambridge, MA: MIT Press, vol. 4, pp. 577-584. Teh, Y.W, Jordan, M.I, Beal, M.J., and Blei, D.M. (5) Hierarchical Dirichlet Processes. J American Statistical Association.

van Gael, J., Saatci, Y., Teh, Y.-W., and Ghahramani, Z. (8) Beam sampling for the infinite Hidden Markov Model. In International Conference on Machine Learning (ICML 8), 88 95. Dirichlet Process Mixtures Antoniak, C.E. (974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, :5-74. Escobar, M.D. and West, M. (995) Bayesian density estimation and inference using mixtures. J American Statistical Association. 9: 577-588. Neal, R.M. (). Markov chain sampling methods for Dirichlet process mixture models.journal of Computational and Graphical Statistics, 9, 49 65. Rasmussen, C.E. (). The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press. Blei, D.M. and Jordan, M.I. (5) Variational methods for Dirichlet process mixtures. Bayesian Analysis. Minka, T.P. and Ghahramani, Z. (3) Expectation propagation for infinite mixtures. NIPS 3 Workshop on Nonparametric Bayesian Methods and Infinite Models. Heller, K.A. and Ghahramani, Z. (5) Bayesian Hierarchical Clustering. Twenty Second International Conference on Machine Learning (ICML-5) Dirichlet Diffusion Trees and Kingman s Coalescent Neal, R.M. (3) Density modeling and clustering using Dirichlet diffusion trees, in J. M. Bernardo, et al. (editors) Bayesian Statistics 7. Kingman., J. F. C. (98) The coalescent. Stochastic Processes and their Applications, 3(3):35 48.

Teh, Y. W., Daumé III, H., and Roy, D. (7) Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems (NIPS), Vancouver, Canada. Indian Buffet Processes Griffiths, T.L., and Ghahramani, Z. (6) Infinite Latent Feature Models and the Indian Buffet Process. In Advances in Neural Information Processing Systems 8:475 48. Cambridge, MA: MIT Press. Teh, Y.W., Görür, D. and Ghahramani, Z. (7) Stick-breaking Construction for the Indian Buffet. In Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 7). San Juan, Puerto Rico. Thibaux, R. and Jordan, M. I. (7) Hierarchical Beta processes and the Indian buffet process. In Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 7). San Juan, Puerto Rico. Other Hjort, N. L. (99) Nonparametric Bayes estimators based on Beta processes in models for life history data. Annals of Statistics, 8:59 94. Kingman, J. F. C. (967) Completely Random Measures. (). Pacific Journal of Mathematics Teh, Y. W. and Görür, D. (9) Indian buffet processes with power-law behavior. In Advances in Neural Information Processing Systems.