Lecture 3a: Dirichlet processes

Similar documents
Dirichlet Processes: Tutorial and Practical Course

Bayesian Nonparametrics

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process

Non-Parametric Bayes

Dirichlet Process. Yee Whye Teh, University College London

A Brief Overview of Nonparametric Bayesian Models

Non-parametric Clustering with Dirichlet Processes

Bayesian Nonparametrics for Speech and Signal Processing

Stochastic Processes, Kernel Regression, Infinite Mixture Models

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Dirichlet Processes and other non-parametric Bayesian models

Bayesian nonparametric latent feature models

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes

Non-parametric Bayesian Methods

Bayesian nonparametrics

Nonparametric Mixed Membership Models

Hierarchical Models, Nested Models and Completely Random Measures

CSC 2541: Bayesian Methods for Machine Learning

An Introduction to Bayesian Nonparametric Modelling

Foundations of Nonparametric Bayesian Methods

Infinite latent feature models and the Indian Buffet Process

Dirichlet Enhanced Latent Semantic Analysis

Bayesian non parametric approaches: an introduction

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Bayesian Nonparametric Models

Bayesian Hidden Markov Models and Extensions

The Indian Buffet Process: An Introduction and Review

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Bayesian Nonparametrics

STAT Advanced Bayesian Inference

Lecture 1a: Basic Concepts and Recaps

Image segmentation combining Markov Random Fields and Dirichlet Processes

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

CPSC 540: Machine Learning

Applied Nonparametric Bayes

Bayesian Methods for Machine Learning

Infinite Latent Feature Models and the Indian Buffet Process

Bayesian nonparametric latent feature models

Hierarchical Dirichlet Processes with Random Effects

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Probabilistic Graphical Models

arxiv: v2 [stat.ml] 4 Aug 2011

Advanced Machine Learning

Spatial Normalized Gamma Process

Tree-Based Inference for Dirichlet Process Mixtures

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Spatial Bayesian Nonparametrics for Natural Image Segmentation

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Bayesian non-parametric model to longitudinally predict churn

Nonparametric Probabilistic Modelling

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Construction of Dependent Dirichlet Processes based on Poisson Processes

Probabilistic & Unsupervised Learning

Distance dependent Chinese restaurant processes

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Nonparametric Models on Decomposable Graphs

Probabilistic modeling of NLP

Gentle Introduction to Infinite Gaussian Mixture Modeling

Collapsed Variational Dirichlet Process Mixture Models

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Bayesian Structure Modeling. SPFLODD December 1, 2011

13: Variational inference II

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

TWO MODELS INVOLVING BAYESIAN NONPARAMETRIC TECHNIQUES

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Nonparametric Bayesian Methods - Lecture I

Non-parametric Bayesian Modeling and Fusion of Spatio-temporal Information Sources

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Modelling and Tracking of Dynamic Spectra Using a Non-Parametric Bayesian Method

Part IV: Monte Carlo and nonparametric Bayes

Latent Dirichlet Alloca/on

Introduction to Probabilistic Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Nonparametric Factor Analysis with Beta Process Priors

Evolutionary Clustering by Hierarchical Dirichlet Process with Hidden Markov State

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

STA 4273H: Sta-s-cal Machine Learning

Unsupervised Learning

A Process over all Stationary Covariance Kernels

Hierarchical Bayesian Nonparametric Models with Applications

Abstract INTRODUCTION

Dirichlet Process Based Evolutionary Clustering

Nonparameteric Regression:

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Non-parametric Bayesian Methods for Structured Topic Models

A Stick-Breaking Construction of the Beta Process

Hierarchical Bayesian Models of Language and Text

Document and Topic Models: plsa and LDA

arxiv: v1 [stat.ml] 8 Jan 2012

Transcription:

Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics in Machine Learning (MSc in Intelligent Systems) January 2008

Today s plan The Dirichlet distribution The Dirichlet process Representations of the Dirichlet process Infinite mixture models DP mixtures of Regressors DP mixtures of Dynamical Systems Guest speakers: Yee Whye Teh (Gatsby unit) David Barber (CSML) on 05/02

(Bayesian) Nonparametric models Real data is complex and the parametric assumption is often wrong. Nonparametric models allow us to relax the parametric assumption, bringing significant flexibility to our models. Nonparametric models lead to model selection/averaging behaviours without the cost of actually doing model selection/averaging. Nonparametric models are gaining popularity, spurred by growth in computational resources and inference algorithms. Examples: Gaussian processes (GPs) and Dirichlet processes (DPs). Other models which will not be discussed include Indian buffet processes, beta processes, tree processes,...

Gaussian processes in a nutshell A GP is a distribution over (latent) functions. A GP is defined by a Gaussian probability measure 1, i.e. it is characterised by a mean function m( ) and a covariance function k(, ). The consistency property of the Gaussian measure preserves the Gaussian parametric form when integrating out latent function values : Every finite subset of latent function values is jointly Gaussian The marginals are Gaussian Gaussian conditionals can be used to sample from a GP 1 Let A be the set of subsets of a space X. A measure G : A Ω assigns a nonnegative value to any subset of X. We say that G is a probability measure when Ω = [0, 1].

Dirichlet distribution The Dirichlet distribution is a distribution defined over K discrete random variables π 1,..., π K : (π 1,..., π K ) D(α 1,..., α K ) = Γ(P k α k) Q k Γ(α k) We say that the Dirichlet distribution is a distribution over the (K 1)-dimensional probability simplex: KY k=1 K = (π 1,..., π K ) : 0 π k 1, P k π k = 1. The mean and the variance are respectively given by where α P k α k. π k = α k α, π α k 1 k. D (π k π k ) 2E = α k(α α k ) α 2 (α + 1),

Examples

Properties of the Dirichlet distribution The Beta distribution corresponds to the Dirichlet distribution with K = 2: Γ(α + β) π B(α, β) = Γ(α)Γ(β) πα 1 (1 π) β 1. Agglomerative property: Let (π 1,..., π K ) D(α 1,..., α K ). Combining entries of probability vectors preserves the Dirichlet property: (π 1 + π 2, π 3,..., π K ) D(α 1 + α 2, α 3,..., α K ). The property holds for any partition of {1,..., K}. Decimative property: Let (π 1,..., π K ) D(α 1,..., α K ) and (τ 1, τ 2) D(α 1β 1, α 1β 2), with β 1 + β 2 = 1. The converse of the agglomerative property is also true: (π 1τ 1, π 1τ 2, π 2, π 3,..., π K ) D(α 1β 1, α 1β 2, α 2, α 3,..., α K ).

Dirichlet process (DP) A DP is a distribution over probability measures, such that marginals on finite partitions are Dirichlet distributed. Let G and H be probability measures over X. The random probability measure G is a Dirichlet process with base measure H( ) and intensity parameter α G( ) DP(α, H( )), if for any finite partition {A 1,..., A K } of X, we have (G(A 1),..., G(A K )) D (αh(a 1),..., αh(a K )). The mean distribution and its variance are respectively given by G(A) = H(A), where A is any measurable subset of X. D (G(A) G(A) ) 2E = H(A)(1 αh(a)), (α + 1)

The mean measure follows directly from the expression of the mean of the Dirichlet distribution: G(A k ) = αh(a k) Pk αh(a k ) = H(A k), where we used the fact that P k H(A k ) = 1. And so does the variance: D (G(A k ) H(A k )) 2E = αh(a k)( P k αh(a k ) αh(a k )) ( P k αh(a k )) 2 ( P k αh(a k ) + 1) = α2 H(A k )(1 αh(a k )) α 2 (α + 1) = H(A k)(1 αh(a k )). (α + 1) A 1 A 4 A 2 A 3 A 5 A 6

Posterior Dirichlet process Consider a fixed partition {A 1,..., A K } of X. If G( ) DP (α, H( )), then (G(A 1),..., G(A K )) D(αH(A 1),..., αh(a K )). We may treat the random measure G as a distribution over X, such that P(θ A k G) = G(A k ). The posterior is again a Dirichlet distribution: (G(A 1),..., G(A K )) θ D(αH(A 1) + δ θ (A 1),..., αh(a K ) + δ θ (A K )), where δ θ (A k ) = 1 if θ A k and 0 otherwise. The above is true for every finite partition of X, e.g. H(dθ) = p(θ)dθ. Hence, this suggests that the posterior process is a Dirichlet process: G( ) θ DP α + 1, αh( ) + δ «θ( ), α + 1 where δ θ ( ) is Dirac s delta centred at θ, which we call an atom.

Let us denote G(A k ) by π k and αh(a k ) by α k. We introduce the auxiliary indicator variable z, which takes the value k {1,..., K} when θ A k. Hence, we have (π 1,..., π K ) D(α 1,..., α K ), z π 1,..., π K Discrete(π 1,..., π K ). If z is Discrete with parameters π 1..., π K, then z takes the value k {1,..., K} with probability π k. The posterior is given by Bayes rule: Y p(π 1,..., π k z = k) π k k π αk 1 k π 1,..., π k z D(α 1 + δ z(1),..., α K + δ z(k)), where δ z( ) is Kronecker s delta centred at z. The marginal is given by p(z = k) = π k = α k P k α k z Discrete P α1,..., P α K. k α k k α k

Remark If G DP (α, H) and θ G G, then the posterior process and the marginal are respectively given by G θ DP α + 1, αh + δ «θ, α + 1 θ H. It can be shown that the converse is also true, such that G DP(α, H) θ G G θ H G θ DP α + 1, αh+δ θ. α+1

Blackwell-MacQueen urn scheme Blackwell-MacQueen urn scheme tells us how to sample sequentially from a DP. It is like a representer, i.e. a finite projection of an infinite object. It produces a sequence θ 1, θ 2,... with the following conditionals: θ N θ 1:N 1 αh + P N 1 α + N 1 n=1 δ θ n This can be interpreted as picking balls of different colours from an urn: Start with no balls in the urn. Draw θ N H with probability α. Add a ball of colour θ N in the urn. Pick a ball at random from the urn with probability N 1. Record its colour θ N and return the ball in the urn. Place a second ball of the same colour in urn. Hence, there is positive probability that θ N takes the same value θ n for some n, i.e. the θ 1,..., θ N cluster together (see CRP). The Blackwell-MacQueen urn scheme is also known as the Pòlya urn scheme..

First sample: G DP(α, H) θ 1 G G θ 1 H G θ 1 DP α + 1, αh+δ θ 1. α+1 Second sample: G θ 1 DP α + 1, αh+δ θ 1 α+1 θ 2 θ 1, G G N th sample: G θ 1:N 1 DP α + N 1, αh+p N 1 n α+n 1 θ 2 θ 1:N 1, G G. θ 2 θ 1 αh+δ θ 1 α+1 G θ 1, θ 2 DP α + 2, αh+δ θ 1 +δ θ2. α+2 δ θn «θ N θ 1:N 1 αh+p N 1 n δ θn α+n 1 «G θ 1:N DP α + N, αh+p N n δ θn. α+n

Chinese restaurant process (CRP) Random draws θ 1,..., θ N from a Blackwell-MacQueen urn scheme induce a random partition of 1,..., N: θ 1,..., θ N take K < N distinct values, say θ 1,..., θ K. Every drawn sequence defines a partition of 1,..., N into K clusters, such that if n is in cluster k, then θ n = θ k. The induced distribution over partitions is a CRP. Generating from the CRP: First customer sits at the first table. Customer N sits at: α A new table with probability α+n 1. N Table k with probability k α+n 1, where N k is the number of customers at table k. CRP exhibits the clustering property of DP: Tables are clusters, i.e. distinct values {θ k } K k=1 drawn from H. Customers are the actual realisations, i.e. θ n = θ z n where z n {1,..., K} indicates at which table the customer n sat.

Stick-breaking construction We would like to exhibit the discrete character of G DP(α, H). Consider θ H and the partition {θ, X\θ} of X. The posterior process G θ DP α + 1, αh+δ θ implies that α+1 (G(θ), G(X\θ)) θ D (αh(θ) + δ θ (θ), αh(x\θ) + δ θ (X\θ)) = D(1, α) G(θ) θ B(1, α). Hence, G has a point mass located at θ, such that G = βδ θ + (1 β)g, β B(1, α), where G is the (renormalized) probability measure without the point mass.

Stick-breaking construction (continued) What form has the probability measure G? Let {A 1,..., A K } be a partition of X\θ. Since G(X\θ) = P k G(A k), the agglomerative property of the Dirichlet leads to Hence, we see that (G(θ), G(A 1),..., G(A K )) θ D(1, αh(a 1),..., αh(a K )). G(θ) θ = β G(A 1) θ = (1 β)g (A 1). G(A K ) θ = (1 β)g (A K ). From the decimative property of Dirichlet, we get (G (A 1),..., G (A K )) D(αH(A 1),..., αh(a K )). In other words, G DP(α, H)!

Stick-breaking construction (continued) What form has the random measure G? where G DP(α, H) G = β 1δ θ 1 + (1 β 1)G 1 G 1 DP(α, H) G = β 1δ θ 1 + (1 β 1)(β 2δ θ 2 + (1 β 2)G 2) G 2 DP(α, H). G = X π nδ θ n n=1 n 1 Y π n = β n (1 β k ), β n B(1, α), θn H. k=1 Draws from the DP look like a weighted sum of point masses, where the weights are drawn from the stick-breaking construction 2. The steps are limited by assumptions of regularity on X and the smoothness on H. 2 The construction can be viewed as repeatedly breaking the remainder, which is of length 1 β n, of a unit-length stick according to a Beta distribution.

Sampling from a DP using the stick-breaking construction (demo: stick break) The base measure H is a unit Gaussian with zero mean. Note how the intensity parameter α regulates the number of spikes. 0.2 0.025 0.18 0.16 0.02 0.14 0.12 0.015 G 0.1 G 0.08 0.01 0.06 0.04 0.005 0.02 0!4!3!2!1 0 1 2 3 4! (a) α = 15. 0!4!3!2!1 0 1 2 3 4! (b) α = 150. Figure: The random meaasure G drawn from a Dirichlet process prior is discrete.

Exchangeability (de Finetti s theorem) Exchangeability is a fundamental property of Dirichlet processes, more specifically for inference. An infinite sequence θ 1, θ 2, θ 3,... of random variables is said to be infintely exchangeable if for any finite subset θ i1,..., θ in and any permutation θ j1,..., θ jn, we have P(θ i1,..., θ in ) = P(θ j1,..., θ jn ). The condition of exchangeability weaker than the assumption that they are independent and identically distributed Exchangeable observations are conditionally independent given some (usually) unobserved quantity to which some probability distribution would be assigned. Using de Finettis theorem, it is possible to show that draws form a DP are infinitely exchangeable

Density estimation with Dirichlet processes Consider a set of realisations {x n} N n=1 of the random variable X. Based on these observations, we would like to model the density of X. Would the following setting work? G DP(α, H), x n G G. Obviously not, as G is a discrete distribution (not a density). A solution is to convolve G DP(α, H) with a kernel F ( θ): Z X F ( ) = F ( θ)g(θ)dθ = π nf ( θn ), where we used the stick-breaking representation G(θ) = P n=1 πnδ θ n (θ). Each data point x n is modelled by a latent parameter θ n, which is drawn from an unknown distribution G on which a Dirichlet process prior is imposed: n=1 G DP(α, H), θ n G G, x n θ n F ( θ n).

Finite mixture models A finite mixture model makes the assumption that each data point x n was generated by a single mixture component with parameters θ k : x n X k π k F (x n θ k ). This can be captured by introducing a latent indicator variable z n. The model is formalised by the following priors and likelihood: θ k H π 1,..., π K D(α/K,..., α/k) z n π 1,..., π K Discrete(π 1,..., π K ) x n θ z n F ( θ z n ) Inference is performed on: The hyperparameters of the prior H. Parameter α of the Dirichlet prior on the weights. The number of components K. α π z i x i i = 1,..., n H θ k k = 1,..., K

DP mixture models If K is really large, there will be no overfitting if the parameters {θ k } and mixing proportions {π k } are integrated out or at most N (but usually much less) components will be active, i.e. associated with data. Recall the density estimation approach based on DP: X F ( ) = π k F ( θk ) k=1 G DP(α, H), X θ n G G = π k δθ k, k=1 x n θ k F ( θ n = θ k ). This is equivalent to a mixture model with a countably infinite number of components: where K. π 1,..., π K D(α/K,..., α/k), z n Discrete(π 1,..., π K ), x n z n F ( θ z n ),

DP mixture models (continued) DP mixture models are used to sidestep the model selection problem by replacing it by model averaging. They are used in applications in which the number of clusters is not known a priori or is believed to grow without bound as the amount of data grows. DPs can also be used in applications, where the number of latent objects is not known or unbounded: Nonparametric probabilistic context free grammars. Visual scene analysis. Infinite hidden Markov models/trees. Haplotype inference.... In many such applications it is important to be able to model the same set of objects in different contexts. This corresponds to the problem of grouped clustering and can be tackled using hierarchical Dirichlet processes (Teh et al., 2006). Inference for DP mixtures depends on the representations: MCMC based CRP, stick-breaking construction, etc. Variational inference based on stick-breaking construction (Blei and Jordan, 2005).

References The slides are largely based on Y.W. Teh s Tutorial and practical course on Dirichlet processes at Machine Learning Summer School 2007. Tutorial on Dirichlet processes at NIPS 2005 by M. I. Jordan. Blackwell, D. and MacQueen, J. B. (1973), Ferguson distributions via Plya urn schemes, Annals of Statistics, 1:353355. Blei, D. M. and Jordan, M. I. (2006), Variational inference for Dirichlet process mixtures, Bayesian Analysis, 1(1):121144. Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems, Annals of Statistics, 1(2):209230. Rasmussen, C. E. (2000), The infinite Gaussian mixture model, Advances in Neural Information Processing Systems, volume 12. Sethuraman, J. (1994), A constructive definition of Dirichlet priors, Statistica Sinica, 4:639650. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), Hierarchical Dirichlet processes, Journal of the American Statistical Association, 101(476):15661581.