Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics in Machine Learning (MSc in Intelligent Systems) January 2008

Today s plan The Dirichlet distribution The Dirichlet process Representations of the Dirichlet process Infinite mixture models DP mixtures of Regressors DP mixtures of Dynamical Systems Guest speakers: Yee Whye Teh (Gatsby unit) David Barber (CSML) on 05/02

(Bayesian) Nonparametric models Real data is complex and the parametric assumption is often wrong. Nonparametric models allow us to relax the parametric assumption, bringing significant flexibility to our models. Nonparametric models lead to model selection/averaging behaviours without the cost of actually doing model selection/averaging. Nonparametric models are gaining popularity, spurred by growth in computational resources and inference algorithms. Examples: Gaussian processes (GPs) and Dirichlet processes (DPs). Other models which will not be discussed include Indian buffet processes, beta processes, tree processes,...

Gaussian processes in a nutshell A GP is a distribution over (latent) functions. A GP is defined by a Gaussian probability measure 1, i.e. it is characterised by a mean function m( ) and a covariance function k(, ). The consistency property of the Gaussian measure preserves the Gaussian parametric form when integrating out latent function values : Every finite subset of latent function values is jointly Gaussian The marginals are Gaussian Gaussian conditionals can be used to sample from a GP 1 Let A be the set of subsets of a space X. A measure G : A Ω assigns a nonnegative value to any subset of X. We say that G is a probability measure when Ω = [0, 1].

Dirichlet distribution The Dirichlet distribution is a distribution defined over K discrete random variables π 1,..., π K : (π 1,..., π K ) D(α 1,..., α K ) = Γ(P k α k) Q k Γ(α k) We say that the Dirichlet distribution is a distribution over the (K 1)-dimensional probability simplex: KY k=1 K = (π 1,..., π K ) : 0 π k 1, P k π k = 1. The mean and the variance are respectively given by where α P k α k. π k = α k α, π α k 1 k. D (π k π k ) 2E = α k(α α k ) α 2 (α + 1),

Examples

Properties of the Dirichlet distribution The Beta distribution corresponds to the Dirichlet distribution with K = 2: Γ(α + β) π B(α, β) = Γ(α)Γ(β) πα 1 (1 π) β 1. Agglomerative property: Let (π 1,..., π K ) D(α 1,..., α K ). Combining entries of probability vectors preserves the Dirichlet property: (π 1 + π 2, π 3,..., π K ) D(α 1 + α 2, α 3,..., α K ). The property holds for any partition of {1,..., K}. Decimative property: Let (π 1,..., π K ) D(α 1,..., α K ) and (τ 1, τ 2) D(α 1β 1, α 1β 2), with β 1 + β 2 = 1. The converse of the agglomerative property is also true: (π 1τ 1, π 1τ 2, π 2, π 3,..., π K ) D(α 1β 1, α 1β 2, α 2, α 3,..., α K ).

Dirichlet process (DP) A DP is a distribution over probability measures, such that marginals on finite partitions are Dirichlet distributed. Let G and H be probability measures over X. The random probability measure G is a Dirichlet process with base measure H( ) and intensity parameter α G( ) DP(α, H( )), if for any finite partition {A 1,..., A K } of X, we have (G(A 1),..., G(A K )) D (αh(a 1),..., αh(a K )). The mean distribution and its variance are respectively given by G(A) = H(A), where A is any measurable subset of X. D (G(A) G(A) ) 2E = H(A)(1 αh(a)), (α + 1)

The mean measure follows directly from the expression of the mean of the Dirichlet distribution: G(A k ) = αh(a k) Pk αh(a k ) = H(A k), where we used the fact that P k H(A k ) = 1. And so does the variance: D (G(A k ) H(A k )) 2E = αh(a k)( P k αh(a k ) αh(a k )) ( P k αh(a k )) 2 ( P k αh(a k ) + 1) = α2 H(A k )(1 αh(a k )) α 2 (α + 1) = H(A k)(1 αh(a k )). (α + 1) A 1 A 4 A 2 A 3 A 5 A 6

Posterior Dirichlet process Consider a fixed partition {A 1,..., A K } of X. If G( ) DP (α, H( )), then (G(A 1),..., G(A K )) D(αH(A 1),..., αh(a K )). We may treat the random measure G as a distribution over X, such that P(θ A k G) = G(A k ). The posterior is again a Dirichlet distribution: (G(A 1),..., G(A K )) θ D(αH(A 1) + δ θ (A 1),..., αh(a K ) + δ θ (A K )), where δ θ (A k ) = 1 if θ A k and 0 otherwise. The above is true for every finite partition of X, e.g. H(dθ) = p(θ)dθ. Hence, this suggests that the posterior process is a Dirichlet process: G( ) θ DP α + 1, αh( ) + δ «θ( ), α + 1 where δ θ ( ) is Dirac s delta centred at θ, which we call an atom.

Let us denote G(A k ) by π k and αh(a k ) by α k. We introduce the auxiliary indicator variable z, which takes the value k {1,..., K} when θ A k. Hence, we have (π 1,..., π K ) D(α 1,..., α K ), z π 1,..., π K Discrete(π 1,..., π K ). If z is Discrete with parameters π 1..., π K, then z takes the value k {1,..., K} with probability π k. The posterior is given by Bayes rule: Y p(π 1,..., π k z = k) π k k π αk 1 k π 1,..., π k z D(α 1 + δ z(1),..., α K + δ z(k)), where δ z( ) is Kronecker s delta centred at z. The marginal is given by p(z = k) = π k = α k P k α k z Discrete P α1,..., P α K. k α k k α k

Remark If G DP (α, H) and θ G G, then the posterior process and the marginal are respectively given by G θ DP α + 1, αh + δ «θ, α + 1 θ H. It can be shown that the converse is also true, such that G DP(α, H) θ G G θ H G θ DP α + 1, αh+δ θ. α+1

Blackwell-MacQueen urn scheme Blackwell-MacQueen urn scheme tells us how to sample sequentially from a DP. It is like a representer, i.e. a finite projection of an infinite object. It produces a sequence θ 1, θ 2,... with the following conditionals: θ N θ 1:N 1 αh + P N 1 α + N 1 n=1 δ θ n This can be interpreted as picking balls of different colours from an urn: Start with no balls in the urn. Draw θ N H with probability α. Add a ball of colour θ N in the urn. Pick a ball at random from the urn with probability N 1. Record its colour θ N and return the ball in the urn. Place a second ball of the same colour in urn. Hence, there is positive probability that θ N takes the same value θ n for some n, i.e. the θ 1,..., θ N cluster together (see CRP). The Blackwell-MacQueen urn scheme is also known as the Pòlya urn scheme..

First sample: G DP(α, H) θ 1 G G θ 1 H G θ 1 DP α + 1, αh+δ θ 1. α+1 Second sample: G θ 1 DP α + 1, αh+δ θ 1 α+1 θ 2 θ 1, G G N th sample: G θ 1:N 1 DP α + N 1, αh+p N 1 n α+n 1 θ 2 θ 1:N 1, G G. θ 2 θ 1 αh+δ θ 1 α+1 G θ 1, θ 2 DP α + 2, αh+δ θ 1 +δ θ2. α+2 δ θn «θ N θ 1:N 1 αh+p N 1 n δ θn α+n 1 «G θ 1:N DP α + N, αh+p N n δ θn. α+n

Chinese restaurant process (CRP) Random draws θ 1,..., θ N from a Blackwell-MacQueen urn scheme induce a random partition of 1,..., N: θ 1,..., θ N take K < N distinct values, say θ 1,..., θ K. Every drawn sequence defines a partition of 1,..., N into K clusters, such that if n is in cluster k, then θ n = θ k. The induced distribution over partitions is a CRP. Generating from the CRP: First customer sits at the first table. Customer N sits at: α A new table with probability α+n 1. N Table k with probability k α+n 1, where N k is the number of customers at table k. CRP exhibits the clustering property of DP: Tables are clusters, i.e. distinct values {θ k } K k=1 drawn from H. Customers are the actual realisations, i.e. θ n = θ z n where z n {1,..., K} indicates at which table the customer n sat.

Stick-breaking construction We would like to exhibit the discrete character of G DP(α, H). Consider θ H and the partition {θ, X\θ} of X. The posterior process G θ DP α + 1, αh+δ θ implies that α+1 (G(θ), G(X\θ)) θ D (αh(θ) + δ θ (θ), αh(x\θ) + δ θ (X\θ)) = D(1, α) G(θ) θ B(1, α). Hence, G has a point mass located at θ, such that G = βδ θ + (1 β)g, β B(1, α), where G is the (renormalized) probability measure without the point mass.

Stick-breaking construction (continued) What form has the probability measure G? Let {A 1,..., A K } be a partition of X\θ. Since G(X\θ) = P k G(A k), the agglomerative property of the Dirichlet leads to Hence, we see that (G(θ), G(A 1),..., G(A K )) θ D(1, αh(a 1),..., αh(a K )). G(θ) θ = β G(A 1) θ = (1 β)g (A 1). G(A K ) θ = (1 β)g (A K ). From the decimative property of Dirichlet, we get (G (A 1),..., G (A K )) D(αH(A 1),..., αh(a K )). In other words, G DP(α, H)!

Stick-breaking construction (continued) What form has the random measure G? where G DP(α, H) G = β 1δ θ 1 + (1 β 1)G 1 G 1 DP(α, H) G = β 1δ θ 1 + (1 β 1)(β 2δ θ 2 + (1 β 2)G 2) G 2 DP(α, H). G = X π nδ θ n n=1 n 1 Y π n = β n (1 β k ), β n B(1, α), θn H. k=1 Draws from the DP look like a weighted sum of point masses, where the weights are drawn from the stick-breaking construction 2. The steps are limited by assumptions of regularity on X and the smoothness on H. 2 The construction can be viewed as repeatedly breaking the remainder, which is of length 1 β n, of a unit-length stick according to a Beta distribution.

Sampling from a DP using the stick-breaking construction (demo: stick break) The base measure H is a unit Gaussian with zero mean. Note how the intensity parameter α regulates the number of spikes. 0.2 0.025 0.18 0.16 0.02 0.14 0.12 0.015 G 0.1 G 0.08 0.01 0.06 0.04 0.005 0.02 0!4!3!2!1 0 1 2 3 4! (a) α = 15. 0!4!3!2!1 0 1 2 3 4! (b) α = 150. Figure: The random meaasure G drawn from a Dirichlet process prior is discrete.

Exchangeability (de Finetti s theorem) Exchangeability is a fundamental property of Dirichlet processes, more specifically for inference. An infinite sequence θ 1, θ 2, θ 3,... of random variables is said to be infintely exchangeable if for any finite subset θ i1,..., θ in and any permutation θ j1,..., θ jn, we have P(θ i1,..., θ in ) = P(θ j1,..., θ jn ). The condition of exchangeability weaker than the assumption that they are independent and identically distributed Exchangeable observations are conditionally independent given some (usually) unobserved quantity to which some probability distribution would be assigned. Using de Finettis theorem, it is possible to show that draws form a DP are infinitely exchangeable

Density estimation with Dirichlet processes Consider a set of realisations {x n} N n=1 of the random variable X. Based on these observations, we would like to model the density of X. Would the following setting work? G DP(α, H), x n G G. Obviously not, as G is a discrete distribution (not a density). A solution is to convolve G DP(α, H) with a kernel F ( θ): Z X F ( ) = F ( θ)g(θ)dθ = π nf ( θn ), where we used the stick-breaking representation G(θ) = P n=1 πnδ θ n (θ). Each data point x n is modelled by a latent parameter θ n, which is drawn from an unknown distribution G on which a Dirichlet process prior is imposed: n=1 G DP(α, H), θ n G G, x n θ n F ( θ n).

Finite mixture models A finite mixture model makes the assumption that each data point x n was generated by a single mixture component with parameters θ k : x n X k π k F (x n θ k ). This can be captured by introducing a latent indicator variable z n. The model is formalised by the following priors and likelihood: θ k H π 1,..., π K D(α/K,..., α/k) z n π 1,..., π K Discrete(π 1,..., π K ) x n θ z n F ( θ z n ) Inference is performed on: The hyperparameters of the prior H. Parameter α of the Dirichlet prior on the weights. The number of components K. α π z i x i i = 1,..., n H θ k k = 1,..., K

DP mixture models If K is really large, there will be no overfitting if the parameters {θ k } and mixing proportions {π k } are integrated out or at most N (but usually much less) components will be active, i.e. associated with data. Recall the density estimation approach based on DP: X F ( ) = π k F ( θk ) k=1 G DP(α, H), X θ n G G = π k δθ k, k=1 x n θ k F ( θ n = θ k ). This is equivalent to a mixture model with a countably infinite number of components: where K. π 1,..., π K D(α/K,..., α/k), z n Discrete(π 1,..., π K ), x n z n F ( θ z n ),

DP mixture models (continued) DP mixture models are used to sidestep the model selection problem by replacing it by model averaging. They are used in applications in which the number of clusters is not known a priori or is believed to grow without bound as the amount of data grows. DPs can also be used in applications, where the number of latent objects is not known or unbounded: Nonparametric probabilistic context free grammars. Visual scene analysis. Infinite hidden Markov models/trees. Haplotype inference.... In many such applications it is important to be able to model the same set of objects in different contexts. This corresponds to the problem of grouped clustering and can be tackled using hierarchical Dirichlet processes (Teh et al., 2006). Inference for DP mixtures depends on the representations: MCMC based CRP, stick-breaking construction, etc. Variational inference based on stick-breaking construction (Blei and Jordan, 2005).

References The slides are largely based on Y.W. Teh s Tutorial and practical course on Dirichlet processes at Machine Learning Summer School 2007. Tutorial on Dirichlet processes at NIPS 2005 by M. I. Jordan. Blackwell, D. and MacQueen, J. B. (1973), Ferguson distributions via Plya urn schemes, Annals of Statistics, 1:353355. Blei, D. M. and Jordan, M. I. (2006), Variational inference for Dirichlet process mixtures, Bayesian Analysis, 1(1):121144. Ferguson, T. S. (1973), A Bayesian analysis of some nonparametric problems, Annals of Statistics, 1(2):209230. Rasmussen, C. E. (2000), The infinite Gaussian mixture model, Advances in Neural Information Processing Systems, volume 12. Sethuraman, J. (1994), A constructive definition of Dirichlet priors, Statistica Sinica, 4:639650. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006), Hierarchical Dirichlet processes, Journal of the American Statistical Association, 101(476):15661581.