A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine Learning Department, CMU

Parametric vs Nonparametric Models Parametric models assume some finite set of parameters θ. Given the parameters, future predictions, x, are independent of the observed data, D: P (x θ, D) = P (x θ) therefore θ capture everything there is to know about the data. So the complexity of the model is bounded even if the amount of data is unbounded. This makes them not very flexible. Non-parametric models assume that the data distribution cannot be defined in terms of such a finite set of parameters. But they can often be defined by assuming an infinite dimensional θ. Usually we think of θ as a function. The amount of information that θ can capture about the data D can grow as the amount of data grows. This makes them more flexible.

Why? flexibility better predictive performance more realistic 7 6 5 4 3 4 6 8

Outline Bayesian nonparametrics has many uses. Some modelling goals and examples of associated nonparametric Bayesian models: Modelling Goal Example process Distributions on functions Gaussian process Distributions on distributions Dirichlet process Polya Tree Clustering Chinese restaurant process Pitman-Yor process Hierarchical clustering Dirichlet diffusion tree Kingman s coalescent Sparse latent feature models Indian buffet processes Survival analysis Beta processes Distributions on measures Completely random measures......

Gaussian Processes A Gaussian process defines a distribution P (f) on functions, f, where f is a function mapping some input space X to R. f : X R. Let f = (f(x ), f(x ),..., f(x n )) be an n-dimensional vector of function values evaluated at n points x i X. Note f is a random variable. Definition: P (f) is a Gaussian process if for any finite subset {x,..., x n } X, the marginal distribution on that finite subset P (f) has a multivariate Gaussian distribution, and all these marginals are coherent. We can think of GPs as the extension of multivariate Gaussians to distributions on functions.

Samples from Gaussian processes with different c(x, x ) 3.5.5 5.5.5 4.5.5 3 f(x).5 f(x) f(x).5 f(x).5.5.5.5.5 3 4 5 6 7 8 9 x.5 3 4 5 6 7 8 9 x 3 4 5 6 7 8 9 x 3 4 5 6 7 8 9 x 3 3 4 8 3 6 4 f(x) f(x) f(x) f(x) 3 4 4 3 4 5 6 7 8 9 x 3 3 4 5 6 7 8 9 x 3 3 4 5 6 7 8 9 x 6 3 4 5 6 7 8 9 x 3 3 4 8 3 6 4 f(x) f(x) f(x) f(x) 3 3 4 4 3 4 5 6 7 8 9 x 3 3 4 5 6 7 8 9 x 4 3 4 5 6 7 8 9 x 6 3 4 5 6 7 8 9 x

Dirichlet Processes Gaussian processes define a distribution on functions f GP( µ, c) where µ is the mean function and c is the covariance function. We can think of GPs as infinite-dimensional Gaussians Dirichlet processes define a distribution on distributions (a measure on measures) G DP( G, α) where α > is a scaling parameter, and G is the base measure. We can think of DPs as infinite-dimensional Dirichlet distributions. Note that both f and G are infinite dimensional objects.

Dirichlet Process Let Θ be a measurable space, G be a probability measure on Θ, and α a positive real number. For all (A,... A K ) finite partitions of Θ, G DP( G, α) means that (G(A ),..., G(A K )) Dir(αG (A ),..., αg (A K )) (Ferguson, 973)

Dirichlet Distribution The Dirichlet distribution is a distribution on the K-dim probability simplex. Let p be a K-dimensional vector s.t. j : p j and K j= p j = P (p α) = Dir(α,..., α K ) def = Γ( j α j) j Γ(α j) K j= p α j j where the first term is a normalization constant and E(p j ) = α j /( k α k) The Dirichlet is conjugate to the multinomial distribution. Let c p Multinomial( p) That is, P (c = j p) = p j. Then the posterior is also Dirichlet: P (p c = j, α) = where α j = α j +, and l j : α l = α l P (c = j p)p (p α) P (c = j α) = Dir(α ) Γ(x) = (x )Γ(x ) = R tx e t dt. For integer n, Γ(n) = (n )!

Dirichlet Process G DP( G, α) OK, but what does it look like? Samples from a DP are discrete with probability one: G(θ) = π k δ θk (θ) k= where δ θk ( ) is a Dirac delta at θ k, and θ k G ( ). Note: E(G) = G As α, G looks more like G.

Dirichlet Process: Conjugacy If the prior on G is a DP:...and you observe θ......then the posterior is also a DP: P (G θ) = P (θ G)P (G) P (θ) G DP( G, α) P (G) = DP(G G, α) P (θ G) = G(θ) = DP ( α α + G + ) α + δ θ, α + Generalization for n observations: ( α P (G θ,..., θ n ) = DP α + n G + α + n ) n δ θi, α + n i= Analogous to Dirichlet being conjugate to multinomial observations.

Dirichlet Process Blackwell and MacQueen s (973) urn representation G DP( G, α) and θ G G( ) Then θ n θ,... θ n, G, α α n + α G ( ) + n n + α n j= δ θj ( ) P (θ n θ,... θ n, G, α) dg P (θ j G)P (G G, α) j= The model exhibits a clustering effect.

Relationship between DPs and CRPs DP is a distribution on distributions DP results in discrete distributions, so if you draw n points you are likely to get repeated values A DP induces a partitioning of the n points e.g. ( 3 4) ( 5) θ = θ 3 = θ 4 θ = θ 5 Chinese Restaurant Process (CRP) defines the corresponding distribution on partitions Although the CRP is a sequential process, the distribution on θ,..., θ n exchangeable (i.e. invariant to permuting the indices of the θs): e.g. is P (θ, θ, θ 3, θ 4 ) = P (θ, θ 4, θ 3, θ )

Chinese Restaurant Process The CRP generates samples from the distribution on partitions induced by a DPM. 5 3 4 T T T 3 T 4... 6 Generating from a CRP: customer enters the restaurant and sits at table. K =, n =, n = for n =,..., customer n sits at table { k with prob n k n +α K + with prob α n +α if new table was chosen then K K + endif endfor for k =... K (new table) Rich get richer property. (Aldous 985; Pitman )

Dirichlet Processes: Big Picture There are many ways to derive the Dirichlet Process: Dirichlet distribution Urn model Chinese restaurant process Stick breaking Gamma process DP: distribution on distributions Dirichlet process mixture (DPM): a mixture model with infinitely many components where parameters of each component are drawn from a DP. Useful for clustering; assignments of points to clusters follows a CRP.

Samples from a Dirichlet Process Mixture of Gaussians N= N= N= N=3 Notice that more structure (clusters) appear as you draw more points. (figure inspired by Neal)

Approximate Inference in DPMs Gibbs sampling (e.g. Escobar and West, 995; Neal, ; Rasmussen, ) Expectation propagation (Minka and Ghahramani, 3) Variational approximation (Blei and Jordan, 5) Hierarchical clustering (Heller and Ghahramani, 5)

Hierarchical Clustering

Dirichlet Diffusion Trees (DFT) (Neal, 3) In a DPM, parameters of one mixture component are independent of another components this lack of structure is potentially undesirable. A DFT is a generalization of DPMs with hierarchical structure between components. To generate from a DFT, we will consider θ taking a random walk according to a Brownian motion Gaussian diffusion process. θ (t) Gaussian diffusion process starting at origin (θ () = ) for unit time. θ (t), also starts at the origin and follows θ but diverges at some time τ d, at which point the path followed by θ becomes independent of θ s path. a(t) is a divergence or hazard function, e.g. a(t) = /( t). For small dt: P (θ diverges (t, t + dt)) = a(t)dt m where m is the number of previous points that have followed this path. If θ i reaches a branch point between two paths, it picks a branch in proportion to the number of points that have followed that path.

Dirichlet Diffusion Trees (DFT) Generating from a DFT: Figure from Neal, 3.

Dirichlet Diffusion Trees (DFT) Some samples from DFT priors: Figure from Neal, 3.

Kingman s Coalescent (Kingman, 98) A model for genealogies. Working backwards in time from t = Generate tree with n individuals at leaves by merging lineages backwards in time. Every pair of lineages merges independently with rate. Time of first merge for n individuals is t Exp(n(n )/), etc This generates w.p.. a binary tree where all lineages are merged at t =. The coalescent results in the limit n and is infinitely exchangeable over individuals and has uniform marginals over tree topologies. The coalescent has been used as the basis of a hierarchical clustering algorithm by (Teh, Daumé, Roy, 7) z!") (a) (b) (c) y {,,3,4 } 3 y {, } t 3 t t π(t) = { {,, 3, 4} } { {, }, {3, 4} } { {}, {}, {3, 4} } { {}, {}, {3}, {4} } y { 3,4 } x x x 3 x 4 t =! &") &!&")!!!!")!%!%")!(!!"#!!"$!!"%!!!&"'!&"#!&"$!&"% & t #& $& %&! %&! $! $& # $ %! #!!! "! #! $ % $

Hierarchical Dirichlet Processes (HDP) (Teh et al, 5) Assume you have data which is divided into J groups. You assume there are clusters within each group, but you also believe these clusters are shared between groups (i.e. data points in different groups can belong to the same cluster). In an HDP there is a common DP: γ G H G H, γ DP( H, γ) which forms the base measure for a draw from a DP within each group G j G, α DP( G, α) G j φ ji x ji α n j J

Nonparametric Times Series Models

Infinite hidden Markov models (ihmms) Hidden Markov models (HMMs) are widely used sequence models for speech recognition, bioinformatics, text modelling, video monitoring, etc. S Y S Y S 3 Y 3 S T Y T HMMs can be thought of as time-dependent mixture models. s t K π k (.) In an HMM with K states, the transition matrix has K K elements. s t- k We want to let K Infinite HMMs can be derived from the HDP. K β γ Stick( γ) (base distribution over states) π k α, β DP( α, β) (transition parameters for state k =,... ) θ k H H( ) (emission parameters for state k =,... ) s t s t, (π k ) k= π st ( ) (transition) y t s t, (θ k ) k= p( θ st ) (emission) (Beal, Ghahramani, and Rasmussen, ) (Teh et al. 5)

Infinite hidden Markov models We let the number of states K. S S S 3 S T Y Y Y 3 Y T Experiment: Changepoint Detection Introduced in (Beal, Ghahramani and Rasmussen, ). Teh, Jordan, Beal and Blei (5) showed that ihmms can be derived from hierarchical Dirichlet processes, and provided a more efficient Gibbs sampler. We have recently derived a much more efficient sampler based on Dynamic Programming (Van Gael, Saatci, Teh, and Ghahramani, 8).

Completely Random Measures (Kingman, 967) measurable space: Θ with σ-algebra Ω measure: function µ : Ω [, ] assigning to each measurable set a non-neg. real random measure: measures are drawn from some distribution on measures: µ P completely random measure (CRM): the values that µ takes on disjoint subsets are independent: µ(a) µ(b) if A B = CRMs can be represented as sum of nonrandom measure, atomic measure with fixed atoms but random masses, and atomic measure with random atoms and masses. µ = µ + N or i= u iδ φ i + M or i= u i δ φi We can write µ CRM(µ, Λ, {φ i, F i }) where u i F i and {u i, φ i } are drawn from a Poisson process on (, ] Θ with rate measure Λ called the Lévy measure. Examples: Gamma process, Beta process (Hjort, 99), Stable-beta process (Teh and Görür, 9).

Beta Processes (Hjort, 99) CRMs can be represented as sum of nonrandom measure, atomic measure with fixed atoms but random masses, and atomic measure with random atoms and masses. µ = µ + N or i= u iδ φ i + M or i= u i δ φi For a beta process we have µ CRM(, Λ, {}) where {u i, φ i } are drawn from a Poisson process on (, ] Θ with rate Lévy measure: Λ(du dθ) = α c u ( u) c du H(dθ) 3.5 8 6 4.5 3 θ 3.. u.3.4.5.5 3 3 θ

Selected References Gaussian Processes: O Hagan, A. (978). Curve Fitting and Optimal Design for Prediction (with discussion). Journal of the Royal Statistical Society B, 4():-4. Rasmussen, C.E and Williams, C.K.I. (6) Gaussian processes for machine learning. MIT Press. Dirichlet Processes, Chinese Restaurant Processes, and related work Ferguson, T. (973), A Bayesian Analysis of Some Nonparametric Problems, Annals of Statistics, (), pp. 9 3. Blackwell, D. and MacQueen, J. (973), Ferguson Distributions via Polya Urn Schemes, Annals of Statistics,, pp. 353 355. Aldous, D. (985), Exchangeability and Related Topics, in Ecole d Ete de Probabilites de Saint-Flour XIII 983, Springer, Berlin, pp. 98. Hierarchical Dirichlet Processes and Infinite Hidden Markov Models Beal, M. J., Ghahramani, Z., and Rasmussen, C.E. (), The Infinite Hidden Markov Model, in T. G. Dietterich, S. Becker, and Z. Ghahramani (eds.) Advances in Neural Information Processing Systems, Cambridge, MA: MIT Press, vol. 4, pp. 577-584. Teh, Y.W, Jordan, M.I, Beal, M.J., and Blei, D.M. (5) Hierarchical Dirichlet Processes. J American Statistical Association.

van Gael, J., Saatci, Y., Teh, Y.-W., and Ghahramani, Z. (8) Beam sampling for the infinite Hidden Markov Model. In International Conference on Machine Learning (ICML 8), 88 95. Dirichlet Process Mixtures Antoniak, C.E. (974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Annals of Statistics, :5-74. Escobar, M.D. and West, M. (995) Bayesian density estimation and inference using mixtures. J American Statistical Association. 9: 577-588. Neal, R.M. (). Markov chain sampling methods for Dirichlet process mixture models.journal of Computational and Graphical Statistics, 9, 49 65. Rasmussen, C.E. (). The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press. Blei, D.M. and Jordan, M.I. (5) Variational methods for Dirichlet process mixtures. Bayesian Analysis. Minka, T.P. and Ghahramani, Z. (3) Expectation propagation for infinite mixtures. NIPS 3 Workshop on Nonparametric Bayesian Methods and Infinite Models. Heller, K.A. and Ghahramani, Z. (5) Bayesian Hierarchical Clustering. Twenty Second International Conference on Machine Learning (ICML-5) Dirichlet Diffusion Trees and Kingman s Coalescent Neal, R.M. (3) Density modeling and clustering using Dirichlet diffusion trees, in J. M. Bernardo, et al. (editors) Bayesian Statistics 7. Kingman., J. F. C. (98) The coalescent. Stochastic Processes and their Applications, 3(3):35 48.

Teh, Y. W., Daumé III, H., and Roy, D. (7) Bayesian agglomerative clustering with coalescents. In Advances in Neural Information Processing Systems (NIPS), Vancouver, Canada. Indian Buffet Processes Griffiths, T.L., and Ghahramani, Z. (6) Infinite Latent Feature Models and the Indian Buffet Process. In Advances in Neural Information Processing Systems 8:475 48. Cambridge, MA: MIT Press. Teh, Y.W., Görür, D. and Ghahramani, Z. (7) Stick-breaking Construction for the Indian Buffet. In Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 7). San Juan, Puerto Rico. Thibaux, R. and Jordan, M. I. (7) Hierarchical Beta processes and the Indian buffet process. In Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS 7). San Juan, Puerto Rico. Other Hjort, N. L. (99) Nonparametric Bayes estimators based on Beta processes in models for life history data. Annals of Statistics, 8:59 94. Kingman, J. F. C. (967) Completely Random Measures. (). Pacific Journal of Mathematics Teh, Y. W. and Görür, D. (9) Indian buffet processes with power-law behavior. In Advances in Neural Information Processing Systems.