Bayesian Nonparametrics: Dirichlet Process

Similar documents
Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Processes: Tutorial and Practical Course

Lecture 3a: Dirichlet processes

Bayesian Nonparametrics

Bayesian nonparametrics

A Brief Overview of Nonparametric Bayesian Models

Non-parametric Clustering with Dirichlet Processes

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Non-Parametric Bayes

Infinite latent feature models and the Indian Buffet Process

The Indian Buffet Process: An Introduction and Review

Dirichlet Processes and other non-parametric Bayesian models

Nonparametric Mixed Membership Models

Bayesian Nonparametric Models

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

Bayesian nonparametric latent feature models

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Non-parametric Bayesian Methods

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Infinite Latent Feature Models and the Indian Buffet Process

Gentle Introduction to Infinite Gaussian Mixture Modeling

Bayesian Nonparametrics for Speech and Signal Processing

Stochastic Processes, Kernel Regression, Infinite Mixture Models

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Collapsed Variational Dirichlet Process Mixture Models

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

STAT Advanced Bayesian Inference

Hierarchical Models, Nested Models and Completely Random Measures

Bayesian Nonparametric Models on Decomposable Graphs

CSC 2541: Bayesian Methods for Machine Learning

A marginal sampler for σ-stable Poisson-Kingman mixture models

An Introduction to Bayesian Nonparametric Modelling

Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Bayesian non parametric approaches: an introduction

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Clustering using Mixture Models

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Tree-Based Inference for Dirichlet Process Mixtures

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Bayesian nonparametric latent feature models

Probabilistic modeling of NLP

Spatial Normalized Gamma Process

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Probabilistic Graphical Models

Dirichlet Enhanced Latent Semantic Analysis

Distance dependent Chinese restaurant processes

Hybrid Dirichlet processes for functional data

Priors for Random Count Matrices with Random or Fixed Row Sums

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Dependent hierarchical processes for multi armed bandits

arxiv: v2 [stat.ml] 4 Aug 2011

Advanced Machine Learning

Bayesian Mixtures of Bernoulli Distributions

Construction of Dependent Dirichlet Processes based on Poisson Processes

Foundations of Nonparametric Bayesian Methods

Hierarchical Dirichlet Processes

arxiv: v1 [stat.me] 22 Feb 2015

Bayesian Nonparametric Modelling of Genetic Variations using Fragmentation-Coagulation Processes

Part 3: Applications of Dirichlet processes. 3.2 First examples: applying DPs to density estimation and cluster

CMPS 242: Project Report

Graphical Models for Query-driven Analysis of Multimodal Data

arxiv: v1 [stat.ml] 20 Nov 2012

Bayesian Methods for Machine Learning

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

arxiv: v1 [stat.ml] 8 Jan 2012

Bayesian nonparametric models for bipartite graphs

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

An Infinite Product of Sparse Chinese Restaurant Processes

Truncation error of a superposed gamma process in a decreasing order representation

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

A permutation-augmented sampler for DP mixture models

On the posterior structure of NRMI

Nonparametric Bayesian Methods - Lecture I

Applied Nonparametric Bayes

The ubiquitous Ewens sampling formula

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

TWO MODELS INVOLVING BAYESIAN NONPARAMETRIC TECHNIQUES

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

Parallel Markov Chain Monte Carlo for Pitman-Yor Mixture Models

Simple approximate MAP Inference for Dirichlet processes

CPSC 540: Machine Learning

Distance-Based Probability Distribution for Set Partitions with Applications to Bayesian Nonparametrics

Construction of Dependent Dirichlet Processes based on Poisson Processes

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing

13: Variational inference II

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Bayesian Nonparametric Mixture, Admixture, and Language Models

The two-parameter generalization of Ewens random partition structure

Présentation des travaux en statistiques bayésiennes

Bayesian non-parametric model to longitudinally predict churn

An Alternative Prior Process for Nonparametric Bayesian Clustering

Bayesian Nonparametric Models for Ranking Data

Transcription:

Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012

Dirichlet Process Cornerstone of modern Bayesian nonparametrics. Rediscovered many times as the infinite limit of finite mixture models. Formally defined by [Ferguson 1973] as a distribution over measures. Can be derived in different ways, and as special cases of different processes. Random partition view: Chinese restaurant process, Blackwell-mcQueen urn scheme Random measure view: stick-breaking construction, Poisson-Dirichlet, gamma process

The Infinite Limit of Finite Mixture Models

Finite Mixture Models Model for data from heterogeneous unknown sources. Each cluster (source) modelled using a parametric model (e.g. Gaussian). Data item i: Mixing proportions: Cluster k: z i π Discrete(π) x i z i, θ k F (θ z i ) π =(π 1,...,π K ) α Dirichlet(α/K,..., α/k) θ k H H α π z i x i i = 1,..., n H θ k k = 1,..., K

Finite Mixture Models α Dirichlet distribution on the K-dimensional probability simplex { π Σk πk = 1 }: Γ(α) K P (π α) = k Γ(α/K) π α/k 1 k k=1 with Γ(a) = x a 1 e x dx. z i 0 Standard distribution on probability vectors, due to conjugacy with multinomial. π x i H θ k k = 1,..., K i = 1,..., n

Dirichlet Distribution (1, 1, 1) (2, 2, 2) (5, 5, 5) (2, 5, 5) (2, 2, 5) (0.7, 0.7, 0.7) P (π α) = Γ( k α k) k Γ(α k) K k=1 π α k 1 k

Dirichlet-Multinomial Conjugacy Joint distribution over zi and π: n Γ(α) K P (π α) P (z i π) = K k=1 Γ(α/K) i=1 k=1 π α/k 1 k K k=1 π n k k where nc = #{ zi = c }. Posterior distribution: P (π z, α) = Marginal distribution: P (z α) = Γ(n + α) K k=1 Γ(n k + α/k) K k=1 π n k+α/k 1 k Γ(α) K k=1 Γ(α/K) K k=1 Γ(n k + α/k) Γ(n + α)

Gibbs Sampling All conditional distributions are simple to compute: p(z i = k others) π k f(x i θ k) π others Dirichlet( α K + n 1,..., α K + n K) p(θk = θ others) h(θ) f(x j θ) j:z j =k Not as efficient as collapsed Gibbs sampling, which integrates out π, θ * s: p(z i = k others) α K +n i k α+n 1 f(x i {x j : j =i, z j =k}) f(x i {x j : j =i, z j =k}) h(θ)f(x i θ) f(x j θ)dθ j=i:z j =k Conditional distributions can be efficiently computed if F is conjugate to H. α π z i x i i = 1,..., n H θ k k = 1,..., K

Infinite Limit of Collapsed Gibbs Sampler We will take K. Imagine a very large value of K. There are at most n < K occupied clusters, so most components are empty. We can lump these empty components together: p(z i = k others) = p(z i = k empty others) = n i k + α K n 1+α f(x i {x j : j =i, z j =k}) α K K K n 1+α f(x i {}) α π z i x i H θ k k = 1,..., K i = 1,..., n [Neal 2000, Rasmussen 2000, Ishwaran & Zarepour 2002]

Infinite Limit of Collapsed Gibbs Sampler We will take K. Imagine a very large value of K. There are at most n < K occupied clusters, so most components are empty. We can lump these empty components together: p(z i = k others) = p(z i = k empty others) = n i k + α K n 1+α f(x i {x j : j =i, z j =k}) α K K K n 1+α f(x i {}) α π z i x i H θ k k = 1,..., K i = 1,..., n [Neal 2000, Rasmussen 2000, Ishwaran & Zarepour 2002]

Infinite Limit The actual infinite limit of the finite mixture model does not make sense: any particular cluster will get a mixing proportion of 0. Better ways of making this infinite limit precise: Chinese restaurant process. Stick-breaking construction. Both are different views of the Dirichlet process (DP). DPs can be thought of as infinite dimensional Dirichlet distributions. The K Gibbs sampler is for DP mixture models.

Ferguson s Definition of the Dirichlet Process

Ferguson s Definition of Dirichlet Processes A Dirichlet process (DP) is a random probability measure G over (Θ, Σ) such that for any finite set of measurable sets A1,...AK Σ partitioning Θ, i.e. we have A 1 A K = Θ (G(A 1 ),...,G(A K )) Dirichlet(αH(A 1 ),...,αh(a K )) where α and H are parameters of the DP. A 1 A 4 A 2 A 3 A 5 A 6 [Ferguson 1973]

Parameters of the Dirichlet Process α is called the strength, mass or concentration parameter. H is called the base distribution. Mean and variance: E[G(A)] = H(A) V[G(A)] = H(A)(1 H(A)) α +1 where A is a measurable subset of Θ. H is the mean of G, and α is an inverse variance.

Posterior Dirichlet Process Suppose G DP(α,H) We can define random variables that are G distributed: θ i G G for i =1,...,n The usual Dirichlet-multinomial conjugacy carries over to the DP as well: G θ 1,...,θ n DP(α + n, αh+ n i=1 δ θ i α+n )

Pólya Urn Scheme Marginalizing out G, we get: G DP(α, H) θ i G G for i =1, 2,... θ n+1 θ 1,...,θ n αh+ n i=1 δ θ i α+n This is called the Pólya, Hoppe or Blackwell-MacQueen urn scheme. Start with an urn with α balls of a special colour. Pick a ball randomly from urn: If it is a special colour, make a new ball with colour sampled from H, note the colour, and return both balls to urn. If not, note its colour and return two balls of that colour to urn. [Blackwell & MacQueen 1973, Hoppe 1984]

Clustering Property G DP(α, H) θ i G G for i =1, 2,... The n variables θ1,θ2,...,θn can take on K n distinct values. Let the distinct values be θ1 *,...,θk *. This defines a partition of {1,...,n} such that i is in cluster k if and only if θi = θk *. The induced distribution over partitions is the Chinese restaurant process.

Discreteness of the Dirichlet Process Suppose G DP(α, H) θ G G G is discrete if P(G({θ}) > 0) = 1 Above holds, since joint distribution is equivalent to: θ H G θ DP(α +1, αh+δ θ α+1 )

A draw from a Dirichlet Process 15000 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Atomic Distributions Draws from Dirichlet processes will always be atomic: G = π k δ θ k where Σk πk = 1 and θk * Θ. k=1 A number of ways to specify the joint distribution of {πk, θk * }. Stick-breaking construction; Poisson-Dirichlet distribution.

Random Partitions [Aldous 1985, Pitman 2006]

Partitions A partition ϱ of a set S is: A disjoint family of non-empty subsets of S whose union in S. S = {Alice, Bob, Charles, David, Emma, Florence}. ϱ = { {Alice, David}, {Bob, Charles, Emma}, {Florence} }. Alice David Bob Charles Emma Florence Denote the set of all partitions of S as S. Random partitions are random variables taking values in S. We will work with partitions of S = [n] = {1,2,...n}.

Chinese Restaurant Process 1 3 6 2 7 4 5 8 9 Each customer comes into restaurant and sits at a table: n c p(sit at table c) = α + α c n p(sit at new table) = c α + c n c Customers correspond to elements of S, and tables to clusters in ϱ. Rich-gets-richer: large clusters more likely to attract more customers. Multiplying conditional probabilities together, the overall probability of ϱ, called the exchangeable partition probability function (EPPF), is: P ( α) = α Γ(α) Γ( c ) Γ(n + α) c [Aldous 1985, Pitman 2006]

Number of Clusters The prior mean and variance of K are: E[ ρ α, n] =α(ψ(α + n) ψ(α)) α log 1+ n α V[ ρ α, n] =α(ψ(α + n) ψ(α)) + α 2 (ψ (α + n) ψ (α)) α log 1+ n α 200 ψ(α) = α log Γ(α)!=30, d=0 0.12 alpha = 10 150 0.1 0.08 table 100 0.06 50 0.04 0.02 0 0 2000 4000 6000 8000 10000 customer 0 0 20 40 60 80 100

Model-based Clustering with Chinese Restaurant Process

Partitions in Model-based Clustering Partitions are the natural latent objects of inference in clustering. Given a dataset S, partition it into clusters of similar items. Cluster c ϱ described by a model parameterized by θc *. F (θ c ) Bayesian approach: introduce prior over ϱ and θc * ; compute posterior over both.

Finite Mixture Model α Explicitly allow only K clusters in partition: Each cluster k has parameter θk. Each data item i assigned to k with mixing probability πk. Gives a random partition with at most K clusters. Priors on the other parameters: π α Dirichlet(α/K,..., α/k) θ k H H π z i x i i = 1,..., n H θ k k = 1,..., K

Induced Distribution over Partitions P (z α) = Γ(α) k Γ(α/K) P(z α) describes a partition of the data set into clusters, and a labelling of each cluster with a mixture component index. Induces a distribution over partitions ϱ (without labelling) of the data set: P ( α) =[K] k Γ(α) Γ( c + α/k) 1 Γ(n + α) Γ(α/K) where [x] a b = x(x + b) (x +(a 1)b). k Γ(n k + α/k) Γ(n + α) Taking K, we get a proper distribution over partitions without a limit on the number of clusters: P ( α) α Γ(α) Γ( c ) Γ(n + α) c c

Chinese Restaurant Process An important representation of the Dirichlet process An important object of study in its own right. Predates the Dirichlet process and originated in genetics (related to Ewen s sampling formula there). Large number of MCMC samplers using CRP representation. Random partitions are useful concepts for clustering problems in machine learning CRP mixture models for nonparametric model-based clustering. hierarchical clustering using concepts of fragmentations and coagulations. clustering nodes in graphs, e.g. for community discovery in social nets. Other combinatorial structures can be built from partitions.

Random Probability Measures

A draw from a Dirichlet Process 15000 10000 5000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Atomic Distributions Draws from Dirichlet processes will always be atomic: G = π k δ θ k where Σk πk = 1 and θk * Θ. k=1 A number of ways to specify the joint distribution of {πk, θk * }. Stick-breaking construction; Poisson-Dirichlet distribution.

Stick-breaking Construction Stick-breaking construction for the joint distribution: θk H v k Beta(1, α) for k =1, 2,... k 1 πk = v k (1 v j ) G = πkδ θ k j=1 πk s are decreasing on average but not strictly. Distribution of {πk} is the Griffiths-Engen-McCloskey (GEM) distribution. Poisson-Dirichlet distribution [Kingman 1975] gives a strictly decreasing ordering (but is not computationally tractable). k=1

Finite Mixture Model α Explicitly allow only K clusters in partition: Each cluster k has parameter θk. Each data item i assigned to k with mixing probability πk. Gives a random partition with at most K clusters. Priors on the other parameters: π α Dirichlet(α/K,..., α/k) θ k H H π z i x i i = 1,..., n H θ k k = 1,..., K

Size-biased Permutation Reordering clusters do not change the marginal distribution on partitions or data items. By strictly decreasing πk: Poisson-Dirichlet distribution. Reorder stochastically as follows gives stick-breaking construction: Pick cluster k to be first cluster with probability πk. Remove cluster k and renormalize rest of { πk : j k }; repeat. Stochastic reordering is called a size-biased permutation. After reordering, taking K gives the corresponding DP representations.

Stick-breaking Construction Easy to generalize stick-breaking construction: to other random measures; to random measures that depend on covariates or vary spatially. Easy to work with different algorithms: MCMC samplers; variational inference; parallelized algorithms. [Ishwaran & James 2001, Dunson 2010 and many others]

DP Mixture Model: Representations and Inference

DP Mixture Model A DP mixture model: G α, H DP(α, H) θ i G G x i θ i F (θ i ) Different representations: θ1,θ2,...,θn are clustered according to Pólya urn scheme, with induced partition given by a CRP. G is atomic with weights and atoms described by stick-breaking construction. [Neal 2000, Rasmussen 2000, Ishwaran & Zarepour 2002]

CRP Representation Representing the partition structure explicitly with a CRP: ρ α CRP([n],α) θ c H H for c ρ x i θ c F (θ c ) for c i Makes explicit that this is a clustering model. Using a CRP prior for ϱ obviates need to limit number of clusters as in finite mixture models. [Neal 2000, Rasmussen 2000, Ishwaran & Zarepour 2002]

Marginal Sampler Marginal MCMC sampler. Marginalize out G, and Gibbs sample partition. Conditional probability of cluster of data item i: ρ α CRP([n],α) θ c H H for c ρ x i θ c F (θ c ) for c i P (ρ i ρ \i, x, θ) =P (ρ i ρ \i )P (x i ρ i, x \i, θ) c P (ρ i ρ \i )= n 1+α if ρ i = c ρ \i if ρ i =new P (x i ρ i, x \i, θ) = α n 1+α f(x i θ ρi ) if ρ i = c ρ \i f(xi θ)h(θ)dθ if ρ i =new A variety of methods to deal with new clusters. Difficulty lies in dealing with new clusters, especially when prior h is not conjugate to f. [Neal 2000]

Induced Prior on the Number of Clusters The prior expectation and variance of ϱ are: E[ ρ α, n] =α(ψ(α + n) ψ(α)) α log 1+ n α V[ ρ α, n] =α(ψ(α + n) ψ(α)) + α 2 (ψ (α + n) ψ (α)) α log 1+ n α 0.25 alpha = 1 0.12 alpha = 10 0.1 alpha = 100 0.2 0.1 0.08 0.15 0.08 0.06 0.06 0.1 0.04 0.04 0.05 0.02 0.02 0 0 20 40 60 80 100 0 0 20 40 60 80 100 0 0 20 40 60 80 100

Marginal Gibbs Sampler Pseudocode Initialize: randomly assign each data item to some cluster. K := the number of clusters used. For each cluster k = 1...K: Compute sufficient statistics s k := Σ { s(x i ) : z i = k }. Compute cluster sizes n k := # { i : z i = k }. Iterate until convergence: For each data item i = 1...n: Let k := z i be the current cluster data item is assigned to. Remove data item: s k = s(x i ), n k = 1. If n k = 0 then remove cluster k (K = 1 and relabel rest of clusters). Compute conditional probabilities p(zi=c others) for c = 1...K, k empty := K+1. Sample new cluster for data item from conditional probabilities. If c = k empty then create new cluster: K+=1, s c := 0, n c = 0. Add data item: z i := c, s c += s(x i ), n c += 1.

Stick-breaking Representation Dissecting stick-breaking representation for G: Makes explicit that this is a mixture model with an infinite number of components. Conditional sampler: π α GEM(α) θ k H H z i π Discrete(π ) x i z i,θ z i F (θ z i ) Standard Gibbs sampler, except need to truncate the number of clusters. Easy to work with non-conjugate priors. For sampler to mix well need to introduce moves for permuting the order of clusters. [Ishwaran & James 2001, Walker 2007, Papaspiliopoulos & Roberts 2008]

Explicit G Sampler Represent G explicitly, alternately sampling {θi} G (simple) and G {θi}:. G θ 1,...,θ n DP(α + n, αh+ n i=1 δ θ i G = π 0G + K k=1 π kδ θ k α+n ) (π 0,π 1,...,π K) Dirichlet(α, n 1,...,n K ) G DP(α, H) Use a stick-breaking representation for G and truncate as before. No explicit ordering of the non-empty clusters makes for better mixing. Explicit representation of G allows for posterior estimates of functionals of G. G α, H DP(α, H) θ i G G x i θ i F (θ i )

Other Inference Algorithms Split-merge algorithms [Jain & Neal 2004]. Close in spirit to reversible-jump MCMC methods [Green & richardson 2001]. Sequential Monte Carlo methods [Liu 1996, Ishwaran & James 2003, Fearnhead 2004, Mansingkha et al 2007]. Variational algorithms [Blei & Jordan 2006, Kurihara et al 2007, Teh et al 2008]. Expectation propagation [Minka & Ghahramani 2003, Tarlow et al 2008].