Non-parametric Clustering with Dirichlet Processes

Similar documents
Bayesian Nonparametrics: Models Based on the Dirichlet Process

Non-Parametric Bayes

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 3a: Dirichlet processes

Dirichlet Processes and other non-parametric Bayesian models

A Brief Overview of Nonparametric Bayesian Models

Bayesian Nonparametrics: Dirichlet Process

Dirichlet Processes: Tutorial and Practical Course

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Non-parametric Bayesian Methods

Bayesian Nonparametric Models

Bayesian Nonparametrics

Spatial Normalized Gamma Process

Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Bayesian nonparametric latent feature models

CSC 2541: Bayesian Methods for Machine Learning

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Infinite latent feature models and the Indian Buffet Process

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Applied Nonparametric Bayes

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Nonparametric Mixed Membership Models

Image segmentation combining Markov Random Fields and Dirichlet Processes

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Bayesian non parametric approaches: an introduction

Latent Dirichlet Allocation Introduction/Overview

Collapsed Variational Dirichlet Process Mixture Models

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Bayesian Nonparametric Models on Decomposable Graphs

STAT Advanced Bayesian Inference

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Process. Yee Whye Teh, University College London

Collapsed Variational Inference for HDP

Hierarchical Models, Nested Models and Completely Random Measures

Clustering using Mixture Models

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Distance dependent Chinese restaurant processes

arxiv: v1 [stat.ml] 8 Jan 2012

Latent Dirichlet Alloca/on

Advanced Machine Learning

Mixed Membership Models for Time Series

Latent Dirichlet Allocation (LDA)

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Bayesian nonparametrics

Trajectory Analysis and Semantic Region Modeling Using A Nonparametric Bayesian Model Xiaogang Wang,, Keng Teck Ma,, Gee-Wah Ng,, and Eric Grimson

UNIVERSITY OF CALIFORNIA, IRVINE. Networks of Mixture Blocks for Non Parametric Bayesian Models with Applications DISSERTATION

Bayesian Structure Modeling. SPFLODD December 1, 2011

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Probabilistic Graphical Models

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic modeling of NLP

Machine Learning Summer School, Austin, TX January 08, 2015

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Bayesian Methods for Machine Learning

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Modeling Environment

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Introduction to the Dirichlet Distribution and Related Processes

Latent Dirichlet Bayesian Co-Clustering

Bayesian Analysis for Natural Language Processing Lecture 2

Gentle Introduction to Infinite Gaussian Mixture Modeling

Bayesian Mixtures of Bernoulli Distributions

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Bayesian Inference for Dirichlet-Multinomials

Non-parametric Bayesian Methods for Structured Topic Models

An Infinite Product of Sparse Chinese Restaurant Processes

Hybrid Dirichlet processes for functional data

Hierarchical Bayesian Nonparametrics

arxiv: v2 [stat.ml] 4 Aug 2011

CS Lecture 18. Topic Models and LDA

Topic Modelling and Latent Dirichlet Allocation

Tree-Based Inference for Dirichlet Process Mixtures

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Hierarchical Bayesian Nonparametric Models with Applications

Present and Future of Text Modeling

Topic Models. Charles Elkan November 20, 2008

An Introduction to Bayesian Nonparametric Modelling

Applying hlda to Practical Topic Modeling

Nested Hierarchical Dirichlet Processes

1 Inference in Dirichlet Mixture Models Introduction Problem Statement Dirichlet Process Mixtures... 3

Bayesian non-parametric model to longitudinally predict churn

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Latent Variable Models Probabilistic Models in the Study of Language Day 4

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

The Indian Buffet Process: An Introduction and Review

The Infinite Markov Model

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Dirichlet Process Based Evolutionary Clustering

Dependent hierarchical processes for multi armed bandits

Bayesian nonparametric latent feature models

arxiv: v2 [stat.ml] 5 Nov 2012

Transcription:

Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24

Introduction Question: What should we do if we want to model data using a mixture, but we don t know the number of mixing elements k beforehand? T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 2 / 24

Introduction Question: What should we do if we want to model data using a mixture, but we don t know the number of mixing elements k beforehand? Good candidate for a non-parametric method! T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 2 / 24

Introduction Rational e.g. We want to select the number m of Gaussians in a mixture of Gaussians (right) or The number of means k in the k-means algorithm How can we do this? T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 3 / 24

Background Dirichlet Processes First, we need to address a few things: T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 4 / 24

Background Dirichlet Processes First, we need to address a few things: 1 What is the Dirichlet distribution? T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 4 / 24

Background Dirichlet Processes First, we need to address a few things: 1 What is the Dirichlet distribution? 2 What is a Dirichlet Process? T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 4 / 24

Background Dirichlet Processes First, we need to address a few things: 1 What is the Dirichlet distribution? 2 What is a Dirichlet Process? 3 How can it be represented? T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 4 / 24

Background Dirichlet Processes First, we need to address a few things: 1 What is the Dirichlet distribution? 2 What is a Dirichlet Process? 3 How can it be represented? Once we ve done covered the basics we ll talk about a few examples! T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 4 / 24

Dirichlet Processes The Dirichlet Distribution Dirichlet Distribution The Dirichlet Distribution is a distribution over the K-1 probability simplex. Let p be a K-dimensional vector s.t. j : p j 0 and K j=1 p j = 1, then P (p α) = Dir(α 1,..., α K ) = Γ( j α j) j Γ(α j) K j=1 p α j 1 j (1) The first term in the above equation is just a normalization constant. The Dirichlet Distribution is conjugate to the multinomial distribution. i.e if c p Multinomial( p) then the posterior is dirichlet P (p c = j, α) = P (c = j p)p (p α) P (c = j α) where α j = α j + 1, and l j : α l = α l = Dir(alpha ) T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 5 / 24

Dirichlet Processes The Dirichlet Distribution Dirichlet Distribution T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 6 / 24

Dirichlet Processes Dirichlet Processes The Dirichlet Process Dirichlet Processes define a distribution over distributions (or a measure on measures) G DP( G 0, α) where α > 0 is a scaling parameter, and G 0 is the base distribution. Think of DP s as infinite dimensional Dirichlet distributions. It is important to note that G is an infinite dimensional object. T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 7 / 24

The Dirichlet Process Dirichlet Processes Dirichlet Processes Let Θ be a measurable space, G 0 be a probability measure on Θ, and α 0 be a real number. For all (A 1,..., A K ) finite partitions of Θ, G DP( G 0, α) means that (G(A 1 ),..., G(A K )) Dir(α 0 G 0 (A 1 ),..., α 0 G 0 (A K )) T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 8 / 24

The Dirichlet Process Dirichlet Processes Dirichlet Processes Samples from a DP are discrete with probability one: G(θ) = π k δ θk (θ) k=1 Posterior P (G θ) is also a DP! i.e. DP s are conjugate to themselves! ( α P (G θ) = DP α + 1 G 0 + 1 ) α + 1 δ θ, α + 1 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 9 / 24

Urn Representation Dirichlet Processes Representations Given Then θ n θ 1,..., θ n 1, G 0, α P (θ n θ 1,..., θ n 1, G 0, α) G DP( G 0, α) and θ G G( ) n 1 α n 1 + α G 1 0( ) + δ θj ( ) n 1 + α j=1 n P (θ j G)P (G G 0, α)dg j=1 This model exhibits a clustering kind of effect T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 10 / 24

Dirichlet Processes Representations Chinese Restaurant Process (CRP) The Chinese Restaurant Process is another representation of the DP. It can help us see this clustering effect more explicitly: Restaurant has potentially infinitely many tables k = 1,... Customers are indexed by i = 1,..., with values φ i as they arrive Tables have values θ k drawn from G 0 K = total number of occupied tables so far n = total number of customers arrived thus far n k = number of customers seated at table k. T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 11 / 24

Dirichlet Processes Representations Chinese Restaurant Process (CRP) T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 12 / 24

Dirichlet Processes Representations Relationship between CRPs and DPs DP is a distribution over distributions. A DP results in discrete distributions, so drawing n points will likely result in repeat values. A DP induces a partitioning of the n points CRP is the corresponding distribution over partitions. T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 13 / 24

Dirichlet Processes Stick Breaking Construction Representations Samples G DP( G 0, α) can be represented as follows: G( ) = π k δ θk ( ) k=1 where θ k G 0 ( ), k=1 π k = 1, π k = β k k 1(1 β j ) and j=1 β k Beta( 1, α) T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 14 / 24

Extensions Dirichlet Processes Representations Hierarchical Dirichlet Processes (HDP) - For sharing statistical power between many different groups of clustering data. T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 15 / 24

Topic Modeling Examples Goal: To model topics (distributions of words) across an entire corpus. Common Topics LDA (Latent Dirichlet Allocation) is an example of parametric solution to problem, but has shortcomings 1 Documents each draw their own mixing proportions (mixture component is a topic) separately 2 Words are then drawn independently from the mixture model. T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 16 / 24

Examples Topic Modeling Not easy to extend like a simple mixture model since each document has essentially it s own mixture (and thus mixing proportions) In order to capture this we use a separate DP mixture to model each document where each DP is a draw from another DP. This is an application of HDPs T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 17 / 24

Examples Dirichlet Process Mixtures DPs are discrete with prob one, so they are not useful for use as a prior on continuous densities. In a DP Mixture, we draw the parameters of a mixture model from a draw from a DP: G DP( G 0, α) θ i G( ) x i p( θ i ) For example, can be a DP-MoG if p( θ) = Gaussian. However, p( θ) could be any density. T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 18 / 24

Examples Dirichlet Process Mixtures T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 19 / 24

Examples Dirichlet Process Mixtures T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 20 / 24

Examples Dirichlet Process Mixtures We can think of Infinite DP Mixtures in terms of finite mixtures: Consider using a finite mixture of K components to model a data set D = {x (1),..., x (n) } p(x (i) θ) = = K π j p j (x (i) θ j ) j=1 K P (s (i) = j π)p j (x (i) θ j, s (i) = j) j=1 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 21 / 24

Examples Dirichlet Process Mixtures The distribution of indicator variables (assignments of data to mixture) s = (s (1),..., s (n) ) given π is multinomial P (s (1),..., s (n) π) = K j=1 π n j j, n j = n δ(s (i), j) Since we know the Dirichlet distribution is conjugate to the multinomial we can assume the mixing proportions π have a Dirichlet prior: p(π α) = Γ(α) Γ(α/K) K K j=1 i=1 π α/k 1 j And integrating out the mixing proportions gives us: P (s (1),..., s (n) α) = P (s π)p (π α)dπ = Γ(α) K Γ(n + α) j=1 Γ(n j + α/k) Γ(α/K) T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 22 / 24

Examples Dirichlet Process Mixtures Conditionals: Finite K P (s (i) = j s i, α) = n i,j + α/k n 1 + α where s i denotes all indeces except i, and n i,j = l i δ(s(l), i) Conditionals: Infinite K Limit as K P (s (i) = j s i, α) = { n i,j n 1+α α n 1+α j represented all j not represented T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 23 / 24

Summary Summary DPs are essentially infinite dimensional Dirichlet distributions DPs can be constructed in a number of different ways (CRP,SB,Urn) DPs can be extended via HDPs (didn t talk a lot about this, but you can read yourself) Can use DPs to Cluster discrete data Can apply DPs as non-parametric priors over mixing proportions in non-discrete mixtures as a way to avoid sticking with a particular model. And that s all folks! T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 24 / 24

Sources References These slides have made extensive use of the following sources. Many slides were adapted directly from Zhoubin Ghahramani s Non-parametric Bayesian Methods Tutorial, 2005 Yee Whye Teh, Dirichlet Processes Hierarchical Dirichlet Processes, Blei, Jordan, Y. W. Teh, Beal, JASA T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 25 / 24