Bayesian Nonparametrics: Models Based on the Dirichlet Process

Size: px

Start display at page:

Download "Bayesian Nonparametrics: Models Based on the Dirichlet Process"

Myra Carpenter
5 years ago
Views:

1 Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

2 Sources and Inspirations Tutorials (slides) P. Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics. NIPS M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That. NIPS Articles etc. E.B. Sudderth, Chapter in PhD thesis, E. Fox, Chapter in PhD thesis, Y.W. Teh, Dirichlet Processes. Encyclopedia of Machine Learning, Springer.... Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

3 Outline 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

4 Outline Introduction and background 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

5 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

6 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

7 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

8 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

9 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

10 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

11 Introduction and background Bayesian learning De Finetti s theorem A premise: Definition An infinite sequence random variables (x 1, x 2,...) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1,..., N), p(x 1, x 2,..., x N ) = p(x π(1), x π(2)..., x π(n) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

12 Introduction and background Bayesian learning De Finetti s theorem A premise: Definition An infinite sequence random variables (x 1, x 2,...) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1,..., N), p(x 1, x 2,..., x N ) = p(x π(1), x π(2)..., x π(n) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

13 Introduction and background De Finetti s theorem (cont d) Bayesian learning Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

14 Introduction and background De Finetti s theorem (cont d) Bayesian learning Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

15 Introduction and background Bayesian learning Bayesian learning Hypothesis space H Given data D, compute p(h D) = p(d h)p(h) p(d) Then, we probably want to predict some future data D, by either: Average over H, i.e. p(d D) = H p(d h)p(h D)p(h) dh Choose the MAP h (or compute it directly), i.e. p(d D) = p(d h MAP) Sample from the posterior... H can be anything! Bayesian learning as a general learning framework We will consider the case in which h is a probabilistic model itself, i.e. a parameter vector θ. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

16 Introduction and background Bayesian learning A simple example Infer the bias θ [0, 1] of a coin after observing N tosses. H = 1, T = 0, p(h) = θ h = θ, hence H = [0, 1] Sequence of Bernoulli trials: p(x 1,..., x n θ) = θ nh (1 θ) N nh where n H = # heads. Unknown θ: p(x 1,..., x N ) = 1 0 θ nh (1 θ) nh k p(θ) dθ θ x 1 x 2 x N θ x i N Need to find a good prior p(θ)... Beta distribution! Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

17 Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,..., x N ) p(x 1,..., x n θ)p(θ) = θ nh (1 θ) nt 1 B(a, b) θa 1 (1 θ) b 1 θ nh+a 1 nt +b 1 (1 θ) i.e. θ x 1,..., x N Beta(a + N H, b + N T ) We re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,.

18 Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,..., x N ) p(x 1,..., x n θ)p(θ) = θ nh (1 θ) nt 1 B(a, b) θa 1 (1 θ) b 1 θ nh+a 1 nt +b 1 (1 θ) i.e. θ x 1,..., x N Beta(a + N H, b + N T ) We re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

19 Introduction and background A simple example (cont d) Bayesian learning Three sequences of four tosses: H T H H H H H T H H H H Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

20 Introduction and background Nonparametric models Nonparametric models Nonparametric doesn t mean no parameters! Rather, The number of parameters grows as more data are observed. -dimensional parameter space. Finite data Bounded number of parameters Definition A nonparametric model is a Bayesian model on an -dimensional parameter space. Example x 2 p(x) µ Parametric x 1 Nonparametric Peter Orbanz & Yee Whye Teh (from Orbanz and Teh, NIPS 2011) 4 / 71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

21 Introduction and background Nonparametric models Nonparametric models Nonparametric doesn t mean no parameters! Rather, The number of parameters grows as more data are observed. -dimensional parameter space. Finite data Bounded number of parameters Definition A nonparametric model is a Bayesian model on an -dimensional parameter space. Example x 2 p(x) µ Parametric x 1 Nonparametric Peter Orbanz & Yee Whye Teh (from Orbanz and Teh, NIPS 2011) 4 / 71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

22 Outline Finite mixture models 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

23 Finite mixture models Bayesian models Models in Bayesian data analysis Model Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models. Posterior inference Knowing p(d M, θ)... how data is generated... compute p(θ M, D) Akin to reversing the generative process. p(θ) M θ p(d M,θ) D Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

24 Finite mixture models Bayesian models Models in Bayesian data analysis Model Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models. Posterior inference Knowing p(d M, θ)... how data is generated... compute p(θ M, D) Akin to reversing the generative process. p(θ) M θ p(d M,θ) p(θ D, M) D Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

25 Finite mixture models Clustering with FMMs Finite mixture models (FMMs) Bayesian approach to clustering. Each data point is assumed to belong to one of K clusters. General form A sequence of data points x = (x 1,..., x N ) each with probability K p(x i π, θ 1,..., θ K ) = π k f (x i θ k ) k=1 π Π K 1 Generative process For each i: Draw a cluster assignment z i π Draw a data point x i F(θ zi ). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

26 Finite mixture models Clustering with FMMs FMMs (example) Mixture of univariate Gaussians θ k = (µ k, σ k ) p(x i π, µ, σ) = x i N (µ k, σ k ) K π k f N (x i ; µ k, σ k ) k= π =(0.15, 0.25, 0.6) N (1, 1) N (4,.5) N (6,.7) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

27 Finite mixture models Clustering with FMMs FMMs (cont d) Clustering with FFMs Need priors for π, θ Usually, π is given a (symmetric) Dirichlet distribution prior. θ k s are given a suitable prior H, depending on the data. α π z i H θ k K π Dir(α/K,..., α/k) θ k H H k = 1... K z i π π x i θ, z i F(θ zi ) i = 1... N x i N Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

28 Dirichlet distribution Finite mixture models Clustering with FMMs Multivariate generalization of Beta. Dir(1, 1, 1) Dir(2, 2, 2) Dir(5, 5, 5) Dir(5, 5, 2) Dir(5, 2, 2) Dir(0.7, 0.7, 0.7) (from Teh, MLSC 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

29 Finite mixture models Dirichlet distribution (cont d) Clustering with FMMs Γ(α) K π Dir(α/K,..., α/k) iff p(π 1,..., π K ) = k Γ(α/K) Conjugate prior to categorical/multinomial, i.e. implies Moreover, and π Dir( α K,..., α K ) z i π i = 1... N π z 1,..., z N Dir ( α K + n 1, α K + n 2,..., α K + n K p(z 1,..., z N α) = Γ(α) Γ(α + N) K k=1 Γ(n k + α/k) Γ(α/K) p(z i = k z ( i), α) = n( i) k + α/k α + N 1 ) k=1 π α/k 1 k Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

30 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

31 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

32 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

33 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

34 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

35 Finite mixture models Inference Approximate inference for FMMs No exact inference because of the unknown clusters identifiers z Expectation-Maximization (EM) Widely used, but we will focus on MCMC because of the connection with Dirichlet Process. Gibbs sampling Markov chain Monte Carlo (MCMC) integration method Set of random variables v = {v 1, v 2,..., v M }. We want to compute p(v). Randomly initialize their values. At each iteration, sample a variable v i and hold the rest constant: v (t) i v (t) j p(v i v (t 1) j, j i) usually tractable = v (t 1) j This creates a Markov chain with p(v) as equilibrium distribution. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

36 Finite mixture models Gibbs sampling for FMMs Inference State variables: z 1,..., z N, θ 1,..., θ K, π. Conditional distributions: ) α p(π z, θ) = Dir( K + n 1,..., α K + n k p(θ k x, z) p(θ k ) p(x i θ k ) i:z i=k = H(θ k ) F θk (x i ) i:z i=k p(z i = k π, θ, x) p(z i = k π k )p(x i z i = k, θ k ) We can avoid sampling π: = π k F θk (x i ) α π z i x i N H θ k K p(z i = k z i, θ, x) p(x i θ k )p(z i = k z i ) F θk (x i ) ( n ( i) k + α/k ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

37 Finite mixture models Inference Gibbs sampling for FMMs (example) log p(x!, ") = l Mixture of 4 bivariate Gaussians Normal-inverse Wishart prior on θ k = (µ k, Σ k ), conjugate to normal distribution. Σ k W(ν, ) µ k N (ϑ, Σ k /κ) log p(x!, ") = log p(x!,!,") ") = T=2 T=10 T=40 l log p(x!, ") = log p(x!,!,") (from ") = Sudderth, 2008) l Figure Learning a mixture of K = 4 Gaussians using the show the current parameters after T=2 (top), T=10 (middle), and random initializations. Each plot is labeled by the current data lo Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

38 Finite mixture models Inference FMMs: alternative representation α π H H z i x i θ k K α G θ i θ2 θ1 N x i N x2 x1 π Dir(α) θ k H z i π x i F(θ zi ) K G(θ) = π k δ(θ, θ k ) k=1 θ i G x i F( θ i ) θ k H π Dir(α) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

39 Outline Dirichlet process mixture models 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

40 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

41 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

42 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

43 Dirichlet process mixture models The Dirichlet process The (practical) Dirichlet process The Dirichlet process is a distribution over probability measures over Θ. DP(α, H) H(θ) is the base (mean) measure. Think µ for a Gaussian but in the space of probability measures. α is the concentration parameter. Controls the dispersion around the mean H. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

44 Dirichlet process mixture models The Dirichlet process (cont d) The Dirichlet process A draw G DP(α, H) is an infinite discrete probability measure: G(θ) = π k δ(θ, θ k ), where k=1 θ k H, and π is sampled from a stick-breaking prior. G Θ (from Orbanz & Teh, 2008) Break a stick Imagine a stick of length one. For k = 1..., do the following: Break the stick at a point drawn from Beta(1, α). Peter Orbanz & Yee Whye Teh [Kin75, Set94] 50 / 71 Let π k be such value and keep the remainder of the stick. Following standard convention, we write π GEM(α). (Details in second part of talk) w4 w3 w2 w1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

45 Dirichlet process mixture models The Dirichlet process (cont d) The Dirichlet process A draw G DP(α, H) is an infinite discrete probability measure: G(θ) = π k δ(θ, θ k ), where k=1 θ k H, and π is sampled from a stick-breaking prior. G Θ (from Orbanz & Teh, 2008) Break a stick Imagine a stick of length one. For k = 1..., do the following: Break the stick at a point drawn from Beta(1, α). Peter Orbanz & Yee Whye Teh [Kin75, Set94] 50 / 71 Let π k be such value and keep the remainder of the stick. Following standard convention, we write π GEM(α). (Details in second part of talk) w4 w3 w2 w1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

46 Dirichlet process mixture models Stick-breaking, intuitively The Dirichlet process β 1 π 1 π β β 1 1 β 2! k ! k β3 π 3 π 4 1 β3 β 4 1 β 4 β k k ! k 0.2! k 0.2 π k k α =1 α =5 (from Sudderth, 2008) Small α lots of weight assigned to few θ k s. G will be very different from base measure H. Large α weights equally distributed on θ k s. G will resemble the base measure H. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

47 Dirichlet process mixture models The Dirichlet process H G (from Navarro et al., 2005) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

48 Dirichlet process mixture models DP mixture models The DP mixture model (DPMM) Let s use G DP(α, H) to build an infinite mixture model. H α G θ i θ2 θ1 G DP(α, H) θ i G x i F θi x i N x2 x1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

49 DPM (cont d) Dirichlet process mixture models DP mixture models Using explicit clusters indicators z = (z 1, z 2,..., z N ). α π z i H θ k π GEM(α) θ k H k = 1,..., z i π x i F θzi i = 1,..., N x i N Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

Dirichlet process mixture models DP mixture models Chinese restaurant process So far, we only have a generative model. Is there a nice conjugancy property to use during inference?

50 Dirichlet process mixture models DP mixture models Chinese restaurant process So far, we only have a generative model. Is there a nice conjugancy property to use during inference? It turns out (details in part 2) that, if π GEM(α) z i π the distribution p(z α) = p(z π)p(π) dπ is easily tractable, and is known as the Chinese restaurant process (CRP). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

51 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

52 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

53 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

54 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

55 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

56 Dirichlet process mixture models Gibbs sampling for DPMMs Inference Via the CRP, we can find the conditional distributions for Gibbs sampling. State: θ 1,..., θ k, z. p(θ k x, z) p(θ k ) p(x i θ k ) i:z i=k = h(θ k )f (x i θ k ) α π H p(z i = k z i, θ, x) p(x i θ k )p(z i = k z i ) { n ( i) k f (x i θ k ) exising k α f (x i θ k ) new k z i x i N θ k K grows as more data are observed, asymptotically as α log n. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

57 Dirichlet process mixture models log p(x!, ") Inference = log p(x!, ") = Gibbs sampling for DPMMs (example) Mixture of bivariate Gaussians log p(x!, ") = log p(x!,!,") ") = log p(x!, ") = T=2 T=10 T=40 log p(x!, ") = log p(x!,!,") ") = log p(x!, (from ") = Sudderth, 2008) Figure Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg Columns show the parameters of clusters currently assigned to observations, and corresponding data log likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

58 Dirichlet process mixture models Inference END OF FIRST PART. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

59 Outline A little more theory... 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

60 A little more theory... De Finetti s REDUX De Finetti s REDUX Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i=1 The theorem wouldn t be true if θ s range is limited to Euclidean s vector spaces. We need to allow θ to range over measures. p(θ) is a distribution on measures, like the DP. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

61 A little more theory... De Finetti s REDUX De Finetti s REDUX Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i=1 The theorem wouldn t be true if θ s range is limited to Euclidean s vector spaces. We need to allow θ to range over measures. p(θ) is a distribution on measures, like the DP. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

A little more theory... Dirichlet Process REDUX Dirichlet process REDUX Definition Let Θ be a measurable space (of parameters), H be a probability distribution on Θ, and α a positive scalar.

62 A little more theory... Dirichlet Process REDUX Dirichlet process REDUX Definition Let Θ be a measurable space (of parameters), H be a probability distribution on Θ, and α a positive scalar. A Dirichlet process is the distribution of a random probability measure G over Θ, such that for any finite partition (T 1,..., T k ) of Θ, we have (G(T 1 ),..., G(T K )) Dir(αH(T 1 ),..., αh(t K )). Θ T 1 T 2 T 3 ~ T 1 ~ T 4 ~ T 5 ~ T 2 ~ T 3 (from Sudderth, 2008) E[G(T k )] = H(T k ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

63 Posterior conjugancy A little more theory... Dirichlet process REDUX Via the conjugancy of the Dirichlet distribution, we know that: p(g(t 1 ),..., G(T K ) θ T k ) = Dir(αH(T 1 ),..., αh(t k ) + 1,..., αh(t K )) Formalizing this analysis, we obtain that if G DP(α, H) θ i G i = 1,..., N, the posterior measure also follows a Dirichlet process: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi The DP defines a conjugate prior for distributions on arbitrary measure spaces. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

64 A little more theory... Dirichlet process REDUX Generating samples: stick breaking Sethuraman (1995): equivalent definition of the Dirichlet process, through the stick-breaking construction. G(θ) DP(α, H) iff G(θ) = π k δ(θ, θ k ), k=1 where θ H, and k 1 π k = β k (1 β l ) β l Beta(1, α) l=1 β 1 π 1 π β β 1 1 β 2! k ! k β3 π 3 π 4 1 β3 β 4 1 β 4 β k k ! k 0.2! k 0.2 π k k α =1 α =5 (from Sudderth, 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

65 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

66 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

67 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

68 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

69 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

70 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

71 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

72 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] Therefore, where G DP(α, H) G = β 1 δ θ1 + (1 β 1 )G 1 G = β 1 δ θ1 + (1 β 1 )(β 2 δ θ2 + (1 β 2 )G 2 ). G = π k δ θk k=1 k 1 π k = β k (1 β l ) β l Beta(1, α), l=1 which is the stick-breaking construction. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

73 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

74 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

75 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

76 A little more theory... Dirichlet process REDUX Chinese restaurant (derivation) A bit informally... Let T k contain θ k and shrink it arbitrarily. To the limit, we have that p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N K i=1 ) N k δ θi (θ) This is the generalized Polya urn scheme An urn contains one ball for each preceding observation, with a different color for each distinct θ k. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special weighted ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ k. [This description is from Sudderth, 2008] This allows to sample from a Dirichlet process without explicitly constructing the underlying G DP(α, H). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

77 A little more theory... Dirichlet process REDUX Chinese restaurant (derivation) A bit informally... Let T k contain θ k and shrink it arbitrarily. To the limit, we have that p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N K i=1 ) N k δ θi (θ) This is the generalized Polya urn scheme An urn contains one ball for each preceding observation, with a different color for each distinct θ k. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special weighted ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ k. [This description is from Sudderth, 2008] This allows to sample from a Dirichlet process without explicitly constructing the underlying G DP(α, H). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

78 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX The Dirichlet process implicitly partitions the data. Let z i indicate the subset (cluster) associated with the i6th observation, i.e. θ i = θ zi. From the previous slide, we get: p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N This is the Chinese restaurant process (CRP) K i=1 ) N k δ(z, k) It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

79 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX The Dirichlet process implicitly partitions the data. Let z i indicate the subset (cluster) associated with the i6th observation, i.e. θ i = θ zi. From the previous slide, we get: p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N This is the Chinese restaurant process (CRP) K i=1 ) N k δ(z, k) It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

80 Take away message A little more theory... Dirichlet process REDUX These representations are all equivalent! Posterior DP: G DP(α, H) θ G G Stick-breaking construction: G(θ) = π k δ(θ, θ k ) Generalized Polya urn k=1 θ G θ H ( ) DP α + 1, αh+δ θ α+1 θ k H π GEM(α) p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N Chinese restaurant process p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N K i=1 K i=1 ) N k δ θi (θ) ) N k δ(z, k) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

81 Outline The hierarchical Dirichlet process 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

82 The hierarchical Dirichlet process The DP mixture model (DPMM) Let s use G DP(α, H) to build an infinite mixture model. H α G θ i θ2 θ1 G DP(α, H) θ i G x i F θi x i N x2 x1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

83 The hierarchical Dirichlet process Related subgroups of data Dataset with J related groups x = (x 1,..., x J ). x j = (x j1,..., x jnj ) contains N j observations. We want these group to share clusters (transfer knowledge.) 1 2 m i x ij x 1j x 2j x mj (from jordan, 2005) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

84 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

85 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

86 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

87 The hierarchical Dirichlet process The HDP mixture model (DPMM) H H γ G 0 G 0 α G j G1 G2 G 0 DP(γ, H) G j DP(α, G 0 ) θ ji G j θ ji θ12 θ11 θ 21 θ22 x ji F θji x ji N x12 x11 x21 x22 J Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

88 The hierarchical Dirichlet process The HDP mixture model (DPMM) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) G 0 is discrete. Each group might create several copies of the same global cluster. Aggregating the probabilities: G j (θ) = π k δ(θ, θ k ) t=1 π jk = t:k jt=k π jt It can be shown that π DP(α, β). β = (β 1, β 2,...): average weight of local clusters. π = (π 1, π 2,...) group-specific weights. α controls the variability of clusters weight across groups. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

89 The hierarchical Dirichlet process THANK YOU. QUESTIONS? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics