Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Gibbs Sampling Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 644: 2019 03 27

Latent semantic analysis Documents as mixtures of topics (Hoffman 1999) 1 / 60

Latent semantic analysis We have a set of documents D Each document modeled as a bag of words (bow) over dictionary W. : the number of times word appears in document. x w,d w W d D 2 / 60

Latent semantic analysis Let's start with a simple model based on the frequency of word occurrences. Each document is modeled as with parameters θ d = {θ1,d,, θ W,d} n d draws from a Multinomial distribution Note θ w,d 0 and w θ w,d = 1. 3 / 60

Latent semantic analysis Probability of observed corpus D P r(d {θ d }) D d=1 W w=1 θ x w,d w,d 4 / 60

Latent semantic analysis Problem 1: Prove MLE ^θ w,d = x w,d n d 5 / 60

Constrained optimization We have a problem of type min x s.t. f 0 (x) f i (x) 0 i = 1,, m h i (x) = 0 i = 1,, p Note: This discussion follows Boyd and Vandenberghe, Convex Optimization 6 / 60

Constrained optimization To solve these type of problems we will look at the Lagrangian function: L(x, λ, ν) = f 0 (x) + m p λ i f i (x) + ν i g i (x) i=1 i=1 7 / 60

Constrained optimization We'll see these in more detail later, but there is a beautiful result giving optimality conditions based on the Lagrangian: Suppose ~x, ~ λ and ν ~ are optimal, then f i ( ~ x) 0 h i ( ~ x) = 0 ~ λ i 0 ~ λ i f i ( ~ x) = 0 L( ~ x, ~ λ, ~ ν) = 0 8 / 60

Constrained optimization We can use the gradient and feasibility conditions to prove the MLE result. 9 / 60

Probablistic Latent Semantic Analysis Let's change our document model to introduce topics. The key idea is that the probability of observing a word in a document is given by two pieces: The probability of observing a topic in a document, and The probability of observing a word given a topic P r(w, d) = T P r(w t)p r(t d) t=1 10 / 60

Probablistic Latent Semantic Analysis So, we rewrite corpus probability as P r(d {p d }{θ t }) D d=1 W w=1 ( T x w,d p t,d θ w,t ) t=1 11 / 60

Probablistic Latent Semantic Analysis So, we rewrite corpus probability as P r(d {p d }{θ t }) D d=1 W w=1 ( T x w,d p t,d θ w,t ) t=1 Mixture of topics!! 12 / 60

Probablistic Latent Semantic Analysis A fully observed model Assume you know the latent number of occurences of word w in document d generated from topic t: Δ w,d,t t Δ w,d,t = x w,d, such that. In that case we can rewrite corpus probability: P r(d {p d }, {θ t }) D W T d=1 w=1 t=1 (p t,d θ w,t ) Δ w,d,t 13 / 60

Probablistic Latent Semantic Analysis Problem 2 Show MLEs given by ^p t,d = W w=1 Δ w,d,t T t=1 W w=1 Δ w,d,t ^θ t,d = D d=1 Δ w,d,t W w=1 D d=1 Δ w,d,t 14 / 60

Probablistic Latent Semantic Analysis Since we don't observe Δ w,d,t we use the EM algorithm At each iteration (given current parameters {p d } and {θ d } find responsibility γ w,d,t = E[Δ w,d,t {p d }, {θ t }] and maximize fully observed likelihood plugging in γ w,d,t for Δ w,d,t 15 / 60

Probablistic Latent Semantic Analysis Problem 4: Show γ w,d,t = x w,d p t,d θ w,t T t =1 p t,dθ w,t 16 / 60

The EM Algorithm in General So, why does that work? Why does plugging in γ w,d,t for the latent variables Δ w,d,t work? Why does that maximize log likelihood l({p d }, {θ t }; D)? 17 / 60

The EM Algorithm in General Think of it as follows: Z: observed data : missing latent data : complete data (observed and missing) Z m T = (Z,Z m ) 18 / 60

The EM Algorithm in General Think of it as follows: Z : observed data : missing latent data : complete data (observed and missing) Z m T = (Z, Z m ) l(θ ; Z) l 0 (θ ; T ) : log likehood w.r.t. observed data : log likelihood w.r.t. complete data 19 / 60

The EM Algorithm in General Next, notice that Pr(Z θ ) = Pr(T θ ) Pr(Z m Z, θ ) 20 / 60

The EM Algorithm in General Next, notice that Pr(Z θ ) = Pr(T θ ) Pr(Z m Z, θ ) As likelihood: l(θ ; Z) = l 0 (θ ; T ) l 1 (θ ; Z m Z) 21 / 60

The EM Algorithm in General Iterative approach: given parameters θ take expectation of log likelihoods l(θ ; Z) = E[l 0 (θ ; T ) Z, θ] E[l 1 (θ ; Z m Z) Z, θ] Q(θ, θ) R(θ, θ) 22 / 60

The EM Algorithm in General Iterative approach: given parameters θ take expectation of log likelihoods l(θ ; Z) = E[l 0 (θ ; T ) Z, θ] E[l 1 (θ ; Z m Z) Z, θ] Q(θ, θ) R(θ, θ) In plsa, Q(θ is the log likelihood of complete data with replaced by, θ) Δ w,d,t γ w,d,t 23 / 60

The EM Algorithm in General The general EM algorithm 1. Initialize parameters 2. Construct function 3. Find next set of parameters 4. Iterate steps 2 and 3 until convergence θ (0) Q(θ, θ (j) ) θ (j+1) = arg max θ Q(θ, θ (j) ) 24 / 60

The EM Algorithm in General So, why does that work? l(θ (j+1) ; Z) l(θ (j) ; Z) = [Q(θ (j+1), θ (j) ) Q(θ (j), θ (j) )] [R(θ (j+1), θ (j) ) R(θ (j), θ (j) )] 0 25 / 60

The EM Algorithm in General So, why does that work? l(θ (j+1) ; Z) l(θ (j) ; Z) = [Q(θ (j+1), θ (j) ) Q(θ (j), θ (j) )] [R(θ (j+1), θ (j) ) R(θ (j), θ (j) )] 0 I.E., every step makes log likehood larger 26 / 60

The EM Algorithm in General Why else does it work? Q(θ minorizes, θ) l(θ ; Z) 27 / 60

The EM Algorithm in General General algorithmic concept: Iterative approach: Initialize parameters Construct bound based on current parameters Optimize bound 28 / 60

The EM Algorithm in General General algorithmic concept: Iterative approach: Initialize parameters Construct bound based on current parameters Optimize bound We will see this again when we look at variational methods 29 / 60

Approximate Inference by Sampling Ultimately, what we are interested in is learning topics Perhaps instead of finding parameters θ that maximize likelihood Sample from a distribution P r(θ D) that gives us topic estimates 30 / 60

Approximate Inference by Sampling Like EM, the trick here is to expand model with latent data Z m And sample from distribution Pr(θ, Z m Z) 32 / 60

Approximate Inference by Sampling Like EM, the trick here is to expand model with latent data Z m And sample from distribution Pr(θ, Z m Z) This is challenging, but sampling from Pr(θ Z m, Z) and Pr(Z m θ, Z) is easier 33 / 60

Approximate Inference by Sampling The Gibbs Sampler does exactly that Property: After some rounds, samples from the conditional distributions Pr(θ Z m, Z) Correspond to samples from marginal Pr(θ Z) = Z m Pr(θ, Z m Z) 34 / 60

Approximate Inference by Sampling Quick aside, how to simulate data for plsa? Generate parameters Generate Δ w,d,t and {p d } {θ t } 35 / 60

Approximate Inference by Sampling Let's go backwards, let's deal with Δ w,d,t 36 / 60

Approximate Inference by Sampling Let's go backwards, let's deal with Δ w,d,t Δ w,d,t Mult xw,d (γ w,d,1,, γ w,d,t) Where γ w,d,t was as given by E step 37 / 60

Approximate Inference by Sampling Let's go backwards, let's deal with Δ w,d,t Δ w,d,t Mult xw,d (γ w,d,1,, γ w,d,t) Where γ w,d,t was as given by E step for d in range(num_docs): delta[d,w,:] = np.random.multinomial(doc_mat[d,w], gamma[d,w,:]) 38 / 60

Approximate Inference by Sampling Hmm, that's a problem since we need x w,d... But, we know P r(w, d) = t p t,dθ w,t so, let's use that to generate each x w,d as x w,d Mult nd (P r(1, d),, P r(w, d)) for d in range(num_docs): doc_mat[d,:] = np.random.multinomial(nw[d], np.sum(p[:,d] * theta), axis=0) 40 / 60

Approximate Inference by Sampling Now, how about p d? How do we generate the parameters of a Multinomial distribution? 41 / 60

Approximate Inference by Sampling Now, how about p d? How do we generate the parameters of a Multinomial distribution? This is where the Dirichlet distribution comes in... If p d Dir(α), then P r(p d ) T p α t 1 t,d t=1 42 / 60

Approximate Inference by Sampling Some interesting properties: E[p t,d ] = α t t α t So, if we set all α t = 1 we will tend to have uniform probability over topics ( 1/t each on average) If we increase α t = 100 little variance (it will almost always be 1/t) it will also have uniform probability but will have very 43 / 60

Approximate Inference by Sampling So, we can say p d Dir(α) and θ t Dir(β) 44 / 60

Approximate Inference by Sampling So, we can say p d Dir(α) and θ t Dir(β) And generate data as (with α t = 1) for d in range(num_docs): p[:,d] = np.random.dirichlet(1. * np.ones(num_topics)) 45 / 60

Approximate Inference by Sampling So what we have is a prior over parameters and : and And we can formulate a distribution for missing data Δ w,d,t : {p d } {θ t } Pr(p d α) Pr(θ t β) Pr(Δ w,d,t p d, θ t, α, β) = Pr(Δ w,d,t p d, θ t )Pr(p d α)pr(θ t β) 46 / 60

Approximate Inference by Sampling However, what we care about is the posterior distribution Pr(p d Δ w,d,t, θ t, α, β) What do we do??? 47 / 60

Approximate Inference by Sampling Another neat property of the Dirichlet distribution is that it is conjugate to the Multinomial If θ α Dir(α) and X θ Multinomial(θ), then θ X, α Dir(X + α) 48 / 60

Approximate Inference by Sampling That means we can sample p d from p t,d Dir( w Δ w,d,t + α) and θ w,t Dir( Δ w,d,t + β) d 49 / 60

Approximate Inference by Sampling Coincidentally, we have just specified the Latent Dirichlet Allocation method for topic modeling. This is the most commonly used method for topic modeling Blei, Ng, Jordan (2003), JMLR 50 / 60

Approximate Inference by Sampling We can now specify a full Gibbs Sampler for an LDA mixture model. Given: Word document counts Number of topics K Prior parameters α and x w,d Do: Learn parameters {p d } and {θ t } for K topics β 51 / 60

Approximate Inference by Sampling Step 0: Initialize parameters and {p d } {θ t } p d Dir(α) and θ t Dir(β) 52 / 60

Approximate Inference by Sampling Step 1: Sample Δ w,d,t based on current parameters {p d } and {θ t } Δ w,d,. Mult xw,d(γ w,d,1,, γ w,d,t) 53 / 60

Approximate Inference by Sampling Step 2: Sample parameters from p t,d Dir( w Δ w,d,t + α) and θ w,t Dir( Δ w,d,t + β) d 54 / 60

Approximate Inference by Sampling Step 3: Get samples for a few iterations (e.g., 200), we want to reach a stationary distribution... 55 / 60

Approximate Inference by Sampling Step 4: ^Δ w,d,t Estimate as the average of the estimates from the last m iterations (e.g., m=500) 56 / 60

Approximate Inference by Sampling Step 5: Estimate parameters p d and θ t based on estimated ^Δw,d,t ^p t,d = w t w ^Δ w,d,t + α ^Δw,d,t + α ^θ w,t = d w d ^Δw,d,t + β ^Δw,d,t + β 57 / 60

Mixture models We have now seen two different mixture models: soft k means and topic models 58 / 60

Mixture models We have now seen two different mixture models: soft k means and topic models Two inference procedures: Exact Inference with Maximum Likelihood using the EM algorithm Approximate Inference using Gibbs Sampling Next, we will go back to Maximum Likelihood but learn about Approximate Inference using Variational Methods 60 / 60