Latent Dirichlet Allocation 1
Directed Graphical Models William W. Cohen Machine Learning 10-601 2
DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability distribution: Pr(i 1,,i-1,i+1,,n) = Pr(i parents(i)) Alarm Phone Call Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home s burglar alarm is ringing. Uh oh! 3
DGMs: The Burglar Alarm example Node ~ random variable Burglar Earthquake Arcs define form of probability distribution: Pr(i 1,,i-1,i+1,,n) = Pr(i parents(i)) Alarm Phone Call Generative story (and joint distribution): Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall Alarm) 4
DGMs: The Burglar Alarm example You can also compute other quantities: e.g., Pr(Burglar=true PhoneCall=true) Burglar Alarm Earthquake Pr(Burglar=true PhoneCall=true, and Earthquake=true) Phone Call Generative story: Pick b ~ Pr(Burglar), a binomial Pick e ~ Pr(Earthquake), a binomial Pick a ~ Pr(Alarm Burglar=b, Earthquake=e), four binomials Pick c ~ Pr(PhoneCall Alarm) 5
Inference in Bayes nets + E General problem: given evidence E 1,,E k compute P( E 1,..,E k ) for any U 1 U 2 Big assumption: graph is polytree <=1 undirected path between any nodes,y Notation: Z 1 Y 1 Y 2 Z 2 E + = "causal support"for E E = "evidential support"for Slide 6
Slide 7 Inference in Bayes nets: P( E) ), ( ) ( + = E E P E P ) ( ) ( ), ( + + + = E E P E P E E P ) ( ) ( ) ( + E P E P E P )... ( letsstart with + E P Y 1 Y 2 Z 2 Z 1 U 2 U 1 E + E E+: causal support E-: evidential support
Inference in Bayes nets: P( E + ) P( E + ) = E +1 U 1 U 2 +2 + E u u 1, 2 + P( u1, u2, E ) P( u1, u2 E + ) Z 1 Z 2 = u u 1, 2 P u, u ) P( u E ) ( 1 2, j j + j Y 1 Y 2 E CPT table lookup Recursive call to P(. E + ) = P( u) P(u j E Uj\ ) u j So far: simple way of propagating requests for belief due to causal evidence up the tree Evidence for Uj that doesn t go thru I.e. info on Pr( E+) flows down Slide 8
Inference in Bayes nets: P(E - ) simplified P( E) P(E ) = j y jk j P( E P(E Yj\ = P(y jk ) j y jk ) P( ) P(E Yj\, y jk ) = P(y jk ) P(E j Yj y jk ) So far: simple way of propagating requests for belief due to evidential support down the tree E + ) d-sep + polytree Recursive call to P(E -. ) I.e. info on Pr(E- ) flows up 1 E E j j Y Z 1 U 1 U 2 Y 1 Y 2 Z 2 + E 2 E :evidential support for Y E Yj \ : evidence for Y j excluding support through j Slide 9
Recap: Message Passing for BP + E U 1 U 2 We reduced P( E) to product of two recursively calculated parts: P(=x E + ) = u u 1, 2 P u, u ) P( u E ) ( 1 2, j j + j Z 1 Y 1 Y 2 Z 2 E i.e., CPT for and product of forward messages from parents P(E - =x) = β j ( EY y jk) P( y j jk, z jk ) P( z jk EZ j \ Y j y z i.e., combination of backward messages from parents, CPTs, and P(Z E Z\Yk ), a simpler instance of P( E) This can also be implemented by message-passing (belief propagation) Messages are distributions i.e., vectors jk P ) jk k Slide 10
Recap: Message Passing for BP + E U 1 U 2 Top-level algorithm Z 1 Z 2 Pick one vertex as the root Any node with only one edge is a leaf Y 1 Y 2 E Pass messages from the leaves to the root Pass messages from root to the leaves Now every has received P( E+) and P(E- ) and can compute P( E) Slide 11
NAÏVE BAYES AS A DGM 12
Naïve Bayes as a DGM For each document d in the corpus (of size D): - Pick a label y d from Pr(Y) - For each word in d (of length N d ): Pick a word x id from Pr( Y=y d ) onion 0.3 economist 0.7 Pr(Y=y) Y for every Y Pr( y=y) onion aardvark 0.034 onion ai 0.0067 economist aardvark 0.0000003. economist zymurgy 0.01000 N d D 13 Plate diagram
Naïve Bayes as a DGM For each document d in the corpus (of size D): - Pick a label y d from Pr(Y) - For each word in d (of length N d ): Pick a word x id from Pr( Y=y d ) Y Not described: how do we smooth for classes? For multinomials? How many classes are there?. N d D 14 Plate diagram
Recap: smoothing for a binomial MLE: maximize Pr(D θ) MAP: maximize Pr(D θ)pr(θ) estimate Θ=P(heads) for a binomial with MLE as: #heads #tails and with MAP as: #imaginary heads Smoothing = prior over the parameter θ 15 #imaginary tails
Smoothing for a binomial as a DGM MAP for dataset D with α 1 heads and α 2 tails: #imaginary heads γ θ #imaginary tails MAP is a Viterbi-style inference: want to find max probability paramete θ, to the posterior distribution D Also: inference in a simple graph can be intractable if the conditional distributions are complicated 16
Smoothing for a binomial as a DGM MAP for dataset D with α 1 heads and α 2 tails: #imaginary heads γ θ replacing D=d with F=f1,f2, #imaginary Dirichlet prior has tails data for each of K classes Comment: conjugate for binomial is a Beta, conjugate for a multinomial is called a Dirichlet F α 1 + α 2 17
Recap: Naïve Bayes as a DGM Now: let s turn Bayes up to 11 for naïve Bayes. Y N d D 18 Plate diagram
A more Bayesian Naïve Bayes From a Dirichlet α: - Draw a multinomial π over the K classes From a Dirichlet β For each class y=1 K Draw a multinomial γ[y] over the vocabulary For each document d=1..d: - Pick a label y d from π - For each word in d (of length N d ): Pick a word x id from γ[y d ] α β 19 π Y W γ N d K D
Unsupervised Naïve Bayes From a Dirichlet α: - Draw a multinomial π over the K classes From a Dirichlet β For each class y=1 K Draw a multinomial γ[y] over the vocabulary For each document d=1..d: - Pick a label y d from π - For each word in d (of length N d ): Pick a word x id from γ[y d ] α β 20 π Y W γ N d K D
Unsupervised Naïve Bayes The generative story is the same What we do is different - We want to both Uind values of γ and π and find values for Y for each example. α π Y EM is a natural algorithm W N d D β 21 γ K
LDA AS A DGM 22
The LDA Topic Model 23
24
Unsupervised NB vs LDA one class prior α π α different class distrib for each doc π one Y per doc Y W one Y per word Y W N d N d D D β γ K β 25 γ K
Unsupervised NB vs LDA one class prior α π α different class distrib θ for each doc θ d one Y per doc Y W one Z per word ZY di W di N d N d D D β γ K β 26 γ k K
LDA Blei s motivation: start with BOW assumption θ Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,!,M Generate θ d ~ D 1 ( ) w For each word n = 1,!, N d generate w n ~ D 2 (. θ dn ) Now pick your favorite distributions for D 1, D 2 27
Unsupervised NB vs LDA θ w Unsupervised NB clusters documents into latent classes LDA clusters word occurrences into latent classes (topics) The smoothness of β k è same word suggests same topic The smoothness of θ d è same document suggests same topic α β 28 θ d ZY di W di γ k N d K D
LDA s view of a document Mixed membership model 29
LDA topics: top words w by Pr(w Z=k) Z=13 Z=22 Z=27 Z=19 30
SVM using 50 features: Pr(Z=k θ d ) 50 topics vs all words, SVM 31
Gibbs Sampling for LDA 32
LDA Latent Dirichlet Allocation Parameter learning: Variational EM Collapsed Gibbs Sampling Wait, why is sampling called learning here? Here s the idea. 33
LDA Gibbs sampling works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables. 34
α θ Z α θ Z 1 Z 2 Y 2 1 Y 1 2 Y 2 I ll assume we know parameters for Pr( Z) and Pr(Y ) 35
Initialize all the hidden variables randomly, then. Pick Z 1 ~ Pr(Z x 1,y 1,θ) Pick θ ~ Pr(z 1, z 2,α) pick from posterior Pick Z 2 ~ Pr(Z x 2,y 2,θ) α θ... Pick Z 1 ~ Pr(Z x 1,y 1,θ) Z 1 1 Y 1 Z 2 2 Y 2 Pick θ ~ Pr(z 1, z 2,α) pick from posterior Pick Z 2 ~ Pr(Z x 2,y 2,θ) So we will have (a sample of) the true θ in a broad range of cases eventually these will converge to samples from the true joint distribution 36
Why does Gibbs sampling work? Basic claim: when you sample x~p( y 1,,y k ) then if y 1,,y k were sampled from the true joint then x will be sampled from the true joint So the true joint is a Uixed point you tend to stay there if you ever get there How long does it take to get there? depends on the structure of the space of samples: how well-connected are they by the sampling steps? 37
38
Mixing time depends on the eigenvalues of the state space! 39
LDA Latent Dirichlet Allocation Parameter learning: Variational EM Not covered in 601-B Collapsed Gibbs Sampling What is collapsed Gibbs sampling? 40
Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) Pick Z 2 ~ Pr(Z 2 z 1,z 3,α) Pick Z 3 ~ Pr(Z 3 z 1,z 2,α) α θ Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) Pick Z 2 ~ Pr(Z 2 z 1,z 3,α) Z 1 Z 2 Z 3 Pick Z 3 ~ Pr(Z 3 z 1,z 2,α)... 1 Y 1 2 Y 2 3 Y 3 Converges to samples from the true joint and then we can estimate Pr(θ α, sample of Z s) 41
Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) α θ What s this distribution? Z 1 Z 2 Z 3 1 Y 1 2 Y 2 3 Y 3 42
Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) α θ Simpler case: What s this distribution? called a Dirichlet-multinomial and it looks like this: Z 1 Z 2 Z 3 Pr(Z 1 = k 1, Z 2 = k 2, Z 3 = k 3 α) = θ Pr(Z 1 = k 1, Z 2 = k 2, Z 3 = k 3 θ)pr( θ α)dθ If there are k values for the Z s and n k = # Z s with value k, then it turns out: Pr(Z α) = if n is pos int Pr(Z θ)pr( θ α)dθ = θ Γ α k k Γ α k + n k k k k Γ( n k +α k ) Γ(α k ) 43
Initialize all the Z s randomly, then. Pick Z 1 ~ Pr(Z 1 z 2,z 3,α) α It turns out that sampling from a Dirichlet-multinomial is very easy! θ Notation: k values for the Z s n k = # Z s with value k Z = (Z 1,, Z m ) Z (-i) = (Z 1,, Z i-1, Z i+1,, Z m ) n k (-i) = # Z s with value k excluding Z i Z 1 Z 2 Z 3 Pr(Z i = k Z ( i),α) n k ( i) +α k Pr(Z α) = Pr(Z θ)pr( θ α)dθ = θ Γ α k k Γ α k + n k k k k Γ( n k +α k ) Γ(α k ) 44
What about with downstream evidence? Pr(Z i = k Z ( i),α) n k ( i) +α k captures the constraints on Z i via θ (from above, causal direction) what about via β? Pr(Z) = (1/c) * Pr(Z E+) Pr(E- Z) α θ Z 1 Z 2 Z 3 1 2 3 η Z 1 β[k] Z 2 β[k] 1 2 η 45
Sampling for LDA Notation: k values for the Z d,i s Z (-d,i) = all the Z s but Z d,i n w,k = # Z d,i s with value k paired with W d,i =w n *,k = # Z d,i s with value k n w,k (-d,i) = n w,k excluding Z i,d n *,k (-d,i) = n *k excluding Z i,d n *,k d,(-i) = n *k from doc d excluding Z i,d α θ d ZY di Pr(Z d,i = k Z ( d,i),w d,i = w,α) W di Pr(Z E+) ( n d,( i) *,k +α ) k fraction of time Z=k in doc d w',k Pr(E- Z) n ( d,i) w,k +η w n ( d,i) w',k +η w' η fraction of time W=w in topic k 46 β k N d K D
IMPLEMENTING LDA 47
Way way more detail 48
More detail 49
50
51
random z=1 z=2 z=3 unit height 52
What gets learned.. 53
In A Math-ier Notation N[*,*]=V N[*,k] N[d,k] M[w,k] 54
for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d 55
for each pass t=1,2,. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k ] - - where k is old z[d,j] W[w,k]++; W[w,k ] - - where w is w[d,j] 56
p(z d, j = k..) Pr(Z d, j = k "d")* Pr(W d,k = w Z d, j = k,...) = N[k, d] C d, j,k +α Z W[w, k] C d, j,k + β (W[*, k] C d, j,k )+ βn[*,*] C d, j,k =!# " $# 1 Z d, j = k 0 else 57
random z=1 z=2 z=3 unit height 1. You spend a lot of time sampling 2. There s a loop over all topics here in the sampler 58