Collapsed Variational Inference for LDA

Size: px

Start display at page:

Download "Collapsed Variational Inference for LDA"

Martha Richardson
5 years ago
Views:

1 Collapse Variational Inference for LDA BT Thomas Yeo LDA We shall follow the same notation as Blei et al In other wors, we consier full LDA moel with hyperparameters α anη onβ anθ respectiely, whereθparameterizes Ptopic ocument anβ parameterizes Pwor topic,w n refers ton-th wor of the -th ocument. z n refers to the topic that generates the n-th wor of the-th ocument.. Variational Inference Setup Cost function aim is to maximize lielihoo of obsere wors w with respect to α,η: pw α,η = pθ, β, z, w α, ηβθ Note that Therefore = θ θ β β z qβ,θ,zpθ,β,z,w α,η βθ 2 qβ,θ,z z 3 = E q pθ,β,z,w α,η qβ,θ,z E q pθ,β,z,w α,η E q qβ,θ,z 4 = Fqβ, θ, z F is calle the ariational free energy 5 E q pθ,β,z,w α,η E q qβ,θ,z+klqβ,θ,z pβ,θ,z w,α,η 6 qβ,θ,z = E q pθ,β,z,w α,η E q qβ,θ,z+e q 7 pβ,θ,z w,α,η = E q pθ,β,z,w α,η 8 pβ,θ,z w,α,η = E q pw α,η = pw α,η pw α,η Fqβ,θ,z = KLqβ,θ,z pβ,θ,z w,α,η In other wors, for the the negatie free energy F to be equal topw α,η,qβ,θ,z shoul be equal topβ,θ,z w,α,η. In EM, we compute qβ,θ,z = pβ,θ,z w,α,η exactly in the E-step an then conitione on this particular q, we maximize with respect to α an η in the M-step. In LDA, this is not tractable. In ariational EM, qβ,θ,z is approximate with a simpler class of functions. The best q function within this class of functions is optimize uring the E-step an then conitione on this particular q, we maximize with respect toαan η in the M-step..2 Motiation for Collapse Variational Inference In the original LDA paper Blei et al., 2003, the class of proposal istributions was: qθ, z, β γ, φ, λ = qβ λqθ γqz φ 2 K D = qβ λ q θ γ qz φ 3 = = 9 0

2 In Teh et al. 2007, the class of proposal istributions was qθ,z,β φ = qθ,β zqz φ = pθ, β z, w, α, ηqz φ notice the first term is exact 5 D = pθ,β z,w,α,η qz φ 6 = Obsere that Blei s class of istribution is a subset of Teh s: K = qβ λ D = q θ γ qθ,β z,w,α,η since the former class of istributions assume inepenence between β an θ, while the latter maes no assumption about their inepenence. Therefore intuitiely, Teh s approximation is better. We can also proe this more formally: KLqβ,θ,z pβ,θ,z w,α,η = KLqzqβ,θ z pβ,θ,z w,α,η 7 qzqβ, θ z = E q 8 pβ,θ,z w,α,η qz = E q pz w,α,η + qβ,θ z 9 pβ,θ z,w,α,η qz = E q +E pz w,α,η qz E qβ, θ z qβ,θ z 20 pβ,θ z,w,α,η = E qz qz pz w,α,η 4 +E qz KL qβ,θ z pβ,θ z,w,α,η 2 KLqz pz w, α, η with equality when qβ, θ z = pβ, θ z, w, α, η 22 Therefore KL iergence of Teh s approximation is at most equal to the KL iergence of Blei s approximation because the secon KL iergence term is zero for Teh s approximation but non-zero for Blei s approximation. The question then is whether Teh s approximation is tractable..3 Simplifying the lower boun onpw α,η with Teh s approximation We can plug in qβ, θ, z = pθ, β z, w, α, ηqz φ into Eq. 4 an simplify. Howeer, it s simpler to restart from pw α, η: pw α,η = z pz,w α,η 23 = qzpz, w α, η qz z pz,w α,η = E q qz E q pz,w α,η E q qz 26 The class of proposal istribution we will consier is qz = D = qz φ =,n qz n φ n. Note that qz n φ n is a categorical istribution with parameters φ n. In other wors φ n is a ector of length K corresponing to topics, such that φ n =..4 Variational E-step The goal is to maximize Eq. 26 with respect to {φ n }. We hae to a the lagrange multipliers corresponing to φ n =. Furthermore, note that E q qz = [ ] qz qz DN qz + +qz DN 27 z z DN = qz n = qz n = 28 = φ n φ n

3 Therefore, Eq. 26 becomes E q pz,w α,η E q qz+ µ n φ n = E q pz,w α,η φ n φ n + µ n φ n = E qzij E qz ij pz,w α,η φ n φ n + µ n φ n = φ ij E qz ij pz ij,z ij =,w α,η φ n φ n + µ n φ n where i inexes a particular ocument, j inexes the j-th wor of the ocument. Differentiating with respect toφ ijl, we get pz ij,z ij = l,w α,η φ ijl +µ ij 34 E qz ij Equating the aboe to 0, we get φ ijl exp E qz ij pz ij,z ij = l,w α,η 35 exp E qz ij pz ij,z ij = l,w α,η φ ijl = E exp qz ij pz ij,z ij =,w α,η Plugging in the counts Using the fact about Dirichlet-compoun-multinomial istribution wiipeia lin, we get ΓKα Γα+N pz α = pz θ pθ αθ =, 37 ΓKα+N Γα where N is the number of wors in the -th ocument belonging to topic an corresponing to ictionary wor. Dot implies corresponing inices are summe out. For example, N = N is the number of wors in the corpus belonging to topic an ictionary wor, as well as pw z,η = ΓVη ΓVη+N Γη +N. 38 Γη The way to thin about the aboe equation is that conitione on the topics z, we can iie all the wors in the corpus w into wors from topicto topic K. Wors generate from a particular topic are inepenent of wors generate from another topics hence the. For a gien topic, the wors generate by the topic follow a Dirichlet-compoun-multinomial istribution with hyperparameter η. Using both two equations, we get pz,w α,η = pz αpw z, η ΓKα = ΓKα+N = = ΓKα ΓKα+N + N Γα+N Γα Kα+m+ Γα+N Γα N ΓVη ΓVη +N + α+m Γη +N Γη ΓVη ΓVη +N + N Vη +m+ Γη +N Γη N η +m We now come to a tricy part of the eriations! Consier Eq. 36. The numerator an enominator will hae four terms corresponing to those from Eq

4 The first term will be the same for both numerator an enominator an will therefore cancel out. The secon term of pz ij,z ij = l,w α,η i.e., numerator of Eq. 36 excluing the exponential an expectation correspons to: N = i α+m N α+m + N ij i 43 α+m +α+n ij il, 44 where the first two terms will cancel with the enominator because they are inepenent ofl. Similarly, the thir term ofpz ij,z ij = l,w α,η i.e., numerator of Eq. 36 excluing the exponential an expectation correspons to: N Vη +m = N ij Vη+m Vη+N ij l, 45 where the first term will cancel with the enominator because they are inepenent of l. Finally, the fourth term of pz ij,z ij = l,w α,η i.e., numerator of Eq. 36 excluing the exponential an expectation correspons to: N η +m = N ij η +m +η +N ij lw ij, 46 where the first term will cancel with the enominator because they are inepenent of l. Therefore Eq. 36 becomes expe qz φ ijl = ij α+n ij il Vη +N ij l +η +N ij lw ij expe qz ij α+n ij i Vη +N ij +η +N ij w ij Renaming i to, j to n, an l to, we get φ n = expe qz n α+n n Vη +N n +η +N n w n expe qz n α+n n Vη +N n +η +N n w n.4.2 Approximating the countsn N n = n n Iz n =, which is a sum of inepenent by the mean fiel approximation bernoulli ariables with probability of heas equal toφ n. Therefore the mean an ariance of N n is gien by E q N n = φ n n n Var q N n n nφ n φ n 49 N n =,n =,n Iz n =, with mean an ariance gien by E q N n =,n,n φ n Var q N n =,n,n φ n φ n 50 N n w n =,n,n Iz n = Iw n = w n with mean an ariance gien by E q N n w n = φ n Iw n = w n Var q N n w n = φ n φ n Iw n = w n,n,n 4,n,n 5

5 As suggeste by Teh et al. 2007, we will approximate the ranom ariables N n, N n an N n w n as Gaussian ranom ariables with mean an ariance as iscusse aboe. First, note that fx = b+x = fa+f ax a+ f a x a = fa+ b+a x a 2b+a 2x a Therefore the terms in the numerator of Eq. 48 becomes Let b = α, x = N n an a = E q N n E qz n α+n n = f E q N n + α+e q N n E q N n E q N n α+E q N n 2E q N n E q N n = α+e q N n Var q N n 2α+E q N n 57 2 Let b = Vη, x = N n an a = E q N n E qz n Vη+N n = Vη+E q N n Var q N n 2Vη+E q N n 58 2 Let b = η,x = N n w n an a = E q N n w n E qz n η +N n w n = η +E q N n Var q N n w w n n 2η +E q N n 59 w n Plugging the aboe into Eq. 48, we get φ n α+e q N n Vη +E q N n η +E q N n w n Var q N n n n exp 2α+E q N n + Var q N 2 2Vη+E q N n Var q N w n 2 2η +E q N n w n M-step Teh et al stoppe at the ariational E-step. Here, we will consier how to optimize α an η. In the M-step, we see to maximize Eq. 26 with respect to α an η. Keeping only the first term, which contains all the terms that o not contain α, we get E q pz,w α,η = E q N Kα+m+ N α+m N Vη+m+ N 6 η +m 62 5

6 .5. Optimizeα Let s consier the terms withα: L α = E q N Kα+m+ = = = N N N Kα+m+E q Kα+m+ Kα+m+ Differentiating with respect to α, we get N N qn = n N n= α+m α+m n α+m 65 qn nα+n 66 N n= L α α = N K Kα+m + Differentiating one more time, we get qn n α+n N n= 67 2 L α α 2 = N K 2 Kα+m 2 qn n α+n 2 68 N n= We can use the Hessian an Graient to compute the Newton-Raphson upate. To compute qn n, note that N = n Iz n =, which we can approximate by a Gaussian with mean n φ n an ariance n φ n φ n..5.2 Optimizeη Let s consier the terms withη an enoting N = N i.e., total number of wors in all the ocuments, we get L η = E q N Vη +m+ N η +m 69 = N n qn = n Vη +m+ N n qn = n η +m n= n= 70 = N nvη n=qn +n + N qn nη +n n= 7 Differentiating with respect to α, we get L η η = N V qn n Vη +n + n= Differentiating one more time, we get N qn n η +n n= 72 2 L η η 2 = N V 2 qn n Vη+n 2 n= N qn n η +n 2 73 n= We can use the Hessian an Graient to compute the Newton-Raphson upate. To compute qn n, note that N = n Iz n =, which we can approximate by a Gaussian with mean n φ n an ariance n φ n φ n. To compute qn n, note that N = n Iz n = Iw n =, which we can approximate by a Gaussian with mean n φ niw n = an ariance n φ n φ n Iw n = 6

7 .6 Alternatie M-step As an alternatie approach, we can useφ n to compute γ = α + N n= λ = η + φ n N φ n wn, n= which we can then use to upate α an η exactly the same way as the original LDA Blei et al., This is theoretically not as goo as the preious section because we are using point estimates of θ an β instea of integrating them out. But the estimates of φ shoul be better than the original LDA Blei et al., 2003, so maybe it will perform better? Howeer, there is no theoretical guarantees unlie the ariational E-step. 7

Collapsed Variational Inference for HDP

Collapsed Variational Inference for HDP Collapse Variational Inference for HDP Yee W. Teh Davi Newman an Max Welling Publishe on NIPS 2007 Discussion le by Iulian Pruteanu Outline Introuction Hierarchical Bayesian moel for LDA Collapse VB inference