Lecture 2: Correlated Topic Model

Size: px

Start display at page:

Download "Lecture 2: Correlated Topic Model"

Everett Hicks
5 years ago
Views:

1 Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables in the moel. Let K be the number of topics, D be the number of ocuments, V be the number of terms in the vocabulary. We use i to inex a topic, to inex a ocument 2, n inex a wor 3 an w (or v to enote a wor. In correlate topic moels, µ K, Σ K K an β K V are moel parameters, while η D K an 4 are hien variables. As a variational istribution q(, we use a fully factorie moel, where all the variables are inepenently governe by a ifferent istribution, q(η, λ, ν, ϕ q(η λ, νq( ϕ, (. λ D K,ν D K an ϕ 5 are hien variables. are variational parameters. Note that the only assumption we have mae in the variational inference is that η an are inepenent an we o not specify any probabilities functions for these two hien variables. The topic assignments of wors an the ocuments are exchangeable, i.e., inepenent conitione on the parameters (either moel parameters or variational parameters. Note that all the variational istribution q( is a conitional istribution an shoul be written as q( w, for simplicity we write it as q(. The main iea is that we use variational expectation-maximiation (EM: In the E-step variational EM, we use the variational approximation to the posterior escribe in the previous section an fin the optimal values of variational parameters. In the M-step, we maximie the boun with respect to the moel parameters. In a more conense way, we perform variational inference for learning variational parameters in E-step while perform parameter estimation in M-step. These two steps alternate in a iteration. We will optimie the lower boun w.r.t variational parameters an moel parameters one by one, an this is to perform optimiation using a coorinate ascent algorithm.. Variational objective function.. Fining a lower boun for log p(w µ, Σ, β Jensens inequality. Let X be a ranom variable, an f is a convex function. Then we have f(e(x E(f(x. If f is a concave function we have f(e(x E(f(x. i means K i 2 means D 3 n means N n, where N is the length of current ocument 4 It is represente as a three imension matrix, each entry is inexe by a triplet <, n, i >, inicating whether the topic assignment of the nth wor in the th ocument is the ith topic. 5 Corresponing to, it is represente as a three imension matrix, each entry is inexe by a triplet <, n, i >, inicating the probability of the nth wor in the th ocument in the ith topic. -

2 -2 Lecture 2: Correlate Topic Moel We use Jensens inequality to boun the log probability of a ocument 6, log p(w µ, Σ, β log p(η,, w µ, Σ, βη, log p(η,, w µ, Σ, βq(η, η, q(η, q(η, log p(η,, w µ, Σ, βη q(η, log q(η, η, q(η, log p(η µ, Ση + q(η, log p( ηη + q(η, log p(w µ,, βη q(η, log q(η, η, E q [log p(η µ, Σ] + E q [log p( η] + E q [log p(w µ,, β] + H(q, L(λ, ϕ µ, Σ, β E q [log p(η µ, Σ] + E q [log p( η] + E q [log p(w µ,, β] + H(q. (.2 We can easily verify that log p(w µ, Σ, β L(λ, ϕ µ, Σ, β + D(q(η, λ, ν, ϕ p(η, µ, Σ, β, w. (.3 We inee fin a lower boun for log p(w µ, Σ, β, i.e., L(λ, ν, ϕ µ, Σ, β. This shows that maximiing the lower boun L(λ, ν, ϕ µ, Σ, β with respect to λ,ν an ϕ is equivalent to minimiing the KL ivergence between the variational posterior probability an the true posterior probability, the optimiation problem presente earlier in Eq Expaning the lower boun E q [log p(η µ, Σ] (.4 2 log Σ K 2 log 2π 2 E q[(η µ T Σ (η µ] (.5 2 log Σ K 2 log 2π 2 T race(iag(ν2 Σ 2 (λ µt Σ (λ µ. (.6 Let n enote the topic assignment of the nth wor in current ocument,an it is a vector. n,i when the topic assignment is the ith topic, otherwise, n,i 0. E q [log p( η] n E q [log p( n η] (.7 n,i E q [ n,i log exp(η i i exp(η i ] (.8 n,i E q [ n,i η i ] n E q [log i exp(η i ] (.9 6 Currently, we ignore the ocument inex, since all the hien variables an variational parameters are ocument-specific. But we will use the ocument inex explicitly in the part of parameter estimation since these moel parameters are relate to all the ocuments. 7 When we learn the variational parameters, we fix all the moel parameters. So that log p(w µ, Σ, β can be consiere as a fixe value an it is the sum of the lower boun an the KL ivergence.

3 Lecture 2: Correlate Topic Moel -3 We can have E q [ n,i η i ] λ i ϕ n,i. It is a bit ifficult to erive n E q[log i exp(η i ]. To preserve the lower boun on the log probability, we upper boun the negative log normalier with a Taylor expansion: E q [log ( exp(η i ] ζ E q [exp(η i ] + log(ζ, (.0 i i where we have introuce a new slack parameter ζ. The expectation E q [exp(η k ] is the mean of a log normal istribution with mean an variance obtaine from the variational parameters {λ i, νi 2}: E q[exp(η i ] exp(λ i + νi 2 /2. Using this aitional boun, the right sie of Eq. is E q [log p( η] λ i ϕ n,i ( (ζ i exp(λ i + ν 2i /2 + log(ζ. (. n,i n E q [log p(w µ,, β] n E q [log p(w n n, β] (.2 n,i n,i E q [log β n,i i,w n ] (.3 ϕ n,i log β i,wn. (.4 H(q q(η, log q(η, η, (.5 q(η log q(ηη + q( log q(, (.6 i 2 (log ν2 i + log 2π + ϕ n,i log ϕ n,i. (.7 n,i We also present the etaile erivations for q(η log q(ηη i q(η log q(ηη (.8 q(η i λ i, νi 2 log q(η i λ i, νi 2 η i, (.9 i i i exp( (η i λ i 2πν 2 i 2ν 2 i exp( (η i λ i 2 2πν 2 i 2ν 2 i ( (ηi λ i 2νi 2 + log( 2 log 2πν2 i η i, (.20 ( (ηi λ i 2 2νi 2 + log( 2 log 2πν2 i η i, (.2 2 ( + log 2π + log ν2 i. (.22 Here we use a property of a Gaussian istribution p(x

4 -4 Lecture 2: Correlate Topic Moel (x µ 2 p(xx δ 2. (µ is the mean an δ 2 is the variance. (.23.2 Variational inference The aim of variational inference is to learn the values of variational parameters λ, ν, ϕ. With the learnt variational parameters, we can evaluate the posterior probabilities of hien variables. Having specifie a simplifie family of probability istributions, the next step is to set up an optimiation problem that etermines the values of the variational parameters: λ, ν, ϕ. We can obtain a solution for these variational variables by solve the following optimiation problem: (λ, ν, ϕ arg min D(q(η, λ, ν, ϕ p(η, µ, Σ, β, w. (.24 λ,ν,ϕ With Eq..3, we can achieve minimie D(q(η, λ, ν, ϕ p(η, µ, Σ, β, w by maximiing the lower boun L(λ, ϕ µ, Σ, β..2. Learning the variational parameters We have expane each item of the lower boun L(λ, ϕ µ, Σ, β in Eq..2. Then we maximie the boun with respect to the variational parameters: λ, ν, ϕ an the slack variable ζ we have introuce. First, we maximie Eq..0 with respect to ζ an the erivative with respect to ζ is ( ( L/ζ N ζ 2 i exp(λ i + νi 2 /2 ζ, (.25 which has a maximum at ˆζ i exp(λ i + ν 2 i /2. (.26 Secon, we maximie with respect to ϕ n,i. We can have which has a maximum at L/ϕ n,i log β i,wn log ϕ n,i + λ i + τ n (Lagrange Multiplier, (.27 Then we optimie the Gaussian variational parameters λ an ν. For λ, we have the erivative ˆ ϕ n,i exp(λ i β i,wn. (.28 L/λ Σ (λ µ + n ϕ n,:k N ζ exp(λ + ν 2 /2, (.29

5 Lecture 2: Correlate Topic Moel -5 where ϕ n,:k is a column vector. Here we use a property of matrix graient: x T Ax/x 2A if A is symmetric matric an x is a vector. We cannot obtain a close form solution of λ, thus we can use the above erivative of λ using an optimiation algorithm, e.g., conjugate graient algorithm. Finally, we have the erivative of ν 2,i 8 L/ν,i Σ ii 2 N 2ζ exp(λ,i + ν,i/ ν,i 2, (.30 again, we have analytic solution an we can use Newton s metho with the constraint ν,i > 0. We o not want to present the etails of these optimiation methos (e.g., Newton s metho, generally speaking, it is easy to use when we have the erivatives..3 Parameter estimation In this section, we continue to estimate our moel parameters, i.e., β, µ an Σ. We solve this problem by using the variational lower boun as a surrogate for the (intractable marginal log likelihoo, with the variational parameters. Note that we shoul first aggregate ocument-specific lower bouns efine in Eq..2. An in this part, we will use the ocument inex. We first rewrite the lower boun by only keeping the items which contain β with the lagrange multipliers ρ i s L [β] V ϕ,n,i log β i,wn + ρ i ( β i,v. (.3,n,i i v By taking the erivative L [β], we can have L [β] /β i,v,n ϕ,n,i (v w n β i,v + ρ i, (.32 where (v w n is an inicator function which returns when the conition is true otherwise returns 0. We can set ϕ,n,i (vw n,n β i,v + ρ i to ero, an solve ρ i : ρ i,n,v ϕ,n,i(v w n. Since we have v β i,v, we can ignore ρ i to estimate an un-normalie value of β i,v βˆ i,v ϕ,n,i (v w n. (.33,n Similarly, we can rewrite the lower boun by only keeping the items which contain µ L [µ] 2 (λ µ T Σ (λ µ. (.34 8 NOT ν,i

6 -6 Lecture 2: Correlate Topic Moel By taking the erivative L [µ], we can have L [µ] /µ Σ (λ µ, (.35 By setting L [µ] /µ to ero, we have Then we continue to write all the items containing Σ of the lower boun ˆµ λ. (.36 D L [Σ] D 2 log Σ D 2 log Σ D 2 log Σ 2 T race(iag(ν2 Σ 2 2 T race(iag(ν2 Σ 2 2 T race(iag(ν2 Σ 2 ( (λ µ T Σ (λ µ, (.37 ( T race (λ µ T Σ (λ µ, (.38 T race (Σ (λ µ(λ µ T. (.39 In the above, we use the trace trick: For square matrices A an B, we have T race(ab T race(ba. In the next, we will take the erivative of L [Σ] w.r.t Σ. We use the following properties: log A /A A T ; 2 T race(ab/a T race(ba/a B T. L [Σ] /Σ DΣ T iag(ν 2 ( (λ µ(λ µ T T, (.40 so that we can have ˆΣ ( iag(ν 2 D + (λ µ(λ µ T. (.4.4 Discussion of the convergence We can iscussion the convergence of vem for CTM in a sloppy way: either the change of value of the conitional likelihoo log p(w µ, Σ, β, or the change of value of the lower boun L(λ, ν, ϕ µ, Σ, β. Since we perform a coorinate optimiation on the lower boun L(λ, ν, ϕ µ, Σ, β in both E-step an M-step, it can achieve optimal an converge. For log p(w µ, Σ, β, in the E-step, we increases its lower boun w.r.t variational parameters; while in the M- step, we further increase its lower boun, so the likelihoo probably increases. However, D(q(η, λ, ν, ϕ p(η, µ, Σ, β, w is usually non-ero, it might ecrease after optimiing over these moel parameters. It is a bit confusing whether the likelihoo will converge although its lower boun never ecreases.

Note 1: Varitional Methods for Latent Dirichlet Allocation

Technical Note Series Spring 2013 Note 1: Varitional Methods for Latent Dirichlet Allocation Version 1.0 Wayne Xin Zhao batmanfly@gmail.com Disclaimer: The focus of this note was to reorganie the content