Variational Bayes for generic topic models

Size: px

Start display at page:

Download "Variational Bayes for generic topic models"

Edward Hudson
6 years ago
Views:

1 Variational Bayes for generic topic odels Gregor Heinrich 1 and Michael Goesele 2 1 Fraunhofer IGD and University of Leipzig 2 TU Darstadt Abstract. The article contributes a derivation of variational Bayes for a large class of topic odels by generalising fro the well-known odel of latent Dirichlet allocation. For an abstraction of these odels as systes of interconnected ixtures, variational update equations are obtained, leading to inference algoriths for odels that so far have used Gibbs sapling exclusively. 1 Introduction Topic odels (TMs) are a set of unsupervised learning odels used in any areas of artificial intelligence: In text ining, they allow retrieval and autoatic thesaurus generation; coputer vision uses TMs for iage classification and content based retrieval; in bioinforatics they are the basis for protein relationship odels etc. In all of these cases, TMs learn latent variables fro co-occurrences of features in data. Following the seinal odel of latent Dirichlet allocation (LDA [6]), this is done efficiently according to a odel that exploits the conjugacy of Dirichlet and ultinoial probability distributions. Although the original work by Blei et al. [6] has shown the applicability of variational Bayes (VB) for TMs with ipressive results, inference especially in ore coplex odels has not adopted this technique but reains the doain of Gibbs sapling (e.g., [12,9,8]). In this article, we explore variational Bayes for TMs in general rather than specific for soe given odel. We start with an overview of TMs and specify general properties (Sec. 2). Using these properties, we develop a generic approach to VB that can be applied to a large class of odels (Sec. 3). We verify the variational algoriths on real data and several odels (Sec. 4). This paper is therefore the VB counterpart to [7]. 2 Topic odels We characterise topic odels as a for of discrete ixture odels. Mixture odels approxiate coplex distributions by a convex su of coponent distributions, p(x) = Kk=1 p(x z=k)p(z=k), where p(z=k) is the weight of a coponent with index k and distribution p(x z=k). Latent Dirichlet allocation as the siplest TM can be considered a ixture odel with two interrelated ixtures: It represents docuents as ixtures of latent variables z with coponents ϑ = p(z ) and latent topics z as ixtures of words w with

2 2 [M] [M] z,n =k w,n =t ϑ α β [K] k η [V] (a) ϑ r z 2,n =x z αr ϑ 3,n,x α =y w,n =t x ϑ [s1] [s2] y β [V] (c) a x,n =x [M] [A] ϑ [M],0 α 0 [ T ] z T,n =T z t,n =t ϑ [M],T α T [ t ] ϑ x α (b) z,n =k ϕ k β w,n =t [K] [V] z T,n=T l,n w,n ζ T,t γ ϕ [3] l,t,t β [V] [ T + t +1] (d) Fig. 1. Dependencies of ixture levels (ellipses) via discrete variables (arrows) in exaples fro literature: (a) latent Dirichlet allocation [6], (b) author topic odel (ATM [12], using observed paraeters a to label docuents, see end of Sec. 3), (c) 4-level pachinko allocation (PAM [9], odels seantic structure with a hierarchy of topics ϑ, ϑ,x, ϑ y ), (d) hierarchical pachinko allocation (hpam [8], topic hierarchy; coplex ixture structure). coponents β k = p(w z=k) and coponent weights ϑ, leading to a distribution over words w of p(w ) = K k=1 ϑ,k β k,w. 3 The corresponding generative process is illustrative: For each text docuent, a ultinoial distribution ϑ is drawn fro a Dirichlet prior Dir( ϑ α) with hyperparaeter α. For each word token w,n of that docuent, a topic z,n =k is drawn fro the docuent ultinoial ϑ and finally the word observation w drawn fro a topic-specific ultinoial over ters β k. Pursuing a Bayesian strategy with paraeters handled as rando variables, the topic-specific ultinoial is itself drawn fro another Dirichlet, Dir( β k η), siilar to the docuent ultinoial. Generic TMs. As generalisations of LDA, topic odels can be seen as a powerful yet flexible fraework to odel coplex relationships in data that are based on only two odelling assuptions: (1) TMs are structured into Dirichlet ultinoial ixture levels to learn discrete latent variables (in LDA: z) and ultinoial paraeters (in LDA: β and ϑ). And (2) these levels are coupled via the values of discrete variables, siilar to the coupling in LDA between ϑ and β via z. More specifically, topic odels for graphs of ixture levels with sets of ultinoial coponents as nodes connected by discrete rando values as directed edges. Conditioned on discrete inputs, each ixture level chooses one of its coponents to generate discrete output propagated to the next level(s), until one or ore final levels produce observable discrete data. For soe exaples fro literature, corresponding ixture networks are shown in Fig. 1, including the variant of observed ultinoial paraeters substituting the Dirichlet prior, which will be discussed further below. For the following derivations, we introduce sets of discrete variables X, ultinoial paraeters Θ and Dirichlet hyperparaeters A as odel-wide quantities, and the corresponding level-specific quanitities X l, Θ l, A l where superscript l indicates the ixture level. The constraint of connecting different ixture levels (ellipses in Fig. 1) via discrete variables (arrows in Fig. 1) can be expressed by an operator x l that yields all parent variables of a ixture level l L generating variable x l. Here x l can refer to 3 In exaple odels, we use the sybols fro the original literature.

3 3 specific tokens xi l or configurations X l. Based on this and the definitions of the ultinoial and Dirichlet distributions, the joint likelihood of any TM is: p(x, Θ A) = p(x l, Θ l A l, X l ) = Mult(x i Θ, x i ) Dir( ϑ k A, X) (1) i Γ( t α j,t ) = ϑ ki,x i ϑ α j,t 1 i k t Γ(α j,t ) k,t ; ki l = g l ( xi l, i), jl = f l (k l ) t 1 = ϑ n k,t+α j,t 1 ( α j ) k,t ; n l k,t = δ(k i k) δ(x i t). (2) k t In this equation, soe further notation is introduced: We use brackets [ ] to indicate that the contained quantities are specific to level l. Moreover, the appings fro parent variables to coponent indices k i are expressed by (level-specific) k i = g( x i, i), and n l k,t is the nuber of ties that a configuration { x i, i} for level l lead to coponent k l. Further, odels are allowed to group coponents by providing group-specific hyperparaeters α j with apping j = f (k). Finally, ( α) is the noralisation function of the Dirichlet distribution, a K-diensional beta function: ( α) t Γ(α t )/Γ( t α t ). 3 Variational Bayes for topic odels As in any latent-variable odels, deterining the posterior distribution p(h, Θ V) = p(v, H, Θ) / H p(v, H, Θ) dθ with hidden and visible variables {H, V} = X, is intractable in TMs because of excessive dependencies between the sets of latent variables H and paraeters Θ in the arginal likelihood p(v) = H p(v, H, Θ) dθ in the denoinator. Variational Bayes [2] is an approxiative inference technique that relaxes the structure of p(h, Θ V) by a sipler variational distribution q(h, Θ Ψ, Ξ) conditioned on sets of free variational paraeters Ψ and Ξ to be estiated in lieu of H and Θ. Miniizing the Kullback-Leibler divergence of the distribution q to the true posterior can be shown to be equivalent to axiising a lower bound on the log arginal likelihood: log p(v) log p(v) KL{q(H, Θ) p(h, Θ V)} = log p(v, H, Θ) q(h,θ) + H{q(H, Θ)} F {q(h, Θ)} (3) with entropy H{ }. F {q(h, Θ)} is the (negative) variational free energy the quantity to be optiised using an EM-like algorith that alternates between (E) axiising F w.r.t. the variational paraeters to pull the lower bound towards the arginal likelihood and (M) axiising F w.r.t. the true paraeters to raise the arginal likelihood. Mean-field approxiation. Following the variational ean field approach [2], in the LDA odel the variational distribution consists of fully factorised Dirichlet and ultinoial distributions [6]: 4 q( z, β, ϑ ϕ, λ, γ) = M N Mult(z,n ϕ,n ) =1 n=1 k=1 i k K M Dir( β k λ k ) Dir( ϑ γ ). (4) 4 In [6] this refers to the soothed version; it is described in ore detail in [5]. =1

4 4 In [6], this approach proved very successful, which raises the question how it can be transferred to ore generic TMs. Our approach is to view Eq. 4 as a special case of a ore generic variational structure that captures dependencies X between ultiple hidden ixture levels and includes LDA for the case of one hidden level (H = { z }): q(h, Θ Ψ, Ξ) = Mult(x i ψ i, x i ) Dir( ϑ k ξ k, X), (5) l H i where l H refers to all levels that produce hidden variables. In the following, we assue that the indicator i is identical for all levels l, e.g., words in docuents i l = i (, n). Further, tokens i in the corpus can be grouped into ters v and (observable) docuent-specific ter frequencies n,v introduced. We use shorthand u = (, v) to refer to specific unique tokens or docuent ter pairs. Topic field. The dependency between ixture levels, x l u, can be expressed by the likelihood of a particular configuration of hidden variables x u = t {x l u=t l } l H under the variational distribution: ψ u, t = q( x u = t Ψ). The coplete structure ψ u (the joint distribution over all l H with Ψ = {ψ u } u ) is a ulti-way array of likelihoods for all latent configurations of token u with as any index diensions as there are dependent variables. For instance, Fig. 1 reveals that LDA has one hidden variable with diension K while PAM has two with diensions s 1 s 2. Because of its interpretation as a ean field of topic states in the odel, we refer to ψ u as a topic field (in underline notation). We further define ψ l u,k,t as the likelihood of configuration (kl, t l ) for docuent ter pair u. This arginal of ψ u depends on the appings between parent variables x u and coponents k on each level. To obtain ψ l u,k,t, the topic field ψ u is sued over all descendant paths that x u =t causes and the ancestor paths that can cause k = g( x u, u) on level l according to the generative process: ψ l u,k,t = { t l A, t l D } ψ u;( t l A,k l,t l, t l D ) ; t l A = path causing kl, t l D = path caused by tl. (6) Descendant paths t l D of tl are obtained via recursion of k = g( x d u, u) over l s descendant levels d. Assuing bijective g( ) as in the TMs in Fig. 1, the ancestor paths t l A that correspond to coponents in parents leading to k l are obtained via ( x a u, u) = g 1 (k) on l s ancestor levels a recursively. Each pair { t l A, t l D } corresponds to one eleent in ψ u per {k l, t l } at index vector t = ( t l A, kl, t l, t l D ). Free energy. Using Eqs. 2, 3, 5 and 6, the free energy of the generic odel becoes: (( ) ) F = log ( ξ k ) log ( α j ) + un u ψ u,k,t + α j,t ξ k,t µt ( ξ k ) k t n u ψ u, t log ψ u, t = F l + H{Ψ}, (7) u t where µ t ( ξ) Ψ(ξ t ) Ψ( tξ t ) = log ϑ ξ Dir( ϑ ξ) = t log ( ξ), and Ψ(ξ) d/dx log Γ(ξ) is the digaa function. 5 5 Note the distinction between the function Ψ( ) and quantity Ψ. k

5 5 Variational E-steps. In the E-step of each odel, the variational distributions for the joint ultinoial ψ u for each token (its topic field) and the Dirichlet paraeters ξ l k on each level need to be estiated. The updates can be derived fro the generic Eq. 7 by setting derivatives with respect to the variational paraeters to zero, which yields: 6 ( [ ψ u, t exp µt ( ξ k ) ] ), (8) ξ l k,t = [ ( un u ψ u,k,t ) + α j,t ] (9) where the su un u ψ l u,k,t for level l can be interpreted as the expected counts n l k,t q of co-occurrence of the value pair (k l, t l ). The result in Eqs. 8 and 9 perfectly generalises that for LDA in [5]. M-steps. In the M-step of each odel, the Dirichlet hyperparaeters α l j (or scalar αl ) are calculated fro the variational expectations of the log odel paraeters log ϑ k,t q = µ t ( ξ k ), which can be done at ixture level (Eq. 9 has no reference to α l j across levels). Each estiator for α j (oitting level l) should see only the expected paraeters µ t ( ξ k ) of the K j coponents associated with its group j = f (k). We assue that coponents be associated a priori (e.g., PAM in Fig. 1c has ϑ,x Dir( α x )) and K j is known. Then the Dirichlet ML paraeter estiation procedure given in [6,10] can be used in odified for. It is based on Newton s ethod with the Dirichlet log likelihood function f as well as its gradient and Hessian eleents g t and h tu : f ( α j ) = K j log ( α j ) + t(α j,t 1) {k: f (k)= j}µ t ( ξ k ) g t ( α j ) = K j µ t ( α j ) + {k: f (k)= j} µ t ( ξ k ) h tu ( α j ) = K j Ψ ( sα j,s ) + δ(t u)k j Ψ (α j,t ) = z + δ(t u)h tt α j,t α j,t (H 1 g) t = α j,t h 1 tt ( gt ( sg s h 1 ss ) / (z 1 + sh 1 ss ) ). (10) Scalar α (without grouping) is found accordingly via the syetric Dirichlet: f = K[T log Γ(α) log Γ(Tα)] + (α 1)s α, s α = K k=1 Tt=1 µ t ( ξ k ) g = KT[Ψ(Tα) Ψ(α) + s α ], h = KT[TΨ (Tα) Ψ (α)] α α gh 1. (11) Variants. As an alternative to Bayesian estiation of all ixture level paraeters, for soe ixture levels ML point estiates ay be used that are coputationally less expensive (e.g., unsoothed LDA [6]). By applying ML only to levels without docuentspecific coponents, the generative process for unseen docuents is retained. The E- step with ML levels has a siplified Eq. 8, and ML paraeters ϑ c are estiated in the M-step (instead of hyperparaeters): ψ u, t exp ( \c [ µt ( ξ k ) ] ) ϑ c k,t, ϑc k,t = n c k,t q / n c k q un u ψ c u,k,t. (12) 6 In Eq. 8 we assue that t l =v on final ixture level(s) ( leaves ), which ties observed ters v to the latent structure. For root levels where coponent indices are observed, µ t ( ξ k ) in Eq. 8 can be replaced by Ψ(ξ k,t ).

6 6 Moreover, as an extension to the fraework specified in Sec. 2, it is straightforward to introduce observed paraeters that for instance can represent labels, as in the author topic odel, cf. Fig 1. In the free energy in Eq. 7, the ter with µ t ( ξ k ) is replaced by ( un u ψ u,k,t ) log ϑ k,t, and consequently, Eq. 8 takes the for of Eq. 12 (left), as well. Other variants like specific distributions for priors (e.g., logistic-noral to odel topic correlation [4] and non-paraetric approaches [14]) and observations (e.g., Gaussian coponents to odel continuous data [1]), will not be covered here. Algorith structure. The coplete variational EM algorith alternates between the variational E-step and M-step until the variational free energy F converges at an optiu. At convergence, the estiated docuent and topic ultinoials can be obtained via the variational expectation log ˆϑ k,t = µ t ( ξ k ). Initialisation plays an iportant role to avoid local optia, and a coon approach is to initialise topic distributions with observed data, possibly using several such initialisations concurrently. The actual variational EM loop can be outlined in its generic for as follows: 1. Repeat E-step loop until convergence w.r.t. variational paraeters: 1. For each observed unique token u: 1. For each configuration t: calculate var. ultinoial ψ u, t (Eq. 8 or 12 left). 2. For each (k, t) on each level l: calculate var. Dirichlet paraeters ξ l k,t based on topic field arginals ψ l u,k,t (Eqs. 6 and 9), which can be done differentially: ξl k,t ξ l k,t + n u ψ l u,k,t with ψl u,k,t the change of ψl u,k,t. 2. Finish variational E-step if free energy F (Eq. 7) converged. 2. Perfor M-step: 1. For each j on each level l: calculate hyperparaeter α l j,t (Eqs. 10 or 11), inner iteration loop over t. 2. For each (k, t) in point-estiated nodes l: estiate ϑ l k,t (Eq. 12 right). 3. Finish variational EM loop if free energy F (Eq. 7) converged. In practice, siilar to [5], this algorith can be odified by separating levels with docuent-specific variational paraeters Ξ l, and such with corpus-wide paraeters Ξ l,. This allows a separate E-step loop for each docuent that updates ψ u and Ξ l, with Ξ l, fixed. Paraeters Ξ l, are updated afterwards fro changes ψ l u,k,t cuulated in the docuent-specific loops, and their contribution added to F. 4 Experiental verification In this section, we present initial validation results based on the algorith in Sec. 3. Setting. We chose odels fro Fig. 1, LDA, ATM and PAM, and investigated two versions of each: an unsoothed version that perfors ML estiation of the final ixture level (using Eq. 12) and a soothed version that places variational distributions over all paraeters (using Eq. 8). Except for the coponent grouping in PAM ( ϑ,x have vector hyperparaeter α x ), we used scalar hyperparaeters. As a base-line, we used Gibbs sapling ipleentations of the corresponding odels. Two criteria are iediately useful: the ability to generalise to test data V given the odel paraeters Θ, and the convergence tie (assuing single-threaded operation). For the

7 7 Model: LDA ATM PAM Diensions {A,B}: K = {25, 100} K = {25, 100} s 1,2 = {(5, 10), (25, 25)} Method: GS VB ML VB GS VB ML VB GS VB ML VB Convergence tie [h] Iteration tie [sec] Iterations Perplexity A B A B A B A B Fig. 2. Results of VB and Gibbs experients. first criterion, because of its frequent usage with topic odels we use the perplexity, the inverse geoetric ean of the likelihood of test data tokens given the odel: P(V ) = exp( u n u log p(v u Θ )/W ) where Θ are the paraeters fitted to the test data V with W tokens. The log likelihood of test tokens log p(v u Θ ) is obtained by (1) running the inference algoriths on the test data, which yields Ξ and consequently Θ, and (2) arginalising all hidden variables h u in the likelihood p(v u h u, Θ ) = [ ] ϑk,t. 7 The experients were perfored on the NIPS corpus [11] with M = 1740 docuents (174 held-out), V = ters, W = tokens, and A = 2037 authors. Results. The results of the experients are shown in Fig. 2. It turns out that generally the VB algoriths were able to achieve perplexity reductions in the range of their Gibbs counterparts, which verifies the approach taken. Further, the full VB approaches tend to yield slightly iproved perplexity reductions copared to the ML versions. However, these first VB results were consistently weaker copared to the baselines. This ay be due to adverse initialisation of variational distributions, causing VB algoriths to becoe trapped at local optia. It ay alternatively be a systeatic issue due to the correlation between Ψ and Ξ assued independent in Eq. 5, a fact that has otivated the collapsed variant of variational Bayes in [13]. Considering the second evaluation criterion, the results show that the current VB ipleentations generally converge less than half as fast as the corresponding Gibbs saplers. This is why currently work is undertaken in the direction of code optiisation, including parallelisation for ultikernel CPUs, which, opposed to (collapsed) Gibbs saplers, is straightforward for VB. 5 Conclusions We have derived variational Bayes algoriths for a large class of topic odels by generalising fro the well-known odel of latent Dirichlet allocation. By an abstraction of these odels as systes of interconnected ixture levels, we could obtain variational update equations in a generic way, which are the basis for an algorith, that can be easily applied to specific topic odels. Finally, we have applied the algorith to a couple of exaple odels, verifying the general applicability of the approach. So far, especially ore coplex topic odels have predoinantly used inference based on Gibbs sapling. Therefore, this paper is a step towards exploring the possibility of variational 7 In contrast to [12], we also used this ethod to deterine ATM perplexity (fro the ϕ k ).

8 8 approaches. However, what can be drawn as a conclusion fro the experiental study in this paper, ore work reains to be done in order to ake VB algoriths as effective and efficient as their Gibbs counterparts. Related work. Beside the relation to the original LDA odel [6,5], especially the proposed representation of topic odels as networks of ixture levels akes work on discrete DAG odels relevant: In [3], a variational approach for structure learning in DAGs is provided with an alternative derivation based on exponential failies leading to a structure siilar to the topic field. They do not discuss apping of coponents or hyperparaeters and restrict their ipleentations to structure learning in graphs bipartite between hidden and observed nodes. Also, the authors of [9] present their pachinko allocation odels as DAGs, but forulate inference based on Gibbs sapling. In contrast to this, the novelty of the work presented here is that it unifies the theory of topic odels in general including labels, the option of point estiates and coponent grouping for variational Bayes, giving epirical results for real-world topic odels. Future work will optiise the current ipleentations with respect to efficiency in order to iprove the experiental results presented here, and an iportant aspect is to develop parallel algoriths for the odels at hand. Another research direction is the extension of the fraework of generic topic odels, especially taking into consideration the variants of ixture levels outlined in Sec. 3. Finally, we will investigate a generalisation of collapsed variational Bayes [13]. References 1. K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan. Matching words and pictures. JMLR, 3(6): , M. J. Beal. Variational Algoriths for Approxiate Bayesian Inference. PhD thesis, Gatsby Coputational Neuroscience Unit, University College London, M. J. Beal and Z. Ghahraani. Variational bayesian learning of directed graphical odels with hidden variables. Bayesian Analysis, 1: , D. Blei and J. Lafferty. A correlated topic odel of science. AOAS, 1:17 35, D. Blei, A. Ng, and M. Jordan. Hierarchical Bayesian odels for applications in inforation retrieval. Bayesian Statistics, 7:25 44, D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3: , G. Heinrich. A generic approach to topic odels. In ECML/PKDD, W. Li, D. Blei, and A. McCallu. Mixtures of hierarchical topics with pachinko allocation. In ICML, W. Li and A. McCallu. Pachinko allocation: DAG-structured ixture odels of topic correlations. In ICML, T. Minka. Estiating a Dirichlet distribution. Web, NIPS corpus. roweis/data.htl. 12. M. Steyvers, P. Syth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic odels for inforation discovery. In ACM SIGKDD, Y. W. Teh, D. Newan, and M. Welling. A collapsed variational Bayesian inference algorith for latent Dirichlet allocation. In NIPS, volue 19, Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes. Technical Report 653, Departent of Statistics, University of California at Berkeley, 2004.

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points