On Topic Evolution. Eric P. Xing School of Computer Science Carnegie Mellon University Technical Report: CMU-CALD

Size: px

Start display at page:

Download "On Topic Evolution. Eric P. Xing School of Computer Science Carnegie Mellon University Technical Report: CMU-CALD"

Kelley Norman
5 years ago
Views:

1 On Topic Evolution Eric P. Xing School of Computer Science Carnegie Mellon University Technical Report: CMU-CALD-05-5 December 005 Abstract I introuce topic evolution moels for longituinal epochs of wor ocuments. The moels employ marginally epenent latent state-space moels for evolving topic proportion istributions an topicspecific wor istributions; an either a logistic-normal-multinomial or a logistic-normal-poisson moel for ocument lielihoo. These moels allow posterior inference of latent topic themes over time, an topical clustering of longituinal ocument epochs. I erive a variational inference algorithm for nonconjugate generalize linear moels base on truncate Taylor approximation, an I also outline formulae for parameter estimation base on variational EM principle.

2 Introuction Text information, such as meia ocuments, journal articles an s, often come as temporal streams. Current information retrieval systems woring on corpora collecte over time mae little use of the time stamps associate with the ocuments. They often merely pool all the ocuments into a single collection, in which each ocument is treate as an ii sample from some topical istribution [Hofmann, 999; Blei et al., 003; Griffiths an Steyvers, 004]; or moel the topics of each time-specific epoch separately an then examine relationships among the inepenently inferre time-specific topics [Steyvers et al., 004]. In practice, topic themes that generate the ocuments can evolve over time, an there exist epenencies among ocuments over time. In this report, I evelop a principle statistical framewor for moeling topic evolution an extracting high-level insights of the topic history base on latent space ynamic processes, an I erive the formulae for posterior inference an parameter estimation. Topic Evolution Let D,..., D T represent a temporal series of corpus, where D t x Nt enotes the set of N t ocuments available at time t, x enotes a ocument consisting of wor sequence x,,..., x,n, an n n,,..., n,m enotes an M-imensional count vector corresponing to the frequencies of M wors efine by a fixe vocabulary in ocument. We assume that every ocument x can express multiple topics coming from a preefine topic space, an the weights of every topic can be represente by a normalize vector θ of fixe imension. Furthermore, we assume that each topic can be represente by a set of parameters that etermine how wors from a fixe vocabulary can be rawn in a topic-specific manner to compose the ocument for simplicity, here we assume a bag-of-wor moel for the wor-to-ocument relationship, so that topic-specific semantics only translate to measures on wor rates, but not to non-trivial syntactic grammars. Uner a topic evolution moel, the prior istributions of topic proportions of every ocument, an the representations of each of the topics themselves, are evolving over time. In the following, I present two topic evolution moels efine on two ifferent ins of topic representations, an erive the variational inference formulas in each case.. A Dynamic Logistic-Normal-Multinomial Moel In this moel we assume that each ocument is an amixture of topics resulting from a bag of topic-specific instances of wors, each of which is marginally a mixture of topics. Each topic, say topic, is represente by an M-imensional normalize wor frequency vector β, which parameterizes a topic-specific multinomial istribution of wor. Here is an outline of a generative process uner such a moel a graphical moel representation of this moel is illustrate in Figure : We assume that the topic proportion vector θ for each ocument follows a time-specific logistic normal prior LN µ t, Σ t, whose mean µ t is evolving over time accoring to a linear Gaussian moel: For simplicity, we assume that the Σ t s capturing time-specific topic correlations are inepenent across time. µ Normalν, Φ, sample the mean of the topic mixing prior at time. µ t NormalA µ t, Φ, sample the means of the topic mixing priors over time. θ LogisticNormal µ t, Σ t, Notice that the last step above can be broen into two sub-steps: γ Normal µ t, Σ t, for each ocument, sample a topic proportion vector for simplicity, in the sequel we will omit the time inex t an/or ocument inex when escribing a general law that applies to all time points an/or all ocuments.

3 θ, expγ,,,..., K. P expγ, Furthermore, ue to the normalizability constrain of the multinomial parameters, θ only has K egree of freeom. Thus, as escribe in etail in the sequel, we only nee to raw the first K components of γ from a K -imensional multivariate Gaussian, an leave γ K 0. But for simplicity, we omit this technicality in the forth coming general escription of our moel. We further assume that the representation of each topic, in this case a topic-specific multinomial vector β of wor frequencies, is also evolving over time. By efining β as a logistic transformation of a multivariate normal ranom vector η, we can moel the temporal evolution of β in a simplex via a linear Gaussian ynamics moel: η Normalι, Ψ, sample the topic at time. η NormalB η t, Ψ, sample topic over subsequent time points. β,w expη,w, w,..., M, compute wor probabilities via logistic transformation. Pw expη,w Now we assume that each occurrence of wor, e.g., the nth wor in ocument at time t, x,n, is rawn from a topic-specific wor istribution β, specifie by a latent topic inicator z,n. z,n Multinomialθ x,n z,n Multinomial β, sample the latent topic inicator again, for simplicity, inices t an will be omitte in the sequel where no confusion arises. sample the wor from a topic-specific wor istribution. In principle, we can use the above topic evolution moel to capture not only topic correlation among ocuments at a specific time as i in [Blei an Lafferty, 006], but also ynamic coupling i.e., coevolution of topics via covariance matrix Φ, an topic-specific wor-coupling via covariance matrices Ψ. In the simplest scenario, when A I, B I, Φ σi, an Ψ ρi, this moel reuces to ranom wal in both the topic spaces, an the topic-mixing space. Since in most realistic temporal series of corpus, both the proportions of topics, an the semantic representations of topics are unliely to be invariant over time, we expect that even a ranom wal topic evolution moel can provie a better fit of the ata than a static moel that ignores the time stamps of all ocuments.. A Dynamic Log-Normal-Poisson Moel The above topic evolution process assumes an amixture lielihoo moel for ocuments belonging to a specific time interval, an the amixing is realize at the wor level, i.e., the marginal probability of each wor in the ocument is efine by a mixture of topic-specific wor istributions. Now we present another text lielihoo moel employing a ifferent topic mixing mechanism, which can be also plugge into the topic evolution moel. Note that in a bag-of-wor moel all we observe are counts of wors in the ocuments. Instea of assuming each occurrence of a wor is sample from the topic-specific wor istribution, we can irectly assume that the total count n w of wor w is mae up of fractions each contribute by a specific topic accoring to a topic-specific Poisson istribution Poissonωθ τ w,, where ω enotes the length of the ocument, θ enotes the proportion of topic in the ocument as efine before, an τ w, is a rate measure for wor w associate with topic. Specifically, n w n w,, n w, Poissonωθ τ w,. It can be shown that uner this moel we have: n w Poissonω θ τ w, expn w logω θ τ w, ω θ τ w, Γn w +.

4 µ t µ t+ θ θ t+ z,n z t+,n x,n x t+,n M,t M,t+ N t N t+ β β t+ K Figure : A graphical moel representation of the ynamic logistic-normal-multinomial moel for topic evolution. Note that in the above setting for each wor w we have a row vector of rates each associate with a specific topic: τ w, τ w,,..., τ w,k. For each topic, we have a column vector of rates each associate with a specific wor, τ, τ,,..., τ M,. Unlie the multinomial topic moel parameterize by column-normalize topic matrix β [ β,..., β K ], the Poisson topic moel is parameterize by matrix τ [ τ,,..., τ,k ] that oes not have to be column- or row-normalize. Thus we can irectly use a Log-Normal istribution to moel τ, which is simpler than the logistic-normal istribution. This leas to the following generative moel for topic evolution assuming we are intereste in moeling cross-topic coupling of wor rates: µ Normalν, Φ, sample the mean of the topic mixing prior at time. µ t NormalA µ t, Φ, sample the means of the topic mixing priors over time. θ LogisticNormal µ t, Σ t, for each ocument, sample a topic proportion vector. ζ w Normal0, Ψ w, sample rates for wor w at time. ζ w τ w, NormalB wζ t w, Ψ w, sample rates for wor w over subsequent time points. expζ w,, w,..., M, compute wor rates. n,w Poissonω θ τ w, sample the wor counts. Figure illustrates a graphical moel representation of such a ynamic log-normal-poisson moel. 3 Variational Inference 3. Variational Inference for the Logistic-Normal-Multinomial Moel Uner the Logistic-Normal-Multinomial topic evolution moel, the complete lielihoo function can be written as follows: 3

5 µ t µ t+ θ θ t+ n,w n t+,w M M N t N t+ τ w τ t+ w M Figure : A graphical moel representation of the ynamic log-normal-poisson moel for topic evolution. Note that the topic representations evolve as M inepenent wor-rate vectors, each of which efines the rates of a wor of a fixe set of topics. pd, µ t, θ, η, z,n, p µ t p θ µ tp η pz,n θ px,n z,n T p µ ν, Φ T N t t N t n N µ ν, Φ T N t p µ t µ t t p θ µ t pz,n θ px,n z,n, η t N t n N µ t A µ t, Φ T t, η p η ι, Ψ LN θ µ t, Σ t t p η ηt N η ι, Ψ t N η B ηt, Ψ Multinomialz,n θ Multinomialx,n z,n, Logistic η. The posterior of µ t, θ, η, z,n uner the above moel is intractable, therefore we approximate p µ t, θ, η, z,n with a prouct of simpler marginals, each on a cluster of latent variables: q q µ µ t q θ θ q η η q zz,n. Base on the generalize mean fiel theorem [Xing et al., 003], the optimal parameterization of each marginal, q Θ, can be erive by plugging the generalize mean fiel GMF messages receive by the cluster of variables say, X C uner each marginal to the original conitional istribution of each variable cluster given its Marov blanet MB; the GMF messages can be thought of as surrogates of the epenent variables X MB in the Marov blanet of the cluster of variables uner the marginal e.g., X C, an they will be use to replace the original values of the epenent variables in the MB, e.g., px C X MB p X C GMFX MB. [Xing et al., 003] showe that in case of generalize linear moel, the generalize mean fiel message correspons to an expectation of the sufficient statistics of the relevant Marov blanet variables uner its associate GMF cluster marginal. In the sequel, we use S x qx to enote the GMF message ue to latent variable x; an the optimal GMF approximation to px C 4

6 is: q X C p X C S y qy : y X MB As a prelue for etaile erivations, we first rearrange some relevant local conitional istributions in our moel into the canonical form of generalize linear moels. As mentione before, the multinomial parameters θ are logistic transformations of elements of a multivariate normal vector γ: θ t e γ / l eγ l. In fact, since θ is a multinomial parameter vector, it has only K egree of freeom. Therefore, we only nee to moel a K imensional normal vector, an pa it with an vacuous element γ K 0. Uner this parameterization, the logistic transformation from γ to θ remains the same, but the inverse of this transformation taes a simple form: γ ln θ P K i θi ln θ θ K. Assuming that z is a normalize K-imensional ranom binary vector, that is, when z inicate the th event, we have z, z i 0, an i z i ; the exponential family representation of a multinomial istribution for a topic inicator z is: pz θ K exp z ln θ exp exp K K K z γ ln + z γ K e γ K z K ln + e γ z ln K + e γ. 3 For a collection of topic inicators z,n, we have the following conitional lielihoo at time t: K pz,n : n θ exp exp exp exp K n z,n, γ n, γ, K K, m γ n c γ m γ N c γ t n z,n, ln K + n, ln K + expγ, expγ, 4 where n, n z,n, is the number of wors from topic in ocument at time t, m enotes the row vector of total wor-counts from topics to K at time t; n n,,..., n,k m, n,k enotes the row vector of total wor-counts in ocument from all topics; is a column vector of all ones; an c γ ln + K expγ is a scalar etermine by γ. Similarly, the local conitional probability of the ata x,n, where x is also efine as a M-imensional norm- binary inicator vector, can be written as: px,n : n, z,n : n,, η K M K exp x,n,w z,n, η,w K exp K exp w,n M w K exp m η m η n,w η,w K K K,M,w n c η M w,n x,n,w z,n, ln M + n,w ln M + expη,w w w expη,w N t c η, 5 5

7 where n,w,n x,n,w z,n, is the count for wor w from topic at time t; m n,,..., n enotes the row vector of total wor-counts of all but the last wor of topic at time t; n enotes the row vector of counts of every wor generate from topic at time t; an c η,m,m m, n is a scalar etermine by η. Note that we have the following ientity: n t n n. With the above specifications of local conitional probability istributions, in the following we can write own one by one the GMF approximations to marginal posteriors of subsets of latent variables. 3.. We first show that the marginal posterior of µ t can be approximate by a re-parameterize state-space moel. q µ µ t p µ t S θ qθ p µ t S γ qγ π K / Φ exp / µ ν Φ µ ν T exp π K / Φ / N t t π K / Σ t / exp γ µ t Σ t T µ t A µ t Φ µ t A µ t t γ µ t 6 where γ γ,,..., γ,k is the expecte topic vector of ocument at time t, in which γ, enotes the expectation of γ, ln θ, θ,k uner variational marginal q γ γ. For simplicity, we efine y t γ as a short han for the expecte topic vector, an Y t γ Nt as a short han for all such vectors at time t. Note that the above Eq. 6 is a linear Gaussian SSM, except that at each time the output is not a single observation γ, but a set of observations γ Nt. It is well nown that uner a stanar SSM, the posterior istribution of the centroi µ t given the entire observation sequence is still a normal istribution, of which the mean an covariance matrix can be reaily estimate using the Kalman filtering KF an Rauch- Tung-Striebel RTS smoothing algorithms. Here we give the moifie Kalman filter measurement-upate equations that tae into account multiple rather than single output ata points. The RTS equations an the time-upate equations of KF is ientical to the stanar case for single output. Let ˆµ t t enote the mean of µ t conitione on partial sequence Y,..., Y t. The convariance matrix of µ t conitione on partial sequence Y,..., Y t is enote P t t ; that is: ˆµ t t E[ µ t Y,..., Y t ] P t t E[ µ t ˆµ t t µ t ˆµ t t Y,..., Y t ]. Similarly, we let ˆµ t+ t enotes the mean of µ t+ conitione on the partial sequence Y,..., Y t ; P t+ t enotes the covariance matrices of µ t+ t conitione of partial sequences Y,..., Y t ; an so on. Thus, the SSM inference formulae are as follows: Time upate: ˆµ t+ t Aˆµ t t 8 P t+ t AP t t + Φ 9 This can be erive using the fact that the posterior mean an covariance matrix of the mean of a normal istribution N µ, Σ given ata Y an prior of the mean N µ 0, Σ 0 is: Σ p nσ + Σ 0, µ p nσ + Σ 0 nσ ỹ + Σ 0 µ 0 7 6

8 Measurement upate: RTS smoothing: ˆµ t+ t+ ˆµ t+ t + P t+ t P t+ t + Σ t /N t γ t+ ˆµ t+ t 0 P t+ t+ P t+ t P t+ t P t+ t + Σ t /N t P t+ t, L t P t t A P t+ t ˆµ t T ˆµ t t + L t ˆµ t+ T ˆµ t+ t 3 P t T P t t + L t Pt+ T P t+ t L t 4 where γ t+ enote the sample mean of observations at time t + : γ t+ yt+. To estimate the state ynamic matrix A, we also nee to compute the cross-time convariance matrix about µ t an µ t conitione on complete sequence Y,..., Y T : P t,t T E[ µ t ˆµ t T µ t ˆµ t T Y,..., Y t ]. It can be shown that P t t T satisfies the following bacwar recursion: N t+ P t t T P t t L t + L t P t t+ T AP t t L t, 5 which is initialize by P T T T I K T AP T T, where K T P t+ t P t+ t +Σ t /N t is the Kalman gain matrix. 3.. Now we move on to the variational marginal q γ γ. q γ γ p γ S µ qµ, S z qz N t t N t t p γ S µ t qµ, S zt, qz π K / Σ t exp / γ µ t Σ t γ µ t exp m γ n c γ, 6 where n n,,..., n,k c.f. m, an n, n z,n, enotes the sum of expecte topicspecific counts for each wor in ocument uner q z z,n which will be specifie in the sequel. Due to the complexity of c γ ln + K expγ,, the q γ efine above is not integratable uring inference e.g., for computing an expectation of γ. In [Blei an Lafferty, 006], an variational approximation base on optimizing a relaxe boun of the KL-ivergence between q an p is use to approximate q γ. In the following, we present a ifferent approach that overcome the non-conjugacy between the multinomial lielihoo an the logistic-normal prior, an mae the joint tractable. We see a normal approximation to q γ using Taylor expansion technique. Let s tae a secon-orer Taylor expansion of c γ with respect to γ: c γ i e γi + K eγ c e γi γ i γ i γ i + K K + c γ i γ j g i eγ, i eγ eγi e γi + + K e γ i + K eγ eγ K h ii + eγ e γi γ j + e γj e γi K eγ K h ij. 7 + eγ 7 e γ i

9 Therefore, the n-orer Taylor series of c γ ln + K expγ with respect to some ˆγ is: c γ cˆγ + g γ γ ˆγ + γ ˆγ H γ γ ˆγ + R, 8 where g γ γ c γ g,..., g K enotes the graient vector of c γ, H γ h ij enotes the Hessian matrix, an R is the Lagrange remainer. Assuming that ˆγ is close enough to the true γ for each ocument at each time e.g., the posterior mean of all γ, we have: c γ cˆγ + g γˆγ γ ˆγ + γ ˆγ H γ ˆγ γ ˆγ. 9 It can be shown that since c γ is convex w.r.t. γ, the above approximation is a n-orer polymornial lower boun of c γ [Joran et al., 999]. Now we have: p γ S µ t qµ, S zt, qz π K / Σ t exp / γ π K / Σ t exp / N t exp cˆγ + g γˆγ γ γ µ t Σ t γ µ t exp m γ Σ t γ ˆγ + γ γ Σ t + NH t γ ˆγ γ + γ Σ t µ t µ t Σ t µ t + γ m ˆγ H γ ˆγ γ Σ + γ t ˆγ n c γ µ t + m N t g γ ˆγ + N t H γ ˆγˆγ 0 Rearranging the terms, an setting ˆγ µ t ˆµ t T from.., we have the following multivariate-normal approximation: where 3..3 p γ S µ t qµ, S zt, qz N γ µt, Σ t, Σ t inv Σ t + NH t γ ˆµ t T, µ t Σ t Σ t ˆµ t T + NH t γ ˆµ t T ˆµ t T + m N g t γ ˆµ t T ˆµ t T + Σ t m N g t γ ˆµ t T. 3 Now we compute variational marginal q β β. q β β K p β,..., β T S z qz 4 This is a prouct of conitionally inepenent SSMs given sufficient statistics S γ qγ, S z qz, moel parameters ι, Ψ, B, an ata D. The variational marginal of a single chain of a evolving topic represente in pre-transforme normal vector η,..., ηt is: p η,..., ηt S z qz T π M / Φ / t exp m η exp η ι Ψ η ι T t η B η t Ψ η B η t n c η. 5 8

10 Recall that we can approximate c η with its secon-orer truncate Taylor series with respect to an estimate of η, say : c η cˆ η + gˆ η η ˆ η + η ˆ η Hˆ η η ˆ η. 6 In the following we first outline a normal approximation to a multinomial istribution of count vector n. In particular, we assume that the multinomial parameters are logistic transformations of a real vector η: M pn η exp n η + N ln η w w exp n Ng + ˆη NHη η NHη M ˆη NHˆη N ln + ˆη w + N ˆη g exp NH n Ng + NHˆη η NH NH n Ng + NHˆη η M N ln + ˆη w + n Ng NH n Ng + n ˆη w N v η, NH 7 where g η c ηˆη, H Hessian η c ηˆη, v NH n Ng + NHˆη ˆη + NH n Ng; an the Taylor expansion point ˆη can be set at the empirical estimate or just a guess of η. With this approximation, we can approximate Eq. 5 by an SSM with linear Gaussian emission moels: p η,..., ηt S z qz T π M / Φ / t N η ι, Φ exp η exp m η N cˆ η N gˆ η η t N η B η t, Φ ι Ψ η ι t T t η ˆ η N η w B η t Ψ η B η t ˆ η Hˆ η η ˆ η N v η, NH, 8 where the observation v ˆη + N Hˆ η m N gˆ η, ˆη can be set to be its estimate in the previous roun of GMF iteration see..5; an the expectation of the count vector m an total wor count N associate with topic at time t can be compute using variation marginal q z, specifically, n,w,n x,n,w z,n, is the expecte count for wor w from topic at time t; m n,,..., n,m enotes the expecte row vector of total wor-counts of all but the last wor of topic at time t; n m, n,m enotes the row vector of counts of every wor generate from topic at time t; N n. Now the posterior of η can be approximate by a multivariate Gaussian N ˆη,t T, P,t T, here we give the formula for the KF time/measurement upate, an the RTS smoothing of the topics istribution parameters at time t: Time upate: ˆη,t+ t B ˆη,t t P,t+ t B P,t t + Ψ 9 9

11 Measurement upate: RTS smoothing: ˆη,t+ t+ ˆη,t+ t + P,t+ t P,t+ t + NH v ˆη,t+ t P,t+ t+ P,t+ t P,t+ t P,t+ t + NH P,t+ t 30 L,t P,t t B P,t+ t ˆη,t T ˆη,t t + L,t ˆη,t+ T ˆη,t+ t P,t T P,t t + L,t P,t+ T P,t+ t L,t P,t t T P,t t L,t + L,t P,t t+ T B P,t t L,t, 3 We can estimate the parameters ι, Ψ an B using an EM algorithm Now we compute variational marginal q z z,n.,n,t pz D, S γ qγ, S η qη pz,n x,n, Sγ q γ, Sη q η 3 For notational simplicity, we omit inices, an give bellow a generic formula for the variational approximation for singleton marginal: Recall that z is a unit base vector, thus z z. Similar efinition applies to x. pz x, S γ, S η pz S γ px z, S η exp z γ c γ + x Ξ z z c η, 33 where γ follows a Gaussian istribution, an Ξ is an M K matrix whose column vectors η also follows a Gaussian istribution. The close form solutions of c γ an c η uner normal istribution is not available. Note that the multinomial parameter vector θ P π π,..., π K, π K, where vector π π,..., π K follows a multivariate log-normal istribution, an π K. To better approximate c γ an similarly also c η, we rewrite c γ ln + K exp γ c π ln π, where π is the unnormalize version of multinomial parameter vector θ. Now we expan c π w.r.t. to π aroun the mean of π up to the secon orer. The graient of c π w.r.t. to π is: The Hessian of c π w.r.t. to π is: π i π i ln π ln π π i π π ln π π g π. 34 π i π j ln π π H π ln π π π i π π π, 35 0

12 where represents an outer prouct of the two one-vectors. Therefore, Let ˆπ E[ π] uner π LN K µ, Σ, then we have: c π ln ˆπ + g ππ ˆπ + π ˆπ H π π ˆπ. 36 c π ln E[ π] + Tr H π E[ π] E[ π E[ π] π E[ π] ] ln E[ π] + Tr H π E[ π] ˆΣ π, 37 where ˆΣ π is the covariance of π uner a multivariate log-normal istribution. It can be shown that [Kleiber an Kotz, 003]: covπ i, π j exp µ i + µ j + σ ii + σ jj expσ ij 38 Eπ i exp µ i + σ ii 39 E[ π] exp µ + DiagΣ. 40 This computation can be applie to the expectation of both the pre-normalize topic proportion vector π an the pre-normalize topic-specific wor frequency vector ξ corresponing to β. So, we have pz x, S γ, S η exp z γ c π + x Ξ z z c ξ exp z E[ γ] ln E[ π] Tr H π E[ π] ˆΣ π +x E[Ξ]z z ln E[ ξ ] Tr H ξ E[ ξ ] ˆΣ ξ, 4 where H ξ E[ ξ ] E[ ξ ], E[ ξ ] an ˆΣ ξ are the mean an covariance of ξ uner a log-normal istribution as in Eqs , E[Ξ] consists of column-by-column expectation, E[ η ], uner a normal istribution of η. Note that in the above computation, one must be careful about appropriately recovering the K-imensional multinomial istribution of z from the K -imensional pre-tranforme natural parameter vector γ, an the M K imensional pre-tranforme natural parameter matrix Ξ η,..., η K. I omit etails of such manipulations. We nee to compute the above singleton marginal for each z,n where π LN µ, Σ, γ N µ, Σ, ξ 3..5 Summary given γ, π, η, an ξ ; LNˆη,t T, P,t T, an η Nˆη,t T, P,t T. The above 4 variational marginals are couple an thus constitute a set of fixe-point equations, computing the GMF message for one marginal require the marginals of other sets of variables. Thus, we can iteratively upate each marginal until convergence i.e., all the GMF messages stop changing. This approximation scheme can be shown to minimize the KL ivergence between the variational posterior an the true posterior of latent variables. We can use a variational EM scheme to estimate the parameters in our moel, which are essentially the SSM parameters. Operationally, VEM is no ifferent from a stanar EM for SSM we have observation sequence γ t N t γ for the topic mixing SSM as efine by q µ, an observation sequence u for each of the K topic representation i.e., wor-frequency SSMs as efine by q η ; an we can use the stanar learning learning rules for SSM parameter estimation.

13 In the E step, we use Eqs.3-4, an Eqs.3-3 to estimate the expecte sufficient statistics of the latent states; In the M step, we upate the parameters Φ, A, Σ t of the topic mixing SSM, an Ψ, B of the topic representation SSM see [Ghahramani an Hinton, 996; Ghahramani an Hinton, 998] for etails. 3. Variational Inference for Log-Normal-Poisson moel Now we nee to approximate n w PoissonN θ τ w, expn w lnn θ τ w, N θ τ w, Γn w +. Again, let π enote the unnormalize version of the multinomial parameter vector θ. Jacobian of vector θ with respect to π is: θ i π π i π i i π π π θ i π i π j π π π π Jθ π π π π π K π K K π Note that the 4 From this erivation, we now that similarly, θ π i θ τ w π i ln Nθ τ w π i π π,..., i π π ei π π,..., π K π π ei π τw π τw,i π π τ w N θ τ w Nθ π τw,i τ w π i θ τ w π π τ w τw,i θ θ τ w π τ w τw,i / θ π τ w π τ w,i π τ w. 43 ln Nθ τ w π i π i ln Nθ τ w π i π j π i π τ w,i π τ w π + τ w,i π τ w π j π τ w,i π τ w π + τ w,iτ w,j π τ w. 44 Therefore, here is the matrix form of the graient an Hessian of the Poisson log-lielihoo with respect

14 to θ: π ln N θ τ w H π ln N θ τ w π π τ w τ w τ w, τ w, τ w, τ w, τ w,k τ w, τ w, τw, τ w, τ w,k π τ w π τ w, τ w,k τ w,k τ w,k τw,k π τ w τ w τ w π, 45 where τ w τ w represents an outer prouct of the two vectors. The graient an Hessian with respect to τ is: ln Nθ τ w θ i τ i In matrix form: ln N θ τ w θ τ w τ i τ i τ i θi θ τ w ln N θ τ w τ i τ j τ j θi θ τ w τ ln N θ τ w H τ ln N θ τ w π τ w θ θ i θ τ w θ iθ j θ τ w. 46 π τ w θ θ. 47 Assume that θ an τ are inepenent, i.e., covτ, θ 0, we have the following approximation of ln N θ τ w : ln N θ τ w ln N ˆθ ˆτ w + τ [ˆτ w ] τ w ˆτ w + π [ˆπ] π ˆπ + τ w ˆτ w H τ [ˆτ w ] τ w ˆτ w + π ˆπ H π [ˆπ] π ˆπ where ˆπ an ˆτ w are some estimates of the true π an τ. Note that uner this approximation, computing the expectation ln Nθ τ w uner q θ an q τ w can be one approximately in close-from by using the variational marginals of γ an ζ w. Now, the variational marginal for ζ w the inverse logistic-transformation of τ w, also nown as the natural parameter of the multinomial an γ the inverse logistic-transformation of θ can be erive from the following GMF approximations to the marginal posterior of ζ w an τ w, respectively. 48 p ζ w,..., ζ w T S θ qθ T π K/ Ψ w / t exp n w logω t exp ζ w Ψ w ζ w θ τ w, T t ζ w B wζ t w Ψ w ζ w B wζ t w ω t θ τ w, Γn w + ; 49 q θ 3

15 an q γ γ p γ S µ qµ, S τ qτ N t t N t t p γ S µ t qµ, S zt, qz π K / Σ t exp / γ exp logω t θ τ w, ω t n w q τ µ t Σ t γ µ t θ τ w, Γn w Note that by introucing the taylor approximation to ln N θ τ w, an using the laws for computing the means an covariance of τ w an π uner multivariate log-normal istribution i.e., Eqs , the expectation terms in the above equations can be approximately solve. Using similar techniques employe in., they can be approximate by stanar SSMs with Gaussian emissions. 4 Parameter Estimation As mentione before, we can use a variational EM scheme to estimate the parameters in our moel, which are essentially the SSM parameters. Operationally, VEM is no ifferent from a stanar EM for SSM we have observation sequence γ t N t γ for the topic mixing SSM as efine by q µ, an observation sequence u for each of the K topic representation i.e., wor-frequency SSMs as efine by q η ; an we can use the stanar learning learning rules for SSM parameter estimation. In the E step, we use Eqs.3-4, an Eqs.3-3 to estimate the expecte sufficient statistics of the latent states; In the M step, we upate the parameters Φ, A, Σ t of the topic mixing SSM, an Ψ, B of the topic representation SSM. Following [Ghahramani an Hinton, 996; Ghahramani an Hinton, 998], which gave etaile erivations for the MLE stanar SSM an switching SSM, bellow we give the relevant formulae of MLE for the parameters in our moel. Each of the estimate can be erive by taen the corresponing partial erivative of the expecte loglielohoo uner the our variational approximation to the true posterior, setting to zero an solving. Topic mixing ynamic matrix: A T V t t µ T V t µ, 5 where V t µ E[ µ t µ t Y,..., Y t ] not to be confuse with the center moment P t in Eq.??, an V t t µ E[ µ t µ t Y,..., Y t ]. From RTS smoother, it is easy to see: V t µ P t T + ˆµ t T ˆµ t T V t t µ P t t T + ˆµ t T ˆµ t T 5 where the posterior estimate of the self an cross-time covariance matrices P t T an P t,t T can be compute from Eqs Noise covariance matrix for topic mixing state: Φ T V t µ A T T V t t µ. 53 4

16 Output covariance matrix for topic mixing vectors: Σ t N t γ N ˆµ t T γ ˆµ t T. 54 t Topic representation i.e., topic-specific wor frequency vector ynamic matrix: B T V,t t η T V,t η, 55 where Noise covariance matrix for topic representation vector: V,t η P,t T + ˆη,t T ˆη,t T V,t t η P,t t T + ˆη,t T ˆη,t T 56 Ψ T V,t η B T T V,t t η. 57 We set the initial vectors ι an ν to be zero vectors instea of estimating them from the ata. Finally, note that what are given above are the most general form of transition an correlations of topics an wors. In practice, to avoi over-parameterization, we can choose to reuce, for example, the transition matrices B s an the covariance matrices Ψ s of the topic representations to be sparse or iagonal matrix to moel only ranom wal effects. 5 Conclusion In this report I introuce topic evolution moels for longituinal epochs of wor ocuments. The moels employ marginally epenent latent state-space moels for evolving topic proportion istributions an topicspecific wor istributions; an both a logistic-normal-multinomial an a logistic-normal-poisson moel for ocument lielihoo. These moels allow posterior inference of latent topic themes over time, an topical clustering of longituinal ocument epochs. I erive a variational inference algorithm for non-conjugate generalize linear moels base on truncate Taylor approximation, an I also outline formulae for parameter estimation base on variational EM principle. In the current moel, I assume all topics coexist over time, an no new topic will emerge over time. In a companion report, I present a birth-eath process moel that capture more complicate an realistic behaviors of topic evolution, such as aggregation, emergence, extinction, an split of topics over time. 6 BIBLIOGRAPHY References [Blei an Lafferty, 006] D. Blei an J. Lafferty. Correlate topic moels. In Avances in Neural Information Processing Systems 8, 006. [Blei et al., 003] D. Blei, A. Ng, an M. Joran. Latent irichlet allocation. Journal of Machine Learning Research, 3,

17 [Ghahramani an Hinton, 996] Z. Ghahramani an G. E. Hinton. Parameter estimation for linear ynamical systems. University of Toronto Technical Report CRG-TR-96-, 996. [Ghahramani an Hinton, 998] Z. Ghahramani an G. E. Hinton. Variational learning for switching statespace moels. Neural Computation, 4:963 96, 998. [Griffiths an Steyvers, 004] T. Griffiths an M. Steyvers. Fining scientific topics. Proc Natl Aca Sci U S A, 0 Suppl :58 535, 004. [Hofmann, 999] Thomas Hofmann. Probabilistic latent semantic inexing. In Proc. of the n Intl. ACM SIGIR conference, pages 50 57, 999. [Joran et al., 999] M. I. Joran, Z. Ghahramani, T. S. Jaaola, an L. K. Saul. An introuction to variational methos for graphical moels. In M. I. Joran, eitor, Learning in Graphical Moels, pages Kluwer Acaemic Publishers, 999. [Kleiber an Kotz, 003] C. Kleiber an S. Kotz. Statistical Size Distributions in Economics an Actuarial Sciences. Wiley InterScience, 003. [Steyvers et al., 004] M. Steyvers, P. Smyth, M. Rosen-Zvi, an T. Griffiths. Probabilistic author-topic moels for information iscovery. In The Tenth ACM SIGKDD International Conference on Knowlege Discovery an Data Mining, 004. [Xing et al., 003] E. P. Xing, M. I. Joran, an S. Russell. A generalize mean fiel algorithm for variational inference in exponential families. In Proceeings of the 9th Annual Conference on Uncertainty in AI,

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables