Drchlet Mxtures n Text Modelng Mko Yamamoto and Kugatsu Sadamtsu CS Techncal report CS-TR-05-1 Unversty of Tsukuba May 30, 2005 Abstract Word rates n text vary accordng to global factors such as genre, topc, author, and expected readershp (Church and Gale 1995). Models that summarze such global factors n text or at the document level, are called text models. A fnte mxture of Drchlet dstrbuton (Drchlet Mxture or DM for short) was nvestgated as a new text model. When parameters of a multnomal are drawn from a DM, the compound for dscrete outcomes s a fnte mxture of the Drchlet-multnomal. A Drchlet multnomal can be regarded as a multvarate verson of the Posson mxture, a relable unvarate model for global factors (Church and Gale 1995). In the present paper, the DM and ts compounds are ntroduced, wth parameter estmaton methods derved from Mnka s fxed-pont methods (Mnka 2003) and the EM algorthm. The method can estmate a consderable number of parameters of a large DM,.e., a few hundred thousand parameters. After dscusson of the relatonshps wthn the DM probablstc latent semantc analyss (PLSA) (Hofmann 1999), the mxture of ungrams (Ngam et al. 2000), and latent Drchlet allocaton (LDA) (Ble et al. 2001, 2003) the products of statstcal language modelng applcatons are dscussed and ther performance n perplexty compared. The DM model acheves the lowest perplexty level despte ts untopc nature. 1 Introducton Word rates n text vary accordng to global factors such as genre, topc, author, and expected readershp. Church and Gale (1995) examned the Brown corpus and showed that the Englsh word sad occurs wth hgh frequency n the press and fcton, but relatvely nfrequently n the hobby and learned genres. Ths observaton s basc to ther model for word rate varaton. Rosenfeld (1999) wrote that the occurrence of the word wnter n a document s proportonal to the occurrence of the word summer n the same document. Ths s a basc observaton for hs trgger models. Church (2001) states that a document ncludng the word Norega has a hgh probablty for another occurrence of Norega n the same document, an observaton 1
basc to hs adaptaton model. In the present paper, an attempt s made to model these global factors wth a new generatve text model (Drchlet Mxture or DM for short), and to apply t to mprove conventonal ngram language models that reveal only local nterdependency among words. It s well known that global factors are mportant to language modelng n three language processng research communtes nvolved n natural language processng, speech processng, and neural networks. In the natural language processng communty, Church and Gale (1995) proposed that word rate dstrbuton can be explaned by the Posson mxture an nfnte Posson mxture model, n whch the Posson parameter vares over the factors of a densty functon. For example, they demonstrated that an emprcal word rate varaton closely fts a specal case of the Posson mxture, the negatve bnomal where the densty functon assumes a gamma dstrbuton. However, because the Posson mxture s unvarate, t s dffcult to manage word rates for every word smultaneously. Researchers n the speech processng communty have proposed and tested a great number of multvarate models cache models, trgger models and topc-based models to capture dstant word dependency n the past two decades. Iyer and Ostendorf (1999) proposed an m-component mxture of ungram models wth a parameter estmaton method usng the EM algorthm. Each ungram model for the mxture corresponds to a dfferent topc and yelds word rates for that topc. Usng topc-based fnte mxture models, language models can be greatly mproved n perplexty, though parameter estmaton for ths model tends to overft tranng data. Recently, generatve text models such as latent Drchlet allocaton (LDA) have attracted people n the neural network communty. Usng generatve text models, the probablty for a document rather than smply sentences can be computed. Probablty computaton n these models takes advantage of pror dstrbuton of word rate varablty garnered from large document collectons. Generatve models are statstcally well defned and robust for parameter estmaton and adaptaton because they explot (herarchcal) Bayesan frameworks, whch rely heavly on a pror and posteror dstrbuton of word rates. In ths paper, a new generatve text model was nvestgated, whch unfes the followng concepts developed by the three communtes related to language processng: (1) summary of word rate varablty as a stochastc dstrbuton (2) fnte topc mxture models of multvarate dstrbutons (3) generatve text models based on a (herarchcal) Bayesan framework Based on (1) and (2), t was assumed that word rate varablty can be modeled wth a fnte mxture of Drchlet dstrbutons. Fnte mxtures encapsulate rough topc structures, and each Drchlet dstrbuton yelds clear word rate varablty wthn each topc for all words smultaneously. From (3), a robust model s bult adoptng a Baysan framework, employng a pror and a posteror dstrbuton to estmate DM parameters and to adapt them to the context of thus-processed documents. 2
In Sectons 2 and 3 of the paper, the DM model s descrbed, as are parameters estmaton methods, and posteror and predctve dstrbutons of the model. In Secton 4, the relatonshp between DM and other text models s dscussed. In Secton 5, expermental results of applcatons for statstcal language models are presented. The DM model acheves lower perplexty levels than those employng the mxture of ungrams (MU) and LDA models. 2 The DM and parameter estmaton 2.1 The DM and the Polya mxture The Drchlet dstrbuton s defned for a random vector, p =(p 1,p 2...p V ), on a smplex of V dmensons. Elements of a random vector on a smplex sum to 1. We nterpret p as word occurrence probabltes on V words of a vocabulary, so that the Drchlet dstrbuton models word occurrence probabltes. The densty functon of the Drchlet for p s: P D (p; α) = Γ(α) V v=1 Γ(α v) V v=1 p αv 1 v, (1) where α =(α 1,α 2,..., α V ) s a parameter vector, α v > 0andα = V v=1 α v. The Drchlet mxture dstrbuton (Sjölander et al. 1996) wth M components s defned as the followng: M P DM (p; λ, α M 1 )= λ m P D (p; α m ) = m=1 M m=1 λ m Γ(α m ) V v=1 Γ(α mv) V v=1 p αmv 1 v, (2) where λ =(λ 1,λ 2,..., λ M ) s a weght vector for each component Drchlet dstrbuton and α m = v α mv. When the random vector p as parameters of a multnomal s drawn from the DM, the compound dstrbuton for dscrete outcomes y =(y 1,y 2,..., y V )s: P PM (y; λ, α M 1 )= P Mul (y p)p DM (p; λ, α M 1 )dp = = M λ m m=1 M m=1 P Mul (y p)p D (p; α m )dp λ m Γ(α m ) Γ(α m + y) V v=1 Γ(y v + α mv ), (3) Γ(α mv ) 3
where y = v y v. Each y v means occurrence frequency of the v-th word n a document. Ths dstrbuton s called the Drchlet-multnomal mxture or the Polya mxture dstrbuton. The Polya mxture s used to estmate parameters for the DM, α and λ. 2.2 Parameter estmaton In ths subsecton, methods for estmatng parameters were ntroduced for the DM wth a maxmum lkelhood estmator of the Polya mxture. The estmatng methods are based on Mnka s estmaton methods for a Drchlet dstrbuton (Mnka 2003) and the EM algorthm (Dempster et al. 1977). Gven the -th datum or the -th tranng document, outcomes can be determned for words y =(y 1,y 2...y V ). For N tranng documents, the log lkelhood functon for the tranng documents D =(y 1, y 2,..., y N )s: N L(D; λ, α M 1 )= log P PM (y ; λ, α M 1 ). =1 The λ and α that maxmze the above lkelhood functon are also DM parameters. Assumng Z =(z 1,z 2,..., z N )andz s a hdden varable that denotes a component generatng the -th document, the log lkelhood for the complete data s: N L(D, Z; λ, α M 1 )= log P (y,z ; λ, α M 1 ). =1 The Q-functon for the EM algorthm or the condtonal expectaton of the above log lkelhood s: Q(θ θ) = P m log P (y,z = m; λ, α M 1 ) m = P m log λ m + Γ(α m ) V Γ(y v + α mv ) P m log, (4) Γ(α m m m + y ) Γ(α v=1 mv ) where P m = P (z = m y ; λ, ᾱ M 1 ) y = v y v. 4
λ, ᾱ M 1 are current values of parameters. The frst term of (4) can be maxmzed va the followng update formula for λ. λ m P m (5) The second term of (4) can be maxmzed va the followng update formula for α. Ψ(x) s the dgamma functon. The update formula s derved from the Mnka s fxed-pont teraton (Mnka 2003) and the EM algorthm (see Appendx A). α mv =ᾱ P m{ψ(y k +ᾱ mv ) Ψ(ᾱ mv )} mv P (6) m{ψ(y +ᾱ m ) Ψ(ᾱ m )} If the leavng-one-out (LOO) lkelhood s desgnated as the functon to be maxmzed, a faster update formula s obtaned. The followng update formula s based on Mnka s teraton for the LOO lkelhood (Mnka 2003) and the EM algorthm (see Appendx B): α mv =ᾱ P m{y v /(y v 1+ᾱ mv )} mv P (7) m{y /(y 1+ᾱ m )} Ths LOO update functon s used to estmate the DM parameters n all followng experments. 3 Inference 3.1 A posteror and predctve dstrbuton The DM model wth parameters estmated usng the above methods s regarded as a pror for the dstrbuton of word occurrence probabltes. In ths secton, a method s descrbed for computng a posteror and expectatons for word occurrence probablty, gven a word hstory or a document. The followng formula s a posteror dstrbuton for word occurrence probablty gven the data hstory y =(y 1,y 2,..., y V ), assumng a multnomal dstrbuton for count data y, wth parameter p dstrbuted accordng to a DM wth parameter α as a pror: P (p y) = P (y p)p (p) P (y p)p (p)dp = P Mul(y p)p DM (p; λ, α M 1 ) PMul (y p)p DM (p; λ, α M 1 )dp = m=1 B m v pαmv+yv 1 v m=1 C, (8) m 5
where B m = λ m Γ(α m ) V v=1 Γ(α mv), V v=1 C m = B Γ(α mv + y v ) m, Γ(α m + y) α m = V α mv, v=1 y = V y v. v=1 Expectaton of occurrence probablty of the w-thwordnavocabulary,p (w y), s: P (w y) = p w P (p y)dp = = = m=1 B V m v=1 pαmv+yv+δ(v w) 1 v dp m=1 C m m=1 B V Γ{α mv+y v+δ(v w)} m v=1 Γ(α m+y+1) m=1 C m m=1 C m αmw+yw α m+y m=1 C m (9) where δ(k) s Kronecker s delta, δ(k) = { 1, f k =0, 0, others. In contrast to LDA, DM has a closed formula for computng expectaton of word occurrence probablty. 3.2 Model averagng In the experment secton (Secton 5), t s demonstrated that a statstcal language model usng DM outperforms other models wth a fewer components, but that performance does not rse n proporton to the number of components. Ths problem reflects the overfttng nature of DM models. Avodng the overfttng problem, a smple model averagng method s adopted, whch computes a predctve dstrbuton as a mean for each predcton of DM wth a dfferent number of components. It s assumed that there are N dfferent DM models, and that P (w y), =1, 2...N s a predctve probablty for the word w. The followng averagng 6
equaton s referred to as method 1, n whch evdence probablty for a hstory s regarded as credt weght for each model: P ma1 (w y) = PPM (y; λ, α) j P j PM (y; λ, α)p (w y) (10) Method 2 s a smpler method, averagng predctve probabltes wth a smple arthmetc mean: P ma2 (w y) = 1 P (w y) (11) N 4 Relatonshp wth other topc-based models The relatonshp of the DM wth the other topc-based models s demonstrated usng graphcal representaton of models. The followng are evdence probabltes for data y =(y 1,y 2,..., y V ) n each topc-based model. Mxture of ungrams: P (y) = z p(z)p Mul(y z) LDA: P (y) = P D (θ α) v P (w v θ) yv dθ, wherep (w v θ) = z p(z θ)p(w v z) DM: P (y) =P PM (y) Z s of the mxture of ungram models and LDA are latent varables representng topcs. θ of LDA s weghts for each ungram modeled wth the Drchlet dstrbuton P D (θ; α). Evdence probablty for DM s the Polya dstrbuton descrbed n Sec. 2.1. Fgure 1 s a graphcal representaton of four models ncludng a probablstc LSA (PLSA). Outer and nner squares represent N documents (d) and L words (w) neachdocument, respectvely. Crcles are random varables and double crcles are model parameters. Arrows ndcate condtonal dependent relatonshps between varables. The smplest model s MU. In ths model, t s assumed that a document s generated from just one topc the frst chosen topc z, s used to generate all words n the document. The PLSA model can relax ths assumpton. Under PLSA, t s possble that each word n a document s generated wth dfferent topcs. In the result, a document s assumed to have multple topcs, a realstc assumpton. However, PLSA s not a well-defned generatve model for documents, because the model has no natural method to assgn probablty to a new document. LDA s an extenson of PLSA. Assumng Drchlet dstrbuton weghts for each ungram model, LDA can assgn probablty to new documents. The DM s another MU extenson. For DM, the assumpton of MU one topc for one document remans, but dstrbuton of ungram probablty s drectly modeled by the Drchlet dstrbuton. Other models, ncludng PLSA and LDA, can reveal multtopc structure, but each topc model s qute smple a ungram model. Though DM assumes a untopc structure for each document, ts topc models are rcher than those of ts strong rvals. 7
Fgure 1: Graphcal model expressons of generatve text models 5 Experments The performance of the DM was compared wth those of the LDA and the MU n test-set perplextes usng adaptve ngram language models. Tranng n all three models reled on the same tranng data and were evaluated wth the same test data. The tranng data was a set of 98,211 Japanese newspaper artcles from the year 1999. The test data was a set of 495 randomly selected artcles of more than 40 words, from the year 1998. The vocabulary comprsed the 20,000 most frequent words n the tranng data, and had a cover rate of 97.1%. The varatonal EM method was used to estmate the LDA parameter (Ble et al. 2003), but α n the Drchlet parameter for LDA was updated wth Mnka s fxed-pont teraton method (Mnka 2003), nstead of the Newton-Raphson method. For the DM, the estmaton method (1) based on an LOO lkelhood was used. Tranng for both models used the same stoppng crtera the change of closed perplexty for the tranng data before and after one global loop of teraton s less than 0.1%. Models were constructed wth both LDA and DM havng 1, 2, 5, 10, 20, 50, 100, 200, and 500 components. Fgure 2 presents the perplexty of each model for ts dfferent number of components. Each perplexty s computed as an nverse of probablty per word, a geometrcal mean of a document probablty, that s, an evdence probablty for data. For the DM, the probablty of the Polya mxture for a document s the document probablty. The DM consstently outperforms both other models. However, DM performance s best at 20 components, as t saturates wth somewhat fewer components. DM s a generalzed verson of MU and the saturaton suggests that DM suffers the MU overfttng problem. Fgure 3 presents the perplexty of the adaptve language models for dfferent numbers of components. Adaptve language models predct the probablty of the next word by usng a hstory, such as a conventonal ngram language model. They employ a longer hstory, such as an entre prevously processed secton, rather than just the two words precedng the target 8
Fgure 2: Comparson of test-set perplexty by document probablty word. In ths experment, the models were adapted to a secton of a document from the frst to the current word, and then probabltes were computed for the next 20 words. Every 20 words, ths operaton was repeated. Fgure 3 also shows that DM outperforms better than LDA. For the next experment, a trgram language model was developed, dynamcally adapted to longer hstores for topc-based models. A ungram rescalng method was used for the adaptaton (Gldea and Hofmann 1999). Fgure 4 shows the perplexty of combned adaptve language models. Lke the above experments, the performance of a ungram-rescalng trgram model wth DM s better than that wth LDA. Table 1 shows the best perplextes for each model. The values n parentheses are perplexty reducton rates from baselne models. LDA s a multtopc text model, whch reveals a mxture of topcs n a document, whle DM s a untopc text model, whch assumes one topc per document. Generally speakng, there are few documents wth just one topc. Ths rases the queston as to why DM s better than LDA n perplexty for those experments. Whle there s no clear answer, t s possble that the tranng and test materals for those experments were based on newspaper artcles, and thus somewhat focused on a sngle topc. DM may capture the detaled dstrbuton of word probabltes on topcs usng multple Drchlets, whereas LDA smply captures the dstrbuton ndrectly as a topc proporton or weght for a mxture usng a sngle Drchlet. The same experments need to be conducted wth web data, whch contans more complex topcal structures. 6 Conclusons Fnte mxture of Drchlet dstrbutons was nvestgated as a new text model. Parameter estmaton methods were ntroduced, based on Mnka s fxed-pont methods (Mnka 2003) 9
Fgure 3: Comparson of test-set perplexty by hstory adaptaton probablty(ungram) Fgure 4: Comparson of test-set perplexty by hstory adaptaton probablty(trgram) and the EM algorthm usng the mxture of the Drchlet-multnomal dstrbuton. Expermental results for applcatons of statstcal language models were demonstrated and ther performance compared for perplexty. The DM model acheved the lowest perplexty level despte ts untopc nature. References [1] D.M. Ble, A.Y. Ng, and M.I. Jordan. 2001. Latent Drchlet Allocaton, Neural Informaton Processng Systems, vol.14. [2] D.M. Ble, A.Y. Ng, and M.I. Jordan. 2003. Latent Drchlet Allocaton, Journal of Machne Learnng Research, Vol.3, pages 993-1022. 10
Table 1: Mnmum perplexty of each aspect model hstory hstory document adaptaton adaptaton probablty (ungram) (trgram) DM 453.06(32.0%) 60.74(19.6%) 434.73 DM ave.2 425.97(36.1%) 57.97 (23.2%) - LDA 467.61(29.9%) 67.13 (11.1%) 474.82 Mxture of 520.95 Ungrams [3] Kenneth W. Church and Wllam A. Gale. 1995. Posson mxtures. Natural Language Engneerng, Vol.1, No.1, pages 163 190. [4] Kenneth W. Church. 2001. Emprcal estmates of adaptaton: The chance of two Noregas s closer to p/2 thanp 2. In Proc. of Colng 2000, pages 180 186. [5] A. P. Dempster, N. M. Lard and D. B. Rubn. 1977. Maxmum lkelhood from ncomplete data va the EM algorthm. J. of Royal Statstcal Socety, Seres B, Vol.39, pages 1 38. [6] T. Hofmann. 1999. Probablstc latent semantc ndexng, Proc. of the 22nd Annual ACM Conference on Research and Development n Informaton Retreval, pp.50-57, Berkeley, Calforna. [7] K. Sjölander, K. Karplus, M. Brown, R. Hunghey, A. Krogh, I.S. Man, and D. Haussler. 1996. Drchlet mxtures:a method for mproved detecton of weak but sgnfcant proten sequence homology, Computer Applcatons n the Boscences, vol.12, no.4, pp.327 345. [8] D. Gldea, and T. Hofmann. 1999. Topc-based language models usng em, Proc. of the 6th European Conference on Speech Communcaton and Technology (EU- ROSPEECH 99). [9] R.M. Iyer, and M. Ostendorf. 1999. Modelng long dstance dependence n language:topc mxtures versus dynamc cache models, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol.7, no.1, pp.30 39. [10] S.T. K. Ngam, A. McCallum, and T. Mtchell. 2000. Text classfcaton from labeled and unlabeled documents usng EM, Machne Learnng, vol.39, no.2/3, pp.103 134. [11] T. Mnka. 2003. Estmatng a Drchlet dstrbuton, http://www.stat.cmu.edu/ mnka/papers/drchlet/ [12] Ronald Rosenfeld. 1996. A maxmum entropy approach to adaptve statstcal language modelng. Computer Speech and Language, Vol.10, No.3, pages 187 228. 11
A Dervaton of the update formula of α for the MLE Mnka s fxed-pont teraton method (Mnka 1999) and the EM algorthm were used. The followng equatons (Mnka 2003) were used to get the lower bound of the second term of the equaton (4): Γ(α m ) Γ(α m + y ) Γ(ᾱ { } m)exp (ᾱm α m )b m, Γ(ᾱ m + y ) and Γ(ᾱ mv + y v ) Γ(ᾱ mv ) c mv ᾱ a mv mv (f y v 1), where y = v y v, α m = v α mv, and ᾱ m = v ᾱ mv. The lower bound s: Γ(ᾱ m ) P m log Γ(ᾱ m + y ) where m m V v=1 Γ(y v +ᾱ mv ) Γ(ᾱ mv ) [ P m log Γ(ᾱ m)exp { } (ᾱ m α m )b m + Γ(y +ᾱ m ) v log c mv α a mv mv ] Q (α), a mv = { Ψ(ᾱ mv + y v ) Ψ(ᾱ mv ) } ᾱ mv, b m =Ψ(ᾱ m + y ) Ψ(ᾱ m ), c mv = Γ(ᾱ mv + y v ) ᾱ a mv mv. Γ(ᾱ mv ) Ths lower bound can be maxmzed wth the followng update formula for α mv. Q (α) = P m b + 1 P m a mv =0 α mv ᾱ mv α mv =ᾱ P m{ψ(y k +ᾱ mv ) Ψ(ᾱ mv )} mv P m{ψ(y +ᾱ m ) Ψ(ᾱ m )} 12
B Dervaton for the update formula of α for the maxmum LOO lkelhood Gven datum y v n whch one word v s left out of a document, the predctve probablty P (v y v ) for the word v s the followng from the equaton (9): where p(v y v ) m P m α mv + y v 1 α m + y 1, P v m P m. The LOO log-lkelhood, L loo,s L loo (y α) y v log α mv + y v 1 P m α v m m + y 1. Snce P m s very nearly to 1 or 0 for almost all cases, the precedng equaton can be transformed: L loo (y α) ( ) αmv + y v 1 y v P m log α v m m + y 1 Usng the precedng equaton, lkelhood can be maxmzed ndependently for α mv of the m-th Drchlet component. The lower bound of the LOO log-lkelhood L m loo for the m-th component s: L m loo ( ) y v P m q mv log α mv y P m a m α m +(const.), v where ᾱ mv q mv = ᾱ mv + y v 1, 1 a m = ᾱ m + y 1. The followng nequaltes (Mnka 2003) were used to get the above bound: log(n + x) q log x +(1 q)logn q log q (1 q)log(1 q) where q = ˆx, n+ˆx log(x) ax 1+logˆx 13
where a =1/ˆx. An update formula for fxed-pont teraton s: α mv =ᾱ P m{y v /(y v 1+ᾱ mv )} mv P m{y /(y 1+ᾱ m )}. 14