Dirichlet Mixtures in Text Modeling

Similar documents
MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Retrieval Models: Language models

Expectation Maximization Mixture Models HMMs

Conjugacy and the Exponential Family

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Evaluation for sets of classes

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Lecture Notes on Linear Regression

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

The Expectation-Maximization Algorithm

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

EM and Structure Learning

Chapter 11: Simple Linear Regression and Correlation

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

arxiv: v2 [stat.me] 26 Jun 2012

NUMERICAL DIFFERENTIATION

Markov Chain Monte Carlo Lecture 6

SDMML HT MSc Problem Sheet 4

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

STAT 3008 Applied Regression Analysis

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

MAXIMUM A POSTERIORI TRANSDUCTION

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Research Article Green s Theorem for Sign Data

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

More metrics on cartesian products

Hidden Markov Models

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Supporting Information

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Machine learning: Density estimation

Kernel Methods and SVMs Extension

Homework Assignment 3 Due in class, Thursday October 15

Bayesian predictive Configural Frequency Analysis

The Basic Idea of EM

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Composite Hypotheses testing

Basically, if you have a dummy dependent variable you will be estimating a probability.

Appendix B: Resampling Algorithms

Course 395: Machine Learning - Lectures

Linear Regression Analysis: Terminology and Notation

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Boostrapaggregating (Bagging)

Chapter Newton s Method

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

3.1 ML and Empirical Distribution

Generative classification models

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Generalized Linear Methods

Tracking with Kalman Filter

Computing MLE Bias Empirically

Gaussian process classification: a message-passing viewpoint

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Section 8.3 Polar Form of Complex Numbers

Global Sensitivity. Tuesday 20 th February, 2018

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Mixture o f of Gaussian Gaussian clustering Nov

First Year Examination Department of Statistics, University of Florida

Linear Approximation with Regularization and Moving Least Squares

Uncertainty as the Overlap of Alternate Conditional Distributions

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Lecture 6: Introduction to Linear Regression

10-701/ Machine Learning, Fall 2005 Homework 3

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Winter 2017 Instructor: Victor Aguirregabiria

Discussion of Extensions of the Gauss-Markov Theorem to the Case of Stochastic Regression Coefficients Ed Stanek

Joint Statistical Meetings - Biopharmaceutical Section

Maximum Likelihood Estimation (MLE)

Classification as a Regression Problem

Statistical inference for generalized Pareto distribution based on progressive Type-II censored data with random removals

XII.3 The EM (Expectation-Maximization) Algorithm

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

A be a probability space. A random vector

Gaussian Mixture Models

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Lecture Nov

How its computed. y outcome data λ parameters hyperparameters. where P denotes the Laplace approximation. k i k k. Andrew B Lawson 2013

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

x = , so that calculated

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

Transcription:

Drchlet Mxtures n Text Modelng Mko Yamamoto and Kugatsu Sadamtsu CS Techncal report CS-TR-05-1 Unversty of Tsukuba May 30, 2005 Abstract Word rates n text vary accordng to global factors such as genre, topc, author, and expected readershp (Church and Gale 1995). Models that summarze such global factors n text or at the document level, are called text models. A fnte mxture of Drchlet dstrbuton (Drchlet Mxture or DM for short) was nvestgated as a new text model. When parameters of a multnomal are drawn from a DM, the compound for dscrete outcomes s a fnte mxture of the Drchlet-multnomal. A Drchlet multnomal can be regarded as a multvarate verson of the Posson mxture, a relable unvarate model for global factors (Church and Gale 1995). In the present paper, the DM and ts compounds are ntroduced, wth parameter estmaton methods derved from Mnka s fxed-pont methods (Mnka 2003) and the EM algorthm. The method can estmate a consderable number of parameters of a large DM,.e., a few hundred thousand parameters. After dscusson of the relatonshps wthn the DM probablstc latent semantc analyss (PLSA) (Hofmann 1999), the mxture of ungrams (Ngam et al. 2000), and latent Drchlet allocaton (LDA) (Ble et al. 2001, 2003) the products of statstcal language modelng applcatons are dscussed and ther performance n perplexty compared. The DM model acheves the lowest perplexty level despte ts untopc nature. 1 Introducton Word rates n text vary accordng to global factors such as genre, topc, author, and expected readershp. Church and Gale (1995) examned the Brown corpus and showed that the Englsh word sad occurs wth hgh frequency n the press and fcton, but relatvely nfrequently n the hobby and learned genres. Ths observaton s basc to ther model for word rate varaton. Rosenfeld (1999) wrote that the occurrence of the word wnter n a document s proportonal to the occurrence of the word summer n the same document. Ths s a basc observaton for hs trgger models. Church (2001) states that a document ncludng the word Norega has a hgh probablty for another occurrence of Norega n the same document, an observaton 1

basc to hs adaptaton model. In the present paper, an attempt s made to model these global factors wth a new generatve text model (Drchlet Mxture or DM for short), and to apply t to mprove conventonal ngram language models that reveal only local nterdependency among words. It s well known that global factors are mportant to language modelng n three language processng research communtes nvolved n natural language processng, speech processng, and neural networks. In the natural language processng communty, Church and Gale (1995) proposed that word rate dstrbuton can be explaned by the Posson mxture an nfnte Posson mxture model, n whch the Posson parameter vares over the factors of a densty functon. For example, they demonstrated that an emprcal word rate varaton closely fts a specal case of the Posson mxture, the negatve bnomal where the densty functon assumes a gamma dstrbuton. However, because the Posson mxture s unvarate, t s dffcult to manage word rates for every word smultaneously. Researchers n the speech processng communty have proposed and tested a great number of multvarate models cache models, trgger models and topc-based models to capture dstant word dependency n the past two decades. Iyer and Ostendorf (1999) proposed an m-component mxture of ungram models wth a parameter estmaton method usng the EM algorthm. Each ungram model for the mxture corresponds to a dfferent topc and yelds word rates for that topc. Usng topc-based fnte mxture models, language models can be greatly mproved n perplexty, though parameter estmaton for ths model tends to overft tranng data. Recently, generatve text models such as latent Drchlet allocaton (LDA) have attracted people n the neural network communty. Usng generatve text models, the probablty for a document rather than smply sentences can be computed. Probablty computaton n these models takes advantage of pror dstrbuton of word rate varablty garnered from large document collectons. Generatve models are statstcally well defned and robust for parameter estmaton and adaptaton because they explot (herarchcal) Bayesan frameworks, whch rely heavly on a pror and posteror dstrbuton of word rates. In ths paper, a new generatve text model was nvestgated, whch unfes the followng concepts developed by the three communtes related to language processng: (1) summary of word rate varablty as a stochastc dstrbuton (2) fnte topc mxture models of multvarate dstrbutons (3) generatve text models based on a (herarchcal) Bayesan framework Based on (1) and (2), t was assumed that word rate varablty can be modeled wth a fnte mxture of Drchlet dstrbutons. Fnte mxtures encapsulate rough topc structures, and each Drchlet dstrbuton yelds clear word rate varablty wthn each topc for all words smultaneously. From (3), a robust model s bult adoptng a Baysan framework, employng a pror and a posteror dstrbuton to estmate DM parameters and to adapt them to the context of thus-processed documents. 2

In Sectons 2 and 3 of the paper, the DM model s descrbed, as are parameters estmaton methods, and posteror and predctve dstrbutons of the model. In Secton 4, the relatonshp between DM and other text models s dscussed. In Secton 5, expermental results of applcatons for statstcal language models are presented. The DM model acheves lower perplexty levels than those employng the mxture of ungrams (MU) and LDA models. 2 The DM and parameter estmaton 2.1 The DM and the Polya mxture The Drchlet dstrbuton s defned for a random vector, p =(p 1,p 2...p V ), on a smplex of V dmensons. Elements of a random vector on a smplex sum to 1. We nterpret p as word occurrence probabltes on V words of a vocabulary, so that the Drchlet dstrbuton models word occurrence probabltes. The densty functon of the Drchlet for p s: P D (p; α) = Γ(α) V v=1 Γ(α v) V v=1 p αv 1 v, (1) where α =(α 1,α 2,..., α V ) s a parameter vector, α v > 0andα = V v=1 α v. The Drchlet mxture dstrbuton (Sjölander et al. 1996) wth M components s defned as the followng: M P DM (p; λ, α M 1 )= λ m P D (p; α m ) = m=1 M m=1 λ m Γ(α m ) V v=1 Γ(α mv) V v=1 p αmv 1 v, (2) where λ =(λ 1,λ 2,..., λ M ) s a weght vector for each component Drchlet dstrbuton and α m = v α mv. When the random vector p as parameters of a multnomal s drawn from the DM, the compound dstrbuton for dscrete outcomes y =(y 1,y 2,..., y V )s: P PM (y; λ, α M 1 )= P Mul (y p)p DM (p; λ, α M 1 )dp = = M λ m m=1 M m=1 P Mul (y p)p D (p; α m )dp λ m Γ(α m ) Γ(α m + y) V v=1 Γ(y v + α mv ), (3) Γ(α mv ) 3

where y = v y v. Each y v means occurrence frequency of the v-th word n a document. Ths dstrbuton s called the Drchlet-multnomal mxture or the Polya mxture dstrbuton. The Polya mxture s used to estmate parameters for the DM, α and λ. 2.2 Parameter estmaton In ths subsecton, methods for estmatng parameters were ntroduced for the DM wth a maxmum lkelhood estmator of the Polya mxture. The estmatng methods are based on Mnka s estmaton methods for a Drchlet dstrbuton (Mnka 2003) and the EM algorthm (Dempster et al. 1977). Gven the -th datum or the -th tranng document, outcomes can be determned for words y =(y 1,y 2...y V ). For N tranng documents, the log lkelhood functon for the tranng documents D =(y 1, y 2,..., y N )s: N L(D; λ, α M 1 )= log P PM (y ; λ, α M 1 ). =1 The λ and α that maxmze the above lkelhood functon are also DM parameters. Assumng Z =(z 1,z 2,..., z N )andz s a hdden varable that denotes a component generatng the -th document, the log lkelhood for the complete data s: N L(D, Z; λ, α M 1 )= log P (y,z ; λ, α M 1 ). =1 The Q-functon for the EM algorthm or the condtonal expectaton of the above log lkelhood s: Q(θ θ) = P m log P (y,z = m; λ, α M 1 ) m = P m log λ m + Γ(α m ) V Γ(y v + α mv ) P m log, (4) Γ(α m m m + y ) Γ(α v=1 mv ) where P m = P (z = m y ; λ, ᾱ M 1 ) y = v y v. 4

λ, ᾱ M 1 are current values of parameters. The frst term of (4) can be maxmzed va the followng update formula for λ. λ m P m (5) The second term of (4) can be maxmzed va the followng update formula for α. Ψ(x) s the dgamma functon. The update formula s derved from the Mnka s fxed-pont teraton (Mnka 2003) and the EM algorthm (see Appendx A). α mv =ᾱ P m{ψ(y k +ᾱ mv ) Ψ(ᾱ mv )} mv P (6) m{ψ(y +ᾱ m ) Ψ(ᾱ m )} If the leavng-one-out (LOO) lkelhood s desgnated as the functon to be maxmzed, a faster update formula s obtaned. The followng update formula s based on Mnka s teraton for the LOO lkelhood (Mnka 2003) and the EM algorthm (see Appendx B): α mv =ᾱ P m{y v /(y v 1+ᾱ mv )} mv P (7) m{y /(y 1+ᾱ m )} Ths LOO update functon s used to estmate the DM parameters n all followng experments. 3 Inference 3.1 A posteror and predctve dstrbuton The DM model wth parameters estmated usng the above methods s regarded as a pror for the dstrbuton of word occurrence probabltes. In ths secton, a method s descrbed for computng a posteror and expectatons for word occurrence probablty, gven a word hstory or a document. The followng formula s a posteror dstrbuton for word occurrence probablty gven the data hstory y =(y 1,y 2,..., y V ), assumng a multnomal dstrbuton for count data y, wth parameter p dstrbuted accordng to a DM wth parameter α as a pror: P (p y) = P (y p)p (p) P (y p)p (p)dp = P Mul(y p)p DM (p; λ, α M 1 ) PMul (y p)p DM (p; λ, α M 1 )dp = m=1 B m v pαmv+yv 1 v m=1 C, (8) m 5

where B m = λ m Γ(α m ) V v=1 Γ(α mv), V v=1 C m = B Γ(α mv + y v ) m, Γ(α m + y) α m = V α mv, v=1 y = V y v. v=1 Expectaton of occurrence probablty of the w-thwordnavocabulary,p (w y), s: P (w y) = p w P (p y)dp = = = m=1 B V m v=1 pαmv+yv+δ(v w) 1 v dp m=1 C m m=1 B V Γ{α mv+y v+δ(v w)} m v=1 Γ(α m+y+1) m=1 C m m=1 C m αmw+yw α m+y m=1 C m (9) where δ(k) s Kronecker s delta, δ(k) = { 1, f k =0, 0, others. In contrast to LDA, DM has a closed formula for computng expectaton of word occurrence probablty. 3.2 Model averagng In the experment secton (Secton 5), t s demonstrated that a statstcal language model usng DM outperforms other models wth a fewer components, but that performance does not rse n proporton to the number of components. Ths problem reflects the overfttng nature of DM models. Avodng the overfttng problem, a smple model averagng method s adopted, whch computes a predctve dstrbuton as a mean for each predcton of DM wth a dfferent number of components. It s assumed that there are N dfferent DM models, and that P (w y), =1, 2...N s a predctve probablty for the word w. The followng averagng 6

equaton s referred to as method 1, n whch evdence probablty for a hstory s regarded as credt weght for each model: P ma1 (w y) = PPM (y; λ, α) j P j PM (y; λ, α)p (w y) (10) Method 2 s a smpler method, averagng predctve probabltes wth a smple arthmetc mean: P ma2 (w y) = 1 P (w y) (11) N 4 Relatonshp wth other topc-based models The relatonshp of the DM wth the other topc-based models s demonstrated usng graphcal representaton of models. The followng are evdence probabltes for data y =(y 1,y 2,..., y V ) n each topc-based model. Mxture of ungrams: P (y) = z p(z)p Mul(y z) LDA: P (y) = P D (θ α) v P (w v θ) yv dθ, wherep (w v θ) = z p(z θ)p(w v z) DM: P (y) =P PM (y) Z s of the mxture of ungram models and LDA are latent varables representng topcs. θ of LDA s weghts for each ungram modeled wth the Drchlet dstrbuton P D (θ; α). Evdence probablty for DM s the Polya dstrbuton descrbed n Sec. 2.1. Fgure 1 s a graphcal representaton of four models ncludng a probablstc LSA (PLSA). Outer and nner squares represent N documents (d) and L words (w) neachdocument, respectvely. Crcles are random varables and double crcles are model parameters. Arrows ndcate condtonal dependent relatonshps between varables. The smplest model s MU. In ths model, t s assumed that a document s generated from just one topc the frst chosen topc z, s used to generate all words n the document. The PLSA model can relax ths assumpton. Under PLSA, t s possble that each word n a document s generated wth dfferent topcs. In the result, a document s assumed to have multple topcs, a realstc assumpton. However, PLSA s not a well-defned generatve model for documents, because the model has no natural method to assgn probablty to a new document. LDA s an extenson of PLSA. Assumng Drchlet dstrbuton weghts for each ungram model, LDA can assgn probablty to new documents. The DM s another MU extenson. For DM, the assumpton of MU one topc for one document remans, but dstrbuton of ungram probablty s drectly modeled by the Drchlet dstrbuton. Other models, ncludng PLSA and LDA, can reveal multtopc structure, but each topc model s qute smple a ungram model. Though DM assumes a untopc structure for each document, ts topc models are rcher than those of ts strong rvals. 7

Fgure 1: Graphcal model expressons of generatve text models 5 Experments The performance of the DM was compared wth those of the LDA and the MU n test-set perplextes usng adaptve ngram language models. Tranng n all three models reled on the same tranng data and were evaluated wth the same test data. The tranng data was a set of 98,211 Japanese newspaper artcles from the year 1999. The test data was a set of 495 randomly selected artcles of more than 40 words, from the year 1998. The vocabulary comprsed the 20,000 most frequent words n the tranng data, and had a cover rate of 97.1%. The varatonal EM method was used to estmate the LDA parameter (Ble et al. 2003), but α n the Drchlet parameter for LDA was updated wth Mnka s fxed-pont teraton method (Mnka 2003), nstead of the Newton-Raphson method. For the DM, the estmaton method (1) based on an LOO lkelhood was used. Tranng for both models used the same stoppng crtera the change of closed perplexty for the tranng data before and after one global loop of teraton s less than 0.1%. Models were constructed wth both LDA and DM havng 1, 2, 5, 10, 20, 50, 100, 200, and 500 components. Fgure 2 presents the perplexty of each model for ts dfferent number of components. Each perplexty s computed as an nverse of probablty per word, a geometrcal mean of a document probablty, that s, an evdence probablty for data. For the DM, the probablty of the Polya mxture for a document s the document probablty. The DM consstently outperforms both other models. However, DM performance s best at 20 components, as t saturates wth somewhat fewer components. DM s a generalzed verson of MU and the saturaton suggests that DM suffers the MU overfttng problem. Fgure 3 presents the perplexty of the adaptve language models for dfferent numbers of components. Adaptve language models predct the probablty of the next word by usng a hstory, such as a conventonal ngram language model. They employ a longer hstory, such as an entre prevously processed secton, rather than just the two words precedng the target 8

Fgure 2: Comparson of test-set perplexty by document probablty word. In ths experment, the models were adapted to a secton of a document from the frst to the current word, and then probabltes were computed for the next 20 words. Every 20 words, ths operaton was repeated. Fgure 3 also shows that DM outperforms better than LDA. For the next experment, a trgram language model was developed, dynamcally adapted to longer hstores for topc-based models. A ungram rescalng method was used for the adaptaton (Gldea and Hofmann 1999). Fgure 4 shows the perplexty of combned adaptve language models. Lke the above experments, the performance of a ungram-rescalng trgram model wth DM s better than that wth LDA. Table 1 shows the best perplextes for each model. The values n parentheses are perplexty reducton rates from baselne models. LDA s a multtopc text model, whch reveals a mxture of topcs n a document, whle DM s a untopc text model, whch assumes one topc per document. Generally speakng, there are few documents wth just one topc. Ths rases the queston as to why DM s better than LDA n perplexty for those experments. Whle there s no clear answer, t s possble that the tranng and test materals for those experments were based on newspaper artcles, and thus somewhat focused on a sngle topc. DM may capture the detaled dstrbuton of word probabltes on topcs usng multple Drchlets, whereas LDA smply captures the dstrbuton ndrectly as a topc proporton or weght for a mxture usng a sngle Drchlet. The same experments need to be conducted wth web data, whch contans more complex topcal structures. 6 Conclusons Fnte mxture of Drchlet dstrbutons was nvestgated as a new text model. Parameter estmaton methods were ntroduced, based on Mnka s fxed-pont methods (Mnka 2003) 9

Fgure 3: Comparson of test-set perplexty by hstory adaptaton probablty(ungram) Fgure 4: Comparson of test-set perplexty by hstory adaptaton probablty(trgram) and the EM algorthm usng the mxture of the Drchlet-multnomal dstrbuton. Expermental results for applcatons of statstcal language models were demonstrated and ther performance compared for perplexty. The DM model acheved the lowest perplexty level despte ts untopc nature. References [1] D.M. Ble, A.Y. Ng, and M.I. Jordan. 2001. Latent Drchlet Allocaton, Neural Informaton Processng Systems, vol.14. [2] D.M. Ble, A.Y. Ng, and M.I. Jordan. 2003. Latent Drchlet Allocaton, Journal of Machne Learnng Research, Vol.3, pages 993-1022. 10

Table 1: Mnmum perplexty of each aspect model hstory hstory document adaptaton adaptaton probablty (ungram) (trgram) DM 453.06(32.0%) 60.74(19.6%) 434.73 DM ave.2 425.97(36.1%) 57.97 (23.2%) - LDA 467.61(29.9%) 67.13 (11.1%) 474.82 Mxture of 520.95 Ungrams [3] Kenneth W. Church and Wllam A. Gale. 1995. Posson mxtures. Natural Language Engneerng, Vol.1, No.1, pages 163 190. [4] Kenneth W. Church. 2001. Emprcal estmates of adaptaton: The chance of two Noregas s closer to p/2 thanp 2. In Proc. of Colng 2000, pages 180 186. [5] A. P. Dempster, N. M. Lard and D. B. Rubn. 1977. Maxmum lkelhood from ncomplete data va the EM algorthm. J. of Royal Statstcal Socety, Seres B, Vol.39, pages 1 38. [6] T. Hofmann. 1999. Probablstc latent semantc ndexng, Proc. of the 22nd Annual ACM Conference on Research and Development n Informaton Retreval, pp.50-57, Berkeley, Calforna. [7] K. Sjölander, K. Karplus, M. Brown, R. Hunghey, A. Krogh, I.S. Man, and D. Haussler. 1996. Drchlet mxtures:a method for mproved detecton of weak but sgnfcant proten sequence homology, Computer Applcatons n the Boscences, vol.12, no.4, pp.327 345. [8] D. Gldea, and T. Hofmann. 1999. Topc-based language models usng em, Proc. of the 6th European Conference on Speech Communcaton and Technology (EU- ROSPEECH 99). [9] R.M. Iyer, and M. Ostendorf. 1999. Modelng long dstance dependence n language:topc mxtures versus dynamc cache models, IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, vol.7, no.1, pp.30 39. [10] S.T. K. Ngam, A. McCallum, and T. Mtchell. 2000. Text classfcaton from labeled and unlabeled documents usng EM, Machne Learnng, vol.39, no.2/3, pp.103 134. [11] T. Mnka. 2003. Estmatng a Drchlet dstrbuton, http://www.stat.cmu.edu/ mnka/papers/drchlet/ [12] Ronald Rosenfeld. 1996. A maxmum entropy approach to adaptve statstcal language modelng. Computer Speech and Language, Vol.10, No.3, pages 187 228. 11

A Dervaton of the update formula of α for the MLE Mnka s fxed-pont teraton method (Mnka 1999) and the EM algorthm were used. The followng equatons (Mnka 2003) were used to get the lower bound of the second term of the equaton (4): Γ(α m ) Γ(α m + y ) Γ(ᾱ { } m)exp (ᾱm α m )b m, Γ(ᾱ m + y ) and Γ(ᾱ mv + y v ) Γ(ᾱ mv ) c mv ᾱ a mv mv (f y v 1), where y = v y v, α m = v α mv, and ᾱ m = v ᾱ mv. The lower bound s: Γ(ᾱ m ) P m log Γ(ᾱ m + y ) where m m V v=1 Γ(y v +ᾱ mv ) Γ(ᾱ mv ) [ P m log Γ(ᾱ m)exp { } (ᾱ m α m )b m + Γ(y +ᾱ m ) v log c mv α a mv mv ] Q (α), a mv = { Ψ(ᾱ mv + y v ) Ψ(ᾱ mv ) } ᾱ mv, b m =Ψ(ᾱ m + y ) Ψ(ᾱ m ), c mv = Γ(ᾱ mv + y v ) ᾱ a mv mv. Γ(ᾱ mv ) Ths lower bound can be maxmzed wth the followng update formula for α mv. Q (α) = P m b + 1 P m a mv =0 α mv ᾱ mv α mv =ᾱ P m{ψ(y k +ᾱ mv ) Ψ(ᾱ mv )} mv P m{ψ(y +ᾱ m ) Ψ(ᾱ m )} 12

B Dervaton for the update formula of α for the maxmum LOO lkelhood Gven datum y v n whch one word v s left out of a document, the predctve probablty P (v y v ) for the word v s the followng from the equaton (9): where p(v y v ) m P m α mv + y v 1 α m + y 1, P v m P m. The LOO log-lkelhood, L loo,s L loo (y α) y v log α mv + y v 1 P m α v m m + y 1. Snce P m s very nearly to 1 or 0 for almost all cases, the precedng equaton can be transformed: L loo (y α) ( ) αmv + y v 1 y v P m log α v m m + y 1 Usng the precedng equaton, lkelhood can be maxmzed ndependently for α mv of the m-th Drchlet component. The lower bound of the LOO log-lkelhood L m loo for the m-th component s: L m loo ( ) y v P m q mv log α mv y P m a m α m +(const.), v where ᾱ mv q mv = ᾱ mv + y v 1, 1 a m = ᾱ m + y 1. The followng nequaltes (Mnka 2003) were used to get the above bound: log(n + x) q log x +(1 q)logn q log q (1 q)log(1 q) where q = ˆx, n+ˆx log(x) ax 1+logˆx 13

where a =1/ˆx. An update formula for fxed-pont teraton s: α mv =ᾱ P m{y v /(y v 1+ᾱ mv )} mv P m{y /(y 1+ᾱ m )}. 14