Topic Modeling: Beyond Bag-of-Words

Similar documents
Topic Modeling: Beyond Bag-of-Words

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Variational Inference for HDP

A Review of Multiple Try MCMC algorithms for Signal Processing

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Linear First-Order Equations

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

Least-Squares Regression on Sparse Spaces

Cascaded redundancy reduction

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

SYNCHRONOUS SEQUENTIAL CIRCUITS

Implicit Differentiation

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Lecture 2: Correlated Topic Model

Lower bounds on Locality Sensitive Hashing

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

How to Minimize Maximum Regret in Repeated Decision-Making

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Latent Dirichlet Allocation in Web Spam Filtering

One-dimensional I test and direction vector I test with array references by induction variable

IN the evolution of the Internet, there have been

Factorized Multi-Modal Topic Model

WEIGHTING A RESAMPLED PARTICLE IN SEQUENTIAL MONTE CARLO. L. Martino, V. Elvira, F. Louzada

KNN Particle Filters for Dynamic Hybrid Bayesian Networks

Non-Linear Bayesian CBRN Source Term Estimation

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

The Role of Models in Model-Assisted and Model- Dependent Estimation for Domains and Small Areas

Estimating Causal Direction and Confounding Of Two Discrete Variables

Necessary and Sufficient Conditions for Sketched Subspace Clustering

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Influence of weight initialization on multilayer perceptron performance

7.1 Support Vector Machine

CONTROL CHARTS FOR VARIABLES

The Exact Form and General Integrating Factors

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

A Course in Machine Learning

Topic Modelling and Latent Dirichlet Allocation

Code_Aster. Detection of the singularities and calculation of a map of size of elements

Multi-View Clustering via Canonical Correlation Analysis

Optimization of Geometries by Energy Minimization

Equilibrium in Queues Under Unknown Service Times and Service Value

Predictive Control of a Laboratory Time Delay Process Experiment

Latent Dirichlet Allocation (LDA)

Prior-Based Dual Additive Latent Dirichlet Allocation for User-Item Connected Documents

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N.

Math 1B, lecture 8: Integration by parts

Chapter 6: Energy-Momentum Tensors

Code_Aster. Detection of the singularities and computation of a card of size of elements

The Press-Schechter mass function

The Principle of Least Action

Homework 2 EM, Mixture Models, PCA, Dualitys

Introduction to Markov Processes

Similarity Measures for Categorical Data A Comparative Study. Technical Report

A NONLINEAR SOURCE SEPARATION APPROACH FOR THE NICOLSKY-EISENMAN MODEL

Schrödinger s equation.

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

u!i = a T u = 0. Then S satisfies

A simple model for the small-strain behaviour of soils

Bayesian Mixtures of Bernoulli Distributions

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure

Generative learning methods for bags of features

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

Quantum mechanical approaches to the virial

Survey-weighted Unit-Level Small Area Estimation

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Expected Value of Partial Perfect Information

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

Topic Uncovering and Image Annotation via Scalable Probit Normal Correlated Topic Models

arxiv: v1 [math.co] 29 May 2009

IMAGE classification is a topic of significant interest within

Robust Bounds for Classification via Selective Sampling

Separation of Variables

On combinatorial approaches to compressed sensing

TOEPLITZ AND POSITIVE SEMIDEFINITE COMPLETION PROBLEM FOR CYCLE GRAPH

Multi-View Clustering via Canonical Correlation Analysis

A Novel Decoupled Iterative Method for Deep-Submicron MOSFET RF Circuit Simulation

On Topic Evolution. Eric P. Xing School of Computer Science Carnegie Mellon University Technical Report: CMU-CALD

A New Minimum Description Length

Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

ASYMMETRIC TWO-OUTPUT QUANTUM PROCESSOR IN ANY DIMENSION

Learning Automata in Games with Memory with Application to Circuit-Switched Routing

Agmon Kolmogorov Inequalities on l 2 (Z d )

6 General properties of an autonomous system of two first order ODE

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

Quantile function expansion using regularly varying functions

Structural Risk Minimization over Data-Dependent Hierarchies

Linear Regression with Limited Observation

Multi-View Clustering via Canonical Correlation Analysis

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Multivariate Methods. Matlab Example. Principal Components Analysis -- PCA

Transcription:

Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while others use latent topic variables inferre using the bag-of-wors assumption, in which wor orer is ignore. Previously, these methos have not been combine. In this wor, I explore a hierarchical generative probabilistic moel that incorporates both n-gram statistics an latent topic variables by extening a unigram topic moel to inclue properties of a hierarchical Dirichlet bigram language moel. The moel hyperparameters are inferre using a Gibbs EM algorithm. On two ata sets, each of 150 ocuments, the new moel exhibits better preictive accuracy than either a hierarchical Dirichlet bigram language moel or a unigram topic moel. Aitionally, the inferre topics are less ominate by function wors than are topics iscovere using unigram statistics, potentially maing them more meaningful. 1. Introuction Recently, much attention has been given to generative probabilistic moels of textual corpora, esigne to ientify representations of the ata that reuce escription length an reveal inter- or intra-ocument statistical structure. Such moels typically fall into one of two categories those that generate each wor on the basis of some number of preceing wors or wor classes an those that generate wors base on latent topic variables inferre from wor correlations inepenent of the orer in which the wors appear. n-gram language moels mae preictions using observe marginal an conitional wor frequencies. Appearing in Proceeings of the 23 r International Conference on Machine Learning, Pittsburgh, PA, 2006. Copyright 2006 by the author(s)/owner(s). While such moels may use conitioning contexts of arbitrary length, this paper eals only with bigram moels i.e., moels that preict each wor base on the immeiately preceing wor. To evelop a bigram language moel, marginal an conitional wor counts are etermine from a corpus w. The marginal count N i is efine as the number of times that wor i has occurre in the corpus, while the conitional count N i j is the number of times wor i immeiately follows wor j. Given these counts, the aim of bigram language moeling is to evelop preictions of wor w t given wor w t 1, in any ocument. Typically this is one by computing estimators of both the marginal probability of wor i an the conitional probability of wor i following wor j, such as f i = N i /N an f i j = N i j /N j, where N is the number of wors in the corpus. If there were sufficient ata available, the observe conitional frequency f i j coul be use as an estimator for the preictive probability of i given j. In practice, this oes not provie a goo estimate: only a small fraction of possible i, j wor pairs will have been observe in the corpus. Consequently, the conitional frequency estimator has too large a variance to be use by itself. To alleviate this problem, the bigram estimator f i j is smoothe by the marginal frequency estimator f i to give the preictive probability of wor i given wor j: P (w t = i w t 1 = j) = λf i + (1 λ)f i j. (1) The parameter λ may be fixe, or etermine from the ata using techniques such as cross-valiation (Jeline & Mercer, 1980). This proceure wors well in practice, espite its somewhat a hoc nature. The hierarchical Dirichlet language moel (MacKay & Peto, 1995) is a bigram moel that is entirely riven by principles of Bayesian inference. This moel has a similar preictive istribution to moels base on equation (1), with one ey ifference: the bigram statistics f i j in MacKay an Peto s moel are not smoothe with marginal statistics f i, but are smoothe with a quantity relate to the number of ifferent contexts in which each wor has occurre.

Latent Dirichlet allocation (Blei et al., 2003) provies an alternative approach to moeling textual corpora. Documents are moele as finite mixtures over an unerlying set of latent topics inferre from correlations between wors, inepenent of wor orer. This bag-of-wors assumption maes sense from a point of view of computational efficiency, but is unrealistic. In many language moeling applications, such as text compression, speech recognition, an preictive text entry, wor orer is extremely important. Furthermore, it is liely that wor orer can assist in topic inference. The phrases the epartment chair couches offers an the chair epartment offers couches have the same unigram statistics, but are about quite ifferent topics. When eciing which topic generate the wor chair in the first sentence, nowing that it was immeiately precee by the wor epartment maes it much more liely to have been generate by a topic that assigns high probability to wors relate to university aministration. In practice, the topics inferre using latent Dirichlet allocation are heavily ominate by function wors, such as in, that, of an for, unless these wors are remove from corpora prior to topic inference. While removing these may be appropriate for tass where wor orer oes not play a significant role, such as information retrieval, it is not appropriate for many language moeling applications, where both function an content wors must be accurately preicte. In this paper, I present a hierarchical Bayesian moel that integrates bigram-base an topic-base approaches to ocument moeling. This moel moves beyon the bag-of-wors assumption foun in latent Dirichlet allocation by introucing properties of MacKay an Peto s hierarchical Dirichlet language moel. In aition to exhibiting better preictive performance than either MacKay an Peto s language moel or latent Dirichlet allocation, the topics inferre using the new moel are typically less ominate by function wors than are topics inferre from the same corpora using latent Dirichlet allocation. 2. Bacgroun I begin with brief escriptions of MacKay an Peto s hierarchical Dirichlet language moel an Blei et al. s latent Dirichlet allocation. 2.1. Hierarchical Dirichlet Language Moel Bigram language moels are specifie by a conitional istribution P (w t = i w t 1 = j), escribe by W (W 1) free parameters, where W is the number of wors in the vocabulary. These parameters are enote by the matrix Φ, with P (w t = i w t 1 = j) φ i j. Φ may be thought of as a transition probability matrix, in which the j th row, the probability vector for transitions from wor j, is enote by the vector φ j. Given a corpus w, the lielihoo function is P (w Φ) = φ N i j i j, (2) i j where N i j is the number of times that wor i immeiately follows wor j in the corpus. MacKay an Peto (1995) exten this basic framewor by placing a Dirichlet prior over Φ: P (Φ βm) = j Dirichlet(φ j βm), (3) where β > 0 an m is a measure satisfying i m i = 1. Combining equations (2) an (3), an integrating over Φ, yiels the probability of the corpus given the hyperparameters βm, also nown as the evience : P (w βm) = j i Γ(N i j + βm i ) Γ(N j + β) Γ(β) i Γ(βm i). (4) It is also easy to obtain a preictive istribution for each context j given the hyperparameters βm: P (i j, w, βm) = N i j + βm i N j + β. (5) To mae the relationship to equation (1) explicit, P (i j, w, βm) may be rewritten as P (i j, w, βm) = λ j m i + (1 λ j )f i j, (6) where f i j = N i j /N j an λ j = β N j + β. (7) The hyperparameter m i is now taing the role of the marginal statistic f i in equation (1). Ieally, the measure βm shoul be given a proper prior an marginalize over when maing preictions, yieling the true preictive istribution: P (i j, w) = P (βm w)p (i j, w, βm) (βm). (8) However, if P (βm w) is sharply peae in βm so that that it is effectively a elta function, then the true preictive istribution may be approximate by P (i j, w, [βm] MP ), where [βm] MP is the maximum of

P (βm w). Aitionally, the prior over βm may be assume to be uninformative, yieling a minimal atariven Bayesian moel in which the optimal βm may be etermine from the ata by maximizing the evience. MacKay an Peto show that each element of the optimal m, when estimate using this empirical Bayes proceure, is relate to the number of contexts in which the corresponing wor has appeare. 2.2. Latent Dirichlet Allocation Latent Dirichlet allocation (Blei et al., 2003) represents ocuments as ranom mixtures over latent topics, where each topic is characterize by a istribution over wors. Each wor w t in a corpus w is assume to have been generate by a latent topic z t, rawn from a ocument-specific istribution over T topics. Wor generation is efine by a conitional istribution P (w t = i z t = ), escribe by T (W 1) free parameters, where T is the number of topics an W is the size of the vocabulary. These parameters are enote by Φ, with P (w t = i z t = ) φ i. Φ may be thought of as an emission probability matrix, in which the th row, the istribution over wors for topic, is enote by φ. Similarly, topic generation is characterize by a conitional istribution P (z t = t = ), escribe by D(T 1) free parameters, where D is the number of ocuments in the corpus. These parameters form a matrix Θ, with P (z t = t = ) θ. The th row of this matrix is the istribution over topics for ocument, enote by θ. The joint probability of a corpus w an a set of corresponing latent topics z is P (w, z Φ, Θ) = i φ N i i θn, (9) where N i is the number of times that wor i has been generate by topic, an N is the number of times topic has been use in ocument. Blei et al. place a Dirichlet prior over Φ, P (Φ βm) = an another over Θ, P (Θ αn) = Dirichlet(φ βm), (10) Dirichlet(θ αn). (11) Combining these priors with equation (9) an integrating over Φ an Θ gives the evience for hyperparameters αn an βm, which is also the probability of the corpus given the hyperparameters: P (w αn, βm) = ( i Γ(N i + βm i ) Γ(β) Γ(N z + β) i Γ(βm i) Γ(N ) + αn ) Γ(α) Γ(N + α) Γ(αn. (12) ) N is the total number of times topic occurs in z, while N is the number of wors in ocument. The sum over z cannot be compute irectly because it oes not factorize an involves T N terms, where N is the total number of wors in the corpus. However, it may be approximate using Marov chain Monte Carlo (Griffiths & Steyvers, 2004). Given a corpus w, a set of latent topics z, an optimal hyperparameters [αn] MP an [βm] MP, approximate preictive istributions for each topic an ocument are given by the following pair of equations: P (i, w, z, [βm] MP ) = N i + [βm i ] MP N + β MP (13) P (, w, z, [αn] MP ) = N + [αn ] MP These may be rewritten as N + α MP. (14) P (i, z, w, βm) = λ f i + (1 λ )m i (15) P (, z, w, αn) = γ f + (1 γ )n, (16) where f i = N i /N, f = N /N an β λ = N + β (17) α γ = N + α. (18) f i is therefore being smoothe by the hyperparameter m i, while f is smoothe by n. Note the similarity of equations (15) an (16) to equation (6). 3. Bigram Topic Moel This section introuces a moel that extens latent Dirichlet allocation by incorporating a notion of wor orer, similar to that employe by MacKay an Peto s hierarchical Dirichlet language moel. Each topic is now represente by a set of W istributions. Wor generation is efine by a conitional istribution P (w t = i w t 1 = j, z t = ), escribe by W T (W 1) free parameters. As before, these parameters form a matrix Φ, this time with W T rows.

Each row is a istribution over wors for a particular context j,, enote by φ j,. Each topic is now characterize by the W istributions specific to that topic. Topic generation is the same as in latent Dirichlet allocation: topics are rawn from the conitional istribution P (z t = t = ), escribe by D(T 1) free parameters, which form a matrix Θ. The joint probability of a corpus w an a single set of latent topic assignments z is P (w, z Φ, Θ) = i φ N i j, i j, θn, (19) j where N i j, is the number of times wor i has been generate by topic when precee by wor j. As in latent Dirichlet allocation, N is the number of times topic has been use in ocument. The prior over Θ is chosen to be the same as that use in latent Dirichlet allocation: P (Θ αn) = Dirichlet(θ αn). (20) However, the aitional conitioning context j in the istribution that efines wor generation affors greater flexibility in choosing a hierarchical prior for Φ than in either latent Dirichlet allocation or the hierarchical Dirichlet language moel. The priors over Φ use in both MacKay an Peto s language moel an Blei et al. s latent Dirichlet allocation are couple priors: learning the probability vector for a single context, φ j the case of MacKay an Peto s moel an φ in Blei et al. s, gives information about the probability vectors in other contexts, j an respectively. This epenence comes from the hyperparameter vector βm, share, in the case of the hierarchical Dirichlet language moel, between all possible previous wor contexts j an, in the case of latent Dirichlet allocation, between all possible topics. Since wor generation is conitione upon both j an in the new moel presente in this paper, there is more than one way in which hyperparameters for the prior over Φ might be share in this moel. Prior 1: Most simply, a single hyperparameter vector βm may be share between all j, contexts: P (Φ βm) = j Dirichlet(φ j, βm). (21) Here, nowlege about the probability vector for one φ j, will give information about the probability vectors φ j, for all other j, contexts. Prior 2: Alternatively there may be T hyperparameter vectors one for each topic : P (Φ {β m }) = Dirichlet(φ j, β m ). (22) j Information is now share between only those probability vectors with topic context. Intuitively, this is appealing. Learning about the istribution over wors for a single context j, yiels information about the istributions over wors for other contexts j, that share this topic, but not about istributions with other topic contexts. In other wors, this prior encapsulates the notion of similarity between istributions over wors for a given topic context. Having efine the istributions that characterize wor an topic generation in the new moel an assigne priors over the parameters, the generative process for a corpus w is: 1. For each topic an wor j: (a) Draw φ j, from the prior over Φ: either Dirichlet(φ j, βm) (prior 1) or Dirichlet(φ j, β m ) (prior 2). 2. For each ocument in the corpus: (a) Draw the topic mixture θ for ocument from Dirichlet(θ αn). (b) For each position t in ocument : i. Draw a topic z t Discrete(θ ). ii. Draw a wor w t from the istribution over wors for the context efine by the topic z t an previous wor w t 1, Discrete(φ wt 1,z t ). The evience, or probability of a corpus w given the hyperparameters, is either (prior 1) P (w αn, βm) = i Γ(N i j, + βm i ) Γ(β) Γ(N z j j, + β) i Γ(βm i) Γ(N ) + αn ) Γ(α) Γ(N + α) Γ(αn (23) ) or (prior 2) P (w αn, {β m }) = i Γ(N i j, + β m i ) Γ(β) Γ(N z j j, + β ) i Γ(β m i ) Γ(N ) + αn ) Γ(α) Γ(N + α) Γ(αn. (24) )

As in latent Dirichlet allocation, the sum over z is intractable, but may be approximate using MCMC. For a single set of latent topics z, an optimal hyperparameters [βm] MP or {[β m ] MP }, the approximate preictive istribution over wors given previous wor j an current topic is either (prior 1) P (i j,, w, z, [βm] MP ) = N i j, + [βm i ] MP N j, + β MP (25) or (prior 2) P (i j,, w, z, {[β m ] MP }) = N i j, + [β m i ] MP N j, + β MP. (26) In equation (25), the statistic N i j, /N j, is always smoothe by the quantity m i, regarless of the conitioning context, j,. Meanwhile, in equation (26), N i j, /N j, is smoothe by m i, which is will vary epening on the conitioning topic. Given [αn] MP, the approximate preictive istribution over topics for ocument is P (, w, z, [αn] MP ) = N + [αn ] MP N + α MP. (27) 4. Inference of Hyperparameters Previous sampling-base treatments of latent Dirichlet allocation have not inclue any metho for optimizing hyperparameters. However, the metho escribe in this section may be applie to both latent Dirichlet allocation an the moel presente in this paper. Given an uninformative prior over αn an βm or {β m }, the optimal hyperparameters, [αn] MP an [βm] MP or {[β m ] MP }, may be foun by maximizing the evience, given in equation (23) or (24). The evience contains latent variables z an must therefore be maximize with respect to the hyperparameters using an expectation-maximization (EM) algorithm. Unfortunately, the expectation with respect to the istribution over the latent variables involves a sum over T N terms, where N is the number of wors in the entire corpus. However, this sum may be approximate using a Marov chain Monte Carlo algorithm, such as Gibbs sampling, resulting in a Gibbs EM algorithm (Anrieu et al., 2003). Given a corpus w, an enoting the set of hyperparameters as U = {αn, βm} or U = {αn, {β m }}, the optimal hyperparameters may be foun by using the following steps: 1. Initialize z (0) an U (0) an set i = 1. 2. Iteration i: (a) E-step: Draw S samples {z (s) } S s=1 from P (z w, U (i 1) ) using a Gibbs sampler. (b) M-step: Maximize U (i) = arg max U 3. i i + 1 an go to 2. 4.1. E-Step 1 S S log P (w, z (s) U) s=1 Gibbs sampling involves sequentially sampling each variable of interest, z t here, from the istribution over that variable given the current values of all other variables an the ata. Letting the subscript t enote a quantity that exclues ata from the t th position, the conitional posterior for z t is either (prior 1) P (z t = z t, w, αn, βm) or (prior 2) {N wt w t 1,} t + βm wt {N } t + β P (z t = z t, w, αn, {β m }) {N t } t + αn {N t } t + α (28) {N wt w t 1,} t + β m wt {N wt 1,} t + β {N t } t + αn {N t } t + α. (29) Drawing a single set of topics z taes time proportional to the size of the corpus N an the number of topics T. The E-step therefore taes time proportional to N, T an the number of iterations for which the Marov chain is run in orer to obtain the S samples. Note that the samples use to approximate the E-step must come from a single Marov chain. The moel is unaffecte by permutations of topic inices. Consequently, there is no corresponence between topic inices across samples from ifferent Marov chains: topics that have inex in two ifferent Marov chains nee not have similar istributions over wors. 4.2. M-Step Given {z (s) } S s=1, the optimal αn can be compute using the fixe-point iteration [αn ] new = ( s αn s Ψ(N (s) ) + αn ) Ψ(αn ), (30) (Ψ(N + α) Ψ(α)) where N (s) is the number of times topic has been use in ocument in the s th sample. Similar fixepoint iterations can be use to etermine [βm i ] MP an {[β m ] MP } (Mina, 2003).

In my implementation, each fixe-point iteration taes time that is proportional to S an (at worst) N. For latent Dirichlet allocation an the new moel with prior 1, the time taen to perform the M-step is therefore at worst proportional to S, N an the number of iterations taen to reach convergence. For the new moel with prior 2, the time taen is also proportional to T. 5. Experiments To evaluate the new moel, both variants were compare with latent Dirichlet allocation an MacKay an Peto s hierarchical Dirichlet language moel. The topic moels were traine ientically: the Gibbs EM algorithm escribe in the previous section was use for both the new moel (with either prior) an latent Dirichlet allocation. The hyperparameters of the hierarchical Dirichlet language moel were inferre using the same fixe-point iteration use in the M-step. The results presente in this section are therefore a irect reflection of ifferences between the moels. Language moels are typically evaluate by computing the information rate of unseen test ata, measure in bits per wor: the better the preictive performance, the fewer the bits per wor. Information rate is a irect measure of text compressibility. Given corpora w an w test, information rate is efine as R = log 2 P (w test w) N test, (31) where N test is the number of wors in the test corpus. The information rate may be compute irectly for the hierarchical Dirichlet language moel. For the topic moels, computing P (w test w) requires summing over z an z test. As mentione before, this is intractable. Instea, the information rate may be compute using a single set of topics z for the training ata, in this case obtaine by running a Gibbs sampler for 20000 iterations after the hyperparameters have been inferre. Given z, multiple sets of topics for the test ata {z (s) test} S s=1 may be obtaine using the preictive istributions. Given hyperparameters U, P (w test w) may be approximate by taing the harmonic mean of {P (w test z (s) test, w, z, U)} S s=1 (Kass & Raftery, 1995). 5.1. Corpora The moels were compare using two ata sets. The first was constructe by rawing 150 abstracts (ocuments) at ranom from the Psychological Review Abstracts ata provie by Griffiths an Steyvers (2005). A subset of 100 ocuments were use to infer the hyperparameters, while the remaining 50 were use for evaluating the moels. The secon ata set consiste of 150 newsgroup postings, rawn at ranom from the 20 Newsgroups ata (Rennie, 2005). Again, 100 ocuments were use for inference, while 50 were retaine for evaluating preictive accuracy. Punctuation characters, incluing hyphens an apostrophes, were treate as wor separators, an each number was replace with a special number toen to reuce the size of the vocabulary. To enable evaluation using ocuments containing toens not present in the training corpus, all wors that occurre only once in the training corpus were replace with an unseen toen u. Preprocessing the Psychological Review Abstracts ata in this manner resulte in a vocabulary of 1374 wors, which occurre 13414 times in the training corpus an 6521 times in the ocuments use for testing. The 20 Newsgroups ata ene up with a vocabulary of 2281 wors, which occurre 27478 times in the training ata an 13579 times in the test ata. Despite consisting of the same number of ocuments, the 20 Newsgroups corpora are roughly twice the size of the Psychological Review Abstracts corpora. 5.2. Results The experiments involving latent Dirichlet allocation an the new moel were run with 1 to 120 topics, on an Opteron 254 (2.8GHz). These moels all require at most 200 iterations of the Gibbs EM algorithm escribe in section 4. In the E-step, a Marov chain was run for 400 iterations. The first 200 iterations were iscare an 5 samples were taen from the remaining iterations. The mean time taen for each iteration is shown for both variants of the new moel as a function of the number of topics in figure 2. As expecte, the time taen is proportional to both the number of topics an the size of the corpus. The information rates of the test ata are shown in figure 1. On both corpora, latent Dirichlet allocation an the hierarchical Dirichlet language moel achieve similar performance. With prior 1, the new moel improves upon this by between 0.5 an 1 bits per wor. However, with prior 2, it achieves an information rate reuction of between 1 an 2 bits per wor. For latent Dirichlet allocation, the information rate is reuce most by the first 20 topics. The new moel uses a larger number of topics an exhibits a greater information rate reuction as more topics are ae. In latent Dirichlet allocation, the latent topic for a given wor is inferre using the ientity of the wor, the number of times the wor has previously been assume to be generate by each topic, an the number of times each topic has been use in the current ocument. In the new moel, the previous wor is also taen into ac-

bits per wor 12 11 10 9 8 Hierarchical Dirichlet language moel Latent Dirichlet allocation Bigram topic moel (prior 1) Bigram topic moel (prior 2) bits per wor 13 12 11 10 9 8 Hierarchical Dirichlet language moel Latent Dirichlet allocation Bigram topic moel (prior 1) Bigram topic moel (prior 2) 7 7 6 0 20 40 60 80 100 120 number of topics 6 0 20 40 60 80 100 120 number of topics Figure 1. Information rates of the test ata, measure in bits per wor, uner the ifferent moels versus number of topics. Left: Psychological Review Abstracts ata. Right: 20 Newsgroups ata. mean secons per iteration 15 10 5 Psychological Review Abstracts 20 Newsgroups mean secons per iteration 30 25 20 15 10 5 Psychological Review Abstracts 20 Newsgroups 0 0 20 40 60 80 100 120 number of topics 0 0 20 40 60 80 100 120 number of topics Figure 2. Mean time taen to perform a single iteration of the Gibbs EM algorithm escribe in section 4 as a function of the number of topics for both variants of the new moel. Left: prior 1. Right: prior 2. count. This aitional information means wors that were consiere to be generate by the same topic in latent Dirichlet allocation, may now be assume to have been generate by ifferent topics, epening on the contexts in which they are seen. Consequently, the new moel tens to use a greater number of topics. In aition to comparing preictive accuracy, it is instructive to loo at the inferre topics. Table 1 shows the wors most frequently assigne to a selection of topics extracte from the 20 Newsgroups training ata by each of the moels. The unseen toen was omitte. The topics inferre using latent Dirichlet allocation contain many function wors, such as the, in an to. In contrast, all but one of the topics inferre by the new moel, especially with prior 2, typically contain fewer function wors. Instea, these are largely collecte into the single remaining topic, shown in the last column of rows 2 an 3 in table 1. This effect is similar, though less pronounce, to that achieve by Griffiths et al. s composite moel (2004), in which function wors are hanle by a hien Marov moel, while content wors are hanle by latent Dirichlet allocation. 6. Future Wor There is a another possible prior over Φ, in aition to the two priors iscusse in this paper. This prior has a hyperparameter vector for each previous wor context j, resulting in W hyperparameter vectors: P (Φ {β j m j }) = Dirichlet(φ j, β j m j ). (32) j Here, information is share between all istributions with previous wor context j. This prior captures the notion of common bigrams wor pairs that always occur together. However, the number of hyperparameter vectors is extremely large much larger than the number of hyperparameters in prior 2 with comparatively little ata from which to infer them. To mae effective use of this prior, each normalize measure m j shoul itself be assigne a Dirichlet prior. This variant of the moel coul be compare with those presente

in this paper. To enable a irect comparison, Dirichlet hyperpriors coul also be place on the hyperparameters of the priors escribe in section 3. 7. Conclusions Creating a single moel that integrates bigram-base an topic-base approaches to ocument moeling has several benefits. Firstly, the preictive accuracy of the new moel, especially when using prior 2, is significantly better than that of either latent Dirichlet allocation or the hierarchical Dirichlet language moel. Seconly, the moel automatically infers a separate topic for function wors, meaning that the other topics are less ominate by these wors. Acnowlegments Thans to Phil Cowans, Davi MacKay an Fernano Pereira for useful iscussions. Thans to Anrew Suffiel for proviing sparse matrix coe. References Anrieu, C., e Freitas, N., Doucet, A., & Joran, M. I. (2003). An introuction to MCMC for machine learning. Machine Learning, 50, 5 43. Blei, D. M., Ng, A. Y., & Joran, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993 1022. Griffiths, T. L., & Steyvers, M. (2004). Fining scientific topics. Proceeings of the National Acaemy of Sciences, 101, 5228 5235. Griffiths, T. L., & Steyvers, M. (2005). Topic moeling toolbox. http://psiexp.ss.uci.eu/research/ programs_ata/toolbox.htm. Griffiths, T. L., Steyvers, M., Blei, D. M., & Tenenbaum, J. B. (2004). Integrating topics an syntax. Avances in Neural Information Processing Systems. Jeline, F., & Mercer, R. (1980). Interpolate estimation of Marov source parameters from sparse ata. In E. Gelsema an L. Kanal (Es.), Pattern recognition in practice, 381 402. North-Hollan publishing company. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773 795. MacKay, D. J. C., & Peto, L. C. B. (1995). A hierarchical Dirichlet language moel. Natural Language Engineering, 1, 289 307. Mina, T. P. (2003). Estimating a Dirichlet istribution. http://research.microsoft.com/~mina/ papers/irichlet/. Rennie, J. (2005). 20 newsgroups ata set. http:// people.csail.mit.eu/jrennie/20newsgroups/. Table 1. Top: The most commonly occurring wors in some of the topics inferre from the 20 Newsgroups training ata by latent Dirichlet allocation. Mile: Some of the topics inferre from the same ata by the new moel with prior 1. Bottom: Some of the topics inferre by the new moel with prior 2. Each column represents a single topic, an wors appear in orer of frequency of occurrence. Content wors are in bol. Function wors, which are not in bol, were ientifie by their presence on a stanar list of stop wors: http://ir.cs.gla.ac.u/ resources/linguistic_utils/stop_wors. All three sets of topics were taen from moels with 90 topics. Latent Dirichlet allocation the i that easter number is proteins ishtar in satan the a to the of the espn which to have hocey an i with a of if but this metaphorical number english as evil you an run there fact is Bigram topic moel (prior 1) to the the the party go an a arab is between to not belief warrior i power believe enemy of any use battlefiel number i there a is is strong of in this mae there an things i way it Bigram topic moel (prior 2) party go number the arab believe the to power about tower a as atheism cloc an arabs gos a of political before power i are see motherboar is rolling atheist mhz number lonon most socet it security shafts plastic that