Factorized Multi-Modal Topic Model

Size: px

Start display at page:

Download "Factorized Multi-Modal Topic Model"

Blaze Crawford
5 years ago
Views:

1 Factorize Multi-Moal Topic Moel Seppo Virtanen 1, Yangqing Jia 2, Arto Klami 1, Trevor Darrell 2 1 Helsini Institute for Information Technology HIIT Department of Information an Compute Science, Aalto University 2 UC Bereley EECS an ICSI Abstract Multi-moal ata collections, such as corpora of paire images an text snippets, require analysis methos beyon single-view component an topic moels. For continuous observations the current ominant approach is base on extensions of canonical correlation analysis, factorizing the variation into components share by the ifferent moalities an those private to each of them. For count ata, multiple variants of topic moels attempting to tie the moalities together have been presente. All of these, however, lac the ability to learn components private to one moality, an consequently will try to force epenencies even between minimally correlating moalities. In this wor we combine the two approaches by presenting a novel HDP-base topic moel that automatically learns both share an private topics. The moel is shown to be especially useful for querying the contents of one omain given samples of the other. 1 INTRODUCTION Analysis of objects represente by multiple moalities has been an active research irection over the past few years. If the analysis of a single moality is characterize as learning some sort of components that escribe the ata, the tas in analysis of multiple moalities can be summarize as learning components that escribe both the variation within each moality but also the variation share between them (Klami an Kasi, 2008; Jia et al., 2010). The funamental problem is in learning how to correctly factorize the variation into the share an private components, so that the components can be intuitively interprete. For continuous vector-value samples the problem can be solve efficiently by a structural sparsity assumption (Jia et al., 2010; Virtanen et al., 2011), resulting in an extension of canonical correlation analysis (CCA) that moels not only the correlations but also components private to each moality. One prototypical example of multi-moal analysis is that of moeling collections of images an associate text snippets, such as captions or contents of a web page. When both text an image content can naturally be represente with bag of wors -type vectors, the assumptions mae by the above methos fail. Instea, such count ata calls for topic moels such as latent Dirichlet allocation (LDA): several extensions of LDA have been presente for multi-moal setups, incluing Blei an Joran (2003); Mimno an McCallum (2008); Salomatin et al. (2009); Yahneno an Hovavar (2009); Rasiwasia et al. (2010) an Puttivihya et al. (2011). However, none of these extensions are able to fin share an private topics in the same sense as the CCA-base moels o for continuous ata. Instea, the moels attempt to enforce strong correlation between the moalities, which is a reasonable assumption when analyzing e.g. multi-lingual textual corpora with similar languages but that oes not hol for analysis of images associate with free-flowing text. In most cases, the images will contain a consierable amount of information not relate to the text snippet, an it is not even guarantee that the text is relate at all to the visual content of the image. In this wor, we introuce a novel topic moel that combines the two above lines of wor. It buils on the correlate topic moels (CTM) by Blei an Lafferty (2007) an Paisley et al. (2011), by moeling correlations between topic allocations an by using a hierarchical Dirichlet process (HDP) formulation for automatically learning the number of the topics. The propose factorize multi-moal topic moel integrates the technical improvements of these single-moality topic moels to the multi-moal application, an in particular automatically learns to mae some topics

2 specific to each of the moalities, implementing the factorization iea of Klami an Kasi (2008) an Jia et al. (2010) use for continuous ata. The component selection plays a crucial role in implementing this property, implying that the HDP-base technique for automatically selecting the complexity is even more important for factorize multi-moal moels than it woul be a for a regular topic moel. The primary avantage of the new moel is that is oes not enforce correlations between the moalities, lie the earlier multi-moal topic moels o, but instea factorizes the variation into interpretable topics escribing share an private structure. The moel is very flexible an oes not enforce any particular factorization structure, but instea learns it from the ata. For example, the moel can completely ignore the share topics in case the moalities are inepenent or fin almost solely share topics when they are strongly correlate. In this wor we emonstrate the moel in analyzing moalities that have only wea relationships, a scenario for which the previous moels woul not wor. In particular, we analyze a collection of Wiipeia pages that consist of images an the whole text on the page. Such a collection has relatively low between-moality correlation an in particular inclues consierable amount of text that is not relate to the image at all, necessitating topics private to the text moality. The propose moel is shown to clearly outperform alternative HDP-base topic moels as well as corresponence LDA (Blei an Joran, 2003) in the tas of inferring the contents of a missing moality. 2 BACKGROUND: TOPIC MODELS To briefly summarize the topic moels an to introuce the notation use in the paper, we escribe the stanar topic moel of Latent Dirichlet Allocation (LDA) (Blei et al., 2003) through its generative process. We assume that wors occurring in a ocument are rawn from K topics. Each topic specifies a multinomial probability istribution over the vocabulary, parameterize through η rawn from the Dirichlet istribution Dir(γ1), an the topic proportions are multinomial with parameters θ Dir(ν1). The ocuments are generate by repeately sampling a topic inicator z Multi(θ) an then rawing a wor from the corresponing topic as x Multi(η z ). We will also heavily epen on the concept of correlate topic moels (CTM) (Blei an Lafferty, 2007). In the stanar LDA the topic proportions θ rawn from the Dirichlet istribution become inepenent except for wea negative correlation stemming from the normalization constraint. CTM replaces this choice by logistic normal istribution, first rawing an auxiliary variable from a Gaussian istribution ξ N(µ, Σ) an specifying the topic istribution as θ exp(ξ). The topics become correlate when Σ is not iagonal, an empirical experiments show increase preictive accuracy. Finally, our moel will be formulate through a hierarchical Dirichlet process (HDP) formulation (Teh et al., 2006), to enable automatic choice of the number of topics. As mentione in the introuction, the choice is even more critical for multi-moal moels, since we will have several sets of topics instea of just a single one; specifying the complexity for all of those in avance woul not be feasible. Our moel will use elements from the recently introuce Discrete Infinite Logistic Normal (DILN) moel by Paisley et al. (2011), which incorporates HDP into CTM. The ey iea of DILN is that the topic istributions θ are mae sparse by multiplying the exp(ξ) by sparse topic-selection terms. The topic istribution is given by θ Gamma(βp, exp( ξ )), where both β an p come from a stic-breaing process: β is the secon level consentration parameter, an 1 p = V i=1 (1 V i), where V Beta(1, α) with α as the first level concentration parameter. The expecte value of θ is proportional to βp exp(ξ ), illustrating the way the ifferent parameters influence the topic weights. For any finite ata collection, p > 0 only for a finite subset of topics an hence the moel automatically selects the number of topics. 3 FACTORIZED MULTI-MODAL TOPIC MODEL Consier a collection of ocuments each containing M wealy correlate moalities, where each moality has its own vocabulary. In the application of this paper the two vocabularies are textual an visual wors collecte from Wiipeia pages with text an a single image (though the moel woul irectly generalize to multiple images). We introuce a novel multi-moal topic moel that can be use to learn epenencies between these moalities, enabling e.g. preicting the textual content associate with a novel image. The problem is mae particularly challenging by the wea relationship between the moalities; several of the ocuments will contain large amounts of text not relate to the image content. For moeling the ata, we will use M separate vocabularies, so that wors (or visual wors) for each moality are rawn from separate ictionaries η (m) specific to each view m. The topic proportions θ (m) will also be specific to each moality, whereas the actual wors are sample inepenently for each moal-

ity given the topic proportions. The essential moeling question is then how the topic proportions are tie with each other, in orer to achieve the factorization into share an private topics.

moalities). The topic proportions θ (m) are mae epenent by introucing auxiliary variables ξ (m), enoting by ξ = (ξ (1),..., ξ (M) ) the concatenation of them, an using the CTM prior ξ N(µ, Σ).

3 ity given the topic proportions. The essential moeling question is then how the topic proportions are tie with each other, in orer to achieve the factorization into share an private topics. In brief, we will o this by (i) moeling epenencies between topics both within an across moalities an (ii) automatically selecting the number of topics for each type (share or private to any of the moalities). The topic proportions θ (m) are mae epenent by introucing auxiliary variables ξ (m), enoting by ξ = (ξ (1),..., ξ (M) ) the concatenation of them, an using the CTM prior ξ N(µ, Σ). This part of the moel correspons to the multi-fiel CTM with ifferent topic sets by Salomatin et al. (2009), an the ifferent blocs in Σ escribe ifferent types of epenencies between the topic proportions. In particular, the blocs aroun the iagonal escribe epenencies between the topic proportions of each moality, whereas the off-iagonal blocs escribe epenencies in topic proportions between the moalities. Having a CTM for the joint topic istribution is not yet sufficient for separating the share topics from private ones, since we can only control the correlation between the topic proportions. A large correlation between two topics for ifferent moalities woul imply that it is share, but lac of correlation (that is, Σ l = 0) woul not mae either component private. Instea, the weights woul simply be etermine inepenently. To create separate sets of share an private topics we nee to be able to switch some of the topics off in one or more of the moalities, similarly to how Jia et al. (2010) an Virtanen et al. (2011) switch off components to mae the same istinction in continuous ata moels. In the case of multi-fiel CTM this coul only be one by riving µ (the mean of the Gaussian prior for ξ ) towars minus infinity, which is not encourage by the moel an is ifficult to achieve with mean-fiel upates. We implement the share/private choice by separate HDPs, one for each moality, switching a subset of topics off for each moality separately by a mechanism similar to how the single-view DILN moel (Paisley et al., 2011) selects the topics. We introuce β (m) an p (m) for each moality m = 1,..., M, an raw them from separate HDPs, resulting in θ (m) Gamma(β (m) p (m), exp( ξ (m) )) as the final topic proportions. The topic istributions are still share through ξ (m) that were rawn from a single high-imensional Gaussian, but for each moality the stic weights p (m) select ifferent subsets of topics to be switche off. In the en, a finite number of topics remain for each moality, an the private topics can be ientifie as ones that have non-zero weight for one moality an are not correlate with topics active in µ (m) V Figure 1: A graphical representation of the factorize multi-moal topic moel. The ata has D ocuments escribe by M moalities. For each moality, the wors x (m) are rawn from ictionary specific to that moality, accoring to topic proportions θ (m) also specific to the moality. The topic proportions are generate by logistic transformation of latent variables ξ (m) that moel the correlations between the topics both within an across moalities, followe by topic selection with a HDP (enote by V an β in the plate; see text for etails) for each moality. As a result, the moel learns both topics moeling correlations between the moalities as well as topics private to each moality. other moalities. The final generative moel motivate by the above iscussion results in a collection of M correlate BOW ata sets X (m), generate as follows (see Figure 1 for graphical representation). For the whole collection we: create a ictionary of T (m) topics for each moality by rawing η (m) Dir(γ (m) 1) for = 1,.., T (m) raw the parameters α (m), β (m), V (m) of the DILN istribution for each moality from the stic-breaing formulation an construct p (m). For each ocument we then raw ξ N(µ, Σ) an partition it into the ifferent moalities as ξ = (ξ (1),..., ξ (M) ). For each moality, we then generate the wors inepenently as follows: form the topic proportion by rawing Y (m) Gamma(β (m) p (m), exp( ξ (m) )) an set θ (m) = Y (m) T (m) i=1 Y (m) i raw N (m) wors by choosing a topic z Multi(θ (m) ) an rawing a wor x Multi(η (m) z ) z T x N m D M

4 3.1 INFERENCE For learning the moel parameters we use a truncate variational approximation following closely the algorithm given by Paisley et al. (2011), the main ifference being that we have M separate sets of η, β an p, one for each moality. The above generative process is truncate by setting V (m) = 1, forcing the stic T (m) lengths beyon the truncation level T (m) to be zero, an the resulting factorize approximation is given by Q = M D N m T m=1 =1 n m=1 =1 q(z (m) (m) n m )q(y q(v (m) )q(α m )q(β m )q(µ)q(σ), n m )q(ξ(m) )q(η(m) ) where to simplify notation we assume T m = T m. The algorithm procees by upating each factor in turn while eeping the others fixe, using either graient ascent or analytic solution for maximizing the lower boun of the approximation for each of the terms (see Paisley et al. (2011) for etails). The main ifference in the algorithms comes from upating ξ, since in our case it goes over M sets of topics instea of just one, yet the activities within each set are governe by separate HDPs. We use a iagonal Gaussian factor q(ξ) = N( ξ, iag(ṽ)), where ṽ enotes the variances of the imensions, an use graient ascent for jointly upating the parameters. To simplify notation we use ξ an v to enote the expectation an variance of the factorial istribution. The relevant part of the lower boun is L ξ,v = M β (m) p (m)t ξ (m) (1) m=1 M E[θ (m) ] T E[exp( ξ (m) )] m=1 (ξ µ) T Σ 1 (ξ µ)/2 iag(σ 1 ) T v/2 + log(v) T 1/2. Here Σ 1 couples the separate ξ (m) terms in the partial erivatives as L ξ,v ξ (m) = β(m) p (m) + E[θ (m) ]E[exp( ξ (m) )] (Σ 1 ) m,m (ξ (m) µ (m) ) j m(σ 1 ) m,j (ξ (j) µ (j) ), with (Σ 1 ) i,j enoting a bloc of Σ 1 corresponing to moalities i an j. The inverse of Σ remains constant uring the graient escent, an hence only nees to be evaluate once for every time the factor q(ξ) is upate. We use maximum marginal lielihoo to upate µ an Σ resulting in close form upates µ = 1 D Σ = D =1 ξ D ( (ξ µ)(ξ µ) T + iag(v ) ) /D. =1 3.2 PREDICTION The moel structure is well suite for preiction tass, where the tas is to infer missing moalities for a new ocument given that one of them is observe (e.g. infer the caption given the image content). This is because the correlations between the topic proportions provie a irect lin between the moalities, an the private topics explain away all the variation that is not useful for preictions. Here we present the etails of the preiction for the special case with just one observe moality (j) an one missing moality (i). Given the observe ata we first infer the topic proportions ˆθ (j) an then auxiliary variable ˆξ (j) by maximizing a cost similar to (1), but only using the newly inferre topic proportions of the observe moality an the corresponing part of Σ. As ˆξ comes from a Gaussian istribution we can infer ˆξ (i) given ˆξ (j) with the stanar conitional expectation as ˆξ (i) = µ (i) + Σ i,j Σ 1 j,j (ˆξ (j) µ (j) ) (2) = µ (i) + W(ˆξ (j) µ (j) ). Here W involves the corresponing part of the between-topic covariance matrix Σ as inicate above, an can be seen as a projection matrix transforming the components of one moality to another. Finally, the newly estimate ˆξ (i) for the missing views is converte bac to the expectee topic proportion ˆθ by exponentiation an multiplying with the corresponing stic lengths p (i). 3.3 SHARED AND PRIVATE TOPICS The ey novelty of the moel is its capability to learn both topics that are share an that are private to each moality, without neeing to specify them in avance. Since the way these topics appear is by no means transparent in the above formulation, we will here iscuss the property in more etail. In brief, the istinct nature for the topics comes from an interplay of the correlations between the topics of ifferent moalities an the HDP proceure that turns some of the topics off

5 for each moality. In particular, neither of these properties alone woul be sufficient. As mentione alreay in Section 3, merely having separate ξ (m) rawn from a single Gaussian is not sufficient for fining private topics. At best, the correlation structure can specify that the weights will be inepenent for the moalities. Next we explain how the other ey element of the moel, separate selection of active topics for each moality, is not sufficient alone either. We o that by consiering a special case of the moel that assumes equal ξ = ξ (m) for all views but has separate stic-breaing processes switching some of the topics off for each of the views. We call this alternative moel mmdiln, ue to the fact how it implementes multi-moal LDA of Blei an Joran (2003) with DILN-style component selection. Intuitively, mmdiln moel coul fin private topics simply by setting p (m) to small value for topics that are not neee in that moality. However, it cannot mae correct preictions from one moality to another, an hence fails in achieving one of the primary goals for share-private factorizations. If p (m) is small then the moel has no information for inferring ξ from that view, an hence also all other elements ξ l that correlate with ξ will be incorrect. If ξ was an important topic for the other view, the preictions will be severely biase. Our moel avois this issue by having the separate ξ (m) parameters, leaing to correct across-moality preictions as escribe in the previous section. In the experimental section we will empirically compare the propose moel with mmdiln, emonstrating how mmdiln inee has very poor preictive accuracy espite moeling the training ata almost as well. Hence, even though the structure is in principle sufficient for learning private topics, the moel has no practical value as a share-private factorization. In orer to recognize the nature of each of the topics, we nee to loo at both the covariance Σ between the topic weights an the moality-specific stic weights p (m). Since the topics can be (potentially strongly) correlate both within an across moalities, we can ientify private topics only by searching for topics that o not correlate with any topic that woul be active in any other moality. In the experiments we emonstrate how the topics can be rane accoring to how strongly they are share with another moality, by inspecting the elements of Σ. 4 RELATED WORK In this section we relate the moel to other approaches for moeling multi-moal count ata. 4.1 MULTI-MODAL TOPIC MODELS The multi-moal extension of LDA (mmlda) by Blei an Joran (2003) an its non-parametric version mmhdp by (Yahneno an Hovavar, 2009) assume all moalities to share the same topic proportions, an essentially exten LDA only by having separate ictionaries for each moality an generating the wors for the omains inepenently. For many real worl ata sets the assumption of ientical topic proportions is too strong, an the moel tries to enforce correlations even when they o not exist. While the assumption may help in picing up topics that woul be wea in either moality alone, it maes ientifying the true correlations almost impossible. Such moels fail especially when moeling ata having strong private topics in one moality. Since the topic proportions are share, the topic must be present in other moalities as well an becomes associate with a ictionary that merely replicates the overall istribution of the wors. Such topics are particularly harmful for preiction tass. When the ictionary of a topic matches that of the bacgroun wor istribution, it will be present in every ocument in that moality. For example, when preicting text from images we coul learn to associate politics (a strong topic private to the text moality) with the overall visual wor istribution, resulting in all of the preictions incluing terms from the politics topic. Salomatin et al. (2009) too a step towars our moel with their multi-fiel CTM. It extens CTM by introucing separate ξ (m) for each moality, similarly to our moel. However, as escribe in the previous section the separate topic proportions are not yet sufficient for separating the share topics from private ones. 4.2 CONDITIONAL TOPIC MODELS Lots of recent wor on multi-moal topic moeling framewor has focuse on builing conitional moels, largely for image annotation tas. Corresponence LDA (corrlda) propose simultaneously to mmlda in (Blei an Joran, 2003) is a prominent example, assuming that the image is generate first an the text epens on the image content. Both moalities are assume to share the same topic weights. While such moels are very useful for moeling the conitional relationship, they o not treat the moalities symmetrically as in our moel. Recently Puttivihya et al. (2011) propose an extension of corrlda, replacing the ientical topic istributions with a regression moule from image topics to the textual annotation topics. The ae flexibility results in better preictive performance, but the moel remains a irectional one,

6 in contrast to our moel that generates all moalities with equal importance. For applications treating only two moalities an having a specific tas that maes one of them more important (say, image annotation) the conitional moels often wor well. However, they o not easily generalize to multiple moalities an are not flexible in terms of the eventual application. Other conitional moels focus on conitioning on meta-ata, such as author or lin structure (Mimno an McCallum, 2008; Hennig et al., 2012). Such moels allow integrating ata that are not necessarily in count format, but the same istinction of irectional versus generative applies. However, this family of moels coul be integrate with our solution, incorporating a meta-ata lin into our multi-moal moel. In essence, the choice of whether meta-ata is moele or not is inepenent of the choice of how many count ata moalities the ata has. 4.3 CANONICAL CORRELATIONS As escribe earlier, the moel bears close resemblance to how CCA moels correlations between continuous ata, the similarities being most apparent with the recent re-interpretations of CCA as share-private factorization (Klami an Kasi, 2008; Jia et al., 2010). The technical etails of the solutions are, however, very ifferent as the normalization of topic proportions maes the techniques use for continuous ata not feasible for topic moels. Despite the mismatch of ata types, CCA can be use for moeling count ata as well. The most promising irection woul be to apply ernel-cca, but there are no obvious choices for the ernel function that woul irectly match the analysis of image-text pairs. As one practical remey, (Rasiwasia et al., 2010) combine CCA an LDA irectly by first estimating a separate LDA moel for each moality an then combining the resulting topic proportions with CCA. Our approach oes not rely on two separate analysis steps that o not result in irectly interpretable private topics. 5 EXPERIMENTS AND RESULTS 5.1 DATA AND MEASURES We valiate the moel on real ata collecte from Wiipeia 1. We constructe a ata collection with D = 20, 000 ocuments, each consisting of a single image represente with 5000 SIFT patches an text (the contents of the whole Wiipeia page) represente with a vocabulary of 7500 most frequent terms, after 1 Available from ~jiayq/ stopwor removal. We mae a ranom 50/50-split into test an train ata. To emonstrate the ability of the propose moel to correctly moel the relationships between the two moalities, we evaluate the moel with conitional perplexity of a missing moality for a new sample: ( D P (m) train = exp train log p(x (m) ) ) D train x (m) P (i) (j) test ( = exp D test log p(x (i) x(j) ) D test x (i) ), where x (m) enotes concatenation of N (m) wors. These quantities measure how well the moel can relate the visual content to the textual content, corresponing to the ocument completion tas of Wallach et al. (2009) but compute across moalities. We compare our moel to three alternatives representing various ins of multi-moal topic moels: mmdiln (Section 3.3), mmhdp (Section 4.1) an corrlda (Section 4.2). Both mmdiln an mmhdp are comparable to our moel in maing automatic topic number selection an moeling both moalities symmetrically. Consequently, the experiments will focus on emonstrating the importance of fining the correct factorization into share an private topics. The corrlda is inclue as an example of a conitional moel that gives an alternative approach to solving a similar preiction tas. Note that we nee to learn two separate corrlda moels, one for preicting text from images an one for the other irection, whereas the other moels can o both types of preictions. For corrlda we use 100 topics (the threshol we use for nonparametric moels). 5.2 INFERENCE SPEED First we show that the variational approximation use for inference is efficient. Figure 2 shows how the algorithm converges for both N = 400 an N = ocuments alreay after some tens of iterations. For both experiments we use a maximum of T = 100 topics. The convergence of mmhdp an mmdiln is similar (not shown). 5.3 PREDICTING TEXT FROM IMAGES AND VISE VERSA Figure 3 shows the evaluation for training an test sets for the propose moel an the comparison methos, measure as the perplexity on training ata an the conitional perplexity of images given the text an text given the images. The propose metho, which is more flexible than the alternatives, reaches better

7 Training perplexity text 400 image 400 text image Iterations Figure 2: Training perplexity as function of algorithm iterations. (lower) perplexity on the training an testing ata ue to being able to escribe both variation not share by the other moality without neeing to introuce noise topics. A notable observation is that the baseline methos perform worse at preicting text from images as the amount of training ata increases. This illustrates clearly the funamental problem in moeling multi-moal collections without separate private topics. Since the text ocuments are easier to moel than the images, the alternative moels start to focus more an more on moeling the text when there is large amount of ata. The ominant topics start escribing the text alone, yet they are also active in the image moality but with a topic that oes not contain any information. Given a new image sample, the estimate topic proportions will be arbitrary an hence o not enable meaningful preiction. The propose moel, however, learns to mae those textual topics private to the text moality, while capturing weaer correlations between the two moalities with share topics. The moel still cannot preict textual information not correlate with the image content, but it learns correctly not to even attempt that an manages to mae accurate preictions for the aspects that are correlate. 5.4 SHARED AND PRIVATE TOPICS To illustrate how the HDP-formulation chooses the topics, we visualize the stic parameters p in Figure 4. First, we notice that the last stics have close to zero weight, inicating that the chosen truncation level T = 100 is sufficient. More importantly, we see that the weights for the text an image topics are ifferent (the image topics are more sprea out), motivating the choice of separate weights for the moalities. To further unerstan how the propose moel is able to fin both share an private topics, we explore the nature of the iniviual topics. Since the SIFT vocabulary is not easily interpretable by visual inspection, we illustrate the property for the textual topics. For each textual topic we measure the amount of corre- Stic weights (a) Our moel: text (b) Our moel: image Figure 4: Visualization of stic parameter p of the propose moel for the text moality (a) an the image moality (b) reveals how they are not ientical for the two moalities. Both figures show the weights for two moels learne with 400 an 10, 000 ocuments, revealing how the istribution is learne fairly accurately alreay from a small collection. lation between the other moality by inspecting the correlation structure in Σ, an then ran the topics accoring to this measure. This results in a rane list of the text topics, the first ones being strongly share by the two moalities while the last ones are private to the text moality. More specifically, enoting the separate blocs in the covariance matrix as ( ) Σt,t Σ Σ = t,i, (3) Σ i,t Σ i,i we convert it to a correlation matrix, Ω, threshol small values out (we use a threshol of 0.2) an extract the cross-correlation between textual (rows) an visual topics (columns), to get Ω t,i. Then for each textual topic we efine visual relevance, ρ, as row mean of absolute values of Ω t,i, written as ρ = 1 T (Ω t,i) 1 2. This quantity captures general an rich visual combinations that co-occur with the textual topics, an it is worth noticing how the measure is very general: It allows multiple visual topics to correlate with one textual topic (an vise versa), an inclues both positive an negative correlations that are typically equally relevant (negative correlation can be seen as absence of a visual component) (See Figure 5 for emonstration). The textual topics are rane accoring to ρ in Figure 6. There are a few very strong share topics between text an image moalities, an at the en of the list we have several topics private to the text moality, inicate by zero correlation with the image moality. This matches with the intuition that the full text of a Wiipeia page cannot be mappe to the image content in all cases. Table 1 summarizes the six text topics most strongly correlating with the image moality, as well as six topics that are private 2 We also trie using the maximum element instea of the mean; it results in fairly similar raning.

8 Text train perplexity (a) Image train perplexity Our moel mmdiln mmhdp corrlda (b) Text preictive perplexity Number of training ocuments (c) Image preictive perplexity Number of training ocuments () Figure 3: Training an test perplexities (lower is better) for the two moalities. For training ata we show the perplexity of moeling the text (a) an images (b) separately. For test ata, we show the conitional perplexity of preicting text from images (c) an preicting images from text (), corresponing to the ocument completion tas use for evaluating topic moels. The propose metho outperforms the comparison ones in all respects. The comparison methos mmhdp, mmdiln an corrlda that are not able to extract topics private to either moality are not able to learn goo preictive moels, emonstrate especially by the error increasing as a function of training samples in (c). The image preiction perplexity for mmdiln is outsie the range epicte in (), above 5400 for all training set sizes. Text topics Image topics Figure 5: Illustration of part of cross-correlation between text topics an image topics corresponing to subset of Ω (t,i), where yellow represents positive correlations, an blue represents negative ones. The size of the boxes correspons to the absolute value. to the text moality, revealing very clear interpretations. The most strongly correlating topic covers airplanes, which are nown to be easy to recognize from the images ue to the istinct shapes an bacgroun. The secon topic is about maps that also have clear visual corresponence, an the other strongly correlate topics also cover clearly visual concepts lie builings, cars an railroas. The topics private to the text omain, in turn, are about concepts with no clear visual counterpart: economy, politics, history an research. In summary, the moel has separate the components nicely into share an private ones, an provies aitional interpretability beyon regular multi-moal topic moels. Image relevance Orere text topics Figure 6: Text topics orere accoring to visual relevance ρ. We see that there are a few strongly correlating topics, an that the moel has foun roughly 10 topics that are private to the text omain. Note that such topics may still be important for moeling the whole multi-moal corpus, whereas they o not contribute to the cross-moal information transfer. 6 DISCUSSION Our paper ties together two separate lines of wor for analysis of multi-moal ata. In particular, we create a novel multi-moal topic moel which extens earlier tools for analysis of multi-moal count ata by incorporating elements foun useful in the continuousvalue case. We explaine how learning topics private to each moality is of crucial importance while moeling moalities with potentially wea correlations, an

9 Table 1: Text topics rane accoring to visual relevance, summarize by the wors with highest probability. The topic inices match the raning in Figure 6. The share topics have clear visual counterparts, whereas the private ones o not relate with any in of visual content. Share topics T1 airport flight airlines air international aircraft aviation terminal passengers airline boeing flights airways service airports passenger accient T2 format ms lat m longm latm longs lats launche mi broen mill sol rename ec capture rapis class feet coorinates built lae locate T3 builing house built builings street hall st century tower houses west esigne esign castle south north east sie main square large en site T4 car engine cars moel moels for engines race rear series front racing wheel year river spee vehicles vehicle prouction hp motor rive T5 retrieve album song music vieo release single awars number billboar chart top release mtv songs meia love show u jacson hot albums T6 line railway station rail trains train service lines bus transport services system railways stations built railroa passenger main metro transit Topics private to the text omain T95 presient washington post unite american national states secretary ecember november september times military c enney press security T96 ottoman turish turey osovo armenian war gree serbia bulgarian serbian government borer bulgaria turs forces croatian albanian republic T97 research science evelopment institute university management scientific technology esign worl national engineering wor human international T98 government state national european policy council international states members act union political countries system nations article parliament T99 nuclear weapons anti power protest bomb people protests unite protesters government strie peace states march reactor atomic april test T100 economic trae economy worl prouction inustry oil million growth evelopment government agricultural maret agriculture inustrial emonstrate empirically how such a property can only be obtaine by combining two separate elements: moeling correlations between separate topic weights for each moality, an learning moality-specific inicators switching unnecessary topics off. For implementing these elements we combine state-of-art techniques in topic moels, integrating the DILN istribution (Paisley et al., 2011) into a moel similar to the multi-fiel correlate topic moel of Salomatin et al. (2009), to create an efficient learning algorithm reaily applicable for relatively large ocument collections. Acnowlegements AK an SK were supporte by the COIN Finnish Center of Excellence an the FuNeSoMo exchange project. AK was aitionally supporte by Acaemy of Finlan (ecision number ) an PASCAL2 European Networ of Excellence. References Blei, D., Ng, A. an Joran, M. (2003). Latent Dirichlet allocation. JMLR, 3: Blei, D. an Joran, M. (2003). Moeling annotate ata. In SIGIR. Blei, D. an Lafferty, J. (2007). A correlate topic moel of science. Annals of Applie Sciences, 1: Hennig, H., Stern, D., Herbrich, R. an Graepel,T. (2012). Kernel topic moels. In AISTATS. Jia, Y., Salzmann, M. an Darrell, T. (2010). Factorize latent spaces with structure sparsity. In NIPS 23. Klami, A. an Kasi, S. (2008). Probabilistic approach to etecting epenencies between ata sets. Neurocomputing, 72(1-3): Mimno, D. an McCallum A. (2008). Topic moels conitione on arbitrary features with Dirichletmultinomial regression. In UAI. Paisley, J., Wang C. an Blei, D. (2011). Discrete infinite logistic normal istribution. In AISTATS. Puttivihya, D., Attias, H. an Nagarajan, S. (2011). Topic-regression multi-moal latent Dirichlet allocation for image annotation. In CVPR. Puttivihya, D., Attias, H. an Nagarajan, S. (2009). Inepenent factor topic moels. In ICML. Rasiwasia, N., Pereira, J., Covielho, E., Doyle, G., Lancriet G., Levy, R. an Vasconcelos N. (2010). A new approach to cross-moal multimeia retrieval. In ACM Multimeia. Salomatin, K., Yang, Y. an La, A. (2009). Multifiel correlate topic moeling. In SDM. Teh, Y., Blei, D. an Joran, M. (2006). Hierarchical Dirichlet processes. JASA, 101(476): Virtanen, S., Klami, A. an Kasi, S. (2011). Bayesian CCA via structure sparsity. In ICML. Wallach, H.M., Murray, I., Salahutinov, R. an Mimno, D. (2009). Evaluation methos for topic moels. In ICML. Yahneno, O. an Honavar, V. (2009). Multi-moal hierarchical Dirichlet process moel for preicting image annotation an image-object label corresponence. In SDM.

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while