Bayesian embedding of co-occurrence data for query-based visualization

Size: px

Start display at page:

Download "Bayesian embedding of co-occurrence data for query-based visualization"

Corey Boone
5 years ago
Views:

1 Bayesan embeddng of co-occurrence data for query-based vsualzaton Mohammad Khoshneshn Department of Management Scences The Unversty of Iowa Iowa Cty, IA 541 USA W. Nc Street Department of Management Scences The Unversty of Iowa Iowa Cty, IA 541 USA Padmn Srnvasan Computer Scence Department The Unversty of Iowa Iowa Cty, IA 541 USA Abstract We propose a generatve probablstc model for vsualzng co-occurrence data. In co-occurrence data, there are a number of enttes and the data ncludes the frequency of two enttes co-occurrng. We propose a Bayesan approach to nfer the latent varables. Gven the ntractablty of nference for the posteror dstrbuton, we use approxmate nference va varatonal approaches. The proposed Bayesan approach enables accurate embeddng n hgh-dmensonal space whch s not useful for vsualzaton. Therefore, we propose a method to embed a fltered number of enttes for a query query-based vsualzaton. Our experments show that our proposed models outperform co-occurrence data embeddng, the state-of-the-art model for vsualzng co-occurrence data. I. INTRODUCTION Vsualzaton s one of the most mportant exploratory tools for data analyss and mnng. In ths paper we propose a fully generatve probablstc model Bayesan co-occurrence data embeddng to embed co-occurrence data n a Eucldean space as an unsupervsed learnng approach that can be used for vsualzaton. Ths model s a Bayesan extenson to cooccurrence data embeddng () [7]. Bayesan co-occurrence data embeddng (Bayes-) s a generatve probablstc model for embeddng co-occurrence data n a Eucldean space. To explan dfferent types of cooccurrence data, we use graph notaton. Let G = (U, E) represent an unweghted graph wth multple edges allowed to exst between any two nodes. Each edge depcts a toen n the co-occurrence dataset. A toen s a sngle observed occurrence. So f edge or toen (, ) and ndexes enttes or nodes occurred 3 tmes, then v (the th, th element of co-occurrence matrx V ) s 3, and there are 3 edges between nodes and n the equvalent graph. The reason we use graph notaton s to consder some specfc relatonshps that cannot be represented by the matrx notaton. Based on the structure of the graph G, there are two types of structural attrbutes that descrbe co-occurrence data. Frst s whether the graph s drected or undrected. Second s whether the nodes U are homogeneous or heterogeneous. When the co-occurrence graph s drected, t means that one type of entty s responsble for generatng another type, therefore t s logcal to defne the generatve model as a condtonal probablty. In the case of an undrected graph, the jont probablty s more approprate. In the case of heterogeneous nodes, nodes n U are dvded nto two groups, U 1 and U, and edges can only be defned between nodes of dfferent types; ths heterogeneous graph s a bpartte graph. In the case of homogeneous nodes, there s only one type of node. Based on the categores for the co-occurrence data, four dfferent classes can be defned. In hetero-drected, each toen conssts of two dfferent enttes and one type of entty s responsble for generatng the other. The most popular example of ths type s text data where a document s responsble for generatng words. Therefore, the ln drecton s from document nodes to word nodes. In hetero-undrected, each toen conssts of two dfferent enttes and both enttes are generated smultaneously from a jont dstrbuton. An example of ths type s the co-occurrence of mage features and eywords. In, homo-drected, each toen conssts of two enttes from the same type and one entty s responsble for generatng the other. An example s co-ctaton data. Enttes are scholars who cte other scholars n ther research papers. In homo-undrected, each toen conssts of two enttes from the same type and both enttes are generated smultaneously from a jont dstrbuton. An example s the co-occurrence between tems from maret baset data. Each toen conssts of two tems that have been purchased together. In ths paper, we focus on the hetero-drected case. Dervng the model and nference algorthm for other cases wll be straghtforward based on the method proposed n ths paper. Snce nference n the proposed model s ntractable, we use approxmate nference va varatonal methods. It s hard to capture the essence of real-world data n two dmensons due to hgh complexty. On the other hand, vsualzng a large number of data s confusng rather than nformatve. Therefore, a way to present a fltered verson of data s useful. To ths end, we propose a query-based vsualzaton method. Although we present ths algorthm n the context of nformaton retreval, t can be appled to any query-answerng problem for co-occurrence data. The paper s organzed as follows. In Secton II, we revew the related lterature on vsualzng co-occurrence data. In Secton III, the Bayesan co-occurrence data embeddng s presented. In Secton IV, the approxmate nference dervatons are presented where we use varatonal Bayes approach to

2 learn the posteror parameters. In Secton V, we present querybased vsualzaton. In Secton VI, the expermental results are presented. We examned Bayes- n the context of vsualzng text data where t gves very compettve results. In the last secton, we conclude wth some comments and future drectons. II. BACKGROUND Vsualzng co-occurrence data wth heterogeneous nodes generally heterogeneous data has been rarely studed. Most of the lterature concentrates on embeddng only one type of data (e.g. mult-dmensonal scalng [5]). In the context of text data whch s co-occurrence data wth heterogeneous nodes most vsualzaton approaches usually embed only documents or words va embeddng algorthms such as multdmensonal scalng [10]. The state-of-the-art algorthm to vsualze co-occurrence data wth heterogeneous data s cooccurrence data embeddng () [7]. In, all enttes are embedded n a unfed Eucldean space n a way that closer enttes are more correlated. Let X (a 1 D vector) represent the latent varable of entty and Y (a 1 D vector) represent the latent varable of entty n a D-dmensonal space. In, two approaches are used for fttng the model to the data. In the frst approach, the jont dstrbuton s modeled va: P (, ) = 1 Z P () P () exp( δ ), (1) where δ = (X Y )(X Y ) T s the squared Eucldean dstance between latent varables of and and the normalzng factor s Z = P () P () exp( δ ). P () and P () are the emprcal margnal probabltes whch model the basness of enttes some enttes occur more frequently compared to others. Ths model s approprate for the hetero-undrected case. The second approach n s modelng the condtonal probablty nstead of the jont probablty: P ( ) = 1 Z P () exp( δ ), () where Z = P () exp( δ ), whch s useful n heterodrected. The latent varables can be found va mnmzng the Kullbac-Lebler dvergence between emprcal and model probabltes: P = arg mn D KL [ P P ]. The man ntuton behnd s that f two ponts are related then they should be very close n the latent space. Therefore, the result of embeddng can be presented as a vsualzaton of enttes. Although the embeddng s based on the relatonshp between enttes from dfferent types, we expect the enttes from the same type to be close due to the transtvty of dstance. Our proposed model, Bayes-, s smlar to for the basc probablty model. The man dfference, whch s our man contrbuton, s extendng to a fully generatve probablstc model for learnng co-occurrence data. Instead of Fg. 1. θ K Y θ I X N b N I j N K The graphcal model of Bayesan embeddng of co-occurrence data. the maxmum lelhood approach whch s very vulnerable to overfttng, we use a Bayesan approach whch has some classc advantages such as robustness. Although our generatve model outperforms even n a -dmensonal space, the dfference ntensfes n hgher dmensons. Nevertheless, t s not possble to nterpret more than 3 dmensons vsually. Our other contrbuton s proposng a query-based vsualzaton to embed a fltered number of enttes. Another contrbuton of our wor s related to how the bas of enttes s learned. Our algorthm uses a Bayesan approach, whereas bas parameters were estmated drectly from margnal emprcal probabltes n. III. BAYESIAN EMBEDDING OF CO-OCCURRENCE DATA In Bayesan co-occurrence data embeddng (Bayes-), the relatonshp between enttes s captured by embeddng them n a latent varable space. Here, we present Bayes- for the hetero-drected case and dervng the model for other cases s straghtforward. Followng the same notaton from prevous sectons, ndexes the enttes of the frst type and ndexes the enttes of the second type. Smlar to other latent space models, the dmenson D of the latent space s an algorthmc nput and can be chosen n a Bayesan manner. Here, we treat D as a nown parameter. The learned postons of enttes n the latent space are denoted by X for the enttes of the frst type and Y for the enttes of the second type. Furthermore, b and b represent the bases of enttes. By basness we refer to the stuaton n whch some enttes tend to occur more often, such as some words whch have hgher frequency than others. In the current model, we assume Gaussan prors on all latent varables. Ths s an arbtrary choce of dstrbuton and one may assume any other dstrbuton. The relatonshp between enttes s computed va the squared Eucldean dstance. The probablty model for ths data s defned as a condtonal probablty: P ( ) = 1 Z exp( δ(x, Y ) + b ), (3) where Z = exp(δ(x, Y ) + b ). Note that we do not nclude a bas parameter for entty, snce we are condtonng

3 on and the bas of has no effect. Even f b s nserted nto (3), t wll be canceled out. The graphcal model of the generatve process s shown n Fgure 1. Frst, latent varables are generated for the N I enttes of the frst type from the pror parameters θ I and for the N K enttes of the second type from the pror parameters θ K. Then for each of the N toens n entty, an entty of the frst type s chosen from N I possble enttes va a multnomal dstrbuton wth probabltes from (3). The generatve process can be summarzed as follows: 1) For each entty : a) Choose entty latent varable X N(µ 0I, σ 0I I) b) Choose entty bas varable b N(β 0I, ξ 0I ) ) For each entty : a) Choose entty latent varable Y N(µ 0K, σ 0K I) b) For each toen: ) Choose j Multnomal(P (. )) where I denotes the dentty matrx and P s computed from (3). IV. APPROXIMATE INFERENCE The lelhood of the whole dataset gven the assumpton that the probablty of each toen s ndependent from other toens gven the hdden varables s as follows: P (U X, Y, b) = P ( u u, X, Y, b) = P (, X, Y, b) v, u (4) where U s the set of all toens (.e. the whole dataset), and v s the number of tmes the toen (, ) has occurred. Gven (3) and (4), the log-lelhood s as follows: log P (U X, Y, b) = v. log v δ(x, Y ) + v. b exp( δ(x, Y ) + b ), (5) where v. = v and v. = v. To estmate the hdden varables X, Y and b, we can maxmze the log-lelhood (5). However, the maxmum lelhood approach has problems such as overfttng. A Bayesan approach wll result n a more robust soluton by gvng the posteror dstrbuton of hdden parameters. Gven the Bayes rules, the posteror dstrbuton of latent varables gven the data s as follows: P (X, Y, b U) = P (U X, Y, b)p (X)P (Y )P (b) P (U X, Y, b )P (X )P (Y )P (b )dx dy db, (6) whch s not computable analytcally due to the ntractablty of the ntegral n the denomnator. Therefore, we chose to use varaton nference [8] whch s a popular algorthm for approxmate nference n graphcal models. In varatonal approxmaton, nstead of fndng the true posteror, we estmate a varatonal dstrbuton for each latent varable. The man dea s mnmzng the dfference between the true posteror and the surrogate varatonal dstrbuton so then we can use the varatonal dstrbuton for mang nference about latent varables. If Q(X, Y, b) shows the varatonal dstrbuton over latent varables, we are nterested n mnmzng the Kullbac-Lebler dvergence between the true posteror and ts approxmaton: KL(Q(X, Y, b) P (X, Y, b U). Ths s equvalent to maxmzng a lower bound over the margnal probablty of data [1]: log P (U) E Q [log P (U X, Y, b)] KL(Q(X, Y, b) P (X, Y, b)), (7) where E Q [.] s expectaton wth regard to varatonal dstrbutons. Here we assume the varatonal dstrbutons are ndependent Gaussans wth the followng parameters: X N(µ, σ I), Y N(µ, σ I), and b N(β, ξ ), where I s the dentty matrx wth dmenson D. Note that t s possble to use a covarance matrx nstead of σ I and the only reason we chose to use ndependent coordnates s smplcty n optmzaton. Substtutng P (U X, Y, b) wth ts value from (5), the lower bound on the probablty of data n (7) can be wrtten as: L(µ, β, σ, ξ ) = v E Q [ δ(x, Y ) + b ] v. E Q [log exp( δ(x, Y ) + b )] KL(Q(X) P (X)) KL(Q(Y ) P (Y )) KL(Q(b) P (b)). (8) Gven the Gaussan dstrbuton for prors and varatonal dstrbutons, all ntegrals for computng the expectatons n (8) are analytcally solvable except for the part v.e Q [log exp(δ(x, Y ) + b )] whch s ntractable because of the log-sum-exp format. It s possble to use the concavty of the log functon to defne an upper bound on the log-sum-exp functon: log a φa log φ 1, (9) where the equalty holds ff φ = 1/a. Such an approach has been used n several wors n the varatonal nference context [4], [], [3]. Therefore, the new bound s as follows: L(µ, β, σ, ξ ) = constant + v E Q [δ(x, Y ) + b ] v. φ E Q [exp(δ(x, Y )+b )] KL(Q(X) P (X)) KL(Q(Y ) P (Y )) KL(Q(b) P (b)), (10) where the constant does not depend on decson varables and we set φ = [ E Q(exp( δ(x, Y ) + b ))] 1 to tghten the lower bound wth regard to the log-sum-exp part. The only trcy part n (10) s dervng the ntegral E Q [exp( δ (X, Y ) + b )]. Let x 1 N(µ 1, σ 1 ) and x N(µ, σ ), then t can be shown that the followng equaton

4 holds: E[exp( (x 1 x ) )] = exp( (µ 1 µ ) 1+(σ1 +σ)) 1 + (σ 1 + σ )) (11) and therefore we have: E Q [e δ (X,Y )+b ] = η D e η δ (µ,µ )+β +ξ / where η = [1 + (σ + σ ))] 1/. As a result, the lower bound n (10) can be wrtten as follows: L(µ, β, σ, ξ ) = constant v δ (µ, µ ) ( v. σ + v. σ)d + v. β v. φ η D exp( ηδ (µ, µ ) + β + ξ /) 1 [ D log σ + D σ σ + (µ µ 0I )(µ µ 0I ) T 0I σ0i ] 1 [ D log σ + D σ σ + (µ µ 0K )(µ µ 0K ) T 0K σ0k ] 1 [ log ξ + ξ ξ0 + (β β 0 ) ξ0 ]. (1) To fnd varatonal values, we need to optmze (1). Snce σ 0, to have an unconstraned optmzaton problem, we use an auxlary varable χ and the exponental functon n our experments: σ = exp(χ). Any unconstraned optmzaton algorthm can be used to solve (1). In our experments, we used gradent ascent wth multple random starts. V. QUERY-BASED VISUALIZATION Here, we present query-based vsualzaton for nformaton retreval; however, t can be appled to other dyadc data see [9] for a smlar approach n collaboratve flterng. In query-based vsualzaton (QBV), documents, query words, and the query are embedded n a Eucldean space to help the user n dentfyng documents of nterest. Unfortunately, dmensons s barely enough to capture the complexty of a data, whle hgher dmensons cannot be nterpreted vsually. Addtonally, representng all data to a user s beyond a person s processng ablty. Therefore, vsualzng only top- N (N can be specfed by user) documents s of nterest. These top-n documents can be chosen by an arbtrary retreval method. Therefore, a two-phase vsualzaton can be used. Frst, the data wll be embedded n a hgh-dmensonal space and then a group of enttes can be chosen va some flterng approach to be re-embedded n a -dmensonal space by classc algorthms such as multdmensonal scalng (MDS) [5]. The second embeddng phase s straghtforward, snce we already have the dstances from the frst phase whch satsfy all requrements for the Eucldean dstance and MDS can be appled drectly. In an nformaton retreval context, our proposed approach s embeddng words, documents, and the query n a hgh-dmensonal space, and then usng the dstances n that space, embeddng selected objects n a - dmensonal space va multdmensonal scalng. Another approach may be embeddng the fltered data drectly nto a -dmensonal space. However, such an approach s not desrable for two reasons. Frst, snce there are few enttes n the fltered data, generalzaton s expected to be poor whch deterorates the result. Second, embeddng enttes separately for each query s very tme consumng and neffcent, especally gven the hgh number of queres n the retreval systems. VI. EXPERIMENTS In our experments, we compare and Bayes- n the context of nformaton retreval. Text data s consdered as hetero-drected and so we use a condtonal probablty model. We study the goodness of embeddng usng two evaluaton metrcs. Frst s the proxmty of an embedded query to the relevant documents. Ths can be measured wth average precson (AP). The defnton of AP that we used s averagng the precson of all relevant documents at the pont they are retreved. Let AP (S, R) be the average precson where R s the set of relevant documents and S s the score of a method for ranng all documents for retreval. Then gven an embedded query q, the average precson s AP ( δq., R) where δq. shows the dstance of the query to all documents. Second, the proxmty of relevant documents s of nterest. It s mportant whether all relevant documents are close to each other compared to other documents. To measure ths, average relevant documents proxmty (ARDP) s proposed: ARDP = [ c R c AP ( δ., R c \{}) ]/N C, (13) R c 1 where c ndexes categores or queres (any group of relevant documents), ndexes documents, R c s the set of relevant documents n c, and N C s the total number of categores or queres. ARDP s partally smlar to doc-doc measure used n [7]. More precsely, we consder each relevant document as an embedded query and then we compute the AP of retrevng other relevant documents. Ths metrc measures how well we embed data, snce relevant documents are expected to be closer and the vsualzaton s better as a result. In Bayes-, we set the pror mean to 0 for both words and documents, and pror varance to 1 for words and for documents. No fne tunnng was done for fndng prors. We chose smaller pror varance for words due to the fact that the support for words s often lower compared to the support for documents. There are some words that occur only or 3 tmes whle the number of occurrences of documents (the number of words n them) s usually much hgher. Four datasets were used n our experments. We used subsets of the TDT- and Reuters1578 datasets 1. In these datasets, there are a number of categores and each document s related to one the documents n the same category are related. In each dataset, we selected 5 categores so that the number of 1

5 documents n categores s almost equal. Words that occurred fewer than 3 tmes were excluded. Our subset of TDT- ncludes 8,676 words and 1,584 documents and our subset of Reuters1578 ncludes 4,711 words and 103 documents. These two datasets were only used for evaluatng based on ARDP. For the nformaton retreval tas, we used datasets CRAN and MED. The CRAN dataset ncludes 3,763 words, 1398 documents and 5 queres, and MEDLINE ncludes 7014 words, 1033 documents, and 30 queres. In both and Bayes-, queres were treated as new documents and mapped nto the latent space. Then, the Eucldean dstance between queres and documents was used to compute a score for retreval. These data were used for evaluatng based on both AP and ARDP. Table I presents the result of ARDP for all datasets. We mplemented and Bayes- n a -dmensonal space and then computed the ARDP score. Bayes- outperforms n all 4 datasets (n CRAN the performance s close). TDT Reuters1579 CRAN MEDLINE Bayes TABLE I ARDP ON A -DIMENSIONAL SPACE (BEST RESULTS ARE BOLD) For evaluatng versus Bayes- n hgher dmensons, we used CRAN and MEDLINE. Fgures and 3 shows the result for average precson and Fgures 4 and 5 shows the result for ARDP, n CRAN and MEDLINE datasets respectvely. As expected, the performance of decreases as we ncrease the number of dmensons. Note that procedures such as tunng are not possble snce s an unsupervsed algorthm. CRAN dataset MEDLINE dataset AP ARDP AP ARDP Bayes--QBV TABLE II QUERY-BASED VISUALIZATION + BAYES- VERSUS WITH DIMENSIONS (BEST RESULTS ARE BOLD) Fnally, we explored query-based vsualzaton usng Bayes-. Frst, we selected top-100 documents for each query usng latent semantc ndexng [6] whch s a successful method n nformaton retreval. Note that t s possble to use Bayes- for flterng documents drectly but here we need a method for both and Bayes- for the sae of comparson. Then, we re-embedded all fltered documents, query words, and the query n a -dmensonal space va MDS wth dstances from mplementng Bayes- n a 100-dmensonal space. We compare the result to the s result n a -dmensonal space. Table II represents the result. The performance of query-based vsualzaton s dramatcally better than. Fgures 6 and 7 represent a typcal snapshot of vsualzng a query usng and query-based vsualzaton usng Bayes- respectvely, for a specfc query n the MEDLINE dataset. Note the dstncton between relevant and rrelevant documents n query-based vsualzaton whle they are hghly mxed n. Addtonally, the query s far away from other enttes n whch maes nterpretaton dffcult. Query words mght help user to dentfy whch area n the space s more relevant. VII. CONCLUSION In ths paper, we developed a Bayesan model based on the state-of-the-art vsualzaton model for co-occurrence. Our expermental studes reveal the superorty of the Bayesan approach. However, better embeddng n hgher dmensons s not useful for vsualzaton. Therefore, we proposed a method to embed fltered data from a hgh-dmensonal embeddng for a query query-based vsualzaton whch was successful n our experments. Query-based vsualzaton can be a bass for an nteractve user nterface, n whch a user receves her recommendatons n a vsual manner whle she can have a general pcture of relatonshps between her query, her eywords, and the top- N relevant documents. Then she can explore the vsual space and mar documents accordngly. Also, she mght have the opton of asng for more documents close to a documents or more words close to a word, and after re-embeddng she can have a more accurate pcture. Also, usng relevance feedbac technques, nown relevant documents can be used to formulate a more accurate query, fnd more relevant documents, and re-construct the pcture. REFERENCES [1] C. M. Bshop. Pattern Recognton and Machne Learnng (Informaton Scence and Statstcs). Sprnger-Verlag New Yor, Inc., Secaucus, NJ, USA, 006. [] D. Ble and J. Lafferty. Correlated topc models. Advances n Neural Informaton Processng Systems, 18:147, 006. [3] D. M. Ble and J. D. Lafferty. Dynamc topc models. In Proceedngs of the 3rd Internatonal Conference on Machne Learnng, ICML 06, pages , New Yor, NY, USA, 006. ACM. [4] G. Bouchard. Effcent bounds for the softmax functon, applcatons to nference n hybrd models. Advances n Neural Informaton Processng Systems, 007. [5] M. Cox and T. Cox. Multdmensonal scalng. Handboo of Data Vsualzaton, pages , 008. [6] S. Deerwester, S. Dumas, G. Furnas, T. Landauer, and R. Harshman. Indexng by latent semantc analyss. Journal of the Amercan socety for Informaton Scence, 41(6): , [7] A. Globerson, G. Chech, F. Perera, and N. Tshby. Eucldean embeddng of co-occurrence data. Journal of Machne Learnng Research, 8:65 95, 007. [8] M. Jordan, Z. Ghahraman, T. Jaaola, and L. Saul. An ntroducton to varatonal methods for graphcal models. Machne Learnng, 37():183 33, [9] M. Khoshneshn and W. N. Street. Collaboratve flterng va Eucldean embeddng. In Proceedngs of the Fourth ACM Conference on Recommender Systems, RecSys 10, pages 87 94, New Yor, NY, USA, 010. ACM. [10] J. Zhang. Vsualzaton for Informaton Retreval. Sprnger Verlag, 008.

6 Bayes Average precson Average precson Bayes # of dmensons # of dmensons Fg.. The average precson results for CRAN dataset Fg. 3. The average precson results for MEDLINE dataset Bayes Bayes 0.4 ARDP ARDP # of dmensons # of dmensons Fg. 4. The ARDP results for CRAN dataset Fg. 5. The ARDP results for MEDLINE dataset 6 Irrelevant documents Relevant documents Query words query 5 4 effect 3 effect renal system azathoprn regard erythematosus lupus system azathoprn regard lupus erythematosus leson renal leson Fg. 6. Vsualzaton of a typcal query from MEDLINE dataset usng wth dmensons Fg. 7. Vsualzaton of a typcal query from MEDLINE dataset usng query-based vsualzaton + Bayes- wth 100 dmensons

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder